knowledge-pipeline/STRUCTURE.md

# Estructura del Proyecto Pipeline

## ✅ Carpetas y Archivos Copiados

```
pipeline/
├── 📄 pyproject.toml              # Configuración principal del proyecto
├── 📄 config.yaml                 # Configuración de GCP (del original)
├── 📄 config.example.yaml         # Plantilla de configuración
├── 📄 .python-version             # Python 3.12
├── 📄 .gitignore                  # Archivos a ignorar
├── 📄 README.md                   # Documentación completa
│
├── 📁 packages/                   # Librerías reutilizables
│   ├── chunker/                   # ⭐ CHUNKING
│   │   ├── pyproject.toml
│   │   └── src/chunker/
│   │       ├── base_chunker.py
│   │       ├── recursive_chunker.py
│   │       ├── contextual_chunker.py
│   │       └── llm_chunker.py
│   │
│   ├── embedder/                  # ⭐ EMBEDDINGS
│   │   ├── pyproject.toml
│   │   └── src/embedder/
│   │       ├── base.py
│   │       └── vertex_ai.py
│   │
│   ├── file-storage/              # ⭐ ALMACENAMIENTO GCS
│   │   ├── pyproject.toml
│   │   └── src/file_storage/
│   │       ├── base.py
│   │       ├── google_cloud.py
│   │       └── cli.py
│   │
│   ├── vector-search/             # ⭐ ÍNDICE VECTORIAL
│   │   ├── pyproject.toml
│   │   └── src/vector_search/
│   │       ├── base.py
│   │       ├── vertex_ai.py
│   │       └── cli/
│   │           ├── create.py
│   │           ├── query.py
│   │           ├── delete.py
│   │           └── generate.py
│   │
│   ├── llm/                       # Cliente LLM
│   │   ├── pyproject.toml
│   │   └── src/llm/
│   │       ├── base.py
│   │       └── vertex_ai.py
│   │
│   ├── document-converter/        # Conversión PDF→Markdown
│   │   ├── pyproject.toml
│   │   └── src/document_converter/
│   │       ├── base.py
│   │       └── markdown.py
│   │
│   └── utils/                     # Utilidades
│       ├── pyproject.toml
│       └── src/utils/
│           └── normalize_filenames.py
│
├── 📁 apps/                       # Aplicaciones
│   └── index-gen/                 # ⭐ PIPELINE PRINCIPAL
│       ├── pyproject.toml
│       └── src/index_gen/
│           ├── cli.py
│           └── main.py            # Pipeline KFP completo
│
└── 📁 src/                        # Código fuente principal
    └── rag_eval/
        ├── __init__.py
        └── config.py              # Configuración centralizada
```

## 📊 Resumen de Componentes

### Packages Core (7)
1. ✅ **chunker** - 3 estrategias de chunking (Recursive, Contextual, LLM)
2. ✅ **embedder** - Generación de embeddings con Vertex AI
3. ✅ **file-storage** - Almacenamiento en Google Cloud Storage
4. ✅ **vector-search** - Índices vectoriales en Vertex AI
5. ✅ **llm** - Cliente para modelos Gemini/Vertex AI
6. ✅ **document-converter** - Conversión de documentos
7. ✅ **utils** - Utilidades varias

### Aplicaciones (1)
1. ✅ **index-gen** - Pipeline completo de procesamiento

### Configuración (1)
1. ✅ **rag_eval** - Configuración centralizada

## 🔧 Archivos de Configuración

- ✅ `pyproject.toml` - Dependencias y scripts CLI
- ✅ `config.yaml` - Configuración de GCP
- ✅ `config.example.yaml` - Plantilla
- ✅ `.python-version` - Versión de Python
- ✅ `.gitignore` - Archivos ignorados

## 📝 Documentación

- ✅ `README.md` - Documentación completa con ejemplos
- ✅ `STRUCTURE.md` - Este archivo

## 🎯 Funcionalidades Disponibles

### CLI Scripts
```bash
# Chunking
recursive-chunker input.txt output/
contextual-chunker input.txt output/ --max-chunk-size 800
llm-chunker documento.pdf output/ --model gemini-2.0-flash

# Document conversion
convert-md documento.pdf

# File storage
file-storage upload local.md remote/path.md
file-storage list remote/
file-storage download remote/path.md

# Vector search
vector-search create index-name gs://bucket/vectors/ --dimensions 768
vector-search query deployed-id "query text" --limit 5
vector-search delete index-name

# Utils
normalize-filenames input_dir/
```

### Python API
Todas las clases están disponibles para importación directa:

```python
from chunker.contextual_chunker import ContextualChunker
from embedder.vertex_ai import VertexAIEmbedder
from file_storage.google_cloud import GoogleCloudFileStorage
from vector_search.vertex_ai import GoogleCloudVectorSearch
from llm.vertex_ai import VertexAILLM
```

## 🚀 Próximos Pasos

1. **Instalar dependencias**:
   ```bash
   cd /home/coder/sigma-chat/pipeline
   uv sync
   ```

2. **Configurar GCP**:
   - Editar `config.yaml` con tus credenciales
   - Ejecutar `gcloud auth application-default login`

3. **Probar chunking**:
   ```bash
   echo "Texto de prueba" > test.txt
   recursive-chunker test.txt output/
   ```

4. **Ver documentación completa**:
   ```bash
   cat README.md
   ```

---

**Total de archivos Python copiados**: ~30+ archivos
**Total de packages**: 8 (7 packages + 1 app)
**Listo para usar**: ✅