knowledge-pipeline/00_START_HERE.md

# 🚀 START HERE - Pipeline RAG

## ¿Qué hay en esta carpeta?

Este proyecto contiene todo el código necesario para:

1. ✂️ **Chunkear documentos** (dividir en fragmentos)
2. 🧠 **Generar embeddings** (vectorización)
3. 💾 **Almacenar en GCS** (Google Cloud Storage)
4. 🔍 **Crear índices vectoriales** (Vertex AI Vector Search)

---

## 📁 Estructura Básica

```
pipeline/
├── packages/          # 7 librerías reutilizables
│   ├── chunker/      # ⭐ Para dividir documentos
│   ├── embedder/     # ⭐ Para vectorizar texto
│   ├── file-storage/ # ⭐ Para guardar en GCS
│   └── vector-search/# ⭐ Para índices vectoriales
│
├── apps/
│   └── index-gen/    # ⭐ Pipeline completo KFP
│
└── src/rag_eval/     # Configuración
```

---

## ⚡ Instalación Rápida

```bash
# En tu Workbench de GCP:
cd ~/pipeline
uv sync
```

---

## 🎯 Uso Más Común

### Opción 1: Chunking Contextual (Recomendado)

```python
from chunker.contextual_chunker import ContextualChunker
from llm.vertex_ai import VertexAILLM
from pathlib import Path

# Setup
llm = VertexAILLM(project="tu-proyecto", location="us-central1")
chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)

# Procesar
documents = chunker.process_path(Path("documento.txt"))
print(f"Creados {len(documents)} chunks")
```

### Opción 2: Pipeline Completo

```python
from apps.index_gen.src.index_gen.main import (
    gather_files,
    process_file,
    aggregate_vectors,
    create_vector_index
)

# Procesar PDFs desde GCS
pdf_files = gather_files("gs://mi-bucket/pdfs/")

for pdf in pdf_files:
    process_file(
        file_path=pdf,
        model_name="text-embedding-005",
        contents_output_dir="gs://mi-bucket/contents/",
        vectors_output_file="vectors.jsonl",
        chunk_limit=800
    )
```

---

## 📚 Documentación

| Archivo | Descripción |
|---------|-------------|
| **[QUICKSTART.md](QUICKSTART.md)** | ⭐ Inicio rápido con ejemplos |
| **[README.md](README.md)** | Documentación completa |
| **[STRUCTURE.md](STRUCTURE.md)** | Estructura detallada |
| **config.yaml** | Configuración de GCP |

---

## 🔧 Configuración Necesaria

Edita `config.yaml`:

```yaml
project_id: "tu-proyecto-gcp"     # ⚠️ CAMBIAR
location: "us-central1"
bucket: "tu-bucket-nombre"        # ⚠️ CAMBIAR

index:
  name: "mi-indice-rag"
  dimensions: 768
```

---

## 💡 Estrategias de Chunking Disponibles

1. **RecursiveChunker** - Simple y rápido
2. **ContextualChunker** - ⭐ Agrega contexto con LLM (recomendado)
3. **LLMChunker** - Avanzado: optimiza, fusiona, extrae imágenes

---

## 📦 Dependencias Principales

- `google-genai` - LLM y embeddings
- `google-cloud-aiplatform` - Vertex AI
- `google-cloud-storage` - GCS
- `chonkie` - Chunking recursivo
- `langchain` - Text splitting
- `tiktoken` - Token counting
- `pypdf` - PDF processing

Total instaladas: ~30 packages

---

## ❓ FAQ

**P: ¿Qué chunker debo usar?**
R: `ContextualChunker` para producción (agrega contexto del documento)

**P: ¿Cómo instalo en Workbench?**
R: `uv sync` (las credenciales de GCP ya están configuradas)

**P: ¿Dónde está el código del pipeline completo?**
R: `apps/index-gen/src/index_gen/main.py`

**P: ¿Cómo genero embeddings?**
R: Usa `embedder.vertex_ai.VertexAIEmbedder`

---

## 🆘 Soporte

- Ver ejemplos en [QUICKSTART.md](QUICKSTART.md)
- Ver API completa en [README.md](README.md)
- Ver estructura en [STRUCTURE.md](STRUCTURE.md)

---

**Total**: 33 archivos Python | ~400KB | Listo para Workbench ✅