knowledge-pipeline/README.md

# RAG Pipeline - Document Chunking & Vector Storage

Este proyecto contiene todo el código necesario para procesar documentos (PDFs), dividirlos en chunks, generar embeddings vectoriales y almacenarlos en Google Cloud Storage + Vertex AI Vector Search.

## 📁 Estructura del Proyecto

```
pipeline/
├── packages/                    # Librerías reutilizables
│   ├── chunker/                # ⭐ Estrategias de chunking
│   │   ├── base_chunker.py
│   │   ├── recursive_chunker.py
│   │   ├── contextual_chunker.py    # Usado en producción
│   │   └── llm_chunker.py           # Avanzado con optimización
│   ├── embedder/               # Generación de embeddings
│   │   └── vertex_ai.py
│   ├── file-storage/           # Storage en GCS
│   │   └── google_cloud.py
│   ├── vector-search/          # Índices vectoriales
│   │   └── vertex_ai.py
│   ├── llm/                    # Cliente LLM
│   │   └── vertex_ai.py
│   ├── document-converter/     # PDF → Markdown
│   │   └── markdown.py
│   └── utils/                  # Utilidades
├── apps/
│   └── index-gen/              # ⭐ Pipeline principal
│       └── src/index_gen/
│           └── main.py         # Orquestador completo
├── src/
│   └── rag_eval/
│       └── config.py           # Configuración centralizada
├── pyproject.toml              # Dependencias del proyecto
└── config.yaml                 # Configuración de GCP
```

---

## 🚀 Instalación

### 1. Prerrequisitos

- **Python 3.12+**
- **uv** (gestor de paquetes)
- **Poppler** (para pdf2image):
  ```bash
  # Ubuntu/Debian
  sudo apt-get update
  sudo apt-get install -y poppler-utils libcairo2-dev

  # macOS
  brew install poppler cairo
  ```

### 2. Instalar dependencias

```bash
cd /home/coder/sigma-chat/pipeline

# Instalar todas las dependencias
uv sync

# O instalar solo las necesarias (sin dev)
uv sync --no-dev
```

---

## ⚙️ Configuración

### 1. Configurar credenciales de GCP

```bash
# Autenticar con Google Cloud
gcloud auth application-default login

# O usar service account key
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
```

### 2. Configurar `config.yaml`

Edita el archivo `config.yaml`:

```yaml
project_id: "tu-proyecto-gcp"
location: "us-central1"
bucket: "tu-bucket-gcs"

index:
  name: "mi-indice-vectorial"
  dimensions: 768  # Para text-embedding-005
  machine_type: "e2-standard-2"
```

---

## 📖 Uso

### **Opción 1: Pipeline Completo (Kubeflow/Vertex AI)**

El archivo [`apps/index-gen/src/index_gen/main.py`](apps/index-gen/src/index_gen/main.py) define un pipeline KFP completo:

```python
from apps.index_gen.src.index_gen.main import (
    gather_files,
    process_file,
    aggregate_vectors,
    create_vector_index
)

# 1. Buscar PDFs en GCS
pdf_files = gather_files("gs://mi-bucket/pdfs/")

# 2. Procesar cada archivo
for pdf_file in pdf_files:
    process_file(
        file_path=pdf_file,
        model_name="text-embedding-005",
        contents_output_dir="gs://mi-bucket/contents/",
        vectors_output_file="vectors.jsonl",
        chunk_limit=800
    )

# 3. Agregar vectores
aggregate_vectors(
    vector_artifacts=["vectors.jsonl"],
    output_gcs_path="gs://mi-bucket/vectors/all_vectors.jsonl"
)

# 4. Crear índice vectorial
create_vector_index(
    vectors_dir="gs://mi-bucket/vectors/"
)
```

---

### **Opción 2: Usar Chunkers Individuales**

#### **A) RecursiveChunker (Simple y Rápido)**

```python
from chunker.recursive_chunker import RecursiveChunker
from pathlib import Path

chunker = RecursiveChunker()
documents = chunker.process_path(Path("documento.txt"))

# Resultado:
# [
#   {"page_content": "...", "metadata": {"chunk_index": 0}},
#   {"page_content": "...", "metadata": {"chunk_index": 1}},
# ]
```

**CLI:**
```bash
recursive-chunker input.txt output_dir/
```

---

#### **B) ContextualChunker (⭐ Recomendado para Producción)**

Agrega contexto del documento original usando LLM:

```python
from chunker.contextual_chunker import ContextualChunker
from llm.vertex_ai import VertexAILLM

llm = VertexAILLM(
    project="tu-proyecto",
    location="us-central1"
)

chunker = ContextualChunker(
    llm_client=llm,
    max_chunk_size=800,
    model="gemini-2.0-flash"
)

documents = chunker.process_path(Path("documento.txt"))

# Resultado con contexto:
# [
#   {
#     "page_content": "> **Contexto del documento original:**\n> [Resumen LLM]\n\n---\n\n[Contenido del chunk]",
#     "metadata": {"chunk_index": 0}
#   }
# ]
```

**CLI:**
```bash
contextual-chunker input.txt output_dir/ --max-chunk-size 800 --model gemini-2.0-flash
```

---

#### **C) LLMChunker (Avanzado)**

Con optimización, fusión de chunks y extracción de imágenes:

```python
from chunker.llm_chunker import LLMChunker
from llm.vertex_ai import VertexAILLM

llm = VertexAILLM(project="tu-proyecto", location="us-central1")

chunker = LLMChunker(
    output_dir="output/",
    model="gemini-2.0-flash",
    max_tokens=1000,
    target_tokens=800,
    gemini_client=llm,
    merge_related=True,
    extract_images=True,
    custom_instructions="Mantener términos técnicos en inglés"
)

documents = chunker.process_path(Path("documento.pdf"))
```

**CLI:**
```bash
llm-chunker documento.pdf output_dir/ \
  --model gemini-2.0-flash \
  --max-tokens 1000 \
  --target-tokens 800 \
  --merge-related \
  --extract-images
```

---

### **Opción 3: Generar Embeddings**

```python
from embedder.vertex_ai import VertexAIEmbedder

embedder = VertexAIEmbedder(
    model_name="text-embedding-005",
    project="tu-proyecto",
    location="us-central1"
)

# Single embedding
embedding = embedder.generate_embedding("Texto de ejemplo")
# Returns: List[float] con 768 dimensiones

# Batch embeddings
texts = ["Texto 1", "Texto 2", "Texto 3"]
embeddings = embedder.generate_embeddings_batch(texts, batch_size=10)
# Returns: List[List[float]]
```

---

### **Opción 4: Almacenar en GCS**

```python
import gcsfs

fs = gcsfs.GCSFileSystem()

# Subir archivo
fs.put("local_file.md", "mi-bucket/chunks/documento_0.md")

# Listar archivos
files = fs.ls("mi-bucket/chunks/")

# Descargar archivo
content = fs.cat_file("mi-bucket/chunks/documento_0.md").decode("utf-8")
```

---

### **Opción 5: Vector Search**

```python
from vector_search.vertex_ai import GoogleCloudVectorSearch

vector_search = GoogleCloudVectorSearch(
    project_id="tu-proyecto",
    location="us-central1",
    bucket="mi-bucket",
    index_name="mi-indice"
)

# Crear índice
vector_search.create_index(
    name="mi-indice",
    content_path="gs://mi-bucket/vectors/all_vectors.jsonl",
    dimensions=768
)

# Deploy índice
vector_search.deploy_index(
    index_name="mi-indice",
    machine_type="e2-standard-2"
)

# Query
query_embedding = embedder.generate_embedding("¿Qué es RAG?")
results = vector_search.run_query(
    deployed_index_id="mi_indice_deployed_xxxxx",
    query=query_embedding,
    limit=5
)

# Resultado:
# [
#   {"id": "documento_0", "distance": 0.85, "content": "RAG es..."},
#   {"id": "documento_1", "distance": 0.78, "content": "..."},
# ]
```

**CLI:**
```bash
vector-search create mi-indice gs://bucket/vectors/ --dimensions 768
vector-search query deployed_id "¿Qué es RAG?" --limit 5
vector-search delete mi-indice
```

---

## 🔄 Flujo Completo de Ejemplo

```python
import gcsfs
from pathlib import Path
from chunker.contextual_chunker import ContextualChunker
from embedder.vertex_ai import VertexAIEmbedder
from llm.vertex_ai import VertexAILLM

# 1. Setup
llm = VertexAILLM(project="mi-proyecto", location="us-central1")
chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
embedder = VertexAIEmbedder(
    model_name="text-embedding-005",
    project="mi-proyecto",
    location="us-central1"
)
fs = gcsfs.GCSFileSystem()

# 2. Chunking
documents = chunker.process_path(Path("documento.pdf"))
print(f"Creados {len(documents)} chunks")

# 3. Generate embeddings y guardar
for i, doc in enumerate(documents):
    chunk_id = f"doc_{i}"

    # Generar embedding
    embedding = embedder.generate_embedding(doc["page_content"])

    # Guardar contenido en GCS
    fs.put(f"temp_{chunk_id}.md", f"mi-bucket/contents/{chunk_id}.md")

    # Guardar vector (escribir a JSONL localmente, luego subir)
    print(f"Chunk {chunk_id}: {len(embedding)} dimensiones")
```

---

## 📦 Packages Instalados

Ver lista completa en [`pyproject.toml`](pyproject.toml).

**Principales:**
- `google-genai` - SDK GenAI para LLM y embeddings
- `google-cloud-aiplatform` - Vertex AI
- `google-cloud-storage` - GCS
- `chonkie` - Recursive chunking
- `langchain` - Text splitting avanzado
- `tiktoken` - Token counting
- `markitdown` - Document conversion
- `pypdf` - PDF processing
- `pdf2image` - PDF to image
- `kfp` - Kubeflow Pipelines

---

## 🛠️ Scripts de CLI Disponibles

Después de `uv sync`, puedes usar estos comandos:

```bash
# Chunkers
recursive-chunker input.txt output/
contextual-chunker input.txt output/ --max-chunk-size 800
llm-chunker documento.pdf output/ --model gemini-2.0-flash

# Document converter
convert-md documento.pdf

# File storage
file-storage upload local.md remote/path.md
file-storage list remote/
file-storage download remote/path.md

# Vector search
vector-search create index-name gs://bucket/vectors/ --dimensions 768
vector-search query deployed-id "query text" --limit 5

# Utils
normalize-filenames input_dir/
```

---

## 📊 Arquitectura del Sistema

```
┌─────────────┐
│   PDF File  │
└──────┬──────┘
       │
       ▼
┌─────────────────────────────┐
│  document-converter         │
│  (PDF → Markdown)           │
└──────┬──────────────────────┘
       │
       ▼
┌─────────────────────────────┐
│  chunker                    │
│  (Markdown → Chunks)        │
│  - RecursiveChunker         │
│  - ContextualChunker ⭐     │
│  - LLMChunker               │
└──────┬──────────────────────┘
       │
       ▼
┌─────────────────────────────┐
│  embedder                   │
│  (Text → Vectors)           │
│  Vertex AI embeddings       │
└──────┬──────────────────────┘
       │
       ├─────────────────────────┐
       │                         │
       ▼                         ▼
┌─────────────────┐    ┌─────────────────┐
│  file-storage   │    │  vector-search  │
│  GCS Storage    │    │  Vertex AI      │
│  (.md files)    │    │  Vector Index   │
└─────────────────┘    └─────────────────┘
```

---

## 🐛 Troubleshooting

### Error: "poppler not found"
```bash
sudo apt-get install -y poppler-utils
```

### Error: "Permission denied" en GCS
```bash
gcloud auth application-default login
# O configurar GOOGLE_APPLICATION_CREDENTIALS
```

### Error: "Module not found"
```bash
# Reinstalar dependencias
uv sync --reinstall
```

---

## 📝 Notas

- **ContextualChunker** es el recomendado para producción (agrega contexto del documento)
- **LLMChunker** es más lento pero genera chunks óptimos (fusiona, optimiza tokens)
- **RecursiveChunker** es el más rápido para pruebas rápidas
- Los chunks se guardan como `.md` en GCS
- Los vectores se guardan en formato JSONL: `{"id": "...", "embedding": [...]}`
- El índice vectorial se crea en Vertex AI Vector Search

---

## 📄 License

Este código es parte del proyecto legacy-rag.