First commmit

2026-02-22 15:25:27 +00:00
commit 35d5a65b17
70 changed files with 4298 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,503 @@
+# RAG Pipeline - Document Chunking & Vector Storage
+
+Este proyecto contiene todo el código necesario para procesar documentos (PDFs), dividirlos en chunks, generar embeddings vectoriales y almacenarlos en Google Cloud Storage + Vertex AI Vector Search.
+
+## 📁 Estructura del Proyecto
+
+```
+pipeline/
+├── packages/                    # Librerías reutilizables
+│   ├── chunker/                # ⭐ Estrategias de chunking
+│   │   ├── base_chunker.py
+│   │   ├── recursive_chunker.py
+│   │   ├── contextual_chunker.py    # Usado en producción
+│   │   └── llm_chunker.py           # Avanzado con optimización
+│   ├── embedder/               # Generación de embeddings
+│   │   └── vertex_ai.py
+│   ├── file-storage/           # Storage en GCS
+│   │   └── google_cloud.py
+│   ├── vector-search/          # Índices vectoriales
+│   │   └── vertex_ai.py
+│   ├── llm/                    # Cliente LLM
+│   │   └── vertex_ai.py
+│   ├── document-converter/     # PDF → Markdown
+│   │   └── markdown.py
+│   └── utils/                  # Utilidades
+├── apps/
+│   └── index-gen/              # ⭐ Pipeline principal
+│       └── src/index_gen/
+│           └── main.py         # Orquestador completo
+├── src/
+│   └── rag_eval/
+│       └── config.py           # Configuración centralizada
+├── pyproject.toml              # Dependencias del proyecto
+└── config.yaml                 # Configuración de GCP
+```
+
+---
+
+## 🚀 Instalación
+
+### 1. Prerrequisitos
+
+- **Python 3.12+**
+- **uv** (gestor de paquetes)
+- **Poppler** (para pdf2image):
+  ```bash
+  # Ubuntu/Debian
+  sudo apt-get update
+  sudo apt-get install -y poppler-utils libcairo2-dev
+
+  # macOS
+  brew install poppler cairo
+  ```
+
+### 2. Instalar dependencias
+
+```bash
+cd /home/coder/sigma-chat/pipeline
+
+# Instalar todas las dependencias
+uv sync
+
+# O instalar solo las necesarias (sin dev)
+uv sync --no-dev
+```
+
+---
+
+## ⚙️ Configuración
+
+### 1. Configurar credenciales de GCP
+
+```bash
+# Autenticar con Google Cloud
+gcloud auth application-default login
+
+# O usar service account key
+export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
+```
+
+### 2. Configurar `config.yaml`
+
+Edita el archivo `config.yaml`:
+
+```yaml
+project_id: "tu-proyecto-gcp"
+location: "us-central1"
+bucket: "tu-bucket-gcs"
+
+index:
+  name: "mi-indice-vectorial"
+  dimensions: 768  # Para text-embedding-005
+  machine_type: "e2-standard-2"
+```
+
+---
+
+## 📖 Uso
+
+### **Opción 1: Pipeline Completo (Kubeflow/Vertex AI)**
+
+El archivo [`apps/index-gen/src/index_gen/main.py`](apps/index-gen/src/index_gen/main.py) define un pipeline KFP completo:
+
+```python
+from apps.index_gen.src.index_gen.main import (
+    gather_files,
+    process_file,
+    aggregate_vectors,
+    create_vector_index
+)
+
+# 1. Buscar PDFs en GCS
+pdf_files = gather_files("gs://mi-bucket/pdfs/")
+
+# 2. Procesar cada archivo
+for pdf_file in pdf_files:
+    process_file(
+        file_path=pdf_file,
+        model_name="text-embedding-005",
+        contents_output_dir="gs://mi-bucket/contents/",
+        vectors_output_file="vectors.jsonl",
+        chunk_limit=800
+    )
+
+# 3. Agregar vectores
+aggregate_vectors(
+    vector_artifacts=["vectors.jsonl"],
+    output_gcs_path="gs://mi-bucket/vectors/all_vectors.jsonl"
+)
+
+# 4. Crear índice vectorial
+create_vector_index(
+    vectors_dir="gs://mi-bucket/vectors/"
+)
+```
+
+---
+
+### **Opción 2: Usar Chunkers Individuales**
+
+#### **A) RecursiveChunker (Simple y Rápido)**
+
+```python
+from chunker.recursive_chunker import RecursiveChunker
+from pathlib import Path
+
+chunker = RecursiveChunker()
+documents = chunker.process_path(Path("documento.txt"))
+
+# Resultado:
+# [
+#   {"page_content": "...", "metadata": {"chunk_index": 0}},
+#   {"page_content": "...", "metadata": {"chunk_index": 1}},
+# ]
+```
+
+**CLI:**
+```bash
+recursive-chunker input.txt output_dir/
+```
+
+---
+
+#### **B) ContextualChunker (⭐ Recomendado para Producción)**
+
+Agrega contexto del documento original usando LLM:
+
+```python
+from chunker.contextual_chunker import ContextualChunker
+from llm.vertex_ai import VertexAILLM
+
+llm = VertexAILLM(
+    project="tu-proyecto",
+    location="us-central1"
+)
+
+chunker = ContextualChunker(
+    llm_client=llm,
+    max_chunk_size=800,
+    model="gemini-2.0-flash"
+)
+
+documents = chunker.process_path(Path("documento.txt"))
+
+# Resultado con contexto:
+# [
+#   {
+#     "page_content": "> **Contexto del documento original:**\n> [Resumen LLM]\n\n---\n\n[Contenido del chunk]",
+#     "metadata": {"chunk_index": 0}
+#   }
+# ]
+```
+
+**CLI:**
+```bash
+contextual-chunker input.txt output_dir/ --max-chunk-size 800 --model gemini-2.0-flash
+```
+
+---
+
+#### **C) LLMChunker (Avanzado)**
+
+Con optimización, fusión de chunks y extracción de imágenes:
+
+```python
+from chunker.llm_chunker import LLMChunker
+from llm.vertex_ai import VertexAILLM
+
+llm = VertexAILLM(project="tu-proyecto", location="us-central1")
+
+chunker = LLMChunker(
+    output_dir="output/",
+    model="gemini-2.0-flash",
+    max_tokens=1000,
+    target_tokens=800,
+    gemini_client=llm,
+    merge_related=True,
+    extract_images=True,
+    custom_instructions="Mantener términos técnicos en inglés"
+)
+
+documents = chunker.process_path(Path("documento.pdf"))
+```
+
+**CLI:**
+```bash
+llm-chunker documento.pdf output_dir/ \
+  --model gemini-2.0-flash \
+  --max-tokens 1000 \
+  --target-tokens 800 \
+  --merge-related \
+  --extract-images
+```
+
+---
+
+### **Opción 3: Generar Embeddings**
+
+```python
+from embedder.vertex_ai import VertexAIEmbedder
+
+embedder = VertexAIEmbedder(
+    model_name="text-embedding-005",
+    project="tu-proyecto",
+    location="us-central1"
+)
+
+# Single embedding
+embedding = embedder.generate_embedding("Texto de ejemplo")
+# Returns: List[float] con 768 dimensiones
+
+# Batch embeddings
+texts = ["Texto 1", "Texto 2", "Texto 3"]
+embeddings = embedder.generate_embeddings_batch(texts, batch_size=10)
+# Returns: List[List[float]]
+```
+
+---
+
+### **Opción 4: Almacenar en GCS**
+
+```python
+from file_storage.google_cloud import GoogleCloudFileStorage
+
+storage = GoogleCloudFileStorage(bucket="mi-bucket")
+
+# Subir archivo
+storage.upload_file(
+    file_path="local_file.md",
+    destination_blob_name="chunks/documento_0.md",
+    content_type="text/markdown"
+)
+
+# Listar archivos
+files = storage.list_files(path="chunks/")
+
+# Descargar archivo
+file_stream = storage.get_file_stream("chunks/documento_0.md")
+content = file_stream.read().decode("utf-8")
+```
+
+**CLI:**
+```bash
+file-storage upload local_file.md chunks/documento_0.md
+file-storage list chunks/
+file-storage download chunks/documento_0.md
+```
+
+---
+
+### **Opción 5: Vector Search**
+
+```python
+from vector_search.vertex_ai import GoogleCloudVectorSearch
+
+vector_search = GoogleCloudVectorSearch(
+    project_id="tu-proyecto",
+    location="us-central1",
+    bucket="mi-bucket",
+    index_name="mi-indice"
+)
+
+# Crear índice
+vector_search.create_index(
+    name="mi-indice",
+    content_path="gs://mi-bucket/vectors/all_vectors.jsonl",
+    dimensions=768
+)
+
+# Deploy índice
+vector_search.deploy_index(
+    index_name="mi-indice",
+    machine_type="e2-standard-2"
+)
+
+# Query
+query_embedding = embedder.generate_embedding("¿Qué es RAG?")
+results = vector_search.run_query(
+    deployed_index_id="mi_indice_deployed_xxxxx",
+    query=query_embedding,
+    limit=5
+)
+
+# Resultado:
+# [
+#   {"id": "documento_0", "distance": 0.85, "content": "RAG es..."},
+#   {"id": "documento_1", "distance": 0.78, "content": "..."},
+# ]
+```
+
+**CLI:**
+```bash
+vector-search create mi-indice gs://bucket/vectors/ --dimensions 768
+vector-search query deployed_id "¿Qué es RAG?" --limit 5
+vector-search delete mi-indice
+```
+
+---
+
+## 🔄 Flujo Completo de Ejemplo
+
+```python
+from pathlib import Path
+from chunker.contextual_chunker import ContextualChunker
+from embedder.vertex_ai import VertexAIEmbedder
+from file_storage.google_cloud import GoogleCloudFileStorage
+from llm.vertex_ai import VertexAILLM
+
+# 1. Setup
+llm = VertexAILLM(project="mi-proyecto", location="us-central1")
+chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
+embedder = VertexAIEmbedder(
+    model_name="text-embedding-005",
+    project="mi-proyecto",
+    location="us-central1"
+)
+storage = GoogleCloudFileStorage(bucket="mi-bucket")
+
+# 2. Chunking
+documents = chunker.process_path(Path("documento.pdf"))
+print(f"Creados {len(documents)} chunks")
+
+# 3. Generate embeddings y guardar
+for i, doc in enumerate(documents):
+    chunk_id = f"doc_{i}"
+
+    # Generar embedding
+    embedding = embedder.generate_embedding(doc["page_content"])
+
+    # Guardar contenido en GCS
+    storage.upload_file(
+        file_path=f"temp_{chunk_id}.md",
+        destination_blob_name=f"contents/{chunk_id}.md"
+    )
+
+    # Guardar vector (escribir a JSONL localmente, luego subir)
+    print(f"Chunk {chunk_id}: {len(embedding)} dimensiones")
+```
+
+---
+
+## 📦 Packages Instalados
+
+Ver lista completa en [`pyproject.toml`](pyproject.toml).
+
+**Principales:**
+- `google-genai` - SDK GenAI para LLM y embeddings
+- `google-cloud-aiplatform` - Vertex AI
+- `google-cloud-storage` - GCS
+- `chonkie` - Recursive chunking
+- `langchain` - Text splitting avanzado
+- `tiktoken` - Token counting
+- `markitdown` - Document conversion
+- `pypdf` - PDF processing
+- `pdf2image` - PDF to image
+- `kfp` - Kubeflow Pipelines
+
+---
+
+## 🛠️ Scripts de CLI Disponibles
+
+Después de `uv sync`, puedes usar estos comandos:
+
+```bash
+# Chunkers
+recursive-chunker input.txt output/
+contextual-chunker input.txt output/ --max-chunk-size 800
+llm-chunker documento.pdf output/ --model gemini-2.0-flash
+
+# Document converter
+convert-md documento.pdf
+
+# File storage
+file-storage upload local.md remote/path.md
+file-storage list remote/
+file-storage download remote/path.md
+
+# Vector search
+vector-search create index-name gs://bucket/vectors/ --dimensions 768
+vector-search query deployed-id "query text" --limit 5
+
+# Utils
+normalize-filenames input_dir/
+```
+
+---
+
+## 📊 Arquitectura del Sistema
+
+```
+┌─────────────┐
+│   PDF File  │
+└──────┬──────┘
+       │
+       ▼
+┌─────────────────────────────┐
+│  document-converter         │
+│  (PDF → Markdown)           │
+└──────┬──────────────────────┘
+       │
+       ▼
+┌─────────────────────────────┐
+│  chunker                    │
+│  (Markdown → Chunks)        │
+│  - RecursiveChunker         │
+│  - ContextualChunker ⭐     │
+│  - LLMChunker               │
+└──────┬──────────────────────┘
+       │
+       ▼
+┌─────────────────────────────┐
+│  embedder                   │
+│  (Text → Vectors)           │
+│  Vertex AI embeddings       │
+└──────┬──────────────────────┘
+       │
+       ├─────────────────────────┐
+       │                         │
+       ▼                         ▼
+┌─────────────────┐    ┌─────────────────┐
+│  file-storage   │    │  vector-search  │
+│  GCS Storage    │    │  Vertex AI      │
+│  (.md files)    │    │  Vector Index   │
+└─────────────────┘    └─────────────────┘
+```
+
+---
+
+## 🐛 Troubleshooting
+
+### Error: "poppler not found"
+```bash
+sudo apt-get install -y poppler-utils
+```
+
+### Error: "Permission denied" en GCS
+```bash
+gcloud auth application-default login
+# O configurar GOOGLE_APPLICATION_CREDENTIALS
+```
+
+### Error: "Module not found"
+```bash
+# Reinstalar dependencias
+uv sync --reinstall
+```
+
+---
+
+## 📝 Notas
+
+- **ContextualChunker** es el recomendado para producción (agrega contexto del documento)
+- **LLMChunker** es más lento pero genera chunks óptimos (fusiona, optimiza tokens)
+- **RecursiveChunker** es el más rápido para pruebas rápidas
+- Los chunks se guardan como `.md` en GCS
+- Los vectores se guardan en formato JSONL: `{"id": "...", "embedding": [...]}`
+- El índice vectorial se crea en Vertex AI Vector Search
+
+---
+
+## 📄 License
+
+Este código es parte del proyecto legacy-rag.