First commmit

2026-02-22 15:25:27 +00:00
commit 35d5a65b17
70 changed files with 4298 additions and 0 deletions
--- a/00_START_HERE.md
+++ b/00_START_HERE.md
@@ -0,0 +1,158 @@
+# 🚀 START HERE - Pipeline RAG
+
+## ¿Qué hay en esta carpeta?
+
+Este proyecto contiene todo el código necesario para:
+
+1. ✂️ **Chunkear documentos** (dividir en fragmentos)
+2. 🧠 **Generar embeddings** (vectorización)
+3. 💾 **Almacenar en GCS** (Google Cloud Storage)
+4. 🔍 **Crear índices vectoriales** (Vertex AI Vector Search)
+
+---
+
+## 📁 Estructura Básica
+
+```
+pipeline/
+├── packages/          # 7 librerías reutilizables
+│   ├── chunker/      # ⭐ Para dividir documentos
+│   ├── embedder/     # ⭐ Para vectorizar texto
+│   ├── file-storage/ # ⭐ Para guardar en GCS
+│   └── vector-search/# ⭐ Para índices vectoriales
+│
+├── apps/
+│   └── index-gen/    # ⭐ Pipeline completo KFP
+│
+└── src/rag_eval/     # Configuración
+```
+
+---
+
+## ⚡ Instalación Rápida
+
+```bash
+# En tu Workbench de GCP:
+cd ~/pipeline
+uv sync
+```
+
+---
+
+## 🎯 Uso Más Común
+
+### Opción 1: Chunking Contextual (Recomendado)
+
+```python
+from chunker.contextual_chunker import ContextualChunker
+from llm.vertex_ai import VertexAILLM
+from pathlib import Path
+
+# Setup
+llm = VertexAILLM(project="tu-proyecto", location="us-central1")
+chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
+
+# Procesar
+documents = chunker.process_path(Path("documento.txt"))
+print(f"Creados {len(documents)} chunks")
+```
+
+### Opción 2: Pipeline Completo
+
+```python
+from apps.index_gen.src.index_gen.main import (
+    gather_files,
+    process_file,
+    aggregate_vectors,
+    create_vector_index
+)
+
+# Procesar PDFs desde GCS
+pdf_files = gather_files("gs://mi-bucket/pdfs/")
+
+for pdf in pdf_files:
+    process_file(
+        file_path=pdf,
+        model_name="text-embedding-005",
+        contents_output_dir="gs://mi-bucket/contents/",
+        vectors_output_file="vectors.jsonl",
+        chunk_limit=800
+    )
+```
+
+---
+
+## 📚 Documentación
+
+| Archivo | Descripción |
+|---------|-------------|
+| **[QUICKSTART.md](QUICKSTART.md)** | ⭐ Inicio rápido con ejemplos |
+| **[README.md](README.md)** | Documentación completa |
+| **[STRUCTURE.md](STRUCTURE.md)** | Estructura detallada |
+| **config.yaml** | Configuración de GCP |
+
+---
+
+## 🔧 Configuración Necesaria
+
+Edita `config.yaml`:
+
+```yaml
+project_id: "tu-proyecto-gcp"     # ⚠️ CAMBIAR
+location: "us-central1"
+bucket: "tu-bucket-nombre"        # ⚠️ CAMBIAR
+
+index:
+  name: "mi-indice-rag"
+  dimensions: 768
+```
+
+---
+
+## 💡 Estrategias de Chunking Disponibles
+
+1. **RecursiveChunker** - Simple y rápido
+2. **ContextualChunker** - ⭐ Agrega contexto con LLM (recomendado)
+3. **LLMChunker** - Avanzado: optimiza, fusiona, extrae imágenes
+
+---
+
+## 📦 Dependencias Principales
+
+- `google-genai` - LLM y embeddings
+- `google-cloud-aiplatform` - Vertex AI
+- `google-cloud-storage` - GCS
+- `chonkie` - Chunking recursivo
+- `langchain` - Text splitting
+- `tiktoken` - Token counting
+- `pypdf` - PDF processing
+
+Total instaladas: ~30 packages
+
+---
+
+## ❓ FAQ
+
+**P: ¿Qué chunker debo usar?**
+R: `ContextualChunker` para producción (agrega contexto del documento)
+
+**P: ¿Cómo instalo en Workbench?**
+R: `uv sync` (las credenciales de GCP ya están configuradas)
+
+**P: ¿Dónde está el código del pipeline completo?**
+R: `apps/index-gen/src/index_gen/main.py`
+
+**P: ¿Cómo genero embeddings?**
+R: Usa `embedder.vertex_ai.VertexAIEmbedder`
+
+---
+
+## 🆘 Soporte
+
+- Ver ejemplos en [QUICKSTART.md](QUICKSTART.md)
+- Ver API completa en [README.md](README.md)
+- Ver estructura en [STRUCTURE.md](STRUCTURE.md)
+
+---
+
+**Total**: 33 archivos Python | ~400KB | Listo para Workbench ✅