Second commit

2026-02-22 15:31:05 +00:00
parent 35d5a65b17
commit 98a1b5939e
4 changed files with 0 additions and 491 deletions
--- a/00_START_HERE.md
+++ b/00_START_HERE.md
@@ -1,158 +0,0 @@
-# 🚀 START HERE - Pipeline RAG
-
-## ¿Qué hay en esta carpeta?
-
-Este proyecto contiene todo el código necesario para:
-
-1. ✂️ **Chunkear documentos** (dividir en fragmentos)
-2. 🧠 **Generar embeddings** (vectorización)
-3. 💾 **Almacenar en GCS** (Google Cloud Storage)
-4. 🔍 **Crear índices vectoriales** (Vertex AI Vector Search)
-
---
-
-## 📁 Estructura Básica
-
-```
-pipeline/
-├── packages/          # 7 librerías reutilizables
-│   ├── chunker/      # ⭐ Para dividir documentos
-│   ├── embedder/     # ⭐ Para vectorizar texto
-│   ├── file-storage/ # ⭐ Para guardar en GCS
-│   └── vector-search/# ⭐ Para índices vectoriales
-│
-├── apps/
-│   └── index-gen/    # ⭐ Pipeline completo KFP
-│
-└── src/rag_eval/     # Configuración
-```
-
---
-
-## ⚡ Instalación Rápida
-
-```bash
-# En tu Workbench de GCP:
-cd ~/pipeline
-uv sync
-```
-
---
-
-## 🎯 Uso Más Común
-
-### Opción 1: Chunking Contextual (Recomendado)
-
-```python
-from chunker.contextual_chunker import ContextualChunker
-from llm.vertex_ai import VertexAILLM
-from pathlib import Path
-
-# Setup
-llm = VertexAILLM(project="tu-proyecto", location="us-central1")
-chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
-
-# Procesar
-documents = chunker.process_path(Path("documento.txt"))
-print(f"Creados {len(documents)} chunks")
-```
-
-### Opción 2: Pipeline Completo
-
-```python
-from apps.index_gen.src.index_gen.main import (
-    gather_files,
-    process_file,
-    aggregate_vectors,
-    create_vector_index
-)
-
-# Procesar PDFs desde GCS
-pdf_files = gather_files("gs://mi-bucket/pdfs/")
-
-for pdf in pdf_files:
-    process_file(
-        file_path=pdf,
-        model_name="text-embedding-005",
-        contents_output_dir="gs://mi-bucket/contents/",
-        vectors_output_file="vectors.jsonl",
-        chunk_limit=800
-    )
-```
-
---
-
-## 📚 Documentación
-
-| Archivo | Descripción |
-|---------|-------------|
-| **[QUICKSTART.md](QUICKSTART.md)** | ⭐ Inicio rápido con ejemplos |
-| **[README.md](README.md)** | Documentación completa |
-| **[STRUCTURE.md](STRUCTURE.md)** | Estructura detallada |
-| **config.yaml** | Configuración de GCP |
-
---
-
-## 🔧 Configuración Necesaria
-
-Edita `config.yaml`:
-
-```yaml
-project_id: "tu-proyecto-gcp"     # ⚠️ CAMBIAR
-location: "us-central1"
-bucket: "tu-bucket-nombre"        # ⚠️ CAMBIAR
-
-index:
-  name: "mi-indice-rag"
-  dimensions: 768
-```
-
---
-
-## 💡 Estrategias de Chunking Disponibles
-
-1. **RecursiveChunker** - Simple y rápido
-2. **ContextualChunker** - ⭐ Agrega contexto con LLM (recomendado)
-3. **LLMChunker** - Avanzado: optimiza, fusiona, extrae imágenes
-
---
-
-## 📦 Dependencias Principales
-
- `google-genai` - LLM y embeddings
- `google-cloud-aiplatform` - Vertex AI
- `google-cloud-storage` - GCS
- `chonkie` - Chunking recursivo
- `langchain` - Text splitting
- `tiktoken` - Token counting
- `pypdf` - PDF processing
-
-Total instaladas: ~30 packages
-
---
-
-## ❓ FAQ
-
-**P: ¿Qué chunker debo usar?**
-R: `ContextualChunker` para producción (agrega contexto del documento)
-
-**P: ¿Cómo instalo en Workbench?**
-R: `uv sync` (las credenciales de GCP ya están configuradas)
-
-**P: ¿Dónde está el código del pipeline completo?**
-R: `apps/index-gen/src/index_gen/main.py`
-
-**P: ¿Cómo genero embeddings?**
-R: Usa `embedder.vertex_ai.VertexAIEmbedder`
-
---
-
-## 🆘 Soporte
-
- Ver ejemplos en [QUICKSTART.md](QUICKSTART.md)
- Ver API completa en [README.md](README.md)
- Ver estructura en [STRUCTURE.md](STRUCTURE.md)
-
---
-
-**Total**: 33 archivos Python | ~400KB | Listo para Workbench ✅
--- a/QUICKSTART.md
+++ b/QUICKSTART.md
@@ -1,97 +0,0 @@
-# Quick Start - GCP Workbench
-
-## 📦 Instalación en Workbench
-
-```bash
-# 1. Instalar dependencias del sistema (si es necesario)
-sudo apt-get update
-sudo apt-get install -y poppler-utils libcairo2-dev
-
-# 2. Instalar dependencias de Python
-cd ~/pipeline
-uv sync
-
-# 3. Configurar credenciales (ya deberían estar en Workbench)
-# Las credenciales de Application Default Credentials ya están configuradas
-```
-
-## ⚙️ Configuración Mínima
-
-Edita `config.yaml`:
-
-```yaml
-project_id: "tu-proyecto-gcp"
-location: "us-central1"
-bucket: "tu-bucket-gcs"
-
-index:
-  name: "mi-indice-vectorial"
-  dimensions: 768
-  machine_type: "e2-standard-2"
-```
-
-## 🚀 Uso Rápido
-
-### 1. Chunking Simple
-```python
-from chunker.recursive_chunker import RecursiveChunker
-from pathlib import Path
-
-chunker = RecursiveChunker()
-docs = chunker.process_text("Tu texto aquí")
-print(f"Chunks: {len(docs)}")
-```
-
-### 2. Chunking Contextual (Recomendado)
-```python
-from chunker.contextual_chunker import ContextualChunker
-from llm.vertex_ai import VertexAILLM
-
-llm = VertexAILLM(project="tu-proyecto", location="us-central1")
-chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
-docs = chunker.process_path(Path("documento.txt"))
-```
-
-### 3. Generar Embeddings
-```python
-from embedder.vertex_ai import VertexAIEmbedder
-
-embedder = VertexAIEmbedder(
-    model_name="text-embedding-005",
-    project="tu-proyecto",
-    location="us-central1"
-)
-embedding = embedder.generate_embedding("texto")
-```
-
-### 4. Pipeline Completo
-```python
-from apps.index_gen.src.index_gen.main import process_file
-
-process_file(
-    file_path="gs://bucket/file.pdf",
-    model_name="text-embedding-005",
-    contents_output_dir="gs://bucket/contents/",
-    vectors_output_file="vectors.jsonl",
-    chunk_limit=800
-)
-```
-
-## 📚 Archivos Importantes
-
- `README.md` - Documentación completa
- `STRUCTURE.md` - Estructura del proyecto
- `config.yaml` - Configuración de GCP
- `pyproject.toml` - Dependencias
-
-## 🔗 Componentes Principales
-
-1. **packages/chunker/** - Chunking (Recursive, Contextual, LLM)
-2. **packages/embedder/** - Embeddings (Vertex AI)
-3. **packages/file-storage/** - Storage (GCS)
-4. **packages/vector-search/** - Vector Search (Vertex AI)
-5. **apps/index-gen/** - Pipeline completo
-
---
-
-**Tamaño total**: ~400KB | **Archivos Python**: 33
--- a/RESUMEN.txt
+++ b/RESUMEN.txt
@@ -1,65 +0,0 @@
-╔════════════════════════════════════════════════════════════════╗
-║          ✅ PROYECTO PIPELINE COPIADO EXITOSAMENTE            ║
-╚════════════════════════════════════════════════════════════════╝
-
-📁 UBICACIÓN: /home/coder/sigma-chat/pipeline
-
-📊 ESTADÍSTICAS:
-   • Tamaño total: ~400KB
-   • Archivos Python: 33
-   • Packages: 7
-   • Apps: 1
-   • Archivos de documentación: 5
-
-📦 PACKAGES INCLUIDOS:
-   ✅ chunker         - 3 estrategias de chunking
-   ✅ embedder        - Generación de embeddings (Vertex AI)
-   ✅ file-storage    - Almacenamiento en GCS
-   ✅ vector-search   - Índices vectoriales (Vertex AI)
-   ✅ llm             - Cliente para Gemini/Vertex AI
-   ✅ document-converter - Conversión PDF → Markdown
-   ✅ utils           - Utilidades varias
-
-🎯 APPS INCLUIDAS:
-   ✅ index-gen       - Pipeline completo KFP
-
-📚 DOCUMENTACIÓN:
-   ✅ 00_START_HERE.md    - Punto de inicio rápido
-   ✅ QUICKSTART.md       - Guía rápida con ejemplos
-   ✅ README.md           - Documentación completa
-   ✅ STRUCTURE.md        - Estructura detallada
-   ✅ config.example.yaml - Plantilla de configuración
-
-⚙️  ARCHIVOS DE CONFIGURACIÓN:
-   ✅ pyproject.toml      - Dependencias y scripts CLI
-   ✅ config.yaml         - Configuración de GCP
-   ✅ .python-version     - Python 3.12
-   ✅ .gitignore          - Exclusiones de git
-
-🚀 PRÓXIMOS PASOS:
-
-1. Subir a GCP Workbench:
-   • Comprimir: tar -czf pipeline.tar.gz pipeline/
-   • Subir a Workbench
-   • Descomprimir: tar -xzf pipeline.tar.gz
-
-2. Instalar dependencias:
-   cd ~/pipeline
-   uv sync
-
-3. Configurar:
-   nano config.yaml
-   # Editar: project_id, location, bucket
-
-4. Probar:
-   echo "Texto de prueba" > test.txt
-   recursive-chunker test.txt output/
-
-📖 LEER PRIMERO:
-   cat 00_START_HERE.md
-
-═══════════════════════════════════════════════════════════════
-
-✨ TODO LISTO PARA USAR EN GCP WORKBENCH ✨
-
-═══════════════════════════════════════════════════════════════
--- a/STRUCTURE.md
+++ b/STRUCTURE.md
@@ -1,171 +0,0 @@
-# Estructura del Proyecto Pipeline
-
-## ✅ Carpetas y Archivos Copiados
-
-```
-pipeline/
-├── 📄 pyproject.toml              # Configuración principal del proyecto
-├── 📄 config.yaml                 # Configuración de GCP (del original)
-├── 📄 config.example.yaml         # Plantilla de configuración
-├── 📄 .python-version             # Python 3.12
-├── 📄 .gitignore                  # Archivos a ignorar
-├── 📄 README.md                   # Documentación completa
-│
-├── 📁 packages/                   # Librerías reutilizables
-│   ├── chunker/                   # ⭐ CHUNKING
-│   │   ├── pyproject.toml
-│   │   └── src/chunker/
-│   │       ├── base_chunker.py
-│   │       ├── recursive_chunker.py
-│   │       ├── contextual_chunker.py
-│   │       └── llm_chunker.py
-│   │
-│   ├── embedder/                  # ⭐ EMBEDDINGS
-│   │   ├── pyproject.toml
-│   │   └── src/embedder/
-│   │       ├── base.py
-│   │       └── vertex_ai.py
-│   │
-│   ├── file-storage/              # ⭐ ALMACENAMIENTO GCS
-│   │   ├── pyproject.toml
-│   │   └── src/file_storage/
-│   │       ├── base.py
-│   │       ├── google_cloud.py
-│   │       └── cli.py
-│   │
-│   ├── vector-search/             # ⭐ ÍNDICE VECTORIAL
-│   │   ├── pyproject.toml
-│   │   └── src/vector_search/
-│   │       ├── base.py
-│   │       ├── vertex_ai.py
-│   │       └── cli/
-│   │           ├── create.py
-│   │           ├── query.py
-│   │           ├── delete.py
-│   │           └── generate.py
-│   │
-│   ├── llm/                       # Cliente LLM
-│   │   ├── pyproject.toml
-│   │   └── src/llm/
-│   │       ├── base.py
-│   │       └── vertex_ai.py
-│   │
-│   ├── document-converter/        # Conversión PDF→Markdown
-│   │   ├── pyproject.toml
-│   │   └── src/document_converter/
-│   │       ├── base.py
-│   │       └── markdown.py
-│   │
-│   └── utils/                     # Utilidades
-│       ├── pyproject.toml
-│       └── src/utils/
-│           └── normalize_filenames.py
-│
-├── 📁 apps/                       # Aplicaciones
-│   └── index-gen/                 # ⭐ PIPELINE PRINCIPAL
-│       ├── pyproject.toml
-│       └── src/index_gen/
-│           ├── cli.py
-│           └── main.py            # Pipeline KFP completo
-│
-└── 📁 src/                        # Código fuente principal
-    └── rag_eval/
-        ├── __init__.py
-        └── config.py              # Configuración centralizada
-```
-
-## 📊 Resumen de Componentes
-
-### Packages Core (7)
-1. ✅ **chunker** - 3 estrategias de chunking (Recursive, Contextual, LLM)
-2. ✅ **embedder** - Generación de embeddings con Vertex AI
-3. ✅ **file-storage** - Almacenamiento en Google Cloud Storage
-4. ✅ **vector-search** - Índices vectoriales en Vertex AI
-5. ✅ **llm** - Cliente para modelos Gemini/Vertex AI
-6. ✅ **document-converter** - Conversión de documentos
-7. ✅ **utils** - Utilidades varias
-
-### Aplicaciones (1)
-1. ✅ **index-gen** - Pipeline completo de procesamiento
-
-### Configuración (1)
-1. ✅ **rag_eval** - Configuración centralizada
-
-## 🔧 Archivos de Configuración
-
- ✅ `pyproject.toml` - Dependencias y scripts CLI
- ✅ `config.yaml` - Configuración de GCP
- ✅ `config.example.yaml` - Plantilla
- ✅ `.python-version` - Versión de Python
- ✅ `.gitignore` - Archivos ignorados
-
-## 📝 Documentación
-
- ✅ `README.md` - Documentación completa con ejemplos
- ✅ `STRUCTURE.md` - Este archivo
-
-## 🎯 Funcionalidades Disponibles
-
-### CLI Scripts
-```bash
-# Chunking
-recursive-chunker input.txt output/
-contextual-chunker input.txt output/ --max-chunk-size 800
-llm-chunker documento.pdf output/ --model gemini-2.0-flash
-
-# Document conversion
-convert-md documento.pdf
-
-# File storage
-file-storage upload local.md remote/path.md
-file-storage list remote/
-file-storage download remote/path.md
-
-# Vector search
-vector-search create index-name gs://bucket/vectors/ --dimensions 768
-vector-search query deployed-id "query text" --limit 5
-vector-search delete index-name
-
-# Utils
-normalize-filenames input_dir/
-```
-
-### Python API
-Todas las clases están disponibles para importación directa:
-
-```python
-from chunker.contextual_chunker import ContextualChunker
-from embedder.vertex_ai import VertexAIEmbedder
-from file_storage.google_cloud import GoogleCloudFileStorage
-from vector_search.vertex_ai import GoogleCloudVectorSearch
-from llm.vertex_ai import VertexAILLM
-```
-
-## 🚀 Próximos Pasos
-
-1. **Instalar dependencias**:
-   ```bash
-   cd /home/coder/sigma-chat/pipeline
-   uv sync
-   ```
-
-2. **Configurar GCP**:
-   - Editar `config.yaml` con tus credenciales
-   - Ejecutar `gcloud auth application-default login`
-
-3. **Probar chunking**:
-   ```bash
-   echo "Texto de prueba" > test.txt
-   recursive-chunker test.txt output/
-   ```
-
-4. **Ver documentación completa**:
-   ```bash
-   cat README.md
-   ```
-
---
-
-**Total de archivos Python copiados**: ~30+ archivos
-**Total de packages**: 8 (7 packages + 1 app)
-**Listo para usar**: ✅