First commmit

2026-02-22 15:25:27 +00:00
commit 35d5a65b17
70 changed files with 4298 additions and 0 deletions
--- a/QUICKSTART.md
+++ b/QUICKSTART.md
@@ -0,0 +1,97 @@
+# Quick Start - GCP Workbench
+
+## 📦 Instalación en Workbench
+
+```bash
+# 1. Instalar dependencias del sistema (si es necesario)
+sudo apt-get update
+sudo apt-get install -y poppler-utils libcairo2-dev
+
+# 2. Instalar dependencias de Python
+cd ~/pipeline
+uv sync
+
+# 3. Configurar credenciales (ya deberían estar en Workbench)
+# Las credenciales de Application Default Credentials ya están configuradas
+```
+
+## ⚙️ Configuración Mínima
+
+Edita `config.yaml`:
+
+```yaml
+project_id: "tu-proyecto-gcp"
+location: "us-central1"
+bucket: "tu-bucket-gcs"
+
+index:
+  name: "mi-indice-vectorial"
+  dimensions: 768
+  machine_type: "e2-standard-2"
+```
+
+## 🚀 Uso Rápido
+
+### 1. Chunking Simple
+```python
+from chunker.recursive_chunker import RecursiveChunker
+from pathlib import Path
+
+chunker = RecursiveChunker()
+docs = chunker.process_text("Tu texto aquí")
+print(f"Chunks: {len(docs)}")
+```
+
+### 2. Chunking Contextual (Recomendado)
+```python
+from chunker.contextual_chunker import ContextualChunker
+from llm.vertex_ai import VertexAILLM
+
+llm = VertexAILLM(project="tu-proyecto", location="us-central1")
+chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
+docs = chunker.process_path(Path("documento.txt"))
+```
+
+### 3. Generar Embeddings
+```python
+from embedder.vertex_ai import VertexAIEmbedder
+
+embedder = VertexAIEmbedder(
+    model_name="text-embedding-005",
+    project="tu-proyecto",
+    location="us-central1"
+)
+embedding = embedder.generate_embedding("texto")
+```
+
+### 4. Pipeline Completo
+```python
+from apps.index_gen.src.index_gen.main import process_file
+
+process_file(
+    file_path="gs://bucket/file.pdf",
+    model_name="text-embedding-005",
+    contents_output_dir="gs://bucket/contents/",
+    vectors_output_file="vectors.jsonl",
+    chunk_limit=800
+)
+```
+
+## 📚 Archivos Importantes
+
+- `README.md` - Documentación completa
+- `STRUCTURE.md` - Estructura del proyecto
+- `config.yaml` - Configuración de GCP
+- `pyproject.toml` - Dependencias
+
+## 🔗 Componentes Principales
+
+1. **packages/chunker/** - Chunking (Recursive, Contextual, LLM)
+2. **packages/embedder/** - Embeddings (Vertex AI)
+3. **packages/file-storage/** - Storage (GCS)
+4. **packages/vector-search/** - Vector Search (Vertex AI)
+5. **apps/index-gen/** - Pipeline completo
+
+---
+
+**Tamaño total**: ~400KB | **Archivos Python**: 33