489 lines
12 KiB
Markdown
489 lines
12 KiB
Markdown
# RAG Pipeline - Document Chunking & Vector Storage
|
|
|
|
Este proyecto contiene todo el código necesario para procesar documentos (PDFs), dividirlos en chunks, generar embeddings vectoriales y almacenarlos en Google Cloud Storage + Vertex AI Vector Search.
|
|
|
|
## 📁 Estructura del Proyecto
|
|
|
|
```
|
|
pipeline/
|
|
├── packages/ # Librerías reutilizables
|
|
│ ├── chunker/ # ⭐ Estrategias de chunking
|
|
│ │ ├── base_chunker.py
|
|
│ │ ├── recursive_chunker.py
|
|
│ │ ├── contextual_chunker.py # Usado en producción
|
|
│ │ └── llm_chunker.py # Avanzado con optimización
|
|
│ ├── embedder/ # Generación de embeddings
|
|
│ │ └── vertex_ai.py
|
|
│ ├── file-storage/ # Storage en GCS
|
|
│ │ └── google_cloud.py
|
|
│ ├── vector-search/ # Índices vectoriales
|
|
│ │ └── vertex_ai.py
|
|
│ ├── llm/ # Cliente LLM
|
|
│ │ └── vertex_ai.py
|
|
│ ├── document-converter/ # PDF → Markdown
|
|
│ │ └── markdown.py
|
|
│ └── utils/ # Utilidades
|
|
├── apps/
|
|
│ └── index-gen/ # ⭐ Pipeline principal
|
|
│ └── src/index_gen/
|
|
│ └── main.py # Orquestador completo
|
|
├── src/
|
|
│ └── rag_eval/
|
|
│ └── config.py # Configuración centralizada
|
|
├── pyproject.toml # Dependencias del proyecto
|
|
└── config.yaml # Configuración de GCP
|
|
```
|
|
|
|
---
|
|
|
|
## 🚀 Instalación
|
|
|
|
### 1. Prerrequisitos
|
|
|
|
- **Python 3.12+**
|
|
- **uv** (gestor de paquetes)
|
|
- **Poppler** (para pdf2image):
|
|
```bash
|
|
# Ubuntu/Debian
|
|
sudo apt-get update
|
|
sudo apt-get install -y poppler-utils libcairo2-dev
|
|
|
|
# macOS
|
|
brew install poppler cairo
|
|
```
|
|
|
|
### 2. Instalar dependencias
|
|
|
|
```bash
|
|
cd /home/coder/sigma-chat/pipeline
|
|
|
|
# Instalar todas las dependencias
|
|
uv sync
|
|
|
|
# O instalar solo las necesarias (sin dev)
|
|
uv sync --no-dev
|
|
```
|
|
|
|
---
|
|
|
|
## ⚙️ Configuración
|
|
|
|
### 1. Configurar credenciales de GCP
|
|
|
|
```bash
|
|
# Autenticar con Google Cloud
|
|
gcloud auth application-default login
|
|
|
|
# O usar service account key
|
|
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
|
|
```
|
|
|
|
### 2. Configurar `config.yaml`
|
|
|
|
Edita el archivo `config.yaml`:
|
|
|
|
```yaml
|
|
project_id: "tu-proyecto-gcp"
|
|
location: "us-central1"
|
|
bucket: "tu-bucket-gcs"
|
|
|
|
index:
|
|
name: "mi-indice-vectorial"
|
|
dimensions: 768 # Para text-embedding-005
|
|
machine_type: "e2-standard-2"
|
|
```
|
|
|
|
---
|
|
|
|
## 📖 Uso
|
|
|
|
### **Opción 1: Pipeline Completo (Kubeflow/Vertex AI)**
|
|
|
|
El archivo [`apps/index-gen/src/index_gen/main.py`](apps/index-gen/src/index_gen/main.py) define un pipeline KFP completo:
|
|
|
|
```python
|
|
from apps.index_gen.src.index_gen.main import (
|
|
gather_files,
|
|
process_file,
|
|
aggregate_vectors,
|
|
create_vector_index
|
|
)
|
|
|
|
# 1. Buscar PDFs en GCS
|
|
pdf_files = gather_files("gs://mi-bucket/pdfs/")
|
|
|
|
# 2. Procesar cada archivo
|
|
for pdf_file in pdf_files:
|
|
process_file(
|
|
file_path=pdf_file,
|
|
model_name="text-embedding-005",
|
|
contents_output_dir="gs://mi-bucket/contents/",
|
|
vectors_output_file="vectors.jsonl",
|
|
chunk_limit=800
|
|
)
|
|
|
|
# 3. Agregar vectores
|
|
aggregate_vectors(
|
|
vector_artifacts=["vectors.jsonl"],
|
|
output_gcs_path="gs://mi-bucket/vectors/all_vectors.jsonl"
|
|
)
|
|
|
|
# 4. Crear índice vectorial
|
|
create_vector_index(
|
|
vectors_dir="gs://mi-bucket/vectors/"
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
### **Opción 2: Usar Chunkers Individuales**
|
|
|
|
#### **A) RecursiveChunker (Simple y Rápido)**
|
|
|
|
```python
|
|
from chunker.recursive_chunker import RecursiveChunker
|
|
from pathlib import Path
|
|
|
|
chunker = RecursiveChunker()
|
|
documents = chunker.process_path(Path("documento.txt"))
|
|
|
|
# Resultado:
|
|
# [
|
|
# {"page_content": "...", "metadata": {"chunk_index": 0}},
|
|
# {"page_content": "...", "metadata": {"chunk_index": 1}},
|
|
# ]
|
|
```
|
|
|
|
**CLI:**
|
|
```bash
|
|
recursive-chunker input.txt output_dir/
|
|
```
|
|
|
|
---
|
|
|
|
#### **B) ContextualChunker (⭐ Recomendado para Producción)**
|
|
|
|
Agrega contexto del documento original usando LLM:
|
|
|
|
```python
|
|
from chunker.contextual_chunker import ContextualChunker
|
|
from llm.vertex_ai import VertexAILLM
|
|
|
|
llm = VertexAILLM(
|
|
project="tu-proyecto",
|
|
location="us-central1"
|
|
)
|
|
|
|
chunker = ContextualChunker(
|
|
llm_client=llm,
|
|
max_chunk_size=800,
|
|
model="gemini-2.0-flash"
|
|
)
|
|
|
|
documents = chunker.process_path(Path("documento.txt"))
|
|
|
|
# Resultado con contexto:
|
|
# [
|
|
# {
|
|
# "page_content": "> **Contexto del documento original:**\n> [Resumen LLM]\n\n---\n\n[Contenido del chunk]",
|
|
# "metadata": {"chunk_index": 0}
|
|
# }
|
|
# ]
|
|
```
|
|
|
|
**CLI:**
|
|
```bash
|
|
contextual-chunker input.txt output_dir/ --max-chunk-size 800 --model gemini-2.0-flash
|
|
```
|
|
|
|
---
|
|
|
|
#### **C) LLMChunker (Avanzado)**
|
|
|
|
Con optimización, fusión de chunks y extracción de imágenes:
|
|
|
|
```python
|
|
from chunker.llm_chunker import LLMChunker
|
|
from llm.vertex_ai import VertexAILLM
|
|
|
|
llm = VertexAILLM(project="tu-proyecto", location="us-central1")
|
|
|
|
chunker = LLMChunker(
|
|
output_dir="output/",
|
|
model="gemini-2.0-flash",
|
|
max_tokens=1000,
|
|
target_tokens=800,
|
|
gemini_client=llm,
|
|
merge_related=True,
|
|
extract_images=True,
|
|
custom_instructions="Mantener términos técnicos en inglés"
|
|
)
|
|
|
|
documents = chunker.process_path(Path("documento.pdf"))
|
|
```
|
|
|
|
**CLI:**
|
|
```bash
|
|
llm-chunker documento.pdf output_dir/ \
|
|
--model gemini-2.0-flash \
|
|
--max-tokens 1000 \
|
|
--target-tokens 800 \
|
|
--merge-related \
|
|
--extract-images
|
|
```
|
|
|
|
---
|
|
|
|
### **Opción 3: Generar Embeddings**
|
|
|
|
```python
|
|
from embedder.vertex_ai import VertexAIEmbedder
|
|
|
|
embedder = VertexAIEmbedder(
|
|
model_name="text-embedding-005",
|
|
project="tu-proyecto",
|
|
location="us-central1"
|
|
)
|
|
|
|
# Single embedding
|
|
embedding = embedder.generate_embedding("Texto de ejemplo")
|
|
# Returns: List[float] con 768 dimensiones
|
|
|
|
# Batch embeddings
|
|
texts = ["Texto 1", "Texto 2", "Texto 3"]
|
|
embeddings = embedder.generate_embeddings_batch(texts, batch_size=10)
|
|
# Returns: List[List[float]]
|
|
```
|
|
|
|
---
|
|
|
|
### **Opción 4: Almacenar en GCS**
|
|
|
|
```python
|
|
import gcsfs
|
|
|
|
fs = gcsfs.GCSFileSystem()
|
|
|
|
# Subir archivo
|
|
fs.put("local_file.md", "mi-bucket/chunks/documento_0.md")
|
|
|
|
# Listar archivos
|
|
files = fs.ls("mi-bucket/chunks/")
|
|
|
|
# Descargar archivo
|
|
content = fs.cat_file("mi-bucket/chunks/documento_0.md").decode("utf-8")
|
|
```
|
|
|
|
---
|
|
|
|
### **Opción 5: Vector Search**
|
|
|
|
```python
|
|
from vector_search.vertex_ai import GoogleCloudVectorSearch
|
|
|
|
vector_search = GoogleCloudVectorSearch(
|
|
project_id="tu-proyecto",
|
|
location="us-central1",
|
|
bucket="mi-bucket",
|
|
index_name="mi-indice"
|
|
)
|
|
|
|
# Crear índice
|
|
vector_search.create_index(
|
|
name="mi-indice",
|
|
content_path="gs://mi-bucket/vectors/all_vectors.jsonl",
|
|
dimensions=768
|
|
)
|
|
|
|
# Deploy índice
|
|
vector_search.deploy_index(
|
|
index_name="mi-indice",
|
|
machine_type="e2-standard-2"
|
|
)
|
|
|
|
# Query
|
|
query_embedding = embedder.generate_embedding("¿Qué es RAG?")
|
|
results = vector_search.run_query(
|
|
deployed_index_id="mi_indice_deployed_xxxxx",
|
|
query=query_embedding,
|
|
limit=5
|
|
)
|
|
|
|
# Resultado:
|
|
# [
|
|
# {"id": "documento_0", "distance": 0.85, "content": "RAG es..."},
|
|
# {"id": "documento_1", "distance": 0.78, "content": "..."},
|
|
# ]
|
|
```
|
|
|
|
**CLI:**
|
|
```bash
|
|
vector-search create mi-indice gs://bucket/vectors/ --dimensions 768
|
|
vector-search query deployed_id "¿Qué es RAG?" --limit 5
|
|
vector-search delete mi-indice
|
|
```
|
|
|
|
---
|
|
|
|
## 🔄 Flujo Completo de Ejemplo
|
|
|
|
```python
|
|
import gcsfs
|
|
from pathlib import Path
|
|
from chunker.contextual_chunker import ContextualChunker
|
|
from embedder.vertex_ai import VertexAIEmbedder
|
|
from llm.vertex_ai import VertexAILLM
|
|
|
|
# 1. Setup
|
|
llm = VertexAILLM(project="mi-proyecto", location="us-central1")
|
|
chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
|
|
embedder = VertexAIEmbedder(
|
|
model_name="text-embedding-005",
|
|
project="mi-proyecto",
|
|
location="us-central1"
|
|
)
|
|
fs = gcsfs.GCSFileSystem()
|
|
|
|
# 2. Chunking
|
|
documents = chunker.process_path(Path("documento.pdf"))
|
|
print(f"Creados {len(documents)} chunks")
|
|
|
|
# 3. Generate embeddings y guardar
|
|
for i, doc in enumerate(documents):
|
|
chunk_id = f"doc_{i}"
|
|
|
|
# Generar embedding
|
|
embedding = embedder.generate_embedding(doc["page_content"])
|
|
|
|
# Guardar contenido en GCS
|
|
fs.put(f"temp_{chunk_id}.md", f"mi-bucket/contents/{chunk_id}.md")
|
|
|
|
# Guardar vector (escribir a JSONL localmente, luego subir)
|
|
print(f"Chunk {chunk_id}: {len(embedding)} dimensiones")
|
|
```
|
|
|
|
---
|
|
|
|
## 📦 Packages Instalados
|
|
|
|
Ver lista completa en [`pyproject.toml`](pyproject.toml).
|
|
|
|
**Principales:**
|
|
- `google-genai` - SDK GenAI para LLM y embeddings
|
|
- `google-cloud-aiplatform` - Vertex AI
|
|
- `google-cloud-storage` - GCS
|
|
- `chonkie` - Recursive chunking
|
|
- `langchain` - Text splitting avanzado
|
|
- `tiktoken` - Token counting
|
|
- `markitdown` - Document conversion
|
|
- `pypdf` - PDF processing
|
|
- `pdf2image` - PDF to image
|
|
- `kfp` - Kubeflow Pipelines
|
|
|
|
---
|
|
|
|
## 🛠️ Scripts de CLI Disponibles
|
|
|
|
Después de `uv sync`, puedes usar estos comandos:
|
|
|
|
```bash
|
|
# Chunkers
|
|
recursive-chunker input.txt output/
|
|
contextual-chunker input.txt output/ --max-chunk-size 800
|
|
llm-chunker documento.pdf output/ --model gemini-2.0-flash
|
|
|
|
# Document converter
|
|
convert-md documento.pdf
|
|
|
|
# File storage
|
|
file-storage upload local.md remote/path.md
|
|
file-storage list remote/
|
|
file-storage download remote/path.md
|
|
|
|
# Vector search
|
|
vector-search create index-name gs://bucket/vectors/ --dimensions 768
|
|
vector-search query deployed-id "query text" --limit 5
|
|
|
|
# Utils
|
|
normalize-filenames input_dir/
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Arquitectura del Sistema
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ PDF File │
|
|
└──────┬──────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────┐
|
|
│ document-converter │
|
|
│ (PDF → Markdown) │
|
|
└──────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────┐
|
|
│ chunker │
|
|
│ (Markdown → Chunks) │
|
|
│ - RecursiveChunker │
|
|
│ - ContextualChunker ⭐ │
|
|
│ - LLMChunker │
|
|
└──────┬──────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────┐
|
|
│ embedder │
|
|
│ (Text → Vectors) │
|
|
│ Vertex AI embeddings │
|
|
└──────┬──────────────────────┘
|
|
│
|
|
├─────────────────────────┐
|
|
│ │
|
|
▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐
|
|
│ file-storage │ │ vector-search │
|
|
│ GCS Storage │ │ Vertex AI │
|
|
│ (.md files) │ │ Vector Index │
|
|
└─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### Error: "poppler not found"
|
|
```bash
|
|
sudo apt-get install -y poppler-utils
|
|
```
|
|
|
|
### Error: "Permission denied" en GCS
|
|
```bash
|
|
gcloud auth application-default login
|
|
# O configurar GOOGLE_APPLICATION_CREDENTIALS
|
|
```
|
|
|
|
### Error: "Module not found"
|
|
```bash
|
|
# Reinstalar dependencias
|
|
uv sync --reinstall
|
|
```
|
|
|
|
---
|
|
|
|
## 📝 Notas
|
|
|
|
- **ContextualChunker** es el recomendado para producción (agrega contexto del documento)
|
|
- **LLMChunker** es más lento pero genera chunks óptimos (fusiona, optimiza tokens)
|
|
- **RecursiveChunker** es el más rápido para pruebas rápidas
|
|
- Los chunks se guardan como `.md` en GCS
|
|
- Los vectores se guardan en formato JSONL: `{"id": "...", "embedding": [...]}`
|
|
- El índice vectorial se crea en Vertex AI Vector Search
|
|
|
|
---
|
|
|
|
## 📄 License
|
|
|
|
Este código es parte del proyecto legacy-rag.
|