98 lines
2.3 KiB
Markdown
98 lines
2.3 KiB
Markdown
# Quick Start - GCP Workbench
|
|
|
|
## 📦 Instalación en Workbench
|
|
|
|
```bash
|
|
# 1. Instalar dependencias del sistema (si es necesario)
|
|
sudo apt-get update
|
|
sudo apt-get install -y poppler-utils libcairo2-dev
|
|
|
|
# 2. Instalar dependencias de Python
|
|
cd ~/pipeline
|
|
uv sync
|
|
|
|
# 3. Configurar credenciales (ya deberían estar en Workbench)
|
|
# Las credenciales de Application Default Credentials ya están configuradas
|
|
```
|
|
|
|
## ⚙️ Configuración Mínima
|
|
|
|
Edita `config.yaml`:
|
|
|
|
```yaml
|
|
project_id: "tu-proyecto-gcp"
|
|
location: "us-central1"
|
|
bucket: "tu-bucket-gcs"
|
|
|
|
index:
|
|
name: "mi-indice-vectorial"
|
|
dimensions: 768
|
|
machine_type: "e2-standard-2"
|
|
```
|
|
|
|
## 🚀 Uso Rápido
|
|
|
|
### 1. Chunking Simple
|
|
```python
|
|
from chunker.recursive_chunker import RecursiveChunker
|
|
from pathlib import Path
|
|
|
|
chunker = RecursiveChunker()
|
|
docs = chunker.process_text("Tu texto aquí")
|
|
print(f"Chunks: {len(docs)}")
|
|
```
|
|
|
|
### 2. Chunking Contextual (Recomendado)
|
|
```python
|
|
from chunker.contextual_chunker import ContextualChunker
|
|
from llm.vertex_ai import VertexAILLM
|
|
|
|
llm = VertexAILLM(project="tu-proyecto", location="us-central1")
|
|
chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
|
|
docs = chunker.process_path(Path("documento.txt"))
|
|
```
|
|
|
|
### 3. Generar Embeddings
|
|
```python
|
|
from embedder.vertex_ai import VertexAIEmbedder
|
|
|
|
embedder = VertexAIEmbedder(
|
|
model_name="text-embedding-005",
|
|
project="tu-proyecto",
|
|
location="us-central1"
|
|
)
|
|
embedding = embedder.generate_embedding("texto")
|
|
```
|
|
|
|
### 4. Pipeline Completo
|
|
```python
|
|
from apps.index_gen.src.index_gen.main import process_file
|
|
|
|
process_file(
|
|
file_path="gs://bucket/file.pdf",
|
|
model_name="text-embedding-005",
|
|
contents_output_dir="gs://bucket/contents/",
|
|
vectors_output_file="vectors.jsonl",
|
|
chunk_limit=800
|
|
)
|
|
```
|
|
|
|
## 📚 Archivos Importantes
|
|
|
|
- `README.md` - Documentación completa
|
|
- `STRUCTURE.md` - Estructura del proyecto
|
|
- `config.yaml` - Configuración de GCP
|
|
- `pyproject.toml` - Dependencias
|
|
|
|
## 🔗 Componentes Principales
|
|
|
|
1. **packages/chunker/** - Chunking (Recursive, Contextual, LLM)
|
|
2. **packages/embedder/** - Embeddings (Vertex AI)
|
|
3. **packages/file-storage/** - Storage (GCS)
|
|
4. **packages/vector-search/** - Vector Search (Vertex AI)
|
|
5. **apps/index-gen/** - Pipeline completo
|
|
|
|
---
|
|
|
|
**Tamaño total**: ~400KB | **Archivos Python**: 33
|