First commmit
This commit is contained in:
97
QUICKSTART.md
Normal file
97
QUICKSTART.md
Normal file
@@ -0,0 +1,97 @@
|
||||
# Quick Start - GCP Workbench
|
||||
|
||||
## 📦 Instalación en Workbench
|
||||
|
||||
```bash
|
||||
# 1. Instalar dependencias del sistema (si es necesario)
|
||||
sudo apt-get update
|
||||
sudo apt-get install -y poppler-utils libcairo2-dev
|
||||
|
||||
# 2. Instalar dependencias de Python
|
||||
cd ~/pipeline
|
||||
uv sync
|
||||
|
||||
# 3. Configurar credenciales (ya deberían estar en Workbench)
|
||||
# Las credenciales de Application Default Credentials ya están configuradas
|
||||
```
|
||||
|
||||
## ⚙️ Configuración Mínima
|
||||
|
||||
Edita `config.yaml`:
|
||||
|
||||
```yaml
|
||||
project_id: "tu-proyecto-gcp"
|
||||
location: "us-central1"
|
||||
bucket: "tu-bucket-gcs"
|
||||
|
||||
index:
|
||||
name: "mi-indice-vectorial"
|
||||
dimensions: 768
|
||||
machine_type: "e2-standard-2"
|
||||
```
|
||||
|
||||
## 🚀 Uso Rápido
|
||||
|
||||
### 1. Chunking Simple
|
||||
```python
|
||||
from chunker.recursive_chunker import RecursiveChunker
|
||||
from pathlib import Path
|
||||
|
||||
chunker = RecursiveChunker()
|
||||
docs = chunker.process_text("Tu texto aquí")
|
||||
print(f"Chunks: {len(docs)}")
|
||||
```
|
||||
|
||||
### 2. Chunking Contextual (Recomendado)
|
||||
```python
|
||||
from chunker.contextual_chunker import ContextualChunker
|
||||
from llm.vertex_ai import VertexAILLM
|
||||
|
||||
llm = VertexAILLM(project="tu-proyecto", location="us-central1")
|
||||
chunker = ContextualChunker(llm_client=llm, max_chunk_size=800)
|
||||
docs = chunker.process_path(Path("documento.txt"))
|
||||
```
|
||||
|
||||
### 3. Generar Embeddings
|
||||
```python
|
||||
from embedder.vertex_ai import VertexAIEmbedder
|
||||
|
||||
embedder = VertexAIEmbedder(
|
||||
model_name="text-embedding-005",
|
||||
project="tu-proyecto",
|
||||
location="us-central1"
|
||||
)
|
||||
embedding = embedder.generate_embedding("texto")
|
||||
```
|
||||
|
||||
### 4. Pipeline Completo
|
||||
```python
|
||||
from apps.index_gen.src.index_gen.main import process_file
|
||||
|
||||
process_file(
|
||||
file_path="gs://bucket/file.pdf",
|
||||
model_name="text-embedding-005",
|
||||
contents_output_dir="gs://bucket/contents/",
|
||||
vectors_output_file="vectors.jsonl",
|
||||
chunk_limit=800
|
||||
)
|
||||
```
|
||||
|
||||
## 📚 Archivos Importantes
|
||||
|
||||
- `README.md` - Documentación completa
|
||||
- `STRUCTURE.md` - Estructura del proyecto
|
||||
- `config.yaml` - Configuración de GCP
|
||||
- `pyproject.toml` - Dependencias
|
||||
|
||||
## 🔗 Componentes Principales
|
||||
|
||||
1. **packages/chunker/** - Chunking (Recursive, Contextual, LLM)
|
||||
2. **packages/embedder/** - Embeddings (Vertex AI)
|
||||
3. **packages/file-storage/** - Storage (GCS)
|
||||
4. **packages/vector-search/** - Vector Search (Vertex AI)
|
||||
5. **apps/index-gen/** - Pipeline completo
|
||||
|
||||
---
|
||||
|
||||
**Tamaño total**: ~400KB | **Archivos Python**: 33
|
||||
Reference in New Issue
Block a user