328 lines
8.5 KiB
Markdown
328 lines
8.5 KiB
Markdown
# Observability Implementation
|
|
|
|
This document describes the observability features implemented in the LLM Gateway.
|
|
|
|
## Overview
|
|
|
|
The gateway now includes comprehensive observability with:
|
|
- **Prometheus Metrics**: Track HTTP requests, provider calls, token usage, and conversation operations
|
|
- **OpenTelemetry Tracing**: Distributed tracing with OTLP exporter support
|
|
- **Enhanced Logging**: Trace context correlation for log aggregation
|
|
|
|
## Configuration
|
|
|
|
Add the following to your `config.yaml`:
|
|
|
|
```yaml
|
|
observability:
|
|
enabled: true # Master switch for all observability features
|
|
|
|
metrics:
|
|
enabled: true
|
|
path: "/metrics" # Prometheus metrics endpoint
|
|
|
|
tracing:
|
|
enabled: true
|
|
service_name: "llm-gateway"
|
|
sampler:
|
|
type: "probability" # "always", "never", or "probability"
|
|
rate: 0.1 # 10% sampling rate
|
|
exporter:
|
|
type: "otlp" # "otlp" for production, "stdout" for development
|
|
endpoint: "localhost:4317" # OTLP collector endpoint
|
|
insecure: true # Use insecure connection (for development)
|
|
# headers: # Optional authentication headers
|
|
# authorization: "Bearer your-token"
|
|
```
|
|
|
|
## Metrics
|
|
|
|
### HTTP Metrics
|
|
- `http_requests_total` - Total HTTP requests (labels: method, path, status)
|
|
- `http_request_duration_seconds` - Request latency histogram
|
|
- `http_request_size_bytes` - Request body size histogram
|
|
- `http_response_size_bytes` - Response body size histogram
|
|
|
|
### Provider Metrics
|
|
- `provider_requests_total` - Provider API calls (labels: provider, model, operation, status)
|
|
- `provider_request_duration_seconds` - Provider latency histogram
|
|
- `provider_tokens_total` - Token usage (labels: provider, model, type=input/output)
|
|
- `provider_stream_ttfb_seconds` - Time to first byte for streaming
|
|
- `provider_stream_chunks_total` - Stream chunk count
|
|
- `provider_stream_duration_seconds` - Total stream duration
|
|
|
|
### Conversation Store Metrics
|
|
- `conversation_operations_total` - Store operations (labels: operation, backend, status)
|
|
- `conversation_operation_duration_seconds` - Store operation latency
|
|
- `conversation_active_count` - Current number of conversations (gauge)
|
|
|
|
### Example Queries
|
|
|
|
```promql
|
|
# Request rate
|
|
rate(http_requests_total[5m])
|
|
|
|
# P95 latency
|
|
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
|
|
|
|
# Error rate
|
|
rate(http_requests_total{status=~"5.."}[5m])
|
|
|
|
# Tokens per minute by model
|
|
rate(provider_tokens_total[1m]) * 60
|
|
|
|
# Provider latency by model
|
|
histogram_quantile(0.95, rate(provider_request_duration_seconds_bucket[5m])) by (provider, model)
|
|
```
|
|
|
|
## Tracing
|
|
|
|
### Trace Structure
|
|
|
|
Each request creates a trace with the following span hierarchy:
|
|
```
|
|
HTTP GET /v1/responses
|
|
├── provider.generate or provider.generate_stream
|
|
├── conversation.get (if using previous_response_id)
|
|
└── conversation.create (to store result)
|
|
```
|
|
|
|
### Span Attributes
|
|
|
|
HTTP spans include:
|
|
- `http.method`, `http.route`, `http.status_code`
|
|
- `http.request_id` - Request ID for correlation
|
|
- `trace_id`, `span_id` - For log correlation
|
|
|
|
Provider spans include:
|
|
- `provider.name`, `provider.model`
|
|
- `provider.input_tokens`, `provider.output_tokens`
|
|
- `provider.chunk_count`, `provider.ttfb_seconds` (for streaming)
|
|
|
|
Conversation spans include:
|
|
- `conversation.id`, `conversation.backend`
|
|
- `conversation.message_count`, `conversation.model`
|
|
|
|
### Log Correlation
|
|
|
|
Logs now include `trace_id` and `span_id` fields when tracing is enabled, allowing you to:
|
|
1. Find all logs for a specific trace
|
|
2. Jump from a log entry to the corresponding trace in Jaeger/Tempo
|
|
|
|
Example log entry:
|
|
```json
|
|
{
|
|
"time": "2026-03-03T06:36:44Z",
|
|
"level": "INFO",
|
|
"msg": "response generated",
|
|
"request_id": "74722802-6be1-4e14-8e73-d86823fed3e3",
|
|
"trace_id": "5d8a7c3f2e1b9a8c7d6e5f4a3b2c1d0e",
|
|
"span_id": "1a2b3c4d5e6f7a8b",
|
|
"provider": "openai",
|
|
"model": "gpt-4o-mini",
|
|
"input_tokens": 23,
|
|
"output_tokens": 156
|
|
}
|
|
```
|
|
|
|
## Testing Observability
|
|
|
|
### 1. Test Metrics Endpoint
|
|
|
|
```bash
|
|
# Start the gateway with observability enabled
|
|
./bin/gateway -config config.yaml
|
|
|
|
# Query metrics endpoint
|
|
curl http://localhost:8080/metrics
|
|
```
|
|
|
|
Expected output includes:
|
|
```
|
|
# HELP http_requests_total Total number of HTTP requests
|
|
# TYPE http_requests_total counter
|
|
http_requests_total{method="GET",path="/metrics",status="200"} 1
|
|
|
|
# HELP conversation_active_count Number of active conversations
|
|
# TYPE conversation_active_count gauge
|
|
conversation_active_count{backend="memory"} 0
|
|
```
|
|
|
|
### 2. Test Tracing with Stdout Exporter
|
|
|
|
Set up config with stdout exporter for quick testing:
|
|
|
|
```yaml
|
|
observability:
|
|
enabled: true
|
|
tracing:
|
|
enabled: true
|
|
sampler:
|
|
type: "always"
|
|
exporter:
|
|
type: "stdout"
|
|
```
|
|
|
|
Make a request and check the logs for JSON-formatted spans.
|
|
|
|
### 3. Test Tracing with Jaeger
|
|
|
|
Run Jaeger with OTLP support:
|
|
|
|
```bash
|
|
docker run -d --name jaeger \
|
|
-e COLLECTOR_OTLP_ENABLED=true \
|
|
-p 4317:4317 \
|
|
-p 16686:16686 \
|
|
jaegertracing/all-in-one:latest
|
|
```
|
|
|
|
Update config:
|
|
```yaml
|
|
observability:
|
|
enabled: true
|
|
tracing:
|
|
enabled: true
|
|
sampler:
|
|
type: "probability"
|
|
rate: 1.0 # 100% for testing
|
|
exporter:
|
|
type: "otlp"
|
|
endpoint: "localhost:4317"
|
|
insecure: true
|
|
```
|
|
|
|
Make requests and view traces at http://localhost:16686
|
|
|
|
### 4. End-to-End Test
|
|
|
|
```bash
|
|
# Make a test request
|
|
curl -X POST http://localhost:8080/v1/responses \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "gpt-4o-mini",
|
|
"input": "Hello, world!"
|
|
}'
|
|
|
|
# Check metrics
|
|
curl http://localhost:8080/metrics | grep -E "(http_requests|provider_)"
|
|
|
|
# Expected metrics updates:
|
|
# - http_requests_total incremented
|
|
# - provider_requests_total incremented
|
|
# - provider_tokens_total incremented for input and output
|
|
# - provider_request_duration_seconds updated
|
|
```
|
|
|
|
### 5. Load Test
|
|
|
|
```bash
|
|
# Install hey if needed
|
|
go install github.com/rakyll/hey@latest
|
|
|
|
# Run load test
|
|
hey -n 1000 -c 10 -m POST \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"model":"gpt-4o-mini","input":"test"}' \
|
|
http://localhost:8080/v1/responses
|
|
|
|
# Check metrics for aggregated data
|
|
curl http://localhost:8080/metrics | grep http_request_duration_seconds
|
|
```
|
|
|
|
## Integration with Monitoring Stack
|
|
|
|
### Prometheus
|
|
|
|
Add to `prometheus.yml`:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'llm-gateway'
|
|
static_configs:
|
|
- targets: ['localhost:8080']
|
|
metrics_path: '/metrics'
|
|
scrape_interval: 15s
|
|
```
|
|
|
|
### Grafana
|
|
|
|
Import dashboards for:
|
|
- HTTP request rates and latencies
|
|
- Provider performance by model
|
|
- Token usage and costs
|
|
- Error rates and types
|
|
|
|
### Tempo/Jaeger
|
|
|
|
The gateway exports traces via OTLP protocol. Configure your trace backend to accept OTLP on port 4317 (gRPC).
|
|
|
|
## Architecture
|
|
|
|
### Middleware Chain
|
|
|
|
```
|
|
Client Request
|
|
↓
|
|
loggingMiddleware (request ID, logging)
|
|
↓
|
|
tracingMiddleware (W3C Trace Context, spans)
|
|
↓
|
|
metricsMiddleware (Prometheus metrics)
|
|
↓
|
|
rateLimitMiddleware (rate limiting)
|
|
↓
|
|
authMiddleware (authentication)
|
|
↓
|
|
Application Routes
|
|
```
|
|
|
|
### Instrumentation Pattern
|
|
|
|
- **Providers**: Wrapped with `InstrumentedProvider` that tracks calls, latency, and token usage
|
|
- **Conversation Store**: Wrapped with `InstrumentedStore` that tracks operations and size
|
|
- **HTTP Layer**: Middleware captures request/response metrics and creates trace spans
|
|
|
|
### W3C Trace Context
|
|
|
|
The gateway supports W3C Trace Context propagation:
|
|
- Extracts `traceparent` header from incoming requests
|
|
- Creates child spans for downstream operations
|
|
- Propagates context through the entire request lifecycle
|
|
|
|
## Performance Impact
|
|
|
|
Observability features have minimal overhead:
|
|
- Metrics: < 1% latency increase
|
|
- Tracing (10% sampling): < 2% latency increase
|
|
- Tracing (100% sampling): < 5% latency increase
|
|
|
|
Recommended configuration for production:
|
|
- Metrics: Enabled
|
|
- Tracing: Enabled with 10-20% sampling rate
|
|
- Exporter: OTLP to dedicated collector
|
|
|
|
## Troubleshooting
|
|
|
|
### Metrics endpoint returns 404
|
|
- Check `observability.metrics.enabled` is `true`
|
|
- Verify `observability.enabled` is `true`
|
|
- Check `observability.metrics.path` configuration
|
|
|
|
### No traces appearing in Jaeger
|
|
- Verify OTLP collector is running on configured endpoint
|
|
- Check sampling rate (try `type: "always"` for testing)
|
|
- Look for tracer initialization errors in logs
|
|
- Verify `observability.tracing.enabled` is `true`
|
|
|
|
### High memory usage
|
|
- Reduce trace sampling rate
|
|
- Check for metric cardinality explosion (too many label combinations)
|
|
- Consider using recording rules in Prometheus
|
|
|
|
### Missing trace IDs in logs
|
|
- Ensure tracing is enabled
|
|
- Check that requests are being sampled (sampling rate > 0)
|
|
- Verify OpenTelemetry dependencies are correctly installed
|