8.5 KiB
Observability Implementation
This document describes the observability features implemented in the LLM Gateway.
Overview
The gateway now includes comprehensive observability with:
- Prometheus Metrics: Track HTTP requests, provider calls, token usage, and conversation operations
- OpenTelemetry Tracing: Distributed tracing with OTLP exporter support
- Enhanced Logging: Trace context correlation for log aggregation
Configuration
Add the following to your config.yaml:
observability:
enabled: true # Master switch for all observability features
metrics:
enabled: true
path: "/metrics" # Prometheus metrics endpoint
tracing:
enabled: true
service_name: "llm-gateway"
sampler:
type: "probability" # "always", "never", or "probability"
rate: 0.1 # 10% sampling rate
exporter:
type: "otlp" # "otlp" for production, "stdout" for development
endpoint: "localhost:4317" # OTLP collector endpoint
insecure: true # Use insecure connection (for development)
# headers: # Optional authentication headers
# authorization: "Bearer your-token"
Metrics
HTTP Metrics
http_requests_total- Total HTTP requests (labels: method, path, status)http_request_duration_seconds- Request latency histogramhttp_request_size_bytes- Request body size histogramhttp_response_size_bytes- Response body size histogram
Provider Metrics
provider_requests_total- Provider API calls (labels: provider, model, operation, status)provider_request_duration_seconds- Provider latency histogramprovider_tokens_total- Token usage (labels: provider, model, type=input/output)provider_stream_ttfb_seconds- Time to first byte for streamingprovider_stream_chunks_total- Stream chunk countprovider_stream_duration_seconds- Total stream duration
Conversation Store Metrics
conversation_operations_total- Store operations (labels: operation, backend, status)conversation_operation_duration_seconds- Store operation latencyconversation_active_count- Current number of conversations (gauge)
Example Queries
# Request rate
rate(http_requests_total[5m])
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# Tokens per minute by model
rate(provider_tokens_total[1m]) * 60
# Provider latency by model
histogram_quantile(0.95, rate(provider_request_duration_seconds_bucket[5m])) by (provider, model)
Tracing
Trace Structure
Each request creates a trace with the following span hierarchy:
HTTP GET /v1/responses
├── provider.generate or provider.generate_stream
├── conversation.get (if using previous_response_id)
└── conversation.create (to store result)
Span Attributes
HTTP spans include:
http.method,http.route,http.status_codehttp.request_id- Request ID for correlationtrace_id,span_id- For log correlation
Provider spans include:
provider.name,provider.modelprovider.input_tokens,provider.output_tokensprovider.chunk_count,provider.ttfb_seconds(for streaming)
Conversation spans include:
conversation.id,conversation.backendconversation.message_count,conversation.model
Log Correlation
Logs now include trace_id and span_id fields when tracing is enabled, allowing you to:
- Find all logs for a specific trace
- Jump from a log entry to the corresponding trace in Jaeger/Tempo
Example log entry:
{
"time": "2026-03-03T06:36:44Z",
"level": "INFO",
"msg": "response generated",
"request_id": "74722802-6be1-4e14-8e73-d86823fed3e3",
"trace_id": "5d8a7c3f2e1b9a8c7d6e5f4a3b2c1d0e",
"span_id": "1a2b3c4d5e6f7a8b",
"provider": "openai",
"model": "gpt-4o-mini",
"input_tokens": 23,
"output_tokens": 156
}
Testing Observability
1. Test Metrics Endpoint
# Start the gateway with observability enabled
./bin/gateway -config config.yaml
# Query metrics endpoint
curl http://localhost:8080/metrics
Expected output includes:
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/metrics",status="200"} 1
# HELP conversation_active_count Number of active conversations
# TYPE conversation_active_count gauge
conversation_active_count{backend="memory"} 0
2. Test Tracing with Stdout Exporter
Set up config with stdout exporter for quick testing:
observability:
enabled: true
tracing:
enabled: true
sampler:
type: "always"
exporter:
type: "stdout"
Make a request and check the logs for JSON-formatted spans.
3. Test Tracing with Jaeger
Run Jaeger with OTLP support:
docker run -d --name jaeger \
-e COLLECTOR_OTLP_ENABLED=true \
-p 4317:4317 \
-p 16686:16686 \
jaegertracing/all-in-one:latest
Update config:
observability:
enabled: true
tracing:
enabled: true
sampler:
type: "probability"
rate: 1.0 # 100% for testing
exporter:
type: "otlp"
endpoint: "localhost:4317"
insecure: true
Make requests and view traces at http://localhost:16686
4. End-to-End Test
# Make a test request
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"input": "Hello, world!"
}'
# Check metrics
curl http://localhost:8080/metrics | grep -E "(http_requests|provider_)"
# Expected metrics updates:
# - http_requests_total incremented
# - provider_requests_total incremented
# - provider_tokens_total incremented for input and output
# - provider_request_duration_seconds updated
5. Load Test
# Install hey if needed
go install github.com/rakyll/hey@latest
# Run load test
hey -n 1000 -c 10 -m POST \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","input":"test"}' \
http://localhost:8080/v1/responses
# Check metrics for aggregated data
curl http://localhost:8080/metrics | grep http_request_duration_seconds
Integration with Monitoring Stack
Prometheus
Add to prometheus.yml:
scrape_configs:
- job_name: 'llm-gateway'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
scrape_interval: 15s
Grafana
Import dashboards for:
- HTTP request rates and latencies
- Provider performance by model
- Token usage and costs
- Error rates and types
Tempo/Jaeger
The gateway exports traces via OTLP protocol. Configure your trace backend to accept OTLP on port 4317 (gRPC).
Architecture
Middleware Chain
Client Request
↓
loggingMiddleware (request ID, logging)
↓
tracingMiddleware (W3C Trace Context, spans)
↓
metricsMiddleware (Prometheus metrics)
↓
rateLimitMiddleware (rate limiting)
↓
authMiddleware (authentication)
↓
Application Routes
Instrumentation Pattern
- Providers: Wrapped with
InstrumentedProviderthat tracks calls, latency, and token usage - Conversation Store: Wrapped with
InstrumentedStorethat tracks operations and size - HTTP Layer: Middleware captures request/response metrics and creates trace spans
W3C Trace Context
The gateway supports W3C Trace Context propagation:
- Extracts
traceparentheader from incoming requests - Creates child spans for downstream operations
- Propagates context through the entire request lifecycle
Performance Impact
Observability features have minimal overhead:
- Metrics: < 1% latency increase
- Tracing (10% sampling): < 2% latency increase
- Tracing (100% sampling): < 5% latency increase
Recommended configuration for production:
- Metrics: Enabled
- Tracing: Enabled with 10-20% sampling rate
- Exporter: OTLP to dedicated collector
Troubleshooting
Metrics endpoint returns 404
- Check
observability.metrics.enabledistrue - Verify
observability.enabledistrue - Check
observability.metrics.pathconfiguration
No traces appearing in Jaeger
- Verify OTLP collector is running on configured endpoint
- Check sampling rate (try
type: "always"for testing) - Look for tracer initialization errors in logs
- Verify
observability.tracing.enabledistrue
High memory usage
- Reduce trace sampling rate
- Check for metric cardinality explosion (too many label combinations)
- Consider using recording rules in Prometheus
Missing trace IDs in logs
- Ensure tracing is enabled
- Check that requests are being sampled (sampling rate > 0)
- Verify OpenTelemetry dependencies are correctly installed