# Observability Implementation This document describes the observability features implemented in the LLM Gateway. ## Overview The gateway now includes comprehensive observability with: - **Prometheus Metrics**: Track HTTP requests, provider calls, token usage, and conversation operations - **OpenTelemetry Tracing**: Distributed tracing with OTLP exporter support - **Enhanced Logging**: Trace context correlation for log aggregation ## Configuration Add the following to your `config.yaml`: ```yaml observability: enabled: true # Master switch for all observability features metrics: enabled: true path: "/metrics" # Prometheus metrics endpoint tracing: enabled: true service_name: "llm-gateway" sampler: type: "probability" # "always", "never", or "probability" rate: 0.1 # 10% sampling rate exporter: type: "otlp" # "otlp" for production, "stdout" for development endpoint: "localhost:4317" # OTLP collector endpoint insecure: true # Use insecure connection (for development) # headers: # Optional authentication headers # authorization: "Bearer your-token" ``` ## Metrics ### HTTP Metrics - `http_requests_total` - Total HTTP requests (labels: method, path, status) - `http_request_duration_seconds` - Request latency histogram - `http_request_size_bytes` - Request body size histogram - `http_response_size_bytes` - Response body size histogram ### Provider Metrics - `provider_requests_total` - Provider API calls (labels: provider, model, operation, status) - `provider_request_duration_seconds` - Provider latency histogram - `provider_tokens_total` - Token usage (labels: provider, model, type=input/output) - `provider_stream_ttfb_seconds` - Time to first byte for streaming - `provider_stream_chunks_total` - Stream chunk count - `provider_stream_duration_seconds` - Total stream duration ### Conversation Store Metrics - `conversation_operations_total` - Store operations (labels: operation, backend, status) - `conversation_operation_duration_seconds` - Store operation latency - `conversation_active_count` - Current number of conversations (gauge) ### Example Queries ```promql # Request rate rate(http_requests_total[5m]) # P95 latency histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # Error rate rate(http_requests_total{status=~"5.."}[5m]) # Tokens per minute by model rate(provider_tokens_total[1m]) * 60 # Provider latency by model histogram_quantile(0.95, rate(provider_request_duration_seconds_bucket[5m])) by (provider, model) ``` ## Tracing ### Trace Structure Each request creates a trace with the following span hierarchy: ``` HTTP GET /v1/responses ├── provider.generate or provider.generate_stream ├── conversation.get (if using previous_response_id) └── conversation.create (to store result) ``` ### Span Attributes HTTP spans include: - `http.method`, `http.route`, `http.status_code` - `http.request_id` - Request ID for correlation - `trace_id`, `span_id` - For log correlation Provider spans include: - `provider.name`, `provider.model` - `provider.input_tokens`, `provider.output_tokens` - `provider.chunk_count`, `provider.ttfb_seconds` (for streaming) Conversation spans include: - `conversation.id`, `conversation.backend` - `conversation.message_count`, `conversation.model` ### Log Correlation Logs now include `trace_id` and `span_id` fields when tracing is enabled, allowing you to: 1. Find all logs for a specific trace 2. Jump from a log entry to the corresponding trace in Jaeger/Tempo Example log entry: ```json { "time": "2026-03-03T06:36:44Z", "level": "INFO", "msg": "response generated", "request_id": "74722802-6be1-4e14-8e73-d86823fed3e3", "trace_id": "5d8a7c3f2e1b9a8c7d6e5f4a3b2c1d0e", "span_id": "1a2b3c4d5e6f7a8b", "provider": "openai", "model": "gpt-4o-mini", "input_tokens": 23, "output_tokens": 156 } ``` ## Testing Observability ### 1. Test Metrics Endpoint ```bash # Start the gateway with observability enabled ./bin/gateway -config config.yaml # Query metrics endpoint curl http://localhost:8080/metrics ``` Expected output includes: ``` # HELP http_requests_total Total number of HTTP requests # TYPE http_requests_total counter http_requests_total{method="GET",path="/metrics",status="200"} 1 # HELP conversation_active_count Number of active conversations # TYPE conversation_active_count gauge conversation_active_count{backend="memory"} 0 ``` ### 2. Test Tracing with Stdout Exporter Set up config with stdout exporter for quick testing: ```yaml observability: enabled: true tracing: enabled: true sampler: type: "always" exporter: type: "stdout" ``` Make a request and check the logs for JSON-formatted spans. ### 3. Test Tracing with Jaeger Run Jaeger with OTLP support: ```bash docker run -d --name jaeger \ -e COLLECTOR_OTLP_ENABLED=true \ -p 4317:4317 \ -p 16686:16686 \ jaegertracing/all-in-one:latest ``` Update config: ```yaml observability: enabled: true tracing: enabled: true sampler: type: "probability" rate: 1.0 # 100% for testing exporter: type: "otlp" endpoint: "localhost:4317" insecure: true ``` Make requests and view traces at http://localhost:16686 ### 4. End-to-End Test ```bash # Make a test request curl -X POST http://localhost:8080/v1/responses \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-4o-mini", "input": "Hello, world!" }' # Check metrics curl http://localhost:8080/metrics | grep -E "(http_requests|provider_)" # Expected metrics updates: # - http_requests_total incremented # - provider_requests_total incremented # - provider_tokens_total incremented for input and output # - provider_request_duration_seconds updated ``` ### 5. Load Test ```bash # Install hey if needed go install github.com/rakyll/hey@latest # Run load test hey -n 1000 -c 10 -m POST \ -H "Content-Type: application/json" \ -d '{"model":"gpt-4o-mini","input":"test"}' \ http://localhost:8080/v1/responses # Check metrics for aggregated data curl http://localhost:8080/metrics | grep http_request_duration_seconds ``` ## Integration with Monitoring Stack ### Prometheus Add to `prometheus.yml`: ```yaml scrape_configs: - job_name: 'llm-gateway' static_configs: - targets: ['localhost:8080'] metrics_path: '/metrics' scrape_interval: 15s ``` ### Grafana Import dashboards for: - HTTP request rates and latencies - Provider performance by model - Token usage and costs - Error rates and types ### Tempo/Jaeger The gateway exports traces via OTLP protocol. Configure your trace backend to accept OTLP on port 4317 (gRPC). ## Architecture ### Middleware Chain ``` Client Request ↓ loggingMiddleware (request ID, logging) ↓ tracingMiddleware (W3C Trace Context, spans) ↓ metricsMiddleware (Prometheus metrics) ↓ rateLimitMiddleware (rate limiting) ↓ authMiddleware (authentication) ↓ Application Routes ``` ### Instrumentation Pattern - **Providers**: Wrapped with `InstrumentedProvider` that tracks calls, latency, and token usage - **Conversation Store**: Wrapped with `InstrumentedStore` that tracks operations and size - **HTTP Layer**: Middleware captures request/response metrics and creates trace spans ### W3C Trace Context The gateway supports W3C Trace Context propagation: - Extracts `traceparent` header from incoming requests - Creates child spans for downstream operations - Propagates context through the entire request lifecycle ## Performance Impact Observability features have minimal overhead: - Metrics: < 1% latency increase - Tracing (10% sampling): < 2% latency increase - Tracing (100% sampling): < 5% latency increase Recommended configuration for production: - Metrics: Enabled - Tracing: Enabled with 10-20% sampling rate - Exporter: OTLP to dedicated collector ## Troubleshooting ### Metrics endpoint returns 404 - Check `observability.metrics.enabled` is `true` - Verify `observability.enabled` is `true` - Check `observability.metrics.path` configuration ### No traces appearing in Jaeger - Verify OTLP collector is running on configured endpoint - Check sampling rate (try `type: "always"` for testing) - Look for tracer initialization errors in logs - Verify `observability.tracing.enabled` is `true` ### High memory usage - Reduce trace sampling rate - Check for metric cardinality explosion (too many label combinations) - Consider using recording rules in Prometheus ### Missing trace IDs in logs - Ensure tracing is enabled - Check that requests are being sampled (sampling rate > 0) - Verify OpenTelemetry dependencies are correctly installed