Files
latticelm/OBSERVABILITY.md

8.5 KiB

Observability Implementation

This document describes the observability features implemented in the LLM Gateway.

Overview

The gateway now includes comprehensive observability with:

  • Prometheus Metrics: Track HTTP requests, provider calls, token usage, and conversation operations
  • OpenTelemetry Tracing: Distributed tracing with OTLP exporter support
  • Enhanced Logging: Trace context correlation for log aggregation

Configuration

Add the following to your config.yaml:

observability:
  enabled: true  # Master switch for all observability features

  metrics:
    enabled: true
    path: "/metrics"  # Prometheus metrics endpoint

  tracing:
    enabled: true
    service_name: "llm-gateway"
    sampler:
      type: "probability"  # "always", "never", or "probability"
      rate: 0.1  # 10% sampling rate
    exporter:
      type: "otlp"  # "otlp" for production, "stdout" for development
      endpoint: "localhost:4317"  # OTLP collector endpoint
      insecure: true  # Use insecure connection (for development)
      # headers:  # Optional authentication headers
      #   authorization: "Bearer your-token"

Metrics

HTTP Metrics

  • http_requests_total - Total HTTP requests (labels: method, path, status)
  • http_request_duration_seconds - Request latency histogram
  • http_request_size_bytes - Request body size histogram
  • http_response_size_bytes - Response body size histogram

Provider Metrics

  • provider_requests_total - Provider API calls (labels: provider, model, operation, status)
  • provider_request_duration_seconds - Provider latency histogram
  • provider_tokens_total - Token usage (labels: provider, model, type=input/output)
  • provider_stream_ttfb_seconds - Time to first byte for streaming
  • provider_stream_chunks_total - Stream chunk count
  • provider_stream_duration_seconds - Total stream duration

Conversation Store Metrics

  • conversation_operations_total - Store operations (labels: operation, backend, status)
  • conversation_operation_duration_seconds - Store operation latency
  • conversation_active_count - Current number of conversations (gauge)

Example Queries

# Request rate
rate(http_requests_total[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# Tokens per minute by model
rate(provider_tokens_total[1m]) * 60

# Provider latency by model
histogram_quantile(0.95, rate(provider_request_duration_seconds_bucket[5m])) by (provider, model)

Tracing

Trace Structure

Each request creates a trace with the following span hierarchy:

HTTP GET /v1/responses
├── provider.generate or provider.generate_stream
├── conversation.get (if using previous_response_id)
└── conversation.create (to store result)

Span Attributes

HTTP spans include:

  • http.method, http.route, http.status_code
  • http.request_id - Request ID for correlation
  • trace_id, span_id - For log correlation

Provider spans include:

  • provider.name, provider.model
  • provider.input_tokens, provider.output_tokens
  • provider.chunk_count, provider.ttfb_seconds (for streaming)

Conversation spans include:

  • conversation.id, conversation.backend
  • conversation.message_count, conversation.model

Log Correlation

Logs now include trace_id and span_id fields when tracing is enabled, allowing you to:

  1. Find all logs for a specific trace
  2. Jump from a log entry to the corresponding trace in Jaeger/Tempo

Example log entry:

{
  "time": "2026-03-03T06:36:44Z",
  "level": "INFO",
  "msg": "response generated",
  "request_id": "74722802-6be1-4e14-8e73-d86823fed3e3",
  "trace_id": "5d8a7c3f2e1b9a8c7d6e5f4a3b2c1d0e",
  "span_id": "1a2b3c4d5e6f7a8b",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "input_tokens": 23,
  "output_tokens": 156
}

Testing Observability

1. Test Metrics Endpoint

# Start the gateway with observability enabled
./bin/gateway -config config.yaml

# Query metrics endpoint
curl http://localhost:8080/metrics

Expected output includes:

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/metrics",status="200"} 1

# HELP conversation_active_count Number of active conversations
# TYPE conversation_active_count gauge
conversation_active_count{backend="memory"} 0

2. Test Tracing with Stdout Exporter

Set up config with stdout exporter for quick testing:

observability:
  enabled: true
  tracing:
    enabled: true
    sampler:
      type: "always"
    exporter:
      type: "stdout"

Make a request and check the logs for JSON-formatted spans.

3. Test Tracing with Jaeger

Run Jaeger with OTLP support:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 4317:4317 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

Update config:

observability:
  enabled: true
  tracing:
    enabled: true
    sampler:
      type: "probability"
      rate: 1.0  # 100% for testing
    exporter:
      type: "otlp"
      endpoint: "localhost:4317"
      insecure: true

Make requests and view traces at http://localhost:16686

4. End-to-End Test

# Make a test request
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "input": "Hello, world!"
  }'

# Check metrics
curl http://localhost:8080/metrics | grep -E "(http_requests|provider_)"

# Expected metrics updates:
# - http_requests_total incremented
# - provider_requests_total incremented
# - provider_tokens_total incremented for input and output
# - provider_request_duration_seconds updated

5. Load Test

# Install hey if needed
go install github.com/rakyll/hey@latest

# Run load test
hey -n 1000 -c 10 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","input":"test"}' \
  http://localhost:8080/v1/responses

# Check metrics for aggregated data
curl http://localhost:8080/metrics | grep http_request_duration_seconds

Integration with Monitoring Stack

Prometheus

Add to prometheus.yml:

scrape_configs:
  - job_name: 'llm-gateway'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana

Import dashboards for:

  • HTTP request rates and latencies
  • Provider performance by model
  • Token usage and costs
  • Error rates and types

Tempo/Jaeger

The gateway exports traces via OTLP protocol. Configure your trace backend to accept OTLP on port 4317 (gRPC).

Architecture

Middleware Chain

Client Request
    ↓
loggingMiddleware (request ID, logging)
    ↓
tracingMiddleware (W3C Trace Context, spans)
    ↓
metricsMiddleware (Prometheus metrics)
    ↓
rateLimitMiddleware (rate limiting)
    ↓
authMiddleware (authentication)
    ↓
Application Routes

Instrumentation Pattern

  • Providers: Wrapped with InstrumentedProvider that tracks calls, latency, and token usage
  • Conversation Store: Wrapped with InstrumentedStore that tracks operations and size
  • HTTP Layer: Middleware captures request/response metrics and creates trace spans

W3C Trace Context

The gateway supports W3C Trace Context propagation:

  • Extracts traceparent header from incoming requests
  • Creates child spans for downstream operations
  • Propagates context through the entire request lifecycle

Performance Impact

Observability features have minimal overhead:

  • Metrics: < 1% latency increase
  • Tracing (10% sampling): < 2% latency increase
  • Tracing (100% sampling): < 5% latency increase

Recommended configuration for production:

  • Metrics: Enabled
  • Tracing: Enabled with 10-20% sampling rate
  • Exporter: OTLP to dedicated collector

Troubleshooting

Metrics endpoint returns 404

  • Check observability.metrics.enabled is true
  • Verify observability.enabled is true
  • Check observability.metrics.path configuration

No traces appearing in Jaeger

  • Verify OTLP collector is running on configured endpoint
  • Check sampling rate (try type: "always" for testing)
  • Look for tracer initialization errors in logs
  • Verify observability.tracing.enabled is true

High memory usage

  • Reduce trace sampling rate
  • Check for metric cardinality explosion (too many label combinations)
  • Consider using recording rules in Prometheus

Missing trace IDs in logs

  • Ensure tracing is enabled
  • Check that requests are being sampled (sampling rate > 0)
  • Verify OpenTelemetry dependencies are correctly installed