A8065384/latticelm

Fork 0

Files

Anibal Angulo b56c78fa07 Add observabilitty and monitoring

2026-03-03 06:40:08 +00:00

8.5 KiB

Raw Blame History

Observability Implementation

This document describes the observability features implemented in the LLM Gateway.

Overview

The gateway now includes comprehensive observability with:

Prometheus Metrics: Track HTTP requests, provider calls, token usage, and conversation operations
OpenTelemetry Tracing: Distributed tracing with OTLP exporter support
Enhanced Logging: Trace context correlation for log aggregation

Configuration

Add the following to your config.yaml:

observability:
  enabled: true  # Master switch for all observability features

  metrics:
    enabled: true
    path: "/metrics"  # Prometheus metrics endpoint

  tracing:
    enabled: true
    service_name: "llm-gateway"
    sampler:
      type: "probability"  # "always", "never", or "probability"
      rate: 0.1  # 10% sampling rate
    exporter:
      type: "otlp"  # "otlp" for production, "stdout" for development
      endpoint: "localhost:4317"  # OTLP collector endpoint
      insecure: true  # Use insecure connection (for development)
      # headers:  # Optional authentication headers
      #   authorization: "Bearer your-token"

Metrics

HTTP Metrics

http_requests_total - Total HTTP requests (labels: method, path, status)
http_request_duration_seconds - Request latency histogram
http_request_size_bytes - Request body size histogram
http_response_size_bytes - Response body size histogram

Provider Metrics

provider_requests_total - Provider API calls (labels: provider, model, operation, status)
provider_request_duration_seconds - Provider latency histogram
provider_tokens_total - Token usage (labels: provider, model, type=input/output)
provider_stream_ttfb_seconds - Time to first byte for streaming
provider_stream_chunks_total - Stream chunk count
provider_stream_duration_seconds - Total stream duration

Conversation Store Metrics

conversation_operations_total - Store operations (labels: operation, backend, status)
conversation_operation_duration_seconds - Store operation latency
conversation_active_count - Current number of conversations (gauge)

Example Queries

# Request rate
rate(http_requests_total[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# Tokens per minute by model
rate(provider_tokens_total[1m]) * 60

# Provider latency by model
histogram_quantile(0.95, rate(provider_request_duration_seconds_bucket[5m])) by (provider, model)

Tracing

Trace Structure

Each request creates a trace with the following span hierarchy:

HTTP GET /v1/responses
├── provider.generate or provider.generate_stream
├── conversation.get (if using previous_response_id)
└── conversation.create (to store result)

Span Attributes

HTTP spans include:

http.method, http.route, http.status_code
http.request_id - Request ID for correlation
trace_id, span_id - For log correlation

Provider spans include:

provider.name, provider.model
provider.input_tokens, provider.output_tokens
provider.chunk_count, provider.ttfb_seconds (for streaming)

Conversation spans include:

conversation.id, conversation.backend
conversation.message_count, conversation.model

Log Correlation

Logs now include trace_id and span_id fields when tracing is enabled, allowing you to:

Find all logs for a specific trace
Jump from a log entry to the corresponding trace in Jaeger/Tempo

Example log entry:

{
  "time": "2026-03-03T06:36:44Z",
  "level": "INFO",
  "msg": "response generated",
  "request_id": "74722802-6be1-4e14-8e73-d86823fed3e3",
  "trace_id": "5d8a7c3f2e1b9a8c7d6e5f4a3b2c1d0e",
  "span_id": "1a2b3c4d5e6f7a8b",
  "provider": "openai",
  "model": "gpt-4o-mini",
  "input_tokens": 23,
  "output_tokens": 156
}

Testing Observability

1. Test Metrics Endpoint

# Start the gateway with observability enabled
./bin/gateway -config config.yaml

# Query metrics endpoint
curl http://localhost:8080/metrics

Expected output includes:

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/metrics",status="200"} 1

# HELP conversation_active_count Number of active conversations
# TYPE conversation_active_count gauge
conversation_active_count{backend="memory"} 0

2. Test Tracing with Stdout Exporter

Set up config with stdout exporter for quick testing:

observability:
  enabled: true
  tracing:
    enabled: true
    sampler:
      type: "always"
    exporter:
      type: "stdout"

Make a request and check the logs for JSON-formatted spans.

3. Test Tracing with Jaeger

Run Jaeger with OTLP support:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 4317:4317 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

Update config:

observability:
  enabled: true
  tracing:
    enabled: true
    sampler:
      type: "probability"
      rate: 1.0  # 100% for testing
    exporter:
      type: "otlp"
      endpoint: "localhost:4317"
      insecure: true

Make requests and view traces at http://localhost:16686

4. End-to-End Test

# Make a test request
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "input": "Hello, world!"
  }'

# Check metrics
curl http://localhost:8080/metrics | grep -E "(http_requests|provider_)"

# Expected metrics updates:
# - http_requests_total incremented
# - provider_requests_total incremented
# - provider_tokens_total incremented for input and output
# - provider_request_duration_seconds updated

5. Load Test

# Install hey if needed
go install github.com/rakyll/hey@latest

# Run load test
hey -n 1000 -c 10 -m POST \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o-mini","input":"test"}' \
  http://localhost:8080/v1/responses

# Check metrics for aggregated data
curl http://localhost:8080/metrics | grep http_request_duration_seconds

Integration with Monitoring Stack

Prometheus

Add to prometheus.yml:

scrape_configs:
  - job_name: 'llm-gateway'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana

Import dashboards for:

HTTP request rates and latencies
Provider performance by model
Token usage and costs
Error rates and types

Tempo/Jaeger

The gateway exports traces via OTLP protocol. Configure your trace backend to accept OTLP on port 4317 (gRPC).

Architecture

Middleware Chain

Client Request
    ↓
loggingMiddleware (request ID, logging)
    ↓
tracingMiddleware (W3C Trace Context, spans)
    ↓
metricsMiddleware (Prometheus metrics)
    ↓
rateLimitMiddleware (rate limiting)
    ↓
authMiddleware (authentication)
    ↓
Application Routes

Instrumentation Pattern

Providers: Wrapped with InstrumentedProvider that tracks calls, latency, and token usage
Conversation Store: Wrapped with InstrumentedStore that tracks operations and size
HTTP Layer: Middleware captures request/response metrics and creates trace spans

W3C Trace Context

The gateway supports W3C Trace Context propagation:

Extracts traceparent header from incoming requests
Creates child spans for downstream operations
Propagates context through the entire request lifecycle

Performance Impact

Observability features have minimal overhead:

Metrics: < 1% latency increase
Tracing (10% sampling): < 2% latency increase
Tracing (100% sampling): < 5% latency increase

Recommended configuration for production:

Metrics: Enabled
Tracing: Enabled with 10-20% sampling rate
Exporter: OTLP to dedicated collector

Troubleshooting

Metrics endpoint returns 404

Check observability.metrics.enabled is true
Verify observability.enabled is true
Check observability.metrics.path configuration

No traces appearing in Jaeger

Verify OTLP collector is running on configured endpoint
Check sampling rate (try type: "always" for testing)
Look for tracer initialization errors in logs
Verify observability.tracing.enabled is true

High memory usage

Reduce trace sampling rate
Check for metric cardinality explosion (too many label combinations)
Consider using recording rules in Prometheus

Missing trace IDs in logs

Ensure tracing is enabled
Check that requests are being sampled (sampling rate > 0)
Verify OpenTelemetry dependencies are correctly installed

8.5 KiB Raw Blame History