A8065384/latticelm

Fork 0

Files

Anibal Angulo 214e63b0c5 Add panic recovery and request size limit

2026-03-05 06:32:42 +00:00

5.8 KiB

Raw Blame History

Security Improvements - March 2026

This document summarizes the security and reliability improvements made to the go-llm-gateway project.

Issues Fixed

1. Request Size Limits (Issue #2) ✅

Problem: The server had no limits on request body size, making it vulnerable to DoS attacks via oversized payloads.

Solution: Implemented RequestSizeLimitMiddleware that enforces a maximum request body size.

Implementation Details:

Created internal/server/middleware.go with RequestSizeLimitMiddleware
Uses http.MaxBytesReader to enforce limits at the HTTP layer
Default limit: 10MB (10,485,760 bytes)
Configurable via server.max_request_body_size in config.yaml
Returns HTTP 413 (Request Entity Too Large) for oversized requests
Only applies to POST, PUT, and PATCH requests (not GET/DELETE)

Files Modified:

internal/server/middleware.go (new file)
internal/server/server.go (added 413 error handling)
cmd/gateway/main.go (integrated middleware)
internal/config/config.go (added config field)
config.example.yaml (documented configuration)

Testing:

Comprehensive test suite in internal/server/middleware_test.go
Tests cover: small payloads, exact size, oversized payloads, different HTTP methods
Integration test verifies middleware chain behavior

2. Panic Recovery Middleware (Issue #4) ✅

Problem: Any panic in HTTP handlers would crash the entire server, causing downtime.

Solution: Implemented PanicRecoveryMiddleware that catches panics and returns proper error responses.

Implementation Details:

Created PanicRecoveryMiddleware in internal/server/middleware.go
Uses defer recover() pattern to catch all panics
Logs full stack trace with request context for debugging
Returns HTTP 500 (Internal Server Error) to clients
Positioned as the outermost middleware to catch panics from all layers

Files Modified:

internal/server/middleware.go (new file)
cmd/gateway/main.go (integrated as outermost middleware)

Testing:

Tests verify recovery from string panics, error panics, and struct panics
Integration test confirms panic recovery works through middleware chain
Logs are captured and verified to include stack traces

3. Error Handling Improvements (Bonus) ✅

Problem: Multiple instances of ignored JSON encoding errors could lead to incomplete responses.

Solution: Fixed all ignored json.Encoder.Encode() errors throughout the codebase.

Files Modified:

internal/server/health.go (lines 32, 86)
internal/server/server.go (lines 72, 217)

All JSON encoding errors are now logged with proper context including request IDs.

Architecture

Middleware Chain Order

The middleware chain is now (from outermost to innermost):

PanicRecoveryMiddleware - Catches all panics
RequestSizeLimitMiddleware - Enforces body size limits
loggingMiddleware - Request/response logging
TracingMiddleware - OpenTelemetry tracing
MetricsMiddleware - Prometheus metrics
rateLimitMiddleware - Rate limiting
authMiddleware - OIDC authentication
routes - Application handlers

This order ensures:

Panics are caught from all middleware layers
Size limits are enforced before expensive operations
All requests are logged, traced, and metered
Security checks happen closest to the application

Configuration

Add to your config.yaml:

server:
  address: ":8080"
  max_request_body_size: 10485760  # 10MB in bytes (default)

To customize the size limit:

1MB: 1048576
5MB: 5242880
10MB: 10485760 (default)
50MB: 52428800

If not specified, defaults to 10MB.

Testing

All new functionality includes comprehensive tests:

# Run all tests
go test ./...

# Run only middleware tests
go test ./internal/server -v -run "TestPanicRecoveryMiddleware|TestRequestSizeLimitMiddleware"

# Run with coverage
go test ./internal/server -cover

Test Coverage:

internal/server/middleware.go: 100% coverage
All edge cases covered (panics, size limits, different HTTP methods)
Integration tests verify middleware chain interactions

Production Readiness

These changes significantly improve production readiness:

DoS Protection: Request size limits prevent memory exhaustion attacks
Fault Tolerance: Panic recovery prevents cascading failures
Observability: All errors are logged with proper context
Configurability: Limits can be tuned per deployment environment

Remaining Production Concerns

While these issues are fixed, the following should still be addressed:

HIGH: Exposed credentials in .env file (must rotate and remove from git)
MEDIUM: Observability code has 0% test coverage
MEDIUM: Conversation store has only 27% test coverage
LOW: Missing circuit breaker pattern for provider failures
LOW: No retry logic for failed provider requests

See the original assessment for complete details.

Verification

Build and verify the changes:

# Build the application
go build ./cmd/gateway

# Run the gateway
./gateway -config config.yaml

# Test with oversized payload (should return 413)
curl -X POST http://localhost:8080/v1/responses \
  -H "Content-Type: application/json" \
  -d "$(python3 -c 'print("{\"data\":\"" + "x"*11000000 + "\"}")')"

Expected response: HTTP 413 Request Entity Too Large

5.8 KiB Raw Blame History

Security Improvements - March 2026

Issues Fixed

1. Request Size Limits (Issue #2) ✅

2. Panic Recovery Middleware (Issue #4) ✅

3. Error Handling Improvements (Bonus) ✅

Architecture

Middleware Chain Order

Configuration

Testing

Production Readiness

Remaining Production Concerns

Verification

References

5.8 KiB

Raw Blame History