Add rate limiting

This commit is contained in:
2026-03-03 05:48:20 +00:00
parent 27dfe7298d
commit 119862d7ed
12 changed files with 648 additions and 2 deletions

View File

@@ -61,6 +61,8 @@ latticelm (unified API)
**OAuth2/OIDC authentication** (Google, Auth0, any OIDC provider)
**Terminal chat client** (Python with Rich UI, PEP 723)
**Conversation tracking** (previous_response_id for efficient context)
**Rate limiting** (Per-IP token bucket with configurable limits)
**Health & readiness endpoints** (Kubernetes-compatible health checks)
## Quick Start
@@ -258,6 +260,54 @@ curl -X POST http://localhost:8080/v1/responses \
-d '{"model": "gemini-2.0-flash-exp", ...}'
```
## Production Features
### Rate Limiting
Per-IP rate limiting using token bucket algorithm to prevent abuse and manage load:
```yaml
rate_limit:
enabled: true
requests_per_second: 10 # Max requests per second per IP
burst: 20 # Maximum burst size
```
Features:
- **Token bucket algorithm** for smooth rate limiting
- **Per-IP limiting** with support for X-Forwarded-For headers
- **Configurable limits** for requests per second and burst size
- **Automatic cleanup** of stale rate limiters to prevent memory leaks
- **429 responses** with Retry-After header when limits exceeded
### Health & Readiness Endpoints
Kubernetes-compatible health check endpoints for orchestration and load balancers:
**Liveness endpoint** (`/health`):
```bash
curl http://localhost:8080/health
# {"status":"healthy","timestamp":1709438400}
```
**Readiness endpoint** (`/ready`):
```bash
curl http://localhost:8080/ready
# {
# "status":"ready",
# "timestamp":1709438400,
# "checks":{
# "conversation_store":"healthy",
# "providers":"healthy"
# }
# }
```
The readiness endpoint verifies:
- Conversation store connectivity
- At least one provider is configured
- Returns 503 if any check fails
## Next Steps
-~~Implement streaming responses~~