The Incident
Tuesday, 2:47 PM
Our customer support chatbot is handling 280 RPS. Everything’s fine.
2:53 PM
Traffic hits 310 RPS. Response times spike. Users start complaining in Slack.
2:58 PM
P99 latency reaches 18 seconds. Some requests time out completely.
3:05 PM
We manually restart LiteLLM. Traffic drops during the restart. Users are angry.
This happened three times that week.
What We Thought the Problem Was
- “Maybe we need more replicas”
- “Let’s add a load balancer”
- “Probably need better hardware”
We scaled horizontally and added three more LiteLLM instances.
Result
Cost increased 4×
- Traffic hit 320 RPS
- The same issues appeared
- All instances struggled simultaneously
What the Problem Actually Was
LiteLLM is built on Python + FastAPI.
At low traffic (< 200 RPS), it works well.
Past 300 RPS, Python’s architecture becomes the bottleneck.
The Python Problem
GIL (Global Interpreter Lock): Only one thread executes Python code at a time
- Async overhead: Event loop coordination adds latency
- Memory pressure: Heavy dependencies + long-running processes
- GC pauses: Garbage collection freezes request handling
What We Observed at 350 RPS (Single Instance)
CPU: 85% (one core maxed due to GIL)
- Memory: 3.2 GB → 5.1 GB → 6.8 GB (steadily climbing)
- Latency: 200 ms → 2 s → 12 s → timeout
- GC pauses: 100–300 ms every ~30 seconds
After 2 hours, memory reached 8 GB.
The process was killed by the OOM killer.
This isn’t a LiteLLM-specific issue.
It’s Python hitting its limits at high throughput.
How Bifrost Solves This
We needed production-grade infrastructure, not a prototype that breaks under load.
So we built Bifrost in Go, specifically for high-throughput LLM workloads.
It’s open source and MIT licensed.
Key Architectural Differences
- True Concurrency (No GIL)
Go’s goroutines execute in parallel across all CPU cores.
// Thousands of goroutines, truly parallel
go handleRequest(req1)
go handleRequest(req2)
go handleRequest(req3)
// All executing simultaneously
- Lightweight Concurrency
`Go: 10,000 goroutines ≈ ~100 MB memory
Python: 10,000 threads / async tasks → out of memory`
- Predictable Memory
- Go’s garbage collector is designed for low-latency systems:
- Concurrent GC (doesn’t stop the world)
- Predictable pause times (typically < 1 ms)
- No circular-reference memory leaks
- Native HTTP/2
- Built-in HTTP/2 support
- Request multiplexing
- No external dependencies
- The Real-World Difference
We ran the same production workload through both gateways.
Test: Customer support chatbot, real user traffic
Load: 500 RPS sustained
LiteLLM
(3 × t3.xlarge instances)
P50 latency: 2.1 s
P99 latency: 23.4 s
Memory per instance: 4–7 GB (climbing)
Timeout rate: 8%
Cost: ~$450/month
Stability: Restart required every 6–8 hours
Bifrost
(1 × t3.large instance)
P50 latency: 230 ms
P99 latency: 520 ms
Memory: 1.4 GB (stable)
Timeout rate: 0.1%
Cost: ~$60/month
Stability: 30+ days without restart
Result
- 45× faster P99 latency
- 7× cheaper
- Actually stable
But Bifrost Isn’t Just About Performance
Rebuilding from scratch let us add production features LiteLLM doesn’t have.
1. Adaptive Load Balancing
Multiple API keys?
Bifrost continuously monitors:
- Latency
- Error rates
- Traffic is automatically reweighted:
Real-time weight adjustment:
├─ Key 1: 1.2× weight (healthy)
├─ Key 2: 0.5× weight (high latency)
└─ Key 3: 1.0× weight (normal)
No manual intervention required.
2. Semantic Caching
Not exact-match caching — semantic similarity.
“How do I reset my password?”
“What’s the password reset process?”
The second query hits the cache.
- Cache hit rate: 40%
- Cost savings: ~$1,200/month
3. Zero-Overhead Observability
Every request is logged with full context:
- Inputs / outputs
- Token usage
- Latency breakdown
- Cost per request
All async. Zero performance impact.
Built-in dashboard.
4. Production-Grade Failover
Primary provider down?
Bifrost automatically fails over.
We’ve had OpenAI incidents where traffic switched to Anthropic automatically.
Users didn’t notice.
Migration Was Surprisingly Easy
Expected: Days of refactoring
Actual: ~15 minutes
Step 1: Start Bifrost
docker run -p 8080:8080 -v $(pwd)/data:/app/data maximhq/bifrost
Step 2: Add API Keys
Visit: http://localhost:8080
Step 3: Change One Line in Code
Before
import openai
openai.api_key = "sk-..."
After
import openai
openai.api_base = "http://localhost:8080/openai"
openai.api_key = "sk-..."
Step 4: Deploy
That’s it.
Bifrost is OpenAI-compatible.
If your code works with OpenAI, it works with Bifrost.
Supports LangChain, LlamaIndex, LiteLLM SDK, and more.
The Production Rollout
Week 1: 10% traffic
No issues
Latency down 60%
Week 2: 50% traffic
Still stable
Costs already dropping
Week 3: 100% migration
Shut down 2 of 3 LiteLLM instances
Performance better than ever
Three Months Later, Zero downtime incidents
Handling 800+ RPS during peaks
Monthly cost: $60 vs $450
No manual restarts
When to Use Each
Use LiteLLM if:
- You’re prototyping
- Traffic is < 100 RPS
- You need deep Python ecosystem integration
- You’re okay with manual scaling and monitoring
Use Bifrost if:
- You’re running production workloads
- Traffic > 200 RPS (or will be soon)
- You care about P99 latency
- You want predictable costs
- You’re tired of restarting your gateway
Try Bifrost; Open source (MIT). Run it locally in 30 seconds:
git clone https://github.com/maximhq/bifrost
cd bifrost
docker compose up
Visit http://localhost:8080, add your API keys, and point your app at Bifrost.
Benchmark It Yourself
cd bifrost/benchmarks
./benchmark -provider bifrost -rate 500 -duration 60
Compare with your current setup.
The Bottom Line
LiteLLM breaking at ~300 RPS wasn’t a bug.
It was Python hitting its architectural limits.
We needed production-grade infrastructure.
So we built it — in Go — and open sourced it.
If you’re hitting scale issues with your LLM gateway, you’re not alone.
We hit them too.
Bifrost solved them. Might solve yours.
Benchmarks: https://docs.getbifrost.ai/benchmarking/getting-started
Docs: https://docs.getbifrost.ai
Repo: https://github.com/maximhq/bifrost
Built by the team at Maxim AI. We also build evaluation and observability tools for production AI systems.
Top comments (1)
This makes absolutely no sense, if it's a python issue, adding more pods with a load balancer will absolutely solve the issue regardless if the language is the bottleneck. GIL is only important on the python process running.
I'm not a huge fan of python in general, but each new instance added is basically separate from the others. The only bottlenecks you'll see when adding more pods and seeing similar performance would be external resources, such as postgres or mlflow related services. At that point it would be an external bottleneck issue.
We're able to scale up litellm with the proper external resources well beyond 300 rps, very cheaply.