It's 2 AM. Your phone buzzes. Users are reporting errors. The API you depend on is down.
Here's your step-by-step survival guide.
Step 1: Verify It's Actually Down (2 minutes)
Don't assume. Verify from multiple angles:
Check Independent Monitors
Visit apistatuscheck.com — if we're showing it down, it's confirmed.
Test from Different Networks
# Test from your server
curl -I https://api.stripe.com/v1/health
# Test from a different location
# Use a VPN or cloud shell
Check Their Status Page
- Stripe: status.stripe.com
- OpenAI: status.openai.com
- AWS: health.aws.amazon.com
- Twilio: status.twilio.com
⚠️ Warning: Status pages often lag 10-15 minutes behind reality.
Review Recent Changes
# Did YOU break something?
git log --since="2 hours ago" --oneline
# Check deployment logs
kubectl get events --sort-by='.lastTimestamp'
Common self-inflicted issues:
- Expired API keys
- Rate limit exceeded
- Network policy blocking outbound requests
- Certificate validation errors
Step 2: Immediate Response (3 minutes)
Activate Fallbacks
// Queue non-critical operations
const paymentQueue = new Queue('payments', REDIS_URL);
async function processPayment(data) {
try {
return await stripe.charges.create(data);
} catch (error) {
if (error.statusCode >= 500) {
// API is down - queue for later
await paymentQueue.add(data, {
attempts: 5,
backoff: { type: 'exponential', delay: 2000 }
});
return { status: 'queued' };
}
throw error;
}
}
Show User-Friendly Messages
// ❌ Bad
throw new Error('Internal server error');
// ✅ Good
return {
error: true,
message: 'Payment processing temporarily unavailable. Your order has been saved and will be processed shortly.',
userMessage: 'We\'re experiencing a brief delay. No action needed - we\'ll email you when complete.'
};
Switch to Backup Provider
class PaymentProcessor {
async charge(amount, customerId) {
// Try primary
try {
return await this.stripe.charge(amount, customerId);
} catch (error) {
console.warn('Stripe failed, trying Braintree');
return await this.braintree.charge(amount, customerId);
}
}
}
Step 3: Communication (5 minutes)
Internal Team Notification (Slack)
🚨 API OUTAGE ALERT
Service: Stripe API
Status: Down (confirmed via apistatuscheck.com)
Impact: Payment processing unavailable
Started: 2:03 AM PST
Affected: ~1,200 users attempting checkout
Actions Taken:
✅ Enabled payment queuing
✅ Displayed maintenance message
✅ Monitoring status page
Incident Commander: @sarah
War Room: #incident-stripe-outage
Customer Communication
For transactional impact:
Subject: Order Received - Processing Delayed
We've received your order and will process payment
automatically when our payment processor returns to
normal. You'll receive confirmation by email.
No action needed on your end.
Status Page Update:
🟡 Degraded Performance
Investigating Payment Processing Issues
Posted: 2:05 AM PST
We're experiencing issues with payment processing
due to a third-party service outage. Orders are
being queued and will process automatically.
Update (2:23 AM): Confirmed upstream issue.
Update (3:41 AM): Service restored. Processing queue.
🟢 Resolved (4:02 AM): All systems operational.
Step 4: Technical Mitigation
Serve Cached Data
const cache = new NodeCache({ stdTTL: 600 });
async function fetchWithStaleCache(key, fetchFn) {
const cached = cache.get(key);
if (cached) return cached;
try {
const fresh = await fetchFn();
cache.set(key, fresh);
return fresh;
} catch (error) {
// API down - serve stale cache
const stale = cache.get(key, { ignoreExpire: true });
if (stale) {
console.warn('Serving stale cache due to API outage');
return stale;
}
throw error;
}
}
Circuit Breaker
const CircuitBreaker = require('opossum');
const breaker = new CircuitBreaker(callExternalAPI, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
breaker.fallback(() => getCachedData());
breaker.on('open', () => {
console.error('Circuit breaker opened - API is down');
notifyTeam('API circuit breaker opened');
});
// Usage
const result = await breaker.fire(requestParams);
Rate Limit Yourself
const Bottleneck = require('bottleneck');
const limiter = new Bottleneck({
minTime: 100, // 100ms between requests
maxConcurrent: 5
});
const rateLimitedFetch = limiter.wrap(fetch);
Step 5: Monitor Recovery
// Terminal 1: Watch API status
watch -n 10 'curl -s https://api.stripe.com/v1/health | jq'
// Terminal 2: Monitor error logs
tail -f /var/log/app/errors.log | grep "stripe"
// Terminal 3: Watch queue depth
watch -n 5 'redis-cli llen payment_queue'
Step 6: Post-Incident Review
Document Timeline
## Incident Report: Stripe API Outage
**Date:** Feb 6, 2026
**Duration:** 98 minutes (2:03 AM - 3:41 AM PST)
**Impact:** 1,247 affected users
### Timeline
- 2:03 AM: First alert
- 2:05 AM: Confirmed via apistatuscheck.com
- 2:07 AM: Payment queuing enabled
- 2:12 AM: Status page updated
- 3:41 AM: Service restored
- 4:02 AM: Queue processed
### What Went Well
✅ Fast detection (2 minutes)
✅ Payment queuing prevented data loss
✅ Clear customer communication
### What Didn't
❌ No automated failover
❌ Manual status page update
❌ Support team not briefed on queuing
### Action Items
- [ ] Implement Braintree as backup
- [ ] Automate status page updates
- [ ] Create support team runbook
Specific Playbooks
Stripe Down → Queue Payments
if (await isStripeDown()) {
// Show maintenance mode
res.render('checkout-maintenance');
// Queue payment
await paymentQueue.add({
customerId: user.id,
amount: cart.total,
orderId: order.id
});
// Email confirmation
await sendEmail(user.email, {
subject: 'Order Received',
body: 'Payment will process shortly'
});
}
OpenAI Down → Fallback to Anthropic
class AIService {
async complete(prompt) {
try {
return await openai.chat.completions.create({
model: 'gpt-4',
messages: [{ role: 'user', content: prompt }]
});
} catch (error) {
console.warn('OpenAI unavailable, switching to Anthropic');
return await anthropic.messages.create({
model: 'claude-3-5-sonnet-20241022',
messages: [{ role: 'user', content: prompt }]
});
}
}
}
Twilio Down → Backup SMS Provider
class SMSService {
async send(to, message) {
try {
return await twilio.messages.create({ to, body: message });
} catch (error) {
console.warn('Twilio failed, using SNS');
return await sns.publish({ PhoneNumber: to, Message: message });
}
}
}
Prevention: Build Resilience
1. Monitor Proactively
Set up monitoring before you need it:
- Sign up at apistatuscheck.com
- Get instant alerts when APIs go down
- Track 100+ services from one dashboard
2. Design for Failure
// Assume everything fails
class ResilientAPIClient {
constructor() {
this.maxRetries = 3;
this.timeout = 5000;
this.circuitBreaker = new CircuitBreaker();
}
async call(endpoint) {
return await this.circuitBreaker.fire(async () => {
for (let i = 0; i < this.maxRetries; i++) {
try {
const response = await fetch(endpoint, {
timeout: this.timeout
});
if (response.ok) return response.json();
if (response.status >= 500 && i < this.maxRetries - 1) {
await this.sleep(Math.pow(2, i) * 1000);
continue;
}
throw new Error(`HTTP ${response.status}`);
} catch (error) {
if (i === this.maxRetries - 1) throw error;
}
}
});
}
}
3. Practice Fire Drills
# Simulate API outage in staging
iptables -A OUTPUT -d api.stripe.com -j DROP
# Verify:
# - Alerts trigger
# - Fallbacks activate
# - Team follows runbook
# - Status page updates
# Restore
iptables -D OUTPUT -d api.stripe.com -j DROP
4. Document Everything
Create runbooks for each critical API:
## Stripe Outage Runbook
### Detection
- Monitor: apistatuscheck.com/stripe
- Alert: >5% error rate for 2 minutes
### Response
1. Verify: Check apistatuscheck + status.stripe.com
2. Enable queuing: `kubectl scale payment-queue --replicas=3`
3. Update status page
4. Notify team in #incidents
### Recovery
1. Monitor queue: `redis-cli llen payment_queue`
2. Process queued payments (automated)
3. Verify success rate
4. Update status: Resolved
Quick Reference
First 5 Minutes Checklist
- [ ] Confirm outage (apistatuscheck.com)
- [ ] Check official status page
- [ ] Activate fallbacks (queue/cache/backup)
- [ ] Display user-friendly messages
- [ ] Notify team (Slack)
- [ ] Update status page
- [ ] Monitor for recovery
Communication Templates
Save these for quick copy-paste during incidents.
Monitor Recovery
# Quick status check
while true; do
curl -s https://api.example.com/health \
&& echo "✅ UP" \
|| echo "❌ DOWN"
sleep 10
done
Key Takeaways
- Verify first - Don't assume it's the API
- Act fast - First 5 minutes matter
- Communicate clearly - Internal + external
- Have fallbacks - Queue, cache, backup providers
- Document everything - Post-mortems prevent repeats
- Build resilience - Circuit breakers, retries, monitoring
Stay Prepared
API outages are inevitable. Your response isn't.
Want to know about outages instantly?
Sign up for API Status Check — get real-time alerts when the APIs you depend on go down.
✅ 100+ services monitored 24/7
✅ Instant alerts (email, Slack, Discord, webhook)
✅ Historical uptime data
✅ Free tier available
Don't wait until 2 AM to figure this out.
Read the full guide: apistatuscheck.com/blog/api-outage-response-guide
Top comments (0)