DEV Community

Shib™ 🚀
Shib™ 🚀

Posted on • Originally published at apistatuscheck.com

What to Do When an API Goes Down: Your Incident Response Playbook

It's 2 AM. Your phone buzzes. Users are reporting errors. The API you depend on is down.

Here's your step-by-step survival guide.

Step 1: Verify It's Actually Down (2 minutes)

Don't assume. Verify from multiple angles:

Check Independent Monitors

Visit apistatuscheck.com — if we're showing it down, it's confirmed.

Test from Different Networks

# Test from your server
curl -I https://api.stripe.com/v1/health

# Test from a different location
# Use a VPN or cloud shell
Enter fullscreen mode Exit fullscreen mode

Check Their Status Page

  • Stripe: status.stripe.com
  • OpenAI: status.openai.com
  • AWS: health.aws.amazon.com
  • Twilio: status.twilio.com

⚠️ Warning: Status pages often lag 10-15 minutes behind reality.

Review Recent Changes

# Did YOU break something?
git log --since="2 hours ago" --oneline

# Check deployment logs
kubectl get events --sort-by='.lastTimestamp'
Enter fullscreen mode Exit fullscreen mode

Common self-inflicted issues:

  • Expired API keys
  • Rate limit exceeded
  • Network policy blocking outbound requests
  • Certificate validation errors

Step 2: Immediate Response (3 minutes)

Activate Fallbacks

// Queue non-critical operations
const paymentQueue = new Queue('payments', REDIS_URL);

async function processPayment(data) {
  try {
    return await stripe.charges.create(data);
  } catch (error) {
    if (error.statusCode >= 500) {
      // API is down - queue for later
      await paymentQueue.add(data, {
        attempts: 5,
        backoff: { type: 'exponential', delay: 2000 }
      });
      return { status: 'queued' };
    }
    throw error;
  }
}
Enter fullscreen mode Exit fullscreen mode

Show User-Friendly Messages

// ❌ Bad
throw new Error('Internal server error');

// ✅ Good
return {
  error: true,
  message: 'Payment processing temporarily unavailable. Your order has been saved and will be processed shortly.',
  userMessage: 'We\'re experiencing a brief delay. No action needed - we\'ll email you when complete.'
};
Enter fullscreen mode Exit fullscreen mode

Switch to Backup Provider

class PaymentProcessor {
  async charge(amount, customerId) {
    // Try primary
    try {
      return await this.stripe.charge(amount, customerId);
    } catch (error) {
      console.warn('Stripe failed, trying Braintree');
      return await this.braintree.charge(amount, customerId);
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Communication (5 minutes)

Internal Team Notification (Slack)

🚨 API OUTAGE ALERT

Service: Stripe API
Status: Down (confirmed via apistatuscheck.com)
Impact: Payment processing unavailable
Started: 2:03 AM PST
Affected: ~1,200 users attempting checkout

Actions Taken:
✅ Enabled payment queuing
✅ Displayed maintenance message
✅ Monitoring status page

Incident Commander: @sarah
War Room: #incident-stripe-outage
Enter fullscreen mode Exit fullscreen mode

Customer Communication

For transactional impact:

Subject: Order Received - Processing Delayed

We've received your order and will process payment
automatically when our payment processor returns to
normal. You'll receive confirmation by email.

No action needed on your end.
Enter fullscreen mode Exit fullscreen mode

Status Page Update:

🟡 Degraded Performance

Investigating Payment Processing Issues
Posted: 2:05 AM PST

We're experiencing issues with payment processing
due to a third-party service outage. Orders are
being queued and will process automatically.

Update (2:23 AM): Confirmed upstream issue.
Update (3:41 AM): Service restored. Processing queue.
🟢 Resolved (4:02 AM): All systems operational.
Enter fullscreen mode Exit fullscreen mode

Step 4: Technical Mitigation

Serve Cached Data

const cache = new NodeCache({ stdTTL: 600 });

async function fetchWithStaleCache(key, fetchFn) {
  const cached = cache.get(key);
  if (cached) return cached;

  try {
    const fresh = await fetchFn();
    cache.set(key, fresh);
    return fresh;
  } catch (error) {
    // API down - serve stale cache
    const stale = cache.get(key, { ignoreExpire: true });
    if (stale) {
      console.warn('Serving stale cache due to API outage');
      return stale;
    }
    throw error;
  }
}
Enter fullscreen mode Exit fullscreen mode

Circuit Breaker

const CircuitBreaker = require('opossum');

const breaker = new CircuitBreaker(callExternalAPI, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000
});

breaker.fallback(() => getCachedData());

breaker.on('open', () => {
  console.error('Circuit breaker opened - API is down');
  notifyTeam('API circuit breaker opened');
});

// Usage
const result = await breaker.fire(requestParams);
Enter fullscreen mode Exit fullscreen mode

Rate Limit Yourself

const Bottleneck = require('bottleneck');

const limiter = new Bottleneck({
  minTime: 100, // 100ms between requests
  maxConcurrent: 5
});

const rateLimitedFetch = limiter.wrap(fetch);
Enter fullscreen mode Exit fullscreen mode

Step 5: Monitor Recovery

// Terminal 1: Watch API status
watch -n 10 'curl -s https://api.stripe.com/v1/health | jq'

// Terminal 2: Monitor error logs
tail -f /var/log/app/errors.log | grep "stripe"

// Terminal 3: Watch queue depth
watch -n 5 'redis-cli llen payment_queue'
Enter fullscreen mode Exit fullscreen mode

Step 6: Post-Incident Review

Document Timeline

## Incident Report: Stripe API Outage

**Date:** Feb 6, 2026
**Duration:** 98 minutes (2:03 AM - 3:41 AM PST)
**Impact:** 1,247 affected users

### Timeline
- 2:03 AM: First alert
- 2:05 AM: Confirmed via apistatuscheck.com
- 2:07 AM: Payment queuing enabled
- 2:12 AM: Status page updated
- 3:41 AM: Service restored
- 4:02 AM: Queue processed

### What Went Well
✅ Fast detection (2 minutes)
✅ Payment queuing prevented data loss
✅ Clear customer communication

### What Didn't
❌ No automated failover
❌ Manual status page update
❌ Support team not briefed on queuing

### Action Items
- [ ] Implement Braintree as backup
- [ ] Automate status page updates
- [ ] Create support team runbook
Enter fullscreen mode Exit fullscreen mode

Specific Playbooks

Stripe Down → Queue Payments

if (await isStripeDown()) {
  // Show maintenance mode
  res.render('checkout-maintenance');

  // Queue payment
  await paymentQueue.add({
    customerId: user.id,
    amount: cart.total,
    orderId: order.id
  });

  // Email confirmation
  await sendEmail(user.email, {
    subject: 'Order Received',
    body: 'Payment will process shortly'
  });
}
Enter fullscreen mode Exit fullscreen mode

OpenAI Down → Fallback to Anthropic

class AIService {
  async complete(prompt) {
    try {
      return await openai.chat.completions.create({
        model: 'gpt-4',
        messages: [{ role: 'user', content: prompt }]
      });
    } catch (error) {
      console.warn('OpenAI unavailable, switching to Anthropic');
      return await anthropic.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        messages: [{ role: 'user', content: prompt }]
      });
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Twilio Down → Backup SMS Provider

class SMSService {
  async send(to, message) {
    try {
      return await twilio.messages.create({ to, body: message });
    } catch (error) {
      console.warn('Twilio failed, using SNS');
      return await sns.publish({ PhoneNumber: to, Message: message });
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Prevention: Build Resilience

1. Monitor Proactively

Set up monitoring before you need it:

  • Sign up at apistatuscheck.com
  • Get instant alerts when APIs go down
  • Track 100+ services from one dashboard

2. Design for Failure

// Assume everything fails
class ResilientAPIClient {
  constructor() {
    this.maxRetries = 3;
    this.timeout = 5000;
    this.circuitBreaker = new CircuitBreaker();
  }

  async call(endpoint) {
    return await this.circuitBreaker.fire(async () => {
      for (let i = 0; i < this.maxRetries; i++) {
        try {
          const response = await fetch(endpoint, {
            timeout: this.timeout
          });

          if (response.ok) return response.json();

          if (response.status >= 500 && i < this.maxRetries - 1) {
            await this.sleep(Math.pow(2, i) * 1000);
            continue;
          }

          throw new Error(`HTTP ${response.status}`);
        } catch (error) {
          if (i === this.maxRetries - 1) throw error;
        }
      }
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

3. Practice Fire Drills

# Simulate API outage in staging
iptables -A OUTPUT -d api.stripe.com -j DROP

# Verify:
# - Alerts trigger
# - Fallbacks activate
# - Team follows runbook
# - Status page updates

# Restore
iptables -D OUTPUT -d api.stripe.com -j DROP
Enter fullscreen mode Exit fullscreen mode

4. Document Everything

Create runbooks for each critical API:

## Stripe Outage Runbook

### Detection
- Monitor: apistatuscheck.com/stripe
- Alert: >5% error rate for 2 minutes

### Response
1. Verify: Check apistatuscheck + status.stripe.com
2. Enable queuing: `kubectl scale payment-queue --replicas=3`
3. Update status page
4. Notify team in #incidents

### Recovery
1. Monitor queue: `redis-cli llen payment_queue`
2. Process queued payments (automated)
3. Verify success rate
4. Update status: Resolved
Enter fullscreen mode Exit fullscreen mode

Quick Reference

First 5 Minutes Checklist

  • [ ] Confirm outage (apistatuscheck.com)
  • [ ] Check official status page
  • [ ] Activate fallbacks (queue/cache/backup)
  • [ ] Display user-friendly messages
  • [ ] Notify team (Slack)
  • [ ] Update status page
  • [ ] Monitor for recovery

Communication Templates

Save these for quick copy-paste during incidents.

Monitor Recovery

# Quick status check
while true; do
  curl -s https://api.example.com/health \
    && echo "✅ UP" \
    || echo "❌ DOWN"
  sleep 10
done
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Verify first - Don't assume it's the API
  2. Act fast - First 5 minutes matter
  3. Communicate clearly - Internal + external
  4. Have fallbacks - Queue, cache, backup providers
  5. Document everything - Post-mortems prevent repeats
  6. Build resilience - Circuit breakers, retries, monitoring

Stay Prepared

API outages are inevitable. Your response isn't.

Want to know about outages instantly?

Sign up for API Status Check — get real-time alerts when the APIs you depend on go down.

✅ 100+ services monitored 24/7

✅ Instant alerts (email, Slack, Discord, webhook)

✅ Historical uptime data

✅ Free tier available

Don't wait until 2 AM to figure this out.


Read the full guide: apistatuscheck.com/blog/api-outage-response-guide

Top comments (0)