Shib™ 🚀

Posted on Feb 6 • Originally published at apistatuscheck.com

What to Do When an API Goes Down: Your Incident Response Playbook

#api #devops #monitoring #incident

It's 2 AM. Your phone buzzes. Users are reporting errors. The API you depend on is down.

Here's your step-by-step survival guide.

Step 1: Verify It's Actually Down (2 minutes)

Don't assume. Verify from multiple angles:

Check Independent Monitors

Visit apistatuscheck.com — if we're showing it down, it's confirmed.

Test from Different Networks

# Test from your server
curl -I https://api.stripe.com/v1/health

# Test from a different location
# Use a VPN or cloud shell

Check Their Status Page

Stripe: status.stripe.com
OpenAI: status.openai.com
AWS: health.aws.amazon.com
Twilio: status.twilio.com

⚠️ Warning: Status pages often lag 10-15 minutes behind reality.

Review Recent Changes

# Did YOU break something?
git log --since="2 hours ago" --oneline

# Check deployment logs
kubectl get events --sort-by='.lastTimestamp'

Common self-inflicted issues:

Expired API keys
Rate limit exceeded
Network policy blocking outbound requests
Certificate validation errors

Step 2: Immediate Response (3 minutes)

Activate Fallbacks

// Queue non-critical operations
const paymentQueue = new Queue('payments', REDIS_URL);

async function processPayment(data) {
  try {
    return await stripe.charges.create(data);
  } catch (error) {
    if (error.statusCode >= 500) {
      // API is down - queue for later
      await paymentQueue.add(data, {
        attempts: 5,
        backoff: { type: 'exponential', delay: 2000 }
      });
      return { status: 'queued' };
    }
    throw error;
  }
}

Show User-Friendly Messages

// ❌ Bad
throw new Error('Internal server error');

// ✅ Good
return {
  error: true,
  message: 'Payment processing temporarily unavailable. Your order has been saved and will be processed shortly.',
  userMessage: 'We\'re experiencing a brief delay. No action needed - we\'ll email you when complete.'
};

Switch to Backup Provider

class PaymentProcessor {
  async charge(amount, customerId) {
    // Try primary
    try {
      return await this.stripe.charge(amount, customerId);
    } catch (error) {
      console.warn('Stripe failed, trying Braintree');
      return await this.braintree.charge(amount, customerId);
    }
  }
}

Step 3: Communication (5 minutes)

Internal Team Notification (Slack)

🚨 API OUTAGE ALERT

Service: Stripe API
Status: Down (confirmed via apistatuscheck.com)
Impact: Payment processing unavailable
Started: 2:03 AM PST
Affected: ~1,200 users attempting checkout

Actions Taken:
✅ Enabled payment queuing
✅ Displayed maintenance message
✅ Monitoring status page

Incident Commander: @sarah
War Room: #incident-stripe-outage

Customer Communication

For transactional impact:

Subject: Order Received - Processing Delayed

We've received your order and will process payment
automatically when our payment processor returns to
normal. You'll receive confirmation by email.

No action needed on your end.

Status Page Update:

🟡 Degraded Performance

Investigating Payment Processing Issues
Posted: 2:05 AM PST

We're experiencing issues with payment processing
due to a third-party service outage. Orders are
being queued and will process automatically.

Update (2:23 AM): Confirmed upstream issue.
Update (3:41 AM): Service restored. Processing queue.
🟢 Resolved (4:02 AM): All systems operational.

Step 4: Technical Mitigation

Serve Cached Data

const cache = new NodeCache({ stdTTL: 600 });

async function fetchWithStaleCache(key, fetchFn) {
  const cached = cache.get(key);
  if (cached) return cached;

  try {
    const fresh = await fetchFn();
    cache.set(key, fresh);
    return fresh;
  } catch (error) {
    // API down - serve stale cache
    const stale = cache.get(key, { ignoreExpire: true });
    if (stale) {
      console.warn('Serving stale cache due to API outage');
      return stale;
    }
    throw error;
  }
}

Circuit Breaker

const CircuitBreaker = require('opossum');

const breaker = new CircuitBreaker(callExternalAPI, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000
});

breaker.fallback(() => getCachedData());

breaker.on('open', () => {
  console.error('Circuit breaker opened - API is down');
  notifyTeam('API circuit breaker opened');
});

// Usage
const result = await breaker.fire(requestParams);

Rate Limit Yourself

const Bottleneck = require('bottleneck');

const limiter = new Bottleneck({
  minTime: 100, // 100ms between requests
  maxConcurrent: 5
});

const rateLimitedFetch = limiter.wrap(fetch);

Step 5: Monitor Recovery

// Terminal 1: Watch API status
watch -n 10 'curl -s https://api.stripe.com/v1/health | jq'

// Terminal 2: Monitor error logs
tail -f /var/log/app/errors.log | grep "stripe"

// Terminal 3: Watch queue depth
watch -n 5 'redis-cli llen payment_queue'

Step 6: Post-Incident Review

Document Timeline

## Incident Report: Stripe API Outage

**Date:** Feb 6, 2026
**Duration:** 98 minutes (2:03 AM - 3:41 AM PST)
**Impact:** 1,247 affected users

### Timeline
- 2:03 AM: First alert
- 2:05 AM: Confirmed via apistatuscheck.com
- 2:07 AM: Payment queuing enabled
- 2:12 AM: Status page updated
- 3:41 AM: Service restored
- 4:02 AM: Queue processed

### What Went Well
✅ Fast detection (2 minutes)
✅ Payment queuing prevented data loss
✅ Clear customer communication

### What Didn't
❌ No automated failover
❌ Manual status page update
❌ Support team not briefed on queuing

### Action Items
- [ ] Implement Braintree as backup
- [ ] Automate status page updates
- [ ] Create support team runbook

Specific Playbooks

Stripe Down → Queue Payments

if (await isStripeDown()) {
  // Show maintenance mode
  res.render('checkout-maintenance');

  // Queue payment
  await paymentQueue.add({
    customerId: user.id,
    amount: cart.total,
    orderId: order.id
  });

  // Email confirmation
  await sendEmail(user.email, {
    subject: 'Order Received',
    body: 'Payment will process shortly'
  });
}

OpenAI Down → Fallback to Anthropic

class AIService {
  async complete(prompt) {
    try {
      return await openai.chat.completions.create({
        model: 'gpt-4',
        messages: [{ role: 'user', content: prompt }]
      });
    } catch (error) {
      console.warn('OpenAI unavailable, switching to Anthropic');
      return await anthropic.messages.create({
        model: 'claude-3-5-sonnet-20241022',
        messages: [{ role: 'user', content: prompt }]
      });
    }
  }
}

Twilio Down → Backup SMS Provider

class SMSService {
  async send(to, message) {
    try {
      return await twilio.messages.create({ to, body: message });
    } catch (error) {
      console.warn('Twilio failed, using SNS');
      return await sns.publish({ PhoneNumber: to, Message: message });
    }
  }
}

Prevention: Build Resilience

1. Monitor Proactively

Set up monitoring before you need it:

Sign up at apistatuscheck.com
Get instant alerts when APIs go down
Track 100+ services from one dashboard

2. Design for Failure

// Assume everything fails
class ResilientAPIClient {
  constructor() {
    this.maxRetries = 3;
    this.timeout = 5000;
    this.circuitBreaker = new CircuitBreaker();
  }

  async call(endpoint) {
    return await this.circuitBreaker.fire(async () => {
      for (let i = 0; i < this.maxRetries; i++) {
        try {
          const response = await fetch(endpoint, {
            timeout: this.timeout
          });

          if (response.ok) return response.json();

          if (response.status >= 500 && i < this.maxRetries - 1) {
            await this.sleep(Math.pow(2, i) * 1000);
            continue;
          }

          throw new Error(`HTTP ${response.status}`);
        } catch (error) {
          if (i === this.maxRetries - 1) throw error;
        }
      }
    });
  }
}

3. Practice Fire Drills

# Simulate API outage in staging
iptables -A OUTPUT -d api.stripe.com -j DROP

# Verify:
# - Alerts trigger
# - Fallbacks activate
# - Team follows runbook
# - Status page updates

# Restore
iptables -D OUTPUT -d api.stripe.com -j DROP

4. Document Everything

Create runbooks for each critical API:

## Stripe Outage Runbook

### Detection
- Monitor: apistatuscheck.com/stripe
- Alert: >5% error rate for 2 minutes

### Response
1. Verify: Check apistatuscheck + status.stripe.com
2. Enable queuing: `kubectl scale payment-queue --replicas=3`
3. Update status page
4. Notify team in #incidents

### Recovery
1. Monitor queue: `redis-cli llen payment_queue`
2. Process queued payments (automated)
3. Verify success rate
4. Update status: Resolved

Quick Reference

First 5 Minutes Checklist

[ ] Confirm outage (apistatuscheck.com)
[ ] Check official status page
[ ] Activate fallbacks (queue/cache/backup)
[ ] Display user-friendly messages
[ ] Notify team (Slack)
[ ] Update status page
[ ] Monitor for recovery

Communication Templates

Save these for quick copy-paste during incidents.

Monitor Recovery

# Quick status check
while true; do
  curl -s https://api.example.com/health \
    && echo "✅ UP" \
    || echo "❌ DOWN"
  sleep 10
done

Key Takeaways

Verify first - Don't assume it's the API
Act fast - First 5 minutes matter
Communicate clearly - Internal + external
Have fallbacks - Queue, cache, backup providers
Document everything - Post-mortems prevent repeats
Build resilience - Circuit breakers, retries, monitoring

Stay Prepared

API outages are inevitable. Your response isn't.

Want to know about outages instantly?

✅ 100+ services monitored 24/7

✅ Instant alerts (email, Slack, Discord, webhook)

✅ Historical uptime data

✅ Free tier available

Don't wait until 2 AM to figure this out.

Read the full guide: apistatuscheck.com/blog/api-outage-response-guide

DEV Community