Methodology
How AI Down measures AI provider availability — the AI Intelligence Availability Score (AIAS) and AI Health Index (AIHI).
Why Not Just "Up or Down"?
Traditional status pages report binary availability: the service is either up or down. AI services are more nuanced. An API can return 200 OK while delivering 10× normal latency, a model can be rate-limited to the point of unusability, or time-to-first-token can spike so high that streaming UX breaks — all while the status page says "Operational."
AIAS captures this spectrum. It produces a 0–100 score that reflects how available intelligence actually is from each provider, not just whether the endpoint responds.
Data Sources
Every AIAS score is built from three independent signal sources:
- Official Status Pages — We poll Atlassian Statuspage JSON APIs, AWS Health Dashboard, Azure RSS, GCP incident feeds, and BetterStack endpoints every 60 seconds.
- Synthetic API Probes — Lightweight inference requests sent through OpenRouter to 20+ models every 1–5 minutes, measuring end-to-end latency, time-to-first-token (TTFT), error rates, and rate limiting.
- Crowdsourced Reports — Users can report issues directly from each provider page, contributing a human signal before official acknowledgment.
The Six Dimensions
Each probe result is decomposed into six sub-scores, all normalized to 0–1 (1.0 = perfect, 0.0 = catastrophic):
1. Latency Score
Compares current end-to-end response time against the provider's own historical baseline for the current time-of-day. Being at the p50 baseline is normal (1.0); reaching p95 is concerning (0.6); beyond 3× p95 is catastrophic (0.0). This means a 5-second response from GPT-4o is normal, but the same from Groq signals severe degradation.
2. Time-to-First-Token (TTFT) Score
Same shape as latency but applied to streaming response initiation. For non-streaming endpoints, defaults to 1.0. Critical for chat and real-time applications where perceived responsiveness matters.
3. Error Score
Based on rolling error rate across recent probes. The curve is nonlinear: ≤1% errors = healthy (1.0), 5% = noticeable (0.75), 15% = material (0.35), >50% = near-zero. Error rates burn faster than they recover — 5% errors feel much worse than 5× the impact of 1%.
4. Rate Limit Score
Measures 429 (Too Many Requests) response rate. A service can be technically "up" while refusing to serve most requests. Occasional 429s are normal (≤2%); 20%+ rate limiting means the service is barely available regardless of latency.
5. Status Score
Maps the provider's self-reported status to a score: Operational (1.0), Maintenance (0.85), Degraded (0.70), Partial Outage (0.45), Major Outage (0.15). This is a weak prior — providers often self-report "Operational" during actual degradation.
6. Confidence Score
How much we trust the data feeding this evaluation. Factors: number of contributing probes, probe node health, baseline maturity (cold start vs. learning vs. stable), and signal source confidence. Low confidence pulls the score toward 60 (uncertain, not catastrophic) rather than amplifying noise.
Score Composition
Sub-scores are combined using a gated geometric mean. Performance sub-scores are gated by worst-of — you can't average away an unusable TTFT or a flood of 429s.
- Performance Gate: min(Latency, TTFT) — bad TTFT can't hide behind good latency.
- Effective Performance: min(Performance, Rate Limit) — heavy 429s gate the score regardless of speed.
- Weighted Geometric Mean: Error (40%), Effective Performance (45%), Status (15%).
- Confidence Dampening: Final score is modulated by confidence. At full confidence the score is fully trusted; at zero confidence it pulls toward 60.
Baselines & Temporal Smoothing
Every provider has its own historical baseline, segmented by time-of-day (24 one-hour buckets, rolling 14-day percentiles). This means "slow" is always relative to what's normal for that provider at that time of day.
Scores are smoothed using a time-decay Exponential Weighted Moving Average (EWMA) with a 15-minute half-life. This prevents single outlier probes from spiking the score while still surfacing real degradation within minutes.
Tier Classification
The 0–100 AIAS score maps to four tiers:
| Tier | Score Range | Meaning |
|---|---|---|
| Fully Available | 90–100 | Intelligence responsive, fast, reliable |
| Degraded | 65–89 | Noticeable quality/speed reduction, still functional |
| Brownout | 35–64 | Technically "up" but materially impaired |
| Blackout | 0–34 | Effectively unavailable |
Tier transitions use hysteresis to prevent flickering: degrading requires 2 consecutive evaluations past the boundary, while recovering requires 4. The score must also cross the boundary by at least 3 points before a transition begins.
AI Health Index (AIHI)
The AIHI is the ecosystem-wide headline score — "Is AI intelligence generally available right now?" It aggregates individual AIAS provider scores using a trimmed mean (dropping the top and bottom 10%) to prevent any single provider from dominating. Only providers with fresh data (<10 minutes old) and stable baselines contribute.
The Timeline page visualizes AIHI over time, showing intelligence availability trends and cascade events.
Supply Chain Correlation
Many AI providers share infrastructure (e.g., multiple AI labs run on Azure or AWS). Our Supply Chain Map tracks these dependencies. When a cloud provider experiences an outage, we automatically correlate downstream AI provider degradation, helping users understand whether an issue is provider-specific or part of a broader infrastructure event.
Limitations & Transparency
- Synthetic probes represent a single vantage point and workload type. Real-user performance may differ based on region, model, prompt complexity, and concurrency.
- Baselines require a 7-day learning period. During cold start, scores rely on conservative defaults and carry reduced confidence.
- Status page scraping depends on provider-published data, which may lag or underreport actual issues.
- AIAS measures availability of intelligence, not model quality. A provider can score 95 while producing lower-quality outputs if latency, errors, and rate limits are within normal parameters.
- Crowdsourced reports are weighted by recency and volume but are not independently verified.
Questions or Feedback?
AIAS is an evolving methodology. If you have questions, suggestions, or want to discuss the scoring approach, reach out to SecureCoders.