Network
At a glance
Forty-eight hours after an SLA miss, you're not asking for reassurance -- you're building a case for whether to renew
A root cause analysis that only documents what the vendor observed, not what caused it, tells you exactly where the visibility stops
If their infrastructure monitoring covers application performance but nothing below it, that gap will show up in the next incident before it shows up in their answer
Most SaaS SLA exclusions carve out conditions in the virtualization layer -- which is often exactly where the incident started
Forty-eight hours after the incident, you send the email. Professional tone, direct ask: explain what happened, what's changed and why it won't happen again. It goes to their engineering lead, their account director, or directly to their CTO, depending on how much the miss costs you.
If their answer is a status page entry and a postmortem that says "we're monitoring the situation," you're in a negotiation.
Here's what you actually need answers to.
You're not asking for reassurance. You need a document that your engineering team or executive sponsor can read to understand why your workflows were affected and for how long.
A complete root cause analysis covers the specific failure point, the detection timeline, what their team did to remediate it, and what's structurally different now. The gap to watch for: when the degradation started in the infrastructure layer rather than the application, their logs show symptoms, not causes.
They can document what they observed and what they tried. They can't document what they couldn't see, which is often where the incident actually started.
This is a commitment question, not a plan question. "We're looking at X," is a plan. "We changed Y and here's the evidence," is a commitment.
An honest answer from most vendors stops at the application layer: more caching, load shedding, circuit breakers. Those are real mitigations. They don't address performance variability that originates in the compute environment beneath the workload, which, on hyperscale infrastructure, often means the virtualization layer, which is not anything their team controls.
If you've been through a few of these, you know the difference. When their answer stops at the application layer, ask what's happening underneath it.
If their answer covers application performance monitoring and log aggregation but nothing at the infrastructure layer, that's a partial answer. Infrastructure monitoring on hyperscale tells you what the application is experiencing. It doesn't tell you what's happening in the virtualization layer beneath it. That gap is where incidents like last week's tend to come from.
The follow-up worth asking: what visibility do you have there?
What you're reading for isn't the uptime percentage. It's the exclusions section, specifically whether what just happened qualifies as a covered breach or an excluded event.
Most SaaS SLAs include standard infrastructure-provider carve-outs: force majeure, third-party service disruptions, and scheduled maintenance. Those carve-outs reflect the limits of what the vendor can see and control in their environment. Conditions in the virtualization layer or a provider's underlying network often fall into the excluded category, which means the miss may not trigger the remedy you expected.
That's worth knowing before the renewal conversation starts, not during it.