Keeping your CX up when cloud providers fall down
Keeping your CX up when cloud providers fall down


Last week, a major Google Cloud Platform (GCP) outage disrupted services across the globe. Among the affected was Anthropic, whose systems went offline. Thankfully, our customers’ ticket processing remained uninterrupted. Our systems automatically failed over to another provider, ensuring continuous support for our customers (and their customers).
This incident underscores a critical point I’ve emphasized before: a key way that application-layer vendors add value is by providing customers with higher reliability than the raw infrastructural building blocks offer alone. This ensures continuity for end customers even as infrastructure continues to struggle to scale under fast-growing loads.
Resilience to upstream outages like these is something we have and will continue to invest in as we scale. It's an expensive investment, but one that our customers need and deserve.
The fragility of single provider dependence
Relying solely on a single AI infrastructure provider is akin to putting all your eggs in one basket. Customers don’t see the backend complexities; they see a service that’s suddenly unavailable, leading to long wait times, frustration, and potential loss of trust. At the risk of stating the obvious, AI agents have very different reliability capabilities than human ones. AI agents don’t call in sick or quit at short notice. But they can all go down if a single service fails, while human agents aren’t all going to call in sick on the same day.
Designing for resilience
Everything AI -related is growing so fast right now; the reliability of infrastructure providers and foundational models is impacted e.g. Anthropic has had 99.34% up time over the last 90 days, significantly less than the 99.999% (or '5 9s' in tech lingo) reliability we’ve come to expect from technology providers.
That’s why at Lorikeet, we’ve architected our systems with redundancy at their core. Our AI agents are designed to handle complex, multi-step support requests, and they do so by leveraging a leveraging a multi-provider infrastructure. This means that if one provider experiences issues, our systems seamlessly transition to another, ensuring that our clients’ support operations remain unaffected.
Our automated failovers rely on knowing - at the level of each LLM call - what the next best model is. We do this based on a robust abstraction framework and set of evals. We've made this investment because we're acutely aware of the trust our customers put in us, and need to ensure we honor it, instead of relying on an easy out like "Anthropic went down".
The broader implications
GCP wasn't on its own. In the last thirty days alone, we’ve seen outages from:
Cloudflare
OpenAI
IBM Cloud
Microsoft Azure
Pinecone
LangChain
If you're building your own solution, you will need to ensure it's robust against future outages like these, further increasing the cost of building versus buying.
Moving forward
As we continue to build out the Lorikeet platform, we won’t just focus on capabilities. We’ll maintain our deep investment in reliability. At the end of the day, our AI agents, no matter how advanced, are only as effective as the infrastructure supporting them.
Last week, a major Google Cloud Platform (GCP) outage disrupted services across the globe. Among the affected was Anthropic, whose systems went offline. Thankfully, our customers’ ticket processing remained uninterrupted. Our systems automatically failed over to another provider, ensuring continuous support for our customers (and their customers).
This incident underscores a critical point I’ve emphasized before: a key way that application-layer vendors add value is by providing customers with higher reliability than the raw infrastructural building blocks offer alone. This ensures continuity for end customers even as infrastructure continues to struggle to scale under fast-growing loads.
Resilience to upstream outages like these is something we have and will continue to invest in as we scale. It's an expensive investment, but one that our customers need and deserve.
The fragility of single provider dependence
Relying solely on a single AI infrastructure provider is akin to putting all your eggs in one basket. Customers don’t see the backend complexities; they see a service that’s suddenly unavailable, leading to long wait times, frustration, and potential loss of trust. At the risk of stating the obvious, AI agents have very different reliability capabilities than human ones. AI agents don’t call in sick or quit at short notice. But they can all go down if a single service fails, while human agents aren’t all going to call in sick on the same day.
Designing for resilience
Everything AI -related is growing so fast right now; the reliability of infrastructure providers and foundational models is impacted e.g. Anthropic has had 99.34% up time over the last 90 days, significantly less than the 99.999% (or '5 9s' in tech lingo) reliability we’ve come to expect from technology providers.
That’s why at Lorikeet, we’ve architected our systems with redundancy at their core. Our AI agents are designed to handle complex, multi-step support requests, and they do so by leveraging a leveraging a multi-provider infrastructure. This means that if one provider experiences issues, our systems seamlessly transition to another, ensuring that our clients’ support operations remain unaffected.
Our automated failovers rely on knowing - at the level of each LLM call - what the next best model is. We do this based on a robust abstraction framework and set of evals. We've made this investment because we're acutely aware of the trust our customers put in us, and need to ensure we honor it, instead of relying on an easy out like "Anthropic went down".
The broader implications
GCP wasn't on its own. In the last thirty days alone, we’ve seen outages from:
Cloudflare
OpenAI
IBM Cloud
Microsoft Azure
Pinecone
LangChain
If you're building your own solution, you will need to ensure it's robust against future outages like these, further increasing the cost of building versus buying.
Moving forward
As we continue to build out the Lorikeet platform, we won’t just focus on capabilities. We’ll maintain our deep investment in reliability. At the end of the day, our AI agents, no matter how advanced, are only as effective as the infrastructure supporting them.
Recent posts
Recent posts
Ready to deploy human-quality CX?
Ready to deploy human-quality CX?
Businesses with the highest CX standards choose Lorikeet's AI agents to
solve the most complicated support cases in the most complex industries.