If you’re in Western Europe and work in the IT industry, you probably woke up to discover there were problems with AWS’s East Coast region (us-east-1). It didn’t take long for this information to hit major news sites across the globe, as end users reported increasing issues with services provided by big names such as Snapchat, Reddit, and several banking and gaming platforms.
AWS is one of the biggest players in the cloud computing industry, with its original and oldest data center located on the East Coast of the United States — better known to AWS users as the us-east-1 region.
AWS provides cloud-based services across the globe. Their presence is enormous, and they hold the largest market share in the industry, which inevitably means they host some of the biggest companies out there.
There are two sides to this. The first is the obvious one: AWS went offline for several of their services.
The second, less visible side lies in how companies design their infrastructure and applications to be resilient — and there are some caveats in this particular scenario.
Before cloud computing became mainstream, organizations hosted their infrastructure and applications in physical data centers they owned and managed.
Hosting your own infrastructure comes at a cost, as you typically need to over-provision for busy periods. Although this approach is still viable and sometimes necessary, it becomes less attractive when there’s no strict requirement for a fully private setup — especially when you can save both time and money by using a cloud provider.
Moving to a cloud provider is also appealing because it allows you to achieve resilience by hosting your application in multiple data centers simultaneously, which we’ll explore in the next section.
Major cloud providers such as AWS advertise their services as elastic and multi-location redundant — if you design your infrastructure correctly.
Typically, a region (e.g., N. Virginia or us-east-1) contains multiple availability zones (data centers) spread across that region. The idea is to use two or more availability zones so that if something goes wrong with one, your application can automatically fail over to another — ideally without downtime.
Beyond the built-in redundancy that cloud providers offer, there’s growing interest in what’s called a multi-cloud strategy.
As the name implies, multi-cloud means using multiple cloud providers. This can involve load balancing your applications across providers like AWS, Google Cloud, and Azure, or maintaining a replication layer between them so that if one fails, another can take over with minimal disruption.
Although this sounds perfect on paper, the reality is that most companies — especially startups — don’t begin with a “gold-plated” infrastructure. They start small, focusing on delivering value quickly rather than over-engineering for rare disaster scenarios.
Over time, this leads to vendor lock-in with a single provider. While the idea of going multi-cloud often comes up in meetings, implementing it is expensive and complex — even with modern tools like Docker and Terraform.
At first glance, AWS appears to be at fault for failing to provide business-critical services to their clients, taking down large parts of the internet in the process.
But it’s not that simple. You could have hosted your services in multiple regions, or even across multiple cloud providers. However, AWS doesn’t offer a truly region-agnostic platform out of the box, and as mentioned earlier, multi-cloud setups are difficult and costly to maintain.
AWS’s redundancy model is based on availability zones, typically three per region. The issue with this outage was that the entire region was affected. Users couldn’t scale their services using the AWS redundancy model because AWS temporarily throttled requests while fixing the underlying problem.
Some might argue that the fault lies with end users, since AWS offers global infrastructure and the option to host in multiple regions — for example, the unaffected us-west-1 region. However, designing applications to operate across multiple regions takes significant effort and investment, which is often unrealistic for startups or smaller teams trying to save costs while following the cloud provider’s best practices.
As of writing this article, AWS has yet to release a post-mortem. When they do, it’s worth reading their statement for the full details.
However, as a web developer, there are a few steps you can take to build resilience — even if you’re not currently using multiple clouds or data centers.