This is probably one of the best summarizations of the past 10 years of my career in SRE. Once your systems get complex enough, something is always broken and you have to prepare for that. Detection & response become just as critical as pre-deploy testing.
I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!
Thank you!
One of the question I frequently get is "do you automatically rollback". And I have hide in the corner and say "not really". Often, if you knew a rollback would work, you probably could also have known to not roll out in the first place. I've seen a lot of failures that only got worse when automation attempted to turn the thing on and off again.
Luckily from an automation roll-out standpoint, it's not that much harder to test in isolation. The harder parts to validate are things like "Does a Route 53 Failover Record really work in practice at the moment we actually need it to work?"
Usually the answer is yes, but then there's always the "but it too could be broken", and as you said, it's turtles all the way down.
The nice part is realistically, the automation for dealing with rollout and IaC is small and simple. We've split up our infrastructure to go with individual services, so each piece of infra is also straight forward.
In practice, our infra is less DRY and more repeated, which has the benefit of avoiding complexity that often comes from attempting to reduce code duplication. The ancillary benefit is that, simple stuff changes less frequently. Less frequent changes because less opportunity for issues.
Not-surprisingly, most incidents comes from changes humans make. Where the second most amount of incidents come from assumptions humans make about how a system operates in edge conditions. If you know these two things to be 100% true, you spend more time designing simple systems and attempting to avoid making changes as much as possible, unless it is absolutely required.
Iac is definitely a failure point, but the manual alternative is much worse! I’ve had a lot of benefit from using pulumi, simply because the code can be more compact than the terraform hcl was.
For example, for the fall over regions (from the article) you could make a pulumi function that parameterizes only the n things that are different per fall over env and guarantee / verify the scripts are nearly identical. Of course, many people use modules / terragrunt for similar reasons, but it ends up being quite powerful.
I think some people are going to scream when I say this, but we're using mostly CloudFormation templates.
We don't use the CDK because it introduces complexity into the system.
However to make CloudFormation usable, it is written in typescript, and generates the templates on the fly. I know that sounds like the CDK, but given the size of our stacks, adding an additional technology in, doesn't make things simpler, and there is a lot of waste that can be removed, by using a software language rather than using json/yaml.
There are cases we have some OpenTofu, but for infrastructure resources that customer specific, we have deployments that are run in typescript using the AWS SDK for javascript.
It would be nice if we could make a single change and have it roll-out everywhere. But the reality is that there are many more states in play then what is represented by a single state file. Especially when it comes to interactions between—our infra, our customer's configuration, and the history of requests to change the configuration, as well as resources with mutable states.
One example of that is AWS certificates. They expire. We need them expiring. But expiring certs don't magically update state files or stacks. It's really bad to make assumptions about a customer's environment based on what we thought we knew the last time a change was rolled out.
I actually like terraform for its LACK of power (tho yeah these days when I have a choice I use a lot of small states and orchestrate with tg).
Pulumi or CDK are for sure more powerful (and great tools) but when I need to reach for them I also worry that the infra might be getting too complex.
Agreed, it is much too easy to fall into bad habits. The whole goal of OpenTofu is declarative infrastructure. With CDK and pulumi, it's very easy to end up in a place where you lose that.
But if you need to do something in a particular way, the tools should never be an obstacle.
IMO Pulumi and CDK are an opportunity to simplify your infra by capturing what you’re working with using higher-level abstractions and by allowing you to refactor and extract reusable pieces at any level. You can drive infra definitions easily from typed data structures, you can add conditionals using natural language syntax, and stop trying to program in a configuration language (Terraform HCL with surprises like non-short-circuited AND evaluation).
You still end up having IaaC. You can still have a declarative infrastructure.
That's how we use CDK. Our CDK (in general) creates CloudFormation which we then deploy. As far as the tooling which we have for IaC is concerned, it's indistinguishable from hand-written CloudFormation — but we're able to declare our intent at a higher level of abstraction.
Absolutely, the best case is it's much better, safer, readable etc. However, the worst case is also worse. From the perspective of someone who provides devops support to multiple teams, terraform is more "predictable".
If you do use terraform, for the love of god do NOT use Terraform Cloud. Up there with Github in the list of least reliable cloud vendors. I always have a "break glass" method of deploying from my work machine for that very reason.
Is there not an inherent risk using an AWS service (Route 53) to do the health check? Wouldn’t it make more sense to use a different cloud provider for redundancy?
If the check can't be done, then everything stays stable, so I'm guessing the question is, "What happens if Route 53 does the check and incorrectly reports the result?"
In that case, no matter what we are using there is going to be a critical issue. I think the best I could suggest at that point would be to have records in your zone that round robin different cloud providers, but that comes with its own challenges.
I believe there are some articles sitting around regarding how AWS plans for failure and the fallback mechanism actually reduces load on the system rather than makes it worse. I think it would require in-depth investigation on the expected failover mode to have a good answer there.
For instance, just to make it more concrete, what sort of failure mode are you expecting to happen with the Route 53 health check? Depending on that there could be different recommendations.
Have you considered the scenario of "everything is so dead in aws", that the check doesn't happen, plus the backends are dead too (this is assuming the backend services live in aws as well) ? But I'd guess in that case you'd know quickly enough from supplementary alerting (you guys don't seem the type to not have some sort of awesome monitoring in place) and you have a different/worse DR problem on your hands.
As far as the OP's point though, I'm going to probably assume that the health checks need to stay within/from AWS because 3rd party health checks could taint/dilute the point of the in-house AWS HC service to begin with.
I think there are two worlds of thought to the "AWS is totally dead everywhere". And that's: * It is never going to happen due to the way AWS is designed (or at least told to us, which explains why it is so hard to execute actions across regions.) * It will happen but then everything else is going to be dead, so what's the point?
One problem we've run into, which is the "DNS is single point of failure" is that there isn't a clear best strategy to deal with "failover to a different cloud at the DNS routing level."
I'm not the foremost expert when it comes to ASNs and BGPs, but from my understanding that would require some multi-cloud collaboration to get multiple CDNs to still resolve, something that feels like it would require both multiple levels of physical infrastructure as well as significant cost to actually implement correctly compared to the ROI for our customers.
There's a corollary here for me, which is, still as simple as possible to achieve the result. Maybe there is a multi-cloud strategy, but the strategies I've seen still rely on having the DNS zone in one provider that fail-overs or round-robins specific infra in specific locations.
Third party health checks have less of a problem of "tainting" and more just cause further complications, as you add in complexity to resolving your real state, the harder it is to get it right.
For instance, one thing we keep going back and forth on is "After the incident is over, is there a way for us to stay failed-over and not automatically fail back".
And the answer for us so far is "not really". There are a lot of bad options, which all could have catastrophic impacts if we don't get it exactly correct, and haven't come with significant benefits, yet. But I like to think I have an open mind here.
It's painful, but you can split your DNS across multiple providers. It's not usually done other than during migrations, but if you put two NS names from providerA and two from providerB, you'll get a mix of resolution (most high profile domains have 4 NS names; sometimes based on research/testing, sometimes based on cargo culting; I assume you want to fit in... but amazon.com has 8, and the DNS root and some high profile tlds have 13, so you do you :)). If either provider fails and stops responding, most resolvers will use the other provider. If one provider fails and returns bad data (including errors) or even can no longer be updated [1], the redundancy doesn't really help --- you probably went from a full outage that's easy to diagnose to a partial outage that's much harder to diagnose; and if both providers are equally reliable, you increased your chances of having an outage.
[1] But, it's DNS; the expectation is that some resolvers, hopefully very few of them, will cache data as if your TTL value was measured in days. IMHO, If you want to move all your traffic in a defined timeframe, DNS is not sufficient.
Had the same thought, eg if things are really down can it even do the check etc
Ask some friends and family if you can install an RPi on their home network that monitors your service.
Back in the day (10-12 years ago) at a telecom/cable we accomplished this with F5 Big IP GSLB DNS (and later migrated to A10's GSLB equivalent devices) as the auth DNS server for services/zones that required or were suitable for HA. (I can't totally remember but I'm guessing we must have had a pretty low TTL for this).
Had no idea that Route 53 had this sort of functionality
Speaking of F5 Big IP DNS devices, does anyone know of any auth DNS software solution for GSLB/health checking for DNS (I guess excluding Route 53 or other cloud/SaaS). Last I looked all I could find was the polaris-gslb addon for PowerDNS, but the GitHub for that has no activity in 8 years.
Maybe I should have titled the article "AWS Route53 HealthChecks are amazing" :)
Hey, I wrote that article!
I'll try to add comments and answer questions where I can.
- Warren
Hi Warren! I'm Chris, and I'm with AWS, where among other things, I work on the Well-Architected Framework. Would you be willing to talk with us? You can reach me at kozlowck@amazon.com. Thanks!
Edit: This is a fantastic write-up by the way!
Thank you!
> During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there
> [Our service can only go down] five minutes and 15 seconds per year.
I don't have much experience in this area, so please correct me if I'm mistaken:
Don't these two quotes together imply that they have failed to deliver on their SLA for the subset of their customers that want their service in us-east-1? I understand the customers won't be mad at them in this case, since us-east-1 itself is down, but I feel like their title is incorrect. Some subset of their service is running on top of AWS. When AWS goes down, that subset of their service is down. When AWS was down, it seems like they were also down for some customers.
It's a good point.
We don't actually commit to running infrastructure in one specific AWS region. Customers can't request that the infra runs exactly in us-east-1, but they can request that it runs in "Eastern United States". The problem is that with scenarios that might require VPC peering or low latency connections, we can't just run the infrastructure in us-east-2 and commit to never having a problem. For the same reason, what happens if us-east-2 were to have an incident.
We have to assume that our customers need it in a relatively close region, and that at the same time need to plan for the contingency that region can be down.
Then there are the customer's users to think of as well. In some cases, those users might be globally dispersed, even if the customer infrastructure is only one major location. So while it would be nice to claim "well you were also down at that moment", in practices customer's users will notice, and realistically, we want to make sure we aren't impeding remediation on their side.
That is, even if a customer says "use us-east-1", and then us-east-1 is down, it can't look that way to the customer. This gets a lot more complicated, when the services that we are providing may be impacted differently. Consider us-east-1 dynamoDB down, but everything else was still working. Partial failure modes are much harder to deal with.
> Partial failure modes are much harder to deal with.
Truer words were never spoken.
Depends on what the SLA phrasing is - us-east-1 affinity is a requirement put forth by some customers so I would totally expect the SLA to specifically state it’s subject to us-east-1 availability. Essentially these customers are opting out of Authress’s fault-tolerant infrastructure and the SLA should be clear about that.
As TFA states, we have to offer services in that region because that's where some users are as well. However, the core of services are not in that region. I have also suggested when the time comes for offering SLAs, that there is explicit wording exempting us-east-1.
The bulk of the article discusses their failover strategy, where they detect failures in a region and how they route requests to a backup region, and how to deal with data consistency and cost issues arising from that.
Interesting how engineers like to nerd out about SLAs, but never claim or issue credits when something does occur.
In the last decade, there has been at least one time where we did issue credits to our customers when there was a problem. Issues credits back to our customers is a small compensation for any issue we're responsible for, and doing so is part of our Terms of Service.
I'm interested in how they measure that downtime. If you're down for 200 milliseconds, does that accumulate. How do you even measure that you're down for 200ms.
(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)
I think a lot of web services talk about reliability in terms of uptime (e.g. down for less than 5 minutes a year) but in reality operate on failure ratios (less than 0.001% of request to our service fail).
Good catch. The truth is, while we track downtime for incident reporting, it's much more correct to actually be tracking the number of requests that result in a failure. Our SLAs are based on request volume, and not specifically time. Most customers don't have perfect sustained usage. Being down when they aren't running is irrelevant to everyone.
This is where the grey failures can come into play. It's really hard to tell, often impossible to know what the impact of an incident is to a customer, even if you know you are having an incident, without them telling you.
In order to know that you are "down", our edge of the HTTP request would need to be able to track requests. For us that is CloudFront, but if there is an issue before that, at DNS, at network level, etc... we just can't know what the actual impact is.
As far as measuring how you are down. We can pretty accurately know the list of failures that are happening, (when we can know), and what the results are.
That's because most components are behind cloudfront in any case. And if cloudfront isn't having a problem, we'll have telemetry that tells us what the HTTP request/response status codes and connection completions look like. Then it's a matter of measuring from our first detection to the actual remediation being deployed (assuming there is one).
Another thing that helps here is that we have multiple other products that also use Authress, and we can run technology in other regions that can report this information, for those accounts (obviously can't be for all customers), which can help us identify with additional accuracy, but is often unnecessary.
This is a rare case where the original bait-y title is probably better than the de-bait-ified title, because the actual article is much less of a brag and much more of an actual case study.
Re-how'd, plus I've resisted the temptation to insert a comma that feels missing to me.
I spent a long time, trying to figure out, what the title of the article, should be. I'm terrible at SEO and generating click-bait titles, it is unfortunately, what, it, is.
You did fine! The title is clear. I was just being playful.
"How?! When AWS was down: we were not!"