What I learned from the Lambda Outage in east-1
simple answer: not much. Long answer: aws uses lambda on the backend for a lot of services.
Many of you probably saw my tweet in which I said a system I made relies on lambda:
Let me provide some background: I crafted a simple Lambda function that's subscribed to the SNS topic, which publishes a message if there's an AWS-reported outage or degradation. This message then gets forwarded to an internal Teams/Slack channel.
However, I hadn't thoroughly considered what would happen if Lambda itself experienced an outage in the east-1 region. Some might call this an oversight.
Despite this, it doesn't matter significantly. I managed to fail over the function to our secondary region to pick up any messages published subsequently.
Many may ask, "Why don’t you run an active/active setup or implement a more complex solution?" It's crucial to remember that resiliency is driven by BUSINESS requirements. The system I created doesn't contribute to the business directly; it doesn't generate revenue and serves an informational purpose. Therefore, I chose a simple solution involving a manual failover. The cost of implementing a more complex solution wouldn't be justified given the small quality-of-life improvements it would provide.
I have said it before, and I'll say it again: The BUSINESS drives resiliency. Critical applications and systems need to have stricter requirements and automated failover processes, even an active/active setup for the most important applications. However, some people tend to overlook this and aim for maximum resiliency at all costs. While this is a noble goal, it comes with significant costs. Remember, everything is a tradeoff.
The second thing I learned from this outage is that numerous AWS services have some dependency on Lambda. Many users noted that their ability to log in was affected because the Security Token Service (STS) was also experiencing issues. It's a reasonable assumption that this was because STS relies on Lambda. I, like many others, did not realize the extent to which we were reliant on Lambda due to these transitive dependencies. The full list of impacted services exceeded 100. Admittedly, I was somewhat surprised to see the number climb that high. It raises a question: we focus so much on resiliency and using different services within a cloud provider to ensure availability, but they all seem to rely on a few core services. I'm still pondering what to do with this realization.
In any case, I wasn't fired, so that's good news.
That's a quick and easy freebie for today. I hope it sparked some thoughts and generated some useful discussions.