The Biggest Technical Mistake I Ever Made (Millions of dollars wasted)

AWS and I got this wrong, very wrong.

Apr 22, 2023

Ever made a mistake that turned out to have consequences in the millions of dollars? Well I have, and I’m going to tell you about it so you never make a similar mistake. To be fair, my team of principal AWS solutions architects are just as guilty, but someone has to own the blame.

Let me take you back to 2017 when the grass was greener we were making the design for a data lake on aws. We were writing out specs and requirements with our stakeholders and aws account team. We thought leveraging aws’ expertise and our bright minds would be enough. I already spilled the beans, but it wasn’t enough.

Required Reading:

Cloud Engineering with BowTiedCelt

Engineering Decisions for Data Lakes on Amazon S3

Ask many technology executives on what their firm’s most valuabl e asset is. You would get couple answers, but one that would come up often is: “data”. As new paradigms emerge around Data Science and new breakthroughs around Machine Learning, both of these areas rely on data, and lots of it. As a result, your firm’s data architecture becomes more important in order to empower business units that rely on data to create insights and ultimately improve customer experience, which is increasingly becoming the focus of “data-driven…

3 years ago · 4 likes · BowTiedCelt

Data Lake Mistakes

The goal was to have a single place where all data is stored. Breaking down data silos, enabling data services, and protecting our data. Empowering our turbo data scientists to create profitable insights. (On a tangent, being a data scientist is a great gig and you can easily overemploy earning mid 6 figures by following

Data Science & Machine Learning 101

on substack. Tangent over.) As a financial services firm, the number 1 priority was security, preventing data loss, exfiltration, and leakage.

The biggest issue when planning for data lakes is scale. Now your cloud provider will tell you that object storage is virtually infinitely scalable and you have nothing to worry about. Now they are right, but that doesn’t paint the whole picture. The decision we needed to make was to either have a single aws account data lake or a multi-account data lake. We had initially estimated the size of data, users, and security requirements and we came to the conclusion that having a single account would greatly increase our agility while not impacting security. This turned out to be all wrong. So lets get into why.

Security Mistakes

Even though we wanted a single account we still wanted to be able to implement controls to ensure data was not exfiltrated or otherwise compromised. One control we didnt think hard enough about was data segmentation. For us that meant separating data by application, for your org it could mean something else. It turns out to be very difficult to implement micro segmentation inside of a single account because you then you get on a never ending IAM treadmill where each policy grows extremely long and convoluted. Then soon you just have so much tech debt on your IAM policies it turns into a blackhole that no one knows how it works.

As a consequence the frustrations with IAM in a single account led to issues where teams were either accidently or deliberately messing with data sets that wernt’t there’s, and this had a lot to do with the challenges around IAM and security segmentation. The problem with that is mixing of data or manipulating data that shouldn’t have been. We have this wonderful thing called Sarbanes-Oxley Act (SOX) which will put you square into Shit Creek if there is financial data manipulation. So teams that maybe didn’t know they were manipulating data in scope for SOX could have accidently done something with consequences. In this scenario our data lake was supposed to be our source of truth and system of record, so SOX data manipulation was a big no-no.

Lastly, perhaps the biggest oversight was creating a blast radius that was disasterous. In the security world blast radius essentially means, if x is compromised what else can be compromised, or what is the potential damage. When you have a single account with potentially limiting controls around segmentation that means the blast radius is going to get too high for comfort. If our EDL account was compromised that means ALL of our data, including business data could be compromised. In 2023 that is such a huge risk it could financially ruin a firm. Due to the challenges with segmentation and IAM there was no easy way to reduce our blast radius. Talk about a conversation that will make your CTO very unhappy, trust me on that one.

Engineering Mistakes

The amount of issues that came to be from the single account design was totally unexpected. For us these were unknown unknowns, but i really think aws should have caught these. The #1 issue by FAR was service limits. We ran into so many different service limits that it was almost unbelievable. Yes S3 is infinitely scalable, you know what isn’t the Number of S3 buckets. Yes its listed as a soft limit of 100 on the Service Quotas page, but let me tell you aws wont raise it above 500. This was a problem for us because we set out to give each application their own bucket in our aws data lake. Now the problem with that became pretty clear when we hit about 400 applications onboarded and we had a large chunk of buckets for operational and account level infrastructure stuff. So quickly we were running out of room. Our protections at the bucket level were our main line of defense on several key security controls. So that turned out pretty poorly fairly quickly.

Another big service limit quota we hit was around AWS Batch, which was a key service for our data operations and pipelining. Off memory I believe the service quota limit is a queue depth of 100 for Batch, which we were hitting very often given the amount of applications consuming data services.

The last engineering mistake we made was underestimating the size of our data lake. Our data lake hit 1 PB years before our estimate. This could possibly be because of poor forecasting or otherwise, but I think the challenges around this could have been mitigated if the other issues didn’t exist. The scale just sort of exacerbated the problems that already existed. Another cultural aspect of this, is when in a tech company when a team sees something new and shiny they want to rush to leverage it, so they can show off to leadership that they are using the newest and best technology. Sort of a cultural or human psychology thing but saw this often.

As you can see, all of these factors collided to create a giant hurricane building, to one data ruin our firm. We could no longer live with the security risk, and this played a huge role in encouraging the data engineering teams to move on from the single account structure.

The Fix

Slowly but surely a new data lake is being built with MULTIPLE accounts, bless the lord. Now people like

Data Science & Machine Learning 101

can stop trolling me about broken pipelines. Even though the roadmap for a new data lake migration is underway this mistake most definitely cost my firm millions, and i am not happy about my part in it. I'd recommend a multiple account data lake separated by org, business unit, or whatever other segmentation makes sense for your company.

I hope by highlighting this data architecture hazard you will not make a mistake similar to mine.

-celt

Software Architecture with BowTiedCelt

Discussion about this post