Chris Hickman and Jon Christensen of Kelsus and Rich Staats from Secret Stache offer a history lesson on the unique challenges of data at “Internet scale” that gave birth to NoSQL and DynamoDB. How did AWS get to where it is with DynamoDB? And, what is AWS doing now?
Some of the highlights of the show include:
- Werner’s Worst day at Amazon: Database system crashes during Super Saver Shipping
- Amazon strives to prevent problems that it knows will happen again by realizing relational database management systems aren’t built/designed for the Internet/Cloud
- Internet: Scale up vs. scale out via databases or servers; statefulness of databases prevents easy scalability
- Need sharding and partitioning of data to have clusters that can be scaled up individually
- Amazon’s Aha Moment: Realization that 90% of data accessed was simplistic, rather than relational; same thing happened at Microsoft – recall the Internet Tidal Wave memo?
- Challenge of building applications using CGI bin when Internet was brand new
- Solution: Build your own Internet database; optimize for scalability
Links and Resources:
Rich: In episode 39 of Mobycast, we lead a history lesson on the unique challenges of data at “internet scale” and how this gave rise to NoSQL. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.
Jon: Hello, welcome Chris and Rich.
Chris: Hey guys.
Rich: Hey Jon, hey Chris.
Jon: Hey, so we are we are at number 39 now. Hard to believe that in just a short year, we’ve done these many episodes in Mobycast. Today I think we have a lot to talk about so I just hope you’ve had a good week so far and have been doing fun things but instead of talking about that, we’re going to jump right into it.
Chris and I just got back from re:Invent and had just an amazing time there, learned so much, met new people and got really excited about the future of AWS. So rather than do the obvious thing which would be an AWS re:Invent recap episode, I noticed there were like 15 of those around different places and blogs and everything. So everyone’s seen AWS recaps, re:Invent recaps. So we’re going to instead talk about one thing that really was fun and interesting for us at re:Invent, particularly Chris.
We’re going to do a little history today. So we’re going to talk about the birth of NoSQL and DynamoDB. So yeah, I think in order to get started before I hand it over to Chris, I just want to say that this is going to be a little bit of Chris and some personal storytelling because he has some real world experience that’s related to this story. And so we’re going to try to weave some of that into the stuff we talked about with AWS. I’m really looking forward to everybody being able to learn more about Chris’s history. And then we’re also going to of course talk a lot about specifically what AWS is doing now and how it got to where it is now with DynamoDB. So with that, Chris, maybe we can start by talking about the NoSQL.
Chris: Yeah and with that I will totally cop to the fact that I will be showing my age here by talking about this context of a history lesson right, to kind of indicate just how long I’ve been in this business.
Jon: He’s now like about 27 so.
Chris: So we, Jon said we were at re:Invent, re:Invent for me is always exciting to hear the keynotes. There’s multiple keynotes spread over four days I think. But the two months I really look forward to are Andy Jassy’s keynote on Wednesday and then followed on Thursday Warren Werner Vogel gives his keynote. Andy is just kind of more along the business aspect, lots of product announcements and kind of just demonstrating how AWS is crushing the world, very interesting.
The next day is Werner’s turn to follow up kind of more from a technical and kind of a vision like where are we and where are we going type of thing and then dive in a little bit deeper into technology as well. Given that Werner is the CTO for AWS. During this year’s keynote from Werner, it was just super interesting and relevant to me because he starts off his keynote with a slide saying, “Hey, I’m going to tell you guys about my worst day at Amazon.” I’m like, okay, let’s hear it.”
And so he says, his worst day at Amazon, December 4th, 2004. So this is the peak of the holiday season where they have lots and lots of orders coming through the system. That was the cutoff date for their, they had super saver shipping promotion going on at that time. So it was, if you place your order by December 4th then you get super saver shipping which I think was, I don’t know if it was free but obviously it was very inexpensive. That was the cutoff date to place your order in order to guarantee Christmas delivery.
Jon: I’m just trying to get my head around the date 2004. So in 2004, I was working for a company called StorePerform and we were building software for retail giants like Sears and Best Buy and Lowes and Albertsons in order to help them make all their stores the same. It’s kind of interesting for me to think about that because we’re talking about Amazon who came along and crushed all those companies. So 2004.
Chris: This is 14 years ago, so it kind of feels like multiple generations away. This was pretty early days although we’ll get into this. It wasn’t the early days, there were days before that. So they have this very busy day of traffic and, then of course, the worst, the inevitable happens their database system goes down. So they were using Oracle DB so a relational database system for storing all of their customer data, their orders, shopping carts. They’re using it to basically power their ecommerce site and that crashed due to a database bug.
So they were down for 12 hours on that day. So you have all these customers. They’ve been promised place your order by today and you’re going to get it by Christmas. But they can’t place their orders, they’re getting these error messages, they’re getting this screen saying, “Sorry, server not available. Server too busy.” Or whatever. So obviously, a disaster for the engineers and for the operations people. Hence one of the worst days that Werner had at Amazon.
Jon: Sorry to interrupt again but when he said that, the only thing I can think was well you did keep your job.
Chris: Not only did he keep his job, he got promoted. So yeah, we don’t have to cry tears for it, spoiler alert, it works out. Obviously, so they made it through somehow right and they probably have to do something like, “Okay, we have to extend the deadline and we’re going to eat some shipping cost and whatnot.” but obviously a very traumatic, expensive and just really untenable situation because it’s not like traffic is going to be slowing down the next year.
This is a problem that it’s now hit him in the face and they have to contend with. So obviously as a part of this, they have their go away and do their post mortem in essence of, “What do we do next so that we don’t have this problem in the future.” He talks about in his key network, the key realization was that relational database management systems just aren’t built, they’re not designed for the internet, for the cloud. This was a really core fundamental problem.
Jon: I just want to put a quick technical point on that. The reason is that RDBMS systems are like a single thing, a database. maybe you are able to have multiple machines pointing at some discs or something but really, there is not so much, the databases are really very clusterable or it can be scaled horizontally like you just can’t have 100,000 machines in your database, in your RDBMS database and that’s kind of what it’s getting it. It’s a single thing, if it goes down, you’re down.
Chris: Yeah. This was a common theme around that time with just dealing with internet scale is scale up versus scale out. And really things like relational databases and even servers like really the only answer was scale up. The only way you can handle more load is you throw more hardware at it. You get faster machines, so you double the CPU power on your machine, or you increase the size of the disc and you add more memory. So that scaling up.
But you can only go so much. There’s a limit to how big of a machine you can have versus scale out says, “Hey, instead of having just one of these things, let’s go create clusters of these and let’s have multiples of it.” and so it will scale out horizontally instead of scaling vertically. And that ended up being like what was necessary in order to deal with the massive traffic and load that we have now.
Jon: Just to make sure, I mean I sort of stated something pretty confidently. But just to make sure I really understand it, I think the reason for that is that databases are super stateful and every time you talk to them, you’re basically saying, giving your understanding of the state of the world, I want to know the answer to this question. You just can’t split that state over a bunch of machines that aren’t aware of each other’s state. So essentially, yikes, you’ve got to just scale up. You can’t just read that out because all the machines wouldn’t be aware of what the other one is doing. It’s that statefulness of databases that prevent them from being able to be scaled up very easily.
Chris: Cozy up people, get into a recliner and get nice clothes by the fire because this is super interesting and there’s so much here to unpack, these are all good points and these are the kind of issues that were really occurring for the first time in the late ‘90s, the early 2000s and this is what we were all talking about and struggling with, these very kinds issues. So you’re kind of getting to the concept of we really need charting, we need partitioning of data in order to have these clusters that can be scaled up individually.
And relational database systems, yeah, they did not lend themselves well to that especially by having models where all the data is intertwined. So having to do joins becomes very problematic like, how do you split that data up. How do you partition it? It’s just not built for that.
Jon: I’m maybe trying to read rows from one place and something else is trying to write about the same thing at the same time and it’s like very contentious.
Chris: Exactly. This might be a good point to kind of point out that as Werner was kind of going through this talk and they were doing their post mortem and analysis and figuring out what went wrong and what can we do to fix it. Just what is the criteria here. They took a look at the data that they were storing in these relational database systems for their customers, customer information, for the orders, the shopping carts and whatnot. Here’s what they realized, they realized that 70% of that data that they were accessing was in a single table and they’re only selecting a single row.
So very simple, I mean literally, key value. I’m going to look at something by primary key and I’m pulling that back. That was 70% of the data and the traffic they have and they were using a relational database system for that. Another 20%, single table but multiple rows. Give me all the items in a shopping cart for a particular user or something like that, doesn’t involve any joins, it’s hitting a single table. Again, not really relational data at all. It’s pretty simplistic, there’s just lots and lots of that data and that happens a lot in many operations and has to have them quickly.
And it was really only 10% of the remaining data and traffic involved multiple tables. So they look at this like man, 90% of the data that we’re throwing in here is really simplistic and it’s not relational data. So why are we using a relational database system for that? That was definitely one of those big aha moments for them. When I’m sitting there in the audience and listening to Werner go through this process of like saying okay, “Here’s what happened to us. Here’s the kind of things that we thought about and here’s what we realized about the data.” I was like, oh my goodness. This is like déjà vu because, this is exactly—almost verbatim, this is exactly what happened six years prior for me personally when I was at Microsoft.
I was at Microsoft in late 90’s and I went to Microsoft one of the reasons why I actually got hired there was, I was kind of fortunate in that I got to work on the very first wave of internet applications. So I was working for a research project Motorola for the department of defense. We were building a suite of tools for engineers so that they could build ASIC chips. So we started off building all this in just native code, it was all UNIX based code, UNIX based windowing system and whatnot. It was a lot of work.
It was also back in the day, you had to run this on like a Solaris workstation. While we’re doing this is when the internet is starting to come on the scene. You have things like the Mozilla browser and web pages started to appear and then Java came out from Sun and this had the promise of, you could write code in your applications and it would run inside the browser and it would run across platforms. So it would run anywhere. So anywhere where Java was supported in that browser, your application would run.
So we made the switch to say instead of building this suite of applications for UNIX based system, let’s rewrite it as Java applets. Now we have—every kind of platform support it. So the net is that like, using the alpha version of Java is super painful.
Jon: It’s actually for a 7-year-old.
Chris: Exactly. But it was my after-school project. it’s kind of surprising how far we did get and that work is again just super fortunate for me because that really led to the interest of Microsoft and their recruiting team to allow me to go to Microsoft and work over there because this service is like 1995 or 96. this is around the time when Bill Gates has his famous memo where basically they acknowledged you know what, we missed this ship, this internet ship and we really do like it’s time to steer the ship in that direction and it’s all hands on deck, the internet is here to stay and we really have to double down on this.
Jon: Arguably another kind of variety arriving at it like last year or the year before and finally turned the ship all the way.
Chris: Yes. They were slow to respond to this, bigger company. They had the Microsoft network which was, they’re a dialup online service with lots of applications, a robust online community very similar to what AOL had as well or CompuServe just a bunch of those but that was kind of like the status quo and then the Microsoft network had millions of subscribers and they were all paying a monthly fee and all was well.
The internet comes along and it completely disrupts that model. That was one of the reasons why I went to Microsoft was like, they basically now have the charter of saying, we have to switch from a proprietary dial-up network to now be an internet network and all these applications that we had that were on the proprietary network they now have to run on the internet. So that worked but they did a building […] and Java was a direct link to now go do this work at Microsoft.
We have these huge challenges ahead of us like how do we even build internet applications because this is the first wave. They haven’t been done before. The internet is brand new, the tools don’t exist for the most part or they’re just starting to come on the scene. So things like web servers even like, I mean this is the first wave of web servers and things like CGI bin that was the way that you actually had web servers that you could write code for it. I don’t know if anyone remembers but CGI bin was basically every time or a web request comes in for a process and run that process and the output of that process is what you return back to the web request.
So we had our work cut out for us there at Microsoft. It was an interesting situation because we right out of the gate, we had millions of users and there’s very few other companies out there that had that kind of problem at that scale right at the gate. I don’t even think there was like a handful of companies at that point. It could have been maybe AOL, maybe Netscape.
Jon: I’m trying to think what kind of data those users might have had beyond looking at just sort of what I might call today a brochure type websites where it’s just pure information, just static website. What kind of user information did—how were they interacting with these applications?
Chris: The Microsoft network at that point, they had a pretty rich set of applications that people were using and things ranging from like things like message boards, news groups, chat rooms there were various like interest groups, different types of entertainment and content. A that a lot of it was pretty interactive. It wasn’t like TV where you sit back and consume it, you are interacting with it.
Some of it was games. So just a range of various applications of what probably the common theme was just you’re in a community with other people, you’re interacting with those people. That was our charter is like okay, we already have this existing suite of applications that work for the proprietary dial-up network and that’s all native code and it has its own protocol for sending packets over these dial-up networks. Now we have to open it up to the internet and go over HTP, having built these things.
It was a lot of work but pretty early on we discovered we had this big problem of how do you deliver these systems at scale and particularly how do you scale the data. There were tools for us to scale the stateless components of our architecture. We didn’t really have load balance, I mean we kind of had load balancers but we call them virtual IP appliances. It was kind of the same principle where you could have a cluster of stateless web servers all fronted by a single IP. Sometimes it was software based and sometimes there was a hardware appliance to do that.
So we could scale the load with that. We can have these web servers. Even with the application servers that we’re doing like business logic per se, those are scalable as well because those were stateless. The big thing we had a problem with was the data layer because at that point, I mean really the only option was a relational database and being at Microsoft that choice was SQL server. So that was really the only repository we had for storing and retrieving data and this was a constant problem for us right out of the gate because we’re now dealing with—we have millions of the users like literally right out of the gate, we have all this data that we need to store for all these types of applications that we’re building. You know again, it’s like messaging and chat rooms and user profiles and preference…
Jon: Some of that is like common to every single user data, they would all have to be in one database.
Chris: At that point like there was not even the concept of Aptiv with SQL server. There was one database and that was it. You could have a backup database but it was cold. If the Aptiv database did go down then you manually had to cut over and bring up the backup database. So very much, it was like we were limited to this scale up architecture and it was just a constant problem for us and the writing was on the wall as well like, we’re having problems now the goal—like we’re going to have more and more users, our apps are getting more and more complicated and sophisticated, we’re storing more and more data about our users.
This just doesn’t work. The big realization for us there was like, this is kind of silly that we’re storing in a relational database because it’s not even relational data. What we actually end up calling it, we call it the internet data. And really what we were describing was kind of like document-based data. It was like these snippets of data that didn’t have strict schemas and they were being accessed by key. So it was really like key value, lookups of snippets of data, things again like user preferences, or cookie information, or history details, or like chat messages, these kinds of things.
That’s what we were putting in our database, that’s what was causing it to fall over. This is like 1997 and then going into 1998, that’s when myself and a colleague Marco de Mello, he was the technical program manager there with me and we worked very closely together with him being a program manager and me being a developer and we came over this and like, “Let’s go build our own database.” let’s build an internet database for these stuff because it’s really like, this is not relational.
We really think that we can build something that is so much more scalable and can handle the performance. We want to have things like partitioning, we don’t need things like joins. I really just need to optimize for this scaling out. Basically and we were looking as like, you can have a load balancer for your web layer and for your application layer, what if you could have a load balancer for your data layer. This is what we were going for.
So we kind of got the go ahead to go do this and to do a prototype. So I think we spent four or five months going and building a prototype. We ended up taking one of our applications, I believe it was a message board type application that was running on SQL server. We wrote this new database that was optimized for all these kinds of problems that we want to solve and we replaced the SQL server with this brand new database and showing the application working against that. So we got definitely some heads turn with that.
Jon: I can’t imagine how you proved that it could scale. You didn’t have the option to fire up 100 EC2 instances and throw load at it. So how did you know that it would work?
Chris: One of the great things about working at a company like Microsoft is the level of resources that they have is quite unique. The building that I was in just about the entire first floor was a duplicate of our production data center which is just racks and racks of servers. We had dedicated teams ops and testing teams. so for us to do load testing with just throwing tremendous amounts of traffic at it was like, we could just do that by walking down the stairs and going down to our test datacenter facility.
We had tools like that at our disposal and then also just from an architecture standpoint, to showing like, this is what we’re doing to address it. That gets a lot of credence to it as well back down with things like load testing and performance testing and whatnot. So we were kind of off to the races at that point and I ended up being kind of getting funded internally to have a large team to do that. So I think we ended up then adding five or six people to our team to go actually build this into something real.
One of the unfortunate things during the time that I was at Microsoft was the fact that we would have frequent re-orgs that happened. So this is where the power is at B, I don’t know exactly what happened but I would imagine it’s into a room and start shuffling around some papers and […] on the whiteboard and moving people around and whatnot. You kind of learn whenever you did move, just don’t even bother unpacking most of your stuff, just leave it in the boxes in your office because chances are in six months, you’re going to be moving again to a different office in one of these reorgs.
And so naturally, one of these reorgs happen during this time. some new management folks leading that division that I was in and do this and just politics and just strategies and whatnot, it was kind of when I said like, look, this is cool. What you’re doing here with this new type of database but we’re the Microsoft network group, we’re MSN and that’s not our charter this just feels like it should be around the core database group, the folks that are doing SQL servers. So why don’t you stop doing that and instead go work on something else? And so for me, that was the last straw.
Jon: That’s brutal.
Chris: Yeah, I was not happy with that at all and I just felt like I was screaming into the void where it’s just like, how can you walk away from this. This is a huge problem and it’s like, it’s our problem today but five years from now, it’s going to be everyone’s problem because the internet is here to stay. The internet’s not going to shrink, it’s not going to stay stagnant, it’s going to grow and like we’re just getting started people. This is a huge opportunity, someone’s going to solve this, why can’t it be us? So with that, that’s when I decided to leave Microsoft and go found a company to go build this and deal with this issue like how do you build an internet scale database. Now we get into the next chapter of the story and going off to start up land.
Jon: That’s why I’m going to stop you I think because we’ve spent I think enough time for this episode. So let’s do a serial episode and find out where this goes and then bring it back to how your new company that you’re going to talk about next episode ends up relating back to DynamoDB.
Chris: Sounds great.
Jon: Alright, thanks everyone.
Chris: Thank you. Thanks guys, bye.
Rich: Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with the show notes and other valuable resources is available at mobycast.fm/39. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.