Amazon Web Services (AWS) offers so many services and tools. But which ones should you use? Chris Hickman and Jon Christensen of Kelsus and Rich Staats from Secret Stache share their thoughts on services that they are currently using and evaluating.
Some of the highlights of the show include:
- AWS Simple Storage Service (S3) is scalable file-based storage to write files to folders and read them back, for example image files or configuration files. S3 files are available from any app running anywhere. S3 is highly scalable; it has no hard limits on scaling, but partitioning should be evaluated to maintain performance at scale.
- S3 is not a content delivery network (CDN), but can be integrated with one. CDNs are used to reduce latency of delivering files, such as static web files and media, to end users.
- CloudWatch provides alerts and alarms to monitor resources running inside AWS Cloud; network, CPU, memory, disk space, virtual machines, clusters, auto-scaling groups, and applications. It allows you to create custom metrics and dashboards about the health of your system.
- CloudWatch tight integration with auto-scaling groups is particularly useful, and harder to achieve with third party monitoring tools.
- AWS documentation and console UX leave something to be desired. But it’s strengths outweigh this slight downside.
- For application *performance* monitoring, some third-party monitoring tools such as NewRelic, Data Dog, & Pager Duty provide more functionality.
- Running Dockerized apps in ECS makes it less important to monitor how-level hardware resources usage such as CPU, memory, disk space, etc.
- We prefer Sumo Logic and Rollbar over CloudWatch Logging for managing application logs and error reporting. They have more robust features for searching, troubleshooting, and maintaining logs in a distributed system.
- Some other non-core AWS services that are promising include: DynamoDB, Athena, Redshift, AWS developer tools
- AWS DynamoDB is a NoSQL, document-based, key-value data store with flexible schema. DynamoDB created to address how to scale to massive loads and non-relational data, and it’s comparable to MongoDB. DynamoDB is often used for storing JSON documents.
- DynamoDB was started in 2006 to address Amazon’s own internal challenges with massively scaling non-relational data. It is very well integrated with other AWS services.
- DynamoDB is a completely managed services so you don’t need to worry about backups & restores or server patches.
- DynamoDB Streams provides a log of all mutable operations on your DB that can be integrated with SNS and/or Lambda functions to create event-driven applications.
- You can use MongoDB on AWS as well, by installing it on EC2 instances, or use a third party Mongo hosting service like Mongo Cloud.
- AWS Athena is a data warehouse solution that competes with Snowflake. Data is stored in S3 and accessible via SQL. S3 provides very inexpensive storage. This can provide cost savings when you have huge data volumes but relatively lower data access throughput, and thus lower need for processing horsepower. Snowflake provides HIPAA compliance out of the box; Athena does not – at least not yet. Snowflake also supports multiple clusters for access different parts of your data or accessing it in different ways.
- AWS Redshift is a column-oriented data warehouse solution for large amounts of data and fast access. Redshift offers higher performance in column-oriented queries common for data warehouse applications. On the downside, scaling is more difficult in Redshift: it requires migrating data to a larger cluster.
- Amazon’s developer tools manage a continuous integration & deployment pipeline: CodeStar, CodeCommit, CodePipeline, CodeDeploy, etc. Kelsus uses CircleCI and third party tools for this purpose that meet all our needs, so we haven’t had a need to evaluate these AWS tools.
- AWS Cloud9 is a browser-based Integrated development environment (IDE). Chris is happy and comfortable with Sublime, but Cloud9 is worth considering. It’s a significant investment to learn a new IDE.
- The tight integration and interoperability of all the AWS services are one of their greatest overall strengths.
Links and Resources
Amazon Web Services (AWS)
Secret Stache Media
In episode 9 of Mobycast, Jon and Chris discuss a few of the AWS services they’re currently evaluating. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.
Welcome Rich and Chris, another Mobycast today.
How is it going?
Pretty good. Usually we do this once a week but Chris is going on vacation. Where are you going, Chris?
To an undisclosed location far, far away. I’m going to Rome, Italy, the eternal city.
That should be really fun. I’m looking forward to seeing pictures and hearing your stories. Unfortunately, I don’t think the rest of the audience will get to see the pictures but maybe you can share a story when you get back.
The real reason is I can’t get a good coffee in Seattle so I have to go to Rome.
I was just in Seattle visiting you last week. I think there was some decent coffee there. In fact, I bought some coffee from Elm Street Coffee Roasters. I think I bought $40 worth and it is already gone. I’m so disappointed, it was so good. Both my wife and I were like, “I love this coffee.”
We’re recording a day after the one we did yesterday. In between yesterday and today, we pushed Mobycast live. Rich, that was some great work, thank you for doing that. I imaging that that’s been most of your time, though, between yesterday and today.
Yeah, all of my time, I think. For the most part, working through a few bugs and getting everything setup.
Mostly the same here, getting some mailing list setup to let people know about it. It’s exciting. We’re hoping that a few people like what we have to say.
Today, we’re gonna do a continuation of yesterday’s show. We talked about several of the AWS services that we’d like and that we use about our core to what’s happening at Kelsus. A couple of those, for example, Elastic Beanstalk in particular and a little bit with Lambda. We talked about what we don’t like to use anymore with Lambda. It wasn’t that we don’t like to use it, it’s just that we don’t like to use it for everything.
Today, there are couple of services that are core to what we do that we didn’t hit yet, we wanna talk about those. There’s another list of services that we have that we’re excited about, that some of them we’ve used a little bit and are anticipating using a lot more. Some of them we’re still on the research phase and hoping that we might use them because they’re looking very promising.
To get started, the first things that were gonna talk about are the ones that we absolutely do use in the core of what we do. The first one of those is one that we’ve used for years and years and everybody that knows AWS should be familiar with which is S3. It’s […] defining and talking about so go ahead, Chris. What is S3?
S3, three Ss, simple storage service. This is one of the very first cloud services offered by AWS. Essentially, what it is is it’s completely elastic file based storage so you can write files to folders, read them. Over the years they have extended it to tremendous number of different ways of doing that and manage the lifecycle of storage. The important thing is that this is a file based storage, it’s not block level storage.
It’s not a raw disk, it’s files. Those files can have permissions and you can do all the things that you do with the files. It services one of those core fundamental services that a lot of applications need.
It’s a place where you can store files. Maybe you’re uploading stuff, users are uploading files via your application and you need a place to put them, S3 would be a great place to do that. Maybe you have an image sharing app or something like that or a video sharing app, you might put your videos or images in S3.
I think a good way to clarify what you said about it, not being block level storage, is that you can’t have S3 be the disk drive for a machine. You can’t say, “I wanna run my Microsoft Windows operating system on this virtual machine and the disk is gonna be S3.” I was thinking of it as a special place with, I guess they’re called buckets but it’s like folders that are just for files that are a certain type, they’re just files.
There are so many different use cases, like you said, a great one is you have an app that wants to support image uploading and then be able to view those images and share them. Storing those in S3 is a perfect use case for that. Another one would be, sometimes app configuration information, you wanna store that in files and store them in S3. The great thing about S3, it’s a cloud based service, it means now, it’s available to all your applications and all your machines that are running your application.
If you have a custom machine, you don’t have those configuration files on each one of those servers, instead they can all go to this known place, this S3 bucket and go read the configuration from there. You have one place where you’re managing that file.
This isn’t a CDN service, I was gonna get tripped up with that. Could you use S3 as a CDN, you just have a different location to which loading those files? That’s not what it is.
What a CDN is, a content delivery network, it’s a cache of content that gets pushed to the edges of a network so that it’s closer to the consumers of that and to increase the speed. Since it’s a cache of content, it’s important that that content shouldn’t be changing all that frequently. This is great for static web assets, images and HTML files sometimes, media files that aren’t changing.
The thing is though is that you can combine these two things. You can use something like S3 as the origination source for all that information and then you can integrate it in with your CDN to say, “This is the source of the information that that I want my CDN to pull, suck up these assets and then distribute them out to your edge locations and have that be the origination for my content.” Definitely they can go hand in hand. The CDN is leveraging S3.
I wanna explain that a little bit deeper for a second. It seems to me that if you wanted to, you could say, “All my content for my whole web application is served directly by S3, I’m not using a CDN.” I’ve never tested this but the question I have is can S3 scale to any level of load? Imagine you have millions of users, can S3 handle that?
There are actual service guarantees for the amount of traffic that you can handle with an S3 bucket. You do get into situations where you need to take into account the way that partitioning is done, that could potentially be a limitation on how well you scale. For the most part, S3 is designed to scale infinitely. There’s very little hard limits on it as far as throughput issues that there may be some design decisions that you have to make from the partitioning standpoint so that you don’t have everything go to one particular bucket.
If you have a content that’s changing frequently but that is files, then S3 might be a good place for it. […] even to very high traffic, large application. If you have content that’s not changing very often and you have global users, you may want to get that content closer to those users so that they’re not experiencing latency, say from Argentina to the United States. There might be a 300 or 400 milliseconds of latency that using a CDN can help you avoid.
There was one thing that I remember you telling me about recently that you would learn about how S3 works under the covers, Chris, that I think is worth sharing if you can recall what it is. I actually can’t recall it at the top of my head but it was basically about how the files are stored under the covers. That was really surprising to me because I didn’t expect that it could be as performant as it is.
This is actually another service that we haven’t talked about yet. We were actually talking about Elastic Block Service, EBS. EBS is the block level version of storage that Amazon offers as a service to S3 which is the file based service. I always thought EBS as you are getting access to an actual disk. For all intents and purposes, you’re dismounting that disk as a volume unto your machine.
The truth is that EBS, it truly is a software based service. There are many, many, many thousands of actual physical disks. The blocks are being distributed across them as their software algorithms deem fit. Whenever you request the block, read the block, write a block, it’s actually going through the software and the software is making a network request to figure out what disk to go talk to.
For me that was, “Wow, I can’t believe that you can have disk access.” The speed that we’re used to with that but it’s really going over the network and it’s going through a software which is mind blowing.
It wasn’t about S3. I had thought that one service was using another that was unexpected like EBS was using S3 under the cover or something. My memory is failing me here.
It gets complicated because there’s over 100 AWS services and many of these things are built on top of each other. A great example is RDS, you can do encryption at rest with your RDS databases. It’s just a button click and a consul to do so or a switch in the API, the command line API that’s setup. It’s leveraging S3 encryption to do that.
S3 itself, you can encrypt your files. That ends up being a building block for a lot of other services, just S3 itself is a building block for a lot of other services inside AWS to do the stuff. It’s interesting how all these things plug together, there’s foundational pieces and you build on top of that stuff. That’s one of the reasons why AWS, they’re pace of innovation is increasing. It’s definitely not linear, it is an exponential curve when you look at what they’re doing and what they’re launching.
A part of that is they’re hiring crazy and they do have an army of people. The other thing is that is it just becomes so much easier now that those foundational pieces are in place, you just start gluing these things together. The pace increases dramatically.
The similarity is gonna be brought to us by AWS.
For sure. It’s definitely before 2038.
I think we’ve done a good job on talking about S3, the next one that we have on our list is CloudWatch. It’s another service that we use a lot, there are parts of it that we love and parts of it that we don’t. Maybe you can tell us what it is and we can get into it.
CloudWatch is a big, huge beast of a topic. It’s very, very broad. Generally, it’s all about monitoring and getting insights into what’s going on with your resources and software that’s running inside the AWS cloud so you can get metrics, you can get grasp, you can set up alerts and alarms to know, “Let me know when this EC2 machine passes 90% disk utilization so that I’m getting close to my disk filling up and I need to do something. Let me know when my CPU is above a certain threshold or maybe below a certain threshold.”
Getting that real time insight and the what’s going on in the machines, that feeds in as inputs to a lot of the other services inside AWS as well. We can peel that back and talk about that a bit more. It’s a means for monitoring and getting data about all the various services and resources that you have inside AWS.
From there, I get what is it but it may help to put my head around it if we can talk about how we’re using it in some concrete way. Can you give us some example on a project?
We have used CloudWatch for alarming, we’ve created CloudWatch alarms to let us know when things like network bandwidth is above a threshold that we were not expecting. We’ll set an alert and we’ll get an alert. At the end of the day, it might show up as an email or maybe a Slack message or something like that that lets us know that, “This alert got triggered.”
We’ve also used it to create custom dashboards which is nice so you can have custom metrics for what’s important about the health of your application. We’ve used CloudWatch to create these applications, specific custom alerts. We’re expecting regular activity in our application in our database. If we don’t see something happen at least once an hour, that’s when we need to look in closer to it because that may indicate that there’s something wrong with that system.
We created a custom alarm for that and a dashboard to go along with this so we can visualize and see what’s going on with that particular service. It’s very, very customizable, very, very broad and very feature packed and so much that you can do with it. It really depends on what it is that you’re trying to do and what you need.
You mentioned that it can fit into inputs of other AWS services, it is something that you could setup that makes decisions for you, so if you reach 90% utilization then it can spin up other servers or something like that or is it really just for sending off that data to a human to make those decisions?
You use it to inputs to other things, that’s a great example. You can have a cluster machines being managed, that’s an autoscale group. You can define scale up and scale down policies for that ASG. Maybe a CPU is the metric that you’re gonna monitor on that to determine when you need to get bigger, your cluster needs to get bigger.
You can set a CloudWatch alarm on that to look at that and see if the average CPU goes above a certain threshold. When it does, you can then say, “I’m now gonna trigger a scale up event on my ASG.” That will automatically then say, “ASG, instead of being five nodes, we want you to be a six nodes. Go ahead and create a new note to handle that additional load.” You can do the reserve as well. Maybe your app has peak times for two hours during the day and maybe you need twice as many machines for that time so you’ll have these scale out events that happen, then you don’t need those machines.
You can have the converse, you can say, “Once the CPU utilization goes below a certain threshold, start killing machines and take them out, bring that back down so we’re not paying for resources that we don’t need.”
CloudWatch is easy integration with autoscaling, is the killer feature that probably draws people into it and gets people locked in on AWS and CloudWatch over using some other monitoring tools that are potentially better and definitely prettier like New Relic and Datadog because setting up another learning service to do your autoscale groups feels like more steps. You’re not using that super tightly integrated AWS stuff.
I think autoscale groups can listen to this, there are other tools out there that can send an API request that say autoscale group increase or decrease but CloudWatch, it’s a few button clicks and you’re there.
We touched on this the other day too as well, for any particular aspect of your application or system that you’re trying to do. There are various places where you can get that functionality or service. You can use the AWS versions of it or you can go outside of it. You just have to do the pros and cons, is it worth?
You can’t discount the network effect of using these technologies that are integrated altogether versus, “If I use this outside tool, what’s the initial steps that I have to do?” There’s gotta be a good reason for doing that, whether it’s I don’t want lock in, or it’s some factor of X better functionality type of thing. Jon, you’re totally right with something like CloudWatch and ASG. It would be very hard to come with a scenario where it doesn’t make sense to use that.
I’m identifying what is that feature for the various AWS services. Across the board, I think, AWS does suffer from poor documentation and some UX and headache. Every time you decide, “We’re gonna commit to AWS.” You have to identify the feature that makes that worth putting up with that. I’m thinking CloudWatch’s case is this one.
There was another one, I think, in SNS’s case, you can use another push notification service but is there any other push notification service that has full on publish subscribe message bus that lets you fan out to all the AWS services? No, there’s not another one of those. Let’s use AWS SNS instead of, I can’t remember the name of another push notification service that’s out there but there are several.
There used to be Parse but it is no longer with us. There was a few things that we don’t like about CloudWatch, I think you touched on it yesterday. Maybe we can touch on that one more time real quick to close out our CloudWatch conversation.
I think it boils down to what you’re trying to do and what’s the right tool for the job. I think for a lot of the insights that we have, we do like our log in to give us access to that and also let us know when things are not going well. Application helps what’s going on in the system, a lot of that comes from the logs themselves. Things like these core metrics have CPU utilization and disk utilization, they become less of an issue because we’re running our apps on top of ECS and cluster.
There’s much fewer things to manage and they’re usually provisions such that not much of an issue. We can use things like autoscale groups to scale up and scale down accordingly. In the monitoring space, there are some great options out there from other companies that they are, in a lot of cases, a factor. Things like New Relic, things like Datadog, things like PagerDuty, a lot of these tools give you even much, much, much more detailed information on what’s going on in your systems and learning. They’re really aid and cured around that.
When it comes to lifecycle management, performance management of your applications, those warrant the closer look. Us, personally at Kelsus, we use New Relic for a lot of this monitoring, alerting in citing performance management and lesser on the CloudWatch side for those kinds of things.
One last thing I wanna say about monitoring and watching applications that are running is that we look towards observability as one of the key things that we care about. It’s one thing to be able to look at the big screen that has a dashboard and lots of green parts on the dashboard and feel cozy that your application is up and running and happy.
It’s not a world where you can see everything by doing that, most companies don’t wanna have operations or engineers that all they do is sit around all day looking at green dashboards. When something goes wrong, you wanna be able to know that it happened and then be able to dig into that and really uncover why it went wrong and be able to observe everything about the thing that went wrong. That’s the focus of everything we do when wee build with CloudWatch or with New Relic or with some of the other tools that we use like Similogic and Rollbar.
Let’s move on from CloudWatch. All of these topics are so enormous but we’re trying to do a high level overview of these AWS services. A big one that I actually overlooked for a long time but Krissy went to a talk from AWS itself about it is DynamoDB. I overlooked it, Mongo is the giant of NoSQL databases. It seems like DynamoDB maybe worth a look. We are not using DynamoDB for anything right now.
But we are using Mongo for things and have used it quite a bit in the pass. What was your take away learning a little bit about DynamoDB from AWS? What are they, first, I guess?
DynamoDB is one of the NoSQL database offerings from Amazon. It’s a document based data store, key value based data store type of thing, it’s not relational data. It is a flexible schema format of data. Usually, you’re storing JSON documents, you’re writing them and pulling them out. There are certain use cases where that’s exactly the format that makes sense to use. Using a relational database wouldn’t be the right match for some of these use cases.
DynamoDB is Amazon’s premier NoSQL offering. It’s very much directly compared against MongoDB. MongoDB is by and large probably the king of the hill in the space, they definitely were one of the first out there. They’ve been around for quite some time. I started using it in production apps back in 2012, I think that Mongo had been out for at least a year or two prior to that. It’s been around for quite some time. Although with that said, DynamoDB, they started to work on that in 2006. It has been 12 years that they’re working on as well.
I think that’s a common misconception because I didn’t know anything about DynamoDB before you talked to me about it. I knew it was the Amazon’s NoSQL offering and I just assumed, “They just wanna take the market share away from Mongo so they built some in-house thing.” That’s absolutely not the case.
This is a really fascinating story for me, it’s very personally relevant to me. If I may, I can go back a little bit because it might help explaining the space. Years ago I was at Microsoft, I was working in MSN which is a Microsoft network. It was providing all of these services as web apps to a user base that was millions of people big. We’re doing this at a time where being able to adequately scale your data lair was very, very difficult.
The only options were user relational database and you have to scale up, scaling out wasn’t an option. Scaling out means you can partition, cluster your database and add more database Nodes into your system to scale out, instead you couldn’t do that. You have to scale up which means you have to buy a bigger machine. You get to a point where there’s no bigger machine that you can buy, you literally hit a wall.
We were having problems like that at Microsoft and realized that a lot of the data that was being stored in these things, it wasn’t relational, it was document based data. I left Microsoft with a colleague. We went and founded a company to address that space of how do you build this based on the clusterized version of database storage for documents, specifically for these internet applications.
It turns out that we were ahead of our time, the ecosystem wasn’t quite right. We’re a victim to the dotcom bubble bursting, an unfortunate timing. But in 2004, Amazon was having the exact same problems that we had at Microsoft. Their site was scaling, they’re getting more users, they have a very much peak periods around the holidays. In 2004, they were having these huge issues with scaling a backend database. It was not staying up, it was completely full.
There were bursting at the scenes and they had problems keeping their site up. They had the same realization that a lot of the data going in here wasn’t relational, it was these documents. That started the process for Amazon to go say, “We need to go build our own database that addresses the space.” They started out working at 2004, they had something in place by the end of 2005, it’s an interim approach. That lead to the work that became Dynamo.
Dynamo was created specifically to address the pain of how do you scale to these massive loads with the realization that the data that’s being put in there is not relational and you need a new tech database for it.
I have a question, a lot of times, it seems like a piece of software that may have barnacles on it that may actually be bad for it because maybe DynamoDB is too specific to the Amazon problems that they were having in 2004 and it also carries that it has to be backwards compatible to a bunch of 2004 types of technologies. Did you get that sense at all when learning about it?
I think that even though they may have started on it in 2006, it was purely an internal tool and it was based upon this original thesis of people, “This is how you build something that is massively scalable, that deals with this business sequel data. It was used completely internally at Amazon for some amount of time and then the decision was made, it’s like, “You know what, we should open this up, offer it as a service to everyone else.”
That definitely wasn’t in 2006, I don’t know the exact date when they opened up DynamoDB. I’m pretty sure it was around 2013, 2014, 2015 timeframe, somewhere in there. It’s relatively recent that it’s been offered as an actual service that other people can use as part of the AWS services suite. I’m very, very confident that the code that was written in 2006, very little of that is probably left in what is now DynamoDB that Amazon offers.
That helps explain also why many of us felt DynamoDB was just a Mongo clone that AWS was putting out there. If it came later, if it didn’t get to the public until 13, 14.
That’s the fascinating thing about this is that these systems were created out of necessity, they are solving very real pain. They were independently developed out of that necessity, that said, how many folks actually have that deep pain of I have millions and millions of users and how do I scale my databases? That is why they exist.
One thing right off the bat that’s painful about using MongoDB with AWS is that you have to install it, you have to fire up some […], install it and then maintain it. You can use MongoDB as a service through another company like inLab, that’s totally an option. That way it’s a totally managed service. I know that we have Mongo in one of our projects, are we using inLab or are we hosting it ourselves?
We are definitely hosting it ourselves primarily for performance reasons, otherwise you’re making that network connection out of your AWS data center to go talk to some other […] and you create a very hard dependency as well. What happens if that service goes down? Personally, we do run MongoDB ourselves, we have to install them on EC2, we have to manage those, we obviously want high availability and fault tolerance.
We rendered as a replica set which means we need three separate machines all running the same MongoDB with one acting as the primary and the other two has the replicas. I guess that’s the performance availability and what not but it is a headache when it comes to operating and maintaining it, patching it, making sure we have backups, being able to restore. We have to do all that ourselves, that is a lot of pain.
It’s why we use RDS or why everybody uses RDS is to avoid those kinds of pains. Another question is, did you get a sense that there were any other killer integrations that DynamoDB has to other AWS services that would be coding or more work to accomplishing in Mongo?
This is the common theme, it’s the network effect of staying within the AWS ecosystem. No surprise, DynamoDB is very much integrated in with a lot of the other core services and technologies that Amazon has. A really great use case is when you wanna have a more event driven architecture. DynamoDB, it has a feature called DynamoDB Streams. What that is it’s a streaming transaction log of all immutable operations that are happening on your database.
Every time something is mutated with a create an update or a delete, there’s an event that’s emitted onto the stream as a journal to let you know this happened. You can then have that stream go to a Lambda function. Now you can start doing things to trigger actions based upon these events. Let’s say you wanna send an email whenever someone creates a new contact, a new user is added to the system. You wanna send an email because you’re launching your app and you’re gonna be excited about seeing user signups.
Turn on streams, setup a Lambda function to read off that stream. When the Lambda function gets invoked, it has the event record in there. You can look at it and say, “What table was this? Is this greatly done on the user’s table?” Go ahead and use SCS to send email or maybe it’s an SNS message that you’re sending. A very, very powerful event driven system with very minimal amount of work on your part.
In order for us to start using that or for other companies to start using that, do they have the need to migrate or have a new greenfield project where they can choose it as the data source. Did they talk it all about specific migrations from Mongo? Do they have a path for that?
That ends up being in a very custom specific thing. Amazon, like many other companies, they have the professional services that they did have solutions architects that will, for a fee, they will absolutely help you do that work. This is one of the reasons why MongoDB was so popular, is that it’s so easy to get started. You don’t have to do anything, you just basically have to install the service. Your tables can be created on the fly, you’re out of the box running immediately.
DynamoDB is no different there as well. How much work it’s gonna be to migrate over really depends on your application, how you’ve developed it. If you have your database logic modularized and isolated into one area every code, then it’s probably gonna be a lot easier than it is if you just have self-scattered all throughout your code base. That’s gonna be much more work.
I was wondering if they had some magic thing, you can imagine they would have something that could really collect and documents off of Mongo database and pull them in and them maybe some things that essentially their query language could be transmogrified into Mongo’s query language or vice versa but it sounds like maybe not.
Amazon does have a database migration service. I’m sure that this is definitely one of the source destinations that they do. What’s that doing is that’s taking your MongoDB and moving it over into Dynamo but your application has to change.
I wouldn’t be surprised to see DynamoDB take a seat at the table in the previous list that we had talked about of core services that we use.
The other thing that I will point out is that I did recently go to DynamoDB’s Summit here in Seattle at Amazon HQ. I really got on my radar when I was at re:Invent last year, so many of those talks were mentioning about DynamoDB as this core thing that Amazon services themselves were all using. That’s when I really picked my ears out and I’m like, “Wow, this is really powerful, it’s really integrated in, it’s battle tested.”
You don’t have to manage it, it’s a completely managed service. Things like backups and restores, I don’t have to worry about them. Things like scaling servers, patching servers, I don’t have to deal with that. They have some amazing technologies in place that have multimaster configurations with Dynamo. You have incredible scalability and availability and fault tolerance story, pretty amazing.
Another quick conversation that we can have about database type things is to talk about Athena and Redshift which are both data warehouse solutions from AWS. At Kelsus, we use Redshift a little bit. In this case, maybe me more than Chris has a little bit of experience thinking about these things although I cannot say that I’m an expert in any way but I can take a shot at it.
Athena, I think it’s AWS’s answer to Snowflake. It’s a new data warehouse that’s based on putting data into S3 and then letting you use SQL, the SQL query language on top of S3. Your database, just like any other database, like an Oracle database or a PostgreSQL database or MySQL database, put all the data in S3. The nice thing about that is S3 is really an inexpensive storage, we didn’t talk about that when we talked about S3. Gigabytes, terabytes, petabytes of data are affordable to store in there.
If you can suddenly begin to query across that huge amount of information using standard tools like SQL, that’s pretty powerful because it lets you separate having a database with computers that need to run queries from having disks that needed to be connected to run a computer. You only pay for the storage and then you’re also paying as needed for the time that you spend querying in a data warehouse situation where you’re 100% up and not serving millions of requests per second but rather few requests per hour maybe or maybe 100 of requests per hour but less than internet scale types of applications, typically. Something like Athena is pretty interesting.
I think that one thing I’ve noticed about Athena is that I do think it’s a fairly new entrant in the market and sounds like it’s the king there. In my research, there are two advantages that I saw from Snowflake that Athena doesn’t have yet. One is that Snowflake is already ready to go for things like HIPAA compliant data storage and then retrieval. Athena is not in AWS’s list of HIPAA compliant tools. The other thing is that Snowflake has some more features that AWS’s Athena doesn’t have yet.
For example, Snowflake has this feature where you can turn on what are called data warehouses, multiple data warehouses in each data warehouses, a little cluster of machines that are able to suck in some data from S3, and then make that available and highly queryable. If you have a bunch of health record data and one group wanted to look at that data from the point of view of outcomes and another wanted to look at that data from a point of view of medical research.
They have completely different types of queries that they need to do. The data in S3 is petabytes large, it’s nice to be able to spin up different sets of clusters, different data warehouses that are independent of one another to be able to do those queries. That’s something that you can do with Snowflake that I don’t think Athena supports quite yet. That’s Snowflake versus Athena.
The other piece is Snowflake versus Redshift. Redshift has been around for several years. I know a personal friend of mine who had implemented Redshift, I think this is public information because it’s on his LinkedIn profile, he implemented Redshift as part of the underlying database for the NASDAQ Exchange. Everybody knows that the NASDAQ Exchanged is involved in billions of transactions per day.
I had seen that the NASDAQ had moved from an open source stack involving Hadoop and Cassandra over the Redshift in the past few years and still using Redshift now which tells you that Redshift is probably really good at having a whole lot of data and making it available very, very fast. In fact, that is exactly what it’s all about.
I guess a big difference between Athena and Redshift is that Redshift is more of a classic database. The database lives on a computer and the data lives in the computer’s file system that’s mounted on that computer, not in S3. Data access is maybe a little faster because you’re talking directly to disks that’s mounted on that computer.
The other big thing about Redshift is that it’s essentially just a PostgreSQL database, you talk to it with SQL just like you SQL with Athena. But the structure of the database, the structure of the underlying data is different in Redshift than it is in PostgreSQL. The main thing that’s different about it is that it’s stored in columns instead of in rows. I think I can say this in about a minute to make it clear what that might mean for you.
Imagine you have a table that’s got an ID, a name, and an age, and maybe a location, a city. In classic database, you can imagine that it’s like 1, Bob, 39, Minneapolis. Right next to that is 2, Mark, 15, Denver, and then on and on and on. If you wanna find the average of all the ages, you have to look across a lot of data, you have to look across each of those users and get the age out of it and then put those together and then do the average of the ages.
If you store them in columnar format, all of a sudden that same data looks like 1, 2, 3, Mark, Bob, Steve, 39, 15, 14, and Denver, Seattle, Minneapolis. I know I changed the data but the point is, all of the column data is together in the way it’s stored. If you wanna find the average age, you seek forward to where the ages are stored and then right there, boom, all next to each other, 13, 59, 40 and you can find the average of those three very, very quickly.
For any kind of data where you need to look at the data in aggregate, Redshift is awesome, it can be so fast, the returning of those aggregations. For any kind of data where you’re doing a lot of inserting of new rows or you need to look at all of the information and all of the columns in order to satisfy your queries, then Redshift all of a sudden doesn’t look very good anymore.
I guess the final thing about Redshift is that let’s say you need to scale up or scale down, you need to make a whole new cluster and then migrate your data to the new bigger cluster. Whereas with something like Athena or Snowflake, scaling up and scaling down is really about adding more compute power and not migrating all of your data to a new bigger cluster. That was a nice, long monologue. Do you have any feedback, Chris?
No, that was a great explanation. Definitely, I think you’re more well-versed in that space than I am. Thanks for sharing that.
This is Amazon’s suite of tools for managing a continuous integration pipeline as well as a continuous deployment. All the various pieces of that, they’re very full featured. I believe all these tools arose out of the internal tools that Amazon has built for itself for delivering all of its various projects and what not. Definitely it’s something for us to look at and keep an eye on.
We have been using systems like CircleCI to do this for quite some time right now. CircleCI is absolutely a great solution, there’s nothing that it’s not doing for us. There hasn’t been this huge desire to go find something else and see how does Amazon’s compared to what we’re doing right now because there’s no pain. We have CircleCI doing things like making automated builds, running automated tests, generating test artifacts and reports and code coverage. We have it doing automatic deploys for us into our AWS cloud, we have continuous deployment going on.
It’s conditional based upon what branches we’re committing to, it’s fully scripted, very, very full featured. That said, all of these tools that you mentioned on the Amazon side, they allow you to do all that and probably even more. Definitely it’s something that we’ll be keeping our eye on.
I think it’ll take that killer integration that they have that you can’t get anywhere else that will start to pull us in and the ongoing theme across both.
Cloud9 is a little bit different, it was an independent company. It’s been around for quite some time, five, six years where it’s a browser based IDE for developers. That was one thing that Amazon didn’t have. I think about a year, a year and a half ago, they made that acquisition. They purchased Cloud9 and spent a year, 18 months, working on having it be completely integrated in with all of the AWS stuff. If you’re a fan of cloud based IDE, it definitely makes sense to go look at it. Me personally, I’m really happy with Sublime and I’m sticking with that.
IDEs, they’re like pets, they’re like things that once you’re really comfortable, or like your favorite old t-shirt or something that you’re comfortable wearing it, you don’t want to go wear a different t-shirt.
It’s such an investment because you spend so much of your day inside your IDE or whatever tool you’re using to write your code and the bug and run it. Being as efficient as possible in that is very important. To become really efficient, it’s something that takes time. It’s the 10,000 hours thing, it feels like that to understand all the various key strokes and shortcuts and tweaks and hacks and everything like that to get that much more efficient and comfortable with it.
Switching IDEs, it’s one of those things. It’s like, “There really has to be a good reason for doing this because I’ve invested so much in my existing tool.”
If they have bought Sublime and turned it into a cloud based IDE integrated to all of AWS’s services then would it be a different story?
Maybe, I would probably be pretty sad to see it go cloud based. There’s something to be sad wherever you’re at even without a network connection or anything else, just be able to pop the lid on your laptop and write some code. I’m sure all these support an offline version of it. At the end of the day, it may end up being blurred and you really can’t tell much of a difference, especially things with tools like the Atom editor and what not. I’m sure going forward it will be less and less of a distinction. It may not even be able to tell whether it’s native Sublime, or cloud based Sublime.
We’re already used to using Google Docs and seeing each other type on the same screen and having that not be the default modality or mentality in software development. It’s maybe overdue, I don’t know. I’m interested to see where that goes, that could be a whole rabbit hole we could jump down.
There are so many AWS services, there’s one more on our list. I’ll just say its name is Kinesis. I don’t know that we wanna spend time on it right now, there’s other ones on the list too, AWS services, we can’t talk about them all in two days.
I think the theme has come out that AWS services are best when they’re super in reparable with one another and integrated tightl and give features that that integration turns into magic. That’s why we’ve happily locked ourselves in and why a lot of other companies have too.
We expect to see exponential acceleration and software development capabilities in the future because of what AWS is using and doing. Do you have any other closing remarks that you wanted to make, Chris?
I think we can absolutely do another part three to this of services that are core or interested about that we haven’t talked about. We haven’t talked about Route 53 and Certificate Manager, we haven’t talked about ELB switcher. It’s super important and a critical piece. We still have a lot of the core, fundamental stuff. We have CloudFormation, another great topic to talk about. It goes to show how big, how full featured AWS is and it’s increasing.
We have to make one of these a week for at least a year and we’re hoping to do so, I bet we’ll get to those eventually. Thanks everyone for joining.
Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation online. This episode, along with the show notes and other valuable resources is available at mobycast.fm/09. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you. We’ll see you again next week.