Chris Hickman recounts a session at DockerCon titled, “How to Create Effective Docker Images” by Abby Fuller of Amazon Web Services (AWS). If you have a Docker, you need to create a Docker image. What are the best practices from a cash, security, and performance standpoint?
Some of the highlights of the show include:
- Layers of code come with Docker images; sequence of layers with each corresponding to a command in your Docker file
- Start from “Scratch”: When you start with nothing and want a container that is just an executable; equivalent of creating an ISO
- Each layer is individually addressed and cached; number of layers determines size of your Docker image, how cacheable it is
- Size is a consideration when it comes to downloading a large image
- When you pull an image running Docker, it is cached based on the tag and layer; arrange it so only the last layer changes when code is changed
- As part of the build process of making a Docker image, fetch your dependencies that are specified and packaged; install dependencies first, and then add your code in a new layer
- Every command in your Docker file generates a layer; every “add” or “copy” command in the Docker file creates a layer
- Layer concept increases build time for images and cacheability; only rebuilds things that need to be rebuilt and have changed
- How to choose a base image to build a Docker file; base image determines overall size of Docker image and security footprint needed to be managed
- Issues and Considerations: Smaller images mean less code and much more security, but much more difficult to work with; but the smaller, the better off you are
- Run a scanning surface to minimize issues; scan images as part of your build process
- Remember to look at your initial base image that you inherited to determine if it changed; keep it up to date and refresh dependencies
Links and Resources:
Rich: In episode 20 of Mobycast, Chris recaps another DockerCon 2018 session, how to create effective container images by Abby Fuller. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.
Jon: Here we go, welcome Chris and Rich, in another Mobycast.
Chris: Hey Jon, hey Rich.
Jon: How’s it going Rich?
Rich: Going good, how are you?
Jon: Good, what have you been up to this week?
Rich: I’m back home in New Jersey at a beach town called Long Beach Island. Currently, in a closet trying to escape my cousin’s children from being a little bit loud. I just got off the beach to do this podcast, so I’m pretty stoked and covered in sand and definitely relaxed, so I’m doing real well.
Jon: Keep that sand away from your microphone. New Jersey is one of those places where no matter whether you’ve lived more than half of your life away from it, you still say home when you go there, you still say I’m back home. It’s kind of like that, and maybe California, too.
I grew up in Colorado and I haven’t lived away from it long enough to know the answer to that question for myself and I would still call it home if I lived away from it for more than half of my life. I have noticed that people that move here often after they live more than half of their life here, they stop saying home about whatever place they came from.
Rich: If I’m in Colorado for my half of my life I might do the same. It’s weird for me though, my dad lives in L.A., so home doesn’t really exist anymore. My family is here, I do say I’m headed home. Although I don’t really because I don’t have a home to go home to. I sit on couches.
Jon: Right, that’s why it’s sort of striking.
Rich: It’s a nice feeling to say L.A. is my home because I have no ties to L.A. it’s just that that’s where my dad lives now.
Jon: Right, I hope you enjoy your time there. How about you Chris what have you been up to this week?
Chris: Let’s see, this week I have been enjoying the arrival of summer here in Seattle. Now that we passed July 4th, July 5th usually marks the start of the summer season here in Seattle, and it’s here, it’s really nice to see the sun, we get that for the next three months and then it’ll go away again. We’re soaking it up.
Jon: Is September a super nice month in Seattle, too? Is it sort of like the local summer, like all the kids are back in school, but it’s really good temperatures and sunny?
Chris: Yes, September is usually gorgeous, it’s very much kind of like an Indian summer type of thing where there’s usually very little rain, sunny but the sun is starting to get lower in the sky, it’s a really nice time of the year for sure.
Jon: Nice. As for me, I’ve been watching the World Cup and I lived in England in 1998. I studied abroad, and so I have some ties to England, and I was just crushed when they lost to Croatia yesterday. My sister lives there right now, and so that’s yet another tie. It just bothered me that they didn’t win that game, really deeply. Anyway, I have been trying to pick myself up off the floor and do a Mobycast.
Chris: I should mention too by the way, Tour de France is going full steam right now, and I’m cycling in that, and so kind of the same deal. World Cup is going on, people that really know that me personally I’m in the Tour de France, and every day it’s a 5-hour stage, so it kind of feels like doing a mini Tour de France myself just trying to keep up and watch the coverage.
Jon: I expect that one of these days, you’re going to go to France and you’ll pick one of your favorite tours and ride it.
Chris: Absolutely, definitely a bucket list. I will ride my bike at Mont Ventoux and I will do Alpe d Huez. Also, Paris–Roubaix is a race I want to go to someday and ride the hell of the north and the cobbles, and to have beer, and treats on the side of the road with all the mad Belgians, stick and nuts. Yeah, bucket list.
Jon: Cool, so we’re going to continue our series today from just kind of rehashing and talking through some of the talks, some of the breakout sessions that you attended at DockerCon, so many good things there, and we might as well kind of get those ideas out to a broader audience and put our spin on them. You had said that which talk was it that you had said you wanted to do Chris?
Chris: I went to a talk that was given by Abby Fuller who’s a senior technical evangelist at AWS. She gave the talk on now, how do you go about creating effective container images, and this was a revamp, kind of like a remix of the talk that she gave last year at DockerCon. It’s kind of continued that theme. All about just like, “Hey, this DockerCon, you need to create a Docker image.” What are the best practices? What are the things you should be thinking about from a performance standpoint, from the a cacheability standpoint, from a security standpoint? Pretty interesting talk with lots of practical information.
Jon: This is going to be really interesting for me, too, because we did a Mobycast, I can’t remember which one it was, maybe number six or so. Rich, I don’t know if you have the ability to look that up, maybe you can interject and let us know which one it was––on the same topic. What I’m most curious to find out is what we left out or what we thought of what she did so here we go, let’s find out. Set the stage for us, Chris, please.
Chris: To start with, talking about just what is a Dockerfile? How do Docker images get composed? One of the first things to talk about is just layers. What are layers when it comes to Docker images? That’s kind of an important concept when you go about making your Dockerfile to create an image is to understand that, it’s built up in these slices of code. You can think of it as a pyramid or just like building up a building on each Florida time type thing, and it’s building on top of the previous thing.
You always start with some base layer, some base image, and then you’re now making changes to that, deltas to that with each new command in your Dockerfile to create a new layer. This is just like building on top of that until you finish it all up. When you’re done making your modifications to whatever image it is that you’re creating, that’s your last layer. The sequence of these layers, each one of these layers corresponds to a command in your Dockerfile. Again the important thing is usually you’re starting with some base image which has its own layer.
Jon: You said usually, just out of curiosity is it possible to not start with a base image?
Chris: You can’t start with this especially when it’s called scratch. Scratch basically says “I’m starting with nothing,” which is kind of an interesting thing, like what does that even mean? It’s kind of hard to wrap your head around it a little bit because you need an operating system. How does that all work?
Personally, me, I’ve never had a use case where it’s like I’m going to start from scratch and do it, but my understanding is if all you really wanted was a container that is just literally unexecutable and there are languages and runtime environments out there that are dealing with this case.
I think basically, you’re kind of creating an ISO. Rust, I think is one of those languages that you can say, “Here’s my program. Go ahead,” and when you compile this, I’m going to give you a file that says basically turn this into the self running image type thing. Something like that, you would use the scratch as your base layer perhaps and in that way, you have a very minimal Docker image.
Jon: Got it. Where do we go from here? We understand the concept of a Dockerfile being layers.
Chris: This is just again a very important concept is to understand, it’s building these layers. Each one of these layers are individually addressed and cached. This comes into play with a number of things. One is, it’s the number of layers that you have is going to determine how big your Docker image is. It’s also going to determine how cacheable it is, and what happens to that cacheability when you make changes to your to your Dockerfile.
Docker images can be very small if you’ve built them in a very optimal way, or they can be really large if you haven’t really thought about it. I definitely have seen Docker images that end up being over a gigabyte in size for literally just a pipe on app just because of the way that it wasn’t really optimization or like thinking about how to lay this out wasn’t really taken into consideration, kind of just like the standard, go figure out how to add this program, add this dependency, and before you know it, it’s over a gig in size.
When you have something like that, and you’re pushing and pulling to a remote repo to store your Docker images, that size comes into play now quite heavily, because now you have to download maybe, like another person on the team that hasn’t worked with an image yet, now have to go download a gig worth of data just to get going. If they don’t have a great internet connection, then they could be sitting there for a long time.
Spending some time to understand how these layers work, and how they correspond to the overall size to your images is pretty important. Size is considered. The another thing that is really important here is cacheability.
When you pull an image, when you have an image on your local machine running under Docker, Docker’s going to save that away and store that. It’s going to cache it based upon the tag and the layer, if you will. In that case where if we have the one gigabyte Docker image, it may be composed of 10 layers, and maybe nine of those layers end up being about 900 megabytes’ worth of data, and it’s only the last layer that ends up being 50 megabytes or something like that.
If you can arrange it at such that only the last layer is getting changed when your code changes, so now, when you do subsequent pulls, Docker’s going to say, “I already have these nine layers cached, and they match up with their tag,” so I’m not going to go download them, I’m just going to load them from cache. I’m not really going to download this new layer that’s 50 megabytes. Basically, you want to defer things that are changing to be one of the last layers in your Docker image.
That was kind of like one of the key concepts that was pointed out during this talk. It’s something we definitely do all the time at Kelsus with something like node, whereas part of the build process of making your Docker image, you have to go out and fetch all your dependencies. Those dependencies are specified in a package.json file.
One of the tricks that’s related to this is, what you want to do is you want to install your dependencies first, and then add your code in a separate layer. What that does it means that it’s only going to go to that expensive dependency install step if you actually change your dependency list as opposed to, if you do it after the code step, then it means, whenever your code changes, it’s going to break the cacheability. Docker is now going to always have to go through that dependency and install it even though that stuff may not have changed, but because it comes later in the process, you broke the cacheability of it, so you now have to go redo that process.
Jon: I guess I’m a little confused, is there some way that you’re telling Docker this is a layer, because I guess my confusion comes from, well, don’t you always have to install your dependencies before you install your code? Isn’t there no way to mess this up?
Chris: Yes, so you do tell Docker what’s the layer by virtue of the fact that every command in your Dockerfile generates a layer. That’s kind of an important thing to do.
Jon: I guess that’s what I was getting at. If you have co-dependencies, they’re very definitive––the word itself, dependency sort of means, do this first and then you compare, then you do what it is to do. Isn’t it just sort of something you don’t even need to consider, of course dependencies are going to go in first.
Chris: It’s a little bit more subtle, once you kind of think about it from a layer standpoint, use add and copy in your Dockerfile, like when I build this Docker image, these are the files I want to copy over into the actual image itself to be part of it and that creates a layer.
Jon: You could potentially put some copy command or add command that puts some files under Docker image that that represent dependencies after you put your files on that represent your code, and that would be backwards.
Chris: Right, the idea here is that your dependencies is changing much frequently than your actual source code. What you do is you build up your Dockerfile, you would first do something like, “Hey, add my dependency spec file.” I’m going to add package.json, and then I’m going to say, go install all my dependencies. now I’ve created two layers, I have a layer for the package.json file that I added to the file system, then I did another layer with running the NPM build, and now for my third layer I can say, “Now do another ad command.”
Now I can go add all of the rest of the source code, and that may have been in the same place as package.json and I could have done like, add basically equivalent of star––star.star but instead I said, “No, I’m going to just add package.json and that’s in that layer, and now I’ll come back and add the rest of it. Now, I can do the add star.star.
Jon: Now, I get it. Thank you. That’s what I needed.
Chris: Almost every application language out there has this kind of concept, Python has the same thing with pip install and having your dependencies. It’s a very common technique that really pays off, and it really increases the build time to build your images by having this good cacheability and basically only rebuilding the things that need to be rebuilt, that have actually changed. Definitely a good technique to use.
Jon: We’re at the risk of running over a little bit, but I think we should keep going, I think this could be helpful for people that are working through getting their head around how to do best practices with building Docker images, but maybe you can take a quick pause to say what are the pieces we have left to talk about.
Chris: Yes, definitely one of the next kind of big topics would be base image like what base image should I choose to start off to build my Docker? This is actually one of the first choices you have to make when you want a Docker, when you want to create a Dockerfile, what is my base image?
There’s considerations there, that’s something that kind of talk through. We can talk about multistage builds, it’s kind of an interesting important topic as well. There are some other issues like security considerations, and cleaning up, and making sure that things aren’t just garbage collection––how do you do garbage collection from all the artifacts that are created as part of this?
I think that maybe in this session, definitely talking about base image, what are the considerations there, and how do you go about doing that? And then we can see about maybe doing it next time, talking about multistage builds, and some of the other things around it. Those are pretty interesting topics as well.
Jon: That sounds good, and then I remember that we talked quite a bit about base images in that previous conversation that we had that had essentially the same topic. Rich, you don’t happen to get a chance to figure out which number that was did you?
Rich: I think it’s episode six and seven of how to create Docker containers part one and two.
Jon: Okay cool, and we talked a lot about base images, we talked about how to choose between a base image that hardly had anything on it versus one that has a ton of dependencies already on it, maybe you’re using Dango, or maybe you’re using Rails, there’s a bunch of stuff that you need in there, and then we also talked about building your own custom base images that have pre-built some of the libraries that you might want. Is that some of the same stuff that Abby discussed in her talk?
Chris: Yeah, absolutely. I think what we talked about then definitely matches up with the advice that Abby had as well. Your base image is going to determine a couple of things. One, it’s going to determine the overall size of your Docker image, it’s also going to determine the security footprint that you need to manage and be concerned with.
The bigger image, the bigger your footprint, the more you have to worry about holes, and security issues, and whatnot because you just have more software running.
There is the trade off, this is one of things that she went into as well, we can all say, just go use outline, measure the base image, and it ends up being four megabytes is the base image size for outline versus something like 80 megs, so 20 times the size of it. Given that smaller size, there’s just less code in there, so it’s much more secure because there’s just less code to be exploited. But it comes at a cost which is much more difficult to work with. You may install a new package manager, you may need compile tool support, you may need other libraries that need to be installed and whatnot, so it becomes much less developer friendly.
You also may have to think about security footprint, it’s better because it’s smaller, but maybe there’s actually a software I have to add to increase the security, that’s in that space. Just lots of tradeoffs to consider, and it’s just important to know that these are the issues and the considerations. In general, the recommendation is, the smaller your base image, probably the better off you are, unless you have good reasons for not doing that.
Jon: I guess one of the things that occurs to me is that, especially in terms of the security piece, that you might go get a small base image, but then if you have to add a bunch of libraries and other packages to it, then all of a sudden, you’re responsible for the overall security of all the software that you put on your base image.
Whereas if you get a really popular base image that has more of what you need on it, if there turns out to be a security problem with that base image, you might find out about it from the world wide web, as opposed to finding out about it from, “Oh gosh, I shouldn’t have put that package or library on my image after getting this more secure base image.” That sort of safety in numbers in a way.
Chris: Yeah, for sure, especially if it’s something that there’s lots of eyeballs on it, there’s lots of folks using, it’s very popular distribution, then there is definitely some safety in numbers there. It just means that, it’s much more likely that something is going to be caught and then definitely be fixed as opposed to, if that burden is moved to yourself, you can mitigate this quite a bit with having a scanning service running on this. All the package is running against a CDE database, whether you use Docker Trusted Registry, or something like one of the tools from Aqua Security, or whatnot.
There’s plenty of ways to go scan your images, and that’s just going to be part of your build process, and so you would be alerted to those kinds of things. Tradeoffs and just understand what responsibility you’re assuming versus someone else. For me, it’s also kind of dangerous and scary to say, “I’m going to put a lot of faith in someone else.” That’s why if it’s a package or distribution that has lots of eyeballs on it, and it’s basically coming from a source that I really trust, then I have to worry less about it, but when you get down to the dependency level, just the ecosystems that exist for things Node and Python. Just understanding what it means when you’re just installing some package, something to be aware of.
Jon: I think that we’ve sort of often taken the route of, let’s get something that has quite a few of our dependencies on it, and then essentially build our own base image from there with the additional things that we always use, and then maintain our base image after that.
Chris: Yeah, and I think, we talked about this on a previous episode, but absolutely. If you are building more than one application, so you have more than one Dockerfile, and you have a common way of building out your applications, if you have your own standard microservices framework with all the various standard pieces of infrastructure that goes along with that, things like login, instrumentation, and just setting up routes, and end points, and listeners, and port configuration, or whatever, then absolutely, fo and do those best practices to build your own base, that will serve as your own base image, and then just inherit off of that, as opposed to making each team go and figure that out for them themselves. It’s definitely a very good technique to do, it’s like do it once and then in the spirit of reusability, now I have projects to use that as their base image, not to reinvent the wheel each time.
Jon: But then at that point you’ve given yourself a bit of a technical loan, you’ve got some technical debt going because, if you don’t remember at least once a quarter, maybe twice a year to go peer the base image that you initially inherited from to see, “Has this thing changed? Do I need to rebuild this with a new version of a base image?” Then you could be in bad shape.
Chris: Absolutely, software is hard work isn’t it? There’s definitely no silver bullet here that makes things super easy to do. It’s always tradeoffs and finding that right mix of what it is, but absolutely, you can’t let your base image go into rot with technical debt, you need to keep it up-to-date, so you need someone that’s responsible for doing that, for making sure that it has all the latest patches, that it’s being kept up to date, the dependencies are being refreshed accordingly, you now have issues with backwards compatibility perhaps, and some of those things maybe breaking to some of your applications especially if they haven’t been changed in a while, so lots of things to consider.
Jon: Is there one last note that Abby used or is anything else that she brought up on base images that we didn’t talk about that’s particularly important or did we cover it?
Chris: I think as far as base images go, I think we’re good there.
Jon: Cool, I mean of course we have those other topics that we can hit next time. We’ve ran close to half an hour and by now, our listeners are sitting in a garage somewhere and trying to get into work, so let’s let them get to their job.
Chris: Alright, sounds good.
Jon: Thank you very much, Chris. That was really informative and thanks for joining us Rich.
Rich: Alright, thanks guys, all right, see you next week.
Chris: See you.
Rich: Dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with the show notes and other valuable resources is available at mobycast.fm/20. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and we’ll see you again next week.