Chris Hickman and Jon Christensen of Kelsus discuss the company’s CI/CD toolchain – continuous integration (CI) and deployment (CD) – when it comes to containerization. Your CI/CD pipeline will look completely different after being dockerized.
Some of the highlights of the show include:
- Docker influences the whole lifecycle of software from development to testing, deployment and production.
- Continuous Integration & Continuous Deployment (CI/CD) are critical processes in modern Agile software development, where software is updated and deployed very frequently. CI/CD addresses the problems of multiple people working in the same code base who need to efficiently test code changes and deploy updates, by automating many or even all of those steps.
- The high-level steps in the dev-deploy process are: Write code; make changes; test; commit to source code repository; build artifacts; and distribute artifacts.
- Kelsus prefers to automating as much of these steps as possible because it’s faster, saves developer time, allows more frequent delivery of value to users, and provides repeatability and consistency of builds.
- Continuous integration means that your coders are very frequently (daily or multiple times per day) committing code changes back into shared branches in the source code repository. Integration is no longer a ‘phase’ at the end of a waterfall process.
- Kelsus uses a continuous integration system to build Docker images each time the code is updated, and to run automated tests on the new Docker image.
- Kelsus uses CircleCI because it integrates nicely with GitHub. Every time a developer commits code and push it into the remote (shared) repo, a Webhook invokes a CircleCI build-and-test cycle. CircleCI builds a Docker image with the new code and then runs all of our automated tests on that new image.
- We configure CircleCI to publish the Docker image only if all the tests pass. If tests fail, the image is discarded, and a notification is sent to the team so they can fix the broken build. This reinforces our process by making it difficult for developers to deploy if the tests fail – there is no Docker image to deploy!
- Robust automated tests are an investment that will pay off over time – with compounded interest. Tests improve quality, they help new team members get up to speed quickly, and they make it possible to add new features without breaking other things. As your software grows, the ‘safety net’ of automated tests is absolutely essential. Without them, you’ll reach a point where it’s nearly impossible to add new features to your system, and your velocity will approach zero.
- It is important to have strict discipline about failing tests: the issue must be fixed before deploying. If you ignore failing tests, then they quickly lose all their value, and you might as well not even run them.
- How much automated test coverage is enough? This is a philosophical question that people like to debate. We believe you should have a high percentage of your code base covered by unit tests, but you also must be practical. It might be very difficult and costly to go from 95% to 100% test coverage. Poorly written tests can be brittle and require a lot of maintenance over time. One strategy is to start by covering the most important code paths with automated tests and then to incrementally increase test coverage over time.
- UI testing is more difficult to automate. The tests are often brittle and require more investment to maintain. It’s important to be pragmatic to make sure you’re getting the ROI out of your tests.
- CI systems like CircleCI can also be configured to generate test coverage reports so you can how much of your code base is covered by the tests.
- In what situation is it not worth the investment to build a CI/CD pipeline? For example, if you’re a starting and building a proof-of-concept or MVP, you can build faster and cheaper without automated tests. In this case, you need to explicitly acknowledge the lack of CI/CD as a technical debt that will need to be paid off if and when you get traction and need to build scalable software.
- Before commiting code changes to the repo, developers should run the same tests in their local environment that will later be run by the CI system on a clean build machine. We recommend using static code analysis tools like Lint as part of the build process to enforce coding style rules and, for dynamic languages, to find mistakes like typos.
- Kelsus does code reviews as part of our development process. Before code is committed to the remote repo, another team member must review the code and make recommendations for improvements. This improves quality and it’s a great way to spread knowledge.
- CircleCI uses a ‘clean’ machine to build the software and run tests. This machine isn’t polluted with any other software, so the build depends entirely on what is defined in the source code repository. If there are any missing dependencies, the build will fail.
- CircleCI starts up Docker containers to build your code and test it. The application under test runs inside a container that CircleCI built, as do the tests.
- CircleCI can recognize languages and intelligently use convention to know how to find your tests and run them. Alternatively, you can use configuration for greater control over how to run tests (e.g. with Mocha or Jest) and how to create coverage reports.
- What if your tests depend on an external system like a database? This invokes another philosophical debate. Some people insist that all database interaction must be mocked out for true unit tests. Chris prefers using a real database that is seeded with a known set of test data. It could be populated with a snapshot of production data, or a simple set of test data. If you go this route, you can launch the database as a separate Docker container in your test configuration.
- Continuous Deployment (CD) is about automating the process of publishing your Docker image to a repository (such as AWS ECR or Docker Hub) and then deploying your software to a test environment or production environment – only after all the test have passed!
- Chris recommends two prerequisites to have in place for Continuous Deployment: (1) good integration test coverage and (2) built-in support for quick and easy rollbacks.
- Kelsus uses CircleCI to automatically deploy to Dev & Stage environments, but leaves Production deployments as a manual process.
- CircleCI can run branch-specific actions, based on what branch code was committed to in your Git repo. Kelsus uses the Gitflow model, with branches for each Feature plus Development, Staging, and Master branches. Changes to Dev and Staging branches will auto-deploy to their respective environments if all tests pass. This branching model is the foundation of our CI/CD process: each automated action is triggered by a code commit on a particular branch (Dev, Staging, or Production).
- Database changes add complexity to Continuous Deployment, but backward-compatible database changes can still be easily automated by CI/CD tools like CircleCI. Many languages have ORM tools that help by building database migration scripts that can run with your deployment: for example, Alchemy for Python or Sequelize for Node.js.
- Database changes that are not backward compatible are much more complicated, requiring updates to both code and database to be made in a careful sequence of discrete steps. We don’t know of any tool that automates this process. If you know of one, please tell us!
- Kelsus has a different process to support hotfixes in Production. For a normal change, deployments go from a feature branch, to development, to staging, and if all tests pass, then finally to production. For a hotfix, changes propagate in the other direction: start with production and then merge the change back down to staging and development branches.
- How much of my budget should I expect to spend on all this CI/CD stuff? Chris estimates that it consumes 5% to 10% of team time & budget. But remember that this investment pays huge dividends as your system and your team grow in size and complexity. In fact, without that investment early on, you’ll reach the point where you just can’t move forward: at 100,000 lines of code there’s no question that CI/CD is absolutely necessary. Without it, you’ll spend most of your time fire fighting and no time adding new value to the system. In the long run, the CI/CD investment will reduce development costs.
- One thing you definitely can’t automate is how well the app conforms to a spec. With robust automated tests, the tests ARE the spec.
- Some companies are using ‘canary releases’ to leverage end-users to achieve that final bit of integrated test coverage. They deploy an app update to a subset of their user base, monitor closely for errors, and collect feedback. If the results are good enough, they extend the release to a larger group of users.
Links and Resources:
In Episode 4 of Mobycast, Jon and Chris introduce me to Continuous Integration and Deployment. Welcome to Mobycast, a weekly conversation about containerization, Docker, and modern software deployment. Let’s jump right in.
Today I thought we’d talk about our CI/CD Pipeline that we use at Kelsus. The reason I thought we could talk about this is because I’ll be attending Gluecon in May, which is a conference in Colorado that is targeted at people that are interested in APIs, that are interested in containerization. It also has a fairly heavy blockchain stint all of a sudden. It’s basically put on by a guy named Eric. Eric puts on, it’s whatever he’s interested in. He’s a VC. He started several years ago when APIs were a hot thing. We’re going to API-ize the world. And then containerization got hot, and then server list got kind of hot, and then blockchain is pretty hot, but he kind of keeps it all in a syllabus or itinerary through different tracks that you can go through. Kelsus will be talking about our CI/CD Pipeline in the containerization track.
We don’t have a planned talk yet, so I thought today we could just talk about what our CI/CD Pipeline is and work together to figure out what we will talk about. Maybe what comes out of our conversation will be useful for preparing for that talk later, making some slides and things like that.
Before we jump in, I guess a little intro. We’ve been talking about how Docker’s not just a technology choice for running software and production. It’s not just, “Oh, let me Dockerize that and put it in production.” That’s not really the goal. In the last three podcasts, we’ve talked about how Docker really touches the whole life cycle of software development, from dev to test, demo to production, and that it influences decisions that you make in every part of that, so your CI/CD Pipeline before Docker is going to look entirely different to your CI/CD Pipeline after Docker.
I think I just want to hand it over to you, Chris, and you can take a shot at giving a high-level overview of what we do, and then we might start diving in. You can talk about what we do at dev, what we do for automated testing, and then what we do for deployment. Just kind of a high-level and then we’ll figure out what makes sense, just get in there for each of those and take a little deeper. Go ahead, Chris.
At the highest level, it’s the age-old problem of developer writes some code and needs to get it tested, and then somehow get that actually out there at the hands of users, whether it be shrink wrapped software or cloud-based software or whatnot. What does that whole process looks like?
It’s definitely evolved over the years as the tools and technology have gotten better and more sophisticated, as the community has focused themselves more on automating more steps and placing resources and efforts in those types of areas. It’s definitely matured.
Again, a wide range of possibilities there but the major steps along the way are the developer, I’m writing code, I’m making changes, I’m self-testing myself, assuring myself that this is good, solid code. I now want to include this to the product, so I’m committing that to my source code system. After that I need to build artifacts. If it’s compiled software, it’s producing binaries. If it’s dynamic language, it still may need to pull in dependencies and it may need to go through stuffs like transpilers, to go from an intermediate language to a first-class language. Once these artifacts are built, then to get those distributed. That’s the deployment part.
Definitely, there are spaces along the way. You need to build your code, you need to test your code, and then you want to deploy your code. That’s the overall process framework. You can automate as little or as much of that as you want.
We definitely have gravitated more towards automating as much of that as possible, and there’s lots of benefits. Some of the obvious benefits is it’s faster. It doesn’t waste developer time to have machines do this stuff. That’s a big factor. Also repeatability and consistency of builds. This is something that we can go into more depth as well as it’s definitely one of the big major advantages of having a repeatable build process when you know you’ll always have a clean slate when you’re making these artifacts, and you don’t have to worry about it being polluted with perhaps inconsistencies on one machine versus another one. Having a pure state as you build these artifacts is pretty important.
Just to quickly jump in, I realized that you’re talking about automating these pieces of the build process and the development process, and I think that’s really the ‘C’ of CI/CD, Continuous Integration and Continuous Deployment. Making it continuous is automating it.
That’s definitely a big part of it. It’s also part of the philosophy of how you actually build code. Before the term ‘continuous’ really came into the common parlance, a lot of times what would happen is teams would be isolated, they would spend a lot of time separate teams working on their codebase and their individual features. Perhaps one team was working on this set of APIs, another team was working on consuming those APIs. They would all work independently.
They’d have their own—potentially—source code repositories and it’s more like the waterfall process. The first we’re going to do are architecture, then we’ll do design, then we’ll do our code, then we’ll do our testing, then we’ll do our integration.
That integration part, that’s where a lot of folks just realize that there’s a sense of causing this huge problem, like you develop these things, you made some assumptions, and then if you try to verify those assumptions, by the time you go and you’re ready to integrate, it leads to a whole bunch of problems. This idea of continuous integration is you’re constantly in that integration process, you’re constantly committing your code back into the main repositories. You get rid of that separate phase of integration, if you will. Yeah, you have to have the automation in order to support that.
That was a bit more of a tangent than I thought would be. You just described at a high level what it is, the three main things that we’re automating. We’re automating what developers do, what happens in test, and what happens to deploy. And then you described that doing that continuously is the reason that there’s the ‘C’ in front of the CI/CD Pipeline. But then I wanted to get into what is Kelsus doing? What are our tools of choice and what is our process of choice for doing this?
We definitely have our tools and process in place for each one of the steps. It’s obvious that we’re definitely behind Docker, and Docker is definitely a big part of our toolchain. For building our Docker images, which—at the end of the day—that is our build artifacts that we build, we are using a continuous integration system.
There are lots of it out there, but one in particular we’re using is CircleCI. There’s plenty of other ones out there, some open source ones as well as many paid ones as well. We really do like CircleCI, integrates very nicely, and with our source code repositories, which is GitHub—like most of the rest of the development world out there so that when developers make commits to their code and they push it into the remote repo, they push it to the shared repo, there’s a webhook that CircleCI integrates into, and then it knows that a change has been made and that causes it to kick off a build process. You can have a description file that tells CircleCI how to go build your artifacts and it will do that. For us particularly, in addition to building the artifact, that’s where we run our unit test as well in an automated fashion.
After it goes through, it builds the Docker image for us. It then will run the suite of unit test and if those unit test passed then that actual image is then stored in our repository, that’s now available to be deployed. If it fails the unit test, then we receive a notification that this is a bad build and that means we now have to go and look and figure out, “Okay, what happened here? Why did it fail the build process? Why did it fail our testing?”
I think Rich had a question about that failed part.
Yeah. I guess to make the question a little more broader is, what kind of tests are you putting in there? Here’s what I’ve seen from a few projects is that, there’s an inevitability of that test failing and it seems that depending on the type of fail, that sometimes people will push anyway, because maybe it was really an opinionated thing that wouldn’t really break anything. Could you talk a little bit about what happens when something fails, especially when you’re on deadlines?
That’s a great question. As far as what you do you test? What does that include? It definitely runs the gamut. Having good test as part of your process is a very difficult thing to do. It’s also one of those things that—for better or worse—most developers just don’t like doing. But they are truly valuable.
Having good test cases, spending the time to do them is like an investment. You’re building up your bank account that’s going to have compounding interest for you in the future. By having good test coverage of your codebase—as your code gets bigger and it gets more complicated—those tests will pay off in dividends because it’s going to alert you if you have these side-effects in the code, especially if you have new developers work on it.
It’s one thing if you have the same developer that’s spent a year on the same codebase, but if you start bringing in new team members and it’s a large codebase of 100,000+ lines of code, they’re going to be concerned if the code they’re changing, like “What are the other side-effects? Am I breaking anything else?” By having good test coverage in there, it gives a lot of peace of mind, and just knowing that the change you’re making is going to work the way that you expect to.
It’s definitely a very big topic to say, like, “What amount of testing is necessary?” It’s definitely a big topic, top up like, “What types of tests do you want to have?” You need to have a core set of test coverage for your codebase. If you are writing tests for your codebase and if any of those test fails then you have a big problem. Your build has failed. You cannot deploy that unless you actually fix those problems.
I think it may be confusing sometimes. People may have tests that are brittle and leading to some of the issues that you talked about there, Rich, where perhaps they don’t really reflect true errors. It may be something else changed, so now the test is bad. Test can go bad, test can get out-of-date. You can actually have technical data issue of your test cases where if you change some code so the code works differently now and that is by design, well then you have to go change your test. That sometimes ends up becoming a big problem especially on projects that have been around for a while.
I think I can pull this together a little bit because one thing that I know that Rich is getting at is a project that he and I worked on a long time ago, where Kelsus actually took this project over from another company. When we started, one of the first things we heard was, “Oh don’t worry about those tests that are failing. Just ignore that.” I think what we’re getting at and what we’re saying is that is a non-starter. Get them out of the test suite, fix them, do something, but it’s never a “no,” “can’t,” or to say, “just ignore those failing tests,” that you’re already starting in just the worse position possible. You might as well as not even have them. What’s the point of having any test if you have some that are failing that you totally ignore?
Because then, somebody that’s new that doesn’t know the project very well, heard someone else say, “Oh just ignore those failing ones,” then maybe way back, there were three specific ones that were failing and all the other ones which were supposed to work. The new person doesn’t know that there were three specific ones that were supposed to fail, and she sees six of them failing. It’s like, “Oh yeah, but some of these are supposed to fail.” Now you got six failing and there are only supposed to be three, and like you’re playing operator, nobody really knows what’s supposed to pass and it’s just not okay. You can’t work like that, not professionally in my opinion.
Yeah and I think that it brings up this broader question of do you start with fewer tests and not make assumptions on what you need to test, and then as things become more complex in your codebase, you start introduce new tests. Or are you being proactive in saying, “Maybe it’s through your own experience or the fact that—as a developer—over-engineer things that we should anticipate this breaking and therefore we build it into the test.”
You’re going to ask 10 different people you’ll probably get 10 different philosophies. Me, personally, I’m pretty much a fan of tests are very valuable. I do like to see as much test coverage as makes sense. I’m not saying that you have to have 100% test coverage because a lot of the times that extra 5% is perhaps silly stuff like configuration files or loading and descriptor files, something of that nature. You have to use your judgment.
When you do have test code running, you want to have the insights and like, “What coverage looks like,” so you definitely want to have code coverage tools. You want to have reports that are generated every time your test are run to show you exactly what your test coverage is, like what parts of your code did get tested and what percentage is not being tested. That’s your code coverage rate. You want to look at that and you definitely want to eyeball that, and ask yourself, “Are the most important code pass being tested and do I have to test for them?”
If you’re in a situation, a legacy system where you have test code that you’re not really sure what state it’s in or there’s lots of failing test, and one of the obvious things would be just come and say, “Just rip, just get them all out, burn it to the ground,” and I’d rather start from scratch and say, “We’re going to write some good test. We’ll start off with 20% test coverage.” That will be an ongoing work item to increase that test coverage, have very good test, be very mindful and have the insights into what parts are actually being tested to make sure that the most important parts are being tested, and then to iterate from there.
And to bring us back to the Kelsus CI/CD toolkit/toolset. The reason that I thought this is a good time for you to ask that question, Richard, because Chris had just gotten finishing, CircleCI is going to run a test and it’s either going to make an image to put into the repository or it’s not. That’s the technical component of this that enforces the process.
It will be one thing if we said, “No failing test,” and that’s just our rule, and it’s another thing if we just actually automate that rule, and that’s I think what Circle CI helps us do, and please correct me if I’m wrong, but I believe that if test passed, we don’t get an image on the other end that we could end up deploying to staging or production. Just nothing, there’s nothing you can do. The test failed, your build failed. Better go fix them if you want an image that you can use to do something with. In that way, there’s no manager saying, “See this or that.” It’s really just the software itself saying, “Oops, go fix something.”
Correct, yeah. The way that we set up the process now is very difficult for someone to deploy that has failing test. That said, people can always turn off the test, so there’s things around that, it’s baked into the process. It’s really how do you want to conduct yourself as an engineering organization, and just realize that if you start going down this path of making these exceptions and saying, “Oh we got to have this done quickly and we know that there’s issues within, we’re going to circumvent our process,” then you’ll have to start asking yourself some hard questions, like why even have the process to begin with if I’m not going to follow it.
There are tremendous advantages to having that process in place. It makes more sense when you have a team of more than a couple of people. If it’s just one or two developers, then you may not want to have a more defined process like that, although, pick me personally, I would still be a strong advocate of that. I don’t think that there’s a reason why you can’t adhere to some of the basic things, to say like, “I’m going to have repeatable builds, and I’m going to have automated tests, and I’m going to have my test cases pass, and do my deploys.”
Yeah just about the only case that I can think of where I think it does a really good business case for not doing something more strict like this is say for example you got into Y Combinator, you got three months to produce something that does something and you know you’re going to get funding after that if you’re successful, to go do something in the right way, then maybe don’t write any test right in that three months. You just hammer out some code and try to get something working, and you might be able to go a little faster if you know what you’re doing, than if you are to already use everything served the right way.
I think though in that little, tiny, niche of software where you got very, very little budget leading to more budget in a rebuild later, that might be the one case where I would suggest against doing this.
Absolutely. Call it prototype or something like that. Something where you didn’t have to go really, really fast. You’re not really engineering. This is just proof concept and you’re trying to demonstrate an idea, and you didn’t need to go fast but knowing that this is not the final solution.
That actually brings the point that I think I’m a little bit confused on. I don‘t want to derail this conversation but really quickly, I have this picture in my head of you’re writing code locally and then you’re pushing that to a repository and that’s going to set off a webhook that’s going to go build up to the CI, like CircleCI or something similar. But even when I’m running or writing code locally, there are a lot of things that I could be doing and probably should be doing that are also running test. I got to literally be writing my code with test-driven development, a lot of the packages that I install from my dev suite are going to run some tests. Is the CI all of that, or is CI literally are we talking about just that thing that exists? It’s maybe not technically middleware but something sitting in between my server and my repository? Or is it this whole entire process including all the packages that I install, like JSLint and Codeception and all those other things?
There’s probably a couple of different philosophies here that they all fall under the umbrella of what we are calling CI/CD Pipeline, but it really is at the basic level it’s what is your engineering process? How do you go about conducting yourself? Again, the continuous integration part of it that’s less tooling and more process. It’s really about what is your development philosophy? How are you actually committing to your repos and collaborating with everyone else on a larger software project? If people are committing frequently and syncing up and merging into a shared branch often, like basically daily, then that’s definitely a big part of continuous integration.
You could be using CircleCI or a tool like CircleCI. You know the CI stands for Continuous Integration, but you can actually use it in a way that you’re not going to use continuous integration whatsoever. You can just use it as a build machine.
The individual developers, as they’re developing code on their machine, they absolutely should be running those exact same tests that the build machine is going to be running. It really should be very rare that test fail on the build machine because they already run it on theirs. Now, it may end up on the build machine, it pulls in other parts of the project and maybe it does even starts getting into integration test for you. You’re testing the interfaces between multiple subsystems of your code, and then problems may surface but the actual core unit test, that’s especially run by the developer on their machine.
Thinks like Lint tools, that should be baked right into the whole build process as well. The developer—in this case we’re using Docker—they’re going to building the Docker image locally on their machine to do the local testing and to run it and verify things that’s working. As far as building that Docker image, it should be going through a Lint process.
Things like Codestyle should be enforced, and the silly typo-type errors would be caught especially for the dynamic languages that aren’t compiled. All that stuff will be caught locally before the commit is actually then made and merged.
We haven’t talked about other parts of the process like code reviews and having peers review your code. After you’ve done your local development and testing, then you have your peers look at the code and eyeball it to see if they can find errors or they can learn from it, they may know about other parts of the system that you didn’t and have advice for you. They may have to share an experience and they may have a different suggestions for improvements to the code, so that code review process is part of that.
All of that happens before you can even commit the code to the remote repo which then kicks off the build process with CircleCI. There’s still a lot of steps before you can get there and for sure a lot of this is duplicated. The build machine that CircleCI process, that should be reinforcing them, it should be end-duplicating a lot of the stuff that the individual developer did. But now it’s being done on this Switzerland of a machine. It’s not tainted, it’s fresh, you don’t have to worry about it being polluted with having some old software on it or some additional software put on it. It’s a fresh, neutral machine and it’s going against exactly what’s been checked into your code base, and it’s going to basically guarantee you that everything that needs to be checked in has been checked in. The code is solid, it’s passing test, and that’s why you need that machine, to go build your code, and publish that as an artifact.
One thing that I don’t know, because I just haven’t been doing development much as else as we’ve taken on this process is, so CircleCI is going to use a build machine—you’re just talking about it, this Switzerland build machine—and I know that eventually our goal is to get the application that we’re developing, or applications or microservices onto a staging environment and then a production environment.
And then based on our conversation last week, the staging environment and production environment are actually ECS clusters and that there’s probably databases that are RDS, which is another AWS tool. And there’s also big environments that—at least for the moment—are still manually setting up. We like to get to the point where we’re using confirmation or terraform to set those up, but I think we’re still setting us up manually.
My question is, for CircleCI what is it using? Is it building on some other machine and then deploying to one of those environments to run its tests or is it doing everything on a single machine? What is the relationship of what CircleCI uses to those environments that things are going to eventually end up on?
CircleCI itself today uses Docker, so it’s actually spinning up Docker containers in which to build the code and to test it. You can actually set it up to do that way. Before in the past it actually had its own custom container system, and I’m sure before then virtual machines was the way to go because obviously, company like CircleCI, they’re not going to give you a dedicated wipe-the-disk kind of bare-metal machine each time you want to do a build. That would take forever, so it’s like meta-meta. Here we have Docker. It’s great for us for developing our code, hosting it, deploying it, and really isolating it as a separate unit that is pure and contained. That same philosophy works great for these continuous integration build systems. They’re going to use that same technology.
Okay, so then CircleCI is going to build the Docker image and then it’s going to run it, and then it’s going to run the test against that Docker image. Does it have its own test runner Docker image that’s running those tests or do you create a test runner Docker image and tell it, “Use this.”?
A lot of these languages have certain conventions on where test go and how test are defined and configuration files for those and various test runners. Just out of the box, there’s a lot of times where CircleCI, you may have to tell it anything. It will just know how to run test. It will try to do that. It will go and look for the places where it thinks, where it expects test to be for the language that it detected that your code is, and it will try to go and do it.
You can do that, or you can be much more explicit which is what we do at Kelsus. We actually give it some instructions and we say, “This is exactly how we want you to run test. These are the commands.” We’re actually been just more declarative and having control overdone, saying “We want you to use Mocha test runner,” or, “We want you to use Jest,” and, “This is how we want you to create code coverage reports. We’re going to have to use Istanbul for our node projects that go create these code coverage reports.”
So there’s an application that’s running inside the container that CircleCI built, and that’s the application under test. In CircleCI, alt-giving commands to the Docker container to say, “Hey, also go do this,” in saying there’s already a process running inside the container but it’s saying run these additional processes inside the actual container under test, so the test and the application are all running inside the came container?
Chris: Yeah, you can think of it as, your build machine is being spun up on-demand triggered by this whole process, and that build machine is a Docker process, a Docker container. And so, you now have this.
To your code and to everything else, it looks like just a brand new fresh machine that is just running, and it’s just as if you were developer, you’re doing stuff seriously and like, “Docker build,” and, “Docker build my image,” and as part of my Docker image maybe the image itself has—that’s right do linting—so that’s just going to happen.
And the result of that is some exit code. It’s either successful or it’s not, and if it wasn’t successful then it just says, “Okay, I’m done. This entire thing has failed. Shut down the container and report back.” If it’s not then you can tell these systems like CircleCI what to do, like, “After you build the docker image, then I want you to run the test,” and “This is how you run the test.” To run the test you’re actually now going to do something like a Docker Run command. So you’re going to run the images that you just built, to now execute the tests that were included as part of that image. So that process goes and part of that might be it’s doing some commands now to execute the code coverage reports.
Again, it’s just up to you what you want that build process to look like, how much work it’s going to do. But it’s all happening inside of that Docker container that CircleCI is spinning up to basically host this whole process to make it look like it’s on its own separate machine, when in fact it’s just running inside of a container.
Got it. And what about some dependencies the container might have, like especially a database. I think that it gets ridiculous to have to mock-up all of your database interaction. It’s probably better to just assume that there’s a database available when you write unit test that may end up touching a database. So what do we do there?
It’s a great question. This is another one of those holy war issues—
Sure, you’re arguing my opinion a little bit.
—because a lot of people are pretty adamant that unit tests don’t have any dependencies whatsoever, and if your test cases are causing your code to make database calls, then it’s now an integration test. Then you have to go down the path of do I mock out my dependencies or do I not?. There’s tons of code out there for mocking stuff.
There are intermediate steps and one of the things that I really like with Docker it’s just the great use of the technology, I can mix that gap between unit and integration test. So I may very well—especially when I’m right in the microservice and some of my endpoint implementations for implementing these restful APIs—it may very well want to go talk to a database in order to do its work, especially doing things like queries and what not.
I’m not a huge fan of mocking that stuff out and I’d rather just use the database itself. One technique that I really like a lot is just use the power of Docker to spin up another container that hosts your database, have a test file that you can use to seed your database if you do want to have some existing data, so you can just take a snapshot of a staging database or production database—just have that available for it—and use that as the seed for your database that just gets spun up. It’s totally ephemeral; it only lives for the duration of those tests but your code is none the wiser. It can go do those test, it’s already running inside Docker, it doesn’t take a lot of effort, you don’t have to worry about keeping this separate environment and make sure that that’s up and running and have that kind of a dependency. And the way you go.
Great, That’s sort of what I expected. I wasn’t sure if you’re putting the database into the same container as the one that’s built for test or if you’re pulling from a separate container. I like that you’re pulling from a separate one because it just makes sense. Docker doing its job and having each process be its own container.
So, we kept it at a real high level, like this is what the developers doing on their machine, they’re writing code, they’re linting their code, they’re running the unit test which may include some integration test. Then they’re sending that over to GitHub which has got a hook in it that tells CircleCI to do some stuff, and then we just do a little bit of deep dive on what CircleCI does while it’s running the tests and how that works, and then also what our test actually do. Assuming that the test passed, what happens next?
The answer is, that depends. Here’s where we start getting into the ‘CD’ part of the acronym, Continuous Deployment. What do you do after you actually make your artifact, you tested it, passes all the test, usually the next step is you’re going to publish that artifact, you’re going to publish that to your artifact repo.
For us since we’re on Docker, we need a Docker image repo. Since we’re on Amazon we’re using ECR, which is the Elastic Container Repository, which is essentially just a directory of these images that are now available for containers to pull as they need them. So the image will get published, and then after that we have to decide if we wanted to do a deployment automatically.
Continuous Deployment again, has some interesting philosophical possessions, and my personal opinion on this is to do true CD, there’s a couple of prerequisites that has to be in place. One is that you need to have very good integration test coverage, and then the other thing is that you need to have built-in the sophistication for doing rollbacks very easily and very quickly.
We’ve taken a highway approach at Kelsus where we feel comfortable in our Dev and Staging environments, even though we don’t have those two things in place, that we do have the continuous deployment enabled for those particular environments because they’re non-critical.
The truth of the matter is, is that, that is where basically integration happens and in a lot of test, the testing does happen, so to do for automatically there makes sense for us. For production deployments, it’s not automatic step. That’s something we’ve not setup CircleCI to do. Instead, it’s a manually process, so someone has to manually make that decision that this particular image is indeed ready, it’s been verified, it’s passed all of our test, we feel very confident that this is ready to be promoted to production, in which case then it’s very quick manual step by the person doing that deployment to make that promotion.
Knowing a little bit about how this works, I know that we use the power of GitHub and Git and branching for this, so maybe you can tell us a little bit about how we set that all up?
Sure yes. CircleCI, one of its features that it has is you can set up branch-specific actions, based upon what branch that the code was committed to in your actual Git repo. We have specific branches in our codebase. We’ve come up with a convention that we’ll have one branch that’s know as Development, there’s another one that’s known as Staging, and then we have Master. We tell CircleCI whenever there are code check-ins that are against either the Development branch or the Staging branch, to go ahead and deploy automatically to those particular environments. There’s scripts that we have that get run by CircleCI when those branches are run, and that those scripts essentially or basically saying, “Okay, I’m going to go and do the things that I need to do to tell AWS to update ECS to now run this image there in that particular environment. It’s done on a branch-specific specification that is just a feature of CircleCI. It’s a very nice feature to have.
That’s the convention that we’re employed, having specific branches in Git and then having certain instructions CircleCI for how it go ahead and do deploys.
That about if there are database changes that’s part of a commit?
Man you’re asking all the hard questions, eh Jon?
So database migration is a whole other can of worms on this as well. It’s one of those things that’s actually very specific to the language that you’re working in. In the past I’ve done a lot of work with Python and Python’s technology. One of the common Python ORMS that we use there is called SQLAlchemy and SQLAlchemy had a different component through called Alembic, and Alembic was this technology that would create your database migration scripts based upon your model changes in your codebase.
So the developers are working in their Python codebase, they’re changing the property on a particular database model like a customer model. They’re adding a middle name property on the customer model. You run this Alembic tool after you make your code-commit, it can then see that, “Oh, you changed something to your actual database model,” and its output is a database migration script that now includes the actual commands that you have to do for the particular database that you are using to say, “Hey go add this new column to this particular database table.”
So it has that script, it also has the features inside that to now run that script against an actual database. What we did there when code is deployed as part of its startup process, it would go and run any available migrations for it. These migrations would be versioned in with the source code that be baked into the Docker image. When the Docker image gets run, it’s going to go and using Alembic go figure out, “What is the last migration that was run against this? Do I have something that’s newer and if so, go ahead and run that as a transaction basically under mutex, so that that will be run once and only once.” If that succeeds, great. We now have the migration that’s automatically done for us. If it fails, then that whole deploy has to fail, and because that wasn’t run in a transaction, no changes happens to the database, the entire build fails and that’s when developers have to come into play then figure out why did it happen.
That was an example of an optimal way of how that would work in Python world. Most of our code is running in Node. With Node, we’re still working through the issues of what is a good tool for generating these migrations, and then more importantly how do you robustly play these migration scripts out and have the ability to roll back and detect errors. We’re working through some of those process right now.
That’s exactly what I was thinking about is we’ve counted ECS’ ability to make it easy for you to have 100% uptime updates to applications. But if the updates to applications include breaking changes to a database schema, that gets’ a little trickier. Everything has to be run in phases where we can change the database to add a bunch of stuff that’s not going to break the old code, and then we can run a script that’s going to migrate to some new tables, still not breaking the old code, then we can put in the new code that sees all tables, and then finally pull out the old code and get rid of any tables or columns in the old database that we don’t need anymore. So multiple phases of doing things. But I’m not aware of a software tool that just takes care of all that without having to really think through it carefully.
If you do come across that, let me know because I will buy stock in that company. That’s just super difficult problem because it really is at your architecture level, like are the changes that you’re doing to your database schema backwards compatible?
You talk about how ECS were made for redundancy reasons or performance reasons, how four instances of our microservice up and running. Four duplicate copies and when we go do a new deploy, we’re going to do a rolling update to this. We’re going to individually kill those old versions and bring up new versions. You’re going to have a new mixture of some of the old code and some of the new code. But first before you run the new code, if you made database changes, schema changes, you have to run that migration. Well, if that migration is not backwards-compatible, what happens now with the old code that’s still running as it goes through the process?
It can be very, very complicated and usually big database changes are very difficult to do.
Agreed. We don’t have the solution there yet. We don’t have a tool in our CI/CD Pipeline that solves that big inaudible 00:53:16] problem but is there anything we have settled on in the Node world that relates to this problem, or is our current tool of choice in this CI/CD Pipeline world of managing this stuff?
Yes in the past with Node, some of it has been bare SQL. We’re now transitioning more towards using Sequelize as our ORM in Node and Sequelize is probably the the most popular ORM technology for Node. It does have the ability to generate migration scripts as well. It’s just the execution of those migration scripts in very robust ways are issues that we’re still working through and make sure that works. There are other tools in Node. There’s things like Connects and other technologies out there for doing it. So it’s not like there’s no answer to this, it’s just that we at Kelsus are still working through what that process looks like, and doing a way that’s smart and robust, and that what we’re going to be happy with.
Right on. There’s one other thing that we did recently through our CI/CD Pipeline that I think might be worth talking about a little bit which is, we made some changes in order to support hotfixes because when you’re running code in production, sometimes you need to fix the thing that’s broken in production before you worry about any other thing, so can you tell me why we did that and what we did, and how it impacted our whole process?
This is definitely one of those things that comes up with any development team, is you got this process of developing code and you have environments like Development, Staging and Prod, and you have Feature branches. So there’s code that’s in many different places and it’s definitely important to know what code is deployed where.
Usually you work bottom-up. You’re in a Feature branch, you’re working on some code, and then when you’re done with that you merge to Development. As it goes through testing, integration testing, and verification, then you now merge it into Staging. Once that’s been tested and verified, and you’re ready for it to actually put it to production, then that’s when it goes to Master.
Well, what happens when something makes it all the way to Prod, and now some critical error happens with it, there’s something that was discovered only in Production, or maybe something needs to be changed very quickly because you realized that this color should have been red instead of green, or we need to change the label of some text or something because the customer changed their mind, so you need to make a change very, very quickly. For that, you need to now work from the top-down.
So you’re now going in a different direction where you’re going to actually make the change of that in the highest level, and you then need to make sure that that change now gets propagated back down into the lower branches. That’s roughly how that process works for us is that we just have a way of knowing where to start with, with that hotfix that’s in the production code. So we take a branch off of Master—you can think of it as a hotfix branch off of Master—make a change into that, merge that back in to Master, and now have our build made from that. We can now deploy something that’s very, very quick. Once that’s done take that change that was made and merge it back down into Staging and into Development.
Does that hotfix branch bypass the integration testing that would happen otherwise?
No. All that still has to go through the same process. It still has to be built on the build machine CircleCI. All that stays in place. The only thing that really changes here is what stages it goes through as far as the environments. It doesn’t go through Dev, it doesn’t go through Staging. Basically you’re going straight into Production.
Right and what this is making me realize is that our GitHub process is really the orchestrator of everything, where we put our code and when, putting it in Feature branches, putting it in Dev, putting it in Stage. Those branches are what control their whole CI/CD process. It orchestrates everything and then Docker images, running Docker containers are the output of the process.
All right, let’s just take a moment here. I don’t know if I have any other questions or anything else to bring up during the conversation. Is there anything you felt I left out, Chris or Rich?
I have a few questions that are a little bit more meta and I think it might be interesting.
What percentage of the project budget is spent on the testing and the CI and all of this infrastructure that the client really has no deliverable?
I think it totally depends on the client and how much they value this thing. We’ve been very fortunate Kelsus that work with clients where they know that there is a lot of value to this, even though it’s not contributing to a very specific feature or materially changing the capabilities of the software that they’re delivering. They know that it’s an investment that will almost assuredly lead to higher quality software and also reduce overall development cost by having those techniques in place. As far as how much to spend, like what percentage of the budget, a lot of the toolchains and things like CircleCI make it really easy to get these things set in place, so it can end up being as little as 5%-10% of your developer time to have all this stuff in place. You can have more task and increase your capabilities there with more coverage and more bells and whistles like dashboards or better code coverage, reporting, and metrics into the system, what not. You can definitely spend upwards to there but it doesn’t have to be a huge investment to get a lot of gains from it.
Another thing that’s just an interesting question to you because business-wise, there’s a real conundrum when you start to work with a client or it’s a business and not a consulting company like we are, but you’re a business that produces software. If the business doesn’t understand the value of these things, then what can happen is that when you first begin a piece of software, you may actually be able to produce the software quite a bit more inexpensively by not doing all of these stuff. So to get from zero to one, you might be able to cut all these corners, not do any of these, and just write code and deploy code somewhere. Maybe you can do that for 30% less or even 40% less just depending on how much labor you put into building a DevOps infrastructure. Then you would—without it—way more cheaply.
Chris threw out 100,000 lines and that’s a good number in my mind. It just fills in about intuitively right. It’s enough code that you can’t remember it all, there’s like a dark corner in it somewhere you haven’t looked at. If you’re the one developer there’s at least a few dark corners in it that you haven’t looked at for several months once it get to be that size. So once it’s that size, there’s also probably some interaction across subsystems that you might forget, and you’re going to start making mistakes, and it’s going to become harder and harder to add new features.
It’s just classic what you just see is development team grind to a halt, and then all of a sudden people are having meetings with the business about velocity and why it’s essentially zero, and why is all the time spent during bug fixing and not spent working on new features.
And so it’s like, “Wait a minute, we did all of this work and it was all so fast, and it’s all so cheap, and now all of a sudden we can’t get anything done. And that’s because this compounded the interest on the investment that you should have made is not there. That’s why it’s worth spending a little more upfront not to do this and to engineer it right because when you have to add this later, it’s really a difficult pill for the business to swallow. Really, really painful. I’ve never seen a business happy to take that on and watch several months go by with no new features.
Yeah I’ve heard this conversation a million times where a developer would say, “In order for me to get out an MVP, I’m going to skip testing altogether.” That same developer, seasoned like 18 months in says, “I’ll never, ever, ever develop without TGD ever again.
My other question is, what can’t you use automated tests for and how do you use manual testing? Like through QA, once it’s in Production or maybe that’s even in Staging, to make sure that the user experience or performance or things that a tool can’t really pick up on.
There is a lot that you can automate especially if you devote the resources to doing that. That said, a lot of the times things like UI testing are definitely more difficult to do in an automated fashion. It can be done, but it’s definitely much more of an investment. So a lot of teams will do that in a manual fashion or use some of the other services that are out there for doing that.
Sometimes just the overall correctness of how well does this conform to spec. That’s not something that you can do in an automated fashion. If you could, I think you could probably design a system where you wouldn’t have to write any code.
There’s that aspect to it. Sometimes integration testing can get to be pretty complicated especially when you’ve gone to microservices as an architecture, the whole kit and caboodle, and you have many different services all interacting together. That may be an area that it’s harder to automate, but the truth is there’s a lot that can be automated. You have to balance it out. It’s a very pragmatic decision like how much time is it going to take me to automated it versus how much time is it going to take me to just do it with manual processes.
There’s going to be said too like a lot of companies now, especially in Agile and CI/CD is basically let your users do a lot of your testing for you. That’s continuous deployment. Having Canary releases where you have a new build and it only gets diverted to a certain portion of your user population, you monitor how that’s working, and if your monitoring systems are not detecting any issues or problems then you can increase that percentage of deployment to go up until it’s 100%. So there’s techniques like that as well.
That’s all I have.
Great. I guess we can wrap it up. I think that just looking forward and I’m excited to do this talk at GlueCon because I think this conversation showed that there’s lot to talk about and it feels like we’re onto something in terms of having a well-tested and working CI/CD Pipeline that we can show to other people.
For sure. Again, for me I think we could probably dive deep into it. I can do six or seven different topics. We could have gone much deeper into it. Definitely very interesting, this is one of those things where everyone knows they should be doing this I think. That’s writing software but it’s really hard to do in practice right that, so it’s actually break it down a bit and talk about a little bit about the practicalities of how to do it as well as the motivations behind it and what benefits do you get, is absolutely very, very interesting.