Kubernetes and Containerization
This is designed to be a multi-part article. This first post, the introduction, talks about what the plan was and how it came to be along with some of the concerns and questions we had prior to starting implementation. The following parts discuss the exact details behind each of the particular goals we were trying to achieve.
Update: While there are multiple parts to this "series", I didn't end up writing all of the individual pieces I had intended to originally. So instead, this article is really more of just a general overview of the planning and the design of a new system I was building at Loot Crate during this time (2016/2017). It focused on containerization, Kubernetes, and an event-driven architecture.
It began with a con…
I was promoted to Software Architect at Loot Crate specifically to spearhead a greenfield development of a new e-commerce platform that would host a spin-off of their core product(s), which the spin-off would later be known as Sports Crate. This gave me two firsts in my career:
- I was responsible for the successful design and implementation of a large and cross-functional project.
- I was going to lead a team of 12 engineers, an almost 50/50 mix of front-end and back-end.
Kubernetes went GA with version 1.0 in mid-2015, and they were planning to host their first-even KubeCon in 2016. I had already been looking into Kubernetes since it went GA and I knew that it would be a massive improvement for what we were hoping to accomplish. And you need to keep in mind, "serverless" wasn't a thing in 2016. Although AWS Lambda had been around for a couple years, it was severely limited in the programming languages you could use. As a Ruby on Rails shop, it would have been more effort with less reward to try getting everyone ported to a new language just for the purpose of using Lambda than it was to try and get a Jenkins pipeline to automatically push new Docker container images to Kubernetes deployments.
So in order to acquire the most information possible on Kubernetes and how it is used in production, I asked the company to send me and some of my colleagues to KubeCon 2016, which they gladly obliged.
The event was wonderful. We meet a lot of interesting people working on interesting projects in the container technology space. Monitoring, deployment, development flows, image verification, hosting solutions, and more were present. The most intriguing part of it all, however, was that no one really knew the “right” way of doing anything in this space yet.
Sure, people knew how to monitor a giant cluster of services, but deploying those monitoring tools as either a separate service container or bundling it into existing containers was new space. And automating a lot of the boring stuff was really new and different companies had different ideas on how to tackle the issue. Overall it was a great font of knowledge that we eagerly chugged down.
I used the remainder of November to consolidate the notes from the convention and test out a few ideas to see if they would even be possible in this budding architecture design I had growing in my head. It seemed more and more likely as I did more tests, and I confirmed with some testing that our Director of DevOps was doing in tandem with my own work, and we agreed that this path was possible.
Initial Architecture
In December I began to hammer out the official design. I’ll post other articles to talk about some of these design points in more detail, but here was the basic architecture:
- Use Docker for all services. The application code, databases, anything we wanted to use in our new system would have to be deployed from a container image, and we would use Docker to build/fetch those images.
- Use
docker-compose
for local development. One of the major pain-points with local development is that the environment you’re developing in tends to be significantly different from the one you’re deploying to. The use ofdocker-compose
significantly reduced that disparity. - Use Jenkins for automation. Not just kicking off tests, but general automation. Specifically we used it to build images, run tests, upload images, and kick off deployments.
- Use the Google Cloud platform to store images, host compute instances, manage network configuration (ingress/egress rules, load balancing, etc), and manage Kubernetes.
- Use Kubernetes to manage container deployments.
- Use Datadog to monitor container instances.
- Use PagerDuty to automate the on-call rotation and escalation rules, with alerts typically coming in from Datadog.
- Use Cloud Pub/Sub to build an asynchronous and decentralized message bus. This was primarily to allow our legacy software and our new prototype work to communicate effectively.
The “legacy” platform at Loot Crate was written as a monolithic Ruby on Rails application. Our front-end developers were writing and managing ERB and Coffeescript files, which we already knew we didn’t want in the prototype. Aside from merely trying to improve the technology we were using, I also wanted to improve our process, so I wanted to separate the development and deployment needs of front-end and back-end code. The prototype therefore used React project for the front-end.
Concerns and the Unknown
There was a lot of new stuff in this design that most people in the company hadn’t even heard of before. Kubernetes was still new, I was the only person to ever use Docker before, no one had used Google Cloud, and many people weren’t sure how/why to use a message queue. We were also using Heroku for deployments and some third-party CI service for automated testing, so moving away from both of those things onto a completely custom solution (Kubernetes + Jenkins) was quite polarizing for some. Overall people were excited though… assuming we could bring it all together.
The biggest concern was: if we start implementing all of this in January could we have a functioning website/service deployed by mid-March? Well, some of the design decisions were made specifically to accommodate such a timeline. For example, we ultimately went with Google Cloud because they supported a lot of what we wanted to do, like container image hosting and Kubernetes-as-a-service, at a time when AWS wasn’t touching that stuff. I also knew my team well and had no concern that they wouldn’t pick up on the work I started and see this through to completion. And again, we made some concessions to ensure developers would be as effective as possible, such as using Ruby as the back-end language since all the back-end engineers knew it well. One of the reasons we were comfortable going with React was because our front-end engineers had been learning React up until that point in preparation for a switch in the legacy system, so they were all trained up and ready to go with that.
I made an estimated timeline and worked closely with our Director of DevOps and Product Manager to ensure that all of our requirements were within reason and that if we were cutting any corners that they were known about ahead of time. I didn’t mind having a less-than-perfect prototype because that’s why you build prototypes, to test things without worrying about getting it exactly correct. Everyone was on board with that and by the end of December we had pinned down our largest tasks according to estimated effort and time-to-completion. Getting docker-compose
to work was a large developer task since no one had used it before. Getting Kubernetes configured and working was a large DevOps task. Setting up Jenkins was a medium task but crucial since it automated all of our builds and deployments. And finally there was the task to get the back-end and front-end work completed so that we had something customers could interact with, and that was actually the least risky part of all.
Update
I had intended this to be a multi-post series on the subject but that didn't end up happening. What I will say is that our vision executed better than we had imagined.
QA and our front-end developers were using tools like Postman for testing purposes. They basically had a handful of endpoints: localhost, test, staging, and production. What was nice about that was we could have QA actively developing tests for endpoints which hadn't been written yet, allowing developers to have access to a pre-built set of tests by the time they finished implementation. And front-end could see the various data representations for different cases without having to manually trigger them through the UI.
Kubernetes gave us a lot of reliability and easy scaling. I remember our Product Manager asking me on a near-daily basis when we were going to have our first crash or on-call emergency, and it literally never happened. I told her I was considering causing a small production issue just to get her to calm down, we laughed a bit, but I think she remained a bit nervous because the launch was certainly more smooth than most major product launches I had been part of up to then.
Containerization is (as of 2024) pretty much the champion now for how nearly everyone deploys their services, and so getting everyone onto Docker and containerization was certainly the correct choice and it enriched all of our engineers to learn it.
And the event-driven architecture we built off Cloud Pub/Sub topics gave us the ability to do a handful of post-launch tasks without changing any of the core platform or website code. The data & analytics team wanted to capture some more data and, thanks to the breadth of events we were already firing off, was able to just add a new subscriber and start ingesting data immediately.
Overall it was a massive success. It wasn't all smooth sailing all the time, but it was definitely a huge win for all involved, a lot of learning was done, and it was a great achievement.