How I learnt to love and hate Terraform in the past few weeks
The tale of a joy/pain-ful lonely journey into the infrastructure-as-code world
Once upon a time…
I might have said this a couple of times, but the best and the worst part of working in a startup environment is the diversity of activities you enroll in. One day you are debugging some bug in a Angular/React front-end page, the other you are pulling your hair off with some obscure infrastructure problem.
That is good because you sort of get an intensive crash course in many disciplines, being forced to learn things very fast — like front-end, back-end, infrastructure, design, business, billing, you name it — that would take you a lot of time (if at all) in a mid-size/large company. On the other hand, having to deal with such broad aspects together (sometimes, simultaneously more than one), means that you will not be a specialist on any of those subjects and the constant context-switching may drive your productivity low.
At Revmob, we are currently on an effort to strike a balance between flexibility and specialization in the tech team, with well defined squads and responsibilities. However, we still do not have an ops team, neither someone whose primary focus is developing and maintaining infrastructure, so basically everybody is responsible for it.
But, as we say here in Brazil, literally translated, “a dog with two owners starves to death” (I believe the closest saying in English would be “too many cooks spoil the broth”). With the lack of standards and, at some extent, of knowledge, whenever there was a problem with our infrastructure, we used to go full panic mode, basically dropping and recreating everything.
This is the poor infra-dog :(
We have relied on AWS Elastic Beanstalk for a while and, whilst it reduces the effort to setup a basic web or worker-based standard infrastructure, once you need more flexibility, you end up struggling against the multiple levels of abstraction, resources that are hard to find, the complexity of CloudFormation stacks, configuration files using the .ebextensions
directory inside projects leading to lots of code duplication, etc.
However, our main pain point was that the warm up time for a new instance being too high (about 7~8 minutes) and we verified that most of that time was spent on Beanstalk setup than on our applications. We managed to make some tweaks, by creating a custom AMI based on some Beanstalk-ready default images, so the warm up time went down to about 4~5 minutes. Still, this was not good enough for our highly elastic environments (oh, the irony!), as certain unpredictable high traffic spikes (very frequent in mobile advertising networks) could not be properly handled.
We needed to find a way around this problem.
Beyond that, we were facing the need for our first multi-region application. Since this would require a lot of work to setup by hand, we felt motivated to explore alternatives that could solve both problems.
I had heard about Terraform a couple of times and our CTO told us that the ops team in his previous company used it to manage infrastructure. So we decided to give it a shot.
If you are interested in a more technical article, check this out:
Then the journey begins…
I was chosen/volunteered to be in charge of exploring this brave new world. The initial idea was that I would spend one sprint (which means 1 week for us) in order to make the setup needed to run two new applications, including the dreaded multi-region one.
We thought it would be hard to do pairing since none of us had the slightest idea of how that thing worked, so it was decided that I by myself would be my own team for that sprint.
I started by doing the good ol’ hello world. The installation is quite easy, just download a binary file and put it in your PATH
. Then I followed all the steps of the guide. Piece of cake.
Now let’s get to the real work!
Troubles in paradise
I found a nice article series about Terraform, written by some company, other than HashiCorp — the creator of the whole thing — and started to read it. While it gave me some cool insights, when I started to try the code out, it simply did not work. I tried to move forward with the series, but ended up stumbling on the same problem.
I got kind of pissed and completely removed it from my bookmarks and never looked back. Later I found out that those articles were based on a previous version of Terraform, which was not compatible with the most recent I was using.
Then I decided to dive into the official documentation and continue by trial and error. That kind of worked for a while, however I was going in baby steps.
Differently from software development, where you can easily setup an automated test suite, put them to run on watch mode and get instant feedback when you make any changes, working with Terraform seemed to be some orders of magnitude slower. Many times I caught myself just staring at my screen to see what was going to happen after I changed a resource. But at least I felt I was moving forward.
At some point, however, I hit a wall. Then I went pray to our lord Google.
Oh, mighty Google, please save this poor siner’s soul
Nonetheless, the all-powerful was not very merciful at me. Each link of “how to do X with Terraform” led me to a completely different approach, which probably did not even work. Some articles, only a few months old, were already outdated (it seems like Terraform 0.9.x
was a completely new beast from 0.8.x
).
By leaps and bounds I could reach a somewhat satisfactory result for the simpler application I had. It was kind of nice to see all that code being translated into actual Load Balancers, Auto-Scaling Groups, VPCs, Security Groups, CloudWatch metrics and alarms, and so on.
It was already Friday an I was happy that I was done. Now I just needed to change some input parameters and Terraform would automagically reproduce the setup I had for the other application, then I would just need to make it multi-regional. Easy-peasy.
That’s what I thought.
Yeah, not so fast, hasty!
It turned out that Terraform does not work quite like I thought it would. I had gotten it all wrong. I had not understood that modules play a central role in Terraform code reuse patterns, so I just ignored them altogether. It was too late, the sprint was over and I had failed.
During the grooming meeting on the same Friday, I gave the status to my boss and he asked me if I wanted more time. I told him I would get everything done until the next Wednesday (yeah, I was that stupid). He agreed to allow me to continue my quest during the following week.
Also, in the meantime, since we are only two in my squad so far, I sort of left my squad-mate alone for the sprint and would leave him by himself again. He got sick of waiting for my return and joined another squad for the next sprint.
I feel you, SpongeBob
Taking a step back
During the weekend I was reflecting on what went wrong and I concluded that I was in frenzy mode . I needed to take it slower, gain more muscle and agility in order to win the challenge.
I focused the first days on understanding the internals of Terraform and to review some infrastructure-related and AWS specific concepts. For example, I had to remember a few things of that Computer Networks classes I had never payed too much attention during college.
I would probably never have to write a Route Table by hand nor partition a VPC into several subnets my entire life, I don’t need this bull$*!7.
I realized that while I was complaining about Beanstalk, it had spoiled me a a lot. But it wasn’t long before I could find that information in a dusty corner in my mind.
Oh, so that’s how I do to calculate the number of bits I need in my subnet mask?
After those mental push ups, I was back on track. I had finally understood why I needed modules, how to partition Terraform state and how to use it remotely. My code started to get reusable.
Of course, the screen staring wasn’t entirely over, so obviously I couldn’t get everything done by Wednesday as I promised.
The epic final battle
On Thursday morning I made a promise to myself.
I won’t drag this $*!7 with me another sprint. I will finish it until tomorrow.
Time to suit up!
So it begun. I was back on frenzy mode, but this time a little more conscious. My strategy was to deploy both two applications in a single region and, after making sure everything was running smoothly, I would replicate one of them across AWS 14 regions.
I was a little bit more savvy on Terraform than before, so my main struggle became the AWS quirks. One of the problems we had with Beanstalk finding specific resources through its console (have you tried to find the load balancer of an environment?), so I decided to use meaningful names for all resources I could.
Then came the name collisions. Luckily this was already solved by Terraform, allowing me to use name prefixes for certain resource type, generating a new name every time, while keeping its meaningfulness. I was also trying to use CodeDeploy, something we had never used before. This gave me a bit of work as well.
When I had figured out the problems with AWS, I felt the need for refactoring the infrastructure code to make it more manageable. I was ready to throw everything away and rebuild my infrastructure when I found out that Terraform allows you to move state.
That moment was the first time I loved Terraform…
This feature saved me some time, as I could keep what I had created and just move the state around. I was getting close.
Before the setup was 100% functional for a single region, I got into a fight with PM2 as well (I’m a bit feisty, you might be thinking). Again it was something I had never used and, as it was late in the night on a Friday, the tiredness started to consume me. PM2 was able to throw a few strong punches against me.
PM2 looked like Apollo Creed…
But, as Rocky Balboa himself, I managed to defeat the dreaded adversary and was ready to move on to the next challenge.
Time to go multi-regional!
It was about 5 a.m. of Saturday when I finished the first multi-regional setup. Sleep was a long lost. I wanted to finish everything before going home. I had to make some minor tweaks and after a few minutes — bam — there was my first multi-regional application deployed and ready to go.
Just to be sure, I ran a little test suite on Postman I had setup during development. The first time it failed for Japan, but I realized that the infrastructure was not fully set yet. One more try and everything ran smoothly \o/. It was 5:53 a.m.
At that moment, I started to loudly play “We are the Champions” at the Office.
Of course, there was no one there to listen… So I put if off, asked for my Uber and went home to get the sleep of the righteous.
What would I do different?
I’ve made several mistakes during those two weeks. I probably could’ve stressed less, suffered less, slept better and drunk more water during the process.
I believe our assumption that would be better to make this a single person job was wrong. There are so many things involved, it’s easy to screw up without someone helping you. I would definitely suggest people who are starting with Terraform to do pairing.
Furthermore, it got pretty lonely. Those who know me will probably say that I’m not the most sociable person they know, so trust me. I barely spoke to my teammates for two weeks, even though I was sitting right next to them. I couldn’t ask for insights — since nobody knew squad about what I was doing — or even just complain about my fate.
I also think that I aimed too high. I had to learn/remember so many things at once: Terraform, CodeDeploy, PM2, Computer Networks, AWS quirks and how to setup a full-blown multi-regional application. It was just too much. I should’ve started small and make improvements with time.
The end (?)
That was my story. I tried to summarize it as much as possible, but failed miserably.
I will probably write another post from a more technical perspective within the next few days, sharing the little I learned from this endeavor. Keep posted.
Did you like what you just read? Buy me a beer with tippin.me.