Inside CARTO Engineering: Configuration Management Makeover

Our primary goal at CARTO is to be the world’s leading Location Intelligence platform, empowering our clients with the best data and the best spatial analysis. We frequently discuss the latest techniques in spatial data science and spatial modeling, as well as the specific use-cases of our clients, and the ways they are unlocking new insights for their businesses with spatial analysis.

That said, a big part of striving towards the goal of being the top LI platform means building, updating, and maintaining our technology stack. Today we are excited to share a bit more about how we explore and implement changes to that stack so that our customers experience optimal service with CARTO.

Our Configuration Management Makeover

As you may know, PostgreSQL is the heart of our technology stack. We manage a big farm of database instances with different configurations, sizes, and requirements, and to be honest, when it comes to databases, we don’t use any fancy container orchestration technology. We still using traditional configuration management systems. In our case, that system is Chef.

Over a year ago, we - the Site Reliability Engineering team at CARTO - made an important change to the way we manage our Infrastructure. We are excited to share how we moved from Chef-server to a combination of Chef-solo and Consul, and how that change impacted our day-to-day lives.

Addressing Institutional Pains

So what was the problem with the Chef-server setup at CARTO? Well, with increases in the number of instances and the number of people working on them, the workflow started to become sub-optimal: our cookbooks are under version control, so we used to edit a recipe, push it to Chef-server, and then commit the changes to git.

For example, it was very common to upload a new version of a given cookbook into Chef-server, in order to test it in the staging environment, before merging the code to the master branch. However, if your colleague had just uploaded a different version of the same cookbook, you’d end up with cookbook versions stored in Chef-server but missing from git. It would be better if we were able to use the same workflow used by the rest of developers in the company!

via GIPHY

Another issue was performance: the one server that converges the config and attributes of all nodes, and serves more than 1000 servers was really slow. Maintaining Chef-server is not our core business and our setup was not optimal.

Last but not least, we used Chef-server as a way to do service discovery! It somehow works and it is very tempting to use in this way because all of the information is available and centralized in the Chef-server, but there are modern tools out there specifically designed to provide fast and reliable service discovery such as Consul.

Making a Plan

While improving this system might sound simple at first glance, there are a couple of things to take into account:

  1. Improving the workflow is the top priority
  2. We are committed to performing the full operation with zero downtime, no midnight intervention or tiny downtimes are allowed
  3. Our infrastructure is complex, we run many technology stacks in addition to massive databases which are inherently hard to manage and scale
  4. We have a lot of chef recipes, and rewriting the whole thing would mean one year of work
~/src/chef/cookbooks$ find . -print | egrep "/recipes/" | wc -l
1027
~/src/chef/roles$ ls | wc -l
213
~/src/chef/cookbooks$ ls | wc -l
249


Chef-solo and Consul to the rescue! Using Chef-solo allowed us to reuse the same cookbooks and roles, and all we had to deal with were the following points:

  1. Replacing roles written in ruby by JSON roles, since the ruby roles are not compatible with Chef-solo
  2. Removing cookbook versioning: now the only version that counts is in Git, which became our unique source of truth
  3. Finding a way to search for users defined in Chef databags

The rest is left to Consul, we already had some experience using it as it was required for some other services, so what did we do to adapt it to our problem? We simply registered Chef roles as Consul services.

One caveat: Chef-server comes with a very powerful knife tool that we will lose: we had to build a tool that replaces it but uses Consul as a backend.

Development Towards Infrastructure Improvement

We gradually started to introduce Consul as another option for service discovery, and controlled by some chef-attributes, either Consul or chef-search was used in a cookbook for this purpose.

Consul search was introduced via a chef library that harnesses the local Consul rest API on all machines. But to make it work correctly, we had to make some modifications! If you are using Chef, you probably know about Chef’s two-phase execution model, the key idea here is to make sure the value discovered by Consul was available at the moment templates were evaluated. Fortunately, adapting some recipes to use lazy evaluation was enough to solve this issue.

Likewise, to replace the users search feature provided by chef-server, we found this chef-solo-search library, that despite being somewhat outdated, worked perfectly for us.

Once we were confident that Consul search was working as expected, we completely removed chef-server searches from our cookbooks. Hooray!

Our next task was to figure out how to distribute and run recipes on instances: we created a Chef-solo-wrapper Python script that is installed on all our machines. It connects to our git repository, downloads the corresponding Chef repository and finally runs Chef-solo. But now that we had the control at this level we took advantage and introduced two extra perks:

  1. We made the wrapper aware of Git branches: this enabled us to temporarily run a specific code branch for a determined amount of time on a specific subset of instances, which is really helpful. This feature has proven itself to be really valuable, not just for our team, but for the other teams as well.
  2. The ability to define special nodes in a cluster and give them different treatment, such as running another set of Chef roles etc. We did that by creating a file with the same FQDN in a specific folder in the repo. This allowed us to offload complexity from recipes to simple node role override.

We knew we would miss our beloved knife tool, taking into account that in the past we used it for almost everything! And so we made carto_infra_cli or cic . It is a python tool that replaces some of the utilities provided by knife and uses Consul’s rest API and cloud providers APIs to provide us with some important functionalities in our daily operations: searching for a node belonging to a specific role (i.e: web-servers), listing roles or healthy nodes. Did I say we love knife? we made the syntax similar to knife:

carto infra cli demo


A grateful surprise was when we noted huge performance boost doing searches, In our case we found the new setup 10x faster than doing searches with knife against Chef-server.

Rolling Out Infrastructure Improvements

At that moment, Consul was already executing our service discovery searches and providing service addresses via DNS, but we still had Chef-server running and used by the entire platform. Using an ansible playbook we called solify ;), we had already migrated staging nodes to use Chef-solo by removing any references to chef-server, installing and executing a Chef-solo-wrapper and setting the proper firewall rules to ensure those machines couldn’t reach chef-server.

We gave it some time looking for any broken pieces, but all we found were some failing CI jobs and some scripts that were using knife. Then, once we had fixed them and waited for a week, we followed the same procedure on production. The operation went smoothly and in the vast majority of cases, all we had to do was to solify the instances.However, We faced some issues, like having some clustering logic automated with Chef, that forced us to manually migrate such nodes.

Lessons Learned by Improving our Stack

Without any doubt, gaining a better understanding of our platform was the most important outcome of this project. Here are some points that we feel deserve special attention:

  1. It is never too late to take a deep breath, stop, think, and try to improve your workflow. It will pay off faster than you expected.
  2. Don’t be afraid of an idea if you search on the internet and don't find any reference about somebody taking the same approach (chef-solo + consul) in the first place. It has been some time since this improvement and we are so happy with the result.
  3. It made us more careful with our cookbooks: it may sound silly, but this project helped us to realize that a configuration management system should be exclusively used for configuration management, not to replace a manual operation or to do service discovery.

In general terms, we gained speed and efficiency improving our delivery as a team.

DISCLAIMER: We don’t, by any means, discourage the usage of Chef or Chef-server, we love Chef and almost all of the issues we faced in the past were due to sub-optimal implementation or other technical debts.

Have any comments or questions for our Engineering team? Please share your thoughts with us!

About the author

Site Reliability Engineer at CARTO.

More posts from Ibrahim Menem
About the author

Infrastructure Lead at CARTO.

More posts from Javier Villar Fernández

Related Posts

Ready to optimize your territories with Location Intelligence?

Close circle icon

Contact us

Please fill out the below form and we'll be in touch real soon.