Inside CARTO Engineering: Configuration Management Makeover

Summary

In order to achieve our plan of being the leading location intelligence platform, we continually invest in our infrastructure to provide the best service to our client. Today we are telling a success story about how we revamped our configuration management system

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
Inside CARTO Engineering: Configuration Management Makeover

Our primary goal at CARTO is to be the world’s leading Location Intelligence platform empowering our clients with the best data and the best spatial analysis. We frequently discuss the latest techniques in spatial data science and spatial modeling as well as the specific use-cases of our clients and the ways they are unlocking new insights for their businesses with spatial analysis.

That said a big part of striving towards the goal of being the top LI platform means building updating and maintaining our technology stack. Today we are excited to share a bit more about how we explore and implement changes to that stack so that our customers experience optimal service with CARTO.

Our Configuration Management Makeover

As you may know PostgreSQL is the heart of our technology stack. We manage a big farm of database instances with different configurations sizes and requirements and to be honest when it comes to databases we don't use any fancy container orchestration technology. We still using traditional configuration management systems. In our case that system is Chef.

Over a year ago we - the Site Reliability Engineering team at CARTO - made an important change to the way we manage our Infrastructure. We are excited to share how we moved from Chef-server to a combination of Chef-solo and Consul and how that change impacted our day-to-day lives.

Addressing Institutional Pains

So what was the problem with the Chef-server setup at CARTO? Well with increases in the number of instances and the number of people working on them the workflow started to become sub-optimal: our cookbooks are under version control so we used to edit a recipe push it to Chef-server and then commit the changes to git.

For example it was very common to upload a new version of a given cookbook into Chef-server in order to test it in the staging environment before merging the code to the master branch. However if your colleague had just uploaded a different version of the same cookbook you’d end up with cookbook versions stored in Chef-server but missing from git. It would be better if we were able to use the same workflow used by the rest of developers in the company!

via GIPHY

Another issue was performance: the one server that converges the config and attributes of all nodes and serves more than 1000 servers was really slow. Maintaining Chef-server is not our core business and our setup was not optimal.

Last but not least we used Chef-server as a way to do service discovery! It somehow works and it is very tempting to use in this way because all of the information is available and centralized in the Chef-server but there are modern tools out there specifically designed to provide fast and reliable service discovery such as Consul.

Making a Plan

While improving this system might sound simple at first glance there are a couple of things to take into account:

  1. Improving the workflow is the top priority
  2. We are committed to performing the full operation with zero downtime no midnight intervention or tiny downtimes are allowed
  3. Our infrastructure is complex we run many technology stacks in addition to massive databases which are inherently hard to manage and scale
  4. We have a lot of chef recipes and rewriting the whole thing would mean one year of work

~/src/chef/cookbooks$ find . -print | egrep "/recipes/" | wc -l
1027
~/src/chef/roles$ ls | wc -l
213
~/src/chef/cookbooks$ ls | wc -l
249


Chef-solo and Consul to the rescue! Using Chef-solo allowed us to reuse the same cookbooks and roles and all we had to deal with were the following points:

  1. Replacing roles written in ruby by JSON roles since the ruby roles are not compatible with Chef-solo
  2. Removing cookbook versioning: now the only version that counts is in Git which became our unique source of truth
  3. Finding a way to search for users defined in Chef databags

The rest is left to Consul we already had some experience using it as it was required for some other services so what did we do to adapt it to our problem? We simply registered Chef roles as Consul services.

One caveat: Chef-server comes with a very powerful knife tool that we will lose: we had to build a tool that replaces it but uses Consul as a backend.

Development Towards Infrastructure Improvement

We gradually started to introduce Consul as another option for service discovery and controlled by some chef-attributes either Consul or chef-search was used in a cookbook for this purpose.

Consul search was introduced via a chef library that harnesses the local Consul rest API on all machines. But to make it work correctly we had to make some modifications! If you are using Chef you probably know about Chef’s two-phase execution model the key idea here is to make sure the value discovered by Consul was available at the moment templates were evaluated. Fortunately adapting some recipes to use lazy evaluation was enough to solve this issue.

Likewise to replace the users search feature provided by chef-server we found this chef-solo-search library that despite being somewhat outdated worked perfectly for us.

Once we were confident that Consul search was working as expected we completely removed chef-server searches from our cookbooks. Hooray!

Our next task was to figure out how to distribute and run recipes on instances: we created a Chef-solo-wrapper Python script that is installed on all our machines. It connects to our git repository downloads the corresponding Chef repository and finally runs Chef-solo. But now that we had the control at this level we took advantage and introduced two extra perks:

  1. We made the wrapper aware of Git branches: this enabled us to temporarily run a specific code branch for a determined amount of time on a specific subset of instances which is really helpful. This feature has proven itself to be really valuable not just for our team but for the other teams as well.
  2. The ability to define special nodes in a cluster and give them different treatment such as running another set of Chef roles etc. We did that by creating a file with the same FQDN in a specific folder in the repo. This allowed us to offload complexity from recipes to simple node role override.

We knew we would miss our beloved knife tool taking into account that in the past we used it for almost everything! And so we made carto_infra_cli or cic . It is a python tool that replaces some of the utilities provided by knife and uses Consul’s rest API and cloud providers APIs to provide us with some important functionalities in our daily operations: searching for a node belonging to a specific role (i.e: web-servers) listing roles or healthy nodes. Did I say we love knife? we made the syntax similar to knife:

carto infra cli demo


A grateful surprise was when we noted huge performance boost doing searches In our case we found the new setup 10x faster than doing searches with knife against Chef-server.

Rolling Out Infrastructure Improvements

At that moment Consul was already executing our service discovery searches and providing service addresses via DNS but we still had Chef-server running and used by the entire platform.Using an ansible playbook we called solify ;) we had already migrated staging nodes to use Chef-solo by removing any references to chef-server installing and executing a Chef-solo-wrapper and setting the proper firewall rules to ensure those machines couldn’t reach chef-server.

We gave it some time looking for any broken pieces but all we found were some failing CI jobs and some scripts that were using knife. Then once we had fixed them and waited for a week we followed the same procedure on production. The operation went smoothly and in the vast majority of cases all we had to do was to solify the instances.However We faced some issues like having some clustering logic automated with Chef that forced us to manually migrate such nodes.

Lessons Learned by Improving our Stack

Without any doubt gaining a better understanding of our platform was the most important outcome of this project. Here are some points that we feel deserve special attention:

  1. It is never too late to take a deep breath stop think and try to improve your workflow. It will pay off faster than you expected.
  2. Don’t be afraid of an idea if you search on the internet and don't find any reference about somebody taking the same approach (chef-solo + consul) in the first place. It has been some time since this improvement and we are so happy with the result.
  3. It made us more careful with our cookbooks: it may sound silly but this project helped us to realize that a configuration management system should be exclusively used for configuration management not to replace a manual operation or to do service discovery.

In general terms we gained speed and efficiency improving our delivery as a team.

DISCLAIMER: We don’t by any means discourage the usage of Chef or Chef-server we love Chef and almost all of the issues we faced in the past were due to sub-optimal implementation or other technical debts.

Have any comments or questions for our Engineering team? Please share your thoughts with us!