Our primary goal at CARTO is to be the world’s leading Location Intelligence platform, empowering our clients with the best data and the best spatial analysis. We frequently discuss the latest techniques in spatial data science and spatial modeling, as well as the specific use-cases of our clients, and the ways they are unlocking new insights for their businesses with spatial analysis.
That said, a big part of striving towards the goal of being the top LI platform means building, updating, and maintaining our technology stack. Today we are excited to share a bit more about how we explore and implement changes to that stack so that our customers experience optimal service with CARTO.
As you may know, PostgreSQL is the heart of our technology stack. We manage a big farm of database instances with different configurations, sizes, and requirements, and to be honest, when it comes to databases, we don’t use any fancy container orchestration technology. We still using traditional configuration management systems. In our case, that system is Chef.
Over a year ago, we - the Site Reliability Engineering team at CARTO - made an important change to the way we manage our Infrastructure. We are excited to share how we moved from Chef-server to a combination of Chef-solo and Consul, and how that change impacted our day-to-day lives.
So what was the problem with the Chef-server setup at CARTO? Well, with increases in the number of instances and the number of people working on them, the workflow started to become sub-optimal: our cookbooks are under version control, so we used to edit a recipe, push it to Chef-server, and then commit the changes to git.
For example, it was very common to upload a new version of a given cookbook into Chef-server, in order to test it in the staging environment, before merging the code to the master branch. However, if your colleague had just uploaded a different version of the same cookbook, you’d end up with cookbook versions stored in Chef-server but missing from git. It would be better if we were able to use the same workflow used by the rest of developers in the company!
Another issue was performance: the one server that converges the config and attributes of all nodes, and serves more than 1000 servers was really slow. Maintaining Chef-server is not our core business and our setup was not optimal.
Last but not least, we used Chef-server as a way to do service discovery! It somehow works and it is very tempting to use in this way because all of the information is available and centralized in the Chef-server, but there are modern tools out there specifically designed to provide fast and reliable service discovery such as Consul.
While improving this system might sound simple at first glance, there are a couple of things to take into account:
Chef-solo and Consul to the rescue! Using Chef-solo allowed us to reuse the same cookbooks and roles, and all we had to deal with were the following points:
The rest is left to Consul, we already had some experience using it as it was required for some other services, so what did we do to adapt it to our problem? We simply registered Chef roles as Consul services.
One caveat: Chef-server comes with a very powerful knife tool that we will lose: we had to build a tool that replaces it but uses Consul as a backend.
We gradually started to introduce Consul as another option for service discovery, and controlled by some chef-attributes, either Consul or chef-search was used in a cookbook for this purpose.
Consul search was introduced via a chef library that harnesses the local Consul rest API on all machines. But to make it work correctly, we had to make some modifications! If you are using Chef, you probably know about Chef’s two-phase execution model, the key idea here is to make sure the value discovered by Consul was available at the moment templates were evaluated. Fortunately, adapting some recipes to use lazy evaluation was enough to solve this issue.
Likewise, to replace the users search feature provided by chef-server, we found this chef-solo-search library, that despite being somewhat outdated, worked perfectly for us.
Once we were confident that Consul search was working as expected, we completely removed chef-server searches from our cookbooks. Hooray!
Our next task was to figure out how to distribute and run recipes on instances: we created a Chef-solo-wrapper Python script that is installed on all our machines. It connects to our git repository, downloads the corresponding Chef repository and finally runs Chef-solo. But now that we had the control at this level we took advantage and introduced two extra perks:
We knew we would miss our beloved knife tool, taking into account that in the past we used it for almost everything! And so we made carto_infra_cli or cic . It is a python tool that replaces some of the utilities provided by knife and uses Consul’s rest API and cloud providers APIs to provide us with some important functionalities in our daily operations: searching for a node belonging to a specific role (i.e: web-servers), listing roles or healthy nodes. Did I say we love knife? we made the syntax similar to knife:
A grateful surprise was when we noted huge performance boost doing searches, In our case we found the new setup 10x faster than doing searches with knife against Chef-server.
At that moment, Consul was already executing our service discovery searches and providing service addresses via DNS, but we still had Chef-server running and used by the entire platform. Using an ansible playbook we called solify ;), we had already migrated staging nodes to use Chef-solo by removing any references to chef-server, installing and executing a Chef-solo-wrapper and setting the proper firewall rules to ensure those machines couldn’t reach chef-server.
We gave it some time looking for any broken pieces, but all we found were some failing CI jobs and some scripts that were using knife. Then, once we had fixed them and waited for a week, we followed the same procedure on production. The operation went smoothly and in the vast majority of cases, all we had to do was to
solify the instances.However, We faced some issues, like having some clustering logic automated with Chef, that forced us to manually migrate such nodes.
Without any doubt, gaining a better understanding of our platform was the most important outcome of this project. Here are some points that we feel deserve special attention:
In general terms, we gained speed and efficiency improving our delivery as a team.
DISCLAIMER: We don’t, by any means, discourage the usage of Chef or Chef-server, we love Chef and almost all of the issues we faced in the past were due to sub-optimal implementation or other technical debts.
Have any comments or questions for our Engineering team? Please share your thoughts with us!
Please fill out the below form and we'll be in touch real soon.