In a previous blog post we announced our demographic segmentation service as part of the Data Observatory. In today’s post we will discuss how we generate these segments and how we went about giving them names.
Each one of us is a precious individual snowflake of data… but if you look around your neighborhood you will start noticing similarities to your fellow humans. You might all be roughly the same age, have the same income, drive to work or take the subway. There are patterns in groups of people everywhere you look.
Luckily we can train computers to pick out these kinds of groupings, or clusters, of people. We can take each census tract and a selection of the census variables which describe it, then using a method called K-means clustering we can identify groups of tracts that are statistical similar to each other.
To see how this might work, let’s consider a simple example. Imagine we collected data on people’s ages and the probability they own a record player. We plot the data and it looks something like this
As humans it’s really easy for us to pick out that there are three clusters of people. K-means attempts to find these clusters programmatically. It does this by:
The end result of this process is to label each point on the graph a 1, 2 or 3 depending on what cluster it belongs to. If we color the points by the labels k-means gave them the data looks like this:
Awesome, the algorithm has done programmatically what we as humans do instinctively. This simple example is trivial and we didn’t need k-means to find the clusters, but what if we had 150 different variables to sift through and wanted to find 55 independent clusters as we do with the census? Then its essential to use an algorithm.
Unfortunately the algorithm can’t determine meaningful names for these clusters, thats up to us. In this example we might decide to call the yellow cluster: ‘young hipsters,’ the blue cluster: ‘parents with mp3 players,’ and the green cluster: ‘original record player owners.’ These names are subjective but informed by our intuition and the data.
Applying the procedure outlined above allows us to segment the census into neighborhoods that fall into one of 55 different clusters. And after many hours of staring at plots of the census variables in each cluster, give a to name them. No doubt some of these names can be improved and we are going to keep working on getting more accurate descriptions of these neighborhoods but we wanted to set you lose on them early.
To get a feel for just how diverse a place the U.S. is, here are the neighborhood segments for multiple U.S. cities:
You can explore the 55 segments in more detail using this deep insights dashboard:
To create similar or even better visualizations, you can watch our Data Observatory webinar as many times as you need to!
We are working hard to generate segmentation for other countries outside of the U.S. Keep an eye on the blog and CartoDB to find out when we will be launching these.
Until then happy data mapping!
Please fill out the below form and we'll be in touch real soon.