Automatically Detecting Areas of Interest
One of the most common reasons people make maps using CartoDB is to try and explore the patterns in their data in order to make better decisions or to communicate interesting ideas. This exploratory approach to location intelligence can be difficult but armed with the right analytical workflows and cartographic approaches people can create incredible new value simply by using a map. In today's blog post I'm going to introduce a lesser-known method that can help you find interesting patterns in polygon data that a simple thematic map (e.g. a choropleth) alone may allow you to overlook.
Choropleths have to be one of the most popular thematic map types produced. One form of choropleth you are almost certainly familiar with right now is the election map showing the percent win of various candidates in elections (learn how to make one over on the Map Academy). Take a look at the one below from a recent workshop on the topic the particulars of the data don't matter to this story yet.
Choropleths can tell us a lot about the geographical distribution of data we are mapping but sometimes they lack certain information that can be critical to gaining insights and extracting value. For example depending on the selection of your color bins and ramp a choropleth may highlight or obscure key attributes such as the average value and information about the distribution of the data. Still choropleths allow us to explore the distribution of data on a map. Trends easily become clear like the Midwest is a strong place for Republican candidates while urban areas and more populated states tend to vote more Democratic.
What we often find with choropleths is that they can be good for quickly learning about your data but become better the more you begin to know. So how can you learn more about the patterns in your data than a choropleth alone can show you?
Exploring the Census
Let's visualize American Community Survey data from the United States Census focusing on just a couple of variables. Here we're looking at ratio of the number who 'worked at home' to the number of workers 16 years and older:
As you can easily see there are dozens of counties in the Upper Midwest west of Minnesota and Iowa where people reported that they work from home (WFH). From this map we can infer that there is a change in work environments north and west of the line from Dallas to Chicago with scores of values in the double digits west and north and scores around or below the mean of 4.7% south and east of this line. There are also several outliers scattered throughout the country easily seen as dark spots.
Some of the natural groupings of work-from-home counties are obviously not contained by just county borders--there are broader clusters of WFH hotspots and coldspots--but in other regions it is more contained like the anomaly in southern Missouri. This leads us to wonder how we can detect spatial correlation of work-from-home percentages from one county to the next to better identify the regional behavior.
Tobler's First Law of Geography — Everything is related to everything else but near things are more related than distant things.
Finding Spatial Clusters
Let's try taking on this data with a different tack. Obviously there are regions where counties are correlated with each other. That is a county with a high percentage of WFH lies adjacent to other counties with high percentages and vice versa for counties with low percentages. A lot of the country does not seem correlated with its neighbors -- some highs some lows but mostly they seem close to the mean value of workers working from home.
Let's classify counties by how correlated they are with neighbors. The conditions we're looking for:
- Highs where a county and its neighbors have a high value on average
- Lows where a county and its neighbors have a low value on average
- Neutrals where the averages of neighbors tend towards the mean or what would be expected from a random distribution so don't seem to be clustered
- Outliers for counties that don't fit conditions 1-3 (more on these later)
To find these groupings we draw upon a geo-statistical cluster and outlier detecting method called [Local Moran's I] that allows us to test the distribution of our attribute of interest over geography. Moran's I is one of many statistical approaches we have been exploring lately by combining CartoDB with the PySAL library (another blog post on that soon).
By applying the Moran's I algorithm we classify our data similar to the types above. To visualize it we simply style the counties according to these types.
Below is a map with the clustering overlayed on the choropleth we used above for Work from Home percentages. I did this to help show how the choropleth informs how and why the clusters neutrals and outliers are formed. For neutrals we let them be masked (dark grey over the choropleth) because they are not 'significant'. Here significant means that their arrangement is consistent with what would be expected by randomly distributing on the map the WFH percentages of each county.
What is exciting about this approach is that the clustering algorithm picks apart which variations in our data are interesting (i.e. is an outlier or part of a cluster) and which may just be random variation relative to neighbors. We now have an automated way of finding areas of interest for future study. That's pretty exciting!
Let's walk through some of the observations we can make looking at the outputs:
- Broader region extending from across Colorado through Nebraska South and North Dakota Montana Idaho and petering out in eastern Oregon. This is evidently America's work-from-home region. There's also a more localized cluster centered west of Lake Tahoe in Northeastern California.
- The Southeastern United States has a broad not-working-from-home region that extends up east and west of the Appalachian mountains to West Virginia and Eastern North Carolina. Within this region you can see that the counties are consistently the lowest shade in our color ramp (all below the mean).
- A large part of the country has work from home rates from what you would expect if the counties were placed randomly across the map so no clusters or outliers are found in these areas.
What we see in #s 1 and 2 is evidence of a clear underlying process that's driving the clusters in these regions that goes well beyond county borders and leads us to ask questions about what this process could be.
There is obviously a big shift in work behavior between the WFH and not WFH regions mentioned above. What causes these differences? Perhaps a combination of topography distance to work population density and nature of work? Maybe something else entirely?
The clustering also brought out some subtleties that weren't evident in our choropleth. While there is variation across a few regions they can't be algorithmically told apart from what you would expect for a random arrangement about the mean value so these 'not significant' ones lack a border and are masked by dark grey.
So what are outliers?
In the simplification above I purposefully left out a discussion of the outliers. These are statistical outliers in a spatial sense: they are counties that buck the trend in their local area. That is they are very dissimilar from their neighbors. For instance a high work-from-home county adjacent to several low work-from-home counties. There are dozens of counties like this in the northern parts of our work from home cluster.
In the map above the outliers are typically adjacent to regions of high or low values and sometimes are in an island of non-significant counties. These are areas that are very different from their immediate neighborhood -- maybe explained by a county which has a very different economy than those around it.
The Southern Missouri county (pictured above) that is an normal outlier/anomaly for the dataset looks like a shoo-in to be a spatial outlier. Instead this county is surrounded by a lot of variability which means that it is not correlated with its neighbors.
In the north middle regions of the US there are a number of interesting outliers sitting in the middle of our large work-from-home region. These outliers could indicate a different economy than in the surrounding regions such as several large cities change in industry presence of large institutions like universities etc. Answers to the causes of this can be obtained by appealing to Workplace Area Characteristics data from the Census.
Outliers can be defined more precisely now:
- High outlier: A 'high' in a region of lows (on average)
- Low outlier: A 'low' in a region of highs (on average)
Finding clusters is fun!
We started playing around with this method of finding statistical clusters and outliers across different attributes of the American Community Survey. In the map below we took a look at the population who are 18 or younger popularly known as Generation Z. This group is interesting because they are largely made up of the population that hasn't yet left home. They can give you a lot of insights about how a region will change over the coming years as the Gen Z population transitions into the workforce and move into new homes and apartments nearby.
This time I peeled away the choropleth layer to only show the cluster and outlier analysis. Looking at Generation Z we see some interesting patterns:
- The Appalachians extending up into Maine Northern Michigan Minnesota and Wisconsin and large parts of Florida have low rates of young people (except in nearby major cities).
- There are two geographically large hotspots of Generation Z population: Eastern New Mexico through Western Texas and a huge area centered on Utah.
The technique is useful because it reveals so many patterns that aren't immediately evident in the choropleth map it gives a strong sense of areas that are above and below the mean value and it allows us to make inferences about the patterns. Once we have inferences we can use techniques such as spatial regression to make predictions.
Where to go next?
We're working hard to bring these techniques to all CartoDB users so keep your eye on our tutorials documentation and of course here on our blog over the coming weeks and months.
If you want to use any of the maps from this blog post feel free to. Here are some [basic instructions for getting them into your stories].
Happy statistical mapping!