From understanding the dynamics of a business, to modelling physical and biological processes, scale matters. For spatial data, the concept of scale is closely related to that of shape. Point-referenced data, like point of interests (POIs), are dimensionless, and as such cannot be related to any scale. On the other hand, when dealing with data associated to discrete units we typically need to account for their scale, for example by choosing the most appropriate statistical units for Census data. In this case, the scale might occur naturally, as in satellite imagery where it is determined by the sensor’s instantaneous field of view, or as a result of a measurement process, like for survey data that need to be aggregated to a fixed unit to provide meaningful statistics.
But the question is, how do we choose the right scale? Or, in other words, what is the lens through which we should look at our problem? The choice of an appropriate scale is an extremely important one, first because mechanisms vital to a process at one scale may be uninformative at another. Additionally, inferences might depend on the aggregation level or, even for the same scale, on how we choose to aggregate: this source of aggregation and/or zonal bias is a well known issue, the so-called Modifiable Areal Unit Problem (MAUP).
Usually, however, we do not have full control on this choice. As new sources of location data are becoming more available, data scientists are frequently faced with the problem of how best to integrate spatial data at multiple scales. Think for example of site planning: here, the focus is usually on summarizing the characteristics of a store location in terms of its clients’ demographics (age, population, income) and some geographical characteristics of the surrounding area (density of POIs, number of daily visitors, etc.). How can we combine sources that, because of the difference in scale, not only represent misaligned geographical regions, but also are associated with a different data “volume”? A solution to this problem seems to “transform” the data to a common spatial scale.
The problem of transforming the data from one spatial scale to another, has a name in statistics: it is called a change of support problem (COSP), where the support is defined as the size or the volume associated with each data value.
Changing the support of a variable is a challenging task and one that we are constantly dealing with at CARTO. Our Data Observatory, currently at version 2.0, contains a comprehensive collection of spatial data from different sources that data scientists can easily access using our APIs for a faster and more efficient integration in their workflows. Designing and building this repository presents some challenges: not only are data hard to find and come in different formats requiring lengthy and painful ETL processes, but there is also the question of how we should serve data with potentially different geographical supports to our users. This is when we started thinking of COSP and how can we solve it. Our solution is to try to transform our data on the finest spatial scale useful for analysis. We can always then aggregate to produce inferences over coarser scales.
Here is when statistical methods come in handy: usually we cannot turn to data providers asking for finer-resolved data, as the finest spatial scale might be limited by present technology, as in satellite imagery, by access to the raw sources, as for Census data, or by privacy regulations, as in the case of GPS-derived data.
Census data are especially problematic in this regard: the high costs of gathering data through traditional surveying methods make it challenging to study population characteristics at fine spatial scales. For example, Census data in the US are available at the finest scale at the block group level: block groups vary substantially in shape and size and, especially in sub-urban and rural areas, can represent large geographical regions, as highlighted in this map:
To overcome this barrier, international efforts have been developed to increase the spatial scale of the distribution of human populations. However, these initiatives are only limited to demographic data (total population, by age and gender categories) and have not been extended to the other socio-economic variables available from the Census. This limits the actions and insights that policy makers have available to improve resource allocation, accessibility, and disaster planning amongst others. But the assets are not limited to the public sector: from where to focus marketing campaigns, to site planning and market analysis for retailers, to client segmentation for utility companies, or for the identification of new opportunities for investment funds, being able to get a detailed picture of your population is crucial to uncover the relationship between location and business performance.
Consider for example the case of running a local political campaign: how well do you know your electorate? A striking result from recent Western European and North American elections is that politics seem to have increasingly turned into a struggle between urban and rural voters, as well exemplified by the results of the last presidential elections in the US. While the reasons for this geographical divisions might be very complex, fine scale demographic and socio-economic data could improve the planning of political campaigns and ease these differences that result in a sort of spatial segregation. The popular site planning mantra - location, location, location - could be easily applied here: from door-to-door canvassing, to placement of billboards, researching and understanding your electorate is crucial. To show the insights gained from a magnified view of the demographic and socio-economic characteristics of the population, here we will:
By comparing the results for rural and urban locations obtained from both Census data (by block group) and their downscaled version, we show how detailed information about citizens can be used to inform electoral strategy and guide tactical efforts.
Before jumping into this analysis, let’s see in practice how statistical methods can be used to enhance the spatial scale, effectively downscaling your data.
First we need to create a common grid, which depending on the chosen scale, might contain a huge amount of cells: data for a specific location need be retrieved quickly and stored efficiently. This is what quadkeys are used for: instead of storing polygons, the Earth is divided into quadrants, each uniquely defined by a string of digits with a length that depends on the level of zoom. Like Google S2 cells, or H3 hexbins developed by Uber, quadkeys capture the hierarchy of parent-children cells and are not locked to any specific boundary type, allowing the same unit anywhere on the globe. This is what a grid built with quadkeys with zoom level 17 (about 300 m x 300 m at the Equator) looks like near Albany, NY:
Next, we need a downscaling model in order to transform a random variable from its support () to the support (), the quadkey grid we just created, where this variable is not available or has never been observed. For extensive variables (i.e. one whose value for a block can be viewed as a sum of sub-block values, as in the case of population), the idea is to use data with extensive properties available at the point-level or at a finer-scale than the target support as covariates to model the variable of interest . Using these finer-scale covariates (the red pins in the figure below) has the advantage that they are available both on the source and the target support, since changing their support only requires a summation.
What we are assuming here is that this model is independent of scale, which is not necessarily true. This could mean, for example, that when we transform the data back on the source support their aggregated value will not necessarily be equal to the corresponding actual value. If () and () are nested, we can always restore the coherence between the two supports by “adjusting” the model predictions such that when we sum over all the unit cells on the target support we are sure to get the corresponding value on the source support. When the source and the target support are not nested, for the prediction task we instead need to resort to a hybrid support () resulting from their intersection. For example, if the source support is represented by block groups, this hybrid support will look like this
To finally get the predictions on the target support we then just need to simply sum over all the unit cells on the hybrid support, corresponding to a cell on the target support.
For non-extensive variables (e.g. the median household income), we can still use a similar approach based on fine-scale extensive covariates, training the model on the source support and predicting on the hybrid support. However, in this case the ad hoc adjustment is interpreted as a weighting scheme based on the model trained on the source support for “redistributing” the data rather than a correction for the change of scale. Additionally, we also need to specify the relevant operation (e.g. the arithmetic mean) to transform the predictions from the hybrid to the target support.
What kind of finer-scale covariates are available to downscale Census data? We could imagine that at least some of the main socio-economic characteristics of the US population could be inferred from the population distribution plus the geographical characteristics of the area, for example the density of roads or POIs for different categories, as shown in this map:
In this map, population distribution was obtained from Worldpop, which provides gridded population data (total and by age and gender) at a fine scale (ca. 100 m x 100 m at the Equator).
In order to reduce the model dimension, we started by selecting only a subset of these covariates by running for each Census variable a Generalized Linear Model with a Least Absolute Shrinkage and Selection Operator (or LASSO) regularization. Given the negative log-likelihood for observation , and a parameter which controls the overall strength of the penalty, we can estimate the model coefficients by solving
As we decrease the size of more covariates are included up to the full model (or the largest identified model when we have more variables than observations). To identify the optimal regularization parameter we use K-fold cross validation and selected such that the cross-validation mean-squared error is within one-standard deviation of its minimum, as indicated by the dashed black lines in these plots where the sequence above indicates the number of covariates selected for each value of .
But how well are demographic and geographical covariates likely to predict the socio-economic characteristics of a location? Clearly the answer is variable dependent. While the population distribution, POIs, and roads might be expected to be good predictors for the number of households or the median age, they show a much smaller predictive power for income-related variables. This is well exemplified by the panel on the left (a) in this figure:
That panel shows the K-fold cross validation predictions vs. the actual values on the source support for the median household income. Clearly, the model is not doing a good job, as ideally all the points should be close to the diagonal line. Given that our aim is to derive predictions at a finer scale, to overcome the unavailability of “better” covariates, we adopt a transfer learning approach.
Within this framework, we can add a second step where we add as covariates, other response variables for which the model depending on the selected fine-resolution covariates has better predictive skills. For example, we can train on the source support a model depending on fine-scale covariates for and then train a second model for with both the fine-scale features and as covariates. In this way we “learn” on the source support to transfer the fine-scale features that are predictive of the variations for to and we can then use these models to predict on the target support (under the assumption of independence on the spatial scale). Note that, in order to reduce the model dimension, we also apply a LASSO regularization on this second step. The results after applying this transfer learning approach to model the median household income are shown in the panel in the middle of the figure above (b): the cross-validation accuracy improved notably, with a more than doubled pseudo .
Finally, to further improve our model, we can account for non-linear dependencies of the response variable on the selected covariates using an ensemble tree model (e.g. Random Forest), as shown in the right panel (c) in the figure above for the median household income and in the table below for other selected response variables.
So far, we tested the prediction skills of our model on the source support, where the actual values for the variables we are trying to model are available. However, how can we get a sense of the quality of the predictions on the target support, where the ground truth values are not available? For extensive variables only, we can compare the distribution of the unadjusted aggregated predictions on the source support (i.e. at the block group level), with the distribution of the actual values, as shown in the figure below.
Ideally, these two distributions would look very similar, with any difference indicating possible over-predicting (under-predicting). For example, looking at the plot for the population 16+ not in labor force (c) , this comparison suggests that the model is not able to predict the lowest quantiles for the downscaled predictions, despite a low bias for the cross-correlation predictions on the source support.
To quantify the similarity between the actual distribution and the distribution of the unadjusted aggregated predictions , we can compute the (first order) Wasserstein distance, which measures the minimal amount of work for transporting one unit of mass from to
where is the set of probability distributions on of all couplings of and . The larger , the further away the two distributions are and therefore the larger the error on the target support due to the change of scale. For example as annotated in the previous figure, we can see that the Wasserstein distance for the now married variable is larger than for the households variable, suggesting for the latter more accurate predictions not only on the source support (c. f. the previous table) but also on the target support.
Finally, we can plot the downscaled predictions, as shown in the figure below for some selected Census variables
Going back to our analysis of the US electorate, given the downscaled Census, we can apply the results of a multivariate analysis to compare the similarity of two geographic locations, as described in a previous post. The following map shows the similarity skill score near Albany (NY) with respect to a selected location in a rural area (identified by the purple dot): the score is positive if and only if a unit cell is more similar to the selected location than the mean vector data, with a similarity skill score of 1 meaning perfect matching.
Based on the demographic and socio-economic characteristics of the US population only, we find that for the selected location the most similar sites are all located mainly in sparsely populated areas, without having included location explicitly in our analysis. Similarly, if we instead select an urban neighbourhood characterized by a wealthier and younger population, we only retrieve mostly urban sites
Looking at these maps it is clear that the model-based downscaled data are able to capture spatial clustering and heterogeneity, which is essential for a better understanding of electoral behaviour and planning of political campaigns.
On the other hand, these fine-scale local insights are lost when we apply the same analysis on the actual data simply transformed to match the target support by allocation proportional to area (area-interpolation) as a simple imputation strategy. This can be seen when we compare the following maps with those shown above
It is clear that the differences that we see between the two selected locations do not just reflect geographical differences, but also the divergence of rural and urban economies. And using the model-based downscaled data, we can for the first time capture this separation at a very fine scale. Fom setting up billboards to door-to-door canvassing, local estimates of the demographic and socio-economic characteristics of the population might help to better target the relevant households and optimize the budget, boost turnout, and efficiently transform votes into seats.
Learn more about CARTO’s location data streams today, and get in touch with our team in case we can help you magnify your insights with CARTO’s statistical downscaling models.
The new SARS-CoV-2 coronavirus, which emerged in the Chinese city of Wuhan in late 2019, continues to cross borders. The epidemic, one of the biggest health crises in recen...Data Science
Seeking an understanding of resident health based on the social factors within a neighborhood is not a new concept. But efforts to do so have long been overly simplistic, w...Data Science
Please fill out the below form and we'll be in touch real soon.