Recently we held a DeepFin Investor Tutorial on Spatial Data Science with J.P.Morgan conducted over video conference from London and Madrid. As Spatial data continues to grow in demand across all industries, not least in financial services, this served as an opportunity to explore specific capabilities and practical applications.
Originally published by J.P.Morgan this summary of the workshop covers three main areas:
Figure 1: Annual Revenue ($) Predictions in Long Island
Spatial data science treats location, distance, and spatial interaction as core aspects of the data
Traditional data science applied to spatial data essentially ascribes no additional value to e.g. longitude / latitude of the data though true spatial data science as above employs specialised methods to work with such data i.e. spatial statistic to statistics, spatial databases to databases, and geocomputation to computation. Spatial data science can be thought of as a subset of traditional data science which focuses on the importance of “where” a special characteristic of spatial data. Spatial data comes in all forms and shapes e.g. median household income, number of visitors from GPS sources, POI locations by category - see Figure 2.
Figure 2: Types of Spatial Data
One of the useful aspects of spatial data is that data in nearby locations tends to be similar, encompassed into the First law of Geography by Waldo Tobler. Essentially spatial data is spatially dependent and independence among observations – a common assumption for many statistical methods – is not present. There are a number of measures of spatial dependence varying on whether you are working with continual spatial processes, discrete spatial processes, or point-pattern processes.
Spatial modelling is similar to classical machine learning in analysing spatial data to make inferences about the models parameters, predict unsampled locations and for down/up-scaling applications. Consider a general model of a variable of interest where is the mean structure and is an IID process. When dealing with spatial data we might add to our general model an extra term representing a spatial random effect acting as a spatial smooth as in Figure 3.
Figure 3: Spatial Modelling
The extra term ensures observations that are close in space will also be “close in” and adds the spatial dependence property
Among common methods when building spatial models include Continuous Spatial Error Models (Gaussian Processes), Discrete Spatial Error Models (Gaussian Markov Random Fields), Spatially Varying Coefficient Models, Spatio-temporal models, and Spatial confounding.
Figure 4: Spatial modelling
a) Continuous spatial error models
The left-hand panel displays a continuous spatial error model of predictions (mean) for the concentration of zinc near the Meuse river in the Netherlands obtained using kriging (Gaussian process regression)
b) Discrete spatial error models
Right-hand panel displays predictions for the owner occupied housing value in Boston obtained using INLA and a Besag, York, Mollié (BYM) model for spatial random effects
Akin to clustering in traditional data science, spatial clustering adds spatial constraints to ensure geographies are maintained across groups. Such clustering methods, which build clusters built from sub-geographies, are known as regionalisation.
Figure 5: Clustering techniques
Clustering using DBSCAN: identifies clusters by grouping together items within r distance such that a cluster has m points
Regionalisation enforces contiguity constraints which allows smaller geographies to form larger, continuous regions
Using the CARTO tech stack, in the hands-on section of the workshop, the data scientists from CARTO went through step-by-step demos using Jupyter notebooks, from data exploration, to external data discovery and augmentation, to model formulation and results. The examples addressed two problems:
Want to see this in action?Request a live personalized demo
Site selection refers to the process of deciding where to open a new or relocate an existing store/facility by comparing the merits of potential locations. If some attribute of interest is known (e.g. the revenues of existing stores), statistical and machine learning methods can be used to help with the decision process.
We start by uploading (dummy) data to CARTO. The data contains the addresses of Starbucks stores, their annual revenue ($) and some store characteristics (e.g. the number of open hours per week). Next the addresses are geocoded: we extract the corresponding geographical locations, which will be used to assign each store census data of the corresponding block group.
Figure 6: Long Island, New York, Starbucks locations and Annual Revenue ($) In this problem we use dummy data for store annual revenue
Participants then downloaded the demographic and socioeconomic variables (from the CARTO Data Observatory) that will be used to build a model for the store’s revenues. We use data from the American Community Survey (ACS), which is an ongoing survey by the U.S. Census Bureau. The ACS is available at the most granular resolution at the census block group level, with the most recent geography boundaries available for 2015. The data that we will use are from the survey run from 2013 to 2017. Further as we are only interested in Long Island for the demo, we upload the geographical boundaries of this area from a geojson file.
When comparing data for irregular units like census block groups, extra care is needed for extensive variables, i.e. one whose value for a block can be viewed as a sum of sub-block values, as in the case of population. For extensive variables in fact we need first to normalise them, e.g. by dividing by the total area or the total population, depending on the application. Checking correlations among the variables there are some noticeably large correlations. To account for missing values and reduce the model dimensionality, we transform the data using principle component analysis (PCA).
To understand the relationship between the transformed variables (the PC scores) and the original variable, we plot the 10 variables most highly correlated with each PC (see below).
Figure 7: Principle Components correlation scores with the 10 most correlated variables For example, we see that the first PC, which is that dimension that explains the most of the variance, is positively correlated with the density of owner occupied housing units but negatively correlated with the density of renter occupied housing units.
Having prepared the model covariates, then we will model the annual revenues by store using a Generalised Linear Model (GLM) with the selected PC scores as covariates. Looking at the model results we notice a significant negative correlation with the first PC (which is lower in richer areas) and a significant positive correlation with the second and third PCs which is negatively correlated with a higher workforce density and a higher density of female seniors. Assessing model accuracy, we then plot the observed vs. the predicted values and compute the pseudo-R2 score. As shown in the plot below, most data points lie on the 1-1 line and alongside a high pseudo-R2 we can conclude our model is accurate for in-sample predictions.
Figure 8: Testing model accuracy – pseudo R2 for GLM
Finally, we can use this model to predict the annual revenue for each block group in Long Island, and map the results using CARTOframes.
Figure 9: Annual Revenue ($) Predictions by Long Island block group
Having assessed the performance of the non-spatial GLM model, we can check if the model residuals show any residual spatial dependence, indicating that better results might be obtained using a model that explicitly takes the property that an observation is more correlated with an observation collected at a neighbouring location than with another observation that is collected from farther away.
Variograms represent a useful tool to check for spatial dependence in the residuals. We fit a variogram models on the residuals by grouping pairs of observations in bins based on their distance and averaging the squared difference from the values of all pairs see Figure 10.
We extend the GLM to account for residual spatial dependence by adding a Gaussian process. Including a spatially-structured random effects model that incorporates such spatial dependency is beneficial when the variables used as covariates in the model may not be sufficient to explain the variability of the observations and the measurements, given the predictors, are not independent. Fitting the variogram model to the spatial-GLM models’ residuals we can see that we have consistently reduced spatial dependence in the residuals – see the scale of the y-axis in Figure 11.
Figure 10: Spatial dependence in GLM
In the extended model we have removed serial autocorrelation in the residuals.
The next step would be to make predictions using this model though computation time for GPs scale cubically with the data locations this is left for further work.
Figure 11: Spatial dependence in spatial-GLM
Location Intelligence plays a critical role in supply chain network design. From finding better ways to serve stores, the optimal location for a new distribution centre, or understanding how goods are reaching their final destination, every aspect of supply chain design is tied to location data.
Want to see a real world example?See how SEUR optimized their cold transportation network
In the demo, participants walked through all the analytical processes required to build an optimization model to design a supply chain network. In particular, we focus on the following steps:
Firstly we load the current distribution network data, including their locations, capacity and operational areas. The current network operates based on regional administrative regions as we can see Figure 12.
Figure 12: Distribution Centres | Locations and capacity (left panel), and current operational areas (right panel)
Next we look at the last year’s geolocated historical order data. Looking at the number of orders on its own – a temporal analysis of orders – shows the variability of orders. Adding the spatial component into the analysis to now run spatio-temporal analysis of orders shows the demand series varies not only by time but also by location. In the summer months, the number of orders from inner areas recedes as the number of orders from coastal areas increases.
Figure 13: Number of orders – temporal (top panel) and spatio-temporal analysis (lower panel)
The first analysis we will carry out is a clustering analysis to identify areas with a high concentration of orders. The goal of this analysis is to verify whether DCs are strategically located and whether the spatial characteristics of high density areas (land covered, administrative areas covered, etc.) can be leveraged to improve delivery areas.
For this analysis we will use DBSCAN, a density-based clustering non-parametric algorithm which groups together points that are closely packed together (points with many nearby neighbours), marking as outliers points that lie alone in low-density regions. The map below shows the clusters obtained with a sample of orders. It can easily be seen how different patterns emerge such as:
The panel on the right displays spatial intersections and spatial aggregations.
Figure 14: Supply-demand matching - analysis of the current SC network
For strategic and tactical planning, the focus is not on the exact location of orders, but rather an estimation of them aggregated both spatially and temporally. Selecting the right spatial aggregation is critical.
There are different alternatives for spatial aggregation. H3 and Quadkey grid are two examples of standard hierarchical grids. However, oftentimes business requirements impose the use of other spatial aggregations more “natural” to people.
Some examples are zip codes, municipalities, and administrative regions.
Figure 15: Spatial aggregation methods, H3 (left), Quadkey (middle) and natural (right) We aggregate orders delivered using H3, Quadkey grids and municipalities.
We move onto building an optimisation model to calculate the optimal delivery areas of each DC so that they could be compared to existing ones. The linear optimisation model has the following costs in the objective function:
To allow us to meet the third constraint we enrich our dataframe with WorldPop data to ensure that the population covered by each DC is evenly distributed with regards to the DC capacities. The enriched dataframe is visualised in Figure 16.
Figure 16: Average orders and population data The two panels here display average orders across Spain alongside population density
Given our enriched dataset we can now check the current utilisation in the supply-chain network. We can see there are a number of DCs where the average utilisation is >150% of capacity. The aim of the optimisation would be to calculate optimal delivery areas of each DC.
Figure 17: Current utilisation across DC supply-chain By in large most DCs operate below capacity but there is a notable number where average utilisation is >150% of capacity
After running the optimization algorithm, the optimal areas for DCs can be seen in the right panel in Figure 18 – compared to the current areas based on administrative regions (left panel). Interestingly whilst the average DC utilisation comes down from 124% to 116%, the distance travelled increases from 21km to 27km. This would suggest some DCs (e.g. Sabadell) may benefit from additional DC space.
Figure 18: Current operational areas (left) and optimal operating areas (right) The left panel display the current operational areas of DCs whilst the right panel displays proposed optimal operating areas
Figure 19: Optimal new average utilisation (orange dots) across DC supply-chain By in large there is a significant reduction in average utilisation across the SC as depicted by the orange dots for each DC
Want to get started?Sign up for a free account
Recently, as part of our ongoing mission to empower Data Scientists with the best data and analysis, we announced the integration of our platform with Databricks, using eit...Use Cases
Please fill out the below form and we'll be in touch real soon.