How to Use Spatial Data to Identify CPG Demand Hotspots

Summary

Focusing on organic food products in New York & Philadelphia, learn how Spatial Data can be used to reveal features & areas for successful distribution rollout

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
How to Use Spatial Data to Identify CPG Demand Hotspots

In recent years  the consumption and promotion of products which fall under the category of Organic / Natural / Local has increased dramatically. These are specific types of products which have not undergone any industrial process or lack certain food additives and preservatives. In the US  the marketing of organic products has grown significantly  appealing to a new generation of consumers looking for healthy products and plant-based food. According to the Organic Trade Association  organic food is the fastest growing sector in the US food industry:

   {% include icons/icon-quotes.svg %}    Organic is the fastest growing sector of the U.S. food industry. Organic food sales increase by double digits annually  far outstripping the growth rate for the overall food market. Now  an unprecedented and conclusive study links economic health to organic agriculture.    

This growth in demand for organic products is closely linked to cultural  socio-economic and health factors. In this case study  we analyze these factors spatially as an exercise to understand which features and city areas might help a CPG data and marketing professional identify where to prioritize in terms of rolling out distribution and identifying POS (points of sale) for certain  organic food products in two major US cities  namely New York and Philadelpia.

Data

In order to do this  we selected different data sources available from CARTO's Data Observatory that could help us identify which areas in a city are better suited for the distribution of organic products. The datasets that we have used for this analysis are the following:

  • Mastercard - Geographic Insights: providing sales-based dynamics of a location with indices measuring the evolution of credit card spend  number of transactions  average tickets  etc. happening in a retail area over time;
  • Spatial.ai - Geosocial Segments: behavioral segments based on the analysing social media feeds with location information;
  • Dstillery - Behavioral Audiences:  audiences derived from online behaviors;
  • Pitney Bowes - Points of Interest: database with the location of businesses and other points of interest categorized by classes and industry groups;
  • AGS - Sociodemographics: basic socio-demographic and socio-economic attributes estimated at current year and projected 5 years into the future.

Methodology

Our analysis follows three main steps:

  1. Identification of target areas with high potential for a successful rollout of organic products.
  2. Analysis of the different factors that characterize and have driven the selection of the target areas
  3. Identification of twin areas in San Francisco based on those selected in New York and Philadelphia.

Identifying Target Areas for Distributing Organic Products

In general  organic products or "bio" products are considered premium  typically with higher prices. For example  in Detroit  organic milk is 88% more expensive with respect to the regular milk. This means that it will be preferable to place such products in stores located where consumers are willing to pay that premium  whether it’s where they live  work  or spend their leisure time.

Therefore  in order to identify the potential areas to rollout the distribution of such products we followed these 3 steps:

  1. Identification of areas with higher average ticket size in Grocery Stores based on Mastercard data;
  2. Identification of areas where organic food has potentially a higher demand via the exploration of social media posts (using Spatial.ai geosocial segmentation) and internet search behaviours (with Dstillery's audience data);
  3. Intersection of the areas identified in the above two steps; these will be the resulting selected target areas for the reminder of the case study.

Note that all the three sources we have used in this phase provide features aggregated at census block group level.

Step 1 takes into account the index based on the monthly average ticket from credit card transactions per census block group in grocery stores and looks at the areas with higher ticket sizes. The rationale behind this is based on the fact that organic products are considered "premium"  as mentioned at the beginning of this section. As a result  shoppers must have the capability and the will to spend the extra bucks to purchase organic products.

First  we check whether this variable is spatially correlated (calculating its Moran's I measure in the two cities)  evaluating whether the pattern expressed is clustered  dispersed  or random. The result shows a spatial pattern with values at a location being affected by the values at the nearby locations. So  as a next step we compute the spatial lag as the average value of the average ticket index in a location and its neighboring areas; which is basically like a smoothing over space.

Mastercard spatial correlation graph


Then  the spatial lag of the average ticket is quantized into 5 different quants based on the FisherJenks algorithm in an effort to minimize each class's average deviation from the class mean  while maximizing each class's deviation from the means of the other groups. From the result  we select the top 2 groups as the census block groups with more potential to be "profitable" for our use case from the credit card spend point of view.

The next step is to identify areas where there is more interest in organic products based on behavioral data. For that purpose  the datasets from Spatial.ai and Dstillery are used. The former has an affinity index for organic food  which we analyze following the same procedure used with the Mastercard index. As the Spatial.ai index ranges values from 0 to 100  the quantization method used is the quantile.

From the Dstillery dataset  we select the interest type named "Organic and local food"  and we follow a similar analysis used for the Mastercard and Spatial.ai indices. Again  spatial correlation was observed and was found to be statistically significant. Though  in this case  the correlation was much lower than in the previous dataset from Spatial.ai.

Spatial.ai spatial correlation graph


Dstillery spatial correlation graph


Having already identified the most relevant areas in which to  sell organic products based on each data source individually  we select the final set of target areas as the product of the intersection of the outer merge of the latter two (from behavioral data) with the first one (from credit card transaction data):

{Selected areas} = {Mastercard ∩ {Spatial.ai ∪ Dstillery}}

In the figures below  the selection process for each city is illustrated. In each figure  4 subfigures are shown  each one showing the areas selected using the Mastercard  Spatial.ai  Dstillery and the final selection respectively.

Selected census block groups in New York City

Selected census block groups in Philadelphia

Characterizing the Selected Areas

Having already identified the areas of interest  we now want to further understand which are the factors that characterize them and examine the driving attributes that make an area attractive for placing organic products. For that purpose  we analyze the sociodemographic and socioeconomic factors in the selected areas provided by AGS  the number of Points of Interest (POI) from the Pitney Bowes (now Precisely) database aggregated by business group  as well as the geosocial segments from Spatial.ai. We will compare the behaviors of these factors between the selected and non-selected areas. Note that in terms of sociodemographic and socioeconomic attributes we selected those available in the dataset as a 5-year projection.

In order to identify the driving factors  first we compute and compare the distribution of each feature for the selected and non-selected areas. To compare them  we perform a t-test  in order to evaluate if the means of two sets of areas are significantly different from each other. We drop the features that we identify as having the same distribution in the selected and non-selected areas.

Additionally  for the Spatial.ai Geosocial Segments  in order to reduce the dimension of the features  an additional procedure is followed in order to identify the segments for which there are greater differences between the selected and non-selected areas. For that purpose  the average value within the selected areas and non-selected areas  as well as the ratio of those average values are utilized as features. In the tables below  the average index in the entire city and in the selected areas as well as the ratio among the two values are shown. These calculated features are further clustered  and we chose the geosocial segments in the two anti-diametrical sides. The rationale for this selection is that in the center  the features show similar behavior between the selected areas and the non-selected areas  while in the sides  either a high or a low value is observed  making the distinction between them obvious.

The above procedure was followed by disregarding the constant and the correlated features. For the correlated features  the 80% threshold was used.

The tables show the values for the top 5 and bottom 5 features based on the ratio between the city average and the average in the selected areas in the respective cities. It's interesting to note that the geosocial segment "ED06_ingredient_attentive" popped up in Philadelphia and "ED03_trendy_eats" in New York.

New York Top 5 & Bottom 5 Features


Philadelphia Top 5 & Bottom 5 Features


Selecting the driving factors

Having formed and cleaned the features  we then build a classifier in order to derive the final impact of each selected feature. But first  we need to solve the issue of imbalanced data  as the number of selected areas is much smaller compared to the rest of the areas: 5 389 compared to 623 selected in New York  and 1 043 compared to 35 selected in Philadelphia.

To achieve this  an upsampling technique is used  SMOTE  to generate artificial data. The training set is upsampled  generating "new" data. The reason for this is to create enough samples of each class so the classifier can identify and correct  while not being overwhelmed by the major class  the different driving factors and their impact.

The upsampling process is followed by the classification process  where a random forest classifier is used. The hyperparameter tuning is performed which attempts to minimize the incorrectly identified "selected" areas and maintain good accuracy. As performance metrics for the classification  the confusion matrix and the received operating curve are reported in the figures below.

New York Confusion Matrix & Receiver Operating Characteristic


Philadelphia Confusion Matrix & Receiver Operating Characteristic


For both cities  we can see that the performance of the classification method is good. Some might argue that overfitting has occurred  but because the selected areas were successfully identified  the confidence level of an appropriate classifier on top of an imbalanced dataset is increased.

Looking at the importance of the top 20 features for each city  using the Shapley values  the information regarding the importance of the main driving factors can be extracted. Also  similarities between the driving factors for both cities can be observed. From the top 20 features  10 common features between the cities can be identified.

New York Top 20 Features


Top 20 features for New York City

Philadelphia Top 20 Features


Top 20 features for Philadelphia

Looking at the most important features in New York we can see that areas with higher income  with the presence of the "LGTB culture" and "artistic appreciation" geosocial segments  as well as the segments related to premium foods and drinks  are the best suited for the distribution of organic products. We see a similar trend in Philadelphia  with most driving features being shared in both cities.

Identifying Twin Areas in Different Cities

Having analyzed the driving factors behind area selection in New York and Philadelphia  the twin areas method described in one of our previous posts can be applied in order to identify target areas for the distribution of organic products in other parts of the US. As an example for this exercise  we picked San Francisco.

Now we imagine that we have already established a distribution strategy for organic products in New York City and detected the census block group that is yielding optimal results in terms of sales of our products. The purpose of the twin areas method is to help us identify similar areas in a different city leveraging the driving factors (i.e.  features) identified in the previous section.

As the top performing area we have selected this census block group in Manhattan:

Selected census block group in Manhattan


Census block group in New York City for which we are going to look for twin areas in San Francisco

The features that will be used as criteria for the twin areas method  are the common features in New York and Philadelphia that were identified in the previous section:

  • 'Per capita income (projected  five years)'
  • 'Average household Income (projected  five years)'
  • 'EB03_lgbtq_culture'
  • 'ED09_hops_and_brews'  
  • 'ED08_wine_lovers'
  • 'ED04_whiskey_business'  
  • 'Median household income (projected  five years)'
  • 'ED02_coffee_connoisseur'  
  • 'LEGAL SERVICES'  
  • 'ED01_sweet_treats'

The map below illustrates the results from the twin areas method  showcasing the census block groups in San Francisco with similarity scores greater than 0. We can filter the areas using the histogram widget to identify the most similar twins based on the features listed above.

Similarity Scores (SS) for San Francisco locations. Only locations with positive SS are shown.

Conclusions

In this case study we illustrated how to leverage new types of spatial data  such as aggregated credit card transactions and social media behavior  in order to define a methodology to select and characterize the optimal areas for the rollout of organic products in New York City and Philadelphia  enabling CPG firms to see where they may have gaps in their POS networks. Finally  we also applied the Twin Areas method in order to identify the best areas in San Francisco based on the driving factors identified in the other two cities  which are closely tied to high income areas  greater concentration of different types of retail stores  and geosocial segments related to premium food products.

The combination of new location data streams and spatial data science techniques opens up a new array of opportunities to define more optimal strategies for the distribution of consumer goods; allowing for a greater understanding of the different retail areas based on their consumer segments so CPG companies can place their products as close as possible to areas with the greatest potential demand to ensure increased sales and ROI for CPG brands.

Want to get started?

Sign up for a free account

EU Flag This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 960401.