Locating High Performing Retail Expansion Sites

Summary

The Retail module in our Analytics Toolbox for Google BigQuery now includes Twin Areas analysis, an essential tool in Site Selection analytics. Learn more.

This post may describe functionality for an old version of CARTO. Find out about the latest and cloud-native version here.
Locating High Performing Retail Expansion Sites

When it comes to retail site selection applications  a common task consists of combining data from different sources to compare the characteristics of target locations and hone in on the most appropriate areas for footprint expansion or consolidation.

In a previous blogpost  we explained how to build a similarity score with respect to an existing site (e.g. the location of your top performing store) for a set of target locations. This approach is an essential tool for Site Planners looking to open new stores  relocate existing points of sale or consolidate their current network.

Our Analytics Toolbox for BigQuery  a set of User Defined Functions and procedures for running advanced spatial analytics natively in BigQuery  has been further enhanced with a new twin areas procedure  adding to the existing retail-specific procedures we have developed including revenue prediction  commercial hotspots and white space analysis.

Our Approach for Finding the Most Similar Locations to Your Top Performing Stores

In this post  we tackle Twin Areas analysis in more depth. This approach consists of three main steps:

     
  • First  to select the most relevant variables given the characteristics of your business (e.g. population  income  etc.)  coming from either our Data Observatory or from your own data tables;
  •  
  • Secondly  to gridify and enrich the location of an existing site (from now on referred to as the origin location) and of all the target sites using the selected data sources. The process of gridification both for the origin and target locations is required in order to be able to compare areas of the same size and relies on the use of spatial indexes (either quadkey or h3)  using the available procedures in our Analytics Toolbox.
  •  
  • Finally  to derive a similarity score between the origin and each target location by ranking the distance between the origin and each target cell in the variable space  where the selected variables have been standardized and transformed to account for pair-wise correlations.

The similarity score for each target cell is computed with respect to the score of the average cell in the target areas. Under this scoring rule  a target cell with a larger score is more similar to the origin cell than a target cell with a lower score; moreover  this score is positive if (and only if) the target cell is more similar to the origin than the mean cell.

How to Run the Twin Areas Analysis in Google BigQuery

To demonstrate how to run the Twin Areas analysis in BigQuery  we selected as potential origin locations the position of the top 10 performing liquor stores in 2019 from the publicly available Iowa Liquor sales dataset.  

We start by using the GRIDIFY_ENRICH procedure from the data module in CARTO’s Analytics Toolbox. This procedure is used to first gridify a set of geometries (point data  in this case) to a quadkey grid with zoom 15  and then to enrich each selected location with data from a subscription to one of the datasets available in the Data Observatory  including the total population (total_pop_3409f36f) and the number of households (households_d7d24db5) at the Census Block Group level from the ACS Sociodemographics dataset  as well as from a custom dataset  which contains the count of road links (count_qualified) per zip code. The result can be found in the table cartobq.docs.twin_areas_origin_enriched.

CALL `carto-un`.carto.GRIDIFY_ENRICH(
    -- Input query
    'SELECT * FROM `cartobq.docs.twin_areas_iowa_liquor_sales_origin`' 
    -- Grid params: grid type and level
    'quadkey'  15  
    -- Data Observatory enrichment
    [('total_pop_3409f36f' 'sum') ('households_d7d24db5' 'sum')] 
    'carto-data.ac_7xhfwyml' 
    -- Custom data enrichment
    '''
    SELECT geom  count_qualified FROM `cartobq.docs.twin_areas_custom` 
    ''' 
    [('count_qualified' 'count')] 
    0 "uniform" 
    -- Output table
    'cartobq.docs.twin_areas_origin_enriched');

The map below shows both the locations of the selected stores (left) as well as the enriched grid for the population variable (right):

Similarly  we can use this procedure to gridify and enrich the target areas for which we will use the Census Tracts polygons in Texas in the main urban areas. The result can be found in the table cartobq.docs.twin_areas_target_enriched.

CALL `carto-un`.carto.GRIDIFY_ENRICH(
    -- Input query
     'SELECT geom FROM `cartobq.docs.twin_areas_target`' 
    -- Grid params: grid type and level
    'quadkey'  15  
    -- Data Observatory enrichment
    [('total_pop_3409f36f' 'sum') ('households_d7d24db5' 'sum')] 
    'carto-data.ac_7xhfwyml' 
    -- Custom data enrichment
    '''
    SELECT geom  count_qualified FROM `cartobq.docs.twin_areas_custom` 
    ''' 
    [('count_qualified' 'count')] 
    0 "uniform" 
    -- Output table
    'cartobq.docs.twin_areas_target_enriched');

Once we have gridified and enriched the origin and target areas  we can run the FIND_TWIN_AREAS procedure for a given origin location  here selected as the store with the highest revenue. The result of the analysis can be found in the table cartobq.docs.twin_areas.

 CALL `carto-un`.carto.FIND_TWIN_AREAS(
    -- Input queries
    '''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched`
                    LIMIT 1''' 
    '''SELECT * FROM `cartobq.docs.twin_areas_target_enriched`''' 
    -- Twin areas model inputs
    -- Grid type
    'quadkey' 
    -- Percentage of explained variance retained by the Principal Component Analysis (PCA)
    0.90 
    -- Maximum number of twin areas
    NULL 
    -- Output prefix used to store the results: <output_prefix>_model_pca_components for the PCA model and <output_prefix>_<origin_cell_ID>_results for the results 
    'cartobq.docs.twin_areas');

As shown by the code above  the procedure requires the user to specify through a SQL query the origin and target cells  as well as to provide some inputs for the calculation of the similarity score  namely:

     
  • The prefix for the tables where the outputs of the procedure are stored  including the results table for a given origin ID and the name of the Principal Component Analysis (PCA) model used to transform the selected variables before computing the distance between the origin and target cell. The PCA method is used to avoid that pairwise correlations between variables might (wrongly) affect the computation of the distance. Since the eigenvectors of the covariance matrix are computed only taking into account the data in the target areas  if a model with the specified name already exists  this is used to retrieve the principal component scores both for the origin and target areas  otherwise the model is generated by the call of the procedure.
  •  
  • The percentage of variance retained when extracting the principal components (by default this value is set to 90%). By specifying this value  the user can decide how much of the variability in the original variables to include in the computation of the distance (smaller values of this parameter imply that only the major modes of variability will be accounted for when computing the distance).
  •  
  • The maximum number of twin areas results. By default this parameter is set to NULL and all the target areas  for which the similarity score is positive  are returned.

The results of this procedure for a selected origin location can be seen in this map:

This map shows the similarity skill score for all the target cells with a positive score in Texas: larger scores indicate areas more similar to the origin location.

Traditionally  discovering new areas for retail business expansion represented a difficult and lengthy process  which required on-site market analysis and local expertise. Using our Twin Areas tool  retailers and CPG brands can now easily discover the best locations to expand or optimize their network without any strong prior knowledge of the area  thus optimizing their site planning process. The approach takes full advantage of our comprehensive data catalog and the analytical capabilities of CARTO’s cloud native Location Intelligence platform.

Banner promoting CARTO Workflows with the text "Prefer a low-code approach? Try Workflows" and a screenshot of an example workflow

Try Conducting Your Own Twin Areas Analysis in BigQuery with CARTO

These procedures are now available to CARTO users as part of the Analytics Toolbox for BigQuery. They can be run directly from the BigQuery console or from your SQL or Python Notebooks using the BigQuery client. Additionally  they will soon power the white space analysis engine integrated into CARTO's Site Selection solution.

EU Flag This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 960401.