Apr 27, 2022

Giulia Carella

and

Margara Tejera

CARTO Contributor

Apr 27, 2022

Giulia Carella

Margara Tejera

and

François Baptiste

Apr 27, 2022

Locating High Performing Retail Expansion Sites

Cloud Native

mins read

Locating High Performing Retail Expansion Sites

When it comes to retail site selection applications a common task consists of combining data from different sources to compare the characteristics of target locations and hone in on the most appropriate areas for footprint expansion or consolidation.

In a previous blogpost we explained how to build a similarity score with respect to an existing site (e.g. the location of your top performing store) for a set of target locations. This approach is an essential tool for Site Planners looking to open new stores relocate existing points of sale or consolidate their current network.

Our Analytics Toolbox for BigQuery a set of User Defined Functions and procedures for running advanced spatial analytics natively in BigQuery has been further enhanced with a new twin areas procedure adding to the existing retail-specific procedures we have developed including revenue prediction commercial hotspots and white space analysis.

Our Approach for Finding the Most Similar Locations to Your Top Performing Stores

In this post we tackle Twin Areas analysis in more depth. This approach consists of three main steps:

First to select the most relevant variables given the characteristics of your business (e.g. population income etc.) coming from either our Data Observatory or from your own data tables;
Secondly to gridify and enrich the location of an existing site (from now on referred to as the origin location) and of all the target sites using the selected data sources. The process of gridification both for the origin and target locations is required in order to be able to compare areas of the same size and relies on the use of spatial indexes (either quadkey or h3) using the available procedures in our Analytics Toolbox.
Finally to derive a similarity score between the origin and each target location by ranking the distance between the origin and each target cell in the variable space where the selected variables have been standardized and transformed to account for pair-wise correlations.

The similarity score for each target cell is computed with respect to the score of the average cell in the target areas. Under this scoring rule a target cell with a larger score is more similar to the origin cell than a target cell with a lower score; moreover this score is positive if (and only if) the target cell is more similar to the origin than the mean cell.

How to Run the Twin Areas Analysis in Google BigQuery

To demonstrate how to run the Twin Areas analysis in BigQuery we selected as potential origin locations the position of the top 10 performing liquor stores in 2019 from the publicly available Iowa Liquor sales dataset.

We start by using the GRIDIFY_ENRICH procedure from the data module in CARTO’s Analytics Toolbox. This procedure is used to first gridify a set of geometries (point data in this case) to a quadkey grid with zoom 15 and then to enrich each selected location with data from a subscription to one of the datasets available in the Data Observatory including the total population (total_pop_3409f36f) and the number of households (households_d7d24db5) at the Census Block Group level from the ACS Sociodemographics dataset as well as from a custom dataset which contains the count of road links (count_qualified) per zip code. The result can be found in the table cartobq.docs.twin_areas_origin_enriched.

CALL `carto-un`.carto.GRIDIFY_ENRICH(
    -- Input query
    'SELECT * FROM `cartobq.docs.twin_areas_iowa_liquor_sales_origin`' 
    -- Grid params: grid type and level
    'quadkey'  15  
    -- Data Observatory enrichment
    [('total_pop_3409f36f' 'sum') ('households_d7d24db5' 'sum')] 
    'carto-data.ac_7xhfwyml' 
    -- Custom data enrichment
    '''
    SELECT geom  count_qualified FROM `cartobq.docs.twin_areas_custom` 
    ''' 
    [('count_qualified' 'count')] 
    0 "uniform" 
    -- Output table
    'cartobq.docs.twin_areas_origin_enriched');

The map below shows both the locations of the selected stores (left) as well as the enriched grid for the population variable (right):

Similarly we can use this procedure to gridify and enrich the target areas for which we will use the Census Tracts polygons in Texas in the main urban areas. The result can be found in the table cartobq.docs.twin_areas_target_enriched.

CALL `carto-un`.carto.GRIDIFY_ENRICH(
    -- Input query
     'SELECT geom FROM `cartobq.docs.twin_areas_target`' 
    -- Grid params: grid type and level
    'quadkey'  15  
    -- Data Observatory enrichment
    [('total_pop_3409f36f' 'sum') ('households_d7d24db5' 'sum')] 
    'carto-data.ac_7xhfwyml' 
    -- Custom data enrichment
    '''
    SELECT geom  count_qualified FROM `cartobq.docs.twin_areas_custom` 
    ''' 
    [('count_qualified' 'count')] 
    0 "uniform" 
    -- Output table
    'cartobq.docs.twin_areas_target_enriched');

Once we have gridified and enriched the origin and target areas we can run the FIND_TWIN_AREAS procedure for a given origin location here selected as the store with the highest revenue. The result of the analysis can be found in the table cartobq.docs.twin_areas.

 CALL `carto-un`.carto.FIND_TWIN_AREAS(
    -- Input queries
    '''SELECT * FROM `cartobq.docs.twin_areas_origin_enriched`
                    LIMIT 1''' 
    '''SELECT * FROM `cartobq.docs.twin_areas_target_enriched`''' 
    -- Twin areas model inputs
    -- Grid type
    'quadkey' 
    -- Percentage of explained variance retained by the Principal Component Analysis (PCA)
    0.90 
    -- Maximum number of twin areas
    NULL 
    -- Output prefix used to store the results: <output_prefix>_model_pca_components for the PCA model and <output_prefix>_<origin_cell_ID>_results for the results 
    'cartobq.docs.twin_areas');

As shown by the code above the procedure requires the user to specify through a SQL query the origin and target cells as well as to provide some inputs for the calculation of the similarity score namely:

The prefix for the tables where the outputs of the procedure are stored including the results table for a given origin ID and the name of the Principal Component Analysis (PCA) model used to transform the selected variables before computing the distance between the origin and target cell. The PCA method is used to avoid that pairwise correlations between variables might (wrongly) affect the computation of the distance. Since the eigenvectors of the covariance matrix are computed only taking into account the data in the target areas if a model with the specified name already exists this is used to retrieve the principal component scores both for the origin and target areas otherwise the model is generated by the call of the procedure.
The percentage of variance retained when extracting the principal components (by default this value is set to 90%). By specifying this value the user can decide how much of the variability in the original variables to include in the computation of the distance (smaller values of this parameter imply that only the major modes of variability will be accounted for when computing the distance).
The maximum number of twin areas results. By default this parameter is set to NULL and all the target areas for which the similarity score is positive are returned.

The results of this procedure for a selected origin location can be seen in this map:

This map shows the similarity skill score for all the target cells with a positive score in Texas: larger scores indicate areas more similar to the origin location.

Traditionally discovering new areas for retail business expansion represented a difficult and lengthy process which required on-site market analysis and local expertise. Using our Twin Areas tool retailers and CPG brands can now easily discover the best locations to expand or optimize their network without any strong prior knowledge of the area thus optimizing their site planning process. The approach takes full advantage of our comprehensive data catalog and the analytical capabilities of CARTO’s cloud native Location Intelligence platform.

Banner promoting CARTO Workflows with the text "Prefer a low-code approach? Try Workflows" and a screenshot of an example workflow

Try Conducting Your Own Twin Areas Analysis in BigQuery with CARTO

These procedures are now available to CARTO users as part of the Analytics Toolbox for BigQuery. They can be run directly from the BigQuery console or from your SQL or Python Notebooks using the BigQuery client. Additionally they will soon power the white space analysis engine integrated into CARTO's Site Selection solution.

Want to get started?

This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 960401.

Don’t forget to share this post on Twitter, Facebook and Linkedin!

About the author

Giulia Carella is a Data Scientist at CARTO. She holds a PhD in Applied Statistics and has experience in the development and application of statistics and machine learning methods for spatio-temporal data, with applications ranging from climate science to spatial demography.

More Posts from

Giulia Carella

About the author

Provided by our community, industry experts, or the CARTO Team, these blog posts cover the entire spectrum of spatial analysis. From location intelligence to GIS, spatial data science, industry trends, and much more, we’ve crafted relevant content to accompany you at every stage of your journey, whether you have a technical or business background. With our Blog, you are one step closer to taking spatial analysis to the next level.

About the author

About the author

About the author

About the author

About the author

François is passionate about algorithms and ingenious solutions of all kinds with experience in such varied fields as signal processing, digital communications, data science, distributed computing, real-time biding, graph theory, autonomous agents and last but not least geospatial!

More Posts from

François Baptiste

What divides the U.S.? The 2016 Presidential Election Visualized

Data Through Design Opening Reception: Kicking-Off NYC Open Data Week 2018 in Style

80 Data Visualization Examples Using Location Data and Maps

How our solutions team engineered WaterHack 2018

Who is who at CARTO’s Design Team

CARTO 2018: Our Year In Review

Using Location Data to Identify Communities in Williamsburg, NY

Patching Plain PostgreSQL for Parallel PostGIS Plans

Mapping the Impact of Madrid's Line 5 Shutdown

What We Learned About Open-Source Geospatial Technology at FOSS4G