# Locating High Performing Retail Expansion Sites

When it comes to retail site selection applications, a common task consists of combining data from different sources to compare the characteristics of target locations and hone in on the most appropriate areas for footprint expansion or consolidation.

In a previous blogpost, we explained how to build a similarity score with respect to an existing site (e.g. the location of your top performing store) for a set of target locations. This approach is an essential tool for Site Planners looking to open new stores, relocate existing points of sale or consolidate their current network.

Our Analytics Toolbox for BigQuery, a set of User Defined Functions and procedures for running advanced spatial analytics natively in BigQuery, has been further enhanced with a new twin areas procedure, adding to the existing retail-specific procedures we have developed including revenue prediction, commercial hotspots and white space analysis.

## Our Approach for Finding the Most Similar Locations to Your Top Performing Stores

In this post, we tackle Twin Areas analysis in more depth. This approach consists of three main steps:

• First, to select the most relevant variables given the characteristics of your business (e.g. population, income, etc.), coming from either our Data Observatory or from your own data tables;
• Secondly, to gridify and enrich the location of an existing site (from now on referred to as the origin location) and of all the target sites using the selected data sources. The process of gridification both for the origin and target locations is required in order to be able to compare areas of the same size and relies on the use of spatial indexes (either quadkey or h3), using the available procedures in our Analytics Toolbox.
• Finally, to derive a similarity score between the origin and each target location by ranking the distance between the origin and each target cell in the variable space, where the selected variables have been standardized and transformed to account for pair-wise correlations.

The similarity score for each target cell is computed with respect to the score of the average cell in the target areas. Under this scoring rule, a target cell with a larger score is more similar to the origin cell than a target cell with a lower score; moreover, this score is positive if (and only if) the target cell is more similar to the origin than the mean cell.

## How to Run the Twin Areas Analysis in Google BigQuery

To demonstrate how to run the Twin Areas analysis in BigQuery, we selected as potential origin locations the position of the top 10 performing liquor stores in 2019 from the publicly available Iowa Liquor sales dataset.

We start by using the GRIDIFY_ENRICH procedure from the data module in CARTO’s Analytics Toolbox. This procedure is used to first gridify a set of geometries (point data, in this case) to a quadkey grid with zoom 15, and then to enrich each selected location with data from a subscription to one of the datasets available in the Data Observatory, including the total population (total_pop_3409f36f) and the number of households (households_d7d24db5) at the Census Block Group level from the ACS Sociodemographics dataset, as well as from a custom dataset, which contains the count of road links (count_qualified) per zip code. The result can be found in the table cartobq.docs.twin_areas_origin_enriched.

CALL carto-un.carto.GRIDIFY_ENRICH(
-- Input query
'SELECT * FROM cartobq.docs.twin_areas_iowa_liquor_sales_origin',
-- Grid params: grid type and level
-- Data Observatory enrichment
[('total_pop_3409f36f','sum'),('households_d7d24db5','sum')],
'carto-data.ac_7xhfwyml',
-- Custom data enrichment
'''
SELECT geom, count_qualified FROM cartobq.docs.twin_areas_custom
''',
[('count_qualified','count')],
0,"uniform",
-- Output table
'cartobq.docs.twin_areas_origin_enriched');


The map below shows both the locations of the selected stores (left) as well as the enriched grid for the population variable (right):

Similarly, we can use this procedure to gridify and enrich the target areas for which we will use the Census Tracts polygons in Texas in the main urban areas. The result can be found in the table cartobq.docs.twin_areas_target_enriched.

CALL carto-un.carto.GRIDIFY_ENRICH(
-- Input query
'SELECT geom FROM cartobq.docs.twin_areas_target',
-- Grid params: grid type and level
-- Data Observatory enrichment
[('total_pop_3409f36f','sum'),('households_d7d24db5','sum')],
'carto-data.ac_7xhfwyml',
-- Custom data enrichment
'''
SELECT geom, count_qualified FROM cartobq.docs.twin_areas_custom
''',
[('count_qualified','count')],
0,"uniform",
-- Output table
'cartobq.docs.twin_areas_target_enriched');


Once we have gridified and enriched the origin and target areas, we can run the FIND_TWIN_AREAS procedure for a given origin location, here selected as the store with the highest revenue. The result of the analysis can be found in the table cartobq.docs.twin_areas.

 CALL carto-un.carto.FIND_TWIN_AREAS(
-- Input queries
'''SELECT * FROM cartobq.docs.twin_areas_origin_enriched
LIMIT 1''',
'''SELECT * FROM cartobq.docs.twin_areas_target_enriched''',
-- Twin areas model inputs
-- Grid type
-- Percentage of explained variance retained by the Principal Component Analysis (PCA)
0.90,
-- Maximum number of twin areas
NULL,
-- Output prefix used to store the results: <output_prefix>_model_pca_components for the PCA model and <output_prefix>_<origin_cell_ID>_results for the results
'cartobq.docs.twin_areas');


As shown by the code above, the procedure requires the user to specify through a SQL query the origin and target cells, as well as to provide some inputs for the calculation of the similarity score, namely:

• The prefix for the tables where the outputs of the procedure are stored, including the results table for a given origin ID and the name of the Principal Component Analysis (PCA) model used to transform the selected variables before computing the distance between the origin and target cell. The PCA method is used to avoid that pairwise correlations between variables might (wrongly) affect the computation of the distance. Since the eigenvectors of the covariance matrix are computed only taking into account the data in the target areas, if a model with the specified name already exists, this is used to retrieve the principal component scores both for the origin and target areas, otherwise the model is generated by the call of the procedure.
• The percentage of variance retained when extracting the principal components (by default this value is set to 90%). By specifying this value, the user can decide how much of the variability in the original variables to include in the computation of the distance (smaller values of this parameter imply that only the major modes of variability will be accounted for when computing the distance).
• The maximum number of twin areas results. By default this parameter is set to NULL and all the target areas, for which the similarity score is positive, are returned.

The results of this procedure for a selected origin location can be seen in this map:

This map shows the similarity skill score for all the target cells with a positive score in Texas: larger scores indicate areas more similar to the origin location.

Traditionally, discovering new areas for retail business expansion represented a difficult and lengthy process, which required on-site market analysis and local expertise. Using our Twin Areas tool, retailers and CPG brands can now easily discover the best locations to expand or optimize their network without any strong prior knowledge of the area, thus optimizing their site planning process. The approach takes full advantage of our comprehensive data catalog and the analytical capabilities of CARTO’s cloud native Location Intelligence platform.

## Try Conducting Your Own Twin Areas Analysis in BigQuery with CARTO

These procedures are now available to CARTO users as part of the Analytics Toolbox for BigQuery. They can be run directly from the BigQuery console or from your SQL or Python Notebooks using the BigQuery client. Additionally, they will soon power the white space analysis engine integrated into CARTO’s Site Selection solution.

 This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 960401.

Data Scientist at CARTO.

Product Manager at CARTO

Software Engineer at CARTO

• ## Hexagons for Location Intelligence: Why, When & How?

You’ve probably seen hexagon grids on maps, and maybe even created some of your own. But have you ever stopped to think about why? This is CARTO’s definitive guide to hexag...

• ## Understanding Retail Store Clusters with Local Outlier Factor Analysis

The retail landscape of a city is a complex and intricate one. A store’s success or failure depends heavily on its location; whether it is in a busy area, has a strong loca...

• and

## Dark Store Site Selection: How Geospatial Data Can Help

In New York City, a number of companies such as 1520, Fridge No More, JOKR and BUYK have started offering local grocery delivery in 15 minutes from dark stores. As these se...