How to build a revenue prediction model with CARTO & BigQuery

Sales forecasting and revenue prediction have become increasingly important to retailers during the past two years as store owners have had to rapidly react to evolving consumer behaviors.

For example, understanding what consumers will buy, how much, and when can be used by retailers to optimize revenue by avoiding over or under-stocking key product lines. Historical sales data has been traditionally used to create such a forecast but retailers also need to consider the location aspect within their analysis, especially when it comes to decisions around site selection.

This year, retailers in the US will open more stores than they close for the first time since 2017 (source) with ‘retail’s recovery going faster than anyone imagined’.

Recovery has gone faster than anyone imagined. Brokers are busy. The transactional increase is massive, up 75%
Naveen Jaggi, President of Retail Advisory Services at JLL.

Trends identified by our own Black Friday analysis also illustrate a recovery period between 2020 and 2021 but crucially showed that people do not seem to be traveling as far to shop. Being able to predict the revenue potential for different locations is critical for an optimal site selection strategy, allowing retailers to reduce the time and effort required to pinpoint the best store locations for their business.

In this blogpost we will showcase how to use CARTO’s Analytics Toolbox for BigQuery to train a spatial model to predict annual store revenues across a territory.

The recently announced CARTO Spatial Extension for BigQuery is a complete redevelopment of our leading Location Intelligence platform. One of the key components of this platform is the Analytics Toolbox, which provides a complete framework of analysis capabilities to perform spatial analytics in SQL, computed natively in the leading cloud data warehouse platforms. Our implementation for BigQuery enables users of such a platform to have a fully cloud native GIS and Spatial Data Science experience while keeping the benefits around privacy, compliance, scalability, and lower costs that their data warehouse brings.

Diagram showing Analytics Toolbox components

The analysis routines in the Analytics Toolbox cover a broad range of spatial analytics use-cases and are organized into a series of modules (data, clustering, statistics, etc.) based on the functionality that they offer. Today, we are happy to announce the availability of a new domain-specific module to solve geospatial analytics for the retail sector, starting with Revenue Prediction.

Revenue Prediction with the Analytics Toolbox

In order to carry out revenue prediction analysis we have implemented three new procedures into the Analytics Toolbox that leverage the scalability and computational efficiency of spatial indexes and cover all the necessary steps to solve this use-case, end-to-end: data prep, model training, and prediction.

  • BUILD_REVENUE_MODEL_DATA: as a first step in the revenue prediction workflow, this procedure prepares the data to build the revenue prediction model, which will be used in both the training and prediction phases. As a result, the area of interest for the analysis is polyfilled using either H3 or quadkey spatial indexes and enriched with features from both Data Observatory datasets and user's first party data (including store revenue) in order to characterize it.
  • BUILD_REVENUE_MODEL: builds the predictive model from the input data (i.e., output from the previous step), the advanced configuration options, and computes the model's and features' statistics (mean absolute error, feature importance, etc.).
  • PREDICT_REVENUE_AVERAGE: the final procedure in the workflow that predicts the potential revenue of a new store to be located in a particular area.

It is important to note that these procedures are leveraging other routines from our Analytics Toolbox concurrently such as MORANS_I_H3, ENRICH_GRID and DATAOBS_ENRICH_GRID.

For the purpose of this case study we are going to use the publicly available Iowa Liquor Sales data, which provides spirits purchase information at store level for Iowa Class “E” liquor licensees by product and date of purchase from January 1st 2012 to present time. In particular, we are going to train our predictive model with the revenue data of the 135 stores that appear under the brand “Hy-Vee”, computed as the aggregate sales data at yearly level. These stores are used as a reference and the model is built using their locations and revenues as inputs. All other liquor stores will be considered Hy-Vee’s competitors. Then, we are going to simulate the scenario in which we want to predict the revenue of a new Hy-Vee store located anywhere in the state of Iowa.

Additionally, in order to train the revenue prediction model we are going to enrich the store revenue data with location-aware context, specifically with variables from CARTO’s Spatial Features dataset at H3 Resolution 8 level, available via our Data Observatory.

Data Preparation

The first step of the Revenue Prediction analysis workflow consists in preparing the data that will be used to both train the model and run predictions. The area of interest for our particular use-case is the entire state of Iowa in the US, which we are going to analyze using a H3 grid of resolution 6. The H3 grid cells within this area will be enriched with the number of Hy-Vee’s stores in each grid cell, their average annual revenue, and the number of competitor’s stores.

Moreover, this H3 grid will be enriched with the total population and number of POIs of different categories, in particular, retail, financial, food & drinks, leisure, transportation and tourism. This data is extracted from CARTO’s Spatial Features dataset, available via our Data Observatory. For this enrichment, the value of these variables for each particular grid cell and their neighboring cells, defined as those within a KRING of size 2, will be taken into account. The values at the neighboring cells will be scaled according to a configurable decay function, in this case a uniform function, with the objective of giving less weight to those cells that are further away from the origin cell.

  -- Stores: revenue, store, geom
  SELECT store_name as store,ST_GEOGFROMTEXT(any_value(store_location)) as geom, sum(sale_dollars) as revenue FROM
  where store_name like "%Hy-Vee%" and EXTRACT(YEAR FROM date)=2019
  group by store_name
  -- Store variables
  -- Competitors: competitor, geom
  SELECT store_name as competitor,ST_GEOGFROMTEXT(any_value(store_location)) as geom FROM
  where store_name not like "%Hy-Vee%" and EXTRACT(YEAR FROM date)=2019
  group by store_name
  -- Area of interest: geom
  SELECT state_geom AS geom
  FROM `bigquery-public-data.geo_us_boundaries.states`
  WHERE state_name='Iowa'
  -- Grid params: grid type and level
  'h3', 6,
  -- Decay params: kring size and decay function
  2, '',
  -- Data Observatory enrichment
  [("population_f5b8d177", 'sum'), ("retail_21cb1815", 'sum'),("financial_9b8bb507",'sum'),("food_drink_bd2ef444",'sum'),("leisure_16da1bd6",'sum'),
  -- Custom data enrichment
  -- Output destination prefix

Where <my-dataobs-project>.<my-dataobs-dataset> specifies the location of the user’s Data Observatory subscriptions in BigQuery.

In the map below, we showcase Hy-Vee’s stores styled by their yearly revenue and the enriched H3 grid that we will use to train and run the model, styled by the weighted population at each cell (i.e., considering also their neighbors as explained previously).

Now that our training data is ready, we can perform a simple correlation analysis to understand the relationship between all pairs of variables. In the diagram below, where one-to-one correlations are shown, we can identify a strong relationship between food & drink POIs with both population and retail POIs. This is expected as restaurants, coffee shops, etc. tend to be located in highly populated areas. In addition, a significant relationship is observed for the existing store count and number of competitors variables. This means that the more stores are present in a particular grid cell, the more competitor stores there are, which is expected from a business perspective. Finally, we can highlight that the lowest correlated variables are the number of tourism POIs and the annual revenue.

Triangle Correlation Heatmap

Model Training

Now that we have the data prepared, we can run the second procedure, BUILD_REVENUE_MODEL, in order to build and train the predictive model.

CALL `carto-un`.carto.BUILD_REVENUE_MODEL(
   -- Model data
   -- Options
   -- Output destination prefix

The outputs of this procedure are the trained model to be used in the revenue prediction, a SHAP table containing the list of the features and their attribution to the model, a model stats table (e.g., mean error, variance, etc.), and finally a table with the model’s feature importance.

Table showing model’s feature importances

Finally, in order to showcase the performance of our predictive model we use the following metrics, R-square and Mean Absolute Percentage Error (MAPE). As can be seen, the model is able to explain 87% of the variance, and the average prediction error is ±25%



Mean Absolute Percentage Error (MAPE)


Revenue Prediction

To predict the revenue of a new store located in any of the H3 grid cells within our area of interest (the state of Iowa), we can simply run the PREDICT_REVENUE_AVERAGE procedure, providing as inputs the corresponding cell ID, the model data (output of the BUILD_REVENUE_MODEL_DATA procedure) and the revenue model (output of the BUILD_REVENUE_MODEL procedure).


In the map below we can see the revenue predictions for a set of locations within Iowa where there were previously no existing stores, along with the annual revenue of the existing stores and the location of the competitors considered for this analysis.

The prediction results for all grid cells within Iowa can be found in the table

This is just the first retail-specific use case that we can solve with our Analytics Toolbox for BigQuery. In the coming months we will be adding other advanced use cases to the Retail module such as Twin Area Analysis, Whitespace Analysis, and Commercial Hotspot Analysis.

The revenue prediction procedures are now available from the Analytics Toolbox for BigQuery and can be run directly from the BigQuery console or from your SQL or Python Notebooks using the BigQuery client. Additionally, they will soon power the revenue prediction analysis engine integrated in CARTO’s Site Selection application.

The Analytics Toolbox for BigQuery is a component of our cloud native Location Intelligence platform. To access, sign up for a free trial or contact our team of experts for a personalized demo.

EU Flag This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 960401.
About the author
Javier Pérez Trufero

Javier is VP of Product and is globally responsible for CARTO’s product portfolio and innovation roadmap. He also leads the company strategy with respect to third party data offerings and data science activities. Javier's responsibilities span the complete product development cycle. He also coordinates new alliances with top-class data providers such as Vodafone and Mastercard.

More posts from Javier Pérez Trufero
About the author
Miguel Álvarez

Lead Data Scientist at CARTO.

More posts from Miguel Álvarez
About the author
Argyrios Kyrgiazos

Data Scientist at CARTO

More posts from Argyrios Kyrgiazos

Related Posts

Ready to optimize your territories with Location Intelligence?

Close circle icon

Contact us

Please fill out the below form and we'll be in touch real soon.