A Python package for integrating CARTO maps, analysis, and data services into data science workflows.

View source

Data Observatory

CARTO’s Data Observatory is a spatial data platfrom that enables data scientists to augment their data and broaden their analyses by using thousands of datasets from around the globe.

This guide is intended for those who want to start augmenting their data using CARTOframes and wish to explore CARTO’s public Data Observatory catalog to find datasets that best fit their use cases and analyses. For further learning you can also check out the Data Observatory examples.

Data Discovery

The Data Obsevatory data catalog is comprised of thousands of curated spatial datasets. When searching for data the easiest way to find what you are looking for is to make use of a faceted search. A faceted (or hierarchical) search allows you to narrow down search results by applying multiple filters based on the faceted classification of catalog datasets. For more information check the data discovery example.

Datasets are organized in these main hierarchies: country, category, provider and geography (or spatial resolution).

The catalog is public and you don’t need a CARTO account to search for available datasets. You can access the web version of the catalog here.

Dataset and variables metadata

The Data Observatory catalog is not only a repository of curated spatial datasets, it also contains valuable information that helps better understand the underlying data of every dataset so you can make an informed decision on what data best fits your problem.

Some of the augmented metadata you can find for each dataset in the catalog is:

  • head and tail methods to get a glimpse of the actual data. This helps you to understand the available columns, data types, etc., to start modelling your problem right away.
  • geom_coverage to visualize on a map the geographical coverage of the data in the Dataset.
  • counts, fields_by_type and a full describe method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset.

You don’t need a subscription to a dataset to be able to query the augmented metadata, it’s publicly available for anyone exploring the Data Observatory catalog.

Let’s review some of that information, starting by getting a glimpse of the ten first or last rows of the actual data of the dataset:

1
2
3
4
from cartoframes.data.observatory import Dataset

dataset = Dataset.get('ags_sociodemogr_a7e14220')
dataset.head()
DWLCY DWLPY HHDCY HHDPY POPCY POPPY geoid VPHCY1 do_date AGECYMED ... MARCYDIVOR MARCYNEVER MARCYWIDOW RCHCYAMNHS RCHCYASNHS RCHCYBLNHS RCHCYHANHS RCHCYMUNHS RCHCYOTNHS RCHCYWHNHS
0 1057 1112 932 986 1500 1648 040130405071 442 2020-01-01 00:00:00+00:00 77.40 ... 149 4 228 0 11 20 0 25 0 1317
1 1964 2069 1774 1877 2595 2868 040130405072 1049 2020-01-01 00:00:00+00:00 76.88 ... 414 160 699 0 74 68 7 55 0 2167
2 1049 1101 897 933 1585 1716 040130610182 460 2020-01-01 00:00:00+00:00 69.88 ... 31 217 246 2 55 43 9 26 0 1313
3 1084 1137 910 940 1503 1616 040138175002 392 2020-01-01 00:00:00+00:00 71.44 ... 191 79 268 8 24 38 0 8 0 1290
4 682 706 574 591 980 1039 040190043241 244 2020-01-01 00:00:00+00:00 72.38 ... 30 44 195 3 9 0 0 0 0 902
5 880 910 840 869 1249 1284 060133511032 539 2020-01-01 00:00:00+00:00 76.75 ... 160 40 319 0 136 19 2 12 5 1024
6 1467 1534 1314 1467 1658 1800 060590995101 831 2020-01-01 00:00:00+00:00 74.58 ... 423 136 496 3 226 10 1 16 0 1269
7 704 753 693 730 1078 1176 060610210391 338 2020-01-01 00:00:00+00:00 73.86 ... 117 63 215 5 33 7 0 9 0 986
8 1582 1691 1553 1650 2540 2795 060610236001 818 2020-01-01 00:00:00+00:00 68.80 ... 406 45 301 5 168 26 3 19 0 2183
9 1186 1268 1163 1234 1980 2176 060610236002 415 2020-01-01 00:00:00+00:00 68.59 ... 253 60 223 5 97 22 1 10 0 1750

10 rows × 110 columns

Alternatively, you can get the last ten ones with dataset.tail()

An overview of the coverage of the dataset

1
dataset.geom_coverage()

Some stats about the dataset:

1
dataset.counts()
1
2
3
4
5
rows                    217182.0
cells                 23890020.0
null_cells                   0.0
null_cells_percent           0.0
dtype: float64
1
dataset.fields_by_type()
1
2
3
4
5
float          5
string         2
integer      102
timestamp      1
dtype: int64
1
dataset.describe()
POPCY POPCYGRP POPCYGRPI AGECY0004 AGECY0509 AGECY1014 AGECY1519 AGECY2024 AGECY2529 AGECY3034 ... DWLCYVACNT DWLCYRENT DWLCYOWNED POPPY HHDPY DWLPY AGEPYMED INCPYPCAP INCPYAVEHH INCPYMEDHH
avg 1.520470e+03 3.725678e+01 1.798816e+01 9.047040e+01 9.341024e+01 9.581574e+01 9.726450e+01 1.000845e+02 1.083624e+02 1.044326e+02 ... 4.957671e+01 2.116296e+02 3.835577e+02 1.568307e+03 6.075594e+02 6.718555e+02 3.989449e+01 4.292896e+04 1.073098e+05 7.933328e+04
max 6.710000e+04 1.975200e+04 1.205300e+04 5.393000e+03 5.294000e+03 5.195000e+03 7.606000e+03 1.480400e+04 5.767000e+03 5.616000e+03 ... 6.547000e+03 1.005700e+04 2.367600e+04 7.584500e+04 2.811500e+04 3.264000e+04 8.750000e+01 3.824975e+06 1.112720e+07 3.500000e+05
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 ... 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
sum 3.302187e+08 8.091501e+06 3.906704e+06 1.964854e+07 2.028702e+07 2.080945e+07 2.112410e+07 2.173655e+07 2.353436e+07 2.268087e+07 ... 1.076717e+07 4.596213e+07 8.330182e+07 3.406080e+08 1.319510e+08 1.459149e+08 8.664364e+06 9.323398e+09 2.330576e+10 1.722976e+10
range 6.710000e+04 1.975200e+04 1.205300e+04 5.393000e+03 5.294000e+03 5.195000e+03 7.606000e+03 1.480400e+04 5.767000e+03 5.616000e+03 ... 6.547000e+03 1.005700e+04 2.367600e+04 7.584500e+04 2.811500e+04 3.264000e+04 8.750000e+01 3.824975e+06 1.112720e+07 3.500000e+05
stdev 1.063417e+03 2.428693e+02 1.582057e+02 8.044792e+01 8.338915e+01 8.351758e+01 1.117828e+02 1.244125e+02 9.642758e+01 9.358961e+01 ... 9.849786e+01 2.353261e+02 3.163312e+02 1.141981e+03 4.138545e+02 4.467692e+02 7.567177e+00 3.178870e+04 7.835131e+04 4.262018e+04
q1 8.500000e+02 0.000000e+00 0.000000e+00 4.400000e+01 4.400000e+01 4.600000e+01 4.500000e+01 4.300000e+01 5.000000e+01 4.900000e+01 ... 1.100000e+01 6.000000e+01 1.820000e+02 8.670000e+02 3.440000e+02 3.840000e+02 3.407000e+01 2.168000e+04 5.536100e+04 4.701800e+04
q3 1.454000e+03 0.000000e+00 0.000000e+00 8.300000e+01 8.600000e+01 8.900000e+01 8.700000e+01 8.600000e+01 9.800000e+01 9.500000e+01 ... 3.400000e+01 1.780000e+02 3.750000e+02 1.485000e+03 5.810000e+02 6.480000e+02 4.101000e+01 4.058200e+04 1.012720e+05 7.908300e+04
median 1.125000e+03 0.000000e+00 0.000000e+00 6.200000e+01 6.300000e+01 6.500000e+01 6.400000e+01 6.200000e+01 7.100000e+01 6.900000e+01 ... 2.000000e+01 1.080000e+02 2.740000e+02 1.143000e+03 4.520000e+02 5.040000e+02 3.771000e+01 3.056300e+04 7.671300e+04 6.212200e+04
interquartile_range 6.040000e+02 0.000000e+00 0.000000e+00 3.900000e+01 4.200000e+01 4.300000e+01 4.200000e+01 4.300000e+01 4.800000e+01 4.600000e+01 ... 2.300000e+01 1.180000e+02 1.930000e+02 6.180000e+02 2.370000e+02 2.640000e+02 6.940000e+00 1.890200e+04 4.591100e+04 3.206500e+04

10 rows × 107 columns

Every Dataset instance in the catalog contains other useful metadata:

1
dataset.to_dict()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{'slug': 'ags_sociodemogr_a7e14220',
 'name': 'Sociodemographics - United States of America (Census Block Group)',
 'description': 'Census and ACS sociodemographic data estimated for the current year and data projected to five years. Projected fields are general aggregates (total population, total households, median age, avg income etc.)',
 'category_id': 'demographics',
 'country_id': 'usa',
 'data_source_id': 'sociodemographics',
 'provider_id': 'ags',
 'geography_name': 'Census Block Group - United States of America',
 'geography_description': None,
 'temporal_aggregation': 'yearly',
 'time_coverage': None,
 'update_frequency': 'yearly',
 'is_public_data': False,
 'lang': 'eng',
 'version': '2020',
 'category_name': 'Demographics',
 'provider_name': 'Applied Geographic Solutions',
 'geography_id': 'carto-do.ags.geography_usa_blockgroup_2015',
 'id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020'}

When exploring datasets in the Data Observatory catalog it’s very important that you understand clearly what variables are available to enrich your own data.

For each Variable in each dataset, the Data Observatory provides (as it does for datasets) a set of methods and attributes to understand their underlaying data.

Some of them are:

  • head and tail methods to get a glimpse of the actual data and start modelling your problem right away.
  • counts, quantiles and a full describe method with stats of the actual values in the dataset, such as: average, stdev, quantiles, min, max, median for each of the variables of the dataset.
  • an histogram plot with the distribution of the values on each variable.

Let’s review some of that augmented metadata for the variables in the AGS population dataset.

1
2
3
4
from cartoframes.data.observatory import Variable

variable = Variable.get('POPCY_4534fac4')
variable
1
<Variable.get('POPCY_4534fac4')> #'Population (2019A)'
1
variable.to_dict()
1
2
3
4
5
6
7
8
9
{'slug': 'POPCY_4534fac4',
 'name': 'Total Population',
 'description': 'Population (2019A)',
 'db_type': 'INTEGER',
 'agg_method': 'SUM',
 'column_name': 'POPCY',
 'variable_group_id': None,
 'dataset_id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020',
 'id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020.POPCY'}

There’s also some utility methods to understand the underlying data for each variable:

1
variable.head()
1
2
3
4
5
6
7
8
9
10
11
0    1500
1    2595
2    1585
3    1503
4     980
5    1249
6    1658
7    1078
8    2540
9    1980
dtype: int64
1
variable.counts()
1
2
3
4
5
6
7
8
9
10
11
12
all                 217182.000000
null                     0.000000
zero                   299.000000
extreme               9073.000000
distinct              6756.000000
outliers             26998.000000
null_percent             0.000000
zero_percent             0.137673
extreme_percent          0.041776
distinct_percent         3.110755
outliers_percent         0.124310
dtype: float64
1
variable.quantiles()
1
2
3
4
5
q1                      850
q3                     1454
median                 1125
interquartile_range     604
dtype: int64
1
variable.histogram()

1
variable.describe()
1
2
3
4
5
6
7
8
9
10
11
avg                    1.520470e+03
max                    6.710000e+04
min                    0.000000e+00
sum                    3.302187e+08
range                  6.710000e+04
stdev                  1.063417e+03
q1                     8.500000e+02
q3                     1.454000e+03
median                 1.125000e+03
interquartile_range    6.040000e+02
dtype: float64

Subscribe to a Dataset in the catalog

Once you have explored the catalog and have identified a dataset with the variables you need for your analysis and in the right spatial resolution, you can check is_public_data to know whether the dataset is freely accessible or you first need to purchase a license. Subscriptions are available for CARTO’s Enterprise plan users.

Subscriptions to datasets allow you to either use them from CARTOframes to enrich your own data or to download them. See the enrichment guide for more information.

Let’s check out the dataset and geography from our previous example:

1
dataset = Dataset.get('ags_sociodemogr_a7e14220')
1
dataset.is_public_data
1
False

This dataset is not public data, which means that you need a subscription to be able to use it to enrich your own data.

To subscribe to premium data in the Data Observatory catalog you need an Enterprise CARTO account with access to the Data Observatory.

1
2
3
from cartoframes.auth import set_default_credentials

set_default_credentials('creds.json')
1
dataset.subscribe()

Licenses to data in the Data Observatory grant you the right to use the data for the period of one year. Every non-public dataset or geography you want to use to enrich your own data require a valid license.

You can check the actual status of your subscriptions directly from the catalog.

1
2
3
from cartoframes.data.observatory import Catalog

Catalog().subscriptions()
1
2
Datasets: [<Dataset.get('ags_sociodemogr_a7e14220')>, <Dataset.get('ags_retailpoten_aaf25a8c')>]
Geographies: [<Geography.get('ags_blockgroup_1c63771c')>]

Data Access

Now that we have explored some basic information about the Dataset, we will proceed to download a sample of the Dataset into a dataframe so we can operate with it using Python.

Note: You’ll need your CARTO Account credentials to perform this action.

1
2
3
from cartoframes.auth import set_default_credentials

set_default_credentials('creds.json')
1
2
3
from cartoframes.data.observatory import Dataset

dataset = Dataset.get('ags_sociodemogr_a7e14220')
1
2
3
4
# Filter by SQL query
query = "SELECT * FROM $dataset$ LIMIT 50"

dataset_df = dataset.to_dataframe(sql_query=query)

Note about SQL filters

Our SQL filtering queries allow for any PostgreSQL and PostGIS operation, so you can filter the rows (by a WHERE condition) or the columns (using the SELECT). Some common examples are filtering the Dataset by bounding box or filtering by column value:

1
SELECT * FROM $dataset$ WHERE ST_IntersectsBox(geom, -74.044467,40.706128,-73.891345,40.837690)
1
SELECT total_pop, geom FROM $dataset$

A good tool to get the bounding box of a specific area is bboxfinder.com.

1
2
# First rows of the Dataset sample
dataset_df.head()
BLOCKGROUP POPCY POPCYGRP POPCYGRPI AGECY0004 AGECY0509 AGECY1014 AGECY1519 AGECY2024 AGECY2529 ... POPPY HHDPY DWLPY AGEPYMED INCPYPCAP INCPYAVEHH INCPYMEDHH geoid do_date geom
0 530330074013 1109 0 0 16 14 9 26 106 256 ... 1136 811 887 33.83 133973 187662 132266 530330074013 2020-01-01 00:00:00+00:00 POLYGON ((-122.32814 47.62490, -122.32812 47.6...
1 40134226252 1544 95 95 19 19 12 19 16 24 ... 1753 1208 1492 74.27 31521 45243 48999 40134226252 2020-01-01 00:00:00+00:00 POLYGON ((-111.70519 33.41560, -111.70522 33.4...
2 130510110061 1531 41 40 31 32 33 24 9 16 ... 1523 832 899 69.37 101521 185539 131521 130510110061 2020-01-01 00:00:00+00:00 POLYGON ((-81.06554 31.94715, -81.06484 31.946...
3 250173541004 1015 140 0 24 10 11 52 123 167 ... 1052 544 604 41.69 148962 286375 179544 250173541004 2020-01-01 00:00:00+00:00 POLYGON ((-71.12449 42.37597, -71.12478 42.375...
4 250250004014 2128 0 0 78 52 25 34 138 273 ... 2223 1264 1291 52.88 36240 63736 17438 250250004014 2020-01-01 00:00:00+00:00 POLYGON ((-71.15473 42.34121, -71.15446 42.341...

5 rows × 111 columns

You can also download the dataset directly to a CSV file

1
2
3
query = "SELECT * FROM $dataset$ LIMIT 50"

dataset_df = dataset.to_csv('my_dataset.csv', sql_query=query)
1
2
3
4
Data saved: my_dataset.csv
To load it as a DataFrame you can do:

    df = pandas.read_csv('my_dataset.csv')

Data Enrichment

We define enrichment as the process of augmenting your data with new variables by means of a spatial join between your data and a Dataset in CARTO’s Data Observatory, aggregated at a given spatial resolution, or in other words:

Enrichment is the process of adding variables to a geometry, which we call the target, (point, line, polygon…) from a spatial (polygon) dataset, which we call the source

We recommend you also check out the CARTOframes quickstart guide since it offers a complete example of data discovery and enrichment and also helps you build a simple dashboard to draw conclusions from the resulting data.

Note: You’ll need your CARTO Account credentials to perform this action.

1
2
3
from cartoframes.auth import set_default_credentials

set_default_credentials('creds.json')
1
2
3
4
5
from cartoframes.data.observatory import Dataset

dataset = Dataset.get('ags_sociodemogr_a7e14220')
variables = dataset.variables
variables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
[<Variable.get('BLOCKGROUP_30e525a6')> #'Geographic Identifier',
 <Variable.get('POPCY_4534fac4')> #'Population (2019A)',
 <Variable.get('POPCYGRP_3033ef2e')> #'Population in Group Quarters (2019A)',
 <Variable.get('POPCYGRPI_1e42899')> #'Institutional Group Quarters Population (2019A)',
 <Variable.get('AGECY0004_aaae373a')> #'Population age 0-4 (2019A)',
 <Variable.get('AGECY0509_d2d4896c')> #'Population age 5-9 (2019A)',
 <Variable.get('AGECY1014_b09611e')> #'Population age 10-14 (2019A)',
 <Variable.get('AGECY1519_7373df48')> #'Population age 15-19 (2019A)',
 <Variable.get('AGECY2024_32919d33')> #'Population age 20-24 (2019A)',
 <Variable.get('AGECY2529_4aeb2365')> #'Population age 25-29 (2019A)',
 <Variable.get('AGECY3034_9336cb17')> #'Population age 30-34 (2019A)',
 <Variable.get('AGECY3539_eb4c7541')> #'Population age 35-39 (2019A)',
 <Variable.get('AGECY4044_41a06569')> #'Population age 40-44 (2019A)',
 <Variable.get('AGECY4549_39dadb3f')> #'Population age 45-49 (2019A)',
 <Variable.get('AGECY5054_e007334d')> #'Population age 50-54 (2019A)',
 <Variable.get('AGECY5559_987d8d1b')> #'Population age 55-59 (2019A)',
 <Variable.get('AGECY6064_d99fcf60')> #'Population age 60-64 (2019A)',
 <Variable.get('AGECY6569_a1e57136')> #'Population age 65-69 (2019A)',
 <Variable.get('AGECY7074_78389944')> #'Population age 70-74 (2019A)',
 <Variable.get('AGECY7579_422712')> #'Population age 75-79 (2019A)',
 <Variable.get('AGECY8084_a7c395dd')> #'Population age 80-84 (2019A)',
 <Variable.get('AGECYGT85_ac46767d')> #'Population age 85+ (2019A)',
 <Variable.get('AGECYMED_f218d6e9')> #'Median Age (2019A)',
 <Variable.get('SEXCYMAL_8ee6ade5')> #'Population male (2019A)',
 <Variable.get('SEXCYFEM_91d8b796')> #'Population female (2019A)',
 <Variable.get('RCHCYWHNHS_b4cab6fe')> #'Non Hispanic White (2019A)',
 <Variable.get('RCHCYBLNHS_93a8395b')> #'Non Hispanic Black (2019A)',
 <Variable.get('RCHCYAMNHS_6cb424ee')> #'Non Hispanic American Indian (2019A)',
 <Variable.get('RCHCYASNHS_dc720442')> #'Non Hispanic Asian (2019A)',
 <Variable.get('RCHCYHANHS_2b72f927')> #'Non Hispanic Hawaiian/Pacific Islander (2019A)',
 <Variable.get('RCHCYOTNHS_fe95829a')> #'Non Hispanic Other Race (2019A)',
 <Variable.get('RCHCYMUNHS_3ce9b69f')> #'Non Hispanic Multiple Race (2019A)',
 <Variable.get('HISCYHISP_e62d7c2e')> #'Population Hispanic (2019A)',
 <Variable.get('MARCYNEVER_eee4f8c3')> #'Never Married (2019A)',
 <Variable.get('MARCYMARR_17f0d887')> #'Now Married (2019A)',
 <Variable.get('MARCYSEP_d4d69eb8')> #'Separated (2019A)',
 <Variable.get('MARCYWIDOW_5ce5d993')> #'Widowed (2019A)',
 <Variable.get('MARCYDIVOR_146db750')> #'Divorced (2019A)',
 <Variable.get('AGECYGT15_7d84cd34')> #'Population Age 15+ (2019A)',
 <Variable.get('EDUCYLTGR9_ed0362fa')> #'Pop 25+ less than 9th grade (2019A)',
 <Variable.get('EDUCYSHSCH_7a88e398')> #'Pop 25+ 9th-12th grade no diploma (2019A)',
 <Variable.get('EDUCYHSCH_a7a81733')> #'Pop 25+ HS graduate (2019A)',
 <Variable.get('EDUCYSCOLL_3840e65b')> #'Pop 25+ college no diploma (2019A)',
 <Variable.get('EDUCYASSOC_dcd76160')> #'Pop 25+ Associate degree (2019A)',
 <Variable.get('EDUCYBACH_d7b78049')> #'Pop 25+ Bachelor's degree (2019A)',
 <Variable.get('EDUCYGRAD_c58943fb')> #'Pop 25+ graduate or prof school degree (2019A)',
 <Variable.get('AGECYGT25_56a99ef7')> #'Population Age 25+ (2019A)',
 <Variable.get('HHDCY_935c1592')> #'Households (2019A)',
 <Variable.get('HHDCYFAM_c1a6fccf')> #'Family Households (2019A)',
 <Variable.get('HHSCYMCFCH_bd115dc2')> #'Families married couple w children (2019A)',
 <Variable.get('HHSCYLPMCH_ce8863e2')> #'Families male no wife w children (2019A)',
 <Variable.get('HHSCYLPFCH_c2dd8c03')> #'Families female no husband children (2019A)',
 <Variable.get('HHDCYAVESZ_d265f21c')> #'Average Household Size (2019A)',
 <Variable.get('HHDCYMEDAG_4f099151')> #'Median Age of Householder (2019A)',
 <Variable.get('VPHCYNONE_3755ac60')> #'Households: No Vehicle Available (2019A)',
 <Variable.get('VPHCY1_bed441da')> #'Households: One Vehicle Available (2019A)',
 <Variable.get('VPHCYGT1_e4a07c30')> #'Households: Two or More Vehicles Available (2019A)',
 <Variable.get('INCCYPCAP_7c8377cf')> #'Per capita income (2019A)',
 <Variable.get('INCCYAVEHH_1ef75363')> #'Average household Income (2019A)',
 <Variable.get('INCCYMEDHH_98692c24')> #'Median household income (2019A)',
 <Variable.get('INCCYMEDFA_7f36b90e')> #'Median family income (2019A)',
 <Variable.get('HINCYLT10_61c14e29')> #'Household Income < $10000 (2019A)',
 <Variable.get('HINCY1015_c720a11b')> #'Household Income $10000-$14999 (2019A)',
 <Variable.get('HINCY1520_9aacc4bc')> #'Household Income $15000-$19999 (2019A)',
 <Variable.get('HINCY2025_feb85d36')> #'Household Income $20000-$24999 (2019A)',
 <Variable.get('HINCY2530_91025a13')> #'Household Income $25000-$29999 (2019A)',
 <Variable.get('HINCY3035_5f1f0b12')> #'Household Income $30000-$34999 (2019A)',
 <Variable.get('HINCY3540_66ffabb1')> #'Household Income $35000-$39999 (2019A)',
 <Variable.get('HINCY4045_8d89a56c')> #'Household Income $40000-$44999 (2019A)',
 <Variable.get('HINCY4550_e233a249')> #'Household Income $45000-$49999 (2019A)',
 <Variable.get('HINCY5060_77695404')> #'Household Income $50000-$59999 (2019A)',
 <Variable.get('HINCY6075_cad3e24')> #'Household Income $60000-$74999 (2019A)',
 <Variable.get('HINCY75100_bb90c7bb')> #'Household Income $75000-$99999 (2019A)',
 <Variable.get('HINCY10025_40903e13')> #'Household Income $100000-$124999 (2019A)',
 <Variable.get('HINCY12550_d379563b')> #'Household Income $125000-$149999 (2019A)',
 <Variable.get('HINCY15020_7243aae')> #'Household Income $150000-$199999 (2019A)',
 <Variable.get('HINCYGT200_c39e094b')> #'Household Income > $200000 (2019A)',
 <Variable.get('HINCYMED24_4ac9369')> #'Median Household Income: Age < 25 (2019A)',
 <Variable.get('HINCYMED25_73aba3ff')> #'Median Household Income: Age 25-34 (2019A)',
 <Variable.get('HINCYMED35_6ab092be')> #'Median Household Income: Age 35-44 (2019A)',
 <Variable.get('HINCYMED45_25f10479')> #'Median Household Income: Age 45-54 (2019A)',
 <Variable.get('HINCYMED55_3cea3538')> #'Median Household Income: Age 55-64 (2019A)',
 <Variable.get('HINCYMED65_17c766fb')> #'Median Household Income: Age 65-74 (2019A)',
 <Variable.get('HINCYMED75_edc57ba')> #'Median Household Income: Age 75+ (2019A)',
 <Variable.get('LBFCYPOP16_75363c6f')> #'Population Age 16+ (2019A)',
 <Variable.get('LBFCYARM_c8f45b67')> #'Pop 16+ in Armed Forces (2019A)',
 <Variable.get('LBFCYEMPL_1902fd90')> #'Pop 16+ civilian employed (2019A)',
 <Variable.get('LBFCYUNEM_befc2d4')> #'Pop 16+ civilian unemployed (2019A)',
 <Variable.get('LBFCYNLF_803bfa0d')> #'Pop 16+ not in labor force (2019A)',
 <Variable.get('UNECYRATE_a642ed8a')> #'Unemployment Rate (2019A)',
 <Variable.get('LBFCYLBF_1d3c03ed')> #'Population In Labor Force (2019A)',
 <Variable.get('LNIEXSPAN_8f8728c7')> #'SPANISH SPEAKING HOUSEHOLDS',
 <Variable.get('LNIEXISOL_c2e86dc7')> #'LINGUISTICALLY ISOLATED HOUSEHOLDS (NON-ENGLISH SP...',
 <Variable.get('HOOEXMED_48df3206')> #'Median Value of Owner Occupied Housing Units',
 <Variable.get('RNTEXMED_6ac2e609')> #'Median Cash Rent',
 <Variable.get('HUSEX1DET_231a9f6c')> #'UNITS IN STRUCTURE: 1 DETACHED',
 <Variable.get('HUSEXAPT_dc7d3c72')> #'UNITS IN STRUCTURE: 20 OR MORE',
 <Variable.get('DWLCY_50c5eee2')> #'Housing units (2019A)',
 <Variable.get('DWLCYVACNT_6b929d9a')> #'Housing units vacant (2019A)',
 <Variable.get('DWLCYRENT_3601a69e')> #'Occupied units renter (2019A)',
 <Variable.get('DWLCYOWNED_858b3ad6')> #'Occupied units owner (2019A)',
 <Variable.get('POPPY_24dbbb56')> #'Population (2024A)',
 <Variable.get('HHDPY_f2b35400')> #'Households (2024A)',
 <Variable.get('DWLPY_312aaf70')> #'Housing units (2024A)',
 <Variable.get('AGEPYMED_d5583bbb')> #'Median Age (2024A)',
 <Variable.get('INCPYPCAP_f9c107fa')> #'Per capita income (2024A)',
 <Variable.get('INCPYAVEHH_48c1d530')> #'Average household Income (2024A)',
 <Variable.get('INCPYMEDHH_ce5faa77')> #'Median household income (2024A)',
 <Variable.get('do_date_be147af')> #'First day of time period',
 <Variable.get('geoid_818ef39c')> #'Geographical Identifier']

The ags_sociodemogr_f510a947 dataset contains socio-demographic variables aggregated by Census block group level.

Let’s try and find a variable for total population:

1
2
vdf = variables.to_dataframe()
vdf[vdf['name'].str.contains('pop', case=False, na=False)]
slug name description db_type agg_method column_name variable_group_id dataset_id id
1 POPCY_4534fac4 Total Population Population (2019A) INTEGER SUM POPCY None carto-do.ags.demographics_sociodemographics_us... carto-do.ags.demographics_sociodemographics_us...
2 POPCYGRP_3033ef2e POPCYGRP Population in Group Quarters (2019A) INTEGER SUM POPCYGRP None carto-do.ags.demographics_sociodemographics_us... carto-do.ags.demographics_sociodemographics_us...
3 POPCYGRPI_1e42899 POPCYGRPI Institutional Group Quarters Population (2019A) INTEGER SUM POPCYGRPI None carto-do.ags.demographics_sociodemographics_us... carto-do.ags.demographics_sociodemographics_us...
84 LBFCYPOP16_75363c6f LBFCYPOP16 Population Age 16+ (2019A) INTEGER SUM LBFCYPOP16 carto-do.ags.demographics_sociodemographics_us... carto-do.ags.demographics_sociodemographics_us... carto-do.ags.demographics_sociodemographics_us...
101 POPPY_24dbbb56 Total population Population (2024A) INTEGER SUM POPPY None carto-do.ags.demographics_sociodemographics_us... carto-do.ags.demographics_sociodemographics_us...

We can store the variable instance we need by searching the Catalog by its slug, in this case POPCY_4534fac4:

1
2
variable = Variable.get('POPCY_4534fac4')
variable.to_dict()
1
2
3
4
5
6
7
8
9
{'slug': 'POPCY_4534fac4',
 'name': 'Total Population',
 'description': 'Population (2019A)',
 'db_type': 'INTEGER',
 'agg_method': 'SUM',
 'column_name': 'POPCY',
 'variable_group_id': None,
 'dataset_id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020',
 'id': 'carto-do.ags.demographics_sociodemographics_usa_blockgroup_2015_yearly_2020.POPCY'}

The POPCY variable contains the SUM of the population per blockgroup for the year 2019. Let’s enrich our stores DataFrame with that variable.

Enrich points

Let’s start by loading the geocoded Starbucks stores:

1
2
3
4
from geopandas import read_file

stores_gdf = read_file('http://libs.cartocdn.com/cartoframes/files/starbucks_brooklyn_geocoded.geojson')
stores_gdf.head()
cartodb_id field_1 name address revenue geometry
0 1 0 Franklin Ave & Eastern Pkwy 341 Eastern Pkwy,Brooklyn, NY 11238 1321040.772 POINT (-73.95901 40.67109)
1 2 1 607 Brighton Beach Ave 607 Brighton Beach Avenue,Brooklyn, NY 11235 1268080.418 POINT (-73.96122 40.57796)
2 3 2 65th St & 18th Ave 6423 18th Avenue,Brooklyn, NY 11204 1248133.699 POINT (-73.98976 40.61912)
3 4 3 Bay Ridge Pkwy & 3rd Ave 7419 3rd Avenue,Brooklyn, NY 11209 1185702.676 POINT (-74.02744 40.63152)
4 5 4 Caesar's Bay Shopping Center 8973 Bay Parkway,Brooklyn, NY 11214 1148427.411 POINT (-74.00098 40.59321)

Alternatively, you can load data in any geospatial format supported by GeoPandas or CARTO.

As we can see, for each store we have its name, address, the total revenue by year and a geometry column indicating the location of the store. This is important because for the enrichment service to work, we need a DataFrame with a geometry column encoded as a shapely object.

We can now create a new Enrichment instance, and since the stores_gdf dataset represents store locations (points), we can use the enrich_points function passing as arguments the stores DataFrame and a list of Variables (that we have a valid subscription from the Data Observatory catalog for).

In this case we are only enriching one variable (the total population), but we could enrich a list of them.

1
2
3
4
from cartoframes.data.observatory import Enrichment

enriched_stores_gdf = Enrichment().enrich_points(stores_gdf, [variable])
enriched_stores_gdf.head()
cartodb_id field_1 name address revenue geometry POPCY do_area
0 1 0 Franklin Ave & Eastern Pkwy 341 Eastern Pkwy,Brooklyn, NY 11238 1321040.772 POINT (-73.95901 40.67109) 2608 59840.196748
1 2 1 607 Brighton Beach Ave 607 Brighton Beach Avenue,Brooklyn, NY 11235 1268080.418 POINT (-73.96122 40.57796) 1792 60150.636995
2 3 2 65th St & 18th Ave 6423 18th Avenue,Brooklyn, NY 11204 1248133.699 POINT (-73.98976 40.61912) 733 38950.618837
3 4 3 Bay Ridge Pkwy & 3rd Ave 7419 3rd Avenue,Brooklyn, NY 11209 1185702.676 POINT (-74.02744 40.63152) 1155 57353.293114
4 5 4 Caesar's Bay Shopping Center 8973 Bay Parkway,Brooklyn, NY 11214 1148427.411 POINT (-74.00098 40.59321) 2266 188379.242640

Once the enrichment finishes, we can see there is a new column in our DataFrame called POPCY with population projected for the year 2019, from the US Census block group which contains each one of our Starbucks stores. The enrichment process also provides an extra column called do_area with the information of the area in square meters covered by the polygons in the source dataset we are using to enrich our data.

Enrich polygons

Next, let’s do a second enrichment, but this time using a DataFrame with areas of influence calculated using the CARTOframes isochrones service to obtain the polygon around each store that covers the area within an 8, 17 and 25 minute walk.

1
2
aoi_gdf = read_file('http://libs.cartocdn.com/cartoframes/files/starbucks_brooklyn_isolines.geojson')
aoi_gdf.head()
data_range lower_data_range range_label geometry
0 500 0 8 min. MULTIPOLYGON (((-73.95959 40.67571, -73.95971 ...
1 1000 500 17 min. POLYGON ((-73.95988 40.68110, -73.95863 40.681...
2 1500 1000 25 min. POLYGON ((-73.95986 40.68815, -73.95711 40.688...
3 500 0 8 min. MULTIPOLYGON (((-73.96185 40.58321, -73.96231 ...
4 1000 500 17 min. MULTIPOLYGON (((-73.96684 40.57483, -73.96830 ...

In this case we have a DataFrame which, for each index in the stores_gdf, contains a polygon of the areas of influence around each store at 8, 17 and 25 minute walking intervals. Again the geometry is encoded as a shapely object.

In this case, the Enrichment service provides an enrich_polygons function, which in its basic version, works in the same way as the enrich_points function. It just needs a DataFrame with polygon geometries and a list of variables to enrich:

1
2
3
4
from cartoframes.data.observatory import Enrichment

enriched_aoi_gdf = Enrichment().enrich_polygons(aoi_gdf, [variable])
enriched_aoi_gdf.head()
data_range lower_data_range range_label geometry POPCY
0 500 0 8 min. MULTIPOLYGON (((-73.95959 40.67571, -73.95971 ... 21893.521552
1 1000 500 17 min. POLYGON ((-73.95988 40.68110, -73.95863 40.681... 60463.830128
2 1500 1000 25 min. POLYGON ((-73.95986 40.68815, -73.95711 40.688... 111036.739845
3 500 0 8 min. MULTIPOLYGON (((-73.96185 40.58321, -73.96231 ... 23118.113571
4 1000 500 17 min. MULTIPOLYGON (((-73.96684 40.57483, -73.96830 ... 29212.999892

We now have a new column in our areas of influence DataFrame, SUM_POPCY, which represents the SUM of the total population in the Census block groups that instersect with each polygon in our DataFrame.

How enrichment works

Let’s take a deeper look into what happens under the hood when you execute a polygon enrichment.

Imagine we have polygons representing municipalities, in blue, each of which have a population attribute, and we want to find out the population inside the green circle.

Polygon enrichment

We don’t know how the population is distributed inside these municipalities. They are probably concentrated in cities somewhere, but, since we don’t know where they are, our best guess is to assume that the population is evenly distributed in the municipality (i.e. every point inside the municipality has the same population density).

Population is an extensive property (it grows with area), so we can subset it (a region inside the municipality will always have a smaller population than the whole municipality), and also aggregate it by summing.

In this case, we’d calculate the population inside each part of the circle that intersects with a municipality.

Default aggregation methods

In the Data Observatory, we suggest a default aggregation method for certain fields. However, some fields don’t have a clear best method, and some just can’t be aggregated. In these cases, we leave the agg_method field blank and let the user choose the method that best fits their needs.