Detect Outliers and Clusters

Analysis Guides

Detect Outliers and Clusters

This guide describes how to find regions in your data where high or low values are spatially clustered, and where values on opposite sides of the mean value are adjacent to one another (outliers). This analysis also indicates if the clusters or outliers are statistically significant (not due to chance). For example, use this analysis to find connected regions of high poverty, and isolated places of high poverty adjacent to areas of low poverty.

This technique finds these correlations by looking at a geography's attribute value, and the values in its geographical neighborhood, as compared to the entire dataset.

Detect outliers and clusters is primarily used as an exploratory data analysis tool to uncover statistically significant patterns in data. In the field of geostatistics, this technique is called Moran's I. It often is used to build inferences about the underlying data, and is usually a stepping-off point for other analyses, such as regression, or simply to highlight significant areas in a dataset.

Example

An example of the Detect outliers and clusters analysis was used in the CARTO blog post, The L Train Closure -- What data can tell us. For this guide, let's apply the analysis to find the clusters of high and low rates of poverty in Brooklyn.

  1. Import the .carto file available from the "Download resources" of this guide. Builder opens displaying Brooklyn Poverty as the first and only map layer. This layer contains data from the Data Observatory on poverty and total population at the block group level from the American Community Survey.

    Click on "Download resources" from this guide to download the zip file to your local machine. Extract the zip file to view the .carto file(s) used for this guide.
  2. Select the Brooklyn Poverty map layer.

  3. Click the ANALYSIS tab.

  4. Apply the Detect outliers and cluster analysis, entering the following parameters:

    • For the TARGET COLUMN, select povery_per_pop as the numerator column.
    • For SIGNIFICANCE, enter 1, in order to get back all of the geometries.
    • For NEIGHBORS, keep the default selection of 5.
    • Click APPLY.

    The calculation is the ratio of the number of poverty within the total population, and the result detects patterns of poverty in Brooklyn. You can now visualize clusters of high and low rates of poverty.

    Dete applied to layer

Analysis Output

The quads and significance values are the most important outputs of the Detect outliers and clusters analysis. Let's add these column values as widgets on our dashboard, to explore our data.

Column Value Description
quads Contains the following values: HH, LL, HL, and LH, where H is for high, L is for low. The interpretation of these is as follows:

- The first letter is the unit compared to the rest of the dataset (i.e., is it high or low compared to the dataset's average).

- The second letter is how the average of the neighbors of the geography compare to the entire dataset.
significance Measures whether the pattern of values in the geography could be better attributed to random chance. The lower the significance value, the more likely the geography's quads designation is not due to random chance (i.e., there is an underlying process driving the spatial arrangement of values).

For example, if a county has a value of 10, neighboring counties have an average of 6, and the dataset as a whole has an average of 2, then the classification would be HH.

  1. From the Brooklyn Poverty map layer, click the DATA tab to access the widget shortcut options.

  2. Click the checkbox next to Add as a widget for the quads column and for the significance column.

  3. Edit the significance histogram widget and set the number of buckets to 30.

    Edit number of histogram widget buckets

  4. Filter the significance Histogram widget to 0.05 and less, to ensure that you are looking at highly significant patterns.

    Any time that you filter a widget connected to a layer that contains analyses, the analysis is rerun and recalculated!
  5. Apply Auto style to the quads widget. This allows the widget to act as a legend, so that you can highlight the different patterns of poverty.

Add widgets from map layer

Based on the filters applied, the following clear patterns arise:

  • Large clusters of low poverty in the northwest (Greenpoint), west (Brooklyn Heights through Prospect Heights), and southeast
  • Large clusters of high poverty south of Greenpoint, near Broadway Junction in the central east, and elsewhere
  • The outliers (LH and HL) tend to lie adjacent to the larger clusters.

Advanced Cartography Tips

A default color scheme differentiates the quads values. Alternatively, the following CartoCSS code can be applied in Builder, as a best practice for styling Detect outliers and clusters data.

  1. Click the STYLE tab of the Brooklyn Povery map layer.

  2. Switch the slider button, located at the bottom of the STYLE tab, from VALUES to CARTOCSS and apply the following custom styling.

#layer {
  polygon-fill: ramp([quads], (#ac4166,#df749d,#6ba3d9,#3c6ea9), ("HH", "HL", "LH", "LL"));
  polygon-opacity: 0.9;
  polygon-gamma: 0.5;
  line-width: 0.5;
  line-color: #FFF;
  line-opacity: 0.25;
}

Additional Reading

Advanced Resources