Spatial Data Science Glossary
Here we have collected all the key terms you need to know to become a Spatial Data Scientist.
A
Type of hierarchical clustering where clusters are built from the bottom up. This algorithm starts building clusters where each object is in its own cluster, then clusters are recursively merged (agglomerated) using a 'linkage strategy' such as minimizing the sum of squared distances within a cluster.
Data associated with a fixed set of locations, with well-defined boundaries. The boundaries can be irregular, as in the case of administrative units (e.g. districts, regions, counties), or can be defined by a regular grid, as in the case of raster data. Typical applications consist of model inference, prediction at unsampled locations, and spatial smoothing.
B
Provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilities of observing various data given the hypothesis, and the observed data itself.
Use Bayes' theorem to compute and update probabilities after obtaining new data.
C
The science and technique of building maps to communicate spatial information.
Statistical technique of grouping the data in such a way that those belonging to the same group (cluster) are more likely to be similar that those in other clusters.
If data is distributed randomly and uniformly over the study area, it is said to exhibit CSR.
Conditional (CAR) and simultaneous (SAR) random spatial effects for fitting hierarchical Bayesian models.
Model validation technique for assessing how the results of a statistical model will generalize to new data. It involves partitioning a sample of data into complementary subsets, performing the analysis on one subset and validating the analysis on the other subset.
D
A process for cleaning & transforming data to extract useful insights for making decisions.
Clustering method that groups together data that are close to each other based on a distance metric and a minimum number of data points. Using the appropriate metric, can be applied to the coordinates of point reference data to perform a spatial clustering.
F
Clustering method designed to determine the best arrangement of values into different classes.
G
Stochastic process such that satisfies the Markovian property that the parameters of the i-th area are independent on all the other parameters given the set of its neighbours.
Stochastic process such that has Gaussian distributed marginal distributions. It is parametriezed by a mean function and covariance function, which apply to vectors of inputs and return a mean vector and covariance matrix which provide the mean and covariance of the outputs corresponding to those input points in the functions drawn from the process.
Create pint geometrics in your data.
The process that converts addresses, such as a street, into latitude & longitude coordinates, which you can use to position a marker on a map.
A system to capture and analyze spatial data.
Spatially varying coefficient model used as an exploratory technique intended to indicate where non-stationarity is taking place.
An open format designed for encoding spatial data.
Data with a geographic component.
Collection of random variables that are associated with the nodes of a graph.
H
Statistical model written in a hierarchical (multi-level) form that estimates the parameters of the posterior distribution using the Bayesian method.
I
Combines analytical approximations and efficient numerical integration schemes to achieve highly accurate deterministic approximations to the posterior distribution.
A stochastic process is said to be instrinsically stationary if its variance function does not change when shifted in space.
Isoline for travel time, that is a curve of equal travel time.
J
An open-source web that allows data scientist to create and share documents that contain live code, equations, visualizations and text.
K
Non-spatial clustering method that aims to partiion the data into a fixed number of clusters in which each data point belongs to the cluster with the closest mean.
Spatial interpolation method used to derive predictions at unmeasured location based on a GP. The covariance function is usually derived from a variogram analysis.
L
Data freely available for everyone to use without restrictions.
Location that someone may find useful or interesting, such as, restaurants, monuments, parks, schools...
The methodology for transforming your location data into business outcomes. Location data can be anything from addresses and latitude/longitude coordinates, to existing points, lines, and polygons.
M
Class of simulation methods used to approximate the posterior distribution by randomly sampling in a probabilistic space.
Stochastic model describing a sequence of possible states in which the probability of each state depends only on the previous state.
Comprehensive library for creating static, animated, and interactive visualizations in Python.
The model is the formulation of the problem.
Measure of global and local spatial autocorrelation for areal data.
N
Data associated to a set of ordered points, connected by straight lines. Examples include data from mobility networks, internet, and mobile phone networks. Typical applications include the analysis of spatial networks and route optimization.
P
Fast and open source data analysis and manipulation tool, built on top of the Python programming language.
Data representing occurrences of events where locations themselves are random. In this context, this data is useful in evaluating possible clustering or inhibition between the observations.
Data associated with a spatial index that varies continuously across space. Examples include data from GPS tracking, fixed devices, high resolution satellites. This data is often useful for model inference and prediction at unsampled locations.
PostgreSQL is a general purpose and object-relational database management system.
Programming language.
R
Programming language.
Type of clustering that enforces contiguity constraints on the geographies. That means that smaller geographies can be put together to form larger, contiguous regions that are constructed to optimize for qualities such as similar populations, homogenous measures (e.g., similar socio-demographic characteristics), and compactness among others.
RMSE is the standard deviation of the prediction errors.
S
Regionalization method that works by constructing a contiguity-based minimum spanning tree that ensures homogeneity within trees by minimizing costs that are the inverse of the similarity of joined regions.
Clustering methods accounting for the spatial relationships inherent in spatial data.
Spatial confounding occurs when adding a spatially-correlated error term changes the estimates of the fixed-effect coefficients, especially when the fixed effects are highly correlated with the spatially structured random effect.
Cross-validation technique that uses the spatial information to partition the data into subsets.
Data structure that allows for accessing a spatial object.
Consists of the analysis of spatial data (i.e. data that exhibits spatial dependence) to make inferences about the model parameters, to predict at unsampled locations, and for spatial smoothing.
A stochastic process is said to be stationary if its joint probability distribution does not change when shifted in space.
Consists in the analysis of spatio-temporal data: the data are defiend by a process indexed by space and time.
Consists in representing a GP (a continous spatial process) using a GMRF (a discretely index spatial process).
U
Dimensionality reduction technique and clustering method.
V
Defines the variability between data points as a function of distance only.
Consists in computing the experimental variogram from the data and fit a variogram model to the empirical variogram to infer the parameters characterizing the spatial dependence.
W
A stochastic process is said to be weak stationary if its covariance function does not change when shifted in space.