Network Science Institute | Northeastern University
NETS 7370 Computational Urban Science
2026-02-09
This week:
Statistical Analysis of Urban Data
Computational Urban Science is primarily concerned with spatially embedded features of cities.
Some features are explicitly spatial: commuting and infrastructure networks, physical amenity visitation.
Some features are influenced by space: Social, communication, employment / opportunity networks.
The spatial structure of urban data requires special consideration.
Consider Tobler’s first “law” of Geography: “Everything is related to everything else, but near things are more related than distant.”
In CUS, we want reliable, repeatable insights about urban systems. If everything is related to everything else:
How can we achieve reliable statistical estimates of the relationship between urban variables?
How can we measure causal relationships in spatially-interconnected systems?
In Week 2, we discussed the Modifiable Areal Unit Problem (MAUP) and the effect of scale for defining analytical conclusions.
Today, we will use many definitions of proximity and adjacency as we aim to encode the spatial structure of our data into our analysis.
How do we define which things are “near” one another as described in Tobler’s law? Euclidean distance? Geodesic distance? Travel time? Semantic distance?
How do we define adjacency? \(k\) nearest-neighbors? What is \(k\)? What about physical boundaries between physically adjacent features?
Like MAUP, appropriate definitions of spatial structure in your data require your own scientific judgment.
A tale of two cities: London’s rich and poor in Tower Hamlets
A tale of two cities: London’s rich and poor in Tower Hamlets
Is Tobler’s First Law a Law? I prefer “empirical regularities”.
Spatial features have consistent, repeated patterns which should inform how you address statistical and causal inference and other analyses of spatial data.
Some of these regularities are:
Spatial autocorrelation (a.k.a. “clustering” or spatial heterogenity)
Spatial nonstationarity (variation of statistical relationships across space)
Physical constraints on network structure
Tobler’s first law revisited: “…near things are more related than distant.”
This is an empirical observation which holds true for a wide range of spatial phenomena.
Spatial autocorrelation permits:
Spatial autocorrelation hinders:
A funny example: Inverse-distance Weighting (IDW) (1965) beats Google Research’s [1] elevation predictions:
General Geospatial Inference with a Population Dynamics Foundation Model [1]
Most spatial statistics depend on a spatial weights matrix \(W\). It encodes which spatial units are considered neighbors.
Formally
There are many ways to define neighbors. For example these are the “contiguity-based” ones
Other methods:
Takeaway: Choosing \(W\) is a scientific judgment, not a technical detail.It should be guided by how interaction actually occurs in the urban system you study.
Spatial variogram: how much do two observations vary by distance?
Useful for assessing degree of spatial autocorrelation of continuous spatial variables.
\[\gamma(h) = \frac{1}{2N(h)} \sum_{i=1}^{N(h)} (z(x_i) - z(x_i + h))^2\]

Moran’s I
\[I = \frac{N}{W} \cdot \frac{\sum_{i}^N\sum_{j}^N w_{ij}(x_i - \bar{x})(x_j - \bar{x})}{\sum_{i} (x_i - \bar{x})^2}\]
Where:
Moran’s I
Local Indicators of Spatial Association (LISA)
Another feature of spatial data: statistical relationships can vary across space
There are multiple techiques to address spatial autocorrelation and nonstationarity:
Geographically weighted regression:
Fixed effects models:
We used one last week!
Used to handle unobserved location-specific variation that impacts dependent variables. Only allows interpretation of within-unit effects.
More on GWR and spatially-aware statistical inference next week!
Most geostatistical analysis happens within a constrained spatial boundary
For proximity- or adjacency-based statistical methods (like GWR):
Spatial clustering techniques account for spatial proximity when defining clusters.
Supports varying cluster density (producing varying size clusters).
Spatial clustering is useful for: dimensionality reduction of spatial features and for detecting spatial outliers.
In practical 5-1: note the difference between K-means clusters and geographically contiguous SKATER clusters (SKATER accounts for spatial proximity).
DBSCAN - Density Based Spatial Clustering of Applications with Noise
Most common spatial clustering algorithms:
For \(minPts = 4\), \(\varepsilon\) indicated by circle radius. Red: core points, Yellow: border points, Blue: outlier.


Geodemographic analysis groups spatial units (e.g., tracts, neighborhoods, CBGs) based on similar demographic characteristics, often while encouraging spatial coherence [5], [6], [7].
Typical inputs:
Typical outputs:
Key distinction from point clustering: We are clustering attributes of places, not locations of points. Spatial proximity is often encouraged, but not required
From Maptitude
From ESRI Tapestry
From ESRI Tapestry
How does it work: combining feature similarity + spatial structure
Two common approches:
Geodemographic clusters are low-dimensional spatial representations of data. As such, they models of the data, and should be evaluated as such. They are not “ground truth” representations of urban structure.

CUS 2025, ©SUNLab group