Lab 5-2 - Detecting Hotspots in Urban Areas

Author
Affiliations

Esteban Moro

Network Science Institute, Northeastern University

NETS 7370 Computational Urban Science

Published

February 9, 2026

Objective

In this practical, we will use spatial clustering techniques (dbscan) to identify “hotspots” in urban areas. IN particular, we will:

  • Use the Foursquare POIs dataset in a city
  • Run DBSCAN and HDBSCAN to find hotspots.
  • Choose parameters and evaluate the results.
  • Map clusters and interpret them

Load some libraries and settings we will use

library(tidyverse)
library(arrow)
options(arrow.unsafe_metadata = TRUE)
library(sf) # for spatial data
library(tigris) # for US geospatial data
library(leaflet) # for interactive maps
library(ggthemes)
theme_set(theme_hc() + theme(axis.title.y = element_text(angle = 90))) 

Load the FSQ POIs daset

files_path <- Sys.glob("/data/CUS/FSQ_OS_Places/places/*.parquet")
pois_FSQ <- open_dataset(files_path,format="parquet")
pois_FSQ |> head() |> collect()
# A tibble: 6 × 23
  fsq_place_id         name  latitude longitude address locality region postcode
  <chr>                <chr>    <dbl>     <dbl> <chr>   <chr>    <chr>  <chr>   
1 4aee4d4688a04abe82d… Cana…    41.8       3.03 Anselm… Sant Fe… Gerona 17220   
2 cb57d89eed29405b908… Foto…    52.2      18.6  Mickie… Koło     Wielk… 62-600  
3 59a4553d112c6c2b6c3… CoHo     40.8     -73.9  <NA>    New York NY     11371   
4 4bea3677415e20a110d… Bism…    -6.13    107.   Bisma75 Sunter … Jakar… <NA>    
5 dca3aeba404e4b60066… Grac…    34.2    -119.   18545 … Tarzana  CA     91335   
6 505a6a8be4b066ff885… 沖縄海邦…    26.7     128.   辺土名130… 国頭郡国頭村…… 沖縄県 <NA>    
# ℹ 15 more variables: admin_region <chr>, post_town <chr>, po_box <chr>,
#   country <chr>, date_created <chr>, date_refreshed <chr>, date_closed <chr>,
#   tel <chr>, website <chr>, email <chr>, facebook_id <int>, instagram <chr>,
#   twitter <chr>, fsq_category_ids <list<character>>,
#   fsq_category_labels <list<character>>

Choose a city

city_location <- "Boston"
pois_city <- pois_FSQ |> filter(locality == city_location & region == "MA") |> collect() 
pois_city <- pois_city |> filter(!is.na(longitude) & !is.na(latitude))
nrow(pois_city)
[1] 56747

Convert to sf and project to meters (UTM).

pois_city_sf <- st_as_sf(pois_city, coords = c("longitude", "latitude"), crs = 4326) 

Visual sanity check

pois_city_sf |> sample_n(5000) |> 
  leaflet() |> 
  addProviderTiles("CartoDB.Positron") |> addCircles()

As we can see from the map, there are many POIS outside of the city limits. Let’s get the city limits from the Census and filter to only those POIs inside the city.

city_limits <- places(state = "MA", cb = TRUE, progress=F) |> 
  filter(NAME == city_location) |>
  st_transform(crs = st_crs(pois_city_sf))
pois_city_sf <- pois_city_sf |> st_join(city_limits, join = st_within) |> filter(!is.na(NAME))

Sanity check, plot the pois and the limits of the city

pois_city_sf |> sample_n(5000) |> 
  leaflet() |> 
  addProviderTiles("CartoDB.Positron") |>
  addCircles() |> 
  addPolygons(data = city_limits, color = "red", weight = 2, fill = FALSE)

Run DBSCAN

DBSCAN / HDBSCAN are density-based clustering algorithms that can be used to identify clusters of points in space. They are particularly useful for spatial data because they can find clusters of arbitrary shape and do not require the number of clusters to be specified in advance.

They need distances in meters, so we will project our data to UTM.

pois_city_sf_utm <- st_transform(pois_city_sf, crs = 32619) # UTM zone 19N

Now we can run DBSCAN. We need to choose two parameters:

  • eps (the maximum distance between two points to be considered neighbors) and
  • minPts (the minimum number of points required to form a cluster).
library(dbscan)
set.seed(123)
minPts <- 20
dbscan_result <- dbscan(st_coordinates(pois_city_sf_utm), eps = 100, minPts = minPts)
pois_city_sf_utm$dbscan_cluster <- dbscan_result$cluster

Now we can visualize the clusters.

pois_city_sf_utm |> 
  filter(dbscan_cluster != 0) |> 
  ggplot() + 
  geom_sf(aes(col=factor(dbscan_cluster))) +
  theme_void() +
  labs(color = "DBSCAN Cluster") +
  ggtitle("DBSCAN Clusters of POIs in Boston")

As we can see, DBSCAN has identified several clusters of POIs in the city, but most of the points belong to the first cluster (cluster 1). This is a common issue with DBSCAN, as it tends to create one large cluster and several small ones.

One possibility is to work with the eps parameter. To do that we can check how far are minPts nearest neighbors for each point using different radiuos eps

kdist <- dbscan::kNNdist(st_coordinates(pois_city_sf_utm), k = 20)
plot(
  sort(kdist[kdist>0]),
  log = "xy",
  xlab = "Points (rank, log scale)",
  ylab = "20-NN distance (meters, log scale)",
  main = "k-distance plot (log–log)"
)

Each point shows the distance to its 20th nearest neigbhor, sorted from smallest to largest. We can see different structures of the data:

  • A dense region with small distances (centimiters to meters), probably POIs packed into the same buildings/blocks
  • Urban fabric, middle distances (100-300 meters), probably POIs in the same street or block
  • Sparse region with large distances (> 300-500 meters), Suburban, industrial, isolated POIs. These points becomes noise under DBSCA

Exercise

  • Which eps value would you choose? Try different values and see how the clusters change.

HDBSCAN

As we have seen, DBSCAN requires one eps, but cities have multiple densities. Downtown is dense, but suburban areas are sparse.

HDBSCAN is a hierarchical version of DBSCAN that can find clusters of varying densities. The idea is to build a hierarchy of clusters based on the density of points, and then extract the most stable clusters from this hierarchy. We select only 20k points for speed

set.seed(123)
pois_city_sf_utm_sample <- pois_city_sf_utm |> sample_n(20000)
hdbscan_result <- hdbscan(st_coordinates(pois_city_sf_utm_sample), minPts = 20)

Let’s see the results

table(hdbscan_result$cluster)

   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
9098   20   95   42   59   45   44 1091   39  175   39   66   23   21   46   37 
  16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
  39   20   46   50   27   91   61   26   33   66   74   26   38   41   81   31 
  32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
  25   24   60   76   91   46   25   24   80   55  124   28   93   28   28   24 
  48   49   50   51   52   53   54   55   56   57   58   59   60   61   62   63 
  47  219   28  103   37   27   33  196   54  318   68   22   64   37   68   33 
  64   65   66   67   68   69   70   71   72   73   74   75   76   77   78   79 
  26   32   64   47   53   32   40   34   23   35   43   27   61   40   89  230 
  80   81   82   83   84   85   86   87   88   89   90   91   92   93   94   95 
  31   32   97  120   39   20   54   42   71   25   29  191  183  137   35   58 
  96   97   98   99  100  101  102  103  104  105  106  107  108  109  110  111 
  29  261   58  223  113   49   27   28   20   33   49   20   83   20   25   39 
 112  113  114  115  116  117  118  119  120  121  122  123  124  125  126  127 
  24   54   30   53   83   33   84   25   30   79   46  138   29   34   54   29 
 128  129  130  131  132  133  134  135  136  137  138  139  140  141  142  143 
  26   35   22   56   51   63   52   31   24  100   31   38   36   54   63   45 
 144  145  146  147  148  149  150  151  152  153  154  155  156  157  158  159 
  40   25   50   50   28   50   21   29   46   20   78   54   21   44   23   25 
 160  161  162  163  164  165  166  167  168  169  170  171  172  173  174 
  34  202   68   22   27   81   84   34   34   27   35   45   45   75   39 

As we can see HDBSCAN has identified several clusters of varying sizes, and it has also labeled 9098` points as noise (cluster 0).

HDBSCAN also assings each pont and outlier score that indicates how much of an outlier is a point: low score (0) means that it is a core point, deeply embedded in a cluster, while high-score (1) is a borderline or isolated point. We can use this score to filter out points that are not part of any cluster.

Let’s map the cluster

pois_city_sf_utm_sample <- pois_city_sf_utm_sample  |> 
  mutate(hdbscan_cluster = hdbscan_result$cluster,
         hdbscan_outlier_score = hdbscan_result$outlier_scores)

Plot the data (without legend for better visualization):

library(colorspace)
pois_plot <- pois_city_sf_utm_sample |> 
  filter(hdbscan_cluster != 0 & hdbscan_outlier_score < 0.8) |>
  mutate(hdbscan_cluster = factor(hdbscan_cluster)) |>
  st_transform(4326)

n_clusters <- nlevels(pois_plot$hdbscan_cluster)

pal <- colorFactor(
  palette = qualitative_hcl(n_clusters, palette = "Dark 3") |> sample(),
  domain  = pois_plot$hdbscan_cluster
)

leaflet(pois_plot) |>
  addProviderTiles("CartoDB.Positron") |>
  addCircleMarkers(
    radius = 5,
    stroke = FALSE,
    fillOpacity = 0.7,
    color = ~pal(hdbscan_cluster)
  )

As we can see, HDBSCAN has identified several clusters of varying sizes. The clusters are more balanced than DBSCAN, and they capture different densities in the city.

Exercise.

  • Try different values of minPts and see how the clusters change.
  • Try filtering points with different hdbscan_outlier_score thresholds and see how the clusters change.
  • Identify the most important clusters in the city and try to interpret them. What kind of places do they represent? Are they downtown areas, commercial zones, residential neighborhoods, etc.?