Introduction to sparsecommunity

Overview

sparsecommunity implements spectral clustering for community detection in sparse networks. It provides:

community_detect(): main estimation function, supporting both the stochastic block model ("sbm") and the degree-corrected stochastic block model ("dcsbm").
simulate_sbm() / simulate_dcsbm(): simulation utilities for both models.
misclustering_rate(): best-permutation evaluation metric.

The methods implement the regularized normalized Laplacian spectral clustering of Lei and Rinaldo (2015), which is consistent for community recovery when the maximum expected degree grows as slowly as \(\log(n)\) — the sparse regime where many standard spectral methods break down.

library(sparsecommunity)

C:5xW2PpK8D-intro.R

1. Stochastic block model (SBM)

1.1 Simulating SBM data

A stochastic block model with \(K\) communities is specified by a block probability matrix \(B \in [0,1]^{K \times K}\). Nodes in community \(k\) and community \(\ell\) are connected independently with probability \(B_{k\ell}\).

# Two balanced communities, n = 300 nodes
# Within-community edge probability: 0.25; between: 0.04
B_sbm <- matrix(c(0.25, 0.04,
                   0.04, 0.25), nrow = 2)

sim_sbm <- simulate_sbm(n = 300, K = 2, B = B_sbm, seed = 1)
print(sim_sbm)
#> SBM simulation: n = 300 , K = 2 
#> Community sizes: 140 160 
#> Mean degree    : 43.69

C:5xW2PpK8D-intro.R

The returned object contains the sparse adjacency matrix A and the true community labels:

# Mean degree (note: sparse regime ~ log(n)/n * n = log(n))
mean(Matrix::rowSums(sim_sbm$A))
#> [1] 43.69333

C:5xW2PpK8D-intro.R

1.2 Fitting the SBM model

For the standard SBM, community_detect() embeds the network via the regularized normalized Laplacian and applies \(k\)-means:

fit_sbm <- community_detect(sim_sbm$A, K = 2, model = "sbm", seed = 1)
print(fit_sbm)
#> Community detection fit (sparsecommunity)
#>   Model       : sbm 
#>   Nodes (n)   : 300 
#>   Communities : 2 
#>   Regularized : TRUE 
#>   Objective   : 0.0399 
#>   Sizes       : 160 140

C:5xW2PpK8D-intro.R

The fitted object contains the community labels, the spectral embedding, the top eigenvalues, and the clustering objective:

# Top-K eigenvalues of the regularized Laplacian
fit_sbm$eigenvalues
#> [1] 0.5024848 0.3675187

# Community sizes
table(fit_sbm$labels)
#> 
#>   1   2 
#> 160 140

C:5xW2PpK8D-intro.R

1.3 Evaluating accuracy

misclustering_rate() computes the fraction of misclassified nodes under the best alignment of estimated and true labels (Hungarian algorithm):

misclustering_rate(sim_sbm$labels, fit_sbm$labels)
#> [1] 0

C:5xW2PpK8D-intro.R

1.4 Examining the spectral embedding

The two-dimensional embedding reveals the cluster structure directly:

U <- fit_sbm$embedding
plot(U[, 1], U[, 2],
     col  = sim_sbm$labels + 1,
     pch  = 19, cex = 0.6,
     xlab = "Eigenvector 1",
     ylab = "Eigenvector 2",
     main = "SBM: spectral embedding")
legend("topright", legend = c("Community 1", "Community 2"),
       col = 2:3, pch = 19, bty = "n")

Spectral embedding for SBM. Points are colored by true community.

C:5xW2PpK8D-intro.R

2. Degree-corrected stochastic block model (DCSBM)

In many real networks, nodes within the same community have very different degrees. The DCSBM captures this by introducing per-node degree parameters \(\theta_i > 0\): the edge probability between nodes \(i\) and \(j\) is \(\theta_i \theta_j B_{z_i z_j}\).

Under degree heterogeneity, standard spectral methods fail because high-degree nodes concentrate together in the embedding regardless of their community. The spherical \(k\)-median algorithm (Lei and Rinaldo 2015, Theorem 4.2) corrects for this by row-normalizing the embedding before clustering.

2.1 Simulating DCSBM data

# Three communities with strong degree heterogeneity
B_dcsbm <- matrix(c(0.5, 0.04, 0.04,
                     0.04, 0.5, 0.04,
                     0.04, 0.04, 0.5), nrow = 3)

# Degree parameters: Uniform(0.3, 1.7), creating substantial heterogeneity
set.seed(2)
theta <- runif(400, min = 0.3, max = 1.7)

sim_dcsbm <- simulate_dcsbm(n = 400, K = 3, B = B_dcsbm,
                              theta = theta, seed = 2)
print(sim_dcsbm)
#> DCSBM simulation: n = 400 , K = 3 
#> Community sizes: 136 153 111 
#> Mean degree    : 77.13 
#> Theta range    : 0.31 to 1.697

C:5xW2PpK8D-intro.R

2.2 Why standard SBM clustering fails under degree heterogeneity

Applying the SBM method to DCSBM data shows the problem:

fit_wrong <- community_detect(sim_dcsbm$A, K = 3, model = "sbm", seed = 2)
cat("Misclustering rate (SBM method on DCSBM data):",
    misclustering_rate(sim_dcsbm$labels, fit_wrong$labels), "\n")
#> Misclustering rate (SBM method on DCSBM data): 0

C:5xW2PpK8D-intro.R

2.3 Fitting the DCSBM model

The "dcsbm" model row-normalizes the embedding and applies spherical \(k\)-median clustering:

fit_dcsbm <- community_detect(sim_dcsbm$A, K = 3, model = "dcsbm", seed = 2)
print(fit_dcsbm)
#> Community detection fit (sparsecommunity)
#>   Model       : dcsbm 
#>   Nodes (n)   : 400 
#>   Communities : 3 
#>   Regularized : TRUE 
#>   Objective   : 22.052 
#>   Sizes       : 136 111 153

cat("Misclustering rate (DCSBM method):",
    misclustering_rate(sim_dcsbm$labels, fit_dcsbm$labels), "\n")
#> Misclustering rate (DCSBM method): 0

C:5xW2PpK8D-intro.R

2.4 Embedding comparison

The row-normalized embedding maps all nodes to the unit sphere. Degree heterogeneity is absorbed by the normalization, and clusters separate by angle:

U_dc <- fit_dcsbm$embedding
plot(U_dc[, 1], U_dc[, 2],
     col  = sim_dcsbm$labels + 1,
     pch  = 19, cex = 0.5,
     xlab = "Eigenvector 1 (normalized)",
     ylab = "Eigenvector 2 (normalized)",
     main = "DCSBM: row-normalized spectral embedding")
legend("topright",
       legend = paste("Community", 1:3),
       col = 2:4, pch = 19, bty = "n")

Row-normalized spectral embedding for DCSBM. Colors indicate true communities.

C:5xW2PpK8D-intro.R

3. Real-data example: Zachary’s karate club

Zachary’s (1977) karate club network is a benchmark for community detection. The network has 34 members (nodes) and 78 friendships (edges). A conflict led to a split into two factions, providing known ground-truth community membership.

if (!requireNamespace("igraphdata", quietly = TRUE)) {
  message("igraphdata not installed; skipping real-data example.")
  knitr::knit_exit()
}

library(igraph)
#> Warning: package 'igraph' was built under R version 4.3.1
data("karate", package = "igraphdata")

# Extract adjacency matrix and true community labels
A_karate  <- igraph::as_adjacency_matrix(karate, sparse = TRUE)
true_comm <- igraph::V(karate)$Faction
cat("Nodes:", vcount(karate), "| Edges:", ecount(karate),
    "| Communities:", length(unique(true_comm)), "\n")
#> Nodes: 34 | Edges: 78 | Communities: 2
cat("Community sizes:", table(true_comm), "\n")
#> Community sizes: 16 18
cat("Mean degree:", round(mean(degree(karate)), 2), "\n")
#> Mean degree: 4.59

C:5xW2PpK8D-intro.R

fit_karate <- community_detect(A_karate, K = 2, model = "sbm",
                                n_init = 30, seed = 42)
summary(fit_karate)
#> Community detection summary
#>   Call        : community_detect(A = A_karate, K = 2, model = "sbm", n_init = 30,      seed = 42) 
#>   Model       : sbm 
#>   Nodes (n)   : 34 
#>   Communities : 2 
#>   Regularized : TRUE 
#>   Restarts    : 30 
#> 
#> Community sizes:
#>  Community Size Proportion
#>          1   16      0.471
#>          2   18      0.529
#> 
#> Top eigenvalues:
#> [1] 0.5463 0.4289
#> 
#> Objective (within-cluster sum of distances): 0.3856

cat("Misclustering rate:", misclustering_rate(true_comm, fit_karate$labels), "\n")
#> Misclustering rate: 0

C:5xW2PpK8D-intro.R

# Plot network colored by detected community
shape_map <- ifelse(true_comm == 1, "circle", "square")
igraph::plot.igraph(
  karate,
  vertex.color = fit_karate$labels + 1,
  vertex.shape = shape_map,
  vertex.size  = 8,
  vertex.label = NA,
  main         = "Karate club: detected vs. true communities"
)
legend("bottomleft",
       legend = c("Detected: 1", "Detected: 2"),
       fill   = 2:3, bty = "n", cex = 0.9)
legend("bottomright",
       legend = c("True: faction 1", "True: faction 2"),
       pch    = c(19, 15), bty = "n", cex = 0.9)

Karate club network. Node color = detected community; node shape = true faction.

C:5xW2PpK8D-intro.R

4. Real-data example: NCAA college football (Girvan & Newman 2002)

The NCAA football network is a standard benchmark for community detection with \(K > 2\) communities. The network has 115 Division I-A college football teams (nodes) connected by regular-season games during Fall 2000 (613 edges). Teams are divided into 12 athletic conferences, providing unambiguous ground-truth community labels (Girvan & Newman 2002).

The network is bundled directly in sparsecommunity:

data("football")
cat("Nodes:", nrow(football_A), "| Edges:", sum(football_A) / 2, "\n")
#> Nodes: 115 | Edges: 613
cat("Mean degree:", round(mean(Matrix::rowSums(football_A)), 2),
    "  log(n):", round(log(nrow(football_A)), 2), "\n")
#> Mean degree: 10.66   log(n): 4.74
table(football_labels)   # 12 conferences
#> football_labels
#>  1  2  3  4  5  6  7  8  9 10 11 12 
#>  9  8 11 12 10  5 13  8 10 12  7 10

C:5xW2PpK8D-intro.R

The mean degree (10.66) is well above \(\log(115) = 4.74\), placing the network in the sparse-but-identifiable regime. We first use estimate_K() to check whether the number of conferences can be recovered from the data alone:

estimate_K(football_A, K_max = 15)   # true K = 12
#> [1] 10

C:5xW2PpK8D-intro.R

The estimator returns 10, slightly underestimating due to two very small conferences (sizes 5 and 7) that are hard to distinguish spectrally. We proceed with the known \(K = 12\):

fit_football <- community_detect(football_A, K = 12, model = "sbm",
                                  n_init = 30, seed = 1)
misclustering_rate(football_labels, fit_football$labels)
#> [1] 0.08695652

C:5xW2PpK8D-intro.R

sparsecommunity misclassifies 10 of 115 teams (8.7%). These are primarily independent teams and teams from small conferences that schedule games across conference lines. The spectral embedding confirms the 12-cluster structure:

plot(fit_football)

Spectral embedding of the football network. Colors indicate detected community; the 12 athletic conferences are largely separated.

C:5xW2PpK8D-intro.R

5. Summary of the `community_detect()` interface

Argument	Default	Description
`A`	—	Adjacency matrix (sparse or dense)
`K`	—	Number of communities
`model`	`"dcsbm"`	`"sbm"` for k-means; `"dcsbm"` for spherical k-median
`n_init`	`20`	Random restarts for clustering
`max_iter`	`100`	Max iterations per restart
`reg`	`TRUE`	Degree regularization in Laplacian embedding
`seed`	`NULL`	Random seed

The returned "sparsecommunity" object contains:

Component	Description
`labels`	Integer vector of community assignments (length \(n\))
`embedding`	\(n \times K\) spectral embedding matrix
`eigenvalues`	Top-\(K\) eigenvalues of the regularized Laplacian
`centers`	\(K \times K\) cluster centers in embedding space
`objective`	Within-cluster sum of distances at convergence

Introduction to sparsecommunity

Overview

1. Stochastic block model (SBM)

1.1 Simulating SBM data

1.2 Fitting the SBM model

1.3 Evaluating accuracy

1.4 Examining the spectral embedding

2. Degree-corrected stochastic block model (DCSBM)

2.1 Simulating DCSBM data

2.2 Why standard SBM clustering fails under degree heterogeneity

2.3 Fitting the DCSBM model

2.4 Embedding comparison

3. Real-data example: Zachary’s karate club

4. Real-data example: NCAA college football (Girvan & Newman 2002)

5. Summary of the community_detect() interface

References

5. Summary of the `community_detect()` interface