Introduction to sparsecommunity

Overview

sparsecommunity implements spectral clustering for community detection in sparse networks. It provides:

The methods implement the regularized normalized Laplacian spectral clustering of Lei and Rinaldo (2015), which is consistent for community recovery when the maximum expected degree grows as slowly as \(\log(n)\) — the sparse regime where many standard spectral methods break down.

library(sparsecommunity)

C:5xW2PpK8D-intro.R


1. Stochastic block model (SBM)

1.1 Simulating SBM data

A stochastic block model with \(K\) communities is specified by a block probability matrix \(B \in [0,1]^{K \times K}\). Nodes in community \(k\) and community \(\ell\) are connected independently with probability \(B_{k\ell}\).

# Two balanced communities, n = 300 nodes
# Within-community edge probability: 0.25; between: 0.04
B_sbm <- matrix(c(0.25, 0.04,
                   0.04, 0.25), nrow = 2)

sim_sbm <- simulate_sbm(n = 300, K = 2, B = B_sbm, seed = 1)
print(sim_sbm)
#> SBM simulation: n = 300 , K = 2 
#> Community sizes: 140 160 
#> Mean degree    : 43.69

C:5xW2PpK8D-intro.R

The returned object contains the sparse adjacency matrix A and the true community labels:

# Mean degree (note: sparse regime ~ log(n)/n * n = log(n))
mean(Matrix::rowSums(sim_sbm$A))
#> [1] 43.69333

C:5xW2PpK8D-intro.R

1.2 Fitting the SBM model

For the standard SBM, community_detect() embeds the network via the regularized normalized Laplacian and applies \(k\)-means:

fit_sbm <- community_detect(sim_sbm$A, K = 2, model = "sbm", seed = 1)
print(fit_sbm)
#> Community detection fit (sparsecommunity)
#>   Model       : sbm 
#>   Nodes (n)   : 300 
#>   Communities : 2 
#>   Regularized : TRUE 
#>   Objective   : 0.0399 
#>   Sizes       : 160 140

C:5xW2PpK8D-intro.R

The fitted object contains the community labels, the spectral embedding, the top eigenvalues, and the clustering objective:

# Top-K eigenvalues of the regularized Laplacian
fit_sbm$eigenvalues
#> [1] 0.5024848 0.3675187

# Community sizes
table(fit_sbm$labels)
#> 
#>   1   2 
#> 160 140

C:5xW2PpK8D-intro.R

1.3 Evaluating accuracy

misclustering_rate() computes the fraction of misclassified nodes under the best alignment of estimated and true labels (Hungarian algorithm):

misclustering_rate(sim_sbm$labels, fit_sbm$labels)
#> [1] 0

C:5xW2PpK8D-intro.R

1.4 Examining the spectral embedding

The two-dimensional embedding reveals the cluster structure directly:

U <- fit_sbm$embedding
plot(U[, 1], U[, 2],
     col  = sim_sbm$labels + 1,
     pch  = 19, cex = 0.6,
     xlab = "Eigenvector 1",
     ylab = "Eigenvector 2",
     main = "SBM: spectral embedding")
legend("topright", legend = c("Community 1", "Community 2"),
       col = 2:3, pch = 19, bty = "n")
Spectral embedding for SBM. Points are colored by true community.

Spectral embedding for SBM. Points are colored by true community.

C:5xW2PpK8D-intro.R


2. Degree-corrected stochastic block model (DCSBM)

In many real networks, nodes within the same community have very different degrees. The DCSBM captures this by introducing per-node degree parameters \(\theta_i > 0\): the edge probability between nodes \(i\) and \(j\) is \(\theta_i \theta_j B_{z_i z_j}\).

Under degree heterogeneity, standard spectral methods fail because high-degree nodes concentrate together in the embedding regardless of their community. The spherical \(k\)-median algorithm (Lei and Rinaldo 2015, Theorem 4.2) corrects for this by row-normalizing the embedding before clustering.

2.1 Simulating DCSBM data

# Three communities with strong degree heterogeneity
B_dcsbm <- matrix(c(0.5, 0.04, 0.04,
                     0.04, 0.5, 0.04,
                     0.04, 0.04, 0.5), nrow = 3)

# Degree parameters: Uniform(0.3, 1.7), creating substantial heterogeneity
set.seed(2)
theta <- runif(400, min = 0.3, max = 1.7)

sim_dcsbm <- simulate_dcsbm(n = 400, K = 3, B = B_dcsbm,
                              theta = theta, seed = 2)
print(sim_dcsbm)
#> DCSBM simulation: n = 400 , K = 3 
#> Community sizes: 136 153 111 
#> Mean degree    : 77.13 
#> Theta range    : 0.31 to 1.697

C:5xW2PpK8D-intro.R

2.2 Why standard SBM clustering fails under degree heterogeneity

Applying the SBM method to DCSBM data shows the problem:

fit_wrong <- community_detect(sim_dcsbm$A, K = 3, model = "sbm", seed = 2)
cat("Misclustering rate (SBM method on DCSBM data):",
    misclustering_rate(sim_dcsbm$labels, fit_wrong$labels), "\n")
#> Misclustering rate (SBM method on DCSBM data): 0

C:5xW2PpK8D-intro.R

2.3 Fitting the DCSBM model

The "dcsbm" model row-normalizes the embedding and applies spherical \(k\)-median clustering:

fit_dcsbm <- community_detect(sim_dcsbm$A, K = 3, model = "dcsbm", seed = 2)
print(fit_dcsbm)
#> Community detection fit (sparsecommunity)
#>   Model       : dcsbm 
#>   Nodes (n)   : 400 
#>   Communities : 3 
#>   Regularized : TRUE 
#>   Objective   : 22.052 
#>   Sizes       : 136 111 153

cat("Misclustering rate (DCSBM method):",
    misclustering_rate(sim_dcsbm$labels, fit_dcsbm$labels), "\n")
#> Misclustering rate (DCSBM method): 0

C:5xW2PpK8D-intro.R

2.4 Embedding comparison

The row-normalized embedding maps all nodes to the unit sphere. Degree heterogeneity is absorbed by the normalization, and clusters separate by angle:

U_dc <- fit_dcsbm$embedding
plot(U_dc[, 1], U_dc[, 2],
     col  = sim_dcsbm$labels + 1,
     pch  = 19, cex = 0.5,
     xlab = "Eigenvector 1 (normalized)",
     ylab = "Eigenvector 2 (normalized)",
     main = "DCSBM: row-normalized spectral embedding")
legend("topright",
       legend = paste("Community", 1:3),
       col = 2:4, pch = 19, bty = "n")
Row-normalized spectral embedding for DCSBM. Colors indicate true communities.

Row-normalized spectral embedding for DCSBM. Colors indicate true communities.

C:5xW2PpK8D-intro.R


3. Real-data example: Zachary’s karate club

Zachary’s (1977) karate club network is a benchmark for community detection. The network has 34 members (nodes) and 78 friendships (edges). A conflict led to a split into two factions, providing known ground-truth community membership.

if (!requireNamespace("igraphdata", quietly = TRUE)) {
  message("igraphdata not installed; skipping real-data example.")
  knitr::knit_exit()
}

library(igraph)
#> Warning: package 'igraph' was built under R version 4.3.1
data("karate", package = "igraphdata")

# Extract adjacency matrix and true community labels
A_karate  <- igraph::as_adjacency_matrix(karate, sparse = TRUE)
true_comm <- igraph::V(karate)$Faction
cat("Nodes:", vcount(karate), "| Edges:", ecount(karate),
    "| Communities:", length(unique(true_comm)), "\n")
#> Nodes: 34 | Edges: 78 | Communities: 2
cat("Community sizes:", table(true_comm), "\n")
#> Community sizes: 16 18
cat("Mean degree:", round(mean(degree(karate)), 2), "\n")
#> Mean degree: 4.59

C:5xW2PpK8D-intro.R

fit_karate <- community_detect(A_karate, K = 2, model = "sbm",
                                n_init = 30, seed = 42)
summary(fit_karate)
#> Community detection summary
#>   Call        : community_detect(A = A_karate, K = 2, model = "sbm", n_init = 30,      seed = 42) 
#>   Model       : sbm 
#>   Nodes (n)   : 34 
#>   Communities : 2 
#>   Regularized : TRUE 
#>   Restarts    : 30 
#> 
#> Community sizes:
#>  Community Size Proportion
#>          1   16      0.471
#>          2   18      0.529
#> 
#> Top eigenvalues:
#> [1] 0.5463 0.4289
#> 
#> Objective (within-cluster sum of distances): 0.3856

cat("Misclustering rate:", misclustering_rate(true_comm, fit_karate$labels), "\n")
#> Misclustering rate: 0

C:5xW2PpK8D-intro.R

# Plot network colored by detected community
shape_map <- ifelse(true_comm == 1, "circle", "square")
igraph::plot.igraph(
  karate,
  vertex.color = fit_karate$labels + 1,
  vertex.shape = shape_map,
  vertex.size  = 8,
  vertex.label = NA,
  main         = "Karate club: detected vs. true communities"
)
legend("bottomleft",
       legend = c("Detected: 1", "Detected: 2"),
       fill   = 2:3, bty = "n", cex = 0.9)
legend("bottomright",
       legend = c("True: faction 1", "True: faction 2"),
       pch    = c(19, 15), bty = "n", cex = 0.9)
Karate club network. Node color = detected community; node shape = true faction.

Karate club network. Node color = detected community; node shape = true faction.

C:5xW2PpK8D-intro.R


4. Real-data example: NCAA college football (Girvan & Newman 2002)

The NCAA football network is a standard benchmark for community detection with \(K > 2\) communities. The network has 115 Division I-A college football teams (nodes) connected by regular-season games during Fall 2000 (613 edges). Teams are divided into 12 athletic conferences, providing unambiguous ground-truth community labels (Girvan & Newman 2002).

The network is bundled directly in sparsecommunity:

data("football")
cat("Nodes:", nrow(football_A), "| Edges:", sum(football_A) / 2, "\n")
#> Nodes: 115 | Edges: 613
cat("Mean degree:", round(mean(Matrix::rowSums(football_A)), 2),
    "  log(n):", round(log(nrow(football_A)), 2), "\n")
#> Mean degree: 10.66   log(n): 4.74
table(football_labels)   # 12 conferences
#> football_labels
#>  1  2  3  4  5  6  7  8  9 10 11 12 
#>  9  8 11 12 10  5 13  8 10 12  7 10

C:5xW2PpK8D-intro.R

The mean degree (10.66) is well above \(\log(115) = 4.74\), placing the network in the sparse-but-identifiable regime. We first use estimate_K() to check whether the number of conferences can be recovered from the data alone:

estimate_K(football_A, K_max = 15)   # true K = 12
#> [1] 10

C:5xW2PpK8D-intro.R

The estimator returns 10, slightly underestimating due to two very small conferences (sizes 5 and 7) that are hard to distinguish spectrally. We proceed with the known \(K = 12\):

fit_football <- community_detect(football_A, K = 12, model = "sbm",
                                  n_init = 30, seed = 1)
misclustering_rate(football_labels, fit_football$labels)
#> [1] 0.08695652

C:5xW2PpK8D-intro.R

sparsecommunity misclassifies 10 of 115 teams (8.7%). These are primarily independent teams and teams from small conferences that schedule games across conference lines. The spectral embedding confirms the 12-cluster structure:

plot(fit_football)
Spectral embedding of the football network. Colors indicate detected community; the 12 athletic conferences are largely separated.

Spectral embedding of the football network. Colors indicate detected community; the 12 athletic conferences are largely separated.

C:5xW2PpK8D-intro.R


5. Summary of the community_detect() interface

Argument Default Description
A Adjacency matrix (sparse or dense)
K Number of communities
model "dcsbm" "sbm" for k-means; "dcsbm" for spherical k-median
n_init 20 Random restarts for clustering
max_iter 100 Max iterations per restart
reg TRUE Degree regularization in Laplacian embedding
seed NULL Random seed

The returned "sparsecommunity" object contains:

Component Description
labels Integer vector of community assignments (length \(n\))
embedding \(n \times K\) spectral embedding matrix
eigenvalues Top-\(K\) eigenvalues of the regularized Laplacian
centers \(K \times K\) cluster centers in embedding space
objective Within-cluster sum of distances at convergence

References

Lei, J. and Rinaldo, A. (2015). Consistency of spectral clustering in stochastic block models. Annals of Statistics, 43(1), 215–237. https://doi.org/10.1214/14-AOS1274

Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 7821–7826. https://doi.org/10.1073/pnas.122653799

Zachary, W. W. (1977). An information flow model for conflict and fission in small groups. Journal of Anthropological Research, 33(4), 452–473.