Ebrahim-Farrington Goodness of Fit test

Overview

The ebrahim.gof package implements the Ebrahim-Farrington goodness-of-fit test for logistic regression models. This test is particularly effective for binary data and sparse datasets, providing an improved alternative to the traditional Hosmer-Lemeshow test.

New in version 2.0.0: the package is now a full goodness-of-fit toolkit. Alongside the omnibus Ebrahim-Farrington (EF) test it adds the Directed EF (DEF) test (def.gof()), a Cauchy-combination ensemble (def.ensemble.gof()), and run.all.gof() — a one-call battery of ~19 goodness-of-fit tests (plus opt-in slow tests), each verified against the implementation used in the thesis simulation.

Key Features

Installation

From GitHub (Development Version)

Copy and paste this in R or R-studio.

# Install devtools if you haven't already
if (!requireNamespace("devtools", quietly = TRUE)) {
  install.packages("devtools")
}

# Install ebrahim.gof from GitHub
devtools::install_github("ebrahimkhaled/ebrahim.gof")

From CRAN (Stable Version)

The released version is on CRAN (currently 1.0.0):

install.packages("ebrahim.gof")

Version 2.0.0 — this version, with the directed test, the ensemble, and the one-call battery — is available from GitHub now and is being submitted to CRAN.

Quick Start

library(ebrahim.gof)

# Example with binary data
set.seed(123)
n <- 500
x <- rnorm(n)
linpred <- 0.5 + 1.2 * x
prob <- 1 / (1 + exp(-linpred))
y <- rbinom(n, 1, prob)

# Fit logistic regression
model <- glm(y ~ x, family = binomial())
predicted_probs <- fitted(model)

# Perform Ebrahim-Farrington test
result <- ef.gof(y, predicted_probs, G = 10)
print(result)

Main Functions

ef.gof()

The main function that performs the goodness-of-fit test:

ef.gof(y, predicted_probs, G = 10, model = NULL, m = NULL,
       method = c("chisq", "normal"))

Parameters: - y: a fitted binary-logistic glm (then predicted_probs is taken from it), or a binary response vector (0/1) / success counts for grouped data - predicted_probs: Vector of predicted probabilities from logistic model - G: Number of groups for binary data (default: 10) - method: reference distribution for the grouped statistic. "chisq" (default, new in 2.0.0) refers T_EF to a chi-square with G-2 df; "normal" uses the standardized Z_EF (the behaviour of versions <= 1.0.0) - model: Optional glm object (required for the original Farrington test only) - m: Optional vector of trial counts (for grouped data; original Farrington only)

Note (breaking change in 2.0.0): ef.gof() now defaults to method = "chisq". Use method = "normal" to reproduce p-values from version 1.0.0.

Returns: A data frame with test name, test statistic, and p-value.

def.gof() — Directed Ebrahim-Farrington test

Concentrates power on calibration-curve shape directions by projecting the grouped residuals onto a small smooth basis.

def.gof(object, predicted_probs = NULL, X = NULL, G = 10,
        basis = c("poly3", "poly2", "stukel", "ensemble"),
        method = c("satterthwaite", "imhof"))
fit <- glm(y ~ x1 + x2, family = binomial())
def.gof(fit)                      # default poly3 basis
def.gof(fit, basis = "ensemble")  # combined Cauchy decision

def.ensemble.gof() — combine the DEF bases

Combines the three DEF basis tests (optionally the omnibus EF, or extra p-values) into one decision via the Cauchy combination test (CCT).

def.ensemble.gof(fit)                # CCT of poly2 + poly3 + stukel
def.ensemble.gof(fit, add_ef = TRUE) # add the omnibus EF

run.all.gof() — the whole battery in one call

Runs a large battery of goodness-of-fit tests and returns one tidy data frame (one row per test). A failing test never aborts the run.

fit <- glm(low ~ age + lwt + factor(race), data = MASS::birthwt, family = binomial())
run.all.gof(fit)                       # the default (fast) battery + ensemble rows
run.all.gof(fit, include_slow = TRUE)  # also the opt-in slow tests
run.all.gof(fit, tests = c("EF", "DEF.poly3", "HL"))   # a chosen subset
run.all.gof(y, fitted(fit))            # prediction-only tests (no model)

Default battery (19 rows): Pearson, Deviance, Osius-Rojek, Copas-RSS, Information-Matrix, Hosmer-Lemeshow (deciles and equal-width), Pigeon-Heyse, EF, EF-normal, the three DEF bases, Stukel, Tsiatis, Xie, Pulkstenis-Robinson, and the two Cauchy-combination ensemble rows. With include_slow = TRUE it also runs le Cessie-van Houwelingen, the GAM-based tests (HL-GAM, PR-GAM, Xie-GAM; need mgcv), Stute-Zhu, eHL, BAGofT, and the Lai & Liu standardized-power HL test. Every test reproduces the implementation used in the original thesis simulation.

Power and size: the partition-based family

Most goodness-of-fit tests for logistic regression are partition-based: they split the data into groups — by the fitted probability, by covariate-space clusters, or by categorical patterns — and compare observed with expected event counts in each group. This is the family that ef.gof(), def.gof(), and def.ensemble.gof() belong to. In a Monte Carlo study (n = 500, 1000 replications, α = 0.05) the partition tests compare as follows:

Test Grouping Size (null) Power: quadratic Power: wrong link
Hosmer–Lemeshow (decile) fitted prob 0.060 0.588 0.179
Hosmer–Lemeshow (equal-width) fitted prob 0.053 0.332 0.244
Pigeon–Heyse fitted prob 0.035 0.535 0.133
EF (omnibus) fitted prob 0.058 0.480 0.218
Tsiatis covariate clusters 0.056 0.574 0.162
Xie covariate clusters 0.042 0.557 0.147
DEF (poly3) fitted prob + shape basis 0.060 0.709 0.404
DEF (ensemble, vote) fitted prob + 3 bases 0.066 0.767 0.468

Across the family, DEF and its vote ensemble are the most powerful while keeping the size near the nominal 0.05 — they are not liberal — and they roughly double the power of Hosmer–Lemeshow, Tsiatis, and Xie on the wrong-link misfit.

Size and power across partition-based GOF tests

Pros and cons of partition-based tests

Pros - Intuitive — compare observed vs expected event counts within groups. - Work for sparse data and continuous covariates, where the classical Pearson and deviance chi-square tests break down (those need replicated covariate patterns). - Widely used and understood (Hosmer–Lemeshow is the de-facto standard). - Flexible — group by the fitted probability (HL, EF, Pigeon–Heyse, DEF) or by the covariate space (Tsiatis, Xie) to target different kinds of misfit.

Cons - The result depends on the grouping choice — the number of groups G and the grouping rule. Hosmer–Lemeshow in particular is known to give different answers for different G and across software. - Limited power for some departures — omnibus partition tests (HL) spread their few degrees of freedom thinly; fitted-probability grouping can miss misfit that cancels along the predicted probability; covariate-clustering tests can miss smooth link departures. - The chi-square reference is asymptotic and needs adequate group sizes.

DEF is built to fix the power cons without giving up size control: it keeps the intuitive fitted-probability grouping but directs the test at calibration-curve shapes, and the ensemble removes the basis choice by combining those directions — which is why it tops the table above while staying at the nominal size.

Setup: covariate x ~ Uniform(-3, 3), models fitted as glm(y ~ x); all tests computed in one call with the package’s own run.all.gof(). The dashed red line marks the nominal 0.05 size.

Examples

Example 1: Basic Usage with Binary Data

library(ebrahim.gof)

# Simulate binary data
set.seed(42)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
linpred <- -0.5 + 0.8 * x1 + 0.6 * x2
prob <- plogis(linpred)
y <- rbinom(n, 1, prob)

# Fit logistic regression
model <- glm(y ~ x1 + x2, family = binomial())
predicted_probs <- fitted(model)

# Test goodness of fit (chi-square reference by default in 2.0.0;
# use method = "normal" for the version 1.0.0 p-value)
result <- ef.gof(y, predicted_probs, G = 10)
print(result)
#>                 Test Test_Statistic p_value
#> 1 Ebrahim-Farrington         1.3731   0.096

Example 2: Compare Different Group Numbers

# Test with different numbers of groups
results <- data.frame(
  Groups = c(4, 10, 20),
  P_value = c(
    ef.gof(y, predicted_probs, G = 4)$p_value,
    ef.gof(y, predicted_probs, G = 10)$p_value,
    ef.gof(y, predicted_probs, G = 20)$p_value
  )
)
print(results)

Example 3: Comparison with Hosmer-Lemeshow Test

library(ResourceSelection)

# Ebrahim-Farrington test
ef_result <- ef.gof(y, predicted_probs, G = 10)

# Hosmer-Lemeshow test
hl_result <- hoslem.test(y, predicted_probs, g = 10)

# Compare results
comparison <- data.frame(
  Test = c("Ebrahim-Farrington", "Hosmer-Lemeshow"),
  P_value = c(ef_result$p_value, hl_result$p.value)
)
print(comparison)

Example 4: Power Analysis

# Function to simulate misspecified model
simulate_power <- function(n, beta_quad = 0.1, n_sims = 100) {
  rejections <- 0
  
  for (i in 1:n_sims) {
    x <- runif(n, -2, 2)
    # True model has quadratic term
    linpred_true <- 0 + x + beta_quad * x^2
    prob_true <- plogis(linpred_true)
    y <- rbinom(n, 1, prob_true)
    
    # Fit misspecified linear model
    model_mis <- glm(y ~ x, family = binomial())
    pred_probs <- fitted(model_mis)
    
    # Test goodness of fit
    test_result <- ef.gof(y, pred_probs, G = 10)
    
    if (test_result$p_value < 0.05) {
      rejections <- rejections + 1
    }
  }
  
  return(rejections / n_sims)
}

# Calculate power for different sample sizes
power_results <- data.frame(
  n = c(100, 200, 500, 1000),
  power = sapply(c(100, 200, 500, 1000), simulate_power)
)
print(power_results)

Example 5: Run the whole battery at once (run.all.gof)

library(ebrahim.gof)

# a model on the classic low-birth-weight data
fit <- glm(low ~ age + lwt + factor(race) + smoke,
           data = MASS::birthwt, family = binomial())

# every test in one tidy data frame (one row per test)
run.all.gof(fit)

# also run the opt-in slow tests (le Cessie, GAM-based, Stute-Zhu, eHL, BAGofT,
# Lai-Liu); set the bootstrap reps via control
run.all.gof(fit, include_slow = TRUE,
            control = list("Stute-Zhu" = list(B = 200)))

# or just a chosen subset
run.all.gof(fit, tests = c("EF", "DEF.poly3", "Tsiatis", "HL"))

Example 6: Directed test and the ensemble (def.gof, def.ensemble.gof)

set.seed(1)
n <- 800
x <- runif(n, -3, 3)
y <- rbinom(n, 1, 1 - exp(-exp(0.6 * x)))   # true link is complementary log-log
fit <- glm(y ~ x, family = binomial())       # fitted as logit (misspecified)

# the directed test with different shape bases
def.gof(fit, basis = "poly2")
def.gof(fit, basis = "stukel")

# the recommended default: combine the three bases with the Cauchy combination
# test (no basis to choose, valid size)
def.ensemble.gof(fit)
def.ensemble.gof(fit, add_ef = TRUE)   # also fold in the omnibus EF

Methodology

The Ebrahim-Farrington test is based on Farrington’s (1996) theoretical framework but simplified for practical implementation with binary data. The test uses a modified Pearson chi-square statistic:

For binary data with automatic grouping, the test statistic is:

Z_EF = (T_EF - (G - 2)) / sqrt(2(G - 2))

Where: - T_EF is the modified Pearson chi-square statistic - G is the number of groups - Z_EF follows a standard normal distribution under H₀

As of version 2.0.0, ef.gof() by default refers T_EF directly to a chi-square distribution with G - 2 degrees of freedom (method = "chisq"), which is a more accurate small-sample reference; the standardized-normal form Z_EF above is still available via method = "normal".

Advantages over Hosmer-Lemeshow Test

  1. Better Power: More sensitive to model misspecification
  2. Sparse Data Handling: Specifically designed for sparse data situations
  3. Computational Efficiency: Simplified calculations for binary data
  4. Theoretical Foundation: Based on rigorous asymptotic theory ## Superior Performance at G=10 Simulation results consistently demonstrate that the Ebrahim-Farrington test outperforms the Hosmer-Lemeshow test, even when the model misspecification is minimal—such as with a missing interaction or omitted quadratic term—when using G = 10 groups (Ebrahim, 2025). Power_Comparison_All_Scenarios_Combined.png

Asymptotically Following the Standard Normal Distribution

The following two figures illustrate that, under the null hypothesis, the Ebrahim-Farrington test statistic is asymptotically standard normal for both single-predictor and multiple-predictor logistic regression models. This property holds even in sparse data settings, confirming the theoretical foundation of the test and supporting its use for model assessment. (see (Ebrahim,2025))

These results demonstrate that the Ebrahim-Farrington test maintains the correct type I error rate and its statistic converges to the standard normal distribution as sample size increases, validating its asymptotic properties.

Farrington CDF Comparison (U-3_3) Farrington CDF Comparison (multi_indep)

References

  1. Farrington, C. P. (1996). On Assessing Goodness of Fit of Generalized Linear Models to Sparse Data. Journal of the Royal Statistical Society. Series B (Methodological), 58(2), 349-360.

  2. Ebrahim, Khaled Ebrahim (2025). Goodness-of-Fits Tests and Calibration Machine Learning Algorithms for Logistic Regression Model with Sparse Data. Master’s Thesis, Alexandria University.

  3. Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley.

Citation

If you use this package in your research, please cite:

Ebrahim, K. E. (2025). ebrahim.gof: Ebrahim-Farrington Goodness-of-Fit Test 
for Logistic Regression. R package version 2.0.0. 
https://github.com/ebrahimkhaled/ebrahim.gof

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the GPL-3 License

Author

Ebrahim Khaled Ebrahim
Alexandria University
Email: ebrahimkhaled@alexu.edu.eg

Acknowledgments