The ebrahim.gof package implements the Ebrahim-Farrington goodness-of-fit test for logistic regression models. This test is particularly effective for binary data and sparse datasets, providing an improved alternative to the traditional Hosmer-Lemeshow test.
New in version 2.0.0: the package is now a full
goodness-of-fit toolkit. Alongside the omnibus Ebrahim-Farrington (EF)
test it adds the Directed EF (DEF) test
(def.gof()), a Cauchy-combination ensemble
(def.ensemble.gof()), and
run.all.gof() — a one-call battery of ~19
goodness-of-fit tests (plus opt-in slow tests), each verified against
the implementation used in the thesis simulation.
ef.gof()):
omnibus test for binary data with automatic grouping (chi-square or
normal reference)def.gof()): targets
calibration-shape departures (poly2/poly3/stukel bases or their
ensemble)def.ensemble.gof()):
combines the DEF bases via the Cauchy combination testrun.all.gof()): runs
~19 GOF tests (plus opt-in slow ones) and returns a tidy data frameCopy and paste this in R or R-studio.
# Install devtools if you haven't already
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install ebrahim.gof from GitHub
devtools::install_github("ebrahimkhaled/ebrahim.gof")The released version is on CRAN (currently 1.0.0):
install.packages("ebrahim.gof")Version 2.0.0 — this version, with the directed test, the ensemble, and the one-call battery — is available from GitHub now and is being submitted to CRAN.
library(ebrahim.gof)
# Example with binary data
set.seed(123)
n <- 500
x <- rnorm(n)
linpred <- 0.5 + 1.2 * x
prob <- 1 / (1 + exp(-linpred))
y <- rbinom(n, 1, prob)
# Fit logistic regression
model <- glm(y ~ x, family = binomial())
predicted_probs <- fitted(model)
# Perform Ebrahim-Farrington test
result <- ef.gof(y, predicted_probs, G = 10)
print(result)ef.gof()The main function that performs the goodness-of-fit test:
ef.gof(y, predicted_probs, G = 10, model = NULL, m = NULL,
method = c("chisq", "normal"))Parameters: - y: a fitted
binary-logistic glm (then predicted_probs is
taken from it), or a binary response vector (0/1) / success counts for
grouped data - predicted_probs: Vector of predicted
probabilities from logistic model - G: Number of groups for
binary data (default: 10) - method: reference distribution
for the grouped statistic. "chisq" (default, new in 2.0.0)
refers T_EF to a chi-square with G-2 df; "normal" uses the
standardized Z_EF (the behaviour of versions <= 1.0.0) -
model: Optional glm object (required for the original
Farrington test only) - m: Optional vector of trial counts
(for grouped data; original Farrington only)
Note (breaking change in 2.0.0):
ef.gof()now defaults tomethod = "chisq". Usemethod = "normal"to reproduce p-values from version 1.0.0.
Returns: A data frame with test name, test statistic, and p-value.
def.gof()
— Directed Ebrahim-Farrington testConcentrates power on calibration-curve shape directions by projecting the grouped residuals onto a small smooth basis.
def.gof(object, predicted_probs = NULL, X = NULL, G = 10,
basis = c("poly3", "poly2", "stukel", "ensemble"),
method = c("satterthwaite", "imhof"))object: a fitted binary-logistic glm, or a
0/1 response vector y (then give
predicted_probs, and X to get the exact
calibration).basis: "poly3" (default),
"poly2", "stukel", or "ensemble"
(runs all three and combines them via
def.ensemble.gof()).method: "satterthwaite" (default, no extra
dependency) or "imhof" (exact, needs
CompQuadForm).fit <- glm(y ~ x1 + x2, family = binomial())
def.gof(fit) # default poly3 basis
def.gof(fit, basis = "ensemble") # combined Cauchy decisiondef.ensemble.gof()
— combine the DEF basesCombines the three DEF basis tests (optionally the omnibus EF, or extra p-values) into one decision via the Cauchy combination test (CCT).
def.ensemble.gof(fit) # CCT of poly2 + poly3 + stukel
def.ensemble.gof(fit, add_ef = TRUE) # add the omnibus EFrun.all.gof()
— the whole battery in one callRuns a large battery of goodness-of-fit tests and returns one tidy data frame (one row per test). A failing test never aborts the run.
fit <- glm(low ~ age + lwt + factor(race), data = MASS::birthwt, family = binomial())
run.all.gof(fit) # the default (fast) battery + ensemble rows
run.all.gof(fit, include_slow = TRUE) # also the opt-in slow tests
run.all.gof(fit, tests = c("EF", "DEF.poly3", "HL")) # a chosen subset
run.all.gof(y, fitted(fit)) # prediction-only tests (no model)Default battery (19 rows): Pearson, Deviance, Osius-Rojek, Copas-RSS,
Information-Matrix, Hosmer-Lemeshow (deciles and equal-width),
Pigeon-Heyse, EF, EF-normal, the three DEF bases, Stukel, Tsiatis, Xie,
Pulkstenis-Robinson, and the two Cauchy-combination ensemble rows. With
include_slow = TRUE it also runs le Cessie-van Houwelingen,
the GAM-based tests (HL-GAM, PR-GAM, Xie-GAM; need mgcv),
Stute-Zhu, eHL, BAGofT, and the Lai & Liu standardized-power HL
test. Every test reproduces the implementation used in the original
thesis simulation.
Most goodness-of-fit tests for logistic regression are
partition-based: they split the data into groups — by
the fitted probability, by covariate-space clusters, or by categorical
patterns — and compare observed with expected event counts in each
group. This is the family that ef.gof(),
def.gof(), and def.ensemble.gof() belong to.
In a Monte Carlo study (n = 500, 1000 replications, α = 0.05) the
partition tests compare as follows:
| Test | Grouping | Size (null) | Power: quadratic | Power: wrong link |
|---|---|---|---|---|
| Hosmer–Lemeshow (decile) | fitted prob | 0.060 | 0.588 | 0.179 |
| Hosmer–Lemeshow (equal-width) | fitted prob | 0.053 | 0.332 | 0.244 |
| Pigeon–Heyse | fitted prob | 0.035 | 0.535 | 0.133 |
| EF (omnibus) | fitted prob | 0.058 | 0.480 | 0.218 |
| Tsiatis | covariate clusters | 0.056 | 0.574 | 0.162 |
| Xie | covariate clusters | 0.042 | 0.557 | 0.147 |
| DEF (poly3) | fitted prob + shape basis | 0.060 | 0.709 | 0.404 |
| DEF (ensemble, vote) | fitted prob + 3 bases | 0.066 | 0.767 | 0.468 |
Across the family, DEF and its vote ensemble are the most powerful while keeping the size near the nominal 0.05 — they are not liberal — and they roughly double the power of Hosmer–Lemeshow, Tsiatis, and Xie on the wrong-link misfit.
Pros - Intuitive — compare observed vs expected event counts within groups. - Work for sparse data and continuous covariates, where the classical Pearson and deviance chi-square tests break down (those need replicated covariate patterns). - Widely used and understood (Hosmer–Lemeshow is the de-facto standard). - Flexible — group by the fitted probability (HL, EF, Pigeon–Heyse, DEF) or by the covariate space (Tsiatis, Xie) to target different kinds of misfit.
Cons - The result depends on the grouping
choice — the number of groups G and the grouping
rule. Hosmer–Lemeshow in particular is known to give different answers
for different G and across software. - Limited
power for some departures — omnibus partition tests (HL) spread
their few degrees of freedom thinly; fitted-probability grouping can
miss misfit that cancels along the predicted probability;
covariate-clustering tests can miss smooth link departures. - The
chi-square reference is asymptotic and needs adequate
group sizes.
DEF is built to fix the power cons without giving up size control: it keeps the intuitive fitted-probability grouping but directs the test at calibration-curve shapes, and the ensemble removes the basis choice by combining those directions — which is why it tops the table above while staying at the nominal size.
Setup: covariate x ~ Uniform(-3, 3), models fitted
as glm(y ~ x); all tests computed in one call with the
package’s own run.all.gof(). The dashed red line marks the
nominal 0.05 size.
library(ebrahim.gof)
# Simulate binary data
set.seed(42)
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
linpred <- -0.5 + 0.8 * x1 + 0.6 * x2
prob <- plogis(linpred)
y <- rbinom(n, 1, prob)
# Fit logistic regression
model <- glm(y ~ x1 + x2, family = binomial())
predicted_probs <- fitted(model)
# Test goodness of fit (chi-square reference by default in 2.0.0;
# use method = "normal" for the version 1.0.0 p-value)
result <- ef.gof(y, predicted_probs, G = 10)
print(result)
#> Test Test_Statistic p_value
#> 1 Ebrahim-Farrington 1.3731 0.096# Test with different numbers of groups
results <- data.frame(
Groups = c(4, 10, 20),
P_value = c(
ef.gof(y, predicted_probs, G = 4)$p_value,
ef.gof(y, predicted_probs, G = 10)$p_value,
ef.gof(y, predicted_probs, G = 20)$p_value
)
)
print(results)library(ResourceSelection)
# Ebrahim-Farrington test
ef_result <- ef.gof(y, predicted_probs, G = 10)
# Hosmer-Lemeshow test
hl_result <- hoslem.test(y, predicted_probs, g = 10)
# Compare results
comparison <- data.frame(
Test = c("Ebrahim-Farrington", "Hosmer-Lemeshow"),
P_value = c(ef_result$p_value, hl_result$p.value)
)
print(comparison)# Function to simulate misspecified model
simulate_power <- function(n, beta_quad = 0.1, n_sims = 100) {
rejections <- 0
for (i in 1:n_sims) {
x <- runif(n, -2, 2)
# True model has quadratic term
linpred_true <- 0 + x + beta_quad * x^2
prob_true <- plogis(linpred_true)
y <- rbinom(n, 1, prob_true)
# Fit misspecified linear model
model_mis <- glm(y ~ x, family = binomial())
pred_probs <- fitted(model_mis)
# Test goodness of fit
test_result <- ef.gof(y, pred_probs, G = 10)
if (test_result$p_value < 0.05) {
rejections <- rejections + 1
}
}
return(rejections / n_sims)
}
# Calculate power for different sample sizes
power_results <- data.frame(
n = c(100, 200, 500, 1000),
power = sapply(c(100, 200, 500, 1000), simulate_power)
)
print(power_results)run.all.gof)library(ebrahim.gof)
# a model on the classic low-birth-weight data
fit <- glm(low ~ age + lwt + factor(race) + smoke,
data = MASS::birthwt, family = binomial())
# every test in one tidy data frame (one row per test)
run.all.gof(fit)
# also run the opt-in slow tests (le Cessie, GAM-based, Stute-Zhu, eHL, BAGofT,
# Lai-Liu); set the bootstrap reps via control
run.all.gof(fit, include_slow = TRUE,
control = list("Stute-Zhu" = list(B = 200)))
# or just a chosen subset
run.all.gof(fit, tests = c("EF", "DEF.poly3", "Tsiatis", "HL"))def.gof,
def.ensemble.gof)set.seed(1)
n <- 800
x <- runif(n, -3, 3)
y <- rbinom(n, 1, 1 - exp(-exp(0.6 * x))) # true link is complementary log-log
fit <- glm(y ~ x, family = binomial()) # fitted as logit (misspecified)
# the directed test with different shape bases
def.gof(fit, basis = "poly2")
def.gof(fit, basis = "stukel")
# the recommended default: combine the three bases with the Cauchy combination
# test (no basis to choose, valid size)
def.ensemble.gof(fit)
def.ensemble.gof(fit, add_ef = TRUE) # also fold in the omnibus EFThe Ebrahim-Farrington test is based on Farrington’s (1996) theoretical framework but simplified for practical implementation with binary data. The test uses a modified Pearson chi-square statistic:
For binary data with automatic grouping, the test statistic is:
Z_EF = (T_EF - (G - 2)) / sqrt(2(G - 2))
Where: - T_EF is the modified Pearson chi-square
statistic - G is the number of groups - Z_EF
follows a standard normal distribution under H₀
As of version 2.0.0, ef.gof() by
default refers T_EF directly to a chi-square distribution
with G - 2 degrees of freedom
(method = "chisq"), which is a more accurate small-sample
reference; the standardized-normal form Z_EF above is still
available via method = "normal".

The following two figures illustrate that, under the null hypothesis, the Ebrahim-Farrington test statistic is asymptotically standard normal for both single-predictor and multiple-predictor logistic regression models. This property holds even in sparse data settings, confirming the theoretical foundation of the test and supporting its use for model assessment. (see (Ebrahim,2025))
These results demonstrate that the Ebrahim-Farrington test maintains the correct type I error rate and its statistic converges to the standard normal distribution as sample size increases, validating its asymptotic properties.

Farrington, C. P. (1996). On Assessing Goodness of Fit of Generalized Linear Models to Sparse Data. Journal of the Royal Statistical Society. Series B (Methodological), 58(2), 349-360.
Ebrahim, Khaled Ebrahim (2025). Goodness-of-Fits Tests and Calibration Machine Learning Algorithms for Logistic Regression Model with Sparse Data. Master’s Thesis, Alexandria University.
Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression, Second Edition. New York: Wiley.
If you use this package in your research, please cite:
Ebrahim, K. E. (2025). ebrahim.gof: Ebrahim-Farrington Goodness-of-Fit Test
for Logistic Regression. R package version 2.0.0.
https://github.com/ebrahimkhaled/ebrahim.gof
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the GPL-3 License
Ebrahim Khaled Ebrahim
Alexandria University
Email: ebrahimkhaled@alexu.edu.eg