April 23, 2026
Chi-Kuang Yeh (Georgia State University) 
Gregory Rice (University of Waterloo)
Joel A. Dubin (University of Waterloo)
rtForecastEval implements the methodology in Yeh, Rice, and Dubin (2022) (preprint arXiv:2010.00781, PDF). The paper develops tools for continuously updated probabilistic forecasts: calibration-style summaries, skill (relative performance of two forecasters) via a global delta test on L²[0,1] under squared (Brier) loss, and simple pointwise inference for the mean loss difference over time.
This package ships the statistical core used in the paper’s simulation sections. Full NBA replication (scraped data, calibration surfaces, competitor models) lives in a separate analysis repository and is not bundled here.
| Topic | Functions |
|---|---|
| Simulation designs | df_gen() — requires the sde
package at runtime |
| Pointwise loss / variance | calc_L_s2(), lin_interp() |
| Global delta test | calc_Z(), calc_eig(),
calc_pval() |
| Plotting | plot_pcb() (naive pointwise skill
band); README below also shows reliability plots and a
calibration-difference curve between forecasters |
After installation, run help(package = "rtForecastEval")
or ?df_gen for full documentation.
The diagrams below use Mermaid. They render on GitHub; in other viewers you may see the source only.
mindmap
root((rtForecastEval))
Paper
Yeh Rice Dubin 2022
Real-time probabilistic forecasts
Inputs
df_gen simulation
Your own long-format data
Global skill test
calc_Z delta statistic
calc_eig covariance spectrum
calc_pval MC p-values
Pointwise skill
calc_L_s2 loss and variance
plot_pcb naive band
Other
lin_interp time grid
flowchart TD
subgraph src["Data sources"]
A["df_gen()<br>(simulation)"]
B["Your forecasts<br>(long format)"]
end
C["Long tibble:<br>game, grid, Y,<br>phat_A, phat_B"]
D["Centered / non-centered<br>differences"]
E["calc_Z()"]
F["calc_eig()"]
G["calc_pval()"]
H["calc_L_s2()"]
P["plot_pcb()"]
A --> C
B --> C
C --> D
D --> E
D --> F
E --> G
F --> G
D --> H
H --> P
install.packages("pak")
pak::pak("chikuang/rtForecastEval")After installing the package (and sde for
df_gen()), load dplyr,
tidyr, ggplot2, and
rtForecastEval. The chunks below
execute when you render README.qmd
(e.g. quarto render README.qmd or
Rscript tools/render-readme.R README.qmd) with the package
available — for example devtools::load_all() from a local
clone or library(rtForecastEval) after
pak::pak("chikuang/rtForecastEval").
library(dplyr)Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(rtForecastEval)
library(sde)Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Loading required package: stats4
Loading required package: fda
Loading required package: splines
Loading required package: fds
Loading required package: rainbow
Loading required package: pcaPP
Loading required package: RCurl
Attaching package: 'RCurl'
The following object is masked from 'package:tidyr':
complete
Loading required package: deSolve
Attaching package: 'fda'
The following object is masked from 'package:graphics':
matplot
The following object is masked from 'package:datasets':
gait
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
sde 2.0.21
Companion package to the book
'Simulation and Inference for Stochastic Differential Equations With R Examples'
Iacus, Springer NY, (2008)
To check the errata corrige of the book, type vignette("sde.errata")
set.seed(777)
nsamp <- 40 # interior steps; df_gen() uses a grid of nsamp + 1 points on [0, 1]
ngame <- 1000 # independent replicates (“games”); larger n → more games per calibration bin
D <- 8 # leading eigenvalues for the chi-square approximation
N_MC <- 2000 # Monte Carlo draws for the p-value
df_equ <- df_gen(N = nsamp, Ngame = ngame) |>
group_by(grid) |>
mutate(
p_bar_12 = mean(phat_A - phat_B),
diff_non_cent = phat_A - phat_B,
diff_cent = phat_A - phat_B - p_bar_12
) |>
ungroup()
ZZ <- calc_Z(df_equ, "phat_A", "phat_B", "Y", nsamp = nsamp, ngame = ngame)
eig <- calc_eig(df_equ, n_eig = D, ngame = ngame, nsamp = nsamp, cent = FALSE)
out <- calc_pval(ZZ, eig, quan = c(0.9, 0.95, 0.99), n_MC = N_MC)
Ltab <- calc_L_s2(df_equ, "phat_A", "phat_B")
plot_pcb(Ltab)
c(Z = ZZ, p_value = out$p_val, out$quantile) Z p_value 90% 95% 99%
0.05470115 0.72300000 0.40628731 0.58446258 0.91682971
plot_pcb() plots the pointwise mean loss
difference (relative skill) with a naive
normal band — not a classical calibration diagram.
A complementary check is marginal calibration at one
time: bin predicted probabilities at a single grid (here, closest to
mid-game, t ≈ 0.5) and compare mean outcome to mean
forecast (reliability diagram). Grey vertical segments
show a 95% central range for the bin mean of
Y under H₀ (perfect calibration in the bin):
under the null, outcomes are Bernoulli with probability equal to the
bin’s mean forecast p̄, so the bin sum is
Binomial(n, p̄) and the segment endpoints use
exact quantiles of that binomial (no normal
approximation or asymmetric clipping to [0, 1], which
can make bands look “anchored” at y = 0). Full
calibration surfaces from the paper are in the NBA
replication code; this is a minimal ggplot using the same long-format
columns (phat_A, phat_B, Y,
grid).
# Uses `df_equ` from the chunk above
g_mid <- df_equ |>
distinct(grid) |>
slice_min(order_by = abs(grid - 0.5), n = 1) |>
pull(grid)
slice_t <- df_equ |>
filter(grid == g_mid) |>
transmute(game, Y, phat_A, phat_B)
rel_long <- slice_t |>
pivot_longer(c(phat_A, phat_B), names_to = "forecaster", values_to = "phat") |>
mutate(forecaster = ifelse(forecaster == "phat_A", "Forecaster A", "Forecaster B"))
rel_binned <- rel_long |>
group_by(forecaster) |>
mutate(bin = ntile(phat, 10)) |>
group_by(forecaster, bin) |>
summarise(
mean_forecast = mean(phat),
mean_outcome = mean(Y),
n_games = n(),
.groups = "drop"
) |>
mutate(
p_bin = pmin(pmax(mean_forecast, 0), 1),
ci_lo = stats::qbinom(0.025, size = n_games, prob = p_bin) / n_games,
ci_hi = stats::qbinom(0.975, size = n_games, prob = p_bin) / n_games
)
ggplot(rel_binned, aes(mean_forecast, mean_outcome)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
geom_errorbar(
aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
width = 0.02,
color = "grey55",
alpha = 0.85,
linewidth = 0.35
) +
geom_point(aes(size = n_games), alpha = 0.9, color = "darkred") +
facet_wrap(~forecaster, nrow = 1) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
labs(
title = "Binned reliability (calibration) at one grid",
subtitle = paste0(
"Grid = ", signif(g_mid, 3),
"; grey: exact 95% central range for mean(Y) under H0 (Binomial; see text)"
),
x = "Mean forecast in bin",
y = "Mean outcome (Y) in bin",
size = "Games"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = ggplot2::element_blank(),
plot.title = ggplot2::element_text(face = "bold")
)
The facet plot above uses separate decile bins per forecaster, so bins are not aligned across A vs B. Here we bin the same games with equal-width cuts of the midpoint \(m = (phat_A + phat_B) / 2\) (same definition as in the code chunk). For each bin we compute mean(Y) and each forecaster’s mean forecast in that bin, then draw one classical reliability diagram per forecaster (horizontal axis: mean forecast; vertical: mean outcome; both in [0, 1]). Grey vertical segments are the exact Binomial 95% central range for mean(Y) under \(H_0\) at that forecaster’s bin mean forecast (same construction as the first figure). This avoids overlaying two forecasters in one square, which is easy to misread.
# Same grid as above; shared midpoint bins for the next two figures
cal_bins <- slice_t |>
mutate(mid = (phat_A + phat_B) / 2) |>
mutate(bin = cut(mid, breaks = seq(0, 1, length.out = 11), include.lowest = TRUE))
rel_joint <- cal_bins |>
group_by(bin) |>
summarise(
mean_forecast_A = mean(phat_A),
mean_forecast_B = mean(phat_B),
mean_outcome = mean(Y),
n_games = n(),
.groups = "drop"
) |>
filter(!is.na(bin)) |>
mutate(
p_A = pmin(pmax(mean_forecast_A, 0), 1),
p_B = pmin(pmax(mean_forecast_B, 0), 1),
ci_lo_A = stats::qbinom(0.025, size = n_games, prob = p_A) / n_games,
ci_hi_A = stats::qbinom(0.975, size = n_games, prob = p_A) / n_games,
ci_lo_B = stats::qbinom(0.025, size = n_games, prob = p_B) / n_games,
ci_hi_B = stats::qbinom(0.975, size = n_games, prob = p_B) / n_games
)
rel_joint_long <- rel_joint |>
tidyr::pivot_longer(
c(mean_forecast_A, mean_forecast_B),
names_to = "which",
names_prefix = "mean_forecast_",
values_to = "mean_forecast"
) |>
mutate(
forecaster = ifelse(which == "A", "Forecaster A", "Forecaster B"),
ci_lo = ifelse(which == "A", ci_lo_A, ci_lo_B),
ci_hi = ifelse(which == "A", ci_hi_A, ci_hi_B)
)
ggplot(rel_joint_long, aes(mean_forecast, mean_outcome)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
geom_errorbar(
aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
width = 0.02,
color = "grey55",
alpha = 0.85,
linewidth = 0.4
) +
geom_point(aes(color = forecaster, size = n_games), alpha = 0.9) +
facet_wrap(~forecaster, nrow = 1) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
scale_color_manual(values = c("Forecaster A" = "steelblue", "Forecaster B" = "orangered")) +
labs(
title = "Reliability: shared midpoint bins (one classical diagram per forecaster)",
subtitle = "Bins from cut((phat_A + phat_B) / 2); grey = exact 95% Binomial H0 range for mean(Y)",
x = "Mean forecast in bin",
y = "Mean outcome (Y) in bin",
color = NULL,
size = "Games"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = ggplot2::element_blank(),
plot.title = ggplot2::element_text(face = "bold"),
legend.position = "right"
) +
guides(color = "none")
The reliability diagram shows levels on the 45° plot; the plot below shows the same bins but calibration errors mean(Y) − mean forecast vs bin midpoint. Curves near 0 are better calibrated; the dashed line is mean(phat_B) − mean(phat_A) within each bin.
cal_compare <- cal_bins |>
group_by(bin) |>
summarise(
mid_hat = mean(mid),
cal_err_A = mean(Y) - mean(phat_A),
cal_err_B = mean(Y) - mean(phat_B),
n_games = n(),
.groups = "drop"
) |>
filter(!is.na(bin))
cal_long <- cal_compare |>
pivot_longer(
c(cal_err_A, cal_err_B),
names_to = "which",
values_to = "cal_err"
) |>
mutate(
which = ifelse(which == "cal_err_A", "A: mean(Y) - phat_A", "B: mean(Y) - phat_B")
)
ggplot(cal_long, aes(mid_hat, cal_err, color = which)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey70") +
geom_line(linewidth = 0.85) +
geom_point(aes(size = n_games), alpha = 0.85) +
geom_line(
data = cal_compare,
aes(mid_hat, cal_err_B - cal_err_A),
inherit.aes = FALSE,
color = "grey35",
linetype = "dashed",
linewidth = 0.65
) +
labs(
title = "Calibration comparison (same midpoint bins)",
subtitle = "Solid: mean calibration error per forecaster; dashed: mean(phat_B) - mean(phat_A)",
x = "Mean (phat_A + phat_B) / 2 in bin",
y = "mean(Y) - mean(phat)",
color = "Curve",
size = "Games"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = ggplot2::element_blank(),
plot.title = ggplot2::element_text(face = "bold"),
legend.position = "bottom"
)
For the full workflow (explicit linear algebra vs. wrappers, larger grids, centered eigenvalues, and the same figures with narrative), open the vignette:
vignette("rtForecastEval-vignette", package = "rtForecastEval")GPL-3 — see DESCRIPTION.