---
title: "Summarizing transparency across a corpus"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Summarizing transparency across a corpus}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
has_ggplot <- requireNamespace("ggplot2", quietly = TRUE)
```

The detector functions (`rt_all_pmc()`, `rt_data_code_pmc()`) describe **one
article at a time**. Most studies of research transparency instead ask
corpus-level questions: across thousands of articles, how often is each practice
present? Is it improving over time? Does it differ by journal or article type?

This vignette shows how to go from per-article detector output to that kind of
summary, using `rt_summary()`, `rt_score()` and `rt_plot()`.

## From one article to many

Running a detector on a single article returns a one-row table of indicators:

```{r}
library(rtransparency)

xml <- system.file(
  "extdata", "PMID32171256-PMC7071725.xml", package = "rtransparency"
)
one <- rt_all_pmc(xml, remove_ns = TRUE)
one[, c("pmid", "is_coi_pred", "is_fund_pred", "is_register_pred")]
```

To study a corpus you run a detector over many files and stack the rows;
`purrr::map_dfr(files, rt_all_pmc, remove_ns = TRUE)` returns all eight
indicators per article in one pass. The result is one row per article with the
indicator columns `is_coi_pred`, `is_fund_pred`, `is_register_pred`,
`is_open_data`, `is_open_code`, `is_novelty_pred`, `is_replication_pred` and
`is_ai_pred`.
`is_ai_pred` is `NA` for articles published before 2023, and `rt_summary()`
drops those `NA`s, so the AI-disclosure prevalence is computed only over the
articles where the indicator applies.

This package ships a small **simulated** table of that shape, `rt_demo`, so the
rest of the vignette runs without downloading anything:

```{r}
data(rt_demo)
head(rt_demo)
```

## Prevalence of each indicator

`rt_summary()` reports, for each indicator, how many articles were assessed, how
many were positive, the apparent prevalence and its 95% confidence interval:

```{r}
s <- rt_summary(rt_demo)
knitr::kable(
  s[, c("label", "n_articles", "n_detected", "percent", "conf_low", "conf_high")],
  digits = 1,
  col.names = c("Indicator", "Assessed", "Detected", "%", "CI low", "CI high")
)
```

### Correcting for detector error

A text-mining detector is not perfect, so the **observed** prevalence is a biased
estimate of the **true** prevalence. `rt_summary()` corrects for this using each
detector's sensitivity and specificity estimates (the Rogan-Gladen estimator).
The correction is on by default and adds `adj_percent`, `adj_low` and
`adj_high`:

```{r}
knitr::kable(
  s[, c("label", "percent", "adj_percent", "adj_low", "adj_high")],
  digits = 1,
  col.names = c("Indicator", "Apparent %", "Corrected %", "CI low", "CI high")
)
```

The accuracy values come from [`rt_accuracy`](../reference/rt_accuracy.html):

```{r}
rt_accuracy
```

AI-use disclosure has no bundled accuracy estimate here, so its corrected value
is `NA`. Novelty's estimate comes from a hand-labeled gold set
(`inst/benchmark/results_novelty_replication.md`); the data/code values are
reproducible benchmark estimates for the native detector, not untouched
external-validation estimates. Replication's correction is approximate: its
sensitivity comes from a replication-enriched sample and its specificity from
the representative 2023 sample, so it does not rest on the single-design
validation of conflicts of interest, funding or registration, and the
Rogan-Gladen interval does not propagate uncertainty in these estimates.
To use your own validation (or the published `oddpub` values for data and
code), pass any table with `variable`, `sensitivity` and `specificity` columns:

```{r}
my_acc <- rt_accuracy
my_acc$sensitivity[my_acc$variable == "is_open_data"] <- 0.758
rt_summary(rt_demo, indicators = "is_open_data", accuracy = my_acc)[,
  c("label", "percent", "adj_percent")]
```

## How many practices per article

`rt_score()` adds a per-article count of the openness practices met (conflicts
of interest, funding, registration, data and code). Tabulating it shows how many
articles meet zero, one, two ... of the five practices:

```{r}
scored <- rt_score(rt_demo)
knitr::kable(
  as.data.frame(table(`Practices met` = scored$n_indicators)),
  col.names = c("Practices met", "Articles")
)
```

## Subgroups

Pass `by` to summarize within a grouping column, such as article type:

```{r}
by_type <- rt_summary(rt_demo, by = "type", adjust = FALSE)
knitr::kable(
  by_type[by_type$indicator == "is_open_data",
          c("type", "label", "n_articles", "percent")],
  digits = 1,
  col.names = c("Type", "Indicator", "Assessed", "%")
)
```

## Plots

`rt_plot()` returns a `ggplot`, so it composes with the usual ggplot2 layers.
The default is a prevalence bar chart:

```{r, eval = has_ggplot, fig.width = 7, fig.height = 3.5, fig.alt = "Bar chart of the prevalence of each transparency indicator"}
library(ggplot2)
rt_plot(rt_demo) + ggtitle("Transparency indicators in rt_demo")
```

Use `type = "trend"` with a year column to see prevalence over time:

```{r, eval = has_ggplot, fig.width = 7, fig.height = 4, fig.alt = "Line chart of each transparency indicator's prevalence by year"}
rt_plot(rt_demo, type = "trend", year = "year")
```

The AI-disclosure line begins only in 2023, because the indicator is `NA`
before then; the rising data-sharing and AI lines illustrate the kind of trend
these summaries are meant to surface. Restrict a plot to particular indicators
with `indicators =`, for example to follow AI-use disclosure on its own:

```{r, eval = has_ggplot, fig.width = 7, fig.height = 3.5, fig.alt = "Line chart of AI-use disclosure prevalence by year from 2023"}
rt_plot(rt_demo, type = "trend", year = "year", indicators = "is_ai_pred") +
  ggtitle("Disclosure of generative-AI use, 2023 onward")
```

Set `adjusted = TRUE` in either plot to show the error-corrected prevalence
instead of the apparent prevalence.

## Putting it together

A typical analysis is therefore: run a detector over your corpus, stack the
rows, then

```r
results <- purrr::map_dfr(xml_files, rt_all_pmc, remove_ns = TRUE)
rt_summary(results)                       # prevalence + corrected prevalence
rt_score(results)                         # per-article practice count
rt_plot(results, type = "trend", year = "year")
```

For the per-indicator detection methodology, see
`vignette("rtransparency")`.
