First, load the package, set the plot theme and get some data.
# Load the package
library(volker)
# Set the basic plot theme
theme_set(theme_vlkr())
# Load an example dataset ds from the package
ds <- volker::chatgpt
Decide whether your data is categorical or metric and choose the appropriate function:
report_counts()
shows frequency tables and generates
simple and stacked bar charts.report_metrics()
creates tables with distribution
parameters, visualises distributions in density plots, box plots or
scatter plots.Report functions, under the hood, call functions that generate plots, tables or calculate effects. If you only need one of those outputs, you can call the functions directly:
tab_counts()
, plot_counts()
or
effect_counts()
for categorical data.tab_metrics()
, plot_metrics()
or
effect_metrics()
for metric data.All functions expect a dataset as their first parameter. The second and third parameters await your column selections. The column selections determine whether to analyse single variables, item lists or to compare and correlate multiple variables.
Try out the following examples!
# A single variable
report_counts(ds, use_private)
# A list of variables
report_counts(ds, c(use_private, use_work))
# Variables matched by a pattern
report_counts(ds, starts_with("use_"))
You can use all sorts of tidyverse style selections: A single column,
a list of columns or patterns such as starts_with()
,
ends_with()
, contains()
or
matches()
.
# One metric variable
report_metrics(ds, sd_age)
# Multiple metric items
report_metrics(ds, starts_with("cg_adoption_"))
Provide a grouping column in the third parameter to compare different groups.
report_counts(ds, adopter, sd_gender)
For metric variables, you can compare the mean values.
report_metrics(ds, sd_age, sd_gender)
By default, the crossing variable is treated as categorical. You can change this behavior using the metric-parameter to calculate correlations:
report_metrics(ds, sd_age, use_work, metric = TRUE)
The ci parameter, where possible, adds confidence intervals to the outputs.
ds |>
filter(sd_gender != "diverse") |>
report_metrics(sd_age, sd_gender, ci = TRUE)
Conduct statistical tests with the effect
-parameter.
ds |>
filter(sd_gender != "diverse") |>
report_counts(adopter, sd_gender, effect = TRUE)
See the function help (F1 key) to learn more options. For example,
you can use the prop
parameter to grow bars to 100%. The
numbers
parameter prints frequencies and percentages onto
the bars.
ds |>
filter(sd_gender != "diverse") |>
report_counts(adopter, sd_gender, prop="rows", numbers= "n")
The theme_vlkr()
-function lets you customise colors:
theme_set(theme_vlkr(
base_fill = c("#F0983A","#3ABEF0","#95EF39","#E35FF5","#7A9B59"),
base_gradient = c("#FAE2C4","#F0983A")
))
Labels used in plots and tables are stored in the comment attribute
of the variable. You can inspect all labels using the
codebook()
-function:
codebook(ds)
#> # A tibble: 97 × 6
#> item_name item_group item_class item_label value_name value_label
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 case case numeric case <NA> <NA>
#> 2 sd_age sd numeric Age <NA> <NA>
#> 3 cg_activities cg character Activities with C… <NA> <NA>
#> 4 cg_act_write cg character cg_act_write <NA> <NA>
#> 5 cg_act_test cg character cg_act_test <NA> <NA>
#> 6 cg_act_search cg character cg_act_search <NA> <NA>
#> 7 adopter adopter factor Innovator type I try new… I try new …
#> 8 adopter adopter factor Innovator type I try new… I try new …
#> 9 adopter adopter factor Innovator type I wait un… I wait unt…
#> 10 adopter adopter factor Innovator type I only us… I only use…
#> # ℹ 87 more rows
Set specific column labels by providing a named list to the
items-parameter of labs_apply()
:
ds %>%
labs_apply(
items = list(
"cg_adoption_advantage_01" = "General advantages",
"cg_adoption_advantage_02" = "Financial advantages",
"cg_adoption_advantage_03" = "Work-related advantages",
"cg_adoption_advantage_04" = "More fun"
)
) %>%
report_metrics(starts_with("cg_adoption_advantage_"))
Labels for values inside a column can be adjusted by providing a
named list to the values-parameter of labs_apply()
. In
addition, select the columns where value labels should be changed:
ds %>%
labs_apply(
cols = starts_with("cg_adoption"),
values = list(
"1" = "Strongly disagree",
"2" = "Disagree",
"3" = "Neutral",
"4" = "Agree",
"5" = "Strongly agree"
)
) %>%
report_metrics(starts_with("cg_adoption"))
To conveniently manage all labels of a dataset, save the result of
codebook()
to an Excel file, change the labels manually in
a copy of the Excel file, and finally call labs_apply()
with your revised codebook.
library(readxl)
library(writexl)
# Save codebook to a file
codes <- codebook(ds)
write_xlsx(codes,"codebook.xlsx")
# Load and apply a codebook from a file
codes <- read_xlsx("codebook_revised.xlsx")
ds <- labs_apply(ds, codebook)
Be aware that some data operations such as mutate()
from
the tidyverse loose labels on their way. In this case, store the labels
(in the codebook attribute of the data frame) before the operation and
restore them afterwards:
ds %>%
labs_store() %>%
mutate(sd_age = 2024 - sd_age) %>%
labs_restore() %>%
report_metrics(sd_age)
Reports combine plots, tables and effect calculations in an RMarkdown document. Optionally, for item batteries, an index, clusters or factors are calculated and reported.
To see an example or develop own reports, use the volker report template in RStudio:
Have fun with developing own reports!
Without the template, to generate a volker-report from any R-Markdown
document, add volker::html_report
to the output options of
your Markdown document:
---
title: "How to create reports?"
output:
volker::html_report
---
Then, you can generate combined outputs using the report-functions. One advantage of the report-functions is that plots are automatically scaled to fit the page. See the function help for further options (F1 key).
#> ```{r echo=FALSE}
#> ds %>%
#> filter(sd_gender != "diverse") %>%
#> report_counts(adopter, sd_gender,
#> ```
By default, a header and tabsheets are automatically created. You can mix in custom content.
FALSE
and add your own title.FALSE
and adding a new header on the
fifth level (5 x # followed by the tab name). Close your custom new
tabsheet with #### {-}
(4 x #).Try out the following pattern in an RMarkdown document!
#> ### Adoption types
#>
#> ```{r echo=FALSE}
#> ds %>%
#> filter(sd_gender != "diverse") %>%
#> report_counts(adopter, sd_gender, prop="rows", title=FALSE, close=FALSE)
#> ```
#>
#> ##### Method
#> Basis: Only male and female respondents.
#>
#> #### {-}
For quick inspections of an index from a bunch of items, set the
index parameter to TRUE
. The index is calculated by the
average value of all selected columns.
Cronbach’s Alpha and the number of items are calculated with
psych::alpha()
and stored as column attribute named
“psych.alpha”. The reliability values are printed by
report_metrics()
.
ds |>
report_metrics(starts_with("cg_adoption"), index = TRUE)
You can add an index as a new column using add_index()
.
A new column is created with the average value of all selected columns
for each case. Provide a custom name for the column using the
newcol
parameter. The report_metrics()
function still outputs reliability values for the column.
Add a single index
ds %>%
add_index(starts_with("cg_adoption_"), newcol = "idx_cg_adoption") %>%
report_metrics(idx_cg_adoption)
Compare the index values by group
ds %>%
add_index(starts_with("cg_adoption_"), newcol = "idx_cg_adoption") %>%
report_metrics(idx_cg_adoption, adopter)
Add multiple indizes and summarize them
ds %>%
add_index(starts_with("cg_adoption_")) %>%
add_index(starts_with("cg_adoption_advantage")) %>%
add_index(starts_with("cg_adoption_fearofuse")) %>%
add_index(starts_with("cg_adoption_social")) %>%
tab_metrics(starts_with("idx_cg_adoption"))
To reverse items, provide a selection of columns to the
cols.reverse
-parameter of add_index()
.
The easiest way to conduct factor analysis or cluster analyses is to
use the respective parameters in the report_metrics()
function.
ds |>
report_metrics(starts_with("cg_adoption"), factors = TRUE, clusters = TRUE)
Currently, cluster analysis is performed using kmeans and factor analysis is a principal component analysis. Setting the parameters to true, automatically generates scree plots and selects the number of factors or clusters. Alternatively, you can explicitly specify the numbers.
Add factor or cluster analysis results to the original data
If you want to work with the results, use add_factors()
and add_clusters()
respectively. For factor analysis, new
columns prefixed with “fct_” are created to store the factor loadings
based on the specified number of factors. For clustering, an additional
column prefixed with “cls_” is added that assigns each observation to a
cluster number.
ds |>
add_factors(starts_with("cg_adoption"), k = 3) |>
select(starts_with("fct_"))
Once you have added factor or cluster columns to your data set, you can use them with the report functions:
ds |>
add_factors(starts_with("cg_adoption"), k = 3) |>
report_metrics(fct_cg_adoption_1, fct_cg_adoption_2, metric = TRUE)
ds |>
add_clusters(starts_with("cg_adoption"), k = 3) |>
report_counts(sd_gender, cls_cg_adoption, prop = "cols")
After explicitly adding factor or cluster columns, you can inspect
the analysis results using factor_tab()
,
factor_plot()
or cluster_tab()
,
cluster_plot()
.
ds |>
add_factors(starts_with("cg_adoption"), k = 3) |>
factor_tab(starts_with("fct_"))
Automatically determine the number of factors or clusters
To automatically determine the optimal number of factors or clusters based on diagnostics, set k = NULL.
ds |>
add_factors(starts_with("cg_adoption"), k = NULL) |>
factor_tab(starts_with("fct_cg_adoption"))
Modeling in the statistical sense is predicting an outcome (dependent variable) from one or multiple predictors (independent variables).
The report_metrics() function calculates a linear model if the model parameter is TRUE. You provide the variables in the following parameters:
interactions = c(sd_age * sd_gender)
)ds |>
filter(sd_gender != "diverse") |>
report_metrics(
use_work,
cross = c(sd_gender, adopter),
metric = sd_age,
model = TRUE,
diagnostics = TRUE
)
Four selected diagnostic plots are generated if the diagnostics-parameter is TRUE:
To work with the predicted values, use add_model() instead of the
report function. This will add a new variable prefixed with
prd_
holding the target scores.
ds <- ds |>
add_model(
use_work,
categorical = c(sd_gender, adopter),
metric = sd_age
)
report_metrics(ds, use_work, prd_use_work, metric = T)
There are two functions to get the regression table or plot from the new column:
model_tab(ds, prd_use_work)
model_plot(ds, prd_use_work)
By default, p values are adjusted to the number of tests by controlling the false discovery rate (fdr). Set the adjust-parameter to FALSE for disabling p correction.
In content analysis, reliability is usually checked by coding the cases with different persons and then calculating the overlap. To calculate reliability scores, prepare one data frame for each person:
Next, you row bind the data frames. The columns for coder and ID make sure that each coding is uniquely identified and can be related to the cases and coders.
data_coded <- bind_rows(
data_coder1,
data_coder2
)
The final data, for example, looks like:
case | coder | topic_sports | topic_weather |
---|---|---|---|
1 | anne | TRUE | FALSE |
2 | anne | TRUE | FALSE |
3 | anne | FALSE | TRUE |
1 | ben | TRUE | TRUE |
2 | ben | TRUE | FALSE |
3 | ben | FALSE | TRUE |
Calculating reliability is straight forward with report_counts():
starts_with()
) to the second parameter.Example:
report_counts(data_coded, starts_with("topic_"), coder, ids = case, prop="cols", agree = "reliability")
Alternatively, if you are only interested in the scores, not a plot, you get them using agree_tab. Hint: You may abbreviate the reliability value.
agree_tab(data_coded, starts_with("topic_"), coder, ids = case, method="reli")
Further, you can request classification performance indicators (accuracy, precision, recall, F1) with the same function by setting the method to “classification” (may be abbreviated). Use this option if you compare manual codings to automated codings (classifiers, large language models). By default, you get macro statistics (average precision, recall and f1 over categories).
Give you have multiple values in on column, you may focus one category to get micro statistics:
agree_tab(starts_with("topic_"), coder, ids = case, method = "class", category = "catcontent")
Cases with missing values, by default, are omitted in all methods. Thus, the calculations are only based on cases with complete values in the selected columns.
Furthermore, each function first cleans the values:
VLKR_NA_LEVELS
constant
are recoded to missing values (“[NA] nicht beantwortet”, “[NA] keine
Angabe”, “[no answer]” and “keine Angabe”).VLKR_NA_NUMBERS
constant are recoded to missing values (-9, -2, and -1).
print(volker:::VLKR_NA_LEVELS)
#> [1] "[NA] nicht beantwortet" "[NA] keine Angabe" "[no answer]"
#> [4] "keine Angabe"
print(volker:::VLKR_NA_NUMBERS)
#> [1] -9 -2 -1
The output always contains information about how many cases were removed due to missing values. You have three options to treat missings:
clean
-parameter of the
functions to FALSE
.options(vlkr.na.levels=c("Not answered"))
or
options(vlkr.na.numbers=c(-2,-9))
. If you set the value to
FALSE
, no values are recoded.options(vlkr.na.omit=FALSE)
(maximal information from all
items).The volker-package is based on standard methods for data handling and visualisation. You could produce all outputs on your own. The package just makes your code dry - don’t repeat yourself - and wraps often used snippets into a simple interface.
Report functions call subsidiary tab, plot and effect functions, which in turn call functions specifically designed for the provided column selection. Open the package help to see, to which specific functions the report functions are redirected.
Console and markdown output is pimped by specific print- and
knit-functions. To make this work, the cleaned data, produced plots,
tables and markdown snippets gain new classes (vlkr_df
,
vlkr_plt
, vlkr_tbl
, vlkr_list
,
vlkr_rprt
).
The volker-package makes use of common tidyverse functions. Basically, most outputs are generated by three functions:
count()
is used to produce countsskim()
is used to produce metricsggplot()
is used to assemble plots.Statistical tests, clustering and factor analysis are largely based on the stats, psych, car and effectsize packages.
Thanks to all the maintainers, authors and contributors of the packages that make the world of data a magical place.