The {simulist} R package can generate line list data
(sim_linelist()), contact tracing data
(sim_contacts()), or both (sim_outbreak()). By
default the line list produced by sim_linelist() and
sim_outbreak() contains 12 columns. Some amount of
post-simulation data wrangling may be needed to use the simulated
epidemiological case data to certain applications. This vignette
demonstrates some common data wrangling tasks that may be performed on
simulated line list or contact tracing data.
This vignette provides data wrangling examples using both functions available in the R language (commonly called “base R”) as well as using tidyverse R packages, which are commonly applied to data science tasks in R. The tidyverse examples are shown by default, but select the “Base R” tab to see the equivalent functionality using base R. There are many other tools for wrangling data in R which are not covered by this vignette (e.g. {data.table}).
See these great resources for more information on general data wrangling in R:
To simulate an outbreak we will use the sim_outbreak()
function from the {simulist} R package.
If you are unfamiliar with the {simulist} package or the
sim_outbreak() function Get Started
vignette is a great place to start.
First we load in some data that is required for the outbreak simulation. Data on epidemiological parameters and distributions are read from the {epiparameter} R package.
# create contact distribution (not available from {epiparameter} database)
contact_distribution <- epiparameter(
  disease = "COVID-19",
  epi_name = "contact distribution",
  prob_distribution = create_prob_distribution(
    prob_distribution = "pois",
    prob_distribution_params = c(mean = 2)
  )
)
#> Citation cannot be created as author, year, journal or title is missing
# create infectious period (not available from {epiparameter} database)
infectious_period <- epiparameter(
  disease = "COVID-19",
  epi_name = "infectious period",
  prob_distribution = create_prob_distribution(
    prob_distribution = "gamma",
    prob_distribution_params = c(shape = 1, scale = 1)
  )
)
#> Citation cannot be created as author, year, journal or title is missing
# get onset to hospital admission from {epiparameter} database
onset_to_hosp <- epiparameter_db(
  disease = "COVID-19",
  epi_name = "onset to hospitalisation",
  single_epiparameter = TRUE
)
#> Using Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, Yuan
#> B, Kinoshita R, Nishiura H (2020). "Incubation Period and Other
#> Epidemiological Characteristics of 2019 Novel Coronavirus Infections
#> with Right Truncation: A Statistical Analysis of Publicly Available
#> Case Data." _Journal of Clinical Medicine_. doi:10.3390/jcm9020538
#> <https://doi.org/10.3390/jcm9020538>.. 
#> To retrieve the citation use the 'get_citation' function
# get onset to death from {epiparameter} database
onset_to_death <- epiparameter_db(
  disease = "COVID-19",
  epi_name = "onset to death",
  single_epiparameter = TRUE
)
#> Using Linton N, Kobayashi T, Yang Y, Hayashi K, Akhmetzhanov A, Jung S, Yuan
#> B, Kinoshita R, Nishiura H (2020). "Incubation Period and Other
#> Epidemiological Characteristics of 2019 Novel Coronavirus Infections
#> with Right Truncation: A Statistical Analysis of Publicly Available
#> Case Data." _Journal of Clinical Medicine_. doi:10.3390/jcm9020538
#> <https://doi.org/10.3390/jcm9020538>.. 
#> To retrieve the citation use the 'get_citation' functionThe seed is set to ensure the output of the vignette is consistent. When using {simulist}, setting the seed is not required unless you need to simulate the same line list multiple times.
Not every column in the simulated line list may be required for the
use case at hand. In this example we will remove the
$ct_value column. For instance, if we wanted to simulate an
outbreak for which no laboratory testing (e.g Polymerase chain reaction,
PCR, testing) was available and thus a Cycle threshold (Ct) value would
not be known for confirmed cases.
# remove column by name
linelist %>% # nolint one_call_pipe_linter
  select(!ct_value)
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1            <NA> recovered         <NA>               <NA>              <NA>
#> 2      2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3            <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 4            <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 5            <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 6            <NA> recovered         <NA>         2022-12-28        2023-01-03
#> 7            <NA> recovered         <NA>         2022-12-31        2023-01-03
#> 8            <NA> recovered         <NA>         2022-12-29        2023-01-01
#> 9            <NA> recovered         <NA>         2022-12-26        2023-01-02
#> 10     2023-01-02 recovered         <NA>         2022-12-30        2023-01-02
#> 11     2023-01-05 recovered         <NA>         2022-12-30        2023-01-05
#> 12           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 13           <NA> recovered         <NA>         2022-12-28        2023-01-04# remove column by numeric column indexing
# ct_value is column 12 (the last column)
linelist[, -12]
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact ct_value
#> 1            <NA> recovered         <NA>               <NA>     21.9
#> 2      2023-01-08      died   2023-01-10         2022-12-26     22.7
#> 3            <NA> recovered         <NA>         2022-12-30       NA
#> 4            <NA> recovered         <NA>         2022-12-31     27.4
#> 5            <NA> recovered         <NA>         2022-12-26       NA
#> 6            <NA> recovered         <NA>         2022-12-28       NA
#> 7            <NA> recovered         <NA>         2022-12-31       NA
#> 8            <NA> recovered         <NA>         2022-12-29     24.2
#> 9            <NA> recovered         <NA>         2022-12-26       NA
#> 10     2023-01-02 recovered         <NA>         2022-12-30     21.3
#> 11     2023-01-05 recovered         <NA>         2022-12-30     26.0
#> 12           <NA> recovered         <NA>         2023-01-01       NA
#> 13           <NA> recovered         <NA>         2022-12-28       NA
# remove column by column name
linelist[, colnames(linelist) != "ct_value"]
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1            <NA> recovered         <NA>               <NA>              <NA>
#> 2      2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3            <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 4            <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 5            <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 6            <NA> recovered         <NA>         2022-12-28        2023-01-03
#> 7            <NA> recovered         <NA>         2022-12-31        2023-01-03
#> 8            <NA> recovered         <NA>         2022-12-29        2023-01-01
#> 9            <NA> recovered         <NA>         2022-12-26        2023-01-02
#> 10     2023-01-02 recovered         <NA>         2022-12-30        2023-01-02
#> 11     2023-01-05 recovered         <NA>         2022-12-30        2023-01-05
#> 12           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 13           <NA> recovered         <NA>         2022-12-28        2023-01-04
# remove column by assigning it to NULL
linelist$ct_value <- NULL
linelist
#>    id             case_name case_type sex age date_onset date_reporting
#> 1   1         Fabian Mrazik confirmed   m  90 2023-01-01     2023-01-01
#> 2   3       Ashley Martinez confirmed   f  71 2023-01-01     2023-01-01
#> 3   4                Tia Vu  probable   f  48 2023-01-01     2023-01-01
#> 4   5 Abdul Majeed el-Saleh confirmed   m  77 2023-01-01     2023-01-01
#> 5   6        Courtney Flood suspected   f  83 2023-01-02     2023-01-02
#> 6   7          Joseph Jiron suspected   m  56 2023-01-01     2023-01-01
#> 7   8          Kevin Liddle suspected   m  39 2023-01-01     2023-01-01
#> 8   9       Rutaiba el-Raad confirmed   f  68 2023-01-01     2023-01-01
#> 9  10       Jaime Middleton suspected   m   1 2023-01-01     2023-01-01
#> 10 14           Emily Fyffe confirmed   f  16 2023-01-01     2023-01-01
#> 11 16          Miguel Oyebi confirmed   m  54 2023-01-02     2023-01-02
#> 12 21         Katlyn Nelson  probable   f  36 2023-01-02     2023-01-02
#> 13 24       Nicholas Rentie suspected   m  49 2023-01-02     2023-01-02
#>    date_admission   outcome date_outcome date_first_contact date_last_contact
#> 1            <NA> recovered         <NA>               <NA>              <NA>
#> 2      2023-01-08      died   2023-01-10         2022-12-26        2023-01-06
#> 3            <NA> recovered         <NA>         2022-12-30        2023-01-05
#> 4            <NA> recovered         <NA>         2022-12-31        2023-01-08
#> 5            <NA> recovered         <NA>         2022-12-26        2023-01-04
#> 6            <NA> recovered         <NA>         2022-12-28        2023-01-03
#> 7            <NA> recovered         <NA>         2022-12-31        2023-01-03
#> 8            <NA> recovered         <NA>         2022-12-29        2023-01-01
#> 9            <NA> recovered         <NA>         2022-12-26        2023-01-02
#> 10     2023-01-02 recovered         <NA>         2022-12-30        2023-01-02
#> 11     2023-01-05 recovered         <NA>         2022-12-30        2023-01-05
#> 12           <NA> recovered         <NA>         2023-01-01        2023-01-03
#> 13           <NA> recovered         <NA>         2022-12-28        2023-01-04