Download and process public domain works from the Project Gutenberg collection. Includes
gutenberg_download() that downloads one or
more works from Project Gutenberg by ID: e.g.,
gutenberg_download(84) downloads the text of
Frankenstein.gutenberg_metadata contains information about each
work, pairing Gutenberg ID with title, author, language, etcgutenberg_authors contains information about each
author, such as aliases and birth/death yeargutenberg_subjects contains pairings of works with
Library of Congress subjects and topicsInstall the released version of gutenbergr from CRAN:
install.packages("gutenbergr")Install the development version of gutenbergr from GitHub:
# install.packages("pak")
pak::pak("ropensci/gutenbergr")The gutenberg_works() function retrieves, by default, a
table of metadata for all unique English-language Project Gutenberg
works that have text associated with them. (The
gutenberg_metadata dataset has all Gutenberg works,
unfiltered).
Suppose we wanted to download Emily Bronte’s “Wuthering Heights.” We could find the book’s ID by filtering:
library(dplyr)
library(gutenbergr)
gutenberg_works() |>
filter(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#> gutenberg_id title author gutenberg_author_id language
#> <int> <chr> <chr> <int> <fct>
#> 1 768 Wuthering Heights Brontë, Emily 405 en
#> gutenberg_bookshelf rights has_text
#> <chr> <fct> <lgl>
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUE
# or just:
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#> gutenberg_id title author gutenberg_author_id language
#> <int> <chr> <chr> <int> <fct>
#> 1 768 Wuthering Heights Brontë, Emily 405 en
#> gutenberg_bookshelf rights has_text
#> <chr> <fct> <lgl>
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUESince we see that it has gutenberg_id 768, we can
download it with the gutenberg_download() function:
wuthering_heights <- gutenberg_download(768)
wuthering_heights
#> # A tibble: 12,342 × 2
#> gutenberg_id text
#> <int> <chr>
#> 1 768 "Wuthering Heights"
#> 2 768 ""
#> 3 768 "by Emily Brontë"
#> 4 768 ""
#> 5 768 ""
#> 6 768 ""
#> 7 768 ""
#> 8 768 "CHAPTER I"
#> 9 768 ""
#> 10 768 ""
#> # ℹ 12,332 more rowsgutenberg_download can download multiple books when
given multiple IDs. It also takes a meta_fields argument
that will add variables from the metadata.
# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
#> # A tibble: 33,343 × 3
#> gutenberg_id text title
#> <int> <chr> <chr>
#> 1 768 "Wuthering Heights" Wuthering Heights
#> 2 768 "" Wuthering Heights
#> 3 768 "by Emily Brontë" Wuthering Heights
#> 4 768 "" Wuthering Heights
#> 5 768 "" Wuthering Heights
#> 6 768 "" Wuthering Heights
#> 7 768 "" Wuthering Heights
#> 8 768 "CHAPTER I" Wuthering Heights
#> 9 768 "" Wuthering Heights
#> 10 768 "" Wuthering Heights
#> # ℹ 33,333 more rows
books |>
count(title)
#> # A tibble: 2 × 2
#> title n
#> <chr> <int>
#> 1 Jane Eyre: An Autobiography 21001
#> 2 Wuthering Heights 12342It can also take the output of gutenberg_works directly.
For example, we could get the text of all Aristotle’s works, each
annotated with both gutenberg_id and title,
using:
aristotle_books <- gutenberg_works(author == "Aristotle") |>
gutenberg_download(meta_fields = "title")
aristotle_books
#> # A tibble: 43,801 × 3
#> gutenberg_id text
#> <int> <chr>
#> 1 1974 "THE POETICS OF ARISTOTLE"
#> 2 1974 ""
#> 3 1974 "By Aristotle"
#> 4 1974 ""
#> 5 1974 "A Translation By S. H. Butcher"
#> 6 1974 ""
#> 7 1974 ""
#> 8 1974 "[Transcriber's Annotations and Conventions: the translator left"
#> 9 1974 "intact some Greek words to illustrate a specific point of the original"
#> 10 1974 "discourse. In this transcription, in order to retain the accuracy of"
#> title
#> <chr>
#> 1 The Poetics of Aristotle
#> 2 The Poetics of Aristotle
#> 3 The Poetics of Aristotle
#> 4 The Poetics of Aristotle
#> 5 The Poetics of Aristotle
#> 6 The Poetics of Aristotle
#> 7 The Poetics of Aristotle
#> 8 The Poetics of Aristotle
#> 9 The Poetics of Aristotle
#> 10 The Poetics of Aristotle
#> # ℹ 43,791 more rowswikipedia column in
gutenberg_author to Wikipedia content with the WikipediR
package or to pageview statistics with the wikipediatrend
package.format_reverse function for reversing “Last, First”
names).See the data-raw directory for the scripts that generate these datasets. As of now, these were generated from the Project Gutenberg catalog on 27 May 2025.
Yes! The package respects these rules and complies to the best of our ability. Namely:
https://www.gutenberg.lib.md.us/8/84/84.zip..zip file to
minimize bandwidth on the mirror. .txt files are only
retrieved if there is no .zip.Still, this package is not the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow their recommendation to set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic. See their Terms of Service for details.
Please note that the gutenbergr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.