textpress

textpress is an R toolkit for building text corpora and searching them – no custom object classes, just plain data frames from start to finish. It covers the full arc from URL to retrieved passage through a consistent four-step API: Fetch, Read, Process, Search. Traditional tools (KWIC, BM25, dictionary matching) sit alongside modern ones (semantic search, LLM-ready chunking), all composing cleanly with |>.

Installation

From CRAN:

install.packages("textpress")

Development version:

remotes::install_github("jaytimm/textpress")

The `textpress` API

Conventions: corpus is a data frame with a text column plus identifier column(s) passed to by (default doc_id). All outputs are plain data frames or data.tables; pipe-friendly.

1. Fetch (`fetch_*`)

Find URLs and metadata – not full text. Pass results to read_urls() to get content.

fetch_urls(query, n_pages, date_filter) – Search engine query; returns candidate URLs with metadata.
fetch_wiki_urls(query, limit) – Wikipedia article URLs matching a search phrase.
fetch_wiki_refs(url, n) – External citation URLs from a Wikipedia article’s References section.

2. Read (`read_*`)

Scrape and parse URLs into a structured corpus.

read_urls(urls, ...) – Character vector of URLs → list(text, meta). text is one row per node (headings, paragraphs, lists); meta is one row per URL. For Wikipedia, exclude_wiki_refs = TRUE drops References / See also / Bibliography sections.

3. Process (`nlp_*`)

Prepare text for search or indexing.

nlp_split_paragraphs() – Break documents into structural blocks.
nlp_split_sentences() – Segment blocks into individual sentences.
nlp_tokenize_text() – Normalize text into a clean token stream.
nlp_index_tokens() – Build a weighted BM25 index for ranked retrieval.
nlp_roll_chunks() – Roll sentences into fixed-size chunks with surrounding context (RAG-style).

4. Search (`search_*`)

Four retrieval modes over the same corpus. Data-first, pipe-friendly.

Function	Query type	Use case
`search_regex(corpus, query)`	Regex pattern	Specific strings, KWIC with inline highlighting.
`search_dict(corpus, terms)`	Term vector	Exact phrases and MWEs; built-in `dict_generations`, `dict_political`.
`search_index(index, query)`	Keywords	BM25 ranked retrieval over a token index.
`search_vector(embeddings, query)`	Numeric vector	Semantic nearest-neighbor search; use `util_fetch_embeddings()` to embed.

RAG & LLM pipelines

textpress is designed to compose cleanly into retrieval-augmented generation pipelines.

Hybrid retrieval – run search_index() and search_vector() over the same chunks, then merge with reciprocal rank fusion (RRF). Chunks that rank well under both term frequency and meaning rise to the top.

Context assembly – nlp_roll_chunks() with context_size > 0 gives each chunk a focal sentence plus surrounding context, so retrieved passages are self-contained when passed to an LLM.

Agent tool-calling – the consistent API and plain data-frame outputs map naturally to tool use:

Agent task	Function
“Find recent articles on X”	`fetch_urls()`
“Scrape these pages”	`read_urls()`
“Find all mentions of these entities”	`search_dict()`
“Follow citations from this Wikipedia article”	`fetch_wiki_refs()`

Vignettes

Web data – fetch_urls() + read_urls()
Basic NLP – sentence splitting, tokenization, span-aware casting
Wikipedia data – fetch_wiki_urls() + fetch_wiki_refs()
Regex search – search_regex(), KWIC
Dictionary search – search_dict(), PMI co-occurrence
Semantic search – RAG pipeline: embeddings, BM25, hybrid RRF retrieval, LLM extraction

License

Citation

citation("textpress")

Issues

Report bugs or request features at https://github.com/jaytimm/textpress/issues