textpress is an R toolkit for building text corpora and
searching them – no custom object classes, just plain data frames from
start to finish. It covers the full arc from URL to retrieved passage
through a consistent four-step API: Fetch,
Read, Process,
Search. Traditional tools (KWIC, BM25, dictionary
matching) sit alongside modern ones (semantic search, LLM-ready
chunking), all composing cleanly with |>.
From CRAN:
install.packages("textpress")Development version:
remotes::install_github("jaytimm/textpress")textpress APIConventions: corpus is a data frame with a
text column plus identifier column(s) passed to
by (default doc_id). All outputs are plain
data frames or data.tables; pipe-friendly.
fetch_*)Find URLs and metadata – not full text. Pass results to
read_urls() to get content.
fetch_urls(query, n_pages, date_filter)
– Search engine query; returns candidate URLs with metadata.fetch_wiki_urls(query, limit) –
Wikipedia article URLs matching a search phrase.fetch_wiki_refs(url, n) – External
citation URLs from a Wikipedia article’s References section.read_*)Scrape and parse URLs into a structured corpus.
read_urls(urls, ...) – Character
vector of URLs → list(text, meta). text is one
row per node (headings, paragraphs, lists); meta is one row
per URL. For Wikipedia, exclude_wiki_refs = TRUE drops
References / See also / Bibliography sections.nlp_*)Prepare text for search or indexing.
nlp_split_paragraphs() – Break
documents into structural blocks.nlp_split_sentences() – Segment blocks
into individual sentences.nlp_tokenize_text() – Normalize text
into a clean token stream.nlp_index_tokens() – Build a weighted
BM25 index for ranked retrieval.nlp_roll_chunks() – Roll sentences
into fixed-size chunks with surrounding context (RAG-style).search_*)Four retrieval modes over the same corpus. Data-first, pipe-friendly.
| Function | Query type | Use case |
|---|---|---|
search_regex(corpus, query) |
Regex pattern | Specific strings, KWIC with inline highlighting. |
search_dict(corpus, terms) |
Term vector | Exact phrases and MWEs; built-in dict_generations,
dict_political. |
search_index(index, query) |
Keywords | BM25 ranked retrieval over a token index. |
search_vector(embeddings, query) |
Numeric vector | Semantic nearest-neighbor search; use
util_fetch_embeddings() to embed. |
textpress is designed to compose cleanly into
retrieval-augmented generation pipelines.
Hybrid retrieval – run search_index()
and search_vector() over the same chunks, then merge with
reciprocal rank fusion (RRF). Chunks that rank well under both term
frequency and meaning rise to the top.
Context assembly – nlp_roll_chunks()
with context_size > 0 gives each chunk a focal sentence
plus surrounding context, so retrieved passages are self-contained when
passed to an LLM.
Agent tool-calling – the consistent API and plain data-frame outputs map naturally to tool use:
| Agent task | Function |
|---|---|
| “Find recent articles on X” | fetch_urls() |
| “Scrape these pages” | read_urls() |
| “Find all mentions of these entities” | search_dict() |
| “Follow citations from this Wikipedia article” | fetch_wiki_refs() |
fetch_urls() + read_urls()fetch_wiki_urls() +
fetch_wiki_refs()search_regex(), KWICsearch_dict(), PMI co-occurrenceMIT © Jason Timm
citation("textpress")Report bugs or request features at https://github.com/jaytimm/textpress/issues