This release brings together several years of maintenance and feature work to make textreuse easier to use on current R installations and more practical for larger document collections.
This is a CRAN resubmission that fixes a moved README URL reported by CRAN incoming checks.
TextReuseTextDocument() and
TextReuseCorpus() now accept an encoding
argument, making it easier to read source files whose text encoding is
known or differs from the platform default.TextReuseCorpus() now keeps skipped-document
bookkeeping deterministic. Skipped documents are reported consistently,
and skip metadata is available even when
skip_short = FALSE.align_local() now returns an empty local alignment
instead of throwing an error when two texts have no matching words. This
makes batch alignment workflows easier to run because no-match pairs can
be represented directly.align_local() gains preserve_punctuation,
allowing displayed alignments to keep punctuation from the original
texts when that context is useful.count_matches() and matching_tokens()
helpers expose absolute match counts and the matched tokens themselves,
so users can inspect what drove a similarity score rather than relying
only on a ratio.pairwise_candidates() and matrix conversion now
preserve all document IDs, including documents without returned
candidate pairs.as_sparse_matrix() provides a sparse matrix
representation of candidate results, which is more convenient for
downstream modeling, graph analysis, and workflows with many
documents.lsh_add() can add new documents to an existing LSH
bucket cache, so users can extend an index without rebuilding it from
scratch.lsh_compare() can run comparisons in parallel on
non-Windows platforms when options(mc.cores) is set.shingle_ngrams()lsh() on
corpora without minhashes