This R package edlibR provides bindings to the C/C++
library edlib, which computes the exact pairwise sequence alignment
using the edit
distance (Levenshtein
distance). The functions within edlibR are modeled
after the API of the Python
package edlib on PyPI
There are three functions within edlibR:
The first function provided by edlibR is
align(). The function align() computes the
pairwise alignment of the input query against the input
target:
align(query, target, [mode], [task], [k], [cigarFormat], [additionalEqualities])A list is returned with the following fields:
query and target.list(c(start, end)). Note: if the start or end
positions are NULL, this is encoded as NA to work correctly
with R vectors.cigarFormat in the function
align() which is returned here for the function
getNiceAlignment(). (Note: the function
getNiceAlignment() only accepts
cigarFormat="extended".)## $editDistance
## [1] 1
## 
## $alphabetLength
## [1] 5
## 
## $locations
## $locations[[1]]
## [1] 1 3
## 
## $locations[[2]]
## [1] 1 4
## 
## 
## $cigar
## [1] "3=1I"
## 
## $cigarFormat
## [1] "extended"## $editDistance
## [1] 3
## 
## $alphabetLength
## [1] 8
## 
## $locations
## $locations[[1]]
## [1] NA  8
## 
## 
## $cigar
## NULL
## 
## $cigarFormat
## [1] "extended"## $editDistance
## [1] 1
## 
## $alphabetLength
## [1] 5
## 
## $locations
## $locations[[1]]
## [1] 1 3
## 
## $locations[[2]]
## [1] 1 4
## 
## 
## $cigar
## [1] "3=1I"
## 
## $cigarFormat
## [1] "extended"## the previous example with additionalEqualities 
algn4 = align("ACTG", "CACTRT", mode="HW", task="path", additionalEqualities=list(c("R", "A"), c("R", "G")))
print(algn4)## $editDistance
## [1] 0
## 
## $alphabetLength
## [1] 5
## 
## $locations
## $locations[[1]]
## [1] 1 4
## 
## 
## $cigar
## [1] "4="
## 
## $cigarFormat
## [1] "extended"edlibR:
AACT and target as AACTGGC, the edit
distance would be 0, because removing GGC from the end of
the second sequence is “free” and does not count into the total edit
distance. This method is appropriate when you want to find out how
well the first sequence fits at the beginning of the second
sequence.ACT and
CGACTGAC, the edit distance would be 0, because removing
CG from the start and GAC from the end of the
second sequence is “free” and does not count into the total edit
distance. This method is appropriate when you want to find out how
well the first sequence fits at any part of the second sequence.
For example, if your second sequence was a long text and your first
sequence was a sentence from that text, but slightly scrambled, you
could use this method to discover how scrambled it is and where it fits
in that text. In bioinformatics, this method is appropriate for
aligning a read to a sequence.cigarFormat="extended"):
The function getNiceAlignment() takes the output of align(), and represents this in a visually informative format for human inspection (“NICE format”). This will be an informative string showing the matches, mismatches, insertions, and deletions.
getNiceAlignment(alignResult, query, target, [gapSymbol])Note: Users must use the argument task="path" within
align() to output a CIGAR for
getNiceAlignment(); otherwise, there will be no CIGAR for
getNiceAlignment() to reconstruct the alignment in “NICE”
format. Also, users must use the argument
cigarFormat="extended" within align(); otherwise, the CIGAR
will be too ambiguous for getNiceAlignment() to correctly reconstruct
the alignment() in “NICE” format.
library(edlibR)
query = "elephant"
target = "telephone"
result = align(query, target, task = "path")
nice_algn = getNiceAlignment(result, query, target)
print(nice_algn)## $query_aligned
## [1] "-elephant"
## 
## $matched_aligned
## [1] "-|||||.|."
## 
## $target_aligned
## [1] "telephone"align(). As mentioned above, align() must use
the arguments task="path" and
cigarFormat="extended" in order for the CIGAR to be
informative enough for getNiceAlignment() to work
properly.alignResultalignResultquery and
target (default="-"). This must be a single
character, i.e. a string of length 1 (i.e. nchar(gapSymbol)
must equal 1).The function nice_print() simply prints the output of
getNiceAlignment() to the console for quickly inspecting
the alignment. Users can think of this function as a “pretty-print”
function for visualization.
library(edlibR)
## example above from getNiceAlignment()
query = "elephant"
target = "telephone"
result = align(query, target, task = "path")
nice_algn = getNiceAlignment(result, query, target)
nice_print(nice_algn)## [1] "query:   -elephant"
## [1] "matched: -|||||.|."
## [1] "target:  telephone"For more information regarding edlib, please see the publication in Bioinformatics.