Alphabets¶
In order to accommodate different type of data (DNA, RNA, protein, codon or microsatellite for example)
a special class has been designed, Alphabet
, which centralizes the list of valid allelic values
(separated in exploitable and missing alleles). Even though the user can define his own alphabets, it
is recommended to use pre-defined instances present in the package, for two reasons: first, a number of
tools expect data of a particular type and in this case the alphabet is used to validate the data type
(nucleotide sequences must code using the DNA alphabet); and, second, some of the pre-implemented
alphabets are optimized (especially DNA). Pre-defined instances are availabled in
the alphabets module.
As a result, in order to instanciate an instance of the class Align
, Container
or
Site
it is mandatory to explicit under which alphabet they should operate.
For example to import a DNA alignment from a file, one can just type the following:
aln = egglib.io.from_fasta('align1.fas',alphabet=egglib.alphabets.DNA)
Note
Some functions require pre-implemented alphabets, such as tools.to_codons()
,
tools.to_bases()
or the functions of the class tools.Translator
. Other functions are
completely generic in this regard and accept custom alphabets so long as they are well-constructed.
Facilities available in Align
/Container
instances¶
Automatic conversion of string input¶
To visualize data one can used the function string()
from the SequenceView
class such as in the
following example:
aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA)
print(aln.get_sequence(0).string())
This line produces the expected output:
ACCGTGGAGAGCGCGTTGCA
The following is a similar way to obtain the same result:
aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
seq = aln.get_sequence(0)
print(seq[:])
Be careful however that when using simulated data from the coalescent simulator, the above code might get you in trouble since the simulator does not use a DNA alphabet in order to be able to accommodate more general models:
coal = egglib.coalesce.Simulator(1, num_chrom=[40], theta=2.0)
aln = coal.simul()
print(aln.get_sequence(0).string())
would return an error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/.local/lib/python3.7/site-packages/egglib/_interface.py", line239, in string
raise ValueError, 'cannot generate sequence string with alphabet {0}'.format(self._parent._alphabet.name)
ValueError: cannot generate sequence string with alphabet KAM:2
and
seq = aln.get_sequence(0)
print(seq[:])
would return something like the following:
[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
It is however easy to visualize the simulated alignment with DNA sequences, one can use the to_fasta()
function
and specifying the egglib.alphabets.DNA as an argument:
print(aln.to_fasta(alphabet=egglib.alphabets.DNA))
This will map the simulated alleles to nucleotide bases, based on an arbitrary order of bases and generate an intuitive nucleotide sequence. Note that the latter will work only if the mutation model used for simulations allows at most four alleles. With more alleles, the
Protein translation¶
EggLib allows you to translate DNA sequences to proteins. In all cases, there is no
validity checking concerning characters, meaning that invalid characters
will be translated as missing data (although in the future an error might
be generated). The translation operates in two steps: DNA to codons and
codons to protein. The RNA alphabet is actually not used (it is only provided
to allow you to import sequences of this type at this moment).
The standard is therefore to use the function tools.to_codons()
which converts
the alignment from the DNA alphabet to the codon alphabet. Translation requires codon
sequences. Other functions that strictly
require coding sequences are tools.backalign()
, tools.trailing_stops()
,
tools.iter_stops()
, tools.has_stop()
and the class CodingDiversity
.
To perform the DNA to codon extraction, and if the sequences are not an open reading frame (the default assumption),
a tools.ReadingFrame
instance indicating the bounds of the reading
frame must be passed as an argument to the tools.to_codons()
function.
EggLib has built-in support for all genetic codes described in USA’s National Center for Biotechnology Information (NCBI) database (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). The genetic codes and their identifier are listed here.
Protein translation tools lie in the tools module, which gathers various functionalities that are described in the following of this section.
The tools.translate()
function¶
The simplest way to translate nucleotide sequences to amino acids with
EggLib is probably the tools.translate()
function. This function
is flexible: it takes as input alignments, sequence sets, or individual
sequences, and returns a corresponding object, as lists in the table below:
Input type |
Returned type |
---|---|
The usage is straightforward, as exemplified below:
import egglib
aln = egglib.Align.create([
('sample1', 'TTGCTAGGTGTATAG'),
('sample2', 'TTCCTAGATGAATAG'),
('sample3', 'ATGCTAGATGAATAG')],
alphabet=egglib.alphabets.DNA)
aln.to_codons()
prot = egglib.tools.translate(aln)
print(prot.to_fasta())
The resulting protein alignment will be for this example:
>sample1
LLGV*
>sample2
FLDE*
>sample3
MLDE*
The code option of this function lets you specify the genetic code to be used, if it is not the standard one. Among options, the option in_place tells the function to overwrite the object you provide, instead of returning a new one. This can be useful in memory- or especially time-critical applications:
aln = egglib.Align.create([
('sample1', 'TTGCTAGGTGTATAG'),
('sample2', 'TTCCTAGATGAATAG'),
('sample3', 'ATGCTAGATGAATAG')],
alphabet = egglib.alphabets.DNA)
aln.to_codons()
egglib.tools.translate(aln, in_place=True) # returns None
print(aln.to_fasta())
translate()
can also translate individual sequences, either provided as
a SequenceView
or a str
(in_place does not work in this
two cases):
egglib.tools.translate('CCATTGGTAATGGCC')
Use the allow_alt=True
option to support alternative start codons (a rare
phenomenon accounted for by all genetic codes).
“Smart” translation¶
By default, codons with any missing data are translated as missing data
(the X
character). However, in certain cases it might be possible to
guess: for example (in the standard genetic code), CTN
necessarily
translates to a Leucine because all four possibilities (CTA
, CTC
,
CTG
, and CTT
) do. Similarly, both AGC
and TGC
encode a Serine,
so WGC
also necessarily encodes a Serine. The option smart=True
turns on automatic detection of those case, based on the table of ambiguity characters
in the nomenclature for nucleotide codes
reproduced in the table below:
Symbol |
Meaning |
---|---|
G |
G |
A |
A |
T |
T |
C |
C |
R |
G or A |
Y |
T or C |
M |
A or C |
K |
G or T |
S |
G or C |
W |
A or T |
H |
A or C or T |
B |
G or T or C |
V |
G or A or C |
D |
G or A or T |
N |
G or A or T or C |
The tools.Translator
class¶
There is a Translator
class in the tools module
that perform the same operations as the tools.translate()
function.
If you need to translate many data sets in one go, it will probably be faster
to use this class as in the example below, where we assume that aligns
is
a list
of Align
instances:
trans = egglib.tools.Translator(code=1)
for aln in aligns:
trans.translate_align(aln, in_place=True)
The options code=1
is actually the default. It is shown just to
show that options are specified in the class’s constructor.
Detecting open reading frames and processing stop codons¶
There are additionnal tools helping you to manage coding sequences, or sequences containing open reading frames:
tools.orf_iter()
provides an iterator over all possible open reading frames of the sequence. It is configurable (see options) and is used as this (with default options):for start, stop, length, frame in egglib.tools.orf_iter(seq): print(seq[start:stop])
(The example displays the sequence of each open reading frame, although some may need to be reverse-complemented.)
tools.longest_orf()
is a shortcut to find the longest possible open reading frame. If there is a tie, the function does not take the decision for you and raises an exception.tools.trailing_stops()
detect and optionally fixes stop codons at the end of sequences of an alignment.tools.iter_stops()
provides you with an iterator over the positions of all stop codons of the alignment.tools.has_stop()
is a shortcut to test if an alignment contains any stop codons.