Alphabets

In order to accommodate different type of data (DNA, RNA, protein, codon or microsatellite for example) a special class has been designed, Alphabet, which centralizes the list of valid allelic values (separated in exploitable and missing alleles). Even though the user can define his own alphabets, it is recommended to use pre-defined instances present in the package, for two reasons: first, a number of tools expect data of a particular type and in this case the alphabet is used to validate the data type (nucleotide sequences must code using the DNA alphabet); and, second, some of the pre-implemented alphabets are optimized (especially DNA). Pre-defined instances are availabled in the alphabets module.

As a result, in order to instanciate an instance of the class Align , Container or Site it is mandatory to explicit under which alphabet they should operate. For example to import a DNA alignment from a file, one can just type the following:

aln = egglib.io.from_fasta('align1.fas',alphabet=egglib.alphabets.DNA)

Note

Some functions require pre-implemented alphabets, such as tools.to_codons(), tools.to_bases() or the functions of the class tools.Translator. Other functions are completely generic in this regard and accept custom alphabets so long as they are well-constructed.

Facilities available in Align/Container instances

Automatic conversion of string input

To visualize data one can used the function string() from the SequenceView class such as in the following example:

aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA)
print(aln.get_sequence(0).string())

This line produces the expected output:

ACCGTGGAGAGCGCGTTGCA

The following is a similar way to obtain the same result:

aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
seq =  aln.get_sequence(0)
print(seq[:])

Be careful however that when using simulated data from the coalescent simulator, the above code might get you in trouble since the simulator does not use a DNA alphabet in order to be able to accommodate more general models:

coal = egglib.coalesce.Simulator(1, num_chrom=[40], theta=2.0)
aln = coal.simul()
print(aln.get_sequence(0).string())

would return an error:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/user/.local/lib/python3.7/site-packages/egglib/_interface.py", line239, in string
raise ValueError, 'cannot generate sequence string with alphabet {0}'.format(self._parent._alphabet.name)
ValueError: cannot generate sequence string with alphabet KAM:2

and

seq =  aln.get_sequence(0)
print(seq[:])

would return something like the following:

[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

It is however easy to visualize the simulated alignment with DNA sequences, one can use the to_fasta() function and specifying the egglib.alphabets.DNA as an argument:

print(aln.to_fasta(alphabet=egglib.alphabets.DNA))

This will map the simulated alleles to nucleotide bases, based on an arbitrary order of bases and generate an intuitive nucleotide sequence. Note that the latter will work only if the mutation model used for simulations allows at most four alleles. With more alleles, the

Protein translation

EggLib allows you to translate DNA sequences to proteins. In all cases, there is no validity checking concerning characters, meaning that invalid characters will be translated as missing data (although in the future an error might be generated). The translation operates in two steps: DNA to codons and codons to protein. The RNA alphabet is actually not used (it is only provided to allow you to import sequences of this type at this moment). The standard is therefore to use the function tools.to_codons() which converts the alignment from the DNA alphabet to the codon alphabet. Translation requires codon sequences. Other functions that strictly require coding sequences are tools.backalign(), tools.trailing_stops(), tools.iter_stops(), tools.has_stop() and the class CodingDiversity. To perform the DNA to codon extraction, and if the sequences are not an open reading frame (the default assumption), a tools.ReadingFrame instance indicating the bounds of the reading frame must be passed as an argument to the tools.to_codons() function.

EggLib has built-in support for all genetic codes described in USA’s National Center for Biotechnology Information (NCBI) database (http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi). The genetic codes and their identifier are listed here.

Protein translation tools lie in the tools module, which gathers various functionalities that are described in the following of this section.

The tools.translate() function

The simplest way to translate nucleotide sequences to amino acids with EggLib is probably the tools.translate() function. This function is flexible: it takes as input alignments, sequence sets, or individual sequences, and returns a corresponding object, as lists in the table below:

Input type

Returned type

Align

Align

Container

Container

SequenceView

str

str

str

The usage is straightforward, as exemplified below:

import egglib
aln = egglib.Align.create([
    ('sample1',  'TTGCTAGGTGTATAG'),
    ('sample2',  'TTCCTAGATGAATAG'),
    ('sample3',  'ATGCTAGATGAATAG')],
    alphabet=egglib.alphabets.DNA)
aln.to_codons()
prot = egglib.tools.translate(aln)
print(prot.to_fasta())

The resulting protein alignment will be for this example:

>sample1
LLGV*
>sample2
FLDE*
>sample3
MLDE*

The code option of this function lets you specify the genetic code to be used, if it is not the standard one. Among options, the option in_place tells the function to overwrite the object you provide, instead of returning a new one. This can be useful in memory- or especially time-critical applications:

aln = egglib.Align.create([
    ('sample1',  'TTGCTAGGTGTATAG'),
    ('sample2',  'TTCCTAGATGAATAG'),
    ('sample3',  'ATGCTAGATGAATAG')],
    alphabet = egglib.alphabets.DNA)
aln.to_codons()
egglib.tools.translate(aln, in_place=True) # returns None
print(aln.to_fasta())

translate() can also translate individual sequences, either provided as a SequenceView or a str (in_place does not work in this two cases):

egglib.tools.translate('CCATTGGTAATGGCC')

Use the allow_alt=True option to support alternative start codons (a rare phenomenon accounted for by all genetic codes).

“Smart” translation

By default, codons with any missing data are translated as missing data (the X character). However, in certain cases it might be possible to guess: for example (in the standard genetic code), CTN necessarily translates to a Leucine because all four possibilities (CTA, CTC, CTG, and CTT) do. Similarly, both AGC and TGC encode a Serine, so WGC also necessarily encodes a Serine. The option smart=True turns on automatic detection of those case, based on the table of ambiguity characters in the nomenclature for nucleotide codes reproduced in the table below:

Symbol

Meaning

G

G

A

A

T

T

C

C

R

G or A

Y

T or C

M

A or C

K

G or T

S

G or C

W

A or T

H

A or C or T

B

G or T or C

V

G or A or C

D

G or A or T

N

G or A or T or C

The tools.Translator class

There is a Translator class in the tools module that perform the same operations as the tools.translate() function. If you need to translate many data sets in one go, it will probably be faster to use this class as in the example below, where we assume that aligns is a list of Align instances:

trans = egglib.tools.Translator(code=1)
for aln in aligns:
    trans.translate_align(aln, in_place=True)

The options code=1 is actually the default. It is shown just to show that options are specified in the class’s constructor.

Detecting open reading frames and processing stop codons

There are additionnal tools helping you to manage coding sequences, or sequences containing open reading frames:

  • tools.orf_iter() provides an iterator over all possible open reading frames of the sequence. It is configurable (see options) and is used as this (with default options):

    for start, stop, length, frame in egglib.tools.orf_iter(seq):
        print(seq[start:stop])
    

    (The example displays the sequence of each open reading frame, although some may need to be reverse-complemented.)

  • tools.longest_orf() is a shortcut to find the longest possible open reading frame. If there is a tie, the function does not take the decision for you and raises an exception.

  • tools.trailing_stops() detect and optionally fixes stop codons at the end of sequences of an alignment.

  • tools.iter_stops() provides you with an iterator over the positions of all stop codons of the alignment.

  • tools.has_stop() is a shortcut to test if an alignment contains any stop codons.