Importing sequences in the fasta format¶
The fasta format is described formally here. The function
io.from_fasta()
imports a fasta-formatted file without any regard for
the kind of data it contains (DNA or RNA nucleotides, protein sequences).
Note that it is also possible to import Genepop-formatted genotypic data
using the io.from_genepop()
function. This function returns an
Align
instance, like io.from_fasta()
.
Simplest case¶
Let align1.fas
be the name of a fasta-formatted file containing an
alignment of DNA sequences. To import it as an EggLib object, all you have to do is run:
>>> import egglib
>>> aln = egglib.io.from_fasta('align1.fas', alphabet=egglib.alphabets.DNA)
The type of the object is Align
, which is central to most of
EggLib’s functionality. One must necessarily provide an
Alphabet
to specify the type of data (such as
alphabets.DNA
, alphabets.protein
, among a few others,
or a custom type, see
the alphabets module for further
information). Align
instances are accepted as arguments by
many other methods of the package. If you want to see the contents of
an Align
instance, the expression print(aln)
will not be
useful. It will only give you an unique identifier of the object. This
manual will introduce some of the functionality offered by this class
and its relative Container
, but to get started you can access
the number of samples and the alignment length by the instance
properties ns
and ls
:
>>> print(aln.ns)
101
>>> print(aln.ls)
8942
Alignments and containers¶
The instances of type Container
are very similar to Align
except that the number of data entries is allowed to vary between samples.
There are specifically designed to hold unaligned sequences. Much of
Align
’s functionality is shared with Container
.
Automatic detection of alignments¶
There is no difference, in the fasta format, between sequence alignments and
sets of unaligned sequences. By default, io.from_fasta()
detects automatically
whether all sequences have the same lengths: if so, it returns an
Align
; otherwise, it returns a Container
. You can test this by running:
>>> cnt = egglib.io.from_fasta('sequences1.fas',alphabet=egglib.alphabets.DNA)
>>> print(type(cnt))
<class 'egglib._interface.Container'>
Enforcing return type¶
In some cases you want to enforce the return type of io.from_fasta()
.
Typically, unaligned sequences may have the same length just by chance, making
the function returns a Align
when a Container
would
actually make sense. Conversely, malformed fasta files may exist in large sets
of alignments, and forcing return types to be Align
will help detect
invalid files and process them accordingly.
To force the return type to be an Align
or a Container
, use the option cls of
io.from_fasta()
as follows:
>>> aln2 = egglib.io.from_fasta('align1.fas', cls=egglib.Align, alphabet=egglib.alphabets.DNA)
>>> cnt2 = egglib.io.from_fasta('align1.fas', cls=egglib.Container, alphabet=egglib.alphabets.DNA)
The object aln2
will be an Align
, and the object cnt2
will be a Container
. Even if they contain actually the same
data, you will not be able to do the same things with them since they
are instances of different types.
Exporting data¶
Exporting as fasta¶
All Align
/Container
instances have a
fasta()
method generating a fasta representation of
the instance. The first argument of this method is the name of an
output file (fname):
>>> aln.fasta('align_out.fas')
If the fname argument is omitted (or None
), the fasta representation of
the set of sequences is returned as a string (built-in Python str
instance):
>>> print(aln.fasta())
>sample_01 @0,0
CATGGAGGATGCAAACACTGCAATCTCGCGTGGGCCGCCACATATAATCC
CCAGATCACCTCTTGGCACTATTACACCCGCAGTTTCAAACCCGTCCCCA
GGTGTCGGCCTTACCCGACCTCAAATGACCCCGGACAGGGCAGGCTGACC
ANAGGCCGTTTNCGCCACTGTGTGAGTCACATCGTCAATTTTCAGCGNCA
CAAGTGCTTAGCTATCGTCANTCCCGCACCAGAACGTAGGTGGCTGTTAG
CGGGATGTCCCGAGATATCTACGATCGCTCCAACTCGCTGGACAAACAAT
CTATGTCAGTACCCGAGAGTTNTTACCTACCTTGTAAAATTAAACTTTAA
TTATTTCGAAATATTACCGATGTTGATGCAG------ATACATGATCGCT
CGTTAGTTCATGTATGTCTAACTAGCTCGTGCTGTTACACGGACCGAAGA
...
Other arguments can be fed to the function for exporting full names with labels for example or only exporting some sequences.
Other formats¶
Sequence alignments can be exported to the following formats:
Output format of the ms software (
io.to_ms()
).NEXUS format (
Align.nexus()
).Phylip phylogenetic software format (
Align.phylip()
).PhyML phylogenetic software format (
Align.phyml()
).
Besides, sequence alignments can be imported from:
Clustal alignment software format (
io.from_clustal()
).Staden package software “contig dump” format (
io.from_staden()
).Genalys software (which is discontinued) format (
io.from_genalys()
).
Iteration¶
Principle of proxy types¶
Both Align
and Container
classes are iterable (that
is they support the for item in aln4
expression if we take the last
alignment we imported as example). Iteration steps yield instances of a
specialized type, named SampleView
wichi represents one
sample of a Align
/Container
: the name is accessible
as the property name
, the sequence as
sequence
and the list of group labels as
labels
(see example below). The name is a
standard str
instance, but the sequence and list of group
labels (see Structure and group labels) are other specialized types
(SequenceView
and LabelView
, respectively).
In total, SampleView
instances have the following properties:
Attribute |
Type |
Meaning |
Sample name |
||
Array of genetic data |
||
Array of labels |
||
Reference to owner |
||
Sample index in parent |
||
Number of items for this sample |
All the three types SampleView
, SequenceView
, and
LabelView
are proxy classes similar in principle to
dictionary views:
they do not contain a deep copy of the Align
/Container
data but rather act as a proxy to edit it more conveniently. As dictionary
views, if the content of the alignment changes, the data accessible from the
proxy might change or even disappear (causing an error if one tries to access data that
are not available anymore), as we will show later. In comparison with dictionary views, Align
/Container
proxy
types allow a wider range of editing operations, which will be addressed later
in this manual.
Example¶
The example below shows how to display the names of all samples of the last alignment we considered by iterating over the items
>>> aln4 = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
>>> print(aln4.ns)
7
>>> for item in aln4:
... print(item.name)
sam1
sam2
sam3
sam4
sam5
sam6
outgroup