Accessing and editing data¶
In general, Align/Container provide two ways to
access/edit data they contain: through the methods of
SampleView instances returned by the iterators, and through
direct methods. Those two approaches are equivalent and your choice
should be guided by readability first.
Accessing a given sample¶
SampleView instances can be directly accessed using their
index without requiring iteration using Align’s method
get_sample(). The bracket operator [] is a synonym.
Here is an illustratory example (both lines show the name of the first
sample):
>>> print(aln4[0].name)
sam1
>>> print(aln4.get_sample(0).name)
sam1
It is also possible to use the method find() to access
a SampleView based on its name.
Editing names¶
There is no trick about reading/setting the name of a sample in Container/Align
instances. The name property of SampleView is a standard str and
can be modified by any new value, as long as it is, or can be casted to, a str.
Alternatively, both Container and Align have methods to get and set the
name of a sample. All techniques are listed in the table below, with item being a
SampleView instance, and obj being either a Align
or Container instance:
Expression |
Result |
|---|---|
|
Get the name of a sample (1). |
|
Set the name of a sample to |
|
Get the name of the |
|
Set the name of the |
- Notes:
The identity of the sample is defined by the origin of the
SampleViewinstance.xmust be astr.
The example below shows that those approaches are equivalent, and also demonstrates
that the content available through a SampleView is modified whenever
the underlying Align is modified, even if it is by an other mean:
>>> item = aln4.get_sample(0)
>>> print(item.name)
sample1
>>> print(aln4.get_name(0))
sample1
>>> aln4.set_name(0, 'another name')
>>> print(item.name)
another name
Editing sequences or data entries¶
The SequenceView type¶
SequenceView is another proxy class, managing the sequence of
data for a given sample. It can be obtained from a SampleView
or from the method get_sequence() of both Align
and Container instances. SequenceView instances can
be treated, to some extent, as lists of data values. In particular,
they offer the same functionalities for editing the data. There is one
significant limitation: the length of a SequenceView instance
connected to a Align cannot be modified.
Operations using a SequenceView as a list-like instance¶
In the table below, assume seq is a SequenceView
instance, s is a stretch of sequence as a str, c is a
one-character string (although an integer can be accepted depending on
the Alphabet used), i is any valid index (and j a
second one if a slice is needed).
Expression |
Result |
|---|---|
|
Get the number of data entries. |
|
Iterate over data entries. |
|
Access a data item. |
|
Access a section of the sequence. |
|
Modify a data item. |
|
Replace a section of the sequence by a new sequence (1). |
|
Delete a data entry (2). |
|
Delete a section of the sequence (2). |
|
Return the sequence as a |
|
Insert a stretch of sequence (2). |
|
Find the position of a given motif. |
|
Modify the sequence to contain only upper-case characters (3). |
|
Modify the sequence to contain only lower-case characters (3). |
|
Remove left/right occurrences of characters present in |
- Notes:
Only available for
Aligninstances if the length of the provided stretch matches.Not available for
Aligninstances.Only available for instances using a case-sensitive alphabet (excluding DNA).
In addition, one can modify the whole sequence directly through the
SampleView, as in:
>>> item = aln4.get_sample(0)
>>> item.sequence = 'ACCGTGGAGAGCGCGTTGCA'
Obviously, and again, if the original instance is an Align, the
sequence length must be kept constant.
Using methods of Align and Container¶
Most of the functionality available through SequenceView is
also available as methods of the Align/Container.
The table below lists the available methods (or properties), where
i is a sample index, j a position, n a number of sites,
c a data entry (either an integer or character, see alphabets), and s a str or a list of data entries.
Expression |
Result |
|---|---|
|
Get alignment length (cannot be modified) (1). |
|
Length of the sequence for an ingroup sample (2). |
|
Get the sequence of a sample as a |
|
Get a data entry of a sample. |
|
Set a data entry of a sample. |
|
Set the whole sequence of a sample. |
|
Delete data entries for a sample (2). |
|
Insert a given sequence at a given position for a sample (2). |
- Notes:
Only available for
Aligninstances.Only available for
Containerinstances.
Using module functions¶
A few functions from the tools module can be used with
sequences. Note that they never modify the passed instance. On the
other hand, they can accept sequences as SequenceView or
str instances.
Function |
Operation |
|---|---|
Reverse-complement of a DNA sequence. |
|
Check if sequences matches (supporting ambiguity characters). |
|
Turn a sequence with ambiguity characters to a regular expression. |
|
Iterate over occurrences of a motif. |
Editing labels¶
Using LabelView¶
In comparison to sequences, list of labels are relatively simple.
However, there is also a specialized proxy class, LabelView. Objects of this
type behave to a limited extent like a list of strings. It is not possible to delete any item
from a LabelView.
The supported functions are listed in the table below, where grp is a LabelView,
i a level index, and v a label value:
Expression |
Result |
|---|---|
|
Get the number of label levels. |
|
Access a label level. |
|
Modify a label level. |
|
Iterate over group labels. |
|
Append a label. |
Using methods of Align and Container¶
The methods (and one property) allowing to edit group labels are listed below,
where n is non-negative integer, i is a sample index, j is the
index of a group level and g is a group label:
Expression |
Result |
|---|---|
|
Get one of the group labels of a sample. |
|
Set one of the group labels of a sample. |
Initializing instances¶
We have seen how to create Container and Align instances
initialized from the content of a Fasta-formatted sequence file. In
Coalescent simulations we will see how to generate data sets using coalescent
simulations. Several methods exist to create sequence set objects with
more flexibility.
Creating from empty instances¶
The default constructors of Container and Align
return empty instances that can later be filled manually with the
methods described in the following sections. In addition, the
Align constructor allows one to initialize the instance to
specified dimensions, with an optional user-specified initial values
for all data entries, as shown in the example below:
>>> aln5 = egglib.Align(alphabet=egglib.alphabets.DNA)
>>>> print(aln5.ns, aln5.ls)
0, 0
>>> aln6 = egglib.Align(nsam=6, nsit=4, init='N',alphabet=egglib.alphabets.DNA)
>>> print(aln6.ns, aln6.ls)
6 4
>>> print(aln6.fasta())
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN
Deep copy of Align and Container instances¶
Both Align and Container have a class method
create() that returns a new instance initialized from the
content of the provided argument. There can be several uses for that
functionality, and one of them is performing a deep copy of an instance.
For example, let us assume one wants to create an independent copy of an
alignement. The approach exemplified below will not work as wanted:
>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
>>> copy = aln
>>> aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
>>> print(copy.get_sequence(0).string()) # aln and copy refer to the same object!
CCTCCTCCTCCTCCTCCTCT
This results in the string CCTCCTCCTCCTCCTCCTCT since aln and
copy are actually references to the same underlying object (see
this FAQ
in the Python documentation). The class method create() allows
to make a proper deep copy as demonstrated in the code below, were
copy is created in such a way it is an object independent of
aln:
>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
>>> copy = egglib.Align.create(aln)
>>> aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
>>> print(copy.get_sequence(0).string())
ACCGTGGAGAGCGCGTTGCA
Conversion between Align and Container instances¶
Another use of create() is to convert between Align and
Container types. It is possible to make a Container copy of
an Align as in:
>>> cnt = egglib.Container.create(aln)
Obviously, the opposite (from Container to Align) requires that
all sequences have the same length. For example, suppose that we have an alignment that
has, for some reason, a longer sequence, as in:
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCATTAAGTA
>sample4
ACCGTGGAGAGCGCGTTGCA
You must import this data set as a Container. The code below shows
that the resulting instance is a Container (the property
is_matrix is another way to tell if an object is an
Align), and confirms that the third sequence is longer:
>>> cnt = egglib.io.from_fasta('sequences2.fas', alphabet=egglib.alphabets.DNA)
>>> print(type(cnt))
<class 'egglib._interface.Container'>
>>> print(cnt.is_matrix)
False
>>> print(cnt.ls(0))
20
>>> print(cnt.ls(2))
27
After cropping the longer sequence such that all sequences have the same length,
we can turn the Container into an Align:
>>> cnt.del_sites(2, 20, 7)
>>> aln = egglib.Align.create(cnt)
>>> print(aln.is_matrix)
True
>>> print(aln.fasta())
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCA
>sample4
ACCGTGGAGAGCGCGTTGCA
Creation from other iterable types¶
Besides Align and Container instances, the method
create() supports all compatible iterable object. To be
compatible, an object must return, during iteration, (name,
sequence) or (name, sequence, groups) items, where name is a
name string, sequence is a sequence string (or a list of data
entries), and groups (which may be omitted) is a list of group
labels. For creating an Align, it is required that all
sequences match in length. Typically, instances can be created from
lists using this way:
>>> aln = egglib.Align.create([('sample1', 'ACCGTGGAGAGCGCGTTGCA'),
... ('sample2', 'ACCGTGGAGAGCGCGTTGCA'),
... ('sample3', 'ACCGTGGAGAGCGCGTTGCA'),
... ('sample4', 'ACCGTGGAGAGCGCGTTGCA')],
... alphabet = egglib.alphabets.DNA)
>>> print(aln.fasta())
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCA
>sample4
ACCGTGGAGAGCGCGTTGCA
The code above re-creates the alignment discussed in the previous
section. Note that there is a method of Container,
equalize(), that inserts stretches of ? at the
end of sequences of a Container in order to have all
sequences of the same length. In such case, the Container
could be converted to an Align using Align.create(),
but it is not probably not what you want to do if you want to align
sequences.
Add/remove samples¶
Both Align and Container support the following operations
to change the list of samples of an instance:
Method |
Syntax |
Action |
|---|---|---|
|
Add a sample |
|
|
Add several samples |
|
|
Delete a sample |
|
|
Remove all samples |
|
|
Remove duplicates |
Editing alignments¶
Align instances have additional methods that allow to extract or delete
sections of the alignment
Method |
Syntax |
Action |
|---|---|---|
|
Extract a site as a list (1) |
|
|
Insert columns at a given position |
|
|
Delete one or more columns |
|
|
Extract a specified range of positions |
|
|
Extract exon positions based on a |
|
|
Extract an arbitrary list of positions |
|
|
Generate a new instance with selected samples (1) |
|
|
Insert non-varying sites randomly |
|
|
Insert missing data randomly (1) |
|
|
Replace alignment gaps at ends by missing data |
- Note:
Also available for
Container.
The following functions lie in the tools module and provide additional functionalities to manipulate alignments:
Function |
Syntax |
Action |
|---|---|---|
|
Concatenate alignments |
|
|
Remove all gaps from an alignment |
|
|
Remove sites with too many gaps |
|
|
Align (unaligned) nucleotide sequences based on an amino acid alignment |
