Accessing and editing data¶
In general, Align
/Container
provide two ways to
access/edit data they contain: through the methods of
SampleView
instances returned by the iterators, and through
direct methods. Those two approaches are equivalent and your choice
should be guided by readability first.
Accessing a given sample¶
SampleView
instances can be directly accessed using their
index without requiring iteration using Align
’s method
get_sample()
. The bracket operator []
is a synonym.
Here is an illustratory example (both lines show the name of the first
sample):
>>> print(aln4[0].name)
sam1
>>> print(aln4.get_sample(0).name)
sam1
It is also possible to use the method find()
to access
a SampleView
based on its name.
Editing names¶
There is no trick about reading/setting the name of a sample in Container
/Align
instances. The name
property of SampleView
is a standard str
and
can be modified by any new value, as long as it is, or can be casted to, a str
.
Alternatively, both Container
and Align
have methods to get and set the
name of a sample. All techniques are listed in the table below, with item
being a
SampleView
instance, and obj
being either a Align
or Container
instance:
Expression |
Result |
---|---|
|
Get the name of a sample (1). |
|
Set the name of a sample to |
|
Get the name of the |
|
Set the name of the |
- Notes:
The identity of the sample is defined by the origin of the
SampleView
instance.x
must be astr
.
The example below shows that those approaches are equivalent, and also demonstrates
that the content available through a SampleView
is modified whenever
the underlying Align
is modified, even if it is by an other mean:
>>> item = aln4.get_sample(0)
>>> print(item.name)
sample1
>>> print(aln4.get_name(0))
sample1
>>> aln4.set_name(0, 'another name')
>>> print(item.name)
another name
Editing sequences or data entries¶
The SequenceView
type¶
SequenceView
is another proxy class, managing the sequence of
data for a given sample. It can be obtained from a SampleView
or from the method get_sequence()
of both Align
and Container
instances. SequenceView
instances can
be treated, to some extent, as lists of data values. In particular,
they offer the same functionalities for editing the data. There is one
significant limitation: the length of a SequenceView
instance
connected to a Align
cannot be modified.
Operations using a SequenceView
as a list-like instance¶
In the table below, assume seq
is a SequenceView
instance, s
is a stretch of sequence as a str
, c
is a
one-character string (although an integer can be accepted depending on
the Alphabet
used), i
is any valid index (and j
a
second one if a slice is needed).
Expression |
Result |
---|---|
|
Get the number of data entries. |
|
Iterate over data entries. |
|
Access a data item. |
|
Access a section of the sequence. |
|
Modify a data item. |
|
Replace a section of the sequence by a new sequence (1). |
|
Delete a data entry (2). |
|
Delete a section of the sequence (2). |
|
Return the sequence as a |
|
Insert a stretch of sequence (2). |
|
Find the position of a given motif. |
|
Modify the sequence to contain only upper-case characters (3). |
|
Modify the sequence to contain only lower-case characters (3). |
|
Remove left/right occurrences of characters present in |
- Notes:
Only available for
Align
instances if the length of the provided stretch matches.Not available for
Align
instances.Only available for instances using a case-sensitive alphabet (excluding DNA).
In addition, one can modify the whole sequence directly through the
SampleView
, as in:
>>> item = aln4.get_sample(0)
>>> item.sequence = 'ACCGTGGAGAGCGCGTTGCA'
Obviously, and again, if the original instance is an Align
, the
sequence length must be kept constant.
Using methods of Align
and Container
¶
Most of the functionality available through SequenceView
is
also available as methods of the Align
/Container
.
The table below lists the available methods (or properties), where
i
is a sample index, j
a position, n
a number of sites,
c
a data entry (either an integer or character, see alphabets), and s
a str
or a list of data entries.
Expression |
Result |
---|---|
|
Get alignment length (cannot be modified) (1). |
|
Length of the sequence for an ingroup sample (2). |
|
Get the sequence of a sample as a |
|
Get a data entry of a sample. |
|
Set a data entry of a sample. |
|
Set the whole sequence of a sample. |
|
Delete data entries for a sample (2). |
|
Insert a given sequence at a given position for a sample (2). |
- Notes:
Only available for
Align
instances.Only available for
Container
instances.
Using module functions¶
A few functions from the tools
module can be used with
sequences. Note that they never modify the passed instance. On the
other hand, they can accept sequences as SequenceView
or
str
instances.
Function |
Operation |
---|---|
Reverse-complement of a DNA sequence. |
|
Check if sequences matches (supporting ambiguity characters). |
|
Turn a sequence with ambiguity characters to a regular expression. |
|
Iterate over occurrences of a motif. |
Editing labels¶
Using LabelView
¶
In comparison to sequences, list of labels are relatively simple.
However, there is also a specialized proxy class, LabelView
. Objects of this
type behave to a limited extent like a list of strings. It is not possible to delete any item
from a LabelView
.
The supported functions are listed in the table below, where grp
is a LabelView
,
i
a level index, and v
a label value:
Expression |
Result |
---|---|
|
Get the number of label levels. |
|
Access a label level. |
|
Modify a label level. |
|
Iterate over group labels. |
|
Append a label. |
Using methods of Align
and Container
¶
The methods (and one property) allowing to edit group labels are listed below,
where n
is non-negative integer, i
is a sample index, j
is the
index of a group level and g
is a group label:
Expression |
Result |
---|---|
|
Get one of the group labels of a sample. |
|
Set one of the group labels of a sample. |
Initializing instances¶
We have seen how to create Container
and Align
instances
initialized from the content of a Fasta-formatted sequence file. In
Coalescent simulations we will see how to generate data sets using coalescent
simulations. Several methods exist to create sequence set objects with
more flexibility.
Creating from empty instances¶
The default constructors of Container
and Align
return empty instances that can later be filled manually with the
methods described in the following sections. In addition, the
Align
constructor allows one to initialize the instance to
specified dimensions, with an optional user-specified initial values
for all data entries, as shown in the example below:
>>> aln5 = egglib.Align(alphabet=egglib.alphabets.DNA)
>>>> print(aln5.ns, aln5.ls)
0, 0
>>> aln6 = egglib.Align(nsam=6, nsit=4, init='N',alphabet=egglib.alphabets.DNA)
>>> print(aln6.ns, aln6.ls)
6 4
>>> print(aln6.fasta())
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN
Deep copy of Align
and Container
instances¶
Both Align
and Container
have a class method
create()
that returns a new instance initialized from the
content of the provided argument. There can be several uses for that
functionality, and one of them is performing a deep copy of an instance.
For example, let us assume one wants to create an independent copy of an
alignement. The approach exemplified below will not work as wanted:
>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
>>> copy = aln
>>> aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
>>> print(copy.get_sequence(0).string()) # aln and copy refer to the same object!
CCTCCTCCTCCTCCTCCTCT
This results in the string CCTCCTCCTCCTCCTCCTCT
since aln
and
copy
are actually references to the same underlying object (see
this FAQ
in the Python documentation). The class method create()
allows
to make a proper deep copy as demonstrated in the code below, were
copy
is created in such a way it is an object independent of
aln
:
>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
>>> copy = egglib.Align.create(aln)
>>> aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
>>> print(copy.get_sequence(0).string())
ACCGTGGAGAGCGCGTTGCA
Conversion between Align
and Container
instances¶
Another use of create()
is to convert between Align
and
Container
types. It is possible to make a Container
copy of
an Align
as in:
>>> cnt = egglib.Container.create(aln)
Obviously, the opposite (from Container
to Align
) requires that
all sequences have the same length. For example, suppose that we have an alignment that
has, for some reason, a longer sequence, as in:
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCATTAAGTA
>sample4
ACCGTGGAGAGCGCGTTGCA
You must import this data set as a Container
. The code below shows
that the resulting instance is a Container
(the property
is_matrix
is another way to tell if an object is an
Align
), and confirms that the third sequence is longer:
>>> cnt = egglib.io.from_fasta('sequences2.fas', alphabet=egglib.alphabets.DNA)
>>> print(type(cnt))
<class 'egglib._interface.Container'>
>>> print(cnt.is_matrix)
False
>>> print(cnt.ls(0))
20
>>> print(cnt.ls(2))
27
After cropping the longer sequence such that all sequences have the same length,
we can turn the Container
into an Align
:
>>> cnt.del_sites(2, 20, 7)
>>> aln = egglib.Align.create(cnt)
>>> print(aln.is_matrix)
True
>>> print(aln.fasta())
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCA
>sample4
ACCGTGGAGAGCGCGTTGCA
Creation from other iterable types¶
Besides Align
and Container
instances, the method
create()
supports all compatible iterable object. To be
compatible, an object must return, during iteration, (name,
sequence)
or (name, sequence, groups)
items, where name
is a
name string, sequence
is a sequence string (or a list of data
entries), and groups
(which may be omitted) is a list of group
labels. For creating an Align
, it is required that all
sequences match in length. Typically, instances can be created from
lists using this way:
>>> aln = egglib.Align.create([('sample1', 'ACCGTGGAGAGCGCGTTGCA'),
... ('sample2', 'ACCGTGGAGAGCGCGTTGCA'),
... ('sample3', 'ACCGTGGAGAGCGCGTTGCA'),
... ('sample4', 'ACCGTGGAGAGCGCGTTGCA')],
... alphabet = egglib.alphabets.DNA)
>>> print(aln.fasta())
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCA
>sample4
ACCGTGGAGAGCGCGTTGCA
The code above re-creates the alignment discussed in the previous
section. Note that there is a method of Container
,
equalize()
, that inserts stretches of ?
at the
end of sequences of a Container
in order to have all
sequences of the same length. In such case, the Container
could be converted to an Align
using Align.create()
,
but it is not probably not what you want to do if you want to align
sequences.
Add/remove samples¶
Both Align
and Container
support the following operations
to change the list of samples of an instance:
Method |
Syntax |
Action |
---|---|---|
|
Add a sample |
|
|
Add several samples |
|
|
Delete a sample |
|
|
Remove all samples |
|
|
Remove duplicates |
Editing alignments¶
Align
instances have additional methods that allow to extract or delete
sections of the alignment
Method |
Syntax |
Action |
---|---|---|
|
Extract a site as a list (1) |
|
|
Insert columns at a given position |
|
|
Delete one or more columns |
|
|
Extract a specified range of positions |
|
|
Extract exon positions based on a |
|
|
Extract an arbitrary list of positions |
|
|
Generate a new instance with selected samples (1) |
|
|
Insert non-varying sites randomly |
|
|
Insert missing data randomly (1) |
|
|
Replace alignment gaps at ends by missing data |
- Note:
Also available for
Container
.
The following functions lie in the tools module and provide additional functionalities to manipulate alignments:
Function |
Syntax |
Action |
---|---|---|
|
Concatenate alignments |
|
|
Remove all gaps from an alignment |
|
|
Remove sites with too many gaps |
|
|
Align (unaligned) nucleotide sequences based on an amino acid alignment |