Accessing and editing data¶

In general, Align/Container provide two ways to access/edit data they contain: through the methods of SampleView instances returned by the iterators, and through direct methods. Those two approaches are equivalent and your choice should be guided by readability first.

Accessing a given sample¶

SampleView instances can be directly accessed using their index without requiring iteration using Align’s method get_sample(). The bracket operator [] is a synonym. Here is an illustratory example (both lines show the name of the first sample):

>>> print(aln4[0].name)
sam1
>>> print(aln4.get_sample(0).name)
sam1

It is also possible to use the method find() to access a SampleView based on its name.

Editing names¶

There is no trick about reading/setting the name of a sample in Container/Align instances. The name property of SampleView is a standard str and can be modified by any new value, as long as it is, or can be casted to, a str. Alternatively, both Container and Align have methods to get and set the name of a sample. All techniques are listed in the table below, with item being a SampleView instance, and obj being either a Align or Container instance:

Expression	Result
`item.name`	Get the name of a sample (1).
`item.name = x`	Set the name of a sample to `x` (1)(2).
`obj.get_name(i)`	Get the name of the `i`th sample.
`obj.set_name(i,x)`	Set the name of the `i`th sample to `x` (2).

Notes:

The identity of the sample is defined by the origin of the SampleView instance.
x must be a str.

The example below shows that those approaches are equivalent, and also demonstrates that the content available through a SampleView is modified whenever the underlying Align is modified, even if it is by an other mean:

>>> item = aln4.get_sample(0)
>>> print(item.name)
sample1
>>> print(aln4.get_name(0))
sample1
>>> aln4.set_name(0, 'another name')
>>> print(item.name)
another name

Editing sequences or data entries¶

The `SequenceView` type¶

SequenceView is another proxy class, managing the sequence of data for a given sample. It can be obtained from a SampleView or from the method get_sequence() of both Align and Container instances. SequenceView instances can be treated, to some extent, as lists of data values. In particular, they offer the same functionalities for editing the data. There is one significant limitation: the length of a SequenceView instance connected to a Align cannot be modified.

Operations using a `SequenceView` as a list-like instance¶

In the table below, assume seq is a SequenceView instance, s is a stretch of sequence as a str, c is a one-character string (although an integer can be accepted depending on the Alphabet used), i is any valid index (and j a second one if a slice is needed).

Expression	Result
`len(seq)`	Get the number of data entries.
`for v in seq`	Iterate over data entries.
`seq[i]`	Access a data item.
`seq[i:j]`	Access a section of the sequence.
`seq[i] = c`	Modify a data item.
`seq[i:j] = s`	Replace a section of the sequence by a new sequence (1).
`del seq[i]`	Delete a data entry (2).
`del seq[i:j]`	Delete a section of the sequence (2).
`seq.string()`	Return the sequence as a `str`.
`seq.insert(i, s)`	Insert a stretch of sequence (2).
`seq.find(s)`	Find the position of a given motif.
`seq.upper()`	Modify the sequence to contain only upper-case characters (3).
`seq.lower()`	Modify the sequence to contain only lower-case characters (3).
`seq.strip(s)`	Remove left/right occurrences of characters present in `s`.

Notes:

Only available for Align instances if the length of the provided stretch matches.
Not available for Align instances.
Only available for instances using a case-sensitive alphabet (excluding DNA).

In addition, one can modify the whole sequence directly through the SampleView, as in:

>>> item = aln4.get_sample(0)
>>> item.sequence = 'ACCGTGGAGAGCGCGTTGCA'

Obviously, and again, if the original instance is an Align, the sequence length must be kept constant.

Using methods of `Align` and `Container`¶

Most of the functionality available through SequenceView is also available as methods of the Align/Container. The table below lists the available methods (or properties), where i is a sample index, j a position, n a number of sites, c a data entry (either an integer or character, see alphabets), and s a str or a list of data entries.

Expression	Result
`aln.ls`	Get alignment length (cannot be modified) (1).
`cnt.ls(i)`	Length of the sequence for an ingroup sample (2).
`obj.get_sequence(i)`	Get the sequence of a sample as a `SequenceView`.
`obj.get_i(i,j)`	Get a data entry of a sample.
`obj.set_i(i,j,c)`	Set a data entry of a sample.
`obj.set_sequence(i,s)`	Set the whole sequence of a sample.
`cnt.del_sites(i,j,n)`	Delete data entries for a sample (2).
`cnt.insert_sites(i,j,s)`	Insert a given sequence at a given position for a sample (2).

Notes:

Only available for Align instances.
Only available for Container instances.

Using module functions¶

A few functions from the tools module can be used with sequences. Note that they never modify the passed instance. On the other hand, they can accept sequences as SequenceView or str instances.

Function	Operation
`tools.rc()`	Reverse-complement of a DNA sequence.
`tools.compare()`	Check if sequences matches (supporting ambiguity characters).
`tools.regex()`	Turn a sequence with ambiguity characters to a regular expression.
`tools.motif_iter()`	Iterate over occurrences of a motif.

Editing labels¶

Using `LabelView`¶

In comparison to sequences, list of labels are relatively simple. However, there is also a specialized proxy class, LabelView. Objects of this type behave to a limited extent like a list of strings. It is not possible to delete any item from a LabelView. The supported functions are listed in the table below, where grp is a LabelView, i a level index, and v a label value:

Expression	Result
`len(grp)`	Get the number of label levels.
`grp[i]`	Access a label level.
`grp[i] = v`	Modify a label level.
`for v in grp`	Iterate over group labels.
`append()`	Append a label.

Using methods of `Align` and `Container`¶

The methods (and one property) allowing to edit group labels are listed below, where n is non-negative integer, i is a sample index, j is the index of a group level and g is a group label:

Expression	Result
`get_label(i,j)`	Get one of the group labels of a sample.
`set_label(i,j,g)`	Set one of the group labels of a sample.

Initializing instances¶

We have seen how to create Container and Align instances initialized from the content of a Fasta-formatted sequence file. In Coalescent simulations we will see how to generate data sets using coalescent simulations. Several methods exist to create sequence set objects with more flexibility.

Creating from empty instances¶

The default constructors of Container and Align return empty instances that can later be filled manually with the methods described in the following sections. In addition, the Align constructor allows one to initialize the instance to specified dimensions, with an optional user-specified initial values for all data entries, as shown in the example below:

>>> aln5 = egglib.Align(alphabet=egglib.alphabets.DNA)
>>>> print(aln5.ns, aln5.ls)
0, 0
>>> aln6 = egglib.Align(nsam=6, nsit=4, init='N',alphabet=egglib.alphabets.DNA)
>>> print(aln6.ns, aln6.ls)
6 4
>>> print(aln6.fasta())
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN

Deep copy of `Align` and `Container` instances¶

Both Align and Container have a class method create() that returns a new instance initialized from the content of the provided argument. There can be several uses for that functionality, and one of them is performing a deep copy of an instance. For example, let us assume one wants to create an independent copy of an alignement. The approach exemplified below will not work as wanted:

>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
>>> copy = aln
>>> aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
>>> print(copy.get_sequence(0).string()) # aln and copy refer to the same object!
CCTCCTCCTCCTCCTCCTCT

This results in the string CCTCCTCCTCCTCCTCCTCT since aln and copy are actually references to the same underlying object (see this FAQ in the Python documentation). The class method create() allows to make a proper deep copy as demonstrated in the code below, were copy is created in such a way it is an object independent of aln:

>>> aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
>>> copy = egglib.Align.create(aln)
>>> aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
>>> print(copy.get_sequence(0).string())
ACCGTGGAGAGCGCGTTGCA

Conversion between `Align` and `Container` instances¶

Another use of create() is to convert between Align and Container types. It is possible to make a Container copy of an Align as in:

>>> cnt = egglib.Container.create(aln)

Obviously, the opposite (from Container to Align) requires that all sequences have the same length. For example, suppose that we have an alignment that has, for some reason, a longer sequence, as in:

>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCATTAAGTA
>sample4
ACCGTGGAGAGCGCGTTGCA

You must import this data set as a Container. The code below shows that the resulting instance is a Container (the property is_matrix is another way to tell if an object is an Align), and confirms that the third sequence is longer:

>>> cnt = egglib.io.from_fasta('sequences2.fas', alphabet=egglib.alphabets.DNA)
>>> print(type(cnt))
<class 'egglib._interface.Container'>
>>> print(cnt.is_matrix)
False
>>> print(cnt.ls(0))
20
>>> print(cnt.ls(2))
27

After cropping the longer sequence such that all sequences have the same length, we can turn the Container into an Align:

>>> cnt.del_sites(2, 20, 7)
>>> aln = egglib.Align.create(cnt)
>>> print(aln.is_matrix)
True
>>> print(aln.fasta())
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCA
>sample4
ACCGTGGAGAGCGCGTTGCA

Creation from other iterable types¶

Besides Align and Container instances, the method create() supports all compatible iterable object. To be compatible, an object must return, during iteration, (name, sequence) or (name, sequence, groups) items, where name is a name string, sequence is a sequence string (or a list of data entries), and groups (which may be omitted) is a list of group labels. For creating an Align, it is required that all sequences match in length. Typically, instances can be created from lists using this way:

>>> aln = egglib.Align.create([('sample1', 'ACCGTGGAGAGCGCGTTGCA'),
...                            ('sample2', 'ACCGTGGAGAGCGCGTTGCA'),
...                            ('sample3', 'ACCGTGGAGAGCGCGTTGCA'),
...                            ('sample4', 'ACCGTGGAGAGCGCGTTGCA')],
...                            alphabet = egglib.alphabets.DNA)
>>> print(aln.fasta())
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCA
>sample4
ACCGTGGAGAGCGCGTTGCA

The code above re-creates the alignment discussed in the previous section. Note that there is a method of Container, equalize(), that inserts stretches of ? at the end of sequences of a Container in order to have all sequences of the same length. In such case, the Container could be converted to an Align using Align.create(), but it is not probably not what you want to do if you want to align sequences.

Add/remove samples¶

Both Align and Container support the following operations to change the list of samples of an instance:

Method	Syntax	Action
`add_sample()`	`cnt.add_sample(name, sequence[, groups])`	Add a sample
`add_samples()`	`cnt.add_samples(samples)`	Add several samples
`del_sample()`	`cnt.del_sample(index)`	Delete a sample
`reset()`	`cnt.reset()`	Remove all samples
`remove_duplicates()`	`cnt.remove_duplicates()`	Remove duplicates

Editing alignments¶

Align instances have additional methods that allow to extract or delete sections of the alignment

Method	Syntax	Action
`column()`	`aln.column(i)`	Extract a site as a list (1)
`insert_columns()`	`aln.insert_columns(i, values)`	Insert columns at a given position
`del_columns()`	`aln.del_columns(i[, num])`	Delete one or more columns
`extract()`	`sub = aln.extract(start, stop)`	Extract a specified range of positions
	`sub = aln.extract(frame)`	Extract exon positions based on a `ReadingFrame`
	`sub = aln.extract([i, j, ..., z])`	Extract an arbitrary list of positions
`subset()`	`sub = aln.subset(samples)`	Generate a new instance with selected samples (1)
`intersperse()`	`aln.intersperse(len[, ...])`	Insert non-varying sites randomly
`random_missing()`	`aln.random_missing(p[, ...])`	Insert missing data randomly (1)
`fix_ends()`	`aln.fix_ends()`	Replace alignment gaps at ends by missing data

Note:

Also available for Container.

The following functions lie in the tools module and provide additional functionalities to manipulate alignments:

Function	Syntax	Action
`tools.concat()`	`res = egglib.tool.concat(aln1, aln2)`	Concatenate alignments
`tools.ungap()`	`cnt = egglib.tools.ungap(aln)`	Remove all gaps from an alignment
`tools.ungap()`	`aln2 = egglib.tools.sungap(aln, p)`	Remove sites with too many gaps
`tools.backalign()`	`aln = egglib.tools.backalign(nucl, prot)`	Align (unaligned) nucleotide sequences based on an amino acid alignment

Accessing and editing data¶

Accessing a given sample¶

Editing names¶

Editing sequences or data entries¶

The SequenceView type¶

Operations using a SequenceView as a list-like instance¶

Using methods of Align and Container¶

Using module functions¶

Editing labels¶

Using LabelView¶

Using methods of Align and Container¶

Initializing instances¶

Creating from empty instances¶

Deep copy of Align and Container instances¶

Conversion between Align and Container instances¶

Creation from other iterable types¶

Add/remove samples¶

Editing alignments¶

The `SequenceView` type¶

Operations using a `SequenceView` as a list-like instance¶

Using methods of `Align` and `Container`¶

Using `LabelView`¶

Using methods of `Align` and `Container`¶

Deep copy of `Align` and `Container` instances¶

Conversion between `Align` and `Container` instances¶