Accessing and editing data

In general, Align/Container provide two ways to access/edit data they contain: through the methods of SampleView instances returned by the iterators, and through direct methods of the Align/Container instances themselves. Those two approaches are equivalent and your choice should be guided by readability first.

Accessing a given sample

SampleItem instances can be directly accessed using their index without requiring iteration using Align’s method get_sample(). The bracket operator [] is a synonym for get_sample(). Here is an illustratory example (both lines show the name of the first ingroup sample):

print(aln4[0].name)
print(aln4.get_sample(0).name)

Editing names

There is no trick about reading/setting the name of a sample in Container/Align instances. The name property of SampleView is a standard str and can be modified by any new value, as long as it is, or can be casted to, a str. Alternatively, both Container and Align have methods to get and set the name of a sample. All techniques are listed in the table below, with item being a SampleView instance, and obj being either a Align or Container instance:

Expression

Result

item.name

Get the name of a sample (1).

item.name = x

Set the name of a sample to x (1)(2).

obj.get_name(i)

Get the name of the ith sample.

obj.set_name(i,x)

Set the name of the ith sample to x (2).

Notes:
  1. The identity of the sample is defined by the origin of the SampleView instance.

  2. x must be a str.

The example below shows that those approaches are equivalent, and also demonstrates that the content available through a SampleView is modified whenever the underlying Align is modified, even if it is by an other mean:

item = aln4.get_sample(0)
print(item.name)
print(aln4.get_name(0))
aln4.set_name(0, 'another name')
print(item.name)
sample1
sample1
another name

Editing sequences or data entries

The SequenceView type

SequenceView is another proxy class, managing the sequence of data for a given sample. It can be obtained from a SampleView or from the method get_sequence() available from both Align and Container instances. SequenceView instances can be treated, to some extent, as lists of data values. In particular, they offer the same functionalities for editing the data. There is one significant limitation: the length of SequenceView connected to a Align cannot be modified.

Operations using a SequenceView as a list-like instance

In the table below, assume seq is a SequenceView instance, s is a stretch of sequence as a str, c is a one-character string (although an integer can be accepted depending on the Alphabet used), i is any valid index (and j a second one if a slice is needed).

Expression

Result

len(seq)

Get the number of data entries.

for v in seq

Iterate over data entries.

seq[i]

Access a data item.

seq[i:j]

Access a section of the sequence.

seq[i] = c

Modify a data item.

seq[i:j] = s

Replace a section of the sequence by a new sequence (1).

del seq[i]

Delete a data entry (2).

del seq[i:j]

Delete a section of the sequence (2).

seq.string()

Return the sequence as a str.

seq.insert(i, s)

Insert a stretch of sequence (2).

seq.find(s)

Find the position of a given motif.

seq.upper()

Modify the sequence to contain only upper-case characters (3).

seq.lower()

Modify the sequence to contain only lower-case characters (3).

seq.strip(s)

Remove left/right occurrences of characters present in s.

Notes:
  1. Only available for Align instances if the length of the provided stretch matches.

  2. Not available for Align instances.

  3. Only available for instances using a case-sensitive alphabet (excluding DNA).

In addition, one can modify the whole sequence directly through the SampleView, as in:

item = aln4.get_sample(0)
item.sequence = 'ACCGTGGAGAGCGCGTTGCA'

Obviously, and again, if the original instance is an Align, the sequence length must be kept constant.

Using methods of Align and Container

Most of the functionality available through SequenceView is also available as methods of the Align/Container. The table below lists the available methods (or properties), with i is a sample index, j a position, n a number of sites, c a data entry (either an integer or character, see alphabets), and s a str, or a list of data entries.

Expression

Result

aln.ls

Get alignment length (cannot be modified) (1).

cnt.ls(i)

Length of the sequence for an ingroup sample (2).

obj.get_sequence(i)

Get the sequence of a sample as a SequenceView.

obj.get_i(i,j)

Get a data entry of a sample.

obj.set_i(i,j,c)

Set a data entry of a sample.

obj.set_sequence(i,s)

Set the whole sequence of a sample.

cnt.del_sites(i,j,n)

Delete data entries for a sample (2).

cnt.insert_sites(i,j,s)

Insert a given sequence at a given position for a sample (2).

Notes:
  1. Only available for Align instances.

  2. Only available for Container instances.

Using module functions

A few functions from the tools module can be used with sequences. Note that they never modify the passed instance. On the other hand, they can accept sequences as SequenceView or str instances.

Function

Operation

tools.rc()

Reverse-complement of a DNA sequence.

tools.compare()

Check if sequences matches (supporting ambiguity characters).

tools.regex()

Turn a sequence with ambiguity characters to a regular expression.

tools.motif_iter()

Iterate over occurrences of a motif.

Editing labels

Using LabelsView

In comparison to sequences, list of labels are relatively simpler, logically. However, there is also a specialized proxy class, LabelsView. Objects of this type behave to a limited extent like a list of strings. It is not possible to delete any item from a LabelView. The supported functions are listed in the table below, where grp is a LabelView, i a level index, and v a label value:

Expression

Result

len(grp)

Get the number of label levels.

grp[i]

Access a label level.

grp[i] = v

Modify a label level.

for v in grp

Iterate over group labels.

append()

Append a label.

Using methods of Align and Container

The methods (and one property) allowing to edit group labels are listed below, where n is non-negative integer, i is a sample index, j is the index of a group level and g is a group label:

Expression

Result

get_label(i,j)

Get one of the group labels of an ingroup sample.

set_label(i,j,g)

Set one of the group labels of an ingroup sample.

Initializing instances

We have seen how to create Container and Align instances initialized from the content of a Fasta-formatted sequence file. In Coalescent simulations we will see how to generate data sets using coalescent simulations. Several methods exist to create sequence set objects with more flexibility.

Creating from empty instances

The default constructors of Container and Align return empty instances that can later be filled manually with the methods described in the following sections. In addition, the Align constructors allows one to initialize the instance to specified dimensions, with an optional user-specified initial values for all data entries, as shown in the example below:

aln5 = egglib.Align(alphabet=egglib.alphabets.DNA)
print(aln5.ns, aln5.ls)

aln6 = egglib.Align(nsam=6, nsit=4, init='N',alphabet=egglib.alphabets.DNA)
print(aln6.ns, aln6.ls)
print(aln6.to_fasta())
0 0
6 4
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN
>
NNNN

Deep copy of Align and Container instances

Both Align and Container have a class method create() that returns a new instance initialized from the content of the provided argument. There can be several uses for that functionality, and one of them is performing a deep copy of an instance. For example, let us assume one wants to create an independent copy of an alignement. The approach exemplified below will not work as wanted:

aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
copy = aln
aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
print(copy.get_sequence(0).string()) # copy and aln are the same object!

This results in the string CCTCCTCCTCCTCCTCCTCT since aln and copy are actually references to the same underlying object (see this FAQ in the Python documentation). The class method create() allows to make a proper deep copy as demonstrated in the code below, were copy is created in such a way it is an object independent of aln:

aln = egglib.io.from_fasta('align4.fas', alphabet=egglib.alphabets.DNA, labels=True)
copy = egglib.Align.create(aln)
aln.set_sequence(0, 'CCTCCTCCTCCTCCTCCTCT')
print(copy.get_sequence(0).string())

Conversion between Align and Container instances

Another use of create() is to convert between Align and Container types. It is possible to make a Container copy of an Align as in:

cnt = egglib.Container.create(aln)

Obviously, the opposite (from Container to Align) requires that all sequences have the same length. For example, suppose that we have an alignment that has, for some reason, a longer sequence, as in:

>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCATTAAGTA
>sample4
ACCGTGGAGAGCGCGTTGCA

You must import this data set as a Container. The code below shows that the resulting instance is a Container (the property is_matrix is another way to tell if an object is an Align), and confirms that the third sequence is longer:

cnt = egglib.io.from_fasta('sequences2.fas', alphabet=egglib.alphabets.DNA)
print(type(cnt))
print(cnt.is_matrix)
print(cnt.ls(0))
print(cnt.ls(2))
<class 'egglib._interface.Container'>
False
20
27

After cropping the longer sequence such that all sequences have the same length, we can turn the Container into an Align:

cnt.del_sites(2, 20, 7)
aln = egglib.Align.create(cnt)
print(aln.to_fasta())
>sample1
ACCGTGGAGAGCGCGTTGCA
>sample2
ACCGTGGAGAGCGCGTTGCA
>sample3
ACCGTGGAGAGCGCGTTGCA
>sample4
ACCGTGGAGAGCGCGTTGCA

Creation from other iterable types

Besides Align and Container instances, the method create() supports all compatible iterable object. To be compatible, an object must return, during iteration, (name, sequence) or (name, sequence, groups) items, where name is a name string, sequence is a sequence string (or a list of data entries), and groups (which may be omitted) is a list of group labels. For creating an Align, it is required that all sequences match. Typically, instances can be created from lists using this way:

aln = egglib.Align.create([('sample1', 'ACCGTGGAGAGCGCGTTGCA'),
                           ('sample2', 'ACCGTGGAGAGCGCGTTGCA'),
                           ('sample3', 'ACCGTGGAGAGCGCGTTGCA'),
                           ('sample4', 'ACCGTGGAGAGCGCGTTGCA')],
                           alphabet = egglib.alphabets.DNA)
print(aln.to_fasta())

The code above re-creates the alignment discussed in the previous section. Note that there is a method of Container, equalize(), that inserts stretches of ? at the end of sequences of a Container in order to have all sequences of the same length. In such case, the Container could be converted to an Align using Align.create(), but it is not probably not what you want to do if you want to align sequences.

Add/remove samples

Both Align and Container support the following operations to change the list of samples of an instance:

Method

Syntax

Action

add_sample()

cnt.add_sample(name, sequence[, groups])

Add a sample.

add_samples()

cnt.add_samples(samples)

Add several samples.

del_sample()

cnt.del_sample(index)

Delete a sample.

reset()

cnt.reset()

Remove all samples.

remove_duplicates()

cnt.remove_duplicates()

Remove duplicates.

Editing alignments

Align instances have additional methods that allow to extract or delete sections of the alignment

Method

Syntax

Action

column()

aln.column(i)

Extract a site as a list (1).

insert_columns()

aln.insert_columns(i, values)

Insert columns at a given position.

del_columns()

aln.del_columns(i[, num])

Delete one or more columns.

extract()

sub = aln.extract(start, stop)

Extract a specified range of positions.

sub = aln.extract(frame)

Extract exon positions based on a ReadingFrame.

sub = aln.extract([i, j, ..., z])

Extract an arbitrary list of positions.

subset()

sub = aln.subset(samples)

Generate a new Align with selected samples (1).

intersperse()

aln.intersperse(len[, ...])

Insert non-varying sites randomly.

random_missing()

aln.random_missing(p[, ...])

Insert missing data randomly (1).

fix_ends()

aln.fix_ends()

Replace alignment gaps at ends by missing data.

Note:
  1. Also available for Container.

The following functions lie in the tools module and provide additional functionalities to manipulate alignments:

Function

Syntax

Action

tools.concat()

res = concat(aln1, aln2) (1)

Concatenate alignments (2).

tools.ungap()

cnt = ungap(aln)

Remove all gaps from an alignment.

aln2 = ungap(aln, p)

Remove sites with too many gaps.

tools.backalign()

aln = backalign(nucl, prot)

Align (unaligned) nucleotide sequences based on an amino acid alignment.

Notes:
  1. Based on how you import egglib, all functions might have to prefixed by egglib.tools.

  2. concat() allows you to add spacers between segments and it supports missing segments. See the documentation for more details.