Sequence manipulation tools¶
This module provides a number of standalone tools for manipulating sequences.
|
Handle reading frame positions. |
|
Converts a DNA object to the codon alphabet. |
Converts a codon object to the DNA alphabet. |
|
|
Concatenates sequence alignments. |
|
Remove sites with too many gaps. |
|
Remove all gaps from an object. |
|
Reverse-complement a DNA sequence. |
|
Compare two sequences. |
|
Turn a DNA sequence into a regular expression. |
|
Iterate over hits of a given query. |
|
Translates coding nucleotide sequences to proteins. |
|
Class to translate nucleotide sequences to proteins. |
|
Iterate over open reading frames. |
|
Detect the longest open reading frame. |
|
Align coding sequence based on the protein alignment. |
|
Process stop codons at the end of the sequences. |
|
Iterate over stop codons. |
|
Test if there is at least one stop codon. |
- class egglib.tools.ReadingFrame(frame=None, keep_truncated=False)[source]¶
Handle reading frame positions. The reading frame positions can be loaded as constructor argument or using the method
process()
. By default, build an instance with no exons.- Parameters:
frame –
the reading frame specification must be a sequence of
(start, stop[, codon_start])
pairs or triplets where start and stop give the limits of an exon, such thatsequence[start:stop]
returns the exon sequence, and codon_start, if specified, can be:1 if the first position of the exon is the first position of a codon (e.g.
ATG ATG
),2 if the first position of the segment is the second position of a codon (e.g.
TG ATG
),3 if the first position of the segment is the third position a of codon (e.g.
G ATG
),None
if the reading frame is continuing the previous exon.
keep_truncated – if
True
, codons that are truncated (either because the reading frame is 5’, internally, or 3’ partial, or if the number of bases is not a multiple of three) are not skipped (missing positions of truncated codons are replaced byNone
).
If codon_start of the first segment is
None
, 1 will be assumed. If codon_start of any non-first segment is notNone
, the reading frame is supposed to be interupted. This means that if any codon was not completed at the end of the previous exon, it will assumed to be assumed incomplete, and the first codon of the current codon will be incomplete as well.Attributes
Number of codons.
Number of bases in exons.
Number of exons.
Number of bases needed to apply this reading frame.
Number of bases of the reading frame.
Methods
codon_bases
(codon)Give the position of the three bases of a given codon.
codon_index
(base)Find the codon in which a given base falls.
codon_position
(base)Tell the position of a base within the codon in which it falls.
exon_index
(base)Find the exon in which a given base falls.
Iterate over codons.
Iterate over exons.
process
(frame[, keep_truncated])Reset instance with a new reading frame.
- codon_bases(codon)[source]¶
Give the position of the three bases of a given codon. One or two positions (but never the middle one alone) will be
None
if the codon is truncated (beginning/end of an exon without coverage of the previous/next one). If the codon index is out of range, returnNone
.- Parameters:
codon – any codon index.
- Returns:
A tuple with the three base positions (containing potentially one or two
None
) orNone
if the codon index is out of range.
- codon_index(base)[source]¶
Find the codon in which a given base falls.
- Parameters:
base – any base index.
- Returns:
The index of the corresponding codon, or
None
if the base does not fall in any codon.
- codon_position(base)[source]¶
Tell the position of a base within the codon in which it falls.
- Parameters:
base – any base index.
- Returns:
The index of the base in the codon (0, 1 or 3), or
None
if the base does not fall in any codon.
- exon_index(base)[source]¶
Find the exon in which a given base falls.
- Parameters:
base – any base index.
- Returns:
The index of the corresponding exon, or
None
if the base does not fall in any exon.
- iter_codons()[source]¶
Iterate over codons. This iterator returns
(first, second, third)
tuples of the positions of the three bases of each codon. Positions might beNone
if keep_truncated has been set toTrue
.
- iter_exon_bounds()[source]¶
Iterate over exons. This iterator returns
(start, stop)
tuples of the positions of the limits of each exon.
- property num_codons¶
Number of codons. Truncated codons are included.
- property num_exon_bases¶
Number of bases in exons.
- property num_exons¶
Number of exons.
- property num_needed_bases¶
Number of bases needed to apply this reading frame. Minimum size of a sequence to be consistent with this reading frame. In practice, the value equals to the end of the last exon or zero if there are no exons. If the reading frame is used with a shorter sequence, it can lead to errors.
- property num_tot_bases¶
Number of bases of the reading frame. All introns and exons are included, starting from the start of the first exon up to end of the last one.
- process(frame, keep_truncated=False)[source]¶
Reset instance with a new reading frame. All previously loaded data are discarded. The arguments of this method are identical to the constructor.
- egglib.tools.to_codons(src, frame=None)[source]¶
Converts a DNA object to the codon alphabet. In the output alignment, there is one site per codon. The number of provided sites must be a multiple of 3.
- Parameters:
aln – input alignment as an
Align
orContainer
instance. The alphabet must bealphabets.DNA
.frame – a
tools.ReadingFrame
instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region; in such case the length must be a multiple of 3).
- Returns:
A new instance of type matching the input instance. The alphabet will be
alphabets.codons
.
- egglib.tools.to_bases(src)[source]¶
Converts a codon object to the DNA alphabet. In the output object, there are three sites per codon.
- Parameters:
aln – input object as an
Align
orContainer
instance. The alphabet must bealphabets.codons
.- Returns:
A new instance of type matching the input instance. The alphabet will be
alphabets.DNA
.
- egglib.tools.concat(align1, align2, ..., spacer=0, ch='?', group_check=True, no_missing=False, ignore_names=False, dest=None)[source]¶
Concatenates sequence alignments. A unique
Align
is generated. All different sequences from all passed alignments are represented in the final alignment. Sequences whose name match are concatenated. In case several sequences have the same name in a given segment, the first one is considered and others are discarded. In case a sequence is missing for a particular segment, a stretch of non-varying characters is inserted to replace the unknown sequence.All options (excluding the alignements to be concatenated) must be specified as keyword arguments, otherwise they will be treated as part of the list of alignments.
- Parameters:
align1 –
align2 – two or more
Align
instances (their order is used for concatenation). It is not allowed to specify them using the keyword syntax. All instances must be configured to use the same alphabet.spacer – length of unsequenced stretches (represented by non-varying characters) between concatenated alignments. If spacer is a positive integer, the length of all stretches will be identical. If spacer is an iterable containing integers, each specifying the interval between two consecutive alignments (if there are
n
alignments, spacer must be of lengthn-1
).ch – character to used for conserved stretches and for missing segments. This character must be valid for the alphabet considered.
group_check – if
True
, an exception will be raised in case of a mismatch between group labels of different sequence segments bearing the same name. Otherwise, the group labels of the first segment found will be used as group labels of the final sequence.no_missing – if
True
, an exception will be raised in case the list of samples differs betweenAlign
instances. Then, the number of samples must always be the same and all samples must always be present (although it is possible that they consist in missing data only). Ignored if ignore_names isTrue
.ignore_names – don’t consider sample names and concatenate sequences based on they order in the instance. If used, the value of the option no_missing is ignored and the number of samples is required to be constant over alignments.
dest – an optional
Align
instance to recycle and to use to store the result. This instance is automatically reset, ignoring all data previously loaded. If this argument is notNone
, the function returns nothing and the passed instance is modified. Allows to recycle the same object in intensive applications.
- Returns:
If dest is
None
, a newAlign
instance. If dest isNone
, this function returnsNone
.
- egglib.tools.ungap(align, freq, gaps=None)[source]¶
Remove sites with too many gaps. Generate a new
Align
instance containing all sequences of the provided alignment but with only those sites for which the frequency of alignment gaps is less than or equal to freq.- Parameters:
align – input alignment as an
Align
.freq – maximum gap frequency in a site (if there are more gaps, the site is not included in the returned
Align
instance). This value is a relative frequency (included in the [0, 1] range).gaps – the list of allelic value(s) representing gaps. The argument must be a
list
ortuple
containing one gap allele value, or more than one if appropriate. All values must be valid alleles for the alignment alphabet. By default, use the-
character for the DNA and protein alphabets, and all triplets containing at least one alignment gaps for the codon alphabet. There is no default for other alphabets.
- Returns:
A new
Align
instance.
- egglib.tools.ungap_all(obj, gaps=None)[source]¶
Remove all gaps from an object. Generate a
Container
instance containing all sequences of the provided object excluding any gaps.- Parameters:
gaps – the list of allelic value(s) representing gaps. The argument must be a
list
ortuple
containing one gap allele value, or more than one if appropriate. All values must be valid alleles for the input alphabet. By default, use the-
character for the DNA and protein alphabets, and all triplets containing at least one alignment gaps for the codon alphabet. There is no default for other alphabets.
- Returns:
A
Container
instance.
- egglib.tools.rc(seq)[source]¶
Reverse-complement a DNA sequence.
- Parameters:
seq – input nucleotide sequence, as a
str
, or aSequenceView
.- Returns:
The reverse-complemented sequence (see details below).
The case of the provided sequence is preserved. Ambiguity characters are complemented according to the table shown here. Invalid characters characters raise a
ValueError
.
- egglib.tools.compare(seq1, seq2)[source]¶
Compare two sequences. The comparison supports ambiguity characters (see here) such that partially overlapping ambiguity sets characters are not treated as different. For example,
A
andM
are not treated as different, nor areM
andR
. Only IUPAC characters may be supplied.- Parameters:
seq1 – a DNA sequence (as a
str
or aSequenceView
).seq2 – another DNA sequence (idem).
- Returns:
True
if sequences have the same length and they either are identical or differ only by overlapping IUPAC characters,False
otherwise.
- egglib.tools.regex(query, both_strands=False)[source]¶
Turn a DNA sequence into a regular expression. The input sequence should contain IUPAC characters only (see here). Ambiguity characters will be teated as follows: an ambiguity will match either (1) itself, (2) one of the characters it defines, or (3) one of the ambiguity characters that define a subset of the characters it defines. For example, a
M
in the query will match anA
, aC
or aM
, and aD
will match all the following:A
,G
,T
,R
,W
,K
, andD
. The fully degeneratedN
matches all four bases and all intermediate ambiguity characters and, finally,?
matches all allowed characters (including alignment gaps). Note-
only matches itself.- Parameters:
query – a
str
containing IUPAC characters or aSequenceView
.both_strands – look for the query on both forward and reverse strands (by default, only on forward strand).
- Returns:
A regular expression as a
str
expanding ambiguity characters to all compatible characters.
Result of this function can be used with the module
re
to locate occurrences of a motif or the position of a sequence as in:>>> regex = egglib.tools.regex(query) >>> for hit in re.finditer(regex, subject): ... print(hit.start(), hit.end(), hit.group(0))
The returned regular expression includes upper-case characters, regardless of the case of input characters. To perform a case-insensitive search, use the
re.IGNORECASE
flag of there
module in regular expression searches (as inre.search(egglib.tools.regex(query), subject, re.IGNORECASE)
. Note that the regular expression is contained into a group if both_strands isFalse
, and two groups otherwise (one for each strand).
- egglib.tools.motif_iter(subject, query, mismatches=0, both_strands=False, case_independent=True, only_base=True)[source]¶
Iterate over hits of a given query. Return an iterator over hits of the provided query in the subject sequence. The query should only contain bases (including IUPAC ambiguity characters as listed in here, but excluding
-
and?
). Ambiguity characters are treated as described inregex()
(the bottom line is that ambiguities in the query can only match identical or less degenerate ambiguities in the subject). Mismatches are allowed, and the iterator will first yield identical hits (if they exist), then hits with one mismatch, then hits with two mismatches, and so on.- Parameters:
subject – a DNA sequence as
str
or aSequenceView
. The sequence may contain ambiguity characters. In principle, the subject sequence should contain IUPAC characters only, but this is not strictly enforced (non-IUPAC characters will always be treated as different of IUPAC characters).query – a DNA sequence as
str
or aSequenceView
, also containing IUPAC characters only.mismatches – maximum number of mismatches.
both_strands – look for the query on both forward and reverse strands (by default, only on forward strand).
case_independent – if
True
, perform case-independent searches.only_base – if
True
, never create a hit if the putative hit sequence contains an alignment gap (-
) or a?
character, irrespective of the number of allowed mismatches.
- Returns:
An iterator returning, for each hit, a
(start, stop, strand, num)
tuple wherestart
andstop
are the position of the hit such thatsubject[start:stop]
returns the hit sequence,strand
is the strand as+
or-
, andnum
is the number of mismatches of the hit.
Note
If a given query has a hit on both strands at the same position (this only happens if the sequence is a motif equal to its reverse-complement like
AATCGATT
), it is guaranteed that the only one hit will be returned for a given position (no twice on each strand). However, the value of thestrand
flag is not defined.Warning
This function is designed to be efficient only if the number of mismatches is small (only a few). A combination of a large number of mismatches (more than 2 or 3) and a large query sequence (as early as 10 or 20), will take significant time to complete. More complex problems require the use of genuine pairwise ocal alignment tools.
- egglib.tools.translate(seq, code=1, allow_alt=False, in_place=False)[source]¶
Translates coding nucleotide sequences to proteins. See the class
tools.Translator
for more details. This is a convenience method allowing to translate nucleotide sequences in a single call. For repetitive calls, direct use ofTranslator
can be more efficient.- Parameters:
seq – input codon sequences. Accepted types are
Align
,Container
,SequenceView
orstr
(or also types compatible withstr
). If the input object is anAlign
or aContainer
, the alphabet must bealphabets.codons
.code – genetic code identifier (see below). Required to be an integer among the valid values. The default value is the standard genetic code.
allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If
True
, codons are translated as a methionine (M) if, and only if, there are among the alternative start codons for the considered genetic code and they appear at the first position for the sequence. If seq is anAlign
, leading gaps are ignored as long as they appear as multiples of 3 (fully missing codons).in_place – place translated sequences in the provided
Align
orContainer
instance, overwriting initial data. Not allowed if seq is not of one of these two types. Replace the original alphabet byalphabets.protein
. By default, return a new instance.
- Returns:
Protein sequences as an
Align
or aContainer
if either of these types have been provided as seq, or as a string otherwise. If in_place has been specified, returnNone
.
All genetic codes defined by the National Center for Biotechnology Information are supported and can be accessed using codes corresponding to the GenBank
/trans_table
qualifier. The codes are integers.Identifier
Code
1
The Standard Code
2
The Vertebrate Mitochondrial Code
3
The Yeast Mitochondrial Code
4
The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
5
The Invertebrate Mitochondrial Code
6
The Ciliate, Dasycladacean and Hexamita Nuclear Code
9
The Echinoderm and Flatworm Mitochondrial Code
10
The Euplotid Nuclear Code
11
The Bacterial, Archaeal and Plant Plastid Code
12
The Alternative Yeast Nuclear Code
13
The Ascidian Mitochondrial Code
14
The Alternative Flatworm Mitochondrial Code
16
Chlorophycean Mitochondrial Code
21
Trematode Mitochondrial Code
22
Scenedesmus obliquus Mitochondrial Code
23
Thraustochytrium Mitochondrial Code
24
Rhabdopleuridae Mitochondrial Code
25
Candidate Division SR1 and Gracilibacteria Code
26
Pachysolen tannophilus Nuclear Code
27
Karyorelict Nuclear Code
28
Condylostoma Nuclear Code
29
Mesodinium Nuclear Code
30
Peritrich Nuclear Code
31
Blastocrithidia Nuclear Code
33
Cephalodiscidae Mitochondrial UAA-Tyr Code
Note that the following code identifiers do not exist: 7, 8, 15, 17, 18, 19, 20, and 32, as well as 0 and values above 33.
Reference: National Center for Biotechnology Information [http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi]
- class egglib.tools.Translator(code=1)[source]¶
Class to translate nucleotide sequences to proteins.
- Parameters:
code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.
Methods
translate_align
(src[, allow_alt, in_place])Translate an
Align
instance.translate_codon
(codon)Translate a single codon.
translate_container
(src[, allow_alt, in_place])Translate a
Container
instance.translate_sequence
(sequence[, allow_alt])Translate a sequence.
- translate_align(src, allow_alt=False, in_place=False)[source]¶
Translate an
Align
instance.- Parameters:
src – an
Align
containing codon sequences and using thealphabets.codons
alphabet.allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If
True
, codons are translated as a methionine (M) if they are among the alternative start codons for the considered genetic code and they appear at the first position for the considered sequence (excluding all triplets of gap symbols appearing at the 5’ end of the sequence).in_place – place translated sequences into the original
Align
instance (this discards original data and replaces the original alphabet by the protein alphabet). By default, returns a new instance.
- Returns:
By default, an original
Align
instance containing translated (protein) sequences. If in_place wasTrue
, returnNone
.
- translate_codon(codon)[source]¶
Translate a single codon. Translation is based on the genetic code defined at construction time. Alternative start codons are not supported.
- Parameters:
codon – codon as a three-character string
- Returns:
The one-letter amino acid code if the codon can be translated, or
-
if the codon is---
, orX
in all other cases (including all cases with invalid nucleotides).
- translate_container(src, allow_alt=False, in_place=False)[source]¶
Translate a
Container
instance.- Parameters:
align – a
Container
containing codons sequences.allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If
False
, codons are translated as a methionine (M) if, and only if, they are among the alternative start codons for the considered genetic code and they appear at the first position for the considered sequence.in_place – place translated sequences into the original
Container
instance (this discards original data). By default, return a new instance.
- Returns:
By default, a new
Container
instance containing translated (protein) sequences. If in_place wasTrue
, returnNone
.
- translate_sequence(sequence, allow_alt=False)[source]¶
Translate a sequence.
- Parameters:
sequence – a
str
,SequenceView
or compatible instance containing codon sequences. Only upper-case strings.frame – a
ReadingFrame
instance providing the exon positions in the correct frame. By default, a non-segmented frame covering all sequences is assumed (in case the provided alignment is the coding region and length must be a multiple of 3).allow_alt – a boolean telling whether alternative start (initiation) codons should be considered. If
False
, codons are translated as a methionine (M) if, and only if, there are among the alternative start codons for the considered genetic code and they appear at the first position for the sequence.
- Returns:
A new
str
instance containing translated (protein) sequences.
- egglib.tools.orf_iter(sequence, code=1, min_length=1, forward_only=False, force_start=True, allow_alt=False, force_stop=True)[source]¶
Iterate over open reading frames. Return an iterator over non-segmented open reading frames (ORFs) found in a provided codon sequence in any of the six possible frames.
- Parameters:
sequence – a string or a
SequenceView
representing a codon sequence.code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.
min_length – minimum length of returned ORFs. This value must be at least 1. It is understood as the length of the encoded peptide (since ORFs are returned as nucleotide sequences, they will have a length of least three times this value).
forward_only – consider only the three forward frames (do not consider the reverse strand).
force_start – if
True
, all returned ORFs are required to start with a start codon. Otherwise, all non-stop codons are included in ORFs. Note that alternative start codons (CTG
andTTG
for the standard genetic code) are also supported.allow_alt – allow alternative start codons (only considered if force_start is set).
force_stop – require that ORFs end with a stop codon (this only excludes 3’-partial ORFs, that is ORFs that end with the end of the provided sequennce).
- Returns:
An iterator over all detected ORFs. Each ORF is represented by a
(start, stop, length, frame)
tuple wherestart
is the start position of the ORF andstop
the stop position (such assequence[start:stop]
returns the ORF sequence or its reverse complement [note that the stop codon, if present, is included]),length
is the ORF length (number of amino acids, excluding any stop codon) andframe
is the reading frame on which it was found: +1, +2, +3 are the frames on the forward strand (starting respectively at the first, second, and third base), and -1, -2, -3 are the frames on the reverse strand (starting respectively at the last, last but one, and last but two bases).
- egglib.tools.longest_orf(sequence, code=1, min_length=1, forward_only=False, force_start=True, allow_alt=False, force_stop=True)[source]¶
Detect the longest open reading frame. Arguments are identical to
orf_iter()
. AValueError
is raised if two or more open reading frames have the largest length.- Returns:
A
(start, stop, length, frame)
tuple (see the return value oforf_iter()
for details), orNone
if no open reading frame fits the requirements.
- egglib.tools.backalign(nucl, aln, code=1, ignore_names=False, ignore_mismatches=False, fix_stop=False, allow_alt=True)[source]¶
Align coding sequence based on the protein alignment.
- Parameters:
nucl – a
Container
orAlign
instance containing coding sequences that should be aligned. There should not be any alignment gap in the sequences. The alphabet must bealphabets.codons
.aln – an
Align
instance containing an alignment of the protein sequences encoded by the coding sequences provided as nucl. Group labels of aln are not taken into account. The alphabet must bealphabets.protein
.code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.
ignore_names – if
True
, ignore names for matching sequences in the protein alignment to coding sequences. Sequences will be matched using their rank and the names in the returned alignment will be taken from nucl.ignore_mismatches – if
True
, do not generate any exception if a predicted protein does not match the provided protein sequence (if the lengths differ, an exception is always raised).fix_stop – if
True
, support a single trailing stop codon in coding sequences not represented by a*
in the provided protein alignment (if the final stop codons have been stripped during alignment). If found, this stop codon will be flushed as left as possible (immediately after the last non-gap character) in the returned coding alignment.
- Returns:
An
Align
instance containing aligned coding codon sequences.
If a mismatch is detected between a protein from aln and the corresponding prediction from nucl, an instance of
tools.BackalignError
(a subclass ofValueError
) is raised. Its attributealignment
can be used to help identify the reason of the error. Mismatches (but not differences of length) can be ignored with the option ignore_mismatches.
- class egglib.tools.BackalignError(name, fnuc, faln, i_nuc, i_aln, ls_aln, ls_nuc, translator)[source]¶
Bases:
ValueError
Exception used to report errors occurring during the use of
tools.backalign()
because of mismatches between the provided alignment and predicted proteins.Attributes
Comparaison of provided and predicted proteins.
Name of sequence for which the error occurred.
Methods
add_note
Exception.add_note(note) -- add a note to the exception
- property alignment¶
Comparaison of provided and predicted proteins. This string shows the two proteins with a middle line composed of the following symbols:
|
: match.#
: mismatch.~
: one protein shorter.
- property name¶
Name of sequence for which the error occurred.
- egglib.tools.trailing_stops(align, action=0, code=1, replacement='???')[source]¶
Process stop codons at the end of the sequences. This function detects and (optionally) fix trailing stop codons. The algorithm consists in locating the last non-gap codon of each sequence. If the last codon is partially gapped (which can happen if coding sequences are aligned using a DNA-based algorithm), it will not be detected and ignored.
- Parameters:
align – an
Align
containing aligned coding sequences. The alphabet must bealphabets.codons
.action –
an integer specifying what should be done if a stop codon is found at the end of a given sequence. Possible actions are listed in the following table:
Code
Action
0
Nothing (just count them).
1
Replace them by a gap, and delete the last position if it is made by gaps only.
2
Replace them by the value given as replacement.
Note that using
action=1
is not stricly equivalent to usingaction=2, replacement='---'
because the former deletes the last three positions of the alignment if needed while the latter does not.code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.
replacement – if action is set to 2, provide the codon that should be used to replace stop codons.
- Returns:
The number of sequences that had a trailing stop codons among the considered sequences.
- egglib.tools.iter_stops(src, code=1)[source]¶
Iterate over stop codons. Return an iterator providing the coordinates of all stop codons over all sequences. Each iteration returns a
(sample, position)
wheresample
is the sample index andposition
is the index of the stop codon.- Parameters:
src – an
Align
orContainer
containing coding sequences. The alphabet must bealphabets.codons
.code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code.
- Returns:
An iterator over the
(sample, position)
tuples corresponding to each stop codon found in the alignment (see above).