tools¶

This module contains tools used in some other parts of EggLib but that might be of use for the package’s users.

Specific data formats¶

egglib.tools.aln2fas(fname)¶: Imports a clustal-formatted alignment from the file name fname and returns a Align instance.

egglib.tools.staden(fname=None, string=None, delete_consensus=True)¶

Imports a Staden output file as an Align instance. The file should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig. to file”. The sequence named CONSENSUS, if present, is automatically removed unless the option delete_consensus is False.

The Staden outfile file can be read from a file (using the argument fname or directly from a string (using string). It is required to pass either a file name as fname or a Staden string as string and it is not allowed to passb both.

Staden’s default convention is followed:

- codes for an unknown base and is replaced by N.
* codes for an alignment gap and is replaced by -.
. represents the same sequence than the consensus at that position.
White space represents missing data and is replaced by ?.

New in version 2.0.1: Add argument delete_consensus.

Changed in version 2.1.0: Read from string or fname.

egglib.tools.get_fgenesh(fname)¶: Imports fgenesh output. fname must be the name of a file containing fgenesh ouput. The feature definition are parsed an returned as a list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a codon_start qualifier.

egglib.tools.genalys2fasta(iname)¶: Converts Genalys-formatted sequence alignment files to fasta. The function imports files generated through the option Save SNPs of Genalys 2.8. iname if the name of the Genalys output file. Returns an Align instance.

class egglib.tools.Mase(input=None)¶

Bases: list

Minimal implementation of the mase format (allowing input/output operations). This class emulates a list of dictionaries, each dictionary representing a sequence and describing the keys header, name and sequence. However, the string formatter (str(mase) or print mase, where mase is a Mase instance) generates a mase-formatted string. Object attributes are header (a string with file-level information), species (the species of the ingroup), align (an Align instance corresponding to the data contained in the instance, and created upon construction). Modifying this instance has no effect.

The constructor takes an optional argument that can be a string giving the path to a mase-formatted file, or a Align instance. The constructor is currently unable to import population labels, and only sequences marked as ingroup are imported.

Changed in version 2.0.1: An IOError is raised upon file formatting error.

Data analysis¶

egglib.tools.LD(align1, align2, shuffle)¶: Computes linkage disequilibrium statistics between two Align instances align1 and align2. If shuffle is True, randomly shuffles sequences of align2 (without altering the original instance), emulating the hypothesis of linkage equilibrium. Returns a (n1,n2, S1,S2,K1,K2,D,Dp) tuple, where n1 is the number of used sequences of the first alignment, S1 is the number of polymorphic sites of the first alignment, K1 is the number of unique haplotypes of the first alignment, D is the standard estimator of linkage disequilibrium and Dp is Lewontin’s estimator (bound by 0 and 1).

Sequence manipulation tools¶

egglib.tools.backalign(nucseq, protseq, code=1, has_stop=False)¶: Alignement of coding sequences based on aligned predicted products. Conceptual translations of DNA sequences must match exactly passed protein sequences (except for gaps). Stop codons are not supported. nucseq is a Container instance containing raw coding sequence. protseq is a Align instance containing aligned amino acid sequences. code specifies the genetic code; refer to the documentation of translate(). Returns a Align instance containing aligned coding sequences. has_stop nucseq sequences are allowed to contain a final stop codon.

egglib.tools.concat(aligns, spacer=0, ch='?', strict=True, groupCheck=True)¶

Concatenates sequence alignments. A unique Align is returned. All different sequences from all passed alignments are represented in the final alignment. Sequences whose name match are matching are concatenated. In case several sequences have the same name in a given segment, the first one is considered and others are discarded. In case a sequence is missing for a particular segment, a stretch of non-varying characters is inserted to replace the unknown sequence.

aligns must be an iterable containing Align instances.

spacer specifies the length of unsequenced stretches (represented by non-varying characters) between concatenated alignments. If spacer is a positive integer, the length of all stretches will be identical. If spacer is an iterable containing integers, each specifying the interval between two consecutive alignments (if aligns contains n alignments, spacer must be of length n-1).

ch gives the character to used for conserved stretches and for missing segments.

If strict is False, the name comparison will not extend further than the length of the shorter name: for example, names anaconda and anaco will match, and the concatenated sequence will be named anaconda (regardless of which name appears first in the list of Align instances).

If groupCheck is True, an exception will be raised in case of a mismatch between group labels of different sequence segments bearing the same name. Otherwise, the group of the first segment found will be used as group label of the final sequence.

New in version 2.0.1: The arguments allowing to customize function’s behaviour.

egglib.tools.longest_orf(sequence, clean=False, full=False, all=False, code=1, mini=1)¶: Finds the longest open reading frame in the sequence. By default, the longest sequence without stop codon (except for the trailing stop codon) is returned, therefore the returned ORF doesn’t necessarily start by ATG and stops by a stop codon. If clean is True, returns the longest sequence encoding a valid protein sequence (without stop, without missing data, without gap). If full is True, returns only genuine ORFs (starting by ATG and ending by a stop codon). If all is True, returns a list of all ORFs (at least 3 of length), sorted by decreasing length. code specifies the genetic code; refer to the documentation of GeneticCodes.translate(). mini specifies the minimum number of codons (or amino acids) of the returned ORF or ORFs (stop codons are not taken into account).

Changed in version 2.0.1: Added options; return the trailing stop codon when appropriate.

Changed in version 2.1.0: Added option mini. The behaviour of previous versions is reproduced by setting mini to 0.

egglib.tools.rc(seq)¶: Reverse-complements a DNA sequence. Upper and lower-cases characters can be passed, the output is always upper-case. IUPAC characters (ACGTMRWSYKBDHV) are reverted. The characters N-? are returned as is. Other characters raise a ValueError.

Changed in version 2.0.1: Characters N, - and ? are correctly processed.

Changed in version 2.0.2: Reimplemented (will be faster for large sequences).

egglib.tools.translate(input, code=1, strip=False)¶: Translates all sequences from nucleotide to proteins. Accepts sequence container instances and the return type matches the passed type. If strip is True, all stop codon(s) present at the end of any sequence will be automatically stripped off. Setting this option to True will raise a ValueError in case a Align is passed and sequences don’t have all the same number of trailing stop codons. See the documentation of GeneticCodes.translate() for documentation of the argument code. Ambiguous codons are translated if the implied possibilities translate all to the same codon. The IUPAC nomenclature is used. Note that N means A, C, G or T but that codons containing ? or - will always be translated as X (except for --- codons that are be translated as -).

egglib.tools.ungap(align, freq, includeOutgroup=True)¶

Builds a new Align instance containing all sequences of align and only the columns for which the frequency of gaps (- symbols) is less than the value given by freq.

If includeOutgroup is True, the sequences with group label 999 (if any) are considered for computing the frequency of gaps. These sequences are however always exported to the returned alignment).

Changed in version 2.1.0: Added option includeOutgroup.

class egglib.tools.GeneticCodes¶

Holds genetic code. Instantiating this class is pointless since its contains only class methods.

classmethod codes()¶: Gives the list of code identifiers. Each code is represented by three identifiers: (index, short, long) where index is the integer identifier matching NCBI nomenclature (beware that indices are not consecutive); short is a egglib-defined word summarizing the code which can be used as an alternative access means; and long is the full name of the genetic code.

classmethod index(name)¶: Tries to identify the index of the genetic code from its short or full name. Returns None if the string matches no model. The comparison is case-independent.

classmethod is_start(codon, code=1)¶: Returns True if the codon is encoding one the observed translational start for this genetic code, False otherwise (including if the codon is invalid). Arguments are similar as for translate().

classmethod translate(codon, code=1)¶: Translate the codon codon using the indicated code. code is an identifier (index, short or long name) matching NCBI nomenclature. Returns the one-letter amino acid code corresponding to codon, * for stop codon and X for any invalid codon (string with a length different than 3 or containing missing data or gaps). The codon specification is case-independent. Ambiguous codons might still be translated if the implied possibilities translate all to the same codon. The IUPAC nomenclature is used. Note that N means A, C, G or T but that codons containing ? or - will always be translated as X. However, --- will be translated as -.

Sequence comparison¶

egglib.tools.compare(seq1, seq2)¶: Compares two sequences. Sequences are different if they have different lengths or if they differ by at least one position. The comparison supports IUPAC ambiguity characters (for example, A and M are not considered to be different). Furthermore, partially overlapping ambiguity characters (for example, M and R) are not taken as different. Returns True is sequences are identical (or differ only by overlapping IUPAC characters), False otherwise.

egglib.tools.motifs(sequence, motif, mismatches=0, reverse=True)¶: Locates motifs in a nucleotide sequence. Standard ambiguity characters are supported (as explained in compare() documentation). sequence and motif are nucleotide sequence strings. mismatches gives the number of nucleotide differences allowed for motif match. If reverse is True, both strands are examined (otherwise, only the forward strand is considered. Returns a list of hits. Each hit is represented by a dictionary containing keys: start: starting position of the hit, sequence: sequence of the matching region, mismatches: number of mismatches in the hit, reverse: True if the hit is on the reverse strand. The hit position and the found motif are always given with respect to the passed sequence, even when the motif was found on the reverse hit.

egglib.tools.locate(sequence, motif, start=0, stop=-1)¶

Locates the position of the motif in sequence. motif and sequence should be DNA sequences Ambiguity characters (M, R, W, S, Y, K, B, D, H, V and N) are recognized and match the appropriate characters. ? matches any character. Note that the meaning of N (A, C, G or T) is very different to ? (any character). start and stop allow to restrict search to a given subset of sequence (the returned index is still given with respect to the full sequence). The function returns the position of the first exact match or, if there is not exact match, the position of the first matching position allowing ambiguity charactor, or, if there is no match at all, None.

if hasAmb is True, ambiguities will be supported in the target sequence (sequence). With that mode on, ambiguities of the motif sequence (motif) will only be considered as a match if the target sequence account for all

Changed in version 2.1.0: Supports ambiguity characters in sequence. Returns exact matches first.

Tools not (directly) related to sequence data¶

egglib.tools.chisquare(ddl)¶: Returns the 5% critical value of the chi-square distribution with ddl degrees of liberty (maximum: 100).

egglib.tools.correl(x, y)¶: Computes correlation coefficients. x is a sequence giving the values of the explanatory variable. y is a sequence giving the values of the response variable. Returns a tuple (r, r**2, a) where r is the correlation coefficient and a the regression coefficient.

egglib.tools.ranges(values)¶: Identifies continuous ranges among the iterable values. values must be iterable but needs not to be sorted and can contain duplicates. The function returns a list of (start,stop) tuples, where start and stop defines a continuous range.

class egglib.tools.ReadingFrame(frame)¶

Handles reading frame positions.

frame must be a sequence of (start, stop, codon_start) sequences where start and stop gives the first and last position of an exon and codon_start is 1 if the first position of the exon is the first position of a codon (e.g. ATG ATG), 2 if the first position of the segment is the second position of a codon (e.g. TG ATG), 3 if the first position of the segment is the third position a of codon (e.g. G ATG), or None if the reading frame is continuing the previous exon. If codon_start of the first segment is None, 1 will be assumed. It is not possible to modify the codon positions held by the instance after construction.

codon(x)¶: If the position x falls in a complete codon, returns the three positions of that codon. If x fall outside of defined segments, or in a codon that appears not to be completely available, returns None.

Note

the codon positions are cached at build time. As a result, the result of this method will be incorrect if frame positions are changed after the creation of the instance.

codons()¶: Returns the list of complete codons (as triplets of absolute positions).

exon(x)¶: Returns the exon index of a position. Returns -1 if the position falls outside specified segments (out of ranges or in introns).

class egglib.tools.Updater(target=None)¶

Helper designed to monitor progress of long-running tasks. In principle, Updater should be coupled to a repetitive process with a fixed and known number of steps to perform (target) each requiring the same amount of time. Updater should be updated regularly at reasonnable intervals (not to short to keep it from being itself a resource load).

The class can be used as in the following examples:

>>> import egglib
>>> updater = egglib.tools.Updater(1000)
>>> for i in range(1000):
>>>     # ... time-consuming task here ...
>>>     updater.refresh()
>>> updater.close()
>>> 
>>> # The number of iterations is not known - custom display
>>> import random
>>> updater = egglib.tools.Updater()
>>> maxi = 0
>>> while True:
>>>     X = random.random()
>>>     if X>maxi:
>>>         maxi = X
>>>     updater.refresh('$ELAPSED, max. value: %f' %maxi)
>>>     if X>0.99999:
>>>         break
>>> updater.refresh('$DONE tries, time $ELAPSED, got %f' %maxi)
>>> updater.close()

Constructor’s argument target gives the number of iterations to perform. If None, this information is not available.

close()¶: If anything was written using refresh(), writes any cached refresh data and writes a new line. Otherwise, does nothing. This method is automatically called upon object destruction.

closed¶: True if the instance has been closed using the close() method.

format(template=None, increment=1)¶

Returns a string providing feedback about the run’s progress. template gives the template of the string to return. Actual values will substitute to the following special strings:

$DONE: Number of steps done.

$TARGET: Total number of steps to do.

$TODO: Number of steps left to do.

$ELAPSED: Time used since object creation.

$REMAINING: Estimated time to complete the task.

$LREMAINING: Like $REMAINING but computed from the last time point.

$TOTAL: Estimated total time (computed as $REMAINING + $ELAPSED)

$PERCENT: Percentage done (including % symbol).

If template is None, a template defined at construction time is used. This template is $DONE|$ELAPSED when target is None and $DONE/$TARGET (remaining: $REMAINING) if target is specified. It is stored at the object attribute template and can be modified dynamically.

If increment is different than zero, increment() is called and this number is passed before formatting the string.

increment(number)¶: Adds number steps and update elapsed time and estimated running (if target was given). number might be negative.

refresh(template=None, increment=1, grain=1.0)¶: This method generates a string exactly as format() does, but writes the string to sys.stdout instead. If the same object has already wrote anything, an equivalent number of backspaces are written, in principle allowing to overwrite the previous string and making the string appear to update itself. The result might no be so nice if something else is written to the console in the mean time or if the console doesn’t support backspaces. This method doesn’t write a newline, but the object will upon destruction or call to close(). The string will be stripped is it is longer than the object attribute length_max, which can be changed dynamically. If less than the number given by grain (in seconds) has occurred since the last refresh, nothing is printed.

stats()¶: Returns a dictionary with the current values of counters.

wipe()¶: Writes an empty line of the maximal possible length, therefore clearing the line from characters printed by another process (provided that these characters are not too many).

egglib.tools.wrap(string, length, indent=0)¶

Formats the string string to ensures the line lengths are not larger than length. The optional argument indent specifies the number of spaces to insert at the beginning of all lines except the first. The line breaks are inserted at spaces.

An example is given below::

>>> import egglib
>>> string = "Lekrrjf djdhs eeir djs ehehf bnreh eurvz rhffdvfu dksgta."
>>> print egglib.tools.wrap(string, 20, 4)
Lekrrjf djdhs eeir
    djs ehehf bnreh
    eurvz rhffdvfu
    dksgta.

Table Of Contents

Previous topic

Next topic

This Page

tools¶

Specific data formats¶

Data analysis¶

Sequence manipulation tools¶

Sequence comparison¶

Navigation

Table Of Contents

Previous topic

Next topic

This Page

Quick search

tools¶

Specific data formats¶

Data analysis¶

Sequence manipulation tools¶

Sequence comparison¶

Tools not (directly) related to sequence data¶

Navigation