EggLib version 2.1 is archived.

Table Of Contents

Previous topic

Top-level Egglib components

Next topic


This Page


This module contains tools used in some other parts of EggLib but that might be of use for the package’s users.

Specific data formats

Imports a clustal-formatted alignment from the file name fname and returns a Align instance., string=None, delete_consensus=True)

Imports a Staden output file as an Align instance. The file should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig. to file”. The sequence named CONSENSUS, if present, is automatically removed unless the option delete_consensus is False.

The Staden outfile file can be read from a file (using the argument fname or directly from a string (using string). It is required to pass either a file name as fname or a Staden string as string and it is not allowed to passb both.

Staden’s default convention is followed:

  • - codes for an unknown base and is replaced by N.
  • * codes for an alignment gap and is replaced by -.
  • . represents the same sequence than the consensus at that position.
  • White space represents missing data and is replaced by ?.

New in version 2.0.1: Add argument delete_consensus.

Changed in version 2.1.0: Read from string or fname.

Imports fgenesh output. fname must be the name of a file containing fgenesh ouput. The feature definition are parsed an returned as a list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a codon_start qualifier.

Converts Genalys-formatted sequence alignment files to fasta. The function imports files generated through the option Save SNPs of Genalys 2.8. iname if the name of the Genalys output file. Returns an Align instance.


Bases: list

Minimal implementation of the mase format (allowing input/output operations). This class emulates a list of dictionaries, each dictionary representing a sequence and describing the keys header, name and sequence. However, the string formatter (str(mase) or print mase, where mase is a Mase instance) generates a mase-formatted string. Object attributes are header (a string with file-level information), species (the species of the ingroup), align (an Align instance corresponding to the data contained in the instance, and created upon construction). Modifying this instance has no effect.

The constructor takes an optional argument that can be a string giving the path to a mase-formatted file, or a Align instance. The constructor is currently unable to import population labels, and only sequences marked as ingroup are imported.

Changed in version 2.0.1: An IOError is raised upon file formatting error.

Data analysis, align2, shuffle)

Computes linkage disequilibrium statistics between two Align instances align1 and align2. If shuffle is True, randomly shuffles sequences of align2 (without altering the original instance), emulating the hypothesis of linkage equilibrium. Returns a (n1,n2, S1,S2,K1,K2,D,Dp) tuple, where n1 is the number of used sequences of the first alignment, S1 is the number of polymorphic sites of the first alignment, K1 is the number of unique haplotypes of the first alignment, D is the standard estimator of linkage disequilibrium and Dp is Lewontin’s estimator (bound by 0 and 1).

Sequence manipulation tools, protseq, code=1, has_stop=False)

Alignement of coding sequences based on aligned predicted products. Conceptual translations of DNA sequences must match exactly passed protein sequences (except for gaps). Stop codons are not supported. nucseq is a Container instance containing raw coding sequence. protseq is a Align instance containing aligned amino acid sequences. code specifies the genetic code; refer to the documentation of translate(). Returns a Align instance containing aligned coding sequences. has_stop nucseq sequences are allowed to contain a final stop codon., spacer=0, ch='?', strict=True, groupCheck=True)

Concatenates sequence alignments. A unique Align is returned. All different sequences from all passed alignments are represented in the final alignment. Sequences whose name match are matching are concatenated. In case several sequences have the same name in a given segment, the first one is considered and others are discarded. In case a sequence is missing for a particular segment, a stretch of non-varying characters is inserted to replace the unknown sequence.

aligns must be an iterable containing Align instances.

spacer specifies the length of unsequenced stretches (represented by non-varying characters) between concatenated alignments. If spacer is a positive integer, the length of all stretches will be identical. If spacer is an iterable containing integers, each specifying the interval between two consecutive alignments (if aligns contains n alignments, spacer must be of length n-1).

ch gives the character to used for conserved stretches and for missing segments.

If strict is False, the name comparison will not extend further than the length of the shorter name: for example, names anaconda and anaco will match, and the concatenated sequence will be named anaconda (regardless of which name appears first in the list of Align instances).

If groupCheck is True, an exception will be raised in case of a mismatch between group labels of different sequence segments bearing the same name. Otherwise, the group of the first segment found will be used as group label of the final sequence.

New in version 2.0.1: The arguments allowing to customize function’s behaviour., clean=False, full=False, all=False, code=1, mini=1)

Finds the longest open reading frame in the sequence. By default, the longest sequence without stop codon (except for the trailing stop codon) is returned, therefore the returned ORF doesn’t necessarily start by ATG and stops by a stop codon. If clean is True, returns the longest sequence encoding a valid protein sequence (without stop, without missing data, without gap). If full is True, returns only genuine ORFs (starting by ATG and ending by a stop codon). If all is True, returns a list of all ORFs (at least 3 of length), sorted by decreasing length. code specifies the genetic code; refer to the documentation of GeneticCodes.translate(). mini specifies the minimum number of codons (or amino acids) of the returned ORF or ORFs (stop codons are not taken into account).

Changed in version 2.0.1: Added options; return the trailing stop codon when appropriate.

Changed in version 2.1.0: Added option mini. The behaviour of previous versions is reproduced by setting mini to 0.

Reverse-complements a DNA sequence. Upper and lower-cases characters can be passed, the output is always upper-case. IUPAC characters (ACGTMRWSYKBDHV) are reverted. The characters N-? are returned as is. Other characters raise a ValueError.

Changed in version 2.0.1: Characters N, - and ? are correctly processed.

Changed in version 2.0.2: Reimplemented (will be faster for large sequences)., code=1, strip=False)

Translates all sequences from nucleotide to proteins. Accepts sequence container instances and the return type matches the passed type. If strip is True, all stop codon(s) present at the end of any sequence will be automatically stripped off. Setting this option to True will raise a ValueError in case a Align is passed and sequences don’t have all the same number of trailing stop codons. See the documentation of GeneticCodes.translate() for documentation of the argument code. Ambiguous codons are translated if the implied possibilities translate all to the same codon. The IUPAC nomenclature is used. Note that N means A, C, G or T but that codons containing ? or - will always be translated as X (except for --- codons that are be translated as -)., freq, includeOutgroup=True)

Builds a new Align instance containing all sequences of align and only the columns for which the frequency of gaps (- symbols) is less than the value given by freq.

If includeOutgroup is True, the sequences with group label 999 (if any) are considered for computing the frequency of gaps. These sequences are however always exported to the returned alignment).

Changed in version 2.1.0: Added option includeOutgroup.


Holds genetic code. Instantiating this class is pointless since its contains only class methods.

classmethod codes()

Gives the list of code identifiers. Each code is represented by three identifiers: (index, short, long) where index is the integer identifier matching NCBI nomenclature (beware that indices are not consecutive); short is a egglib-defined word summarizing the code which can be used as an alternative access means; and long is the full name of the genetic code.

classmethod index(name)

Tries to identify the index of the genetic code from its short or full name. Returns None if the string matches no model. The comparison is case-independent.

classmethod is_start(codon, code=1)

Returns True if the codon is encoding one the observed translational start for this genetic code, False otherwise (including if the codon is invalid). Arguments are similar as for translate().

classmethod translate(codon, code=1)

Translate the codon codon using the indicated code. code is an identifier (index, short or long name) matching NCBI nomenclature. Returns the one-letter amino acid code corresponding to codon, * for stop codon and X for any invalid codon (string with a length different than 3 or containing missing data or gaps). The codon specification is case-independent. Ambiguous codons might still be translated if the implied possibilities translate all to the same codon. The IUPAC nomenclature is used. Note that N means A, C, G or T but that codons containing ? or - will always be translated as X. However, --- will be translated as -.

Sequence comparison, seq2)

Compares two sequences. Sequences are different if they have different lengths or if they differ by at least one position. The comparison supports IUPAC ambiguity characters (for example, A and M are not considered to be different). Furthermore, partially overlapping ambiguity characters (for example, M and R) are not taken as different. Returns True is sequences are identical (or differ only by overlapping IUPAC characters), False otherwise., motif, mismatches=0, reverse=True)

Locates motifs in a nucleotide sequence. Standard ambiguity characters are supported (as explained in compare() documentation). sequence and motif are nucleotide sequence strings. mismatches gives the number of nucleotide differences allowed for motif match. If reverse is True, both strands are examined (otherwise, only the forward strand is considered. Returns a list of hits. Each hit is represented by a dictionary containing keys: start: starting position of the hit, sequence: sequence of the matching region, mismatches: number of mismatches in the hit, reverse: True if the hit is on the reverse strand. The hit position and the found motif are always given with respect to the passed sequence, even when the motif was found on the reverse hit., motif, start=0, stop=-1)

Locates the position of the motif in sequence. motif and sequence should be DNA sequences Ambiguity characters (M, R, W, S, Y, K, B, D, H, V and N) are recognized and match the appropriate characters. ? matches any character. Note that the meaning of N (A, C, G or T) is very different to ? (any character). start and stop allow to restrict search to a given subset of sequence (the returned index is still given with respect to the full sequence). The function returns the position of the first exact match or, if there is not exact match, the position of the first matching position allowing ambiguity charactor, or, if there is no match at all, None.

if hasAmb is True, ambiguities will be supported in the target sequence (sequence). With that mode on, ambiguities of the motif sequence (motif) will only be considered as a match if the target sequence account for all

Changed in version 2.1.0: Supports ambiguity characters in sequence. Returns exact matches first.

Hosted by  Get seqlib at Fast, secure and Free Open Source software downloads