EggLib
EggLib version 2.1 is archived.

Table Of Contents

Previous topic

Low-level binding of C++ library

Next topic

tools

This Page

Top-level Egglib components

Top-level Egglib utilities consist mostly in data storage classes: Container and Align for sequence data, SSR for microsatellite data, GenBank for annotated sequences and Tree for trees.

Class Container

class egglib.Container(fname=None, string=None, groups=False)

Bases: egglib.data.BaseContainer

Holds sequences, without requiring that they have the same length. This class is a C++-implemented class providing performant storage and access utilies, wrapped within at Python layer that interfaces several operations. In particular it allows direct instanciation from a fasta-formatted file or from a string stored in a Python str instance (see constructor’s signature below).

Container also allow subscript indexing (as in container[0]) and iteration (as in for i in container). Returned items are SequenceItem instances that can be either converted in (name, sequence, group) tuples or modified to modify the underlying instance. For example, the following code resets all group indices of the Container instance container:

>>> for i in container:
...     i.group = 0

Container supports call str() and expressions such as print container. In both cases, the result of the str() method (with default arguments) is returned. The result is a fasta-formatted string. Consider using the str() method to customize and write() to export the instance to a file on the disk.

Container instances have a len() (the result of ns() is returned) supports expressions such as name in container which return True if name is the name of one of the sequences contained in the instance.

Changed in version 2.0.1: The [] operators accept only indices. sequenceByName() fulfils the dictionary-like behaviour. append(), extend() and __iadd__() (operator +=) are removed.

Constructor arguments

Parameters:
  • fname – the path of a fasta-formatted file or None.
  • string – a string containing fasta sequences or None.
  • groups – whether to import group labels. The labels should appear as strings @0, @1, etc. in the input file.

If fname and string are None, the value of groups is ignored and an empty instance is built. If both fname and string are specified, an error is thrown.

Changed in version 2.0.1: Doesn’t accept simultaneous values for fname and string.

Methods

addSequences(seqs)

Appends repetitively (name, sequence, group) tuples to the end of the object (passed the last sequence. seqs must be an iterable returning (name, sequence, group) tuples (such as a Container or Align instance). (the group item is optional and tuples can be of length 2.) Returns the number of sequences after the operation.

New in version 2.0.1.

append(name, sequence, group=0)

Adds a sequence to the object. name is the sequence name, sequence the sequence string and group is the population label. Note that the length of sequence must match the length of the alignment, if self is of type Align. Returns the number of sequences after the operation.

appendSequence(pos, sequence)

Appends the sequence string to the end of the sequence at position pos of the instance.

clear()

Deletes all content of the current instance.

composition()

Gets the composition in characters of each sequence. Returns a dictionary with the sequence names as key. Each entry is itself a dictionary giving the absolute frequency of each character found in the corresponding sequences.

contains_duplicates()

True if the instance contains at least one duplicate.

classmethod create(obj)

Creates an instance from the object obj. The created instance will match the type from which the method is called ( Container.create(obj) will return a Container, and Align.create(obj) will return a Align, and the same goes if the method is called on an object). In the case of Align, the restriction of coherent sequence lengths applies (there is not automatic correction). obj is a priori a Container or a Align, but the method supports any iterable returning (name,sequence,group) or (name,sequence) tuples (in the latter case, groups will be initialized to 0. For example, the following is valid:

import egglib
data = []
data.append( ("sequence1", "AAAAAAAAA",     0) )
data.append( ("sequence2", "GGGG",          2) )
data.append( ("sequence3", "AAAAAAAAAAAAAA") )
container = egglib.Container.create(data)

New in version 2.0.1.

duplicates()

Returns the list of sequence names found more than once in the instance.

encode(nbits=10)

Renames all sequences using a random mapping of names to unique keys of lenght nbits. nbits cannot be lower than 4 (which should allow renaming several millions of sequence) or larger than 63 (which is the number of different characters available). Returns a dictionary mapping all the generated keys to the actual sequence names. The keys are case-dependent and guaranteed not to start with a number. The returned mapping can be used to restored the original names using rename()

New in version 2.0.1.

equalize(ch='?')

Appends character ch to the end of sequences such as all sequences have the same length. The length of all sequences will be the length of the longest sequence before call. This value is returned by the method.

find(string, strict=True)

Returns the index of the first sequence with the name specified by string. If strict is False, then the comparison ignores names that are longest than string. In other words, the name Alphacaga_tada1 will be recognized if find() is called with string Alphacaga and strict = False. If the name is not found, returns None.

Changed in version 2.1.0: Returns None instead of -1 if the name is not found.

get(s, p)

Gets the character value of the sequence s at position p.

group(pos, group=None)

Sets/gets the group label of the sequence at index pos. If group is None, returns the current group label. Otherwise changes the group label and returns nothing. If not None, group must be a positive integer.

groupByName(name, strict=True)

Returns the group label corresponding to the first match of name. If the name is not found, raises a KeyError. If strict is True, seeks an exact match. If False, compares only until the end of the requested name (for example: 'ATCFF' will match 'ATCFF_01' if strict is false).

New in version 2.0.1.

groups()

Gets the group structure. Returns a dictionary with the group labels (as int) as keys. Values are the lists of sequence names corresponding to each group.

isEqual()

Returns True if all sequences have the same length, False otherwise.

ls(pos)

Returns the length of the sequence stringat position pos.

New in version 2.0.1.

matches(format)

Returns the list of indices matching the passed format. The format is passed as-is the the re module using the function search() (which doesn’t necessarily match the beginning of the string). If no sequence name matches the passed format an empty list is returned.

name(pos, name=None)

Sets/gets the name of the sequence at index pos. If name is None, returns the current name. Otherwise changes the name and returns nothing.

names()

Returns the list of sequence names

no_duplicates()

Discards all duplicates: for all sequences with the same name, the one with the largest index is removed.

ns()

Returns the number of sequences contained in the instance.

remove(name)

Removes the first sequence having name name. If no sequence has this name, a KeyError is raised. A workaround is easy to implement:

>>> index = align.find(name)
>>> if index!=None:
>>>     del align[index]

Changed in version 2.0.1: New meaning.

rename(mapping, liberal=False)

Rename all sequences of the instance using the passed mapping. If liberal is False and a name does not appear in mapping, a ValueError is raised. If liberal is True, names that don’t appear in mapping are left unchanged.

New in version 2.0.1.

sequence(pos, sequence=None)

Sets/gets the sequence string at index pos. If sequence is None, returns the current sequence. Otherwise changes the sequence and returns nothing. If the object is an Align, the sequence length must match the alignment length.

sequenceByName(name, strict=True)

Returns the sequence string corresponding to the first match of name. If the name is not found, raises a KeyError. If strict is True, seeks an exact match. If False, compares only until the end of the requested name (for example: 'ATCFF' will match 'ATCFF_01' if strict is False).

New in version 2.0.1.

set(sequence, position, ch)

Sets the character value at string position position of the sequence at index sequence to value ch.

shuffle(maintain_outgroup=True)

Randomly reassigns group labels. Modifies the current object and returns nothing. If maintain_outgroup is True, doesn’t reassign the outgroup (group label 999).

slice(a, b)

Extracts a selection of sequences. Sequences with indices a to b-1 are extracted and returned as a new instance. If a is smaller than 0, 0 is used instead. If b is larger than the number of sequences, the latter is used instead. If b is not larger than a, the returned instance is empty.

str(exportGroupLabels=False, lineLength=50)

Formats the instance as a fasta string. exportGroupLabels: if True, exports group/population membership as @x tags placed at the end of sequence names (where x is any positive integer). lineLength gives the number of characters to place on a single line in the fasta output. If 0, no newlines are inserted within sequences.

write(fname, exportGroupLabels=False, lineLength=50)

Writes the sequences to a fasta-formatted file. fname is the name of the file to create. Other arguments are as for str().

New in version 2.0.1.

Class Align

class egglib.Align(fname=None, string=None, groups=False)

Bases: egglib.data.BaseContainer

Holds sequences and ensures that they have the same length. This class is a C++-implemented class providing performant storage and access utilies, wrapped within at Python layer that interfaces several operations. In particular it allows direct instanciation from a fasta-formatted file or from a string stored in a Python str instance (see constructor’s signature below).

Align also allow subscript indexing (as in align[0]) and iteration (as in for i in align). Returned items are SequenceItem instances that can be either converted in (name, sequence, group) tuples or modified to modify the underlying instance. For example, the following code resets all group indices of the Align instance align:

>>> for i in align:
...     i.group = 0

Align supports calls to both str() (and, as a result, expressions such as print align). In both cases, the result of the str() method (with default arguments) is returned. The result is a fasta-formatted string. Consider using the str() method to customize and write() to export the instance to a file on the disk.

Align instances have a len() (the result of ns() is returned) supports expressions such as name in align which return True if name is the name of one of the sequences contained in the instance.

Changed in version 2.0.1: The [] operators accept only indices. sequenceByName() fulfils the dictionary-like behaviour. append(), extend() and __iadd__() (operator +=) are removed.

Constructor arguments

Parameters:
  • fname – the path of a fasta-formatted file or None.
  • string – a string containing fasta sequences or None.
  • groups – whether to import group labels. The labels should appear as strings @0, @1, etc. in the input file.

If fname and string are None, the value of groups is ignored and an empty instance is built. If both fname and string are specified, an error is thrown.

Changed in version 2.0.1: Doesn’t accept simultaneous values for fname and string.

Methods

Rmin(minimumExploitableData=1.0, ignoreFrequency=0, validCharacters='ACGT', missingData='MRWSYKBDHVN?-')

Computes the minimal number of recombination events

The computation is performed as described in Hudson, RR and NL Kaplan. 1985. Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147-164. The returned parameter is the minimal number of recombination events, given by the number of non-overlapping pairs of segregating sites violating the rule of the four gamete. Only sites with two alleles are considered. Note that homoplasy (multiple mutations) mimicks recombination. The result of this function is not stored in this instance, and re-computed at each call.

addSequences(seqs)

Appends repetitively (name, sequence, group) tuples to the end of the object (passed the last sequence. seqs must be an iterable returning (name, sequence, group) tuples (such as a Container or Align instance). (the group item is optional and tuples can be of length 2.) Returns the number of sequences after the operation.

New in version 2.0.1.

append(name, sequence, group=0)

Adds a sequence to the object. name is the sequence name, sequence the sequence string and group is the population label. Note that the length of sequence must match the length of the alignment, if self is of type Align. Returns the number of sequences after the operation.

appendSequence(pos, sequence)

Appends the sequence string sequence to the sequence at position pos.

binSwitch(pos)

Takes all characters at position pos; replaces 0 by 1 and all the way around, raises an exception if another character is found. This method doesn’t have a return value.

character(s, p)

Fast accessor to a character. Returns character at position p of sequence s. This accessor is faster than get() because it doesn’t perform out-of-bound check.

clear()

Deletes all content of the current instance.

column(pos)

Extracts the alignment column at position pos as a list of characters.

composition()

Gets the composition in characters of each sequence. Returns a dictionary with the sequence names as key. Each entry is itself a dictionary giving the absolute frequency of each character found in the corresponding sequences.

consensus()

Generates a consensus of the object, assuming nucleotide sequences. The consensus is generated based on standard ambiguity (IUPAC) codes. A - character is inserted if any sequence has a -. A ? character is inserted if any sequence has a ?. Returns the consensus string.

contains_duplicates()

True if the instance contains at least one duplicate.

classmethod create(obj)

Creates an instance from the object obj. The created instance will match the type from which the method is called ( Container.create(obj) will return a Container, and Align.create(obj) will return a Align, and the same goes if the method is called on an object). In the case of Align, the restriction of coherent sequence lengths applies (there is not automatic correction). obj is a priori a Container or a Align, but the method supports any iterable returning (name,sequence,group) or (name,sequence) tuples (in the latter case, groups will be initialized to 0. For example, the following is valid:

import egglib
data = []
data.append( ("sequence1", "AAAAAAAAA",     0) )
data.append( ("sequence2", "GGGG",          2) )
data.append( ("sequence3", "AAAAAAAAAAAAAA") )
container = egglib.Container.create(data)

New in version 2.0.1.

dataMatrix(mapping='ACGT', others=999)

Returns a copy of the current instance as a DataMatrix Mapping must be a string of type ‘ACGT’ indicating valid characters that will be encoded by their position in the string (ie 0,1,2,3). others gives the index to affect to characters not found in the mapping string.

duplicates()

Returns the list of sequence names found more than once in the instance.

encode(nbits=10)

Renames all sequences using a random mapping of names to unique keys of lenght nbits. nbits cannot be lower than 4 (which should allow renaming several millions of sequence) or larger than 63 (which is the number of different characters available). Returns a dictionary mapping all the generated keys to the actual sequence names. The keys are case-dependent and guaranteed not to start with a number. The returned mapping can be used to restored the original names using rename()

New in version 2.0.1.

extract(*args)

Extract given positions (or columns) of the alignment and returns a new alignment. There are two ways of using this method. The first is by passing a range specification as in align.extract(100, 200). The bounds will be passed as it to the slice operator on all sequences. The above example will extract columns 100 to 199. As a result, out of bound values will be silently supported. The second use of the method is as in align.extract([80, 143, 189, 842, 967]). The single argument must be an iterable containing positions indices, that might contain repetitions and needs not to be sorted. The positions will be extracted in the specified order.

New in version 2.0.1.

filter(ratio, valid='ACGT')

Removes the sequences with too few valid sites. ratio is the limit threshold (relative to the sequence with the largest number of valid characters). The user can specify the list of valid states through the argument valid. The comparison is case-independent. This method modifies the current instance and returns nothing.

find(string, strict=True)

Returns the index of the first sequence with the name specified by string. If strict is False, then the comparison ignores names that are longest than string. In other words, the name Alphacaga_tada1 will be recognized if find() is called with string Alphacaga and strict = False. If the name is not found, returns None.

Changed in version 2.1.0: Returns None instead of -1 if the name is not found.

fix_gap_ends()

Replaces all leading or trailing alignment gaps (-) by missing data symbols (?). Internal alignment gaps (those having at least one character other than - and ? at each side) are left unchanged.

get(s, p)

Gets the character value of the sequence s at position p.

group(pos, group=None)

Sets/gets the group label of the sequence at index pos. If group is None, returns the current group label. Otherwise changes the group label and returns nothing. If not None, group must be a positive integer.

groupByName(name, strict=True)

Returns the group label corresponding to the first match of name. If the name is not found, raises a KeyError. If strict is True, seeks an exact match. If False, compares only until the end of the requested name (for example: 'ATCFF' will match 'ATCFF_01' if strict is false).

New in version 2.0.1.

groups()

Gets the group structure. Returns a dictionary with the group labels (as int) as keys. Values are the lists of sequence names corresponding to each group.

ls()

Returns the length of the alignment.

matches(format)

Returns the list of indices matching the passed format. The format is passed as-is the the re module using the function search() (which doesn’t necessarily match the beginning of the string). If no sequence name matches the passed format an empty list is returned.

matrixLD(minimumExploitableData=1.0, ignoreFrequency=0, validCharacters='ACGT', missingData='MRWSYKBDHVN?-')

Generates the matrix of linkage disequilibrium between all pairs of polymorphic sites. The options have the same meaning as for polymorphism().

Returns a dictionary containing the following keys:
  • minimumExploitableData (value of input parameter),
  • ignoreFrequency (value of input parameter),
  • n (number of pairs of sequences),
  • d (alignment distance between polymorphic sites),
  • D (D linkage disequilibrium statistic),
  • Dp (D’ linkage disequilibrium statistic),
  • r (Pearson’s correlation coefficient),
  • r2 (square Pearson’s correlation coefficient).

D, Dp, r and r2 are four possible measures of linkage disequilibrium. minimumExploitableData, ignoreFrequency and n are provided as integer values. d, D, Dp, r and r2 are provided as nested dictionaries containing the matrix of values of the corresponding statistic between all pairs of polymorphic sites. The individual values can be accessed as this (example given for D): matrixLD()['D'][i][j] where i is the alignment position of the first site and j is the alignment position of the second site such as i < j. The polymorphic sites as the same as those returned by polymorphism()['siteIndices'] (called on the same object with the same configuration options).

name(pos, name=None)

Sets/gets the name of the sequence at index pos. If name is None, returns the current name. Otherwise changes the name and returns nothing.

names()

Returns the list of sequence names

nexus(prot=False)

Generates a simple nexus-formatted string. If prot is True, adds datatype=protein in the file, allowing it to be imported as proteins (but doesn’t perform further checking). Returns a nexus-formatted string. Note: any spaces and tabs in sequence names are replaced by underscores. This nexus implementation is minimal but will normally suffice to export sequences to programs expecting nexus.

no_duplicates()

Discards all duplicates: for all sequences with the same name, the one with the largest index is removed.

ns()

Returns the number of sequences contained in the instance.

phylip(format='I')

Returns a phyml-formatted string representing the content of the instance. The phyml format is suitable as input data for PhyML and PAML software. Raises a ValueError is any name of the instance contains at least one character of the following list: “()[]{},;” as well as spaces, tabs, newlines and linefeeds. Group labels are never exported. Sequence names cannot be longer than 10 characters. A ValueError will be raised if a longer name is met. format must be ‘I’ or ‘S’ (case-independent), indicating whether the data should be formatted in the sequential (S) or interleaved (I) format (see PHYLIP’s documentation for definitions).

phyml()

Returns a phyml-formatted string representing the content of the instance. The phyml format is suitable as input data for PhyML and PAML software. Raises a ValueError is any name of the instance contains at least one character of the following list: “()[]{},;” as well as spaces, tabs, newlines and linefeeds. Group information is never exported.

polymorphism(allowMultipleMutations=False, minimumExploitableData=1.0, ignoreFrequency=0, validCharacters='ACGT', missingData='MRWSYKBDHVN?-', useZeroAsAncestral=False, skipDifferentiationStats=False, skipOutgroupBasedStats=False, skipAllHaplotypeStats=False, skipHaplotypeDifferentiationStats=False)

Computes nucleotide and haplotype diversity statistics.

Arguments:

  • minimumExploitableData sites where the non-missing data (as defined by mapping strings, see below) are at a frequency smaller than this value will be removed from the analysis. Use 1. to take only ‘complete’ sites into account and 0. to use all sites (the option is not considered for haplotype-based statistics).
  • allowMultipleMutations: if False, only sites with 1 or 2 alleles are considered, and sites with more alleles are considered as missing data. The sum of the frequencies of all alleles not matching the outgroup will treated as the derived allele frequency (for orientable sites).
  • ignoreFrequency: removes sites that are polymorph because of an allele at absolute frequency (as an integer: number of copies) smaller than or equal to this value. If ignoreFrequency=0, no sites are removed, if ignoreFrequency=1, singleton sites are ignored. Such sites are completely removed from the analysis (not counted in lseff). Note that if more than one mutation is allowed, the site is removed only if all the alleles but one are smaller than or equal to this value. For example, an alignment column AAAAAAGAAT is ignored with an ignoreFrequency of 1, but AAAAAAGGAT is conserved (including the third allele T which is a singleton).
  • validCharacters: a string giving the list of characters that should be considered as valid data.
  • missingData: characters indicating missing data, that is tolerated but ignored. All characters that are neither in validCharacters nor missingData, but found in the data, will cause an error.
  • useZeroAsAncestral: if true, all outgroups (if present) will be ignored and the character “0” will be considered as ancestral for all sites, whatever the character mapping.
  • skipDifferentiationStats, skipOutgroupBasedStats, skipAllHaplotypeStats, skipHaplotypeDifferentiationStats allow the user to skip part of the analysis (in order to save time).

The method returns a dictionary containing the diversity statistics. Some of the statistics will be computed only in presence of more than one group in the alignment, or in the presence of an outgroup, or depending on the value of other statistics and or if skip flags were activated (otherwise, they will have a None value).

These statistics are always computed:

  • nseff: Average number of analyzed sequences per analyzed site. It equals to ns() minus the number of outgroup sequences unless minimumExploitableData is set to a value < 1. In the latter case it can be a fraction.
  • lseff: Number of analyzed sites.
  • npop: Number of populations detected in the alignment.
  • S: Number of polymorphic sites.
  • eta: Minimal number of mutations (ignores allowMultipleMutations).
  • sites: List of SitePolymorphism instances (one for each polymorphic site).
  • siteIndices: List of alignment position of each polymorphic site.
  • singletons: List of positions of singletons.

These statistics are computed only is lseff is > 0:

  • thetaW: Theta estimator of Watterson (Theor. Popul. Biol. 7:256-276, 1975).
  • Pi: Nucleotide diversity.

This statistic is computed only is S is > 0:

  • D: Tajima statistic (Genetics 123:585-595, 1989)

These statistics are computed only if skipAllHaplotypeStats is False:

  • He: Haplotypic diversity.
  • K: Number of distinct haplotypes.
  • alleles: A list giving the haplotype index for each sequence of the alignment, or -1 when not applicable, such as for the outgroup.

This statistic is computed only if skipAllHaplotypeStats and skipHaplotypeDifferentiationStats are False, npop is > 1:

  • Snn: Nearest neighbor statistics (Hudson Genetics 155:2011-2014, 2000).

These statistics are computed only if skipAllHaplotypeStats and skipHaplotypeDifferentiationStats are False, npop is > 1 and S is > 0:

  • Fst: Population differentiation, based on nucleotides (Hudson et al. Genetics 132:583-589, 1992).
  • Gst: Population differentiation, based on haplotypes (Nei version, Hudson et al. Mol. Biol. Evol. 9:138-151, 1992).
  • Hst: Population differentiation, based on haplotypes (Hudson et al. Mol. Biol. Evol. 9:138-151, 1992).
  • Kst: Population differentiation, based on nucleotides (Hudson et al. Mol. Biol. Evol. 9:138-151, 1992).

These statistics are computed only if skipDifferentiationStats is False and npop is > 1:

  • pair_CommonAlleles: For each pair of populations, number of sites with at least one allele shared by the two populations. Alleles that are fixed in one or both populations are taken into account, provided that they are polymorphic over the whole sample.
  • pair_FixedDifferences: For each pair of populations, number of sites with a fixed differences between the two populations.
  • pair_SharedAlleles: For each pair of populations, number of sites with at least one allee shared by the two populations. Only alleles that are segregating in both populations are taken into account.
  • pop_Polymorphisms: For each population, number of polymorphic sites in this population.
  • pop_SpecificAlleles: For each population, number of sites with at least one allele specific to this population.
  • pop_SpecificDerivedAlleles: For each population, number sites with at least one derived allele specific to this population.
  • CommonAlleles: Number of sites with at least one allele shared among at least two populations.
  • FixedDifferences: Number of sites with at least one difference fixed between two populations.
  • SharedAlleles: Number of sites with at least one allele shared by at least two populations.
  • SpecificAlleles: Number of sites with at least one allele specific to one population.
  • SpecificDerivedAlleles: Number of sites with at least one derived allele specific to one population.

These statistics are computed only if skipDifferentiationStats is False, npop is > 1 and lseff > 0:

  • average_Pi: Average of nucleotide diversity per population.
  • pop_Pi: Vector of nucleotide diversity per population.

This statistic is computed only if skipDifferentiationStats is False and npop is 3:

  • triConfigurations: A list of 13 numbers counting the number of sites falling in the possible configurations in three populations. Only diallelic loci are considered and rooting is not considered. The possible configurations are explained using an example below. The order is as in the returned list (remember that indices start from 0 in Python). Each line gives the allele(s) present in each population: A means one allele, G the other allele and A/G both alleles (polymorphism within this population). A and G can be substituted (A/G, A, A is the same as A/G, G, G).

    • 0: A/G, A, A
    • 1: A/G, A, G
    • 2: A, A/G, A
    • 3: A, A/G, G
    • 4: A, A, A/G
    • 5: A, G, A/G
    • 6: A/G, A/G, A
    • 7: A/G, A, A/G
    • 8: A, A/G, A/G
    • 9: A/G, A/G, A/G
    • 10: A G G
    • 11: A G A
    • 12: A A G

These statistic are computed only if skipOutgroupBasedStats is False:

  • lseffo: Number of oriented sites that were analyzed.
  • So: Number of polymorphic sites among oriented sites.

These statistics are computed only if skipOutgroupBasedStats is False and So is > 0:

  • thetaH: Theta estimator of Fay and Wu (Genetics 155:1405-1413, 2000).
  • thetaL: Theta estimator of Zeng et al. (Genetics 174:1431-1439, 2006).
  • H: Fay and Wu statistic (Genetics 155:1405-1413, 2000)
  • Z: Fay and Wu statistic standardized by Zeng et al. Genetics 174:1431-1439, 2006).
  • E: Zeng et al. statistic Genetics 174:1431-1439, 2006).

The returned dictionary also contains a nest dictionary options which feedbacks the values used at function call.

Changed in version 2.0.2: Polymorphisms is renamed pop_Polymorphisms. The following statistics are added: pair_CommonAlleles, pair_FixedDifferences, pair_SharedAlleles, pop_SpecificAlleles, pop_SpecificDerivedAlleles. The following statistic are now computed only if So > 0: thetaH, thetaL, E, H and Z. The following statistics are now computed only if lseff > 0: thetaW, Pi, pop_Pi and average_Pi. The following statistic are computed only if S > 0: D, Fst, Gst, Hst and Kst. npop is always returned. For consistency, outgroup-based statistics are computed even if lseffo is 0 (except those who require that So > 0).

Changed in version 2.1.0: The statistics not computed are now exported and set to None.

polymorphismBPP(dataType=1)

Computes diversity statistics using tools provided through the Bio++ libraries. Note that attempting to call this method from an EggLib module compile without Bio++ support will result in a RuntimeError.

Arguments:

  • dataType: 1 for DNA, 2 for RNA, 3 for protein sequences, 4 for standard codons, 5 for vertebrate mitochondrial codons, 6 for invertebrate mitochondrial codons and 7 for echinoderm mitochondrial codons.

The method returns a dictionary containing the diversity statistics. Some keys will be computed only in the presence of an outgroup, or if sequences were specified as coding or depending on the value of other statistics (otherwise, they will be None).

The following statistics are always computed:

  • S: Number of polymorphic sites.
  • Sinf: Number of parsimony informative sites.
  • Ssin: Number of singleton sites.
  • eta: Minimal number of mutations.
  • thetaW: Theta estimator (Watterson Theor. Popul. Biol. 7:256-276, 1975).
  • T83: Theta estimator (Tajima Genetics 105:437-460, 1983)
  • He: Heterozygosity.
  • Ti: Number of transitions.
  • Tv: Number of transversions.
  • K: Number of haplotypes.
  • H: Haplotypic diversity.
  • rhoH: Hudson’s estimator of rho (Genet. Res. 50:245-250, 1987).

The following statistic is computed only if Tv > 0:

  • TiTv: Transition/transversion ratio.

The following statistic is computed only if S > 0:

  • D: Tajima statistic (Genetics 123:585-595, 1989).

The following statistics are computed only if eta > 0:

  • Deta: Tajima’s D computed with eta instead of S.
  • Dflstar: Fu and Li’s D* (without outgroup; Genetics 133:693-709).
  • Fstar: Fu and Li’s F* (without ougroup; Genetics 133:693-709).

The following statistic is computed only if an outgroup is found:

  • Sext: Mutations on external branches.

The following statistics are computed only if an outgroup is found and eta > 0:

  • Dfl: Fu and Li’s D (Genetics 133:693-709).
  • F: Fu and Li’s F (Genetics 133:693-709).

The following statistics are computed only if sequences are coding dataType = 4-7:

  • ncodon1mut: Number of codon sites with exactly one mutation.
  • NSsites: Average number of non-synonymous sites.
  • nstop: Number of codon sites with a stop codon.
  • nsyn: Number of codon sites with a synonymous change.
  • PiNS: Nucleotide diversity computed on non-synonymous sites.
  • PiS: Nucleotide diversity computed on synonymous sites.
  • SNS: Number of non-synonymous polymorphic sites.
  • SS: Number of synonymous polymorphic sites.
  • Ssites: Number of synonymous sites.
  • tWNS: Watterson’s theta computed on non-synonymous sites.
  • tWS: Watterson’s theta computed on synonymous sites.

The following statistics are computed only if sequences are coding dataType = 4-7 and an outgroup is found:

  • MK: McDonald-Kreitman test table (Nature 351:652-654, 1991).
  • NI: Neutrality index (Rand and Kann Mol. Biol. Evol. 13:735-748).

The returned dictionary also contains a nest dictionary options which feedbacks the values used at function call.

Changed in version 2.0.2: The following statistics are now computed only if S > 0: D, Deta, Dflstar, Fstar, Dfl, F.

Changed in version 2.1.0: The statistics not computed are now exported and set to None.

remove(name)

Removes the first sequence having name name. If no sequence has this name, a KeyError is raised. A workaround is easy to implement:

>>> index = align.find(name)
>>> if index!=None:
>>>     del align[index]

Changed in version 2.0.1: New meaning.

removePosition(pos)

Removes character at position pos of all sequences (effectively removing a column of the alignment. Returns the new length of the alignment.

rename(mapping, liberal=False)

Rename all sequences of the instance using the passed mapping. If liberal is False and a name does not appear in mapping, a ValueError is raised. If liberal is True, names that don’t appear in mapping are left unchanged.

New in version 2.0.1.

sequence(pos, sequence=None)

Sets/gets the sequence string at index pos. If sequence is None, returns the current sequence. Otherwise changes the sequence and returns nothing. If the object is an Align, the sequence length must match the alignment length.

sequenceByName(name, strict=True)

Returns the sequence string corresponding to the first match of name. If the name is not found, raises a KeyError. If strict is True, seeks an exact match. If False, compares only until the end of the requested name (for example: 'ATCFF' will match 'ATCFF_01' if strict is False).

New in version 2.0.1.

set(sequence, position, ch)

Sets the character value at string position position of the sequence at index sequence to value ch.

shuffle(maintain_outgroup=True)

Randomly reassigns group labels. Modifies the current object and returns nothing. If maintain_outgroup is True, doesn’t reassign the outgroup (group label 999).

simErrors(rate)

Randomly introduces missing data. rate is the desired proportion of missing data. Replaces random valid positions by N. There should be not missing data in the original object. Note that the module numpy is required, and that this method might be inefficient for large error rates. Changes the current object and returns nothing.

Changed in version 2.1.0: Restricted to :class:`~egglib.Align instances.

slice(a, b)

Extracts a selection of sequences. Sequences with indices a to b-1 are extracted and returned as an Align instance. If a is smaller than 0, 0 is used instead. If b is larger than the number of sequences, the latter is used instead. If b is not larger than a, the returned instance is emptry.

slider(wwidth, wstep)

Provides a means to perform sliding-windows analysis over the alignment. This method returns a generator that can be used as in for window in align.slider(wwidth,wstep), where each step window of the iteration will be a Align instance of length wwidth (or less if not enough sequence is available near the end of the alignment). Each step moves forward following the value of wstep.

str(exportGroupLabels=False, lineLength=50)

Formats the instance as a fasta string. exportGroupLabels: if True, exports group/population membership as @x tags placed at the end of sequence names (where x is any positive integer). lineLength gives the number of characters to place on a single line in the fasta output. If 0, no newlines are inserted within sequences.

write(fname, exportGroupLabels=False, lineLength=50)

Writes the sequences to a fasta-formatted file. fname is the name of the file to create. Other arguments are as for str().

New in version 2.0.1.

Class SequenceItem

class egglib.SequenceItem(parent, index)

Item managing the name, sequence and group values of a given index of a Container or Align instance. Any change applied to the SequenceItem instance are immediately propagated to the Container or Align instance (generating the corresponding exception in case of misuse). It is important to note that some of errors might be generated when attempting to access data and not upon object creation. The print item statement (where item is a SequenceItem instance) returns a specially formatted string "name",sequence,group where name, sequence and group are name string, sequence string and group index for the index corresponding to the instance (note the double quotes around the name and the commas separating the three items). The instance also supports iteration and index-based accessing, but note that SequenceItem instance contain always three items: the name, the sequence and the group (in that order). SequenceItem also supports indexing (item 0 is the name, item 1 the sequence and item 2 the group index).

New in version 2.0.1.

parent must be a Container or Align instance and index an index lying in the range of len(parent).

group

Group label

name

Name string

sequence

Sequence string

SSR class

class egglib.SSR

Bases: object

SSR data container. This class is essentially a wrapper for the lower-level class DataMatrix with parser/formatter methods.

The user doesn’nt normally need to manipulate directly the class attributes described below.

The class attribute dataMatrix holds the DataMatrix instance contained by this instance. loci contains the list (in the correct order) of locus names. individuals stores the names of individuals (or automatically-generated names if they were not available) and maps them to 1 or 2 (for diploid data) indices of dataMatrix. individuals maps individuals to populations names. populations stores the population names. The labels used in the DataMatrix instance are the indices of this list. missing stores the value used to identify missing data (None if missing data are not allowed).

New in version 2.0.1.

Changed in version 2.1.0: Individual to population mapping and string formatting added.

The constructor initializes an empty instance.

Fstats(locus=None)

Computes F-statistics from the currently loaded data, using Weir and Cockheram (1984) method. If locus is an integer, the statistics for that locus are returned. If locus is None, the multi-locus version of the statistics are returned. This method returns a (Fis, Fst, Fit) tuple. If one or several values cannot be computed (due to lack of one or more components of the variance), the corresponding value is replaced by None. This method requires that all genotypes have two alleles. In case of missing data, complete genotypes (i.e. data for one individual at a given locus) are removed.

clear()

Removes all data from the instance.

load(dataMatrix, sampleConfiguration=None)

Imports the data present in the DataMatrix instance passed as dataMatrix. Note that the SSR instance is supposed to take ownership of the DataMatrix instance that should not be modified outside the class and that will be clear if the SSR is cleared. The appropriate behaviour is to delete any outside reference to dataMatrix after passing it to this method. The argument sampleConfiguration is an iterable indicating the number of samples per population. There must be one item per population. If items are integers, they give the number of diploid samples (one random chromosome per individual), otherwise they must be a sequence of two integers giving the number of diploid and haploid samples (in this order). The total sum of given integers must match the number of genotypes of dataMatrix. If sampleConfiguration is None, it it assumed that all samples are haploid samples. The population labels from the passed DataMatrix are discarded unless sampleConfiguration is None. Below are some examples of accepted values for the second argument. 20

individuals from 4 populations with boths chromosomes sampled:

[10, 10, 10, 10]. 10 individuals from 4 populations with one chromosome sampled: [(0,20), (0,20), (0,20), (0,20)]. A mixture of samples: [10, (0,20), (5,10), (1,18)]. In all these three examples, 20 chromosomes are sampled from each population, suming up to a total of 80 samples.

Changed in version 2.1.0: Population names were previously integers, they now are converted to strings.

numberOfGenotypes()

Returns the number of genotypes of the data currently loaded.

numberOfLoci()

Returns the number of loci of the data currently loaded.

parse(string, diploid=True, genotypeSeparator=None, alleleSeparator='/', header=True, missing='999')

Imports data from the string string. The data should follow the following format: one line containing locus names (if header is True) and then one line per individual. The header needs not to be aligned with the data matrix. It is only required that the number of items on the first line matches the number of genotypes given for each individual. Each line is made of a population name (if groups is True), the individual name followed by the appropriate number of genotype values. A given genotype is coded by two (if diploids is True) or one (otherwise) integer. If the latter case, the two values must be separated by alleleSeparator. genotypeSeparator gives the separator between values for one individual. The default value of genotypeSeparator matches all white spaces (including space and tabulation). The same separator is used between population and individual labels. The user must specify another value for genotypeSeparator to be able to import data with that have spaces in names. Unless using the default separator, ensure that the separator is not duplicated in the string (e.g. two spaces in a row between header items): this is supported only for the default (refer to the standard library str.split() method for more details. Missing data are coded by the string or the integer given by missing.

Example (to be read with default argument values):

          locus1  locus2   locus3
pop1 ind1 001/001 001/002  001/003
pop1 ind2 002/002 001/001  001/001
pop1 ind3 002/002 001/001  002/002
pop1 ind4 001/002 001/002  003/002
pop2 ind5 001/002 003/003  000/000
pop2 ind6 003/004 004/004  002/003
stats()

Computes diversity statistics from the currently loaded data. Returns a dictionary containing the following values:

  • k, the number of alleles.
  • V, the variance of allele size.
  • He, the expected heterozygosity.
  • thetaI, theta assuming IAM.
  • thetaHe, theta from He (assuming SMM).
  • thetaV, theta from V (assuming SMM).

Each entry is a list of the corresponding statistics computed for the corresponding locus.

str()

Returns a string representation of the object. The string can be parsed by the method parse() using default options.

TIGR class

class egglib.TIGR(fname=None, string=None)

Bases: object

Automatic wrapper around the TIGR XML format for storing genome annotation data.

Object initialization: TIGR(fname=None, string=None)

fname must be the name of a file containing TIGR-formatted XML data. string should be directly a string containing TIGR-formatted XML data. It is not allowed to specify both fname and string to non-None values, but at least one of the two arguments must be specified (it is currently impossible to create an empty instance).

extract(start, stop)

Returns a GenBank instance containing the sequence of the range [start, stop] and the features that are completely included in that range. Note that positions must be expressed in the TIGR system own coordinate system.

GenBank class

class egglib.GenBank(fname=None, string=None)

Bases: object

GenBank represents a GenBank-formatted DNA sequence record.

Constructor signature: GenBank(fname=None, string=None). Only one of the two arguments fname and string can be non-None. If both are None, the constructor generates an empty instance with sequence of length 0. If fname is non-None, a GenBank record is read from the file with this name. If string is non-None, a GenBank record is read directly from this string. The following variables are read from the parsed input if present: accession, definition, title, version, GI, keywords, source, references (which is a list), locus and others. Their default value is None except for references and others for which it is an empty list. source is a (description, species, taxonomy) tuple. Each of references is a (header, raw reference) tuple and each of others is a (key, raw) tuple.

add_feature(feature)

Pushes a feature to the instance. The argument feature must be a well-formed GenBankFeature instance.

extract(from_pos, to_pos)

Returns a new GenBank instance representing a subset of the current instance, from position from_pos to to_pos. All features that are completely included in the specified range are exported.

get_sequence()

Access to the sequence string.

number_of_features()

Gives the number of features contained in the instance.

rc()

Reverse-complement the instance (in place). All features positions and the sequence will be reverted and applied to the complementary strand. The features will be sorted in increasing start position (after reverting). This method should be applied only on genuine nucleotide sequences.

set_sequence(string)

Sets the sequence string. Note that changing the record’s string might obsolete the features.

write(fname)

Create a file named fname and writes the formatted record in.

write_stream(stream)

Writes the content of the instance as a Genbank-formatted string within the passed file (or file-compatible) stream.

class egglib.GenBankFeature(parent)

Bases: object

GenBankFeature contains a feature associated to a GenBank instance. Instances of this class should not be instantiated or used separatedly from a GenBank instance. The constructor creates an empty instance (athough a GenBank instance must be passed as parent) and either set() or parse() must be used subsequently.

add_qualifier(key, value)

Adds a qualifier to the instance’s qualifiers.

copy(genbank)

Returns a copy of the current instance, connected to the GenBank instance genbank.

get_sequence()

Returns the string corresponding to this feature. If the positions pass beyond the end of the parent’ sequence, a RuntimeError (instead of IndexError) is raised.

parse(string)

Updates feature information from information read in a GenBank-formatted string.

qualifiers()

Returns a dictionary with all qualifier values. This method cannot be used to change data within the instance.

Changed in version 2.1.0: Meaning changed.

rc(length=None)

Reverse-complement the feature: apply it to the complement strand and reverse positions counting from the end. The length argument specifies the length of the complete sequence and is usually not required.

set(type, location, **qualifiers)

Updates feature information: type is a string identifying the feature type (such as gene, CDS, misc_feature, etc.); location must be a GenBankFeatureLocation instance giving the feature’s location. Other qualifiers must be passed as keyword arguments. Note that type can be any string and that it is not allowed to use “type” as a qualifier keyword.

shift(shift)

Shift all positions according to the (positive of negative) argument.

start()

Returns the first position of the (first) segment, such as start() is always smaller than stop().

stop()

Returns the first position of the (first) segment, such as start() is always smaller than stop().

type()

Returns the type string of the instance.

class egglib.GenBankFeatureLocation(string=None)

Bases: object

Holds the location of a GenBank feature. Supports various forms of location as defined in the GenBank format specification. The constructor contains a parser working from a GenBank-formatted string. By default, features are on the forward strand and segmented features are ranges (not orders). GenBankFeatureLocation supports iteration and allows to iterate over (first,last) segments regardless of their types (for a single-base segment a position position, the tuple (position,position) is returned; similar 2-item tuples are returned for other types of segment as well). GenBankFeatureLocation also supports access (but not assignation nor deletion) thought the [] operator. A (first,last) tuple is returned as for the iterator. Finally, the instance can be GenBank-formatted using str(). The length of the instance is the number of segments.

addBaseChoice(first, last, left_partial=False, right_partial=False)

Adds a segment corresponding to a single base chosen within a base range. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use addBaseChoice(1127,1482) in combination with setComplement(). All entered positions must be larger than any positions entered previously and last must be strictly larger than first. left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).

addBaseRange(first, last, left_partial=False, right_partial=False)

Adds a base range the feature. If no segments were previously enter, set the unique segment location. first and last must be integers. The feature will be set between first and last positions, including both limits. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use addBaseRange(1127,1482) in combination with setComplement(). All entered positions must be larger than any positions entered previously and last must be larger than first (but can be equal). left_partial and/or right_partial must be set to True if, respectively, the real start of the segment lies 5’ of first and/or the real end of the segment lies beyond last (relatively to the forward strand and consistently with the numbering system).

addBetweenBase(position)

Adds a segment lying between two consecutive bases. If no segments were entered previously, set the unique segment location. position must be an integer. The feature will be set between position and position + 1. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1128, one must use addBetweenBase(1127) in combination with setComplement(). All entered positions must be larger than any positions entered previously.

addSingleBase(position)

Adds a single-base segment to the feature. If no segments were entered previously, set the unique segment location. position must be an integer. All entered positions must be larger than any positions entered previously.

asOrder()

Defines the feature as an order instead of a range.

asRange()

Defines the features as a range, with is the default.

copy()

Returns a deep copy of the current instance.

isComplement()

True if the feature is on the complement strand.

isRange()

True if the feature is a range (the default), False if it is an order.

rc(length)

Reverse the feature positions: positions are modified to be counted from the end. The length of the complete sequence must be passed.

setComplement()

Places the feature on the complement strand.

setNotComplement()

Places the feature on the forward (not complement) strand, which is the default.

shift(shift)

Shift all positions according to the (positive of negative) argument.

Tree class

class egglib.Tree(fname=None, string=None)

Bases: object

Handles phylogenetic trees. A tree is a linked collection of nodes which all have at least one ascendant and any number of descendants. Nodes are implemented as TreeNode instances. A node without descendants is a leaf. A node with exactly one ascendant and one descendant is generally meaningless, but is allowed. All nodes (internal nodes as well as leaves) have a label which in the case of leaves can be used as leaf name. It is not possible to apply a name and a label to leaf node, accordingly to the newick format. All connections are oriented and lengthed (although the lengths can be omitted) but note that labels are applied to nodes, not edges (aka branches). All Tree instances have at least one root node which is the only one allowed not to have an ascendant. This class allows network-like structures, but note that some operations are available only for genuine trees (ie without closed paths). Import and export to/from strings and files are in the bracket-based newick format, and is concerned by this limitation. Tree instances can be exported using the built-in str() function, and the methods newick() and write(). Tree instances are iterable. Each step yields a TreeNode instance, starting with the root node but without a defined order.

The instance can be initialized as an empty tree (with only a root node), or from a newick-formatted string. By default, the string is read from the file name fname, but it can be passed directly through the argument string. It is not allowed to set both fname and string at the same time. The newick parser expects a well-formed newick string (including the trailing semicolon).

Changed in version 2.0.1: Imports directly from a file. If a string is passed, it is interpreted as a file name by default.

add_node(parent, label=None, brlen=None)

Adds a node to the tree. parent must be a TreeNode instance already present in the instance; label is the label to apply to the tree (or the taxon name if the node is intended to be terminal); brlen the length of the edge connecting parent to the new node. Their is no formal difference between introducing a new internal node and a terminal node (or leaf). The new node has initially no descendant and is therefore a leaf until it is itself connected to a new node. The newly created node can be accessed through last_node().

all_leaves()

Returns all leaves of the tree (nodes without descendant), as a list of TreeNode instances. If the tree is empty (that is, contains only a root node), this method returns an empty list.

clean_edge_lengths()

Ensures that all edge lengths are not different than None.

clean_internal_labels()

Ensures that all internal labels are not different than None.

collapse(node)

Collapses a branch. node must be one of the nodes contained in the tree (as a TreeNode instance). It must have a unique ascendant. If not, a ValueError is raised. Obviously, the tree’s root cannot be collapsed. The destruction of the node might discard the information of its label. This information will be transfered to the ascending node. The ascending node’s label will carry the information of either one whichever is not None. If the two labels are identical, nothing will be done. If the two labels are different and different from None, they will be concatenated (ascending node first), separated by a semicolon like in oldlabel;newlabel. The length of the removed edge will be spread equally among all its descendants (see example below).

Collapsing node [4] on the following tree:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]             /-------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \-------------->[6]
 |
 |              /---------------------------->[8]
 |              | 
 \------------>[7]            /------------->[10]
                |             |
                \----------->[9]
                              |
                              \------------->[11]

will generate the following tree, with the correction of edge lengths as depicted:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]
 |             |
 |             |-------------------->[5]        L5 = L5+L4/2
[0]            |
 |             \-------------------->[6]        L6 = L6+L4/2
 |
 |              /---------------------------->[8]
 |              | 
 \------------>[7]            /------------->[10]
                |             |
                \----------->[9]
                              |
                              \------------->[11]

Although the total edge length of the tree is not modified, the relationships will be altered: the distance between the descendants of the collapsed node (nodes 5 and 6 in the example above) will be artificially increased.

copy()

Returns a deep copy of self.

findGroup(taxa)

Checks whether a group is one of the groups defined by the tree, regardless of the orientation of the tree. If so, returns the first node found as a TreeNode instance. Returns None if no such group is found. taxa is an iterable of leaf label strings. It is not required that all labels are unique. This method returns the first node encountered whose list of descending leaves matches exactly the list taxa or the whose list of ascending leaves (that is all leaves of the tree that are not among the descending leaves) matches exactly the list taxa. This method disregards the tree orientation; for a tree represented by ((A,B),(C,(D,E)),((F,G),(H,I)))), the call findGroup(['A','B','C','D','E']) will succeed and return the node placed at the root of ((F,G),(H,I)). If a monophyletic group must be explicitely searched for, consider using findMonophyleticGroup() instead. The order of leaves is irrelevant. This method returns the root node if taxa is the list of all tree’s leaves. If taxa contains a single label matching a leaf of this tree, then the result will be the same as with get_node().

findMonophyleticGroup(taxa)

Checks whether a group is one of the monophyletic groups defined by the tree. If so, returns the first such node found as a TreeNode instance. Returns None if no such group is found. taxa is an iterable of leaf label strings. It is not required that all labels are unique. This method returns the first node encountered whose list of descending leaves matches exactly the passed list. This method assumes that the tree is rooted, ie the orientation of branches is irrelevant: for a tree represented by ((A,B),(C,(D,E)),((F,G),(H,I)))), the call findMonophyleticGroup(['A','B','C','D','E']) will not succeed (technically because the group is overlapping the root). If the group is searched regardless of the orientation of the tree, typically for unrooted trees, consider using findGroup() instead. The order of leaves is irrelevant. This method returns the root node if taxa is the list of all tree’s leaves. If taxa contains a single label matching a leaf of this tree, then the result will be the same as with get_node().

frequency_nodes(trees, relative=False)

Labels all nodes of the current instances by integers counting the number of trees where the same node exists among the trees in the iterable trees. Each item must be a Tree instance defining exactly the same set of leaf lables. In case relative is True, the numbers are expressed as fractions. The label is converted to a string in both cases.

get_node(name)

Returns the first node of the tree bearing the given label. The returned object is a TreeNode. If no nodes of the tree match the passed name, None is returned. The order in which the nodes are examined is not defined. name can be of any type, including None (comparison is performed without conversion).

get_node_re(regex)

Returns the first node of the tree matching the regular expression regex. regex should be a valid regular expression (refer to the documentation of the re module of the standard Python library). If no nodes of the tree match the regular expression, None is returned. The order in which the nodes are examined is not defined.

get_nodes(name)

Returns all nodes of the tree that bear the given label. The returned object is always a list of zero or more TreeNode instances. The order in which nodes are sorted is not defined. name can be of any type, including None (comparison is performed without conversion).

get_nodes_re(regex)

Returns all nodes of the tree matching the regular expression regex. regex should be a valid regular expression (refer to the documentation of the re module of the standard Python library). The returned object is always a list of zero or more TreeNode instances. The order in which nodes are sorted is not defined.

get_terminal_nodes()

Returns the list of all TreeNode instances of this tree that don’t have descendants. In case of an empty tree, an empty list is returned (ie the root is never returned).

last_node()

Returns the last loaded node (as a TreeNode instance). If no nodes were loaded, the root is returned.

lateralize()

At each node of the tree, sorts the descendants based on the number of leaves that descend from them. The result is a tree where the richest branches are pushed to the back.

midroot()

Automatic rooting of the tree using the midpoint method. The tree must not be previously rooted, there must be not closed path or network-like structures, and must edges must have an available length value.

newick(labels=True, brlens=True)

Returns the newick-formatted string representing the instance. If labels is False, omits the internal branch labels. If brlens is False, omits the branch lengths. Doesn’t support closed paths.

number_of_leaves()

Gives the number of leaves (terminal nodes) of the tree. Returns 0 if the tree contains the root only.

number_of_nodes()

Gives the number of nodes of the tree (including leaves and root).

remove_node(node)

Removes the node from the tree and all its descendants. Any node can be removed, including nodes without descendants, provided that the root is not among the nodes removed. The node in question must have only one ascendant. In case its ascendant had previously only two descendants and only one ascendant, it will be automatically removed.

reoriente(new_root)

Moves the root location of the tree. This method is solely intended to alter the representation of unrooted trees (trees that have a trifurcation at the root). new_root must be a TreeNode instance contained in this tree and representing the position of the new root. It might be the current root.It is illegal to call this method on trees that have a closed path between the current root and the new root. A ValueError is raised whenever the tree cannot be reoriented.

If the original tree has this structure:

 /------------------------------------------>[1]
 |
 |             /---------------------------->[3]
 |             |
 |----------->[2]             /------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \------------->[6]
 |
 |              /--------------------------->[8]
 |              | 
 \------------>[7]            /------------>[10]
                |             |
                \----------->[9]
                              |
                              \------------>[11]

And rooting is requested at node [7], the outcome will be as depicted below, the edge lengths being ignored:

             /--------------------------------[1]
             |
 /----------[0]         /---------------------[3]
 |           |          |
 |           \---------[2]        /-----------[5]
 |                      |         |       
 |                      \--------[4]
 |                                |
[7]                               \-----------[6]
 |
 |--------------------------------------------[8]
 |
 |                     /---------------------[10]
 |                     |
 \--------------------[9]
                       |
                       \---------------------[11]

If the length of the branch on which the root is placed (L0) is not None, the length of the edge [E1] will be branshsplit * L0 and the length of [E2] will be (1 - branshsplit ) * L0 .

Note that the label of outgroup node is copied to the node at the other side of the root where the root is placed. The rationale is that information attached to the root edge might have to be applied to both basal edges. In case the original root (or basal node) had a specified label, it will be retained and the outgroup label will not be copied.

root(outgroup, branchsplit=0.5)

Roots the tree. outgroup must be a TreeNode instance contained in this tree. The root will be placed somewhere on the branch that leads to this node (between the current root and the node). If this branch doesn’t have a branch length, the branchsplit argument is ignored. Otherwise, branchsplit must be a real number between 0. and 1. and gives the proportion of the branch that must be allocated to the basal branch leading to the outgroup, the complement being allocated to the branch leading to the rest of the tree. It is illegal to call this method ons tree that are already rooted (have a trifurcation at the root) or on trees that that have a closed path between the current root (or base of the tree) and the intended root. A ValueError is raised whenever the tree cannot be rooted.

If the original tree has this structure:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]             /-------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \-------------->[6]
 |
 |              /---------------------------->[8]
 |              | 
 \------------>[7]            /------------->[10]
                |             |
                \----------->[9]
                              |
                              \------------->[11]

And rooting is requested at node [9], the root will be placed on the edge marked by [ROOT] below:

 /------------------------------------------->[1]
 |
 |             /----------------------------->[3]
 |             |
 |----------->[2]             /-------------->[5]
 |             |              |
 |             \------------>[4]
[0]                           |
 |                            \-------------->[6]
 |
 |              /---------------------------->[8]
 |              | 
 \------------>[7]            /------------->[10]
                |             |
                \---[ROOT]-->[9]
                              |
                              \------------->[11]

And the outcome will be as depicted below, with the introduction of a new node (which would be [12] here) at the root:

                     /-------------------------[1]
                     |
              /-----[0]     /------------------[3]
              |      |      |
              |      \-----[2]      /----------[5]
              |             |       |
  /----------[7]            \------[4]
  |           |                     |
  |           |                     \----------[6]
  |           |
[ROOT]        \--------------------------------[8]
  |
  |                     /---------------------[10]
  |                     |
  \--------[E1]--------[9]
                        |
                        \---------------------[11]

If the length of the branch on which the root is placed (L0) is not None, the length of the edge [E1] will be branshsplit * L0 and the length of [E2] will be (1 - branshsplit ) * L0 .

Note that the label of outgroup node is copied to the node at the other side of the root where the root is placed. The rationale is that information attached to the root edge might have to be applied to both basal edges. In case the original root (or basal node) had a specified label, it will be retained and the outgroup label will not be copied.

root_node()

Returns the root node as a TreeNode instance.

smallest_group(taxa, threshold=None, minimum=1)

Returns the smallest clade containing a set of leaves, as a TreeNode instance, without regard to the orientation of the tree. The node returned corresponds to the smallest clade fulfilling the criteria. taxa must be a list of leaf labels. All labels must be found either within the clade, or in the rest of the tree (all tree leaves not in this clade). Duplicates are included whenever appropriate. threshold is the minimum numerical label the node must exhibit to be returned. If threshold is None, this criterion is not applied. Otherwise, nodes that have a label not convertible to float or whose label is inferior than threshold are not returned. minimum is the smallest number of descending leaf a clade must have to be returned. The root is never returned. Returns None if no valid node can be found.

Changed in version 2.0.1: The root is never returned; duplicates are supported; the minimum argument is not checked; and nodes that don’t have a numeric label are supported when threshold is not None (but they are excluded).

smallest_monophyleticGroup(taxa, threshold=None, minimum=1)

Returns the most recent common ancestor of a set of leaves, as a TreeNode instance. The node returned corresponds to the smallest clade fulfilling the criteria. taxa must be a list of leaf labels. All labels must be found within the clade, including duplicates whenever appropriate. threshold is the minimum numerical label the node must exhibit to be returned. If threshold is None, this criterion is not applied. Otherwise, nodes that have a label not convertible to float or whose label is inferior than threshold are not returned. minimum is the smallest number of descending leaf a clade must have to be returned. The root is never returned. Returns None if no valid node can be found.

Changed in version 2.0.1: The root is never returned; duplicates are supported; the minimum argument is not checked; and nodes that don’t have a numeric label are supported when threshold is not None (but they are excluded).

total_length()

Returns the sum of all branch lengths across all nodes. All branch lengths must be defined, otherwise a ValueError will be raised.

write(fname, labels=True, brlens=True)

Write the newick-formatted string representing the instance to a file named fname. If labels is False, omits the internal branch labels. If brlens is False, omits the branch lengths. Doesn’t support closed paths.

class egglib.TreeNode

Bases: object

This class provides an interface to a Tree instance’s nodes and allows access and modification of data attached to a given node as well as the tree descending from that node. A node must be understood as the point below a branch. Edges (connections between nodes) have a direction: they go from a node to another node. Nodes have therefore descendants and ascendants. Connecting a node to itself or making a two-way edge (to edges connecting the same two nodes in opposite directions) is not explicitly forbidden. Duplicate edges (between the same two nodes and in the same direction) are however illegal.

The constructor instantiates a tree node with default value.

add_son(label=None, brlen=None)

Generates a new TreeNode instance descending from the current instance. label is to be applied to the new node. brlen is the length of the edge connecting the two nodes. Note that each node will refer to the other, generating a circular reference loop and preventing garbage collection of the node instances. It is therefore required to disconnect all nodes using the method disconnect(). Return the newly created node.

ascendants()

Returns the list of all ascendants as TreeNode instances.

branch_from(node)

Returns the length of the branch connecting node to this node. node but be a TreeNode instance present amongst this node’s ascendants. This method returns None if the value is not defined.

branch_to(node)

Returns the length of the branch connecting this node to node. node but be a TreeNode instance present amongst this node’s descendants. This method returns None if the value is not defined.

connect(node, brlen=None)

Connect this node to an other, existing, node. The orientation of the link is from the current instance to the passed instance. brlen is the length of the newly created edge. Note that each node will refer to the other, generating a circular reference loop and preventing garbage collection of the node instances. It is therefore required to disconnect all nodes using the method disconnect().

descendants()

Returns the list of all descendants as TreeNode instances.

get_label()

Returns the node’s label.

is_ascendant(node)

True if the TreeNode instance node is one of this node’s ascendants.

is_descendant(node)

True if the TreeNode instance node is one of this node’s descendants.

leaves_down()

Recursively gets all leaf labels descending from that node. This method supports closed paths (networks) and nodes are never processed more than once.

leaves_up()

Recursively gets all leaf labels contained on the other side of the tree. In case of a network, the results of this method and leaves_down() might overlap. However, there can be no redundancy within the results returned by one of the methods

numberOfAscendants()

Gets the number of nodes descending to this one.

numberOfDescendants()

Gets the number of nodes descending from this one.

numberOfRelatives()

Gets the number of nodes connected to this one.

remove_ascendant(node)

Removes the edges between this node and the node represented by the TreeNode instance node. node must be one of this node’s ascendants. Note that this method removes also this node from node‘s descendants.

remove_descendant(node)

Removes the edges between this node and the node represented by the TreeNode instance node. node must be one of this node’s descendants. Note that this method removes also this node from node‘s ascendants.

reverse(node, exchange_labels)

Reverse an edge’s orientation between this node and the node given by node, as a TreeNode instance. The two nodes must be connected by exactly one edge. If exchange_labels is True, the node labels are exchanged.

set_branch_from(node, value)

Sets the length of the branch connecting node to this node. node but be a TreeNode instance present amongst this node’s ascendants. value might be a float or None.

set_branch_to(node, value)

Sets the length of the branch connecting this node to node. node but be a TreeNode instance present amongst this node’s descendants. value might be a float or None. Note that this methods affects both nodes.

set_label(value)

Change the label value.

sort()

Sorts the descendants based on their number of leaves.

str(labels=True, brlens=True)

Formats the node and the subtree descending from is as a newick string. If labels is False, omits the internal branch labels. If brlens is False, omits the branch lengths. Doesn’t support closed path.

Clears all references to other TreeNode instances contained in this instance.

Hosted by  Get seqlib at SourceForge.net. Fast, secure and Free Open Source software downloads