This module contains wrappers to external applications. They must be available on the user’s system and detected properly at installation time. To detect a new application a posteriori, one needs to relaunch the detection procedure by typing python setup.py build_apps from the egglib-py directory, and then re-install the configuration file by typing python setup.py install.
Runs the program ms to generate random datasets by coalescence. ms must be installed in the system. Arguments are for the command ms (refer to the program’s documentation for details). Note that all options starting by e (past demographic changes) as well a m, n and g expect a list of tuples (at least one), each tuple containing the appropriate number of arguments. Note also that the options are processed in the same order as in the function’s signature. Note that if the tMRCA argument is set to True, the returned alignments will contain a tMRCA member. If both theta and segsites are specified to positive values, the returned alignments will contain a prob member. If the T flag is sets, the returned alignments will contain a tree member (that will be a Tree instance).
New in version 0.1: Created to provide a closer wrapper of ms.
Changed in version 2.1.0: Exported alignment might contain a prob and/or a trees member.
Provides NCBI Basic Local Alignment Search Tools for finding homologues of query sequences against a local database. All proposed methods return a dictionary of processed BLAST results. This dictionary stores the hits for each sequences, indexed by its name string. If only one sequence is passed as a string, the output dictionary will always one item indexed by an empty string. For a given sequences, the results are presented as a list of HSPs (there can be several HSPs on a single hit sequence), each hit being a dictionary storing the following information: subject (the name of the hit sequence), bitScore (the bit score value), score (the raw score value), eValue (the expectation value), qstart and qend (the start and end positions on the query sequence), hstart and hend (the start and end positions on the hit sequence, qframe and hframe (the frame in which is locate the hit in respectively the query and the hit sequence), identity (the number of matching positions), gaps (the number of gapped positions), length (the length of the hit), *qseq (the sequence of the query sequence at the HSP), hseq (the sequence of the hit sequence at the HSP) and midline (the string summarizing the quality of the local alignment, indicating matching positions). The full XML document is nonetheless accessible as instance’s member xml_results after each call to any of the method. xml_results is None by default.
Searches a nucleotide database using nucleotide queries. query can be a string, a Container or Align instance. In the latter cases, all sequences will be processed. target must refer to a valid database of the correct data type, either represented by its file system path or by a BLASTdb instance. evalue is the expectaction value (expected number of random hits by chance alone, depending on the database size). The default value is e-6 (therefore much less, and more stringent, than blastn‘s default value which is 10). penalty is the penalty to apply for nucleotide mismatch (the default reward for nucleotide match is +1). The default value is -2. The value must be negative, and should be increased to account for most distant homologies. “A ratio of 0.33 (1/-3) is appropriate for sequences that are about 99% conserved; a ratio of 0.5 (1/-2) is best for sequences that are 95% conserved; a ratio of about one (1/-1) is best for sequences that are 75% conserved” (from BLAST online documentation). All other BLAST parameters can be set as keywords arguments. Keyword arguments are passed as is to the blastn program and can overwrite arguments default values of evalue and penalty. For example it is possible to set reward as a keyword argument as in reward=5 penalty=-4.
Searches a protein database using protein queries. Arguments are as for blastn() with the exception that reward and penalty are not applicable. Parameters matrix, gapopen and gapextend are defined automatically based on the average length of query sequences. These automatic settings can be overriden by keyword arguments.
Searches a protein database using translated nucleotide queries. Arguments are as for blastp().
Provides NCBI Basic Local Alignment Search Tools for aligning two sequences by local alignment. The proposed methods return all a list of dictionaries representing all HSPs. The items of the dictionaries corresponding to the following variables: qstart, qend, sstart, send, evalue, bitscore, score, length, nident, qframe, sframe, gaps, qseq, sseq and midline. The latest is given only as identity/mismatch marks. Parameters are defined as for BLAST except that query and subject must both be sequence strings.
Align a nucleotide query to a nucleotide subject. Arguments are as for BLAST.blastn() except that query and subject must both be sequence strings.
Align a protein query to a protein subject. Arguments are as for BLAST.blastp() except that query and subject must both be sequence strings.
Align a protein query to a translated nucleotide subject. Arguments are as for BLAST.blastx() except that query and subject must both be sequence strings.
Align a translated nucleotide query to a protein subject. Arguments are as for BLAST.tblastn() except that query and subject must both be sequence strings.
Align a translated nucleotide query to a translated nucleotide subject. Arguments are as for BLAST.tblastx() except that query and subject must both be sequence strings.
Bases: object
Handles a local BLAST database saved as temporary files. This class is most useful when a database is needed only temporarily. The database will be available for BLAST applications as long as the instance lives.
The constructor expects two (mandatory) arguments: container must be a Container or Align instance and type must be either 'nucl' (for nucleotides) or 'prot' (for proteins) and specify the appropriate data type. For protein sequences, trailing stop codons are automatically trimmed.
Returns the full path name of the local database, as required by BLAST programs. It is usually not required to use this method directly.
Performs multiple alignment using CLUSTALW. container might be a Container or Align instance. If quiet is True, the standard output (but not standard error) of the wrapped program will be intercepted and discarded. Returns a Align instance. By default, the function preserves group labels. However, if the container contains duplicates (even if they belong to the same group), this operation will fail (with a ValueError). To process containers containing duplicates and for which group label information is not important, set the flag nogroups to True.
Performs multiple alignment using MUSCLE. container might be a Container or Align instance. If quiet is True, progress information will not be shown If quiet is False, progress information will be shown but the function might not be able to detect automatically errors reported by the wrapped program. Returns a Align instance. By default, the function preserves group labels. However, if the container contains duplicates (even if they belong to the same group), this operation will fail (with a ValueError). To process containers containing duplicates and for which group label information is not important, set the flag nogroups to True.
Reconstructs phylogeny using maximum likelihood through the PhyML software. input should be a Align instance. model indicates the model to use. Accepted values are HKY85, JC69, K80, F81, F84, TN93 and GTR for nucleotides and LG, WAG, JTT, MtREV, Dayhoff, DCMut, RtREV, CpREV, VT, Blosum62, MtMam, MtArt, HIVw and HIVb for protein sequences. rates gives the number of discrete categories of evolutionary rate. boot sets the number of bootstrap repetitions. Values of -1, -2 and -3 activates one the test-based branch support evaluation methods that provide faster alternatives to bootstrap repetitions. A value of 0 will provide no branch support at all. topo allows to fix the tree topology. start allows to set the starting topology (it is illegal to set both topo and start to non-None values. search can be NNI (fastest), SPR or BEST``(best of both methods). For *topo* or *start*, a :class:`~egglib.Tree` instance must be passed. If present, branch lengths and branch labels will be ignored. If *quiet* is ``True, the standard output of the wrapped program will be intercepted and discarded. The function returns a tuple (tree, loglk) where tree is a Tree instance and loglk the log-likelihood reported by PhyML.
Constructs a neighbor-joining tree using programs from the PHYLIP package (dnadist and neighbor). input should be a Align instance containing DNA sequences only. If group is True, the group labels will be appended to sequence names and therefore will appear in the final tree. If quiet is True, the standard output of the wrapped program will be intercepted and discarded. The function returns a Tree instance.
Interface to non-synonymous/synonymous substitution rate analyses available in the codeml program of the PAML package. The sequences and tree are loaded at construction time. The results can be accessed through the return value of fit() or as a pre-formatted string by calling str(codeml) (where codeml is a Codeml instance). Default options of influencal parameters are: start omega value of 0.4, omega not fixed and 10 discrete omega categories. They can be changed using the appropriate accessors. After running fit(), the instance caches the control file as controlfile, the codeml main output file as outputfile and codeml standard output (where you might be able to read error messages) as standardoutput. This is done to allow manual inspection in case of errors.
Constructor arguments: aln, a Align instance, tree, a Tree instance. The names of aln and tree must match (except that $x and $x labels -where x is an integer- are ignored at the end of tree leaf names). If tree is None, a star topology will be used. Branch lengths from the tree are discarded.
Fits a given model and collects the result.
The results are returned as a dictionary containing these keys (note that keys irrelevant to me fitted model will not be exported):
- model: the model fitted.
- lnL: the log-likelihood.
- np: number of parameters.
- kappa: the transition/transversion ratio.
- omega: omega estimate, as a single value M0, a list of two values for M1a, three values for M2a, eleven values for M8a and M8, k values for nW (where k is the number of clades in the tree), alist of four tuples of two values for A0 and A.
- freq: estimates of the frequency of the different categories, None for M0 and nW, a list of two values for M1a, three values for M2a, eleven values for M8a and M8 and four values for A0 and A.
- beta: a tuple for p and q (beta distribution parameters), None for all models but M1a and M2a.
- site_method: method used to estimate posterior site.
- trees: the trees found in the results (in order) as a list of tree instances.
- site_proba: list with one list per site, each list contains the posterior probability of the site under each omega category, for models M1, M1a, M2a, M8a, M8, A0, A.
- site_class: the list of highest probability class for each site, for models M1a, M2a, M8a, M8, A0, A.
- site_omega: the list of average posterior omega for each site, for models M1, M1a, M2a, M8a, M8.
- site_error: the standard deviation of posterior omega for each site, for models M2a and M8.
If quiet is True, the standard output is intercepted and discarded.
Fixes omega to value. It is not required to call this method for fitting models that require a fixed value of omega.
Sets the number of discrete omega categories.
Sets the start value of omega to value. It is not legal to call this method when omega is fixed.
Releases omega from a previous call to fix_omega() and set the start value to value.
Primer design using the program PRIMER3. The constructor takes a sequence and optional parameters. The list of parameters and default values can be accessed through the class-level attribute dictionary default_parameters.
sequence must be a nucleotide sequence. Parameter values can be passed as keyword arguments. Parameter default values are taken from Primer.default_parameters. Parameters are restricted to the default list, such as spelling errors might result in a crash later, during primer search.
Checks primer pairs defined using find_pairs() and discards the pairs that fail to pass the test. This method includes a second call to the PRIMER3 application.
Deletes all primer pairs that contain at least one invalid character (fully resolved, not missing). All primer pairs with a least one primer containing a character other than A, C, G and T (case-independent) close to the 3’ end are removed. number gives the number of characters to consider. If the number of larger than the length of the primer, the complete sequence is considered. Returns the number of pairs.
Similar to clean_pair_ends() except that the lists of forward and reverse primers are considered. The pairs, if they were generated, are not affected. Returns a tuple (nf, nr) with nf and nr the numbers of forward and reverse primers, respectively.
Class-level dictionary holding default values for all run parameters.
Finds primer pairs. Primers must have been previously designed. mini and maxi gives the range of accepted products. This method doesn’t involve any call to PRIMER3.
Finds primers. Returns a tuple (nf, nr) where nf is the number of forward primers found and nr the number of reverse primers found.
Returns a reference to the list of forward primers (that must have been previously detected using find_primers()). Each item of the list represents a primer as a dictionary containing the following keys: seq (the primer sequence, given in the 5’ to 3’ orientation), pos (the position of nucleotide at the 5’ end), GC%, Tm, Q (the quality value), END (not defined in PRIMER3 documentation, might be the maximum secondary structure stability) and ANY (also not documented in PRIMER3, might be maximum misannealing stability with respect to the provided sequence). The last two parameters should be minimized, and their definition will be confirmed as soon as possible.
Returns the list of primer pairs found by find_pairs(). Each item is a directory with the values: F (the forward primer), R (the reverse primer), start, end and size. The primers are the same as given by forward_primers() and reverse_primers().
Equivalent to forward_primers(), except that the pos value is the position of the nucleotide at the 3’ end of the primer, therefore the first nucleotide when reading in the original orientation of the provided sequence.
Sorts the list of primer pairs (based on the sum of primer qualities) and select best primers. number gives the number of primer pairs to retain. If there is less pairs, they will all be retained, but still be sorted. Returns the number of pairs retained. The lists of forward and reverse primers are not affected.
Sorts all class attributes (forward and reverse primers and primer pairs) based on quality (sum of both primer qualities for pairs).