External application tools¶
This module contains functions that can run external programs within the EggLib framework (taking as arguments and/or returning EggLib objects). To use these functions, the underlying programs must be available in the system. This is controlled by the application paths object as explained in Configuring external applications.
|
Reconstruct maximum-likelihood phylogeny using PhyML. |
|
Fit nucleotide substitution models using PAML. |
|
Neighbour-joining (or UPGMA) tree using PHYLIP. |
|
Multiple sequence alignment using Clustal Omega. |
Perform multiple alignment using Muscle. |
|
|
Create a BLAST database. |
|
|
|
Dicontinuous |
|
|
|
|
|
|
|
|
|
Quick |
|
|
|
Quick |
|
|
|
Quich |
|
|
Results for a given hit of a BLAST run. |
|
Description of an Hsp of a BLAST run. |
|
Full results of a BLAST run. |
|
Results for a given query of a BLAST run. |
- egglib.wrappers.phyml(align, model, labels=False, rates=1, boot=0, start_tree='nj', fixed_topology=False, fixed_brlens=False, freq=None, TiTv=4.0, pinv=0.0, alpha=None, use_median=False, free_rates=False, seed=None, verbose=False)¶
Reconstruct maximum-likelihood phylogeny using PhyML.
PhyML is a program performing maximum-likelihood phylogeny estimation using nucleotide or amino acid sequence alignments.
Reference
Guindon S., J.-F. Dufayard, V. Lefort, M. Animisova, W. Hordijk, and O. Gascuel. 2010. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59: 307-321.
- Parameters:
align – input sequence alignment as an
Align
instance.model – substitution model to use (see list below).
labels – boolean indicating whether the group labels should be included in the names of sequences (they will as the following string: @lbl1,lbl2,lbl3…, that is: @ followed by all labels separated by commas).
rates – number of discrete categories of evolutionary rate. If different of 1, fits a gamma distribution of rates.
boot – number of bootstrap repetitions. Values of -1, -2 and -4 activate one the test-based branch support evaluation methods that provide faster alternatives to bootstrap repetitions (-1: aLRT statistics, -2: Chi2-based parametric tests, -4: Shimodaira and Hasegawa-like statistics). A value of 0 provides no branch support at all.
start_tree – starting topology used by the program. Possible values are the string
nj
(neighbour-joining tree), the stringpars
(maximum-parsimony tree), and aTree
instance containing a user-provided topology. In the latter case, the names of leaves of the tree must match the names of the input alignment (without group labels), implying that names cannot be repeated.fixed_topology – boolean indicating whether the topology provided in the
Tree
instance passed as start_tree argument should be NOT be improved.fixed_brlens – boolean indicating whether the branch lengths provided in the
Tree
instance passed as start_tree argument should be NOT be improved. All branch lengths must be specified. Automatically sets fixed_topology toTrue
.freq – nucleotide or amino acid frequencies. Possible values are the string
o
(observed, frequencies measured from the data), the stringm
(estimated by maximum likelihood for nucleotides, or retrieved from the substitution model for amino acids), or a four-item tuples provided the relative frequencies of A, C, G and T respectively (only for nucleotides). By default, useo
for nucleotides andm
for amino acids.TiTv – transition/transversion ratio. If
None
, estimated by maximum likelihood. Otherwise, must be a stricly positive value. Ignored if data are not nucleotides or if the model does not support it. For theTN93
model, there must be a pair of ratios, one for purines and one for pyrimidines (in that order). However, a single value can be supplied (it will be applied to both rates).pinv – proportion of invariable sites. If
None
estimated by maximum likelihood. Otherwise, must be in the range [0, 1].alpha – gamma shape parameter. If
None
, estimated by maximum likelihood. Otherwise, must be a strictly positive value. Ignored if rates is 1 or if free_rates isTrue
.use_median – boolean indicating whether the median (instead of the mean) should be use to report values for rate from the discretized gamma distribution. Ignored if rates is 1 or if free_rates is
True
.free_rates – boolean indicating whether a mixture model should be used for substitution rate categories instead of the discretized gamma. In this case all reates and their frequencies will be estimated. Requires that rates is larger than 1.
seed – pseudo-random number generator seed. Must be a stricly positive integer, preferably large.
verbose – boolean indicating whether standard output of PhyML should be displayed.
- Returns:
A
(tree, stats)
tuple
where tree is aTree
instance and stats is adict
containing the following statistics or estimated parameters:lk
– log-likelihood of the returned tree and model.pars
– parsimony score of the returned tree.size
– length of the returned tree.rates
– only available if model isGTR
or custom, only for nucleotide sequences: relative substitution rates, as alist
providing values in the following order:A \(\leftrightarrow\) C,
A \(\leftrightarrow\) G
A \(\leftrightarrow\) T
C \(\leftrightarrow\) G
C \(\leftrightarrow\) T
G \(\leftrightarrow\) T
alpha
– gamma shape parameter (only if the number of rate categories is larger than 1 and iffree_rates
wasFalse
).cats
– list of(rate, proportion)
tuples for each discrete rate category (only iffree_rates
wasTrue
, implying that the number of rates was larger than 1).freq
– list of the relative base frequencies, in the following order: A, C, G, and T (only for nucleotide sequences).ti/tv
– transition/transversion ratio (available for the modelsK80
,HKY85
,F84
, andTN93
). For theTN93
model, the resulting value is a pair of transition/transversion ratios, one for purines and one for pyrimidines (in that order).pinv
– proportion of invariable sites (only if the corresponding option was not set to 0).
The choice of the model defines the type of data that are expected. The available models are:
Nucleotides:
Code
Full name
Rates
Base frequencies
JC69
Jukes and Cantor 1969
one
equal
K80
Kimura 1980
two
equal
F81
Felsenstein 1981
one
unequal
HKY85
Hasegawa, Kishino & Yano 1985
two
unequal
F84
Felsenstein 1984
two
unequal
TN93
Tamura and Nei 1993
three
unequal
GTR
general time reversible
six
unequal
In addition, custom nucleotide substitution models can be specified. In that case, model must be a six-character strings of numeric characters specifying which of the six (reversable) substitution rates are allowed to vary. The one-rate model is specified by the string
000000
, the two-rate model (separate transition and transversion rates) is specified by010010
, and the GTR model is specified by012345
. The substitution rates are specified in the following order:A \(\leftrightarrow\) C,
A \(\leftrightarrow\) G
A \(\leftrightarrow\) T
C \(\leftrightarrow\) G
C \(\leftrightarrow\) T
G \(\leftrightarrow\) T
Amino acids:
Code
Authors
LG
Le & Gascuel (Mol. Biol. Evol. 2008)
WAG
Whelan & Goldman (Mol. Biol. Evol. 2001)
JTT
Jones, Taylor & Thornton (CABIOS 1992)
MtREV
Adachi & Hasegawa (in Computer Science Monographs 1996)
Dayhoff
Dayhoff et al. (in Atlas of Protein Sequence and Structure 1978)
DCMut
Kosiol & Goldman (Mol. Biol. Evol. 2004)
RtREV
Dimmic et al. (J. Mol. Evol. 2002)
CpREV
Adachi et al. (J. Mol. Evol. 2000)
VT
Muller & Vingron (J. Comput. Biol. 2000)
Blosum62
Henikoff & Henikoff (PNAS 1992)
MtMam
Cao et al. (J. Mol. Evol. 1998)
MtArt
Abascal, Posada & Zardoya (Mol. Biol. Evol. 2007)
HIVw
Nickle et al. (PLoS One 2007)
HIVb
ibid.
Changed in version 3.0.0: No more default value for model option. Added custom model for nucleotides. Changed SH pseudo-bootstrap option flag from -3 to -4. quiet function replaced by verbose. Several additional options are added. The syntax for input a user tree is modified. The second item in the returned tuple is a dictionary of statistics.
- egglib.wrappers.codeml(align, tree, model, code=1, ncat=None, codon_freq=2, verbose=False, get_files=False, kappa=2.0, fix_kappa=False, omega=0.4)¶
Fit nucleotide substitution models using PAML.
This function uses the CodeML program only of the PAML package.
- Parameters:
align – an
Align
containing a coding sequence alignment. The number of sequences must be at least 3, the length of the alignment is required to be a multiple of 3 (unless codons are provided). There must be no stop codons (even final stop codons) and there must not be any duplicated sequence name. The alphabet might be DNA of codon.tree – a
Tree
providing the phylogenetic relationships between samples. The name of the sequences in theAlign
and in theTree
are required to match. If tree isNone
, a star topology is used (usage not recommended anymore and not supported by recent versions of PAML). If the tree contains branch length or node labels, they are discounted, except for PAML node tags (#x
and$x
wherex
is an integer) that are allowed both as nodel labels. If one wants to label a terminal branch of the tree, they can add the label at the end of the sample name (with an optional separating white space). The tree must not be rooted (if there is a birfurcation at the base, an error will be caused).model –
model. The list of model names appears below:
M0
– one-ratio model (1 parameter).free
– all branches have a different ratio (1 parameter per branch).nW
– several sets of branches. Requires labelling of branches of the tree (1 parameter per set of branches).M1a
– nearly-neutral model (2 parameters).M2a
– positive selection model (4 parameters).M3
– discrete model. Requires setting ncat (2 * ncat - 1 parameters).M4
– frequencies model. Requires setting ncat (ncat - 1 parameters).M7
– beta-distribution model. Requires setting ncat (2 parameters).M8a
– beta + single ratio, additional ratio fixed to 1. Requires setting ncat (3 parameters).M8
– beta + single ratio. Requires setting ncat (4 parameters).A0
– null branch-site model. Requires labelling of branches of the tree with two different labels (3 parameters).A
– branch-site model with positive selection. Requires labelling of branches of the tree with two different labels (4 parameters).C0
– null model for model C (M2a_rel). Does not require branch labelling (4 parameters).C
– branch-site model. Requires labelling of branches (5 parameters).D
– discrete branch-site model. Requires labelling of branches and requires setting ncat to either 2 or 3 (4 or 6 parameters, respectively).
The number of parameters given for each model concern the dN/dS ratios only. Refer to PAML documentation or the following references for more details and recommendations: Bielawski, J.P. & Z. Yang. 2004. A maximum likelihood method for detecting functional divergence at individual codon sites, with application to gene family evolution. J. Mol. Evol. 59:121-132; Yang Z., R. Nielsen, N. Goldman & A.M.K. Pedersen. 2000; Codon-substitution models for heterogeneous selection pressure at amino acid sites. Genetics 155:431-449. Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19:908-917; Zhang, J., R. Nielsen & Z. Yang. 2005. Evaluation of an improved branch-site lieklihood method for detecting positive selection at the molecular level. Mol. Biol. Evol. 22:472-2479.
code – genetic code identifier (see here). Required to be an integer among the valid values. The default value is the standard genetic code. Only codes 1-11 are available.
ncat – number of dN/dS categories. Only a subset of models require that the number of categories to be specified. See models.
codon_freq –
an integer specifying the model for codon frequencies. Must be one of:
0 – 1/61 each.
1 – F1X4.
2 – F3X4.
3 – codon table
4 – F1x4MG.
5 – F3x4MG.
6 – FMutSel0.
7 –FMutSel.
verbose – boolean indicating whether standard output of CodeML should be displayed.
get_files – boolean indicating whether the raw content of CodeML output files should be included in the returned data.
kappa – starting value for the transition/transversion rate ratio.
fix_kappa – tell if the transition/transversion rate ratio should be fixed to its starting value (otherwise, it is estimated as a free parameter).
omega – starting value for the dN/dS ratio (strictly positive value).
Deprecated since version 3.3.1: The star topology option is still supported but raise a UserWarning since it can cause an error with recent versions of PAML.
- Returns:
A
dict
holding results. The keys defined in the returned dictionary are:model
– model name.lk
– log-likelihood.np
– number of parameters of the model.kappa
– fixed or estimated value of the transition/transversion rate ratio.beta
– if model isM7
,M8a
, orM8
, atuple
with the p and q parameters of the beta distribution of neutral dN/dS ratios; otherwise,None
.K
– number of dN/dS ratio categories. Equals to 0 for thefree
model, to the number of branch categories for thenW
model, and to the number of site categories otherwise. This value is not necessarily equal to the ncat argument becauseM8a
andM8
models add a category, and because it has a different meaning for modelnW
.num_tags
– number of branch categories detected from the imported tree (irrespective to the model that has been fitted). If the star topology has been used (tree=None
), this value is 1.omega
– estimated dN/dS ratio or ratios. The structure of the value depends on the model:M0
model – a single value.free
model –None
(ratios are available as node labels in the tree available astree_ratios
).nW
model – alist
of dN/dS ratios for all branch categories (they are listed in the order corresponding to branch labels).Discrete models (
M1a
,M2a
,M3
,M4
,C0
,M7
,M8a
, andM8
) – alist
ofK
dN/dS ratios. The frequency of each dN/dS category is available isfreq
.A0
andA
models – atuple
of twolist
of 4 items each, containing respectively the background and foreground dN/dS ratios. The frequency of each dN/dS category is available isfreq
.C
andD
models – atuple
ofnum_tags
list
(onelist
for each set of branches, as defined by branch labels found in the provided tree), each of them containingK
dN/dS ratios. The frequency of each dN/dS category is available isfreq
.
freq
– the frequency of dN/dS ratio categories. If defined, it is alist
ofK
values. This entry isNone
for modelsM0
,free
, andnW
.length
– total length of tree after estimating branch lengths with the specified model.tree
– the tree with fitted branch lengths, as aTree
instance. Branch lengths are expressed in terms of the model of codon evolution.length_dS
– total length of tree in terms of synonymous substitutions. Only available withM0
,free
, andnW
models.length_dN
– total length of tree in terms of non-synonymous substitutions. Only available withM0
,free
, andnW
models.tree_dS
– aTree
instance with branch lengths expressed in terms of synonymous substitutions. Only available withfree
andnW
models.tree_dN
– aTree
instance with branch lengths expressed in terms of non-synonymous substitutions. Only available withfree
andnW
models.tree_ratios
– aTree
instance with the dN/dS ratios included as branch labels. Only available withfree
andnW
models.site_w
– adict
containing posterior predictions of site dN/dS ratios. Not available for modelsM0
,free
, andnW
(in that cases, the value isNone
). Thedict
contains the following keys:method
– on the stringsNEB
andBEB
.aminoacid
– the list of reference amino acids for all amino acid sites of the alignment (they are taken from the first sequence in the original alignment).proba
– the list of posterior probabilites of the dN/dS categories for all amino acid sites of the alignment. For each site, atuple
ofK
(the number of dN/dS categories) is provided.best
– the index of the best category for each site.postw
– list of the posterior dN/dS estimate for all sites (None
if not available).postwsd
– list of the standard deviation of the dN/dS estimate for all sites (always available ifpostw
is available and the method isBEB
,None
otherwise).P(w>1)
– probability that the dN/dS ratio is greater than 1 for all sites (None
if not available).
main_output
– raw content of the main CodeML output file. This key is not present if the option get_files is not set toTrue
.rst_output
– raw content of therst
detailed CodeML output file. This key is not present if the option get_files is not set toTrue
.candidates
- list of positively selected sites. Each site is represented by adict
with keyspos
(0-based position),aa
(reference amino acid),P(w>1)
,test` (test result, either an empty string or a signification-level string), ``postw
andstdev
(standard deviation). If the block is not present in the output file, the list is replaced by None``.
Changed in version 3.0.0: Turned into a singe function, interface changes (more models, more options, more results).
Changed in version 3.3.1: Raise a warning if
tree
is set toNone
(star topology).Changed in version 3.3.4: Support alignement gaps in reference sequence and export list of positively selected sites.
- egglib.wrappers.nj(aln, model=None, kappa=None, upgma=False, outgroup=None, randomize=0, verbose=False)¶
Neighbour-joining (or UPGMA) tree using PHYLIP.
The programs of PHYLIP used are
dnadist
(orprotdist
) andneighbor
.- Parameters:
aln – an
Align
instance containing source sequences. The alphabet must DNA or protein, matching the model argument. Note: outgroup is ignored.model –
one of the models among the list below:
- For DNA sequences:
JC69: Jukes & Cantor’s 1969 one-parameter model.
K80: Kimura’s 1980 two-parameter model.
F84: like K80 with unequal base frequencies (default).
LD: LogDet (log-determinant of nucleotide occurence matrix).
- For protein sequences:
PAM: Dayoff PAM matrix.
JTT: Jones-Taylor-Thornton model (default).
PMB: probability matrix from blocks.
kappa – transition/transversion ratio (default is 2.0).
upgma – whether using the UPGMA method rather than the neighbour-joining method.
outgroup – name of the sample to use as outgroup for printing the tree (the root is based at the parent node of this sample; default is the first sample).
randomize – whether to randomize samples.
verbose – whether displaying console output of the PHYLIP programs.
- Returns:
A
Tree
instance containing the tree.
New in version 3.0.0.
- egglib.wrappers.clustal(source, ref=None, full=False, full_iter=False, cluster_size=100, use_kimura=True, num_iter=1, threads=1, keep_order=False, verbose=False)¶
Multiple sequence alignment using Clustal Omega.
- Parameters:
source –
a
Container
orAlign
containing the sequences to align. If aContainer
is provided, sequences are assumed to be unaligned, and, if aAlign
is provided, sequences are assumed to be aligned. The list below explains what is done based on the type of source and whether a value is provided for ref:If source is a
Container
and ref isNone
, the sequences in source are aligned.If source is an
Align
and ref isNone
, a hidden Markov model is built from the alignment, then the alignment is reset and sequences are realigned.If source is an
Align
and an alignment is provided as ref, the two alignments are preserved (their columns are left unchanged), and they aligned with respect to each other.If source is a
Container
and an alignment is provided as ref, a hidden Markov model is built from ref, then source is aligned using it, and finally the resulting alignment is aligned with respect to ref as described for the previous case.
source must contain at least two sequences unless it is an
Align
and a value is provided for ref (in that case, it must contain at least one sequence).ref – an
Align
instance providing an external alignment. See above for more details. ref must contain at least one sequence. Sequences must be aligned.full – use full distance matrix to determine guide tree (the default is using the faster and less memory-intensive mBed approximation).
full_iter – use full distance matrix to determine guide tree during iterations.
cluster_size – size of clusters (as a number of sequences) used in the mBed algorithm.
use_kimura – use Kimura correction for estimating whole-alignment distance (only available if a protein alignment has been provided as source).
num_iter – number of iterations allowing to improve the quality of the alignment. Must be a number \(\geq 1\) or a pair of numbers \(\geq 1\). If the value is a pair of numbers, they specify the number of guide tree iterations and hidden Markov model iterations, respectively. If a single value is provided, iterations couple guide tree and hidden Markov model.
threads – number of threads for parallelization (available for parts of the program).
keep_order – return the sequences in the same order as they were loaded.
verbose – display Clustal Omega’s console output.
- Returns:
An
Align
instance containing aligned sequences.
Changed in version 3.0.0: Ported to Clustal Omega and added support for more options.
Changed in version 3.1.0: Support protein sequences.
- egglib.wrappers.muscle(...)[source]¶
Perform multiple alignment using Muscle.
Depending of the version of MUSCLE detected at configuration, the call will be forwarded to either
wrappers.muscle3()
orwrappers.muscle5()
.Changed in version 3.2.0: Dynamically use the new
muscle5()
method if MUSCLE version 5 is present.
- egglib.wrappers.muscle5(source, super5=False, perm='none', perturb=0, consiters=2, refineiters=100, threads=None, verbose=False)¶
Perform multiple alignment using Muscle version 5.
- Parameters:
source – a
Container
instance contain sequences to align.Align
is supported but will be treated as if it was aContainer
. The alphabet must be DNA or protein.super5 – use the Super5 algorithm (recommended for datasets of more than a few hundred sequences).
perm – guide tree permutation mode. Available values are
none
,abc
,acb
, andbca
. More information here.perturb – if different of 0, the value is a random number generator seed used to perform hidden Markov model perturbations.
consiter – number of consistency iterations.
refineiters – number of refinement iterations.
threads – number of threads. By default, let MUSCLE pick the value.
verbose – show MUSCLE’s output in stdout.
Note
The ensemble fasta features of muscle5 are currently not available through this wrapper.
- Returns:
An
Align
containing aligned sequences.
New in version 3.2.0.
- egglib.wrappers.muscle3(source, ref=None, verbose=False, **kwargs)¶
Perform multiple alignment using Muscle.
This wrapper is designed to run version 3 of MUSCLE. To use MUSCLE version 5, configure the application path using the
egglib-config apps
command.MUSCLE’s default options tend to produce high-quality alignments but may be long to run on large data sets. Muscle’s author recommends using the option
maxiters=2
for large data sets, and, for fast alignment (in particular of closely related sequences):maxiters=1 diags=True aa_profile='sv' distance1='kbit20_3'
(for amino acid sequences) andmaxiters=1 diags=True
(for nucleotide sequences).- Parameters:
source – a
Container
orAlign
containing sequences to align. If anAlign
is provided, sequences are assumed to be already aligned and alignment will be refined (using the-refine
option of Muscle), unless an alignment is also provided as ref. In the latter case, the two alignments are preserved (their columns are left unchanged), and they are aligned with respect to each other.ref – an
Align
instance providing an alignement that should be aligned with respect to the alignment provided as source. If ref is provided, it is required both source and ref areAlign
instances.verbose – display Muscle’s console output.
kwargs –
other keyword arguments are passed to Muscle. The available options are listed below:
option value
anchors
a boolean
brenner
a boolean
cluster
a boolean
diags
a boolean
diags1
a boolean
diags2
a boolean
dimer
a boolean
teamgaps4
a boolean
SUEFF
a float
aa_profile
one of:
le
,sp
,sv
anchorspacing
an integer
center
a float
cluster1
one of:
upgma
,upgmb
,neighborjoining
diagbreak
an integer
diaglength
an integer
diagmargin
an integer
distance1
one of:
kmer6_6
,kmer20_3
,kmer20_4
,kbit20_3
,kmer4_6
distance2
one of:
pctidkimura
,pctidlog
gapopen
a float
hydro
an integer
hydrofactor
a float
maxiters
an integer
maxtrees
an integer
minbestcolscore
a float
minsmoothscore
a float
nt_profile
one of:
spn
objscore
one of:
sp
,ps
,dp
,xp
,spf
,spm
refinewindow
an integer
root1
one of:
pseudo
,midlongestspan
,minavgleafdist
seqtype
one of:
protein
,dna
,auto
smoothscoreceil
a float
weight1
one of:
none
,henikoff
,henikoffpb
,gsc
,clustalw
,threeway
weight2
one of:
none
,henikoff
,henikoffpb
,gsc
,clustalw
,threeway
For a description of options, see the Muscle manual. Most of Muscle’s options are available. Note that function takes no flag option, and Muscle’s flag options are passed as boolean keyword arguments (except options relative to the amino acid or nucleotide profile score options, that are passed as string as
aa_profile
andnt_profile
, respectively. The order of options is preserved.
- Returns:
An
Align
containing aligned sequences.
Changed in version 3.0.0: Added support for most options.
Changed in version 3.2.0: Renamed as
muscle3()
. Available asmuscle()
if MUSCLE version 3 is available
- egglib.wrappers.makeblastdb(source, dbtype=None, out=None, input_type='fasta', verbose=False, title=None, parse_seqids=False, hash_index=False, mask_data=None, mask_id=None, mask_desc=None, blastdb_version=5, max_file_sz='1GB', taxid=None, taxid_map=None)[source]¶
Create a BLAST database.
- Parameters:
source – name of an input file of the appropriate format. If not fasta, the format must be specified using the input_type option. Alternatively, source can be a
Container
orAlign
instance. If so, its alphabet must be DNA or protein and the dbtype argument, if specified, must match. Note that passing aContainer
orAlign
instance must be avoided for large databases.dbtype – database type:
"nucl"
or"prot"
are acceptable. Can be omitted if aContainer
or anAlign
is provided as source.out – database name. Must be specified if a
Container
or anAlign
is provided as source, or if the input_type is “blastdb”, otherwise the input file name is used as database name.input_type – format of input file. Must be
"fasta"
if aContainer
or anAlign
is provided as source. Otherwise must describe the format of source:"fasta"
,"asn1_bin"
,"asn1_txt"
, or"blastdb"
.verbose – display makeblastdb output (by default, it is returned by the function). Errors are always displayed.
title – database title. A default title is inserted in case a
Container
orAlign
instance is passed as source.
- Parse_seqids:
parse seqid from sequence names (considered if input_type is fasta, including if a
Container
or anAlign
is provided as source; argument ignored otherwise: seqid is always imported).- Hash_index:
create index of sequence hash values.
- Mask_data:
list of input files containing masking data.
- Mask_id:
list of strings to uniquely identify the masking algorithm, one for each mask file (requires mask_data).
- Mask_desc:
list of free form strings to describe the masking algorithm details, one for each mask file (requires mask_id).
- Blastdb_version:
version of BLAST database to be created (4 or 5).
- Max_file_sz:
maximum file size for BLAST database files.
- Taxid:
taxonomy ID to assign to all sequences as an integer (incompatible with taxid_map).
- Taxid_map:
text file mapping sequence IDs to taxonomy IDs (requires parse_seqids, incompatible with taxid).
- Returns:
Standard output of the program (None if verbose was
True
).
Please refer to the manual of BLAST tools for more details.
- egglib.wrappers.megablast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=28, gapopen=5, gapextend=2, reward=1, penalty=-2, strand='both', no_dust=False, no_soft_masking=False, lcase_masking=False, perc_identity=0, no_greedy=False)[source]¶
megablast
similarity search. This is designed for strongly similar sequences using a nucleotide query on a nucleotide database.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open gaps.
gapextend – cost to extend gaps.
reward – reward for nucleotide match.
penalty –
penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:
(1,-2) -- *(5,2) (2,2) (1,2) (0,2) (3,1) (2,1) (1,1) (1,-3) -- *(5,2) (2,2) (1,2) (0,2) (2,1) (1,1) (1,-4) -- *(5,2) (1,2) (0,2) (2,1) (1,1) (2,-3) -- (4,4) (2,4) (0,4) (3,3) (6,2) *(5,2) (4,2) (2,2) (4,-5) -- *(12,8) (6,5) (5,5) (4,5) (3,5) (1,-1) -- *(5,2) (3,2) (2,2) (1,2) (0,2) (4,1) (3,1) (2,1)
Defaults:
megablast: (1,-2)
dc-megablast: (2,-3)
blastn: (2,-3)
blastn-short: (1,-3)
strand – query strand to use:
"both"
,"minus"
, or"plus"
.no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode).
False
by default unless for blastn-short.no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches).
False
by default unless for blastn-short.lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).
perc_identity – percent identify cutoff
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.dc_megablast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=11, gapopen=None, gapextend=None, reward=2, penalty=-3, strand='both', no_dust=False, no_soft_masking=False, lcase_masking=False, perc_identity=0, template_type='coding', template_length=18)[source]¶
Dicontinuous
megablast
similarity search. This is designed for similar sequences (less similar thanmegablast()
) using a nucleotide query on a nucleotide database.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open gaps.
gapextend – cost to extend gaps.
reward – reward for nucleotide match.
penalty –
penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:
(1,-2) -- *(5,2) (2,2) (1,2) (0,2) (3,1) (2,1) (1,1) (1,-3) -- *(5,2) (2,2) (1,2) (0,2) (2,1) (1,1) (1,-4) -- *(5,2) (1,2) (0,2) (2,1) (1,1) (2,-3) -- (4,4) (2,4) (0,4) (3,3) (6,2) *(5,2) (4,2) (2,2) (4,-5) -- *(12,8) (6,5) (5,5) (4,5) (3,5) (1,-1) -- *(5,2) (3,2) (2,2) (1,2) (0,2) (4,1) (3,1) (2,1)
Defaults:
megablast: (1,-2)
dc-megablast: (2,-3)
blastn: (2,-3)
blastn-short: (1,-3)
strand – query strand to use:
"both"
,"minus"
, or"plus"
.no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode).
False
by default unless for blastn-short.no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches).
False
by default unless for blastn-short.lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).
perc_identity – percent identify cutoff
- Parameters:
template_type – template type for for dc-megablast. Possible values are
"coding"
(default),"optimal"
, and"coding_and_optimal"
.template_length – template length for dc-megablast. Possible values are 16, 18 (the default), and 21.
- egglib.wrappers.blastn(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=11, gapopen=None, gapextend=None, reward=2, penalty=-3, strand='both', no_dust=False, no_soft_masking=False, lcase_masking=False, perc_identity=0)[source]¶
blastn
similarity search. This is designed for distant sequences using a nucleotide query on a nucleotide database.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open gaps.
gapextend – cost to extend gaps.
reward – reward for nucleotide match.
penalty –
penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:
(1,-2) -- *(5,2) (2,2) (1,2) (0,2) (3,1) (2,1) (1,1) (1,-3) -- *(5,2) (2,2) (1,2) (0,2) (2,1) (1,1) (1,-4) -- *(5,2) (1,2) (0,2) (2,1) (1,1) (2,-3) -- (4,4) (2,4) (0,4) (3,3) (6,2) *(5,2) (4,2) (2,2) (4,-5) -- *(12,8) (6,5) (5,5) (4,5) (3,5) (1,-1) -- *(5,2) (3,2) (2,2) (1,2) (0,2) (4,1) (3,1) (2,1)
Defaults:
megablast: (1,-2)
dc-megablast: (2,-3)
blastn: (2,-3)
blastn-short: (1,-3)
strand – query strand to use:
"both"
,"minus"
, or"plus"
.no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode).
False
by default unless for blastn-short.no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches).
False
by default unless for blastn-short.lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).
perc_identity – percent identify cutoff
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.blastn_short(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=1000, parse_deflines=False, num_threads=1, word_size=7, gapopen=None, gapextend=None, reward=1, penalty=-3, strand='both', no_dust=True, no_soft_masking=True, lcase_masking=False, perc_identity=0)[source]¶
blastn
for short sequences. This is optimised for query sequences up to 50 bp long. It automatically setsevalue=1000
,word_size=7
,no_dust=True
,no_soft_masking=True
,reward=1
,penalty=-3
,gapopen=5
andgapextend=2
.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open gaps.
gapextend – cost to extend gaps.
reward – reward for nucleotide match.
penalty –
penalty for nucleotide mismatch. The available reward/penalty combinations are listed below, with the gapopen/gapextend combinations available for each reward/penalty combination. The default gapopen/gapextend combination is indicated by an asterix. For megablast, except for (1,-1), the default for gapopen/gapextend is left up to the blastn program:
(1,-2) -- *(5,2) (2,2) (1,2) (0,2) (3,1) (2,1) (1,1) (1,-3) -- *(5,2) (2,2) (1,2) (0,2) (2,1) (1,1) (1,-4) -- *(5,2) (1,2) (0,2) (2,1) (1,1) (2,-3) -- (4,4) (2,4) (0,4) (3,3) (6,2) *(5,2) (4,2) (2,2) (4,-5) -- *(12,8) (6,5) (5,5) (4,5) (3,5) (1,-1) -- *(5,2) (3,2) (2,2) (1,2) (0,2) (4,1) (3,1) (2,1)
Defaults:
megablast: (1,-2)
dc-megablast: (2,-3)
blastn: (2,-3)
blastn-short: (1,-3)
strand – query strand to use:
"both"
,"minus"
, or"plus"
.no_dust – prevent DUST filtering (by default, use BLAST’s default filtering mode).
False
by default unless for blastn-short.no_soft_masking – do not apply filtering locations as soft masks (i.e., only for finding initial matches).
False
by default unless for blastn-short.lcase_masking – use lower case filtering in query and subject sequences (not supported when EggLib objects are used because the DNA alphabet is case-insensitive and all bases will be passed to the blast program as upper case).
perc_identity – percent identify cutoff
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.blastp(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=11, comp_based_stats=2, seg=0, soft_masking=False, lcase_masking=False, window_size=40, use_sw_tback=False)[source]¶
bastp
similary search. This is designed for using a protein queryon a protein database.
- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open a gap.
None
: use defaultgapextend – cost to extend a gap.
None
: use defaultmatrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.
threshold – minimum word score such that the word is added to the BLAST lookup table (>0).
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
- Parameters:
parse_deflines – parse query and subject bar delimited sequence identifiers.
comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).
soft_masking – apply filtering locations as soft masks.
lcase_masking – use lower case filtering in query and subject sequences.
use_sw_tback – compute locally optimal Smith-Waterman alignments.
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.blastp_short(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='PAM30', threshold=16, comp_based_stats=0, seg=0, lcase_masking=False, window_size=15, use_sw_tback=False)[source]¶
blastp
similarity search for short sequences.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open a gap.
None
: use defaultgapextend – cost to extend a gap.
None
: use defaultmatrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.
threshold – minimum word score such that the word is added to the BLAST lookup table (>0).
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
- Parameters:
parse_deflines – parse query and subject bar delimited sequence identifiers.
comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).
lcase_masking – use lower case filtering in query and subject sequences.
use_sw_tback – compute locally optimal Smith-Waterman alignments.
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.blastp_fast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, parse_deflines=False, num_threads=1, word_size=None, threshold=21, comp_based_stats=2, seg=0, lcase_masking=False, window_size=40, use_sw_tback=False)[source]¶
Quick
blastp
similarity search.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
threshold – minimum word score such that the word is added to the BLAST lookup table (>0).
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
parse_deflines – parse query and subject bar delimited sequence identifiers.
comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).
lcase_masking – use lower case filtering in query and subject sequences.
use_sw_tback – compute locally optimal Smith-Waterman alignments.
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.blastx(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=12, seg=(12, 2.2, 2.5), soft_masking=False, lcase_masking=False, window_size=40, strand='both', query_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]¶
blastx
similarity search. Designed for using a translated nucleotide query on a protein database.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open a gap.
None
: use defaultgapextend – cost to extend a gap.
None
: use defaultmatrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.
threshold – minimum word score such that the word is added to the BLAST lookup table (>0).
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
- Parameters:
soft_masking – apply filtering locations as soft masks.
lcase_masking – use lower case filtering in query and subject sequences.
query_genetic_code – genetic code to translate query. Allowed values are: 1-6, 9-16, 21-25.
max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
strand – query strand(s) to search against database/subject. Choice of both, minus, or plus.
comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.blastx_fast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=21, seg=(12, 2.2, 2.5), soft_masking=False, lcase_masking=False, window_size=40, strand='both', query_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]¶
Quick
blastx
similarity search. Designed for using a translated nucleotide query on a protein database and optimised for faster execution.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open a gap.
None
: use defaultgapextend – cost to extend a gap.
None
: use defaultmatrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.
threshold – minimum word score such that the word is added to the BLAST lookup table (>0).
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
- Parameters:
soft_masking – apply filtering locations as soft masks.
lcase_masking – use lower case filtering in query and subject sequences.
query_genetic_code – genetic code to translate query. Allowed values are: 1-6, 9-16, 21-25.
max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
strand – query strand(s) to search against database/subject. Choice of both, minus, or plus.
comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.tblastn(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=13, seg=(12, 2.2, 2.5), soft_masking=False, window_size=40, db_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]¶
tblastn
similary search. Designed for using a protein query on a translated nucleotide database.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open a gap.
None
: use defaultgapextend – cost to extend a gap.
None
: use defaultmatrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.
threshold – minimum word score such that the word is added to the BLAST lookup table (>0).
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
- Parameters:
soft_masking – apply filtering locations as soft masks.
db_genetic_code – genetic code to translate subject sequences. Allowed values are: 1-6, 9-16, 21-25.
max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.tblastn_fast(query, db=None, subject=None, query_loc=None, subject_loc=None, evalue=10, num_threads=1, word_size=None, gapopen=None, gapextend=None, matrix='BLOSUM62', threshold=21, seg=(12, 2.2, 2.5), soft_masking=False, window_size=40, db_genetic_code=1, max_intron_length=0, comp_based_stats=2)[source]¶
Quich
tblastn
similary search. Designed for using a protein query on a translated nucleotide database and optimised for fast execution.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
gapopen – cost to open a gap.
None
: use defaultgapextend – cost to extend a gap.
None
: use defaultmatrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.
threshold – minimum word score such that the word is added to the BLAST lookup table (>0).
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
- Parameters:
soft_masking – apply filtering locations as soft masks.
db_genetic_code – genetic code to translate subject sequences. Allowed values are: 1-6, 9-16, 21-25.
max_intron_length – length of the largest intron allowed in a translated nucleotide sequence when linking multiple distinct alignments (a negative value disables linking).
comp_based_stats – composition-based statistics, as an integer code: 0 (no composition-based statistics), 1 (composition-based statistics as in NAR 29:2994-3005, 2001), 2 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, conditioned on sequence properties), or 3 (composition-based score adjustment as in Bioinformatics 21:902-911, 2005, unconditionally).
- Returns:
A
BlastOutput
instance.
- egglib.wrappers.tblastx(*args, **kwargs)¶
tblastx
similary search. Designed for using a translated nucleotide query on a translated nucleotide database.- Parameters:
query – input sequence, as a
str
,SequenceView
,SampleView
,Container
orAlign
object. If an EggLib object, the alphabet must be DNA.db – name of a nucleotide database (such as one created with
makeblastdb()
. Incompatible with subject.subject – can be used alternatively to db. Subject sequence to search, as a
str
or aSequenceView
object.query_loc – location on the query sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software. Not supported if query is aContainer
.subject_loc – location on the target sequence, as a
(start, stop)
tuple. The stop position is not included in the range passed to the software.evalue – expect value (E) for saving hits.
num_threads – number of CPUs to use in blast search. Can be different of 1 only if subject is not used.
word_size – length of initial exact match.
- Parameters:
matrix – scoring matrix name. Available values are: PAM-30, PAM-70, BLOSUM-80, and BLOSUM-62.
seg – filter query sequence with SEG as an integer (0 to disable, 1 to enable, or alternatively a tuple with the three parameters window, locut, and hicut).
window_size – multiple hits window size (use 0 to specify 1-hit algorithm).
soft_masking – apply filtering locations as soft masks.
lcase_masking – use lower case filtering in query and subject sequences.
query_genetic_code – genetic code to translate query. Allowed values are: 1-6, 9-16, 21-25.
db_genetic_code – genetic code to translate subject sequences. Allowed values are: 1-6, 9-16, 21-25.
strand – query strand(s) to search against database/subject. Choice of both, minus, or plus.
- Returns:
A
BlastOutput
instance.
- class egglib.wrappers.BlastOutput[source]¶
Full results of a BLAST run.
Attributes
Name of the database used.
Total number of hits for all queries.
Total number of Hsp's for all hits of all entries.
Number of queries used in the BLAST search.
Search parameters.
Name of the program used.
Identifier of the query.
Description of the query.
Length of the query.
Bibliographic reference.
Version of the program.
Methods
get_query
(i)Hits for a given query.
Iterator over all hits of all queries.
iter_hsp
()Iterator over all Hsp's of all hits of all queries.
Iterator over queries.
- property db¶
Name of the database used.
- get_query(i)[source]¶
Hits for a given query. An instance of
BlastQueryHits
is returned.blast_output.get_query(i)
is also available asblast_output[i]
.
- iter_queries()[source]¶
Iterator over queries. Allows to iterate over
BlastQueryHits
instances for all queries.for query_hit in blast_output.iter_queries()
is also available asfor query_hit in blast_output
.
- property num_hits¶
Total number of hits for all queries.
- property num_hsp¶
Total number of Hsp’s for all hits of all entries.
- property num_queries¶
Number of queries used in the BLAST search.
blast_output.num_queries
is also available aslen(blast_output)
.
- property params¶
Search parameters.
"expect"
: E-value,"reward"
: nucleotide match reward,"penalty"
: nucleotide mismatch reward,"gapopen"
: cost for opening a gap,"gapextend"
: cost for extending a gap."filter"
: filter string.
- property program¶
Name of the program used.
- property query_ID¶
Identifier of the query.
- property query_def¶
Description of the query.
- property query_len¶
Length of the query.
- property reference¶
Bibliographic reference.
- property version¶
Version of the program.
- class egglib.wrappers.BlastQueryHits[source]¶
Results for a given query of a BLAST run.
Attributes
Karlin-Altschul entropy parameter.
Karlin-Altschul kappa parameter.
Karlin-Altschul lambda parameter.
Number of letters in the database.
Number of sequence in the database.
Effective space of the search.
Length adjustment.
index of the query in the BLAST run.
Number of hits for this query.
Total number of Hsp's for all hits.
Identifier of the query.
Description of the query.
Length of the query.
Methods
get_hit
(i)Get a given hit, as a
BlastHit
instance.Iterator to the
BlastHit
instances of all hits.iter_hsp
()Iterator over all Hsp's of all hits, as
BlastHsp
instances- property H¶
Karlin-Altschul entropy parameter.
- property K¶
Karlin-Altschul kappa parameter.
- property L¶
Karlin-Altschul lambda parameter.
- property db_len¶
Number of letters in the database.
- property db_num¶
Number of sequence in the database.
- property eff_space¶
Effective space of the search.
- get_hit(i)[source]¶
Get a given hit, as a
BlastHit
instance.query_hits.get_hit(i)
is also available asquery_hits[i]
.
- property hsp_len¶
Length adjustment.
- iter_hits()[source]¶
Iterator to the
BlastHit
instances of all hits.for hit in query_hits.iter_hits()
is also available asfor hit in query_hits
.
- property num¶
index of the query in the BLAST run.
- property num_hits¶
Number of hits for this query.
query_hits.num_hits()
is also available aslen(query_hits)
.
- property num_hsp¶
Total number of Hsp’s for all hits.
- property query_ID¶
Identifier of the query.
- property query_def¶
Description of the query.
- property query_len¶
Length of the query.
- class egglib.wrappers.BlastHit[source]¶
Results for a given hit of a BLAST run.
Attributes
Identifier of the subject.
Description of the subject.
Identifier of the subject.
Length of subject.
Index of the hit for the corresponding query.
Number of Hsp's in this hit.
Methods
get_hsp
(i)Get a given Hsp, as a
BlastHsp
instance.iter_hsp
()Iterator to the
BlastHsp
instances for all Hsp's.- property accession¶
Identifier of the subject.
- property descr¶
Description of the subject.
- get_hsp(i)[source]¶
Get a given Hsp, as a
BlastHsp
instance.hit.get_hsp(i)
is also available ashit[i]
.
- property id¶
Identifier of the subject.
- iter_hsp()[source]¶
Iterator to the
BlastHsp
instances for all Hsp’s.for hsp in hit.iter_hsp()
is also available asfor hsp in hit
.
- property len¶
Length of subject.
- property num¶
Index of the hit for the corresponding query.
- property num_hsp¶
Number of Hsp’s in this hit.
hit.num_Hsp() is also available as ``len(hit)
.
- class egglib.wrappers.BlastHsp[source]¶
Description of an Hsp of a BLAST run.
Start and stop positions are always interpreted as range parameters (use frame to determine if the complement should be used):
>>> hit_sequence = seq[query_start:query_to]
Attributes
Length of the alignment.
Bit score of the Hsp.
Expectation value of the Hsp.
Number of gap positions.
Frame of the hit.
Start position on the subject.
Stop position on the subject.
Aligned subject sequence.
Number of identical positions.
Alignment midline.
Index of the Hsp in the corresponding hit.
Number of positions with positive score.
Aligned query sequence.
Frame of the query.
Start position on the query.
Stop position on the query.
- property align_len¶
Length of the alignment.
- property bit_score¶
Bit score of the Hsp.
- property evalue¶
Expectation value of the Hsp.
- property gaps¶
Number of gap positions.
- property hit_frame¶
Frame of the hit.
- property hit_start¶
Start position on the subject.
- property hit_stop¶
Stop position on the subject.
- property hseq¶
Aligned subject sequence.
- property identity¶
Number of identical positions.
- property midline¶
Alignment midline.
- property num¶
Index of the Hsp in the corresponding hit.
- property positive¶
Number of positions with positive score.
- property qseq¶
Aligned query sequence.
- property query_frame¶
Frame of the query.
- property query_start¶
Start position on the query.
- property query_stop¶
Stop position on the query.
Configuring paths¶
Application paths can be set using the following syntax. A
ValueError
is raised if the automatic test fails.
The change is valid for the current session only unless save()
is
used:
egglib.wrappers.paths[app] = path
And application paths are accessed as followed:
egglib.wrappers.paths[app]
- egglib.wrappers.paths.autodetect(verbose=False)¶
Auto-configure application paths based on default command names.
- Parameters:
verbose – if
True
, print progress information.
The function returns a
(npassed, nfailed, failed_info)
with:npassed
the number of applications which passed.nfailed
the number of applications which failed.failed_info
adict
containing, for each failing application, the command which was used and the error message.
- egglib.wrappers.path.load()¶
Load values of application paths from the configuration file located within the package. All values currently set are discarded.
- egglib.wrappers.path.save()¶
Save current values of application paths in the configuration file located within the package. This action may require administrator rights. All values currently set will be reloaded at next import of the package.