Import/export utilities¶
|
Export data in the |
|
Import sequences from a fasta file. |
|
Import sequences from a fasta-formatted string. |
|
Iterative sequence-by-sequence fasta parser. |
|
Import Genepop-formatted genotypic data. |
|
Import GFF3 genome annotation data. |
|
Index a BCF file. |
|
VCF/BCF parser using htslib. |
|
Class holding BED (Browser Extensible Data) data. |
|
Import a clustal-formatted alignment. |
|
Import data from the Staden assembly package. |
|
Import Genalys data. |
|
Import fgenesh data. |
|
Process a GenBank-formatted DNA sequence record. |
In case the VCF
class is not available, the following
alternative tools are maintained:
|
Import Variant Call Format data from a file. |
|
Import Variant Call Format data from a string. |
Below are components that do not need to be manipulated directly but are referred to by the documentation of the above items.
Provide access to data of a feature. |
|
|
Feature of a GenBank record. |
|
Hold the location of a GenBank feature. |
Here is the description of the fasta format used in EggLib:
Each sequence is preceded by a header limited to a single line and starting by a
>
character.The header length is not limited and all characters are allowed, but white spaces and special characters are discouraged. The header is terminated by a newline character.
Group labels are specified a special markup system placed at the end of the header line. The labels are specified by an at sign (
@
) followed by any string (@pop1
,@pop2
,@pop3
and so on). It is allowed to define several group labels for any sequence. In that case, integer values must be enter consecutively after the at sign, separated by commas, as in@cluster1,pop3,indiv2
for a sequence belonging to a given group in three different grouping levels. Multiple grouping levels can be used to specify hierarchical structure, but not only (several independent grouping structures can be specified). The markup@#
(at sign and hash sign) specifies an outgroup sequence. The hash sign may be followed by an additional label as in@#
, Multiple grouping levels are not allowed for the outgroup.The sequence itself continues on following lines until the next
>
character or the end of the file. Each allele is represented by a single character.White spaces, tab and carriage returns are allowed at any position. They are ignored unless for terminating the header line. There is no limitation in length and different sequences can have different lengths.
Allowed characters and significance of character case is determined but the alphabet specified at time of parsing. For example, the
alphabets.DNA
supports the characters listed below, and case is ignored (all characters are turned into upper-case).
Ambiguity characters in DNA sequences are processed by several functions. The IUPAC nomenclature is followed, as described below:
Code |
Meaning |
Complement |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
alignment gap |
|
|
|
|
Positions with ?
are supposed to be non-characterized, so it
is unknown whether they have a valid base or an alignment gap.
- egglib.io.to_ms(aln, outfile, positions=None, alphabet=None, as_codes=False, converter=None, spacer='auto', header='//\n')[source]¶
Export data in the
ms
program format. The program is described in Hudson (2002 [Bioinformatics 18: 337-338]). The format is as follows:One header line with two slashes (can be extended/replaced).
One line with the number of sites
One line with the positions, only if the number of sites is larger than zero.
The matrix of genotypes (one line per sample), only if the number of sites is larger than zero.
One empty line.
ms
generates binary data (0/1). This function may export any integer values depending on the data contained in the input object.- Parameters:
data –
Align
instance.outfile – where to write data. This argument can be (i) a file (or file-like) object open for writing, (ii) the name of a file (which will be created) as a string, or (iii) None (in which case the resulting string will be returned).
positions – list of site positions, with length matching the alignment length. Positions are required to be in the [0,1] range but the order is not checked. By default (if the argument value is
None
), sites are supposed to be evenly spread over the [0,1] interval.alphabet – an
Alphabet
instance to use instead of the alphabet attached to the input alignment. When using this argument, alleles are transformed using the alphabet provided as argument using a direct mapping of the input alignment alphabet to the provided alphabet.as_codes – use the internal alphabet codes for exporting. Useful if a non-int alphabet is used. Ignored if converter is used. Not allowed if alphabet is used.
converter – a user-provided function generating the appropriate allelic value based on an input alphabet code. The simple inline example below shows how to convert any missing data to 0 and shift all valid alleles of one unit:
converter=lambda x: 0 if x<0 else x+1
.spacer – insert a space between all genotypes (to separate each locus/site). By default, all genotypes are concatenated as in the standard
ms
format. In this case, it is required that all allelic values (or codes if as_codes is specified) are >=0 and <10. By default ("auto"
), the spacer is automatically inserted if the alphabet defines alleles or codes outside the [0, 9] range, or if converter is specified.header – string to print instead of the two slashes line. Must contain the final new line.
- Returns:
A
ms
-formatted string if outfile isNone
, otherwiseNone
.
- egglib.io.from_fasta(fname, alphabet, labels=False, label_marker='@', label_separator=',', cls=None)[source]¶
Import sequences from a fasta file. Create a new instance of either
Align
orContainer
from data read from a fasta-formatted file. To process data from a fasta-formatted string, useio.from_fasta_string()
.- Parameters:
source – name of a fasta-formatted sequence file.
alphabet – an
Alphabet
instance defining the type of data. Only character alphabets are allowed (such asalphabets.DNA
, oralphabets.protein
).labels – import group labels. If so, they are not actually required to be present for each (or any) sequence. By default tags in sequence names considered to be part of the name and not as labels.
label_marker – this option allows to change the character indicating the start of the labels.
label_separator – this options allows to change character used to separate labels.
cls – type that should be generated. Possible values are:
Align
(then, data must be aligned),Container
, orNone
. In the latter case, anAlign
is returned if data are found to be aligned or if the data set is empty, and otherwise aContainer
is returned.
- Returns:
A new
Container
orAlign
instance depending on the value of the cls option.
- egglib.io.from_fasta_string(string, alphabet, labels=False, label_marker='@', label_separator=',', cls=None)[source]¶
Import sequences from a fasta-formatted string. Identical to
io.from_fasta()
but directly takes an fasta-formatted string as first argument.
- class egglib.io.fasta_iter(fname, alphabet, labels=False)[source]¶
Iterative sequence-by-sequence fasta parser.
- Parameters:
fname – name of a fasta-formatted file.
alphabet – an
Alphabet
instance defining the type of data. Only character alphabets are allowed (such asalphabets.DNA
andalphabets.protein
).labels – import group labels from sequence names (by default, they are considered as part of the name).
This function can be used in an iteration as shown below:
>>> for item in egglib.io.fasta_iter(fname): ... ...
- The
with
statement is also supported, which ensures that the input file is properly closed whenever the
with
statement completes:
>>> with egglib.io.fasta_iter(fname) as f: ... for item in f: ... ...
Each iteration yields a
SampleView
instance (which is valid only during the iteration round, see the warning below).Warning
The aim of this iterator is to iterate over large fasta files without actually storing all data in memory at the same time. The
SampleView
instance provided at each iteration is a proxy to a localContainer
instance that is recycled at each iteration step. The iteration variable should be used immediately and never stored as this. If one wants to sequence data, they should copy them immediately (typically using theadd_sample()
method of a separateContainer
instance).
- egglib.io.from_genepop(fname)[source]¶
Import Genepop-formatted genotypic data. The format is described here.
- Parameters:
fname – Genepop-formatted file name.
- Returns:
A new
Align
instance.
The returned object contains data mapped to an ad hoc alphabet, with two samples per individuals. Group labels are used to indicate the structure (first level: populations, second level: individuals). In addition to normal
Align
instance, the returned object has two attributes:title
andloci
, which contain the information read from the Genepop file.
- class egglib.io.GFF3(fname, strict=True)[source]¶
Import GFF3 genome annotation data. Read General Feature Format (GFF)-formatted (version 3) genome annotation data from a file specified by name or from a provided string.
See the description of the GFF3 format at http://www.sequenceontology.org/gff3.shtml.
All features are loaded into memory and can be processed interactively.
- Parameters:
source – name of a GFF3-formatted file. To pass a string, use the factory method
from_string()
.strict – by default, apply strictly GFF3 requirements. See below.
The
strict
argument can be set toFalse
to support CDS features lacking a phase.Some observations regarding this implementation:
The validity of parent-child relationship with respect to the Sequence Ontology is not enforced, meaning that this parser allows any type of feature to be parent to any feature of any type.
The Sequence Ontology accession number of types of features related to genes are automatically mapped to the type name (such as “SO:0000704”, which is translated as “gene”).
Discontinuous features are required to have matching attributes (except start, stop, and phase for CDS features) but subsequent segments are allowed to skip attributes if the first segment defines them.
fasta-formatted sequences are imported but are currently not checked for consistency with annotations and defined sequence regions.
io.GFF3
instances are iterable. The expression:>>> for feat in GFF3: ... ...
is equivalent to:
>>> for feat in GFF3.iter_features(): ... ...
Attributes
Content of all attribute-ontology directives.
Content of all feature-ontology directives.
Genome build if specified by a directive.
List of seqid.
Number of seqid of features present in instance.
Number of features without parents.
Total number of features.
Dictionary with all sequence regions defined by directives.
If specified, sequences as a
Container
.Content of all source-ontology directives.
Species (if specified by a directive).
GFF version.
Methods
from_string
(string[, strict])Import data from a GFF3-formatted string.
iter_features
([seqid, start, end, all_features])Iterate over features.
- property attribute_ontology¶
Content of all attribute-ontology directives.
- property feature_ontology¶
Content of all feature-ontology directives.
- classmethod from_string(string, strict=True)[source]¶
Import data from a GFF3-formatted string. This is a class method, which can be called as shown below to create a new object:
>>> gff3 = egglib.io.GFF3.from_string(gff3_string)
- property genome_build¶
Genome build if specified by a directive.
- iter_features(seqid=None, start=None, end=None, all_features=False)[source]¶
Iterate over features.
- Parameters:
seqid – only iterates overs features of this contig (by default, consider all contigs). If the specified seqid is not present in the GFF3 file, iteration is empty (without error).
start – start iteration at this position. By default, start with the first position.
end – stop iteration at this position (does not include features whose end position is larger than this value). By default, process all features.
all_features – whether iteration should also include features that are part of another (by default, only include features that don’t have a parent). A given feature may be repeated if it has several unconnected parents.
Note
If start and/or end are specified, seqid is required.
- property list_seqid¶
List of seqid.
- property num_seqid¶
Number of seqid of features present in instance.
- property num_top_features¶
Number of features without parents.
- property num_tot_features¶
Total number of features.
- property regions¶
Dictionary with all sequence regions defined by directives.
- property source_ontology¶
Content of all source-ontology directives.
- property species¶
Species (if specified by a directive).
- property version¶
GFF version.
- class egglib.io.Gff3Feature[source]¶
Provide access to data of a feature. This class cannot be instanciated by the user.
Attributes
Feature identifier.
All parts of this feature.
Dictionary of attributes attached to the feature.
List of features deriving from this one.
List of features descending from this one.
Feature end position.
List of features parent to this one.
Feature score.
List of segments of the feature.
Seqid on which this feature is located.
Feature source.
Feature start position.
Strand on which the feature is located.
Feature type.
- property ID¶
Feature identifier.
- property all_parts¶
All parts of this feature.
- property attributes¶
Dictionary of attributes attached to the feature. The list of attributes is:
ID
Name
Alias
Parent
Target
Gap
Derives_from
Note
Dbxref
Ontology_term
Is_circular
Other attributes as defined in the GFF3 file.
- property derivers¶
List of features deriving from this one.
- property descendants¶
List of features descending from this one.
- property end¶
Feature end position.
- property parents¶
List of features parent to this one.
- property score¶
Feature score.
- property segments¶
List of segments of the feature.
- property seqid¶
Seqid on which this feature is located.
- property source¶
Feature source.
- property start¶
Feature start position.
- property strand¶
Strand on which the feature is located.
- property type¶
Feature type.
- egglib.io.index_vcf(fname[, outname])¶
Index a BCF file. The file is required to be in format BCF. If outname is not specified, use the standard naming scheme for CSI index files.
- class egglib.io.VCF(fname, index=None, subset=None)¶
VCF/BCF parser using htslib.
- Parameters:
fname – input VCF/BCF file name. Gzip-compressed files are supported.
index – index file name. By default, try to import index with canonical file name. If index is specified, it must be present. Otherwise missing index file is not an error. Index is only imported for BCF files.
subset – sequence of sample names to import. The order of samples in this sequence is not considered. Duplicated names in this sequence are ignored. Other samples are ignored. This is useful to speed up parsing.
Attributes
Boolean indicating whether an index is available.
Number of samples.
Methods
list
of alleles.list
of alternate alleles.Chromosome or contig name.
Errors while reading last variant.
list
of filters.get_format
(tag, index)Get a given FORMAT field.
FORMAT fields for all samples.
Get genotypes.
Get list of identifiers for the current variant.
get_info
(tag)Get a given INFO field.
dict
of INFO fields.Get booleans indicating if genotypes are phased.
Chromosome position.
Quality value.
Reference allele.
get_sample
(index)Get the name of the sample at index index.
list
of all samples.Get the type(s) of the last variant.
goto
(target[, position, [limit]])Move to a given location in the file.
Check if last variant is a SNP.
Read one variant of the VCF file.
- get_chrom()¶
Chromosome or contig name. Return
None
by default (no available data).
- get_errors()¶
Errors while reading last variant. Get the non-fatal errors generated while importing last variant, as a list, or
None
if nothing has been read.
- get_format(tag, index)¶
Get a given FORMAT field. Return
None
by default (no available data for key not available).
- get_formats()¶
FORMAT fields for all samples. Return a :list: of :dict: instances. Return
None
by default (no available data).
- get_genotypes()¶
Get genotypes. Return a
list
giving, for each sample, thelist
of alleles composing its genotype. ReturnNone
by default (no data available).
- get_id()¶
Get list of identifiers for the current variant. Empty list if none provided.
None
if nothing has been read. The uniqueness of ID’s is not tested.
- get_info(tag)¶
Get a given INFO field. Return
None
by default (no available data for key not available).
- get_phased()¶
Get booleans indicating if genotypes are phased. The return value is the
tuple
: (all_phased, phased_table), with a boolean for all samples and all alleles beyond the first.None
whatever bad happens.
- get_pos()¶
Chromosome position. Return
None
by default (no available data).
- get_quality()¶
Quality value. Return
None
by default (no available data or missing value).
- get_reference()¶
Reference allele. Return
None
by default (no available data).
- get_sample(index)¶
Get the name of the sample at index index.
- get_types()¶
Get the type(s) of the last variant. Return a
list
. ReturnNone
by default (no available data).
- goto(target[, position[, limit]])¶
Move to a given location in the file. Data at the new location are available immediately with no need to call
read()
(this method should be understood as a call toread()
at an arbitrary location. If position is not specified, move to the first available position of contig target. If target does not exist in file or if position is specified, but is not available within the region specified by limit, aValueError
is thrown.ValueError
is also thrown in case of unexpected errors during processing. By default, limit is equal to position + 1 (meaning that only the exact position can be retrieved). If this condition is not met (in particular if position is past the end of the contig target, returnFalse
.Note
Only available for indexed BCF.
- has_index¶
Boolean indicating whether an index is available.
- is_snp()¶
Check if last variant is a SNP.
True
if the last variant is of type SNP, and SNP only.
- num_samples¶
Number of samples.
- read()¶
Read one variant of the VCF file. Return
True
if read is successful,False
if end of file.ValueError
in case of critical error.
- egglib.io.from_clustal(string, alphabet)[source]¶
Import a clustal-formatted alignment. The input format is the one generated and used by ClustalW.
- egglib.io.from_staden(string, keep_consensus=False)[source]¶
Import data from the Staden assembly package. Process the output file of the GAP4 program of the Staden package.
- Parameters:
string – input string.
keep_consensus – don’t delete consensus sequence.
- Returns:
An
Align
instance.
The input string should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig to file”. The sequence named
CONSENSUS
, if present, is automatically removed unless the option keep_consensus is used.Staden’s default convention is followed:
-
codes for an unknown base and is replaced byN
.*
codes for an alignment gap and is replaced by-
..
represents the same sequence than the consensus at that position.White space represents missing data and is replaced by
?
.
- egglib.io.from_genalys(string)[source]¶
Import Genalys data. Genalys was a proprietary program to process sequencing reads, perform assembly and detect polymorphism. Convert Genalys-formatted sequence alignment files to fasta. This function imports files generated through the option “Save SNPs” of Genalys 2.8.
- Parameters:
string – input data as a Genalys-formatted string.
- Returns:
An
Align
instance.
- egglib.io.get_fgenesh(string, locus='locus')[source]¶
Import fgenesh data.
- Parameters:
fname – a string containing fgenesh ouput.
- Parma locus:
locus name.
- Returns:
A list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a
codon_start
qualifier.
- class egglib.io.GenBank(fname=None, string=None)[source]¶
Process a GenBank-formatted DNA sequence record.
- Parameters:
fname – input file name.
string – GenBank-formatted string.
Only one of the two arguments fname and string can be non-
None
. If both areNone
, the constructor generates an empty instance with a sequence of length 0. If fname is non-None
, a GenBank record is read from the file with this name. If string is non-None
, a GenBank record is read from this string. The following variables are read from the parsed input if present: accession, definition, title, version, GI, keywords, source, references (which is a list), locus and others. Their default value isNone
except for references and others for which default is an empty list. source is a (description, species, taxonomy)tuple
. Each of references is a (header, raw reference)tuple
and each of others is a (key, raw)tuple
.In addition to methods documented below, the following operations are supported for
gb
if it is aGenBank
instance:Operation
Result
len(gb)
length of the sequence attached to this record
for feat in gb:
Iterate over
GenBankFeature
instances of this recordstr(gb)
GenBank representation of the record
Attributes
Sequence string.
Methods
add_feature
(feature)Add a feature to the instance.
extract
(from_pos, to_pos)Extract a subset of the instance.
format
()String representation of the instance.
Number of features contained in the instance.
rc
()Reverse-complement the instance (in place).
write
(fname)Write the formatted record to a file.
write_stream
(stream)From the formatted record to a stream.
- add_feature(feature)[source]¶
Add a feature to the instance.
- Parameters:
feature – an
io.GenBankFeature
instance.
- extract(from_pos, to_pos)[source]¶
Extract a subset of the instance.
- Parameters:
from_pos – start position.
to_pos – stop position (not included).
- Returns:
A new
io.GenBank
instance representing a subset of the current instance. All features that are completely included in the specified range are exported.
- rc()[source]¶
Reverse-complement the instance (in place). All features positions and the sequence will be reverted and applied to the complementary strand. The features will be sorted in increasing start position (after reverting). This method should be applied only on genuine nucleotide sequences.
- property sequence¶
Sequence string. Note that changing the record’s string might invalidate the features (meaning that the setting an invalid sequence might cause the features to point to incorrect or out-of-bounds regions of the sequence).
- class egglib.io.GenBankFeature(parent)[source]¶
Feature of a GenBank record.
io.GenBankFeature
instances must be only used along anio.GenBank
instance. The constructor creates an empty instance (although a parentio.GenBank
instance is required) and eitherupdate()
orparse()
must be used subsequently.- Parameters:
parent – an
io.GenBank
instance to which the feature should be attached.
Note that
str(feature)
is equivalent tofeature.format()
.Methods
add_qualifier
(key, value)Add a qualifier.
copy
(genbank)Return a copy of the current instance.
format
()GenBank-formatted string representing the feature.
Return the string corresponding to this feature.
First position of the first segment.
get_stop
()Stop position of the last segment.
get_type
()Type of the instance.
parse
(string)Update feature information from a string.
Dictionary with all qualifier values.
rc
([length])Reverse-complement the feature.
shift
(offset)Shift all positions.
update
(feat_type, location, **qualifiers)Update feature information.
- copy(genbank)[source]¶
Return a copy of the current instance.
- Parameters:
genbank –
GenBank
instance to which the returned instance should be attached.
- get_sequence()[source]¶
Return the string corresponding to this feature. If the positions pass beyond the end of the parent’s sequence, a
RuntimeError
(and not anIndexError
) is raised.
- parse(string)[source]¶
Update feature information from a string.
- Parameters:
string – a GenBank-formatted string.
- qualifiers()[source]¶
Dictionary with all qualifier values. This method cannot be used to change data within the instance.
- rc(length=None)[source]¶
Reverse-complement the feature.
- Parameters:
length – length of the complete sequence (by default, take the information directly from the parent).
- update(feat_type, location, **qualifiers)[source]¶
Update feature information.
- Parameters:
feat_type – a string identifying the feature type (such as
"gene"
,"CDS"
,"misc_feature"
, etc.). All strings are acceppted.location – an
io.GenBankFeatureLocation
instance giving the feature’s location.qualifiers – other qualifiers must be passed as keyword arguments. It is not allowed to use
"type"
as a qualifier keyword.
- class egglib.io.GenBankFeatureLocation(string=None)[source]¶
Hold the location of a GenBank feature. Supports various forms of location specifications as allowed in the GenBank format. The constructor allows to parse a GenBank-formatted string. By default, features are on the forward strand and segmented features are ranges (not orders).
In addition to methods documented below, the following operations are supported for
loc
if it is aGenBankFeatureLocation
instance:Operation
Result
len(loc)
Number of segments
loc[index]
Return the
(first, last)
tuple
for (first, last) in loc:
Iterator over segments
params.format()
Generate a GenBank representation
Methods
add_base_choice
(first, last[, left_partial, ...])Segment corresponding to a single base in a given range.
add_base_range
(first, last[, left_partial, ...])Add a base range to the feature.
add_between_base
(position)Add a segment lying between two consecutive bases.
add_single_base
(position)Add a single-base segment to the feature.
as_order
()Define the feature as an order instead of a range.
as_range
()Define the feature as a range.
copy
()Deep copy of the current instance.
format
()String representation of the instance.
True
if the feature is on the complement strand.is_range
()True
if the feature is a range.rc
(length)Reverse the feature positions.
Place the feature on the complement strand.
Place the feature on the forward strand.
shift
(offset)Shift all positions.
- add_base_choice(first, last, left_partial=False, right_partial=False)[source]¶
Segment corresponding to a single base in a given range. Arguments are identical to
add_base_range()
.
- add_base_range(first, last, left_partial=False, right_partial=False)[source]¶
Add a base range to the feature.
- Parameters:
first – first position of the range.
last – last position of the range (included).
first_partial – specify that the real start of the segment is somewhere 5’ of first.
first_partial – specify that the real end of the segment is somewhere 3’ of last.
If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use
add_base_range(1127,1482)
in combination withset_complement()
.
- add_between_base(position)[source]¶
Add a segment lying between two consecutive bases. The feature will be set between position and position + 1. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1128, one must use
add_between_base(1127)
in combination withset_complement()
.
Alternative VCF parser¶
- class egglib.io.VcfParser(fname, threshold_PL=None, threshold_GL=None)[source]¶
Import Variant Call Format data from a file. The VCF format is designed to store data describing genomic variation in an extremely flexible way. See the description of the VCF format. To parse VCF data stored in string, use
io.VcfStringParser
.- Parameters:
fname – name of a properly formatted VCF file. The header section will be processed upon instance creation, and lines will be read later, when the user iterates over the instance (or call
readline()
).threshold_PL – call genotypes parameter. This parameter controls how genotype calling (GT field) is performed from the PL (phred-scaled genotype likelihood) field. By default (
None
), this conversion is never performed. If threshold_PL is specified, genotype calling is only performed if GT is not available and PL is available. The genotype with the lowest PL value is selected. The parameter gives the minimum acceptable gap between the best genotype and the second one. If the second genotype has a too good score, based on this parameter, the genotype is called unknown. The parameter must be at least 1.
io.VcfParser
instances are iterable. Everly loop yields a(chromosome, position, num_all)
tuple
that allows the user to determines if the variant is of interest. Note that the position is considered as an index and therefore has been decremented compared with the value found in the file. If the variant is of interest,io.VcfParser
instances provide methods to extract all data for this variant. Iterating over VCF lines can be performed manually using the methodreadline()
.Attributes
Index of the current line of the VCF file.
File format specified in the header.
Tell if the file is good for reading.
True
if an index file has been loaded.Number of ALT definitions in the header.
Number of FILTER definitions in the header.
Number of FORMAT definitions in the header.
Number of indexed lines if the file index.
Number of INFO definitions in the header.
Number of META definitions in the header.
Number of samples defined in the header.
Methods
bed_slider
(bed[, max_missing, allow_indel, ...])Perform a sliding window based on BED coordinates.
close
()Close file (if it is open)
get_alt
(idx)Get an ALT definition from the header.
get_filter
(idx)Get a FILTER definition from the header.
get_format
(idx)Get a FORMAT definition from the header.
get_genotypes
([start, stop, dest])Extract genotype data for the current site.
get_info
(idx)Get an INFO definition from the header.
get_meta
(idx)Get data for a given META field defined in the VCF header.
get_sample
(idx)Get the name of a sample read from the header.
Get all data for the current variant.
goto
(self, contig[, position])Move to an arbitrary position of the VCF file.
load_index
([fname])Load an index file allowing fast navigation.
readline
()Read one variant.
rewind
()Reset the parser.
slider
(size, step[, as_variants, start, ...])Start a sliding window from the current position.
unread
()Go back to the previous variant.
- bed_slider(bed, max_missing=0, allow_indel=False, allow_custom=False)[source]¶
Perform a sliding window based on BED coordinates.
- Parameters:
bed – an
io.BED
instance.max_missing –
maximum number of missing alleles.
Warning
Here, max_missing must be an absolute number of samples.
allow_indel – include variants with alleles of varying size.
allow_custom – include variants with custom alleles.
- Returns:
An
io.VcfWindow
instance.
Changed in version 3.2.0: max_missing is required to be an integer.
- property currline¶
Index of the current line of the VCF file.
- property file_format¶
File format specified in the header.
- get_alt(idx)¶
Get an ALT definition from the header. The index must be smaller than
num_alt
. Return a dictionary containing the following data:"id"
: identifier string."description"
: description string."extra"s
: all extra qualifiers, presented as list of(key, value)
tuple
instances.
- get_filter(idx)¶
Get a FILTER definition from the header. The index must be smaller than
num_filter
. Return a dictionary containing the following data:"id"
: identifier string."description"
: description string."extra"
: all extra qualifiers, presented as a list of(key, value)
tuple
instances.
- get_format(idx)¶
Get a FORMAT definition from the header. The index must be smaller than
num_format
. Return a dictionary containing the following data:"id"
: identifier string."type"
: one of"Integer"
,"Float"
,"Character"
, and"String"
."description"
: description string."number"
: expected number of items. Special values areNone
(if undefined),"NUM_GENOTYPES"
(number matching the number of genotypes for any particular variant),"NUM_ALTERNATE"
(number matching the number of alternate alleles for any particular variant), and"NUM_ALLELES"
(number matching the number of alleles–including the reference–for any particular variant)."extra"
: all extra qualifiers, presented as a list of(key, value)
tuple
instances.
- get_genotypes(start=0, stop=None, dest=None)¶
Extract genotype data for the current site.
It is required that a variant has been effectively processed.
- Parameters:
start – index of the first sample to process (at least 0 and smaller than the number of samples).
stop – sample index at which processing must be stopped (this sample is not processed; at least equal to start and smaller than the number of samples).
dest – a
Site
instance that will be recycled and used to place results.
- Returns:
A
Site
instance by default, orNone
if dest was specified.
- get_info(idx)¶
Get an INFO definition from the header. The index must be smaller than
num_info
. Return a dictionary containing the following data:"id"
: identifier string."type"
: one of"Integer"
,"Float"
,"Flag"
,"Character"
, and"String"
."description"
: description string."number"
: expected number of items. Special values areNone
(if undefined),"NUM_GENOTYPES"
(number matching the number of genotypes for any particular variant),"NUM_ALTERNATE"
(number matching the number of alternate alleles for any particular variant), and"NUM_ALLELES"
(number matching the number of alleles–including the reference–for any particular variant)."extra"
: all extra qualifiers, presented as a list of(key, value)
tuple
instances.
- get_meta(idx)¶
Get data for a given META field defined in the VCF header. The index must be smaller than
num_meta
. Return atuple
containing the key and the value of the META field.
- get_sample(idx)¶
Get the name of a sample read from the header. The index must be smaller than
num_samples
.
- get_variant()¶
Get all data for the current variant. Return an
io.VcfVariant
instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.
- property good¶
Tell if the file is good for reading.
False
is the end of the file has been reached.
- goto(self, contig, position=egglib.io.FIRST)[source]¶
Move to an arbitrary position of the VCF file. Requires an index (see
load_index()
). AValueError
is raised if the position cannot be found.
- property has_index¶
True
if an index file has been loaded.
- load_index(fname=None)[source]¶
Load an index file allowing fast navigation. Index files allow fast navigation in VCF file regardless of their size, allowing to move instantly to a determined variant using its position or its index in file. Index files can be created by
io.make_vcf_index()
.- Parameters:
fname – name of the index file. By default, use the default index file name, which is the name of the VCF file with the “.vcfi” extension (removing the original extension only if it is “.vcf”).
- property num_alt¶
Number of ALT definitions in the header.
- property num_filter¶
Number of FILTER definitions in the header.
- property num_format¶
Number of FORMAT definitions in the header.
- property num_index¶
Number of indexed lines if the file index.
- property num_info¶
Number of INFO definitions in the header.
- property num_meta¶
Number of META definitions in the header.
- property num_samples¶
Number of samples defined in the header.
- readline()[source]¶
Read one variant. Return a
tuple
with the chromosome name, the position, and the number of alleles at the variant (as indicated in the VCF file). Raise aValueError
if file is finished.
- slider(size, step, as_variants=False, start=0, stop=None, max_missing=0, allow_indel=False, allow_custom=False)[source]¶
Start a sliding window from the current position.
- Parameters:
size – size of the sliding window (by default, in base pairs).
step – increment of the sliding window (by default, in base pairs).
as_variants – express size and step in number of variants instead of base pairs.
start – start position of the sliding window.
stop – stop position of the sliding window.
max_missing –
maximum number of missing alleles (variants exceeding this threshold are ignored).
Warning
Here, max_missing must be an absolute number of samples.
allow_indel – include variants with alleles of variable size
allow_custom – include variants with custom alleles
- Returns:
An
io.VcfSlidingWindow
instance.
Changed in version 3.2.0: max_missing is required to be an integer.
- egglib.io.make_vcf_index(fname, outname=None)[source]¶
Create the index for a VCF file.
- Parameters:
fname – name of a VCF file.
outname – name of the index file. By default, use a default name. The default name is based on the VCF file name, stripping the extension only if it is vcf and appending the vcfi extension. Index files bearing this default name are automatically loaded if the corresponding VCF file is opened.
- class egglib.io.VcfStringParser(header, threshold_PL=None, threshold_GL=None)[source]¶
Import Variant Call Format data from a string. Alias of
io.VcfParser
processing strings instead of a file. The constructor takes a string containing a VCF header (the first line being the file format specification and the last line being the header line ,starting with#CHROM
).Attributes
File format specified in the header.
Number of ALT definitions in the header.
Number of FILTER definitions in the header.
Number of FORMAT definitions in the header.
Number of INFO definitions in the header.
Number of META definitions in the header.
Number of samples defined in the header.
Methods
get_alt
(idx)Get an ALT definition from the header.
get_filter
(idx)Get a FILTER definition from the header.
get_format
(idx)Get a FORMAT definition from the header.
get_genotypes
([start, stop, dest])Extract genotype data for the current site.
get_info
(idx)Get an INFO definition from the header.
get_meta
(idx)Get data for a given META field defined in the VCF header.
get_sample
(idx)Get the name of a sample read from the header.
Get all data for the current variant.
readline
(string)Read one variant from a user-provided single line.
- property file_format¶
File format specified in the header.
- get_alt(idx)¶
Get an ALT definition from the header. The index must be smaller than
num_alt
. Return a dictionary containing the following data:"id"
: identifier string."description"
: description string."extra"s
: all extra qualifiers, presented as list of(key, value)
tuple
instances.
- get_filter(idx)¶
Get a FILTER definition from the header. The index must be smaller than
num_filter
. Return a dictionary containing the following data:"id"
: identifier string."description"
: description string."extra"
: all extra qualifiers, presented as a list of(key, value)
tuple
instances.
- get_format(idx)¶
Get a FORMAT definition from the header. The index must be smaller than
num_format
. Return a dictionary containing the following data:"id"
: identifier string."type"
: one of"Integer"
,"Float"
,"Character"
, and"String"
."description"
: description string."number"
: expected number of items. Special values areNone
(if undefined),"NUM_GENOTYPES"
(number matching the number of genotypes for any particular variant),"NUM_ALTERNATE"
(number matching the number of alternate alleles for any particular variant), and"NUM_ALLELES"
(number matching the number of alleles–including the reference–for any particular variant)."extra"
: all extra qualifiers, presented as a list of(key, value)
tuple
instances.
- get_genotypes(start=0, stop=None, dest=None)¶
Extract genotype data for the current site.
It is required that a variant has been effectively processed.
- Parameters:
start – index of the first sample to process (at least 0 and smaller than the number of samples).
stop – sample index at which processing must be stopped (this sample is not processed; at least equal to start and smaller than the number of samples).
dest – a
Site
instance that will be recycled and used to place results.
- Returns:
A
Site
instance by default, orNone
if dest was specified.
- get_info(idx)¶
Get an INFO definition from the header. The index must be smaller than
num_info
. Return a dictionary containing the following data:"id"
: identifier string."type"
: one of"Integer"
,"Float"
,"Flag"
,"Character"
, and"String"
."description"
: description string."number"
: expected number of items. Special values areNone
(if undefined),"NUM_GENOTYPES"
(number matching the number of genotypes for any particular variant),"NUM_ALTERNATE"
(number matching the number of alternate alleles for any particular variant), and"NUM_ALLELES"
(number matching the number of alleles–including the reference–for any particular variant)."extra"
: all extra qualifiers, presented as a list of(key, value)
tuple
instances.
- get_meta(idx)¶
Get data for a given META field defined in the VCF header. The index must be smaller than
num_meta
. Return atuple
containing the key and the value of the META field.
- get_sample(idx)¶
Get the name of a sample read from the header. The index must be smaller than
num_samples
.
- get_variant()¶
Get all data for the current variant. Return an
io.VcfVariant
instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.
- property num_alt¶
Number of ALT definitions in the header.
- property num_filter¶
Number of FILTER definitions in the header.
- property num_format¶
Number of FORMAT definitions in the header.
- property num_info¶
Number of INFO definitions in the header.
- property num_meta¶
Number of META definitions in the header.
- property num_samples¶
Number of samples defined in the header.
- readline(string)[source]¶
Read one variant from a user-provided single line. The string should contain a single line of VCF-formatted data (no header). All field specifications and sample information should be consistent with the information contained in the header that has been provided when creating this instance.
- Returns:
A
tuple
with the chromosome name, the position, and the number of alleles at the variant (as indicated in the VCF line).
- class egglib.io.VcfVariant[source]¶
Represent a single VCF variant. The user cannot create instances of this class himself (instances are generated by
io.VcfParser
orio.VcfStringParser
).Note
The
AA
(ancestral allele),AN
(allele number),AC
(allele count), andAF
(allele frequency) INFO fields as well as theGT
(deduced genotype) FORMAT are automatically extracted if they are present in the the file and if their definition matches the format specification (meaning that they were not re-defined with different number/type) in the header. If present, they are available through the dedicated attributesAN
,AA
,AC
,AF
,GT
,ploidy
andGT_phased
. However, they are also available in the respectiveinfo
andsamples
(sub)-dictionaries.Attributes
Value of the AA info field.
Value of the AC info field.
Value of the AF info field.
Value of the AN info field.
GL values for all samples
Genotypes from GT fields.
Tell if the genotype for each sample is phased.
GT field as written in the VCF file
Tuple containing all IDs.
PL values for all samples
Tuple of variant alleles.
Alternate alleles types, as a tuple. One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes
alt_type_default
,alt_type_referred
andalt_type_breakend
, as in (for the type of the first alternate allele)::.Chromosome name.
Filters at which this variant failed.
Frozenset of FORMAT fields ID's.
Dictionary of INFO fields for this variant.
Number of alleles.
Number of alternate alleles.
Number of genotypes.
Number of samples.
Ploidy among genotypes.
Position.
Variant quality.
Information for each sample.
- alt_type_default¶
Explicit alternate allele (the string represents the nucleotide sequence of the allele).
- alt_type_referred¶
Alternate allele referring to a pre-defined allele (the string provides the ID of the allele).
- alt_type_breakend¶
Alternate allele symbolizing a breakend (see VCF description for more details).
- property AA¶
Value of the AA info field.
None
if missing.
- property AC¶
Value of the AC info field. Provided as a tuple.
None
if missing.
- property AF¶
Value of the AF info field. Provided as a tuple.
None
if missing.
- property AN¶
Value of the AN info field.
None
if missing.
- property GL¶
GL values for all samples
- property GT¶
Genotypes from GT fields. Only if this format field is available. Provided as a
tuple
of sub-tuples. The number of sub-tuples is equal to the number of samples (num_samples
). The number of items within each sub-tuples is equal to the ploidy (ploidy
). These items are allele expression (as found inalleles
), orNone
(for missing values). This attribute isNone
if GT is not available.
- property GT_phased¶
Tell if the genotype for each sample is phased.
None
if GT is not available.
- property GT_vcf¶
GT field as written in the VCF file
- property ID¶
Tuple containing all IDs.
- property PL¶
PL values for all samples
- property alleles¶
Tuple of variant alleles. The first is the reference, which is not guaranteed to be present in samples.
- property alternate_types¶
Alternate alleles types, as a tuple. One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes
alt_type_default
,alt_type_referred
andalt_type_breakend
, as in (for the type of the first alternate allele):>>> type_ = variant.alternate_types[0] >>> if type_ == variant.alt_type_default: ... allele = variant.allele(0)
- property chromosome¶
Chromosome name.
None
if missing.
- property failed_tests¶
Filters at which this variant failed. Provided as a tuple or names, or
None
if no filters have been applied.
- property format_fields¶
Frozenset of FORMAT fields ID’s. The frozenset is empty if no sample data are available.
- property info¶
Dictionary of INFO fields for this variant. Keys are ID of INFO fields available for this variant, and values are always a
tuple
of items, even if there is only one item. For flag INFO types, the value is always an emptytuple
.
- property num_alleles¶
Number of alleles. The reference is always included.
- property num_alternate¶
Number of alternate alleles. Equal to
num_alleles
minus 1.
- property num_genotypes¶
Number of genotypes.
- property num_samples¶
Number of samples. Equivalent to
len(variant.samples)
.
- property ploidy¶
Ploidy among genotypes. Always 2 if GT is not available.
- property position¶
Position. Given as an index; first value is 0.
None
if missing.
- property quality¶
Variant quality.
None
if missing.
- property samples¶
Information for each sample. Empty list if no samples are defined. Otherwise, the list contains one dictionary for each sample: keys of these dictionaries are FORMAT fields identifiers (the keys are always the same as the content of
format_fields
), and their values aretuple
instances in all cases.
- class egglib.io.VcfSlidingWindow[source]¶
This class manages a sliding window on a VCF file. This class cannot be instanciated directly: instances are returned by the methods
slider()
andbed_slider()
ofio.VcfParser
instances.io.VcfSlidingWindow
instances are iterable: iteration steps return a commonio.VcfWindow
instance which is updated at each iteration round.Attributes
True
if there is still data to be processed.- property good¶
True
if there is still data to be processed.
- class egglib.io.VcfWindow[source]¶
Provide access to sites of a window from a VCF file. The following operations are supported by
io.VcfWindow
instances:Operation
Result
len(win)
number of sites in the window
win[i]
get a site as a
Site
instancefor site in win:
iterate over sites as
Site
instancesAttributes
Bounds of the window.
Name of the chromosome or contig.
Number of sites in window.
- property bounds¶
Bounds of the window. The second bound is not included in the window.
- property chromosome¶
Name of the chromosome or contig.
- property num_sites¶
Number of sites in window.
- egglib.io.FIRST¶
First available position of a contig.
- egglib.io.LAST¶
Last available position of a contig.
- class egglib.io.BED(fname=None)[source]¶
Class holding BED (Browser Extensible Data) data.
- Parameters:
fname – name of the BED-formatted input file. By default, create an empty instance.
BED is a flexible format representing genomic annotation. In the current implementation, only the chromosome/scaffold, start, and end positions (the first three fields of each lines, which are the only ones to be required) are processed. If incorrectly formed fields are provided besides the first three fields, the behaviour is undefined (that is, it is not guaranteed that are exception is raised).
Supported operations:
Operation
Result
Notes
len(bed)
Number of annotation items
bed[i]
Get an annotation
for i in bed:
Iterate over annotations
Notes:
Annotation items are represented by dictionaries containing the following keys:
"chrom"
,"start"
, and"end"
.
Methods
append
(chrom, start, end)Add an item at the end of the BED data.
extend
(values)Append items from an iterable.