Import/export utilities¶

`egglib.io.to_ms`(aln, outfile[, positions, ...])	Export data in the `ms` program format.
`egglib.io.from_fasta`(fname, alphabet[, ...])	Import sequences from a fasta file.
`egglib.io.from_fasta_string`(string, alphabet)	Import sequences from a fasta-formatted string.
`egglib.io.fasta_iter`(fname, alphabet[, labels])	Iterative sequence-by-sequence fasta parser.
`egglib.io.from_genepop`(fname)	Import Genepop-formatted genotypic data.
`egglib.io.GFF3`(fname[, strict])	Import GFF3 genome annotation data.
`egglib.io.VCF`(fname[, index, subset, dumpfile])	VCF/BCF parser using htslib.
`egglib.io.index_vcf`(fname[, outname])	Index a BCF file.
`egglib.io.VcfSlider`(vcf, size, step[, ...])	Run a sliding window on a VCF file.
`egglib.io.CodonVCF`(vcf_name, gff_name[, ...])	Analysis of coding sites from VCF data.
`egglib.io.BED`([fname])	Class holding BED (Browser Extensible Data) data.
`egglib.io.from_clustal`(string, alphabet)	Import a clustal-formatted alignment.
`egglib.io.from_staden`(string[, keep_consensus])	Import data from the Staden assembly package.
`egglib.io.from_genalys`(string)	Import Genalys data.
`egglib.io.get_fgenesh`(string[, locus])	Import fgenesh data.
`egglib.io.GenBank`([fname, string])	Process a GenBank-formatted DNA sequence record.

In case the VCF class is not available, the following alternative tools are maintained:

`egglib.io.VcfParser`(fname[, threshold_PL, ...])	Import Variant Call Format data from a file.
`egglib.io.VcfStringParser`(header[, ...])	Import Variant Call Format data from a string.

Below are components that do not need to be manipulated directly but are referred to by the documentation of the above items.

`egglib.io.Gff3Feature`()	Provide access to data of a feature.
`egglib.io.GenBankFeature`(parent)	Feature of a GenBank record.
`egglib.io.GenBankFeatureLocation`([string])	Hold the location of a GenBank feature.

Here is the description of the fasta format used in EggLib:

Each sequence is preceded by a header limited to a single line and starting by a > character.
The header length is not limited and all characters are allowed, but white spaces and special characters are discouraged. The header is terminated by a newline character.
Group labels are specified a special markup system placed at the end of the header line. The labels are specified by an at sign (@) followed by any string (@pop1, @pop2, @pop3 and so on). It is allowed to define several group labels for any sequence. In that case, integer values must be enter consecutively after the at sign, separated by commas, as in @cluster1,pop3,indiv2 for a sequence belonging to a given group in three different grouping levels. Multiple grouping levels can be used to specify hierarchical structure, but not only (several independent grouping structures can be specified). The markup @# (at sign and hash sign) specifies an outgroup sequence. The hash sign may be followed by an additional label as in @#, Multiple grouping levels are not allowed for the outgroup.
The sequence itself continues on following lines until the next > character or the end of the file. Each allele is represented by a single character.
White spaces, tab and carriage returns are allowed at any position. They are ignored unless for terminating the header line. There is no limitation in length and different sequences can have different lengths.
Allowed characters and significance of character case is determined but the alphabet specified at time of parsing. For example, the alphabets.DNA supports the characters listed below, and case is ignored (all characters are turned into upper-case).

Ambiguity characters in DNA sequences are processed by several functions. The IUPAC nomenclature is followed, as described below:

Code	Meaning	Complement
`A`	`A`	`T`
`C`	`C`	`G`
`G`	`G`	`C`
`T`	`T`	`A`
`M`	`A` or `C`	`K`
`R`	`A` or `G`	`Y`
`W`	`A` or `T`	`W`
`S`	`C` or `G`	`S`
`Y`	`C` or `T`	`R`
`K`	`G` or `T`	`M`
`V`	`A`, `C`, or `G`	`B`
`H`	`A`, `C`, or `T`	`D`
`D`	`A`, `G`, or `T`	`H`
`B`	`C`, `G`, or `T`	`V`
`N`	`G`, `A`, `T`, or `C`	`N`
`-`	alignment gap	`-`
`?`	`G`, `A`, `T`, `C`, or alignment gap	`?`

Positions with ? are supposed to be non-characterized, so it is unknown whether they have a valid base or an alignment gap.

egglib.io.to_ms(aln, outfile, positions=None, alphabet=None, as_codes=False, converter=None, spacer='auto', header='//\n')[source]¶

Export data in the ms program format. The program is described in Hudson (2002 [Bioinformatics 18: 337-338]). The format is as follows:

One header line with two slashes (can be extended/replaced).
One line with the number of sites
One line with the positions, only if the number of sites is larger than zero.
The matrix of genotypes (one line per sample), only if the number of sites is larger than zero.
One empty line.

ms generates binary data (0/1). This function may export any integer values depending on the data contained in the input object.

Parameters:

data – Align instance.
outfile – where to write data. This argument can be (i) a file (or file-like) object open for writing, (ii) the name of a file (which will be created) as a string, or (iii) None (in which case the resulting string will be returned).
positions – list of site positions, with length matching the alignment length. Positions are required to be in the [0,1] range but the order is not checked. By default (if the argument value is None), sites are supposed to be evenly spread over the [0,1] interval.
alphabet – an Alphabet instance to use instead of the alphabet attached to the input alignment. When using this argument, alleles are transformed using the alphabet provided as argument using a direct mapping of the input alignment alphabet to the provided alphabet.
as_codes – use the internal alphabet codes for exporting. Useful if a non-int alphabet is used. Ignored if converter is used. Not allowed if alphabet is used.
converter – a user-provided function generating the appropriate allelic value based on an input alphabet code. The simple inline example below shows how to convert any missing data to 0 and shift all valid alleles of one unit: converter=lambda x: 0 if x<0 else x+1.
spacer – insert a space between all genotypes (to separate each locus/site). By default, all genotypes are concatenated as in the standard ms format. In this case, it is required that all allelic values (or codes if as_codes is specified) are >=0 and <10. By default ("auto"), the spacer is automatically inserted if the alphabet defines alleles or codes outside the [0, 9] range, or if converter is specified.
header – string to print instead of the two slashes line. Must contain the final new line.

Returns:

A ms-formatted string if outfile is None, otherwise None.

egglib.io.from_fasta(fname, alphabet, labels=False, label_marker='@', label_separator=',', cls=None)[source]¶

Import sequences from a fasta file. Create a new instance of either Align or Container from data read from a fasta-formatted file. To process data from a fasta-formatted string, use io.from_fasta_string().

Parameters:

source – name of a fasta-formatted sequence file.
alphabet – an Alphabet instance defining the type of data. Only character alphabets are allowed (such as alphabets.DNA, or alphabets.protein).
labels – import group labels. If so, they are not actually required to be present for each (or any) sequence. By default tags in sequence names considered to be part of the name and not as labels.
label_marker – this option allows to change the character indicating the start of the labels.
label_separator – this options allows to change character used to separate labels.
cls – type that should be generated. Possible values are: Align (then, data must be aligned), Container, or None. In the latter case, an Align is returned if data are found to be aligned or if the data set is empty, and otherwise a Container is returned.

Returns:

A new Container or Align instance depending on the value of the cls option.

egglib.io.from_fasta_string(string, alphabet, labels=False, label_marker='@', label_separator=',', cls=None)[source]¶: Import sequences from a fasta-formatted string. Identical to io.from_fasta() but directly takes an fasta-formatted string as first argument.

class egglib.io.fasta_iter(fname, alphabet, labels=False)[source]¶

Iterative sequence-by-sequence fasta parser.

Parameters:

fname – name of a fasta-formatted file.
alphabet – an Alphabet instance defining the type of data. Only character alphabets are allowed (such as alphabets.DNA and alphabets.protein).
labels – import group labels from sequence names (by default, they are considered as part of the name).

This function can be used in an iteration as shown below:

>>> for item in egglib.io.fasta_iter(fname):
...     ...

The with statement is also supported, which: ensures that the input file is properly closed whenever the with statement completes:

>>> with egglib.io.fasta_iter(fname) as f:
...     for item in f:
...         ...

Each iteration yields a SampleView instance (which is valid only during the iteration round, see the warning below).

Warning

The aim of this iterator is to iterate over large fasta files without actually storing all data in memory at the same time. The SampleView instance provided at each iteration is a proxy to a local Container instance that is recycled at each iteration step. The iteration variable should be used immediately and never stored as this. If one wants to sequence data, they should copy them immediately (typically using the add_sample() method of a separate Container instance).

egglib.io.from_genepop(fname)[source]¶

Import Genepop-formatted genotypic data. The format is described here.

Parameters:: fname – Genepop-formatted file name.
Returns:: A new Align instance.

The returned object contains data mapped to an ad hoc alphabet, with two samples per individuals. Group labels are used to indicate the structure (first level: populations, second level: individuals). In addition to normal Align instance, the returned object has two attributes: title and loci, which contain the information read from the Genepop file.

class egglib.io.GFF3(fname, strict=True)[source]¶

Import GFF3 genome annotation data. Read General Feature Format (GFF)-formatted (version 3) genome annotation data from a file specified by name or from a provided string.

See the description of the GFF3 format at http://www.sequenceontology.org/gff3.shtml.

All features are loaded into memory and can be processed interactively.

Parameters:

source – name of a GFF3-formatted file. To pass a string, use the factory method from_string().
strict – by default, apply strictly GFF3 requirements. See below.

The strict argument can be set to False to support CDS features lacking a phase.

Some observations regarding this implementation:

The validity of parent-child relationship with respect to the Sequence Ontology is not enforced, meaning that this parser allows any type of feature to be parent to any feature of any type.
The Sequence Ontology accession number of types of features related to genes are automatically mapped to the type name (such as “SO:0000704”, which is translated as “gene”).
Discontinuous features are required to have matching attributes (except start, stop, and phase for CDS features) but subsequent segments are allowed to skip attributes if the first segment defines them.
fasta-formatted sequences are imported but are currently not checked for consistency with annotations and defined sequence regions.

io.GFF3 instances are iterable. The expression:

>>> for feat in GFF3:
...     ...

is equivalent to:

>>> for feat in GFF3.iter_features():
...     ...

Attributes

`attribute_ontology`	Content of all attribute-ontology directives.
`feature_ontology`	Content of all feature-ontology directives.
`genome_build`	Genome build if specified by a directive.
`list_seqid`	List of seqid.
`num_seqid`	Number of seqid of features present in instance.
`num_top_features`	Number of features without parents.
`num_tot_features`	Total number of features.
`regions`	Dictionary with all sequence regions defined by directives.
`sequences`	If specified, sequences as a `Container`.
`source_ontology`	Content of all source-ontology directives.
`species`	Species (if specified by a directive).
`version`	GFF version.

Methods

`from_string`(string[, strict])	Import data from a GFF3-formatted string.
`iter_features`([seqid, start, end, all_features])	Iterate over features.

property attribute_ontology¶: Content of all attribute-ontology directives.

property feature_ontology¶: Content of all feature-ontology directives.

classmethod from_string(string, strict=True)[source]¶

Import data from a GFF3-formatted string. This is a class method, which can be called as shown below to create a new object:

>>> gff3 = egglib.io.GFF3.from_string(gff3_string)

Parameters:

string – GFF3-formatted string.
strict – apply strict GFF3 specifications (see here).

Returns:

A new io.GFF3 instance.

property genome_build¶: Genome build if specified by a directive.

iter_features(seqid=None, start=None, end=None, all_features=False)[source]¶

Iterate over features.

Parameters:

seqid – only iterates overs features of this contig (by default, consider all contigs). If the specified seqid is not present in the GFF3 file, iteration is empty (without error).
start – start iteration at this position. By default, start with the first position.
end – stop iteration at this position (does not include features whose end position is larger than this value). By default, process all features.
all_features – whether iteration should also include features that are part of another (by default, only include features that don’t have a parent). A given feature may be repeated if it has several unconnected parents.

Note

If start and/or end are specified, seqid is required.

property list_seqid¶: List of seqid.

property num_seqid¶: Number of seqid of features present in instance.

property num_top_features¶: Number of features without parents.

property num_tot_features¶: Total number of features.

property regions¶: Dictionary with all sequence regions defined by directives.

property sequences¶: If specified, sequences as a Container. By default, this value is None.

property source_ontology¶: Content of all source-ontology directives.

property species¶: Species (if specified by a directive).

property version¶: GFF version.

class egglib.io.Gff3Feature[source]¶

Provide access to data of a feature. This class cannot be instanciated by the user.

Attributes

`ID`	Feature identifier.
`all_parts`	All parts of this feature.
`attributes`	Dictionary of attributes attached to the feature.
`derivers`	List of features deriving from this one.
`descendants`	List of features descending from this one.
`end`	Feature end position.
`parents`	List of features parent to this one.
`score`	Feature score.
`segments`	List of segments of the feature.
`seqid`	Seqid on which this feature is located.
`source`	Feature source.
`start`	Feature start position.
`strand`	Strand on which the feature is located.
`type`	Feature type.

property ID¶: Feature identifier.

property all_parts¶: All parts of this feature.

property attributes¶

Dictionary of attributes attached to the feature. The list of attributes is:

ID
Name
Alias
Parent
Target
Gap
Derives_from
Note
Dbxref
Ontology_term
Is_circular
Other attributes as defined in the GFF3 file.

property derivers¶: List of features deriving from this one.

property descendants¶: List of features descending from this one.

property end¶: Feature end position.

property parents¶: List of features parent to this one.

property score¶: Feature score.

property segments¶: List of segments of the feature.

property seqid¶: Seqid on which this feature is located.

property source¶: Feature source.

property start¶: Feature start position.

property strand¶: Strand on which the feature is located.

property type¶: Feature type.

class egglib.io.VCF(fname, index=None, subset=None, dumpfile=None)[source]¶

VCF/BCF parser using htslib.

Parameters:

fname – input VCF/BCF file name. Gzip-compressed files are supported.
index – index file name. By default, try to import index with canonical file name. If index is specified, it must be present. Otherwise missing index file is not an error. Index is only imported for BCF files.
subset – sequence of sample names to import. The order of samples in this sequence is not considered. Duplicated names in this sequence are ignored. Other samples are ignored. This is useful to speed up parsing.
dumpfile – name of an output VCF/BCF file where lines can be written as needed using dump_record(). The input VCF header is used as is. The dump file can be closed at any time using dump_close() and is closed by by default when the current instance is destroyed.

Attributes

`has_index`	Boolean indicating whether an index is available.
`num_samples`	Number of samples.

Methods

`as_site`()	Extract current variant.
`dump_close`	Close the dumpfile VCF.
`dump_record`	Write one variant of the VCF file.
`get_allele_type`	Summary of type of last variant.
`get_alleles`	`list` of alleles.
`get_alternate`	`list` of alternate alleles.
`get_chrom`	Chromosome or contig name.
`get_errors`	Errors while reading last variant.
`get_filter`	`list` of filters.
`get_format`(tag, index)	Get a given FORMAT field.
`get_formats`	FORMAT fields for all samples.
`get_genotypes`	Get genotypes.
`get_id`	Get list of identifiers for the current variant.
`get_info`(tag)	Get a given INFO field.
`get_infos`	`dict` of INFO fields.
`get_phased`	Get booleans indicating if genotypes are phased.
`get_pos`	Chromosome position.
`get_quality`	Quality value.
`get_reference`	Reference allele.
`get_sample`(index)	Get the name of the sample at index index.
`get_samples`	`list` of all samples.
`get_types`	Get the type(s) of the last variant.
`goto`(target[, position, [limit]])	Move to a given location in the file.
`is_single`	Check if last variant is invariant or SNP.
`is_snp`	Check if last variant is a SNP.
`iter_sites`([chrom[, start][, stop]][, ...)	Return an iterator over sites.
`read`	Read one variant of the VCF file.

site_iterator

as_site()[source]¶: Extract current variant. Return the current genotype values as a Site object. The alphabet is DNA if the site is a SNP, a string-type alphabet with ad hoc available alleles and ? and - as missing alleles for indels, and a custom-type alphabet with ad hoc available alleles and ? a missing data otherwise.

Added in version 3.4.

Changed in version 3.5.1: The * is not included anymore a available allele if there is an indel while - is always added as potential missing allele for string-type alphabets.

dump_close()¶: Close the dumpfile VCF. Return None if operation is successful, ValueError in case of critical error.

dump_record()¶: Write one variant of the VCF file. Return None if write is successful, ValueError in case of critical error.

get_allele_type()¶: Summary of type of last variant. Return 0 if all alleles have length 1 and are valid DNA codes (SNPs or non-variant bases), 1 if there is at least one allele with length >1 but all alleles contain valid DNA codes (indels), 2 there is at least one non-DNA codes in alleles, or None if no data are available.

Added in version 3.4..

get_alleles()¶: list of alleles. If present, the * allele (position overlapping previous deletion) is omitted. Return None by default (no available data).

Changed in version 3.5.1: Omit the * allele.

get_alternate()¶: list of alternate alleles. If present, the * allele (position overlapping previous deletion) is omitted. Return None by default (no available data).

Changed in version 3.5.1: Omit the * allele.

get_chrom()¶: Chromosome or contig name. Return None by default (no available data).

get_errors()¶: Errors while reading last variant. Get the non-fatal errors generated while importing last variant, as a list, or None if nothing has been read.

get_filter()¶: list of filters. Return None by default (no available data).

get_format(tag, index)¶: Get a given FORMAT field. Return None by default (no available data for key not available).

get_formats()¶: FORMAT fields for all samples. Return a :list: of :dict: instances. Return None by default (no available data).

get_genotypes()¶: Get genotypes. Return a list giving, for each sample, the list of alleles composing its genotype. To generate a Site object, use as_site(). Return None by default (no data available).

get_id()¶: Get list of identifiers for the current variant. Empty list if none provided. None if nothing has been read. The uniqueness of ID’s is not tested.

get_info(tag)¶: Get a given INFO field. Return None by default (no available data for key not available).

get_infos()¶: dict of INFO fields. Return None by default (no available data).

get_phased()¶: Get booleans indicating if genotypes are phased. The return value is the tuple: (all_phased, phased_table), with a boolean for all samples and all alleles beyond the first. None whatever bad happens.

get_pos()¶: Chromosome position. Return None by default (no available data).

get_quality()¶: Quality value. Return None by default (no available data or missing value).

get_reference()¶: Reference allele. Return None by default (no available data).

get_sample(index)¶: Get the name of the sample at index index.

get_samples()¶: list of all samples.

get_types()¶: Get the type(s) of the last variant. Return a list. Return None by default (no available data).

goto(target[, position[, limit]])¶

Move to a given location in the file. Data at the new location are available immediately with no need to call read() (this method should be understood as a call to read() at an arbitrary location). If position is not specified, move to the first available position of contig target. If target does not exist in file, or in case of unexpected parsing error, a ValueError is thrown. By default, limit is equal to position + 1 (meaning that only the exact position can be retrieved). If this condition is not met (in particular if position is past the end of the contig target), return False).

Returns:: True if a variant is available in the requested range, False otherwise. In case the method returns False, the user is not expected to call read() directly (the behaviour is currently undefined).

Note

Only available for indexed BCF.

Changed in version 3.4: The method returns False and doesn’t raise an exception if the target position is out of range.

has_index¶: Boolean indicating whether an index is available.

is_single()¶: Check if last variant is invariant or SNP. True if the last variant has no other flag than SNP and deletion overlap.

Added in version 3.4.

is_snp()¶: Check if last variant is a SNP. True if the last variant is of type SNP, and SNP only.

Changed in version 3.4: SNPs overlapping a deletion are now considered to be SNPs

iter_sites([chrom[, start][, stop]][, max_missing][, mode])[source]¶

Return an iterator over sites. Each variant found in the VCF file is returned as a Site instance.

By default, process sites from the current position up to the end of the file. To control the region used for iteration, use the chrom, start, and stop argument.

Parameters:

chrom – name of the chromosome to process. If None, process all chromosomes. By default (if start and stop are not specified) process the whole chromosome, going back to the first position if needed. Only available if the VCF is indexed.
start – start position. Only allowed if chrom is specified. The first site returned will be at the smallest available position starting from start. By default, start at the beginning of the chromosome.
stop – stop position. Only allowed if chrom is specified. The position of the last site returned will be a most the one before the stop position. By default, stop at the last site of the chromosome.
max_missing – maximum number of missing data to consider a variant. By default, all sites with at least one missing data are ignored.
mode – 0: include only SNP variants (variants with at least two alleles, all corresponding to a single nucleotide, although those alleles are not required to be called in genotypes); 1: include SNP variants and invariant positions; 2: include all variants from the VCF. The default is 0.

Added in version 3.4.

Changed in version 3.5.1: See change described in as_site() relative to the management of gaps due to overlaping deletions.

num_samples¶: Number of samples.

read()¶: Read one variant of the VCF file. Return True if read is successful, False if end of file. ValueError in case of critical error.

egglib.io.index_vcf(fname[, outname])¶: Index a BCF file. The file is required to be in format BCF. If outname is not specified, use the standard naming scheme for CSI index files.

class egglib.io.VcfSlider(vcf, size, step, as_variants=False, chrom=None, start=None, stop=None, max_missing=0, mode=0)[source]¶

Run a sliding window on a VCF file.

If a chromosome and a start position are specified, the sliding windows starts at this position; if only a chromosome is specified, the sliding windows starts at the beginning of this chromosome. If a chromosome is specified, the sliding window stops at the specified stop position or at the end of the chrososome. If no chromosome is specified, the sliding windows starts at the current position and stops at the end of the file.

The sliding windows stops as soon as the last variant is processed. As a result, the last window is never empty unless there is a single window overall and, if as_variants is set, the successive windows should not overlap. In line with this behaviour, empty windows are never returned with as_variants although the last window can contain less than size sites.

Parameters:

vcf – open VCF instance.
size – window size (see note below).
step – window step (see note below).
as_variants – window size and step are expressed as number of variants instead of genomic coordinates.
chrom – chromosome where to perform sliding window (by default, start at current position). This option requires that VCF to be indexed.
start – start position (only if chrom is specified; by default, chromosome start if chrom is specified or, otherwise, current positon).
stop – stop position (only if chrom is specified; by default, chromosome end). Not included in windows.
max_missing – maximum number of missing data to consider a variants.
mode – 0: include only SNP variants (variants with at least two alleles, all corresponding to a single nucleotide, although those alleles are not required to be called in genotypes); 1: include SNP variants and invariant positions; 2: include all variants from the VCF.

Note

If as_variants is True, both size and step are expressed in number of variants. Otherwise, they are expressed in bp with respect to the reference genome.

VCF sliding window objects are iterable and indexable with respect to the sites stored within the current window. Empty and uninitialized windows have length 0. Sites correspond to variants in the VCF file and are represented by Site instances.

Operation	Action
`len(sld)`	number of sites in the current window
site = `sld[i]`	get `i`th site of current window
`for site in sld:`	iterate over sites of the current window

The sliding window is operated by the move() method which returns a boolean to indicate the end.

Added in version 3.4.

Attributes

`bounds`	Bounds of the current window.
`chromosome`	Current chromosome.
`span`	Length of the current window in terms of genomic length.

Methods

move()

Move to the next window.

property bounds¶: Bounds of the current window. Bounds are defined as (start, stop) in such a way that the sequence corresponding to the window can be extracted using [start, stop] as slice operator (meaning that stop is not included in the window in any case. If as_variants is True, bounds are position of the first site and the position immediately after the last site or None if the window is empty. If as_variants is False, bounds are the start and stop positions. Before start of the iteration, set as None.

property chromosome¶: Current chromosome. Before iteration start, or after iteration end, the value is undefined.

move()[source]¶

Move to the next window. Sites that are shared by consecutive windows are represented by the same Site instances although they will generally shifted in index.

Returns:: True if the operation succeeded and False otherwise. In the former case, the window may or may not contain sites depending on the presence of sites passing criteri within the window range. In the latter case, the window should be considered to be uninitialized, and further calls to move() will keep on returning False without error.

property span¶: Length of the current window in terms of genomic length. Actually, distance between the first and last site of the window. If the window doesn’t contain sites, equal to None.

class egglib.io.CodonVCF(vcf_name, gff_name, fill_with_ref=False, ref=None)[source]¶

Analysis of coding sites from VCF data.

This class takes a VCF, which is required to be indexed, and the GFF annotation (in GFF3 format) of the reference sequence. It allows to iterate over coding sites of a given CDS feature (using iter_codons()) or extract the coding site which includes a random position (using from_position()).

Note

Codons with at least one missing nucleotide are ignored altogether.

Parameters:

vcf_name – name of an indexed VCF file.
gff_name – name of a GFF3-fomatted annotation file corresponding to the reference of the VCF file.
use_ref – assume positions not referred to in the VCF file are identical to the reference.
ref – only if use_ref is True: reference genome as a Container instance. By default, use the reference genome included in the GFF3. If no reference genome is provided and use_ref if True, a ValueError is raised.

Added in version 3.5.

Methods

`from_position`(pos)	Return the `CodingSite` corresponding to the codon to which the passed position belongs to.
`iter_codons`()	Iterate over coding sites.
`set_cds`(ID[, subset])	Specify the CDS feature to consider.

from_position(pos)[source]¶: Return the CodingSite corresponding to the codon to which the passed position belongs to.

iter_codons()[source]¶: Iterate over coding sites. Return an iterator over CodingSite objects representing all codon positions defined by the current CDS annotation. Codons positions which are incomplete in the feature due to missing 5’ or 3’ end are not included in the iteration.

set_cds(ID, subset=None)[source]¶

Specify the CDS feature to consider.

Parameters:

ID – name of a CDS feature of the annotation.
subset – optional list of indexes of samples to include (by default, all samples are included). If the sample indexes are reshuffled or repeated, the exported sites will be affected accordingly. Samples must be understood as per the VCF file (e.g. individuals if the VCF contains diploid individuals).

Note

The provided annotation ID must correspond to a CDS feature (and not a gene or a mRNA).

class egglib.io._vcfcodon.CodingSite(flag, chrom, idx, pos)[source]¶

Represent a coding site extracted from VCF data. Instances of this class are normally generated by CodonVCF .

Added in version 3.5.

Variables:

NCOD – requested position is not in coding region. Only if site is obtained from from_position(). If this flag is on, all the others are off.
MISM – there is a mismatch in sample size (including mismatch in ploidy) between sites of the codon. If this flag is on, all the others are off.
UNAVAIL – at least one of the three codon position is unavailable and there is not available reference genome. If this flag is on, all the others are off.
STOP – a stop codon has been found in the site. The stop codon is not counted as missing data but is not considered for synonymous / nonsynonymous analysis either. However it is considered to determine if there is polymorphism, multiple hits and multiple mutations.
MHIT – more than one of the three codon positions is variable.
MMUT – there are more than two different codons.
SYN – there is at least one synonymous change (excluding stop codons).
NSYN – there is at least one non-synonymous change (excluding stop codons).
VAR – there is at least one change (including stop codons).

Attributes

`alphabet`	Alphabet attached to the instance.
`chrom`	Chromosome of the site.
`codon_index`	Codon index.
`flag`	Flag value.
`ns`	Number of samples
`num_missing`	Number of missing data.
`position`	Position of the site.

Methods

`append`(val)	Add an allele at the end of the site.
`as_list`()	Generate a list containing data from the instance.
`extend`(vals)	Add several alleles at the end of the site.
`from_align`(align, index)	Import data from the provided `Align`.
`from_list`(array, alphabet)	Import alleles from the provided list.
`from_vcf`(vcf[, start, stop])	Import data from the provided `io.VcfParser`.
`is_valid`()	Check if site is valid.
`reset`()	Clear all data from the instance (including the alphabet).
`set_bit`(flag)	Turn a flag on.

property codon_index¶: Codon index. Index of the codon in the CDS specification.

Note

Only complete codons are included in the numeration.

property flag¶: Flag value.

is_valid()[source]¶: Check if site is valid. True if the site is in coding region, has no mismatch, has all available position and doesn’t contain a stop codon.

set_bit(flag)[source]¶: Turn a flag on. For example, to specify the the site has a stop codon, use: cdn.set_bit(cdn.STOP)

egglib.io.from_clustal(string, alphabet)[source]¶

Import a clustal-formatted alignment. The input format is the one generated and used by ClustalW.

Parameters:

string – clustal-formatted sequence alignment.
alphabet – an Alphabet instance.

Returns:

A new Align instance.

egglib.io.from_staden(string, keep_consensus=False)[source]¶

Import data from the Staden assembly package. Process the output file of the GAP4 program of the Staden package.

Parameters:

string – input string.
keep_consensus – don’t delete consensus sequence.

Returns:

An Align instance.

The input string should have been generated from a contig alignment by the GAP4 contig editor, using the command “dump contig to file”. The sequence named CONSENSUS, if present, is automatically removed unless the option keep_consensus is used.

Staden’s default convention is followed:

- codes for an unknown base and is replaced by N.
* codes for an alignment gap and is replaced by -.
. represents the same sequence than the consensus at that position.
White space represents missing data and is replaced by ?.

egglib.io.from_genalys(string)[source]¶

Import Genalys data. Genalys was a proprietary program to process sequencing reads, perform assembly and detect polymorphism. Convert Genalys-formatted sequence alignment files to fasta. This function imports files generated through the option “Save SNPs” of Genalys 2.8.

Parameters:: string – input data as a Genalys-formatted string.
Returns:: An Align instance.

egglib.io.get_fgenesh(string, locus='locus')[source]¶

Import fgenesh data.

Parameters:: fname – a string containing fgenesh ouput.
Parma locus:: locus name.
Returns:: A list of gene and CDS features represented by dictionaries. Note that 5’ partial features might not be in the appropriate frame and that it can be necessary to add a codon_start qualifier.

class egglib.io.GenBank(fname=None, string=None)[source]¶

Process a GenBank-formatted DNA sequence record.

Parameters:

fname – input file name.
string – GenBank-formatted string.

Only one of the two arguments fname and string can be non-None. If both are None, the constructor generates an empty instance with a sequence of length 0. If fname is non-None, a GenBank record is read from the file with this name. If string is non-None, a GenBank record is read from this string. The following variables are read from the parsed input if present: accession, definition, title, version, GI, keywords, source, references (which is a list), locus and others. Their default value is None except for references and others for which default is an empty list. source is a (description, species, taxonomy) tuple. Each of references is a (header, raw reference) tuple and each of others is a (key, raw) tuple.

In addition to methods documented below, the following operations are supported for gb if it is a GenBank instance:

Operation	Result
`len(gb)`	length of the sequence attached to this record
`for feat in gb:`	Iterate over `GenBankFeature` instances of this record
`str(gb)`	GenBank representation of the record

Attributes

sequence

Sequence string.

Methods

`add_feature`(feature)	Add a feature to the instance.
`extract`(from_pos, to_pos)	Extract a subset of the instance.
`format`()	String representation of the instance.
`number_of_features`()	Number of features contained in the instance.
`rc`()	Reverse-complement the instance (in place).
`write`(fname)	Write the formatted record to a file.
`write_stream`(stream)	From the formatted record to a stream.

add_feature(feature)[source]¶

Add a feature to the instance.

Parameters:: feature – an io.GenBankFeature instance.

extract(from_pos, to_pos)[source]¶

Extract a subset of the instance.

Parameters:

from_pos – start position.
to_pos – stop position (not included).

Returns:

A new io.GenBank instance representing a subset of the current instance. All features that are completely included in the specified range are exported.

format()[source]¶: String representation of the instance.

number_of_features()[source]¶: Number of features contained in the instance.

rc()[source]¶: Reverse-complement the instance (in place). All features positions and the sequence will be reverted and applied to the complementary strand. The features will be sorted in increasing start position (after reverting). This method should be applied only on genuine nucleotide sequences.

property sequence¶: Sequence string. Note that changing the record’s string might invalidate the features (meaning that the setting an invalid sequence might cause the features to point to incorrect or out-of-bounds regions of the sequence).

write(fname)[source]¶: Write the formatted record to a file.

write_stream(stream)[source]¶

From the formatted record to a stream.

Parameters:: stream – a file-compatible object.

class egglib.io.GenBankFeature(parent)[source]¶

Feature of a GenBank record. io.GenBankFeature instances must be only used along an io.GenBank instance. The constructor creates an empty instance (although a parent io.GenBank instance is required) and either update() or parse() must be used subsequently.

Parameters:: parent – an io.GenBank instance to which the feature should be attached.

Note that str(feature) is equivalent to feature.format().

Methods

`add_qualifier`(key, value)	Add a qualifier.
`copy`(genbank)	Return a copy of the current instance.
`format`()	GenBank-formatted string representing the feature.
`get_sequence`()	Return the string corresponding to this feature.
`get_start`()	First position of the first segment.
`get_stop`()	Stop position of the last segment.
`get_type`()	Type of the instance.
`parse`(string)	Update feature information from a string.
`qualifiers`()	Dictionary with all qualifier values.
`rc`([length])	Reverse-complement the feature.
`shift`(offset)	Shift all positions.
`update`(feat_type, location, **qualifiers)	Update feature information.

add_qualifier(key, value)[source]¶: Add a qualifier.

copy(genbank)[source]¶

Return a copy of the current instance.

Parameters:: genbank – GenBank instance to which the returned instance should be attached.

format()[source]¶: GenBank-formatted string representing the feature.

get_sequence()[source]¶: Return the string corresponding to this feature. If the positions pass beyond the end of the parent’s sequence, a RuntimeError (and not an IndexError) is raised.

get_start()[source]¶: First position of the first segment.

get_stop()[source]¶: Stop position of the last segment.

get_type()[source]¶: Type of the instance.

parse(string)[source]¶

Update feature information from a string.

Parameters:: string – a GenBank-formatted string.

qualifiers()[source]¶: Dictionary with all qualifier values. This method cannot be used to change data within the instance.

rc(length=None)[source]¶

Reverse-complement the feature.

Parameters:: length – length of the complete sequence (by default, take the information directly from the parent).

shift(offset)[source]¶: Shift all positions. The argument can be positive or negative.

update(feat_type, location, **qualifiers)[source]¶

Update feature information.

Parameters:

feat_type – a string identifying the feature type (such as "gene", "CDS", "misc_feature", etc.). All strings are acceppted.
location – an io.GenBankFeatureLocation instance giving the feature’s location.
qualifiers – other qualifiers must be passed as keyword arguments. It is not allowed to use "type" as a qualifier keyword.

class egglib.io.GenBankFeatureLocation(string=None)[source]¶

Hold the location of a GenBank feature. Supports various forms of location specifications as allowed in the GenBank format. The constructor allows to parse a GenBank-formatted string. By default, features are on the forward strand and segmented features are ranges (not orders).

In addition to methods documented below, the following operations are supported for loc if it is a GenBankFeatureLocation instance:

Operation	Result
`len(loc)`	Number of segments
`loc[index]`	Return the `(first, last)` `tuple`
`for (first, last) in loc:`	Iterator over segments
`params.format()`	Generate a GenBank representation

Methods

`add_base_choice`(first, last[, left_partial, ...])	Segment corresponding to a single base in a given range.
`add_base_range`(first, last[, left_partial, ...])	Add a base range to the feature.
`add_between_base`(position)	Add a segment lying between two consecutive bases.
`add_single_base`(position)	Add a single-base segment to the feature.
`as_order`()	Define the feature as an order instead of a range.
`as_range`()	Define the feature as a range.
`copy`()	Deep copy of the current instance.
`format`()	String representation of the instance.
`is_complement`()	`True` if the feature is on the complement strand.
`is_range`()	`True` if the feature is a range.
`rc`(length)	Reverse the feature positions.
`set_complement`()	Place the feature on the complement strand.
`set_forward`()	Place the feature on the forward strand.
`shift`(offset)	Shift all positions.

add_base_choice(first, last, left_partial=False, right_partial=False)[source]¶: Segment corresponding to a single base in a given range. Arguments are identical to add_base_range().

add_base_range(first, last, left_partial=False, right_partial=False)[source]¶

Add a base range to the feature.

Parameters:

first – first position of the range.
last – last position of the range (included).
first_partial – specify that the real start of the segment is somewhere 5’ of first.
first_partial – specify that the real end of the segment is somewhere 3’ of last.

If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1482, one must use add_base_range(1127,1482) in combination with set_complement().

add_between_base(position)[source]¶: Add a segment lying between two consecutive bases. The feature will be set between position and position + 1. If the feature is intended to be placed on the complement strand between positions, say, 1127 and 1128, one must use add_between_base(1127) in combination with set_complement().

add_single_base(position)[source]¶: Add a single-base segment to the feature.

as_order()[source]¶: Define the feature as an order instead of a range.

as_range()[source]¶: Define the feature as a range. This is the default.

copy()[source]¶: Deep copy of the current instance.

format()[source]¶: String representation of the instance.

is_complement()[source]¶: True if the feature is on the complement strand.

is_range()[source]¶: True if the feature is a range. False if it is an order.

rc(length)[source]¶

Reverse the feature positions. Positions are modified to be counted from the end.

Parameters:: length – length of the complete sequence (required).

set_complement()[source]¶: Place the feature on the complement strand.

set_forward()[source]¶: Place the feature on the forward strand. This is the default.

shift(offset)[source]¶: Shift all positions. The argument can be positive or negative.

Alternative VCF parser¶

class egglib.io.VcfParser(fname, threshold_PL=None, threshold_GL=None)[source]¶

Import Variant Call Format data from a file. The VCF format is designed to store data describing genomic variation in an extremely flexible way. See the description of the VCF format. To parse VCF data stored in string, use io.VcfStringParser.

Parameters:

fname – name of a properly formatted VCF file. The header section will be processed upon instance creation, and lines will be read later, when the user iterates over the instance (or call readline()).
threshold_PL – call genotypes parameter. This parameter controls how genotype calling (GT field) is performed from the PL (phred-scaled genotype likelihood) field. By default (None), this conversion is never performed. If threshold_PL is specified, genotype calling is only performed if GT is not available and PL is available. The genotype with the lowest PL value is selected. The parameter gives the minimum acceptable gap between the best genotype and the second one. If the second genotype has a too good score, based on this parameter, the genotype is called unknown. The parameter must be at least 1.

io.VcfParser instances are iterable. Everly loop yields a (chromosome, position, num_all) tuple that allows the user to determines if the variant is of interest. Note that the position is considered as an index and therefore has been decremented compared with the value found in the file. If the variant is of interest, io.VcfParser instances provide methods to extract all data for this variant. Iterating over VCF lines can be performed manually using the method readline().

Attributes

`currline`	Index of the current line of the VCF file.
`file_format`	File format specified in the header.
`good`	Tell if the file is good for reading.
`has_index`	`True` if an index file has been loaded.
`num_alt`	Number of ALT definitions in the header.
`num_filter`	Number of FILTER definitions in the header.
`num_format`	Number of FORMAT definitions in the header.
`num_index`	Number of indexed lines if the file index.
`num_info`	Number of INFO definitions in the header.
`num_meta`	Number of META definitions in the header.
`num_samples`	Number of samples defined in the header.

Methods

`bed_slider`(bed[, max_missing, allow_indel, ...])	Perform a sliding window based on BED coordinates.
`close`()	Close file (if it is open)
`get_alt`(idx)	Get an ALT definition from the header.
`get_filter`(idx)	Get a FILTER definition from the header.
`get_format`(idx)	Get a FORMAT definition from the header.
`get_genotypes`([start, stop, dest])	Extract genotype data for the current site.
`get_info`(idx)	Get an INFO definition from the header.
`get_meta`(idx)	Get data for a given META field defined in the VCF header.
`get_sample`(idx)	Get the name of a sample read from the header.
`get_variant`()	Get all data for the current variant.
`goto`(self, contig[, position])	Move to an arbitrary position of the VCF file.
`load_index`([fname])	Load an index file allowing fast navigation.
`readline`()	Read one variant.
`rewind`()	Reset the parser.
`slider`(size, step[, as_variants, start, ...])	Start a sliding window from the current position.
`unread`()	Go back to the previous variant.

bed_slider(bed, max_missing=0, allow_indel=False, allow_custom=False)[source]¶

Perform a sliding window based on BED coordinates.

Parameters:

bed – an io.BED instance.
max_missing –
maximum number of missing alleles.

Warning

Here, max_missing must be an absolute number of samples.
allow_indel – include variants with alleles of varying size.
allow_custom – include variants with custom alleles.

Returns:

An io.VcfWindow instance.

Changed in version 3.2.0: max_missing is required to be an integer.

close()[source]¶: Close file (if it is open)

property currline¶: Index of the current line of the VCF file.

property file_format¶: File format specified in the header.

get_alt(idx)¶

Get an ALT definition from the header. The index must be smaller than num_alt. Return a dictionary containing the following data:

"id": identifier string.
"description": description string.
"extra"s: all extra qualifiers, presented as list of (key, value) tuple instances.

get_filter(idx)¶

Get a FILTER definition from the header. The index must be smaller than num_filter. Return a dictionary containing the following data:

"id": identifier string.
"description": description string.
"extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_format(idx)¶

Get a FORMAT definition from the header. The index must be smaller than num_format. Return a dictionary containing the following data:

"id": identifier string.
"type": one of "Integer", "Float", "Character", and "String".
"description": description string.
"number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).
"extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_genotypes(start=0, stop=None, dest=None)¶

Extract genotype data for the current site.

It is required that a variant has been effectively processed.

Parameters:

start – index of the first sample to process (at least 0 and smaller than the number of samples).
stop – sample index at which processing must be stopped (this sample is not processed; at least equal to start and smaller than the number of samples).
dest – a Site instance that will be recycled and used to place results.

Returns:

A Site instance by default, or None if dest was specified.

get_info(idx)¶

Get an INFO definition from the header. The index must be smaller than num_info. Return a dictionary containing the following data:

"id": identifier string.
"type": one of "Integer", "Float", "Flag", "Character", and "String".
"description": description string.
"number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).
"extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_meta(idx)¶: Get data for a given META field defined in the VCF header. The index must be smaller than num_meta. Return a tuple containing the key and the value of the META field.

get_sample(idx)¶: Get the name of a sample read from the header. The index must be smaller than num_samples.

get_variant()¶: Get all data for the current variant. Return an io.VcfVariant instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.

property good¶: Tell if the file is good for reading. False is the end of the file has been reached.

goto(self, contig, position=egglib.io.FIRST)[source]¶

Move to an arbitrary position of the VCF file. Requires an index (see load_index()). A ValueError is raised if the position cannot be found.

Parameters:

contig – contig name.
position – contig position. Use io.FIRST for the first available variant of the contig, io.LAST for the last, and -1 for the position immediately before the first position (position 0 in VCF files).

property has_index¶: True if an index file has been loaded.

load_index(fname=None)[source]¶

Load an index file allowing fast navigation. Index files allow fast navigation in VCF file regardless of their size, allowing to move instantly to a determined variant using its position or its index in file. Index files can be created by io.make_vcf_index().

Parameters:: fname – name of the index file. By default, use the default index file name, which is the name of the VCF file with the “.vcfi” extension (removing the original extension only if it is “.vcf”).

property num_alt¶: Number of ALT definitions in the header.

property num_filter¶: Number of FILTER definitions in the header.

property num_format¶: Number of FORMAT definitions in the header.

property num_index¶: Number of indexed lines if the file index.

property num_info¶: Number of INFO definitions in the header.

property num_meta¶: Number of META definitions in the header.

property num_samples¶: Number of samples defined in the header.

readline()[source]¶: Read one variant. Return a tuple with the chromosome name, the position, and the number of alleles at the variant (as indicated in the VCF file). Raise a ValueError if file is finished.

rewind()[source]¶: Reset the parser. Move back to the first variant. No index is required.

slider(size, step, as_variants=False, start=0, stop=None, max_missing=0, allow_indel=False, allow_custom=False)[source]¶

Start a sliding window from the current position.

Parameters:

size – size of the sliding window (by default, in base pairs).
step – increment of the sliding window (by default, in base pairs).
as_variants – express size and step in number of variants instead of base pairs.
start – start position of the sliding window.
stop – stop position of the sliding window.
max_missing –
maximum number of missing alleles (variants exceeding this threshold are ignored).

Warning

Here, max_missing must be an absolute number of samples.
allow_indel – include variants with alleles of variable size
allow_custom – include variants with custom alleles

Returns:

An io.VcfSlidingWindow instance.

Changed in version 3.2.0: max_missing is required to be an integer.

unread()[source]¶: Go back to the previous variant. No index is required, but only allowed after reading one line (not allowed immediately after creating instance or calling rewind()).

egglib.io.make_vcf_index(fname, outname=None)[source]¶

Create the index for a VCF file.

Parameters:

fname – name of a VCF file.
outname – name of the index file. By default, use a default name. The default name is based on the VCF file name, stripping the extension only if it is vcf and appending the vcfi extension. Index files bearing this default name are automatically loaded if the corresponding VCF file is opened.

class egglib.io.VcfStringParser(header, threshold_PL=None, threshold_GL=None)[source]¶

Import Variant Call Format data from a string. Alias of io.VcfParser processing strings instead of a file. The constructor takes a string containing a VCF header (the first line being the file format specification and the last line being the header line ,starting with #CHROM).

Attributes

`file_format`	File format specified in the header.
`num_alt`	Number of ALT definitions in the header.
`num_filter`	Number of FILTER definitions in the header.
`num_format`	Number of FORMAT definitions in the header.
`num_info`	Number of INFO definitions in the header.
`num_meta`	Number of META definitions in the header.
`num_samples`	Number of samples defined in the header.

Methods

`get_alt`(idx)	Get an ALT definition from the header.
`get_filter`(idx)	Get a FILTER definition from the header.
`get_format`(idx)	Get a FORMAT definition from the header.
`get_genotypes`([start, stop, dest])	Extract genotype data for the current site.
`get_info`(idx)	Get an INFO definition from the header.
`get_meta`(idx)	Get data for a given META field defined in the VCF header.
`get_sample`(idx)	Get the name of a sample read from the header.
`get_variant`()	Get all data for the current variant.
`readline`(string)	Read one variant from a user-provided single line.

property file_format¶: File format specified in the header.

get_alt(idx)¶

Get an ALT definition from the header. The index must be smaller than num_alt. Return a dictionary containing the following data:

"id": identifier string.
"description": description string.
"extra"s: all extra qualifiers, presented as list of (key, value) tuple instances.

get_filter(idx)¶

Get a FILTER definition from the header. The index must be smaller than num_filter. Return a dictionary containing the following data:

"id": identifier string.
"description": description string.
"extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_format(idx)¶

Get a FORMAT definition from the header. The index must be smaller than num_format. Return a dictionary containing the following data:

"id": identifier string.
"type": one of "Integer", "Float", "Character", and "String".
"description": description string.
"number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).
"extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_genotypes(start=0, stop=None, dest=None)¶

Extract genotype data for the current site.

It is required that a variant has been effectively processed.

Parameters:

start – index of the first sample to process (at least 0 and smaller than the number of samples).
stop – sample index at which processing must be stopped (this sample is not processed; at least equal to start and smaller than the number of samples).
dest – a Site instance that will be recycled and used to place results.

Returns:

A Site instance by default, or None if dest was specified.

get_info(idx)¶

Get an INFO definition from the header. The index must be smaller than num_info. Return a dictionary containing the following data:

"id": identifier string.
"type": one of "Integer", "Float", "Flag", "Character", and "String".
"description": description string.
"number": expected number of items. Special values are None (if undefined), "NUM_GENOTYPES" (number matching the number of genotypes for any particular variant), "NUM_ALTERNATE" (number matching the number of alternate alleles for any particular variant), and "NUM_ALLELES" (number matching the number of alleles–including the reference–for any particular variant).
"extra": all extra qualifiers, presented as a list of (key, value) tuple instances.

get_meta(idx)¶: Get data for a given META field defined in the VCF header. The index must be smaller than num_meta. Return a tuple containing the key and the value of the META field.

get_sample(idx)¶: Get the name of a sample read from the header. The index must be smaller than num_samples.

get_variant()¶: Get all data for the current variant. Return an io.VcfVariant instance containing all data available for the last variant processed by this instance. It is required that a variant has been effectively processed.

property num_alt¶: Number of ALT definitions in the header.

property num_filter¶: Number of FILTER definitions in the header.

property num_format¶: Number of FORMAT definitions in the header.

property num_info¶: Number of INFO definitions in the header.

property num_meta¶: Number of META definitions in the header.

property num_samples¶: Number of samples defined in the header.

readline(string)[source]¶

Read one variant from a user-provided single line. The string should contain a single line of VCF-formatted data (no header). All field specifications and sample information should be consistent with the information contained in the header that has been provided when creating this instance.

Returns:: A tuple with the chromosome name, the position, and the number of alleles at the variant (as indicated in the VCF line).

class egglib.io.VcfVariant[source]¶

Represent a single VCF variant. The user cannot create instances of this class himself (instances are generated by io.VcfParser or io.VcfStringParser).

Note

The AA (ancestral allele), AN (allele number), AC (allele count), and AF (allele frequency) INFO fields as well as the GT (deduced genotype) FORMAT are automatically extracted if they are present in the the file and if their definition matches the format specification (meaning that they were not re-defined with different number/type) in the header. If present, they are available through the dedicated attributes AN, AA, AC, AF, GT, ploidy and GT_phased. However, they are also available in the respective info and samples (sub)-dictionaries.

Attributes

`AA`	Value of the AA info field.
`AC`	Value of the AC info field.
`AF`	Value of the AF info field.
`AN`	Value of the AN info field.
`GL`	GL values for all samples
`GT`	Genotypes from GT fields.
`GT_phased`	Tell if the genotype for each sample is phased.
`GT_vcf`	GT field as written in the VCF file
`ID`	Tuple containing all IDs.
`PL`	PL values for all samples
`alleles`	Tuple of variant alleles.
`alternate_types`	Alternate alleles types, as a tuple.
`chromosome`	Chromosome name.
`failed_tests`	Filters at which this variant failed.
`format_fields`	Frozenset of FORMAT fields ID's.
`info`	Dictionary of INFO fields for this variant.
`num_alleles`	Number of alleles.
`num_alternate`	Number of alternate alleles.
`num_genotypes`	Number of genotypes.
`num_samples`	Number of samples.
`ploidy`	Ploidy among genotypes.
`position`	Position.
`quality`	Variant quality.
`samples`	Information for each sample.

alt_type_default¶: Explicit alternate allele (the string represents the nucleotide sequence of the allele).

alt_type_referred¶: Alternate allele referring to a pre-defined allele (the string provides the ID of the allele).

alt_type_breakend¶: Alternate allele symbolizing a breakend (see VCF description for more details).

property AA¶: Value of the AA info field. None if missing.

property AC¶: Value of the AC info field. Provided as a tuple. None if missing.

property AF¶: Value of the AF info field. Provided as a tuple. None if missing.

property AN¶: Value of the AN info field. None if missing.

property GL¶: GL values for all samples

property GT¶: Genotypes from GT fields. Only if this format field is available. Provided as a tuple of sub-tuples. The number of sub-tuples is equal to the number of samples (num_samples). The number of items within each sub-tuples is equal to the ploidy (ploidy). These items are allele expression (as found in alleles), or None (for missing values). This attribute is None if GT is not available.

property GT_phased¶: Tell if the genotype for each sample is phased. None if GT is not available.

property GT_vcf¶: GT field as written in the VCF file

property ID¶: Tuple containing all IDs.

property PL¶: PL values for all samples

property alleles¶: Tuple of variant alleles. The first is the reference, which is not guaranteed to be present in samples.

property alternate_types¶

Alternate alleles types, as a tuple.

One value is provided for each alternate allele. The provided values are integers whose values should always be compared to class attributes alt_type_default, alt_type_referred and alt_type_breakend, as in (for the type of the first alternate allele):

>>> type_ = variant.alternate_types[0]
>>> if type_ == variant.alt_type_default:
...     allele = variant.allele(0)

property chromosome¶: Chromosome name. None if missing.

property failed_tests¶: Filters at which this variant failed. Provided as a tuple or names, or None if no filters have been applied.

property format_fields¶: Frozenset of FORMAT fields ID’s. The frozenset is empty if no sample data are available.

property info¶: Dictionary of INFO fields for this variant. Keys are ID of INFO fields available for this variant, and values are always a tuple of items, even if there is only one item. For flag INFO types, the value is always an empty tuple.

property num_alleles¶: Number of alleles. The reference is always included.

property num_alternate¶: Number of alternate alleles. Equal to num_alleles minus 1.

property num_genotypes¶: Number of genotypes.

property num_samples¶: Number of samples. Equivalent to len(variant.samples).

property ploidy¶: Ploidy among genotypes. Always 2 if GT is not available.

property position¶: Position. Given as an index; first value is 0. None if missing.

property quality¶: Variant quality. None if missing.

property samples¶: Information for each sample. Empty list if no samples are defined. Otherwise, the list contains one dictionary for each sample: keys of these dictionaries are FORMAT fields identifiers (the keys are always the same as the content of format_fields), and their values are tuple instances in all cases.

class egglib.io.VcfSlidingWindow[source]¶

This class manages a sliding window on a VCF file. This class cannot be instanciated directly: instances are returned by the methods slider() and bed_slider() of io.VcfParser instances.

io.VcfSlidingWindow instances are iterable: iteration steps return a common io.VcfWindow instance which is updated at each iteration round.

Attributes

good

True if there is still data to be processed.

property good¶: True if there is still data to be processed.

class egglib.io.VcfWindow[source]¶

Provide access to sites of a window from a VCF file. The following operations are supported by io.VcfWindow instances:

Operation	Result
`len(win)`	number of sites in the window
`win[i]`	get a site as a `Site` instance
`for site in win:`	iterate over sites as `Site` instances

Attributes

`bounds`	Bounds of the window.
`chromosome`	Name of the chromosome or contig.
`num_sites`	Number of sites in window.

property bounds¶: Bounds of the window. The second bound is not included in the window.

property chromosome¶: Name of the chromosome or contig.

property num_sites¶: Number of sites in window.

egglib.io.FIRST¶: First available position of a contig.

egglib.io.LAST¶: Last available position of a contig.

class egglib.io.BED(fname=None)[source]¶

Class holding BED (Browser Extensible Data) data.

Parameters:: fname – name of the BED-formatted input file. By default, create an empty instance.

BED is a flexible format representing genomic annotation. In the current implementation, only the chromosome/scaffold, start, and end positions (the first three fields of each lines, which are the only ones to be required) are processed. If incorrectly formed fields are provided besides the first three fields, the behaviour is undefined (that is, it is not guaranteed that are exception is raised).

Supported operations:

Operation	Result	Notes
`len(bed)`	Number of annotation items
`bed[i]`	Get an annotation
`for i in bed:`	Iterate over annotations

Notes:

Annotation items are represented by dictionaries containing the following keys: "chrom", "start", and "end".

Methods

`append`(chrom, start, end)	Add an item at the end of the BED data.
`extend`(values)	Append items from an iterable.

append(chrom, start, end)[source]¶: Add an item at the end of the BED data.

extend(values)[source]¶: Append items from an iterable. All items of values must be a sequence containing a value for chrom, start, and end (in that order).