VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome.
There is an option whether to contain genotype information on samples for each position or not.
See the definitions at
As usual, there is a parser class, called VCF_Reader, that can generate an iterator of objects describing the structural variant calls. These objects are of type VariantCall and each describes one line of a VCF file. See below for an example.
As a subclass of FileOrSequence, VCF_Reader can be initialized either with a file name or with an open file or another sequence of lines.
When requesting an iterator, it generates objects of type VariantCall.
VCF_Reader skips all lines starting with a single ‘#’ as this marks a comment. However, lines starying with ‘##’ contain meta data (Information about filters, and the fields in the ‘info’-column).
- parse_meta(header_filename = None)¶
The VCF_Reader normally does not parse the meta-information and also the VariantCall does not contain unpacked metainformation. The function parse_meta reads the header information either from the attached FileOrSequence or from a file connection being opened to a provided ‘header-filename’. This is important if you want to access sample-specific information for the :class`VariantCall`s in your .vcf-file.
A VariantCall object always contains the following attributes:
This specifies if the VariantCall passed all the filters given in the .vcf-header (value=PASS) or contains a list of filters that failed (the filter-id’s are specified in the header also).
Contains the format string specifying which per-sample information is stored in VariantCall.samples.
The id of the VariantCall, if it has been found in any database, for unknown variants this will be ”.”.
This will contain either the string version of the info field for this VariantCall or a dict with the parsed and processed info-string.
A dict mapping sample-id’s to subdicts which use the VariantCall.format as keys to store the per-sample information.
This function parses the info-string and replaces it with a dict rperesentation if the infodict of the originating VCF_Reader is provided.
Example Workflow for reading the dbSNP in VCF-format (obtained from dbSNP <ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/00-All.vcf.gz>_):
>>> vcfr = HTSeq.VCF_Reader( "00-All.vcf.gz" ) >>> vcfr.parse_meta() >>> vcfr.make_info_dict() >>> for vc in vcfr: ... print vc, 1:10327:'T'->'C' 1:10433:'A'->'AC' 1:10439:'AC'->'A' 1:10440:'C'->'A'
FIXME The example above is not run, as the example file is still missing!
The class is instatiated with the file name of a Wiggle file, or a sequence of lines in Wiggle format. A WiggleReader object generates an iterator, which yields pairs of the form (iv, score), where iv is a GenomicInterval object and score is a float with the score that the file assigns to the specified interval. If verbose is set to True, the user is alerted to skipped lines (comments or browser lines) by a message printed to the standard output.
The BED format is a format originally used to describe gene models but is also commonly used to describe other genomic features.
The class is instatiated with the file name of a BED file, or a sequence of lines in BED format. A BED_Reader object generates an iterator, which yields a GenomicFeature object for each line in the BED file (except for lines starting with track, whcih are skipped).
The attributes of the yielded GenomicFeature objects are as follows: