In this blog, chief technology officer Steve Schwartz describes the limitations of the Mastermind Cited Variants Reference (CVR) that affect genomic analyses, and how the integration of Mastermind’s API can help surmount them.
The Mastermind Cited Variants Reference (CVR) is an open-access file, available in VCF (variant call format) or CSV (comma-separated values) formats for both GRCh37 and GRCh38 reference builds, which contains a list of (at present) over 15 million variants found by Mastermind to be cited in the primary evidence of the medical literature, along with the number of articles citing each variant. Producing the CVR and making it available was a great accomplishment, allowing many websites and pipelines to integrate Mastermind’s industry-leading variant evidence data into genomic resources and analyses. However, this comes with some limitations, which can be overcome with Mastermind’s API.
Reverse-transcribing frameshift variants
These limitations arise from the CVR’s basis in the industry standard VCF specification. The VCF specification is essentially a tab-separated value (TSV) file with many special headings at the top of the file. Each variant row of the file requires variants to be specified by genomic coordinate, reference allele, and alternate allele. A substitution looks like this:
While a frameshift variant might look like this:
The compilation comes from the types of variant descriptions identified and recognized by Mastermind in the medical literature. Take this article for example, which cites the variant BRAF:p.Pro403LeufsTer8. In fact, this article cites several such variants. However, the author never specifies the exact frameshift observed. Nevertheless, in a clinical or research setting, if you’re trying to find important information about a variant resulting in a frameshift starting at this residue, this paper could be critical to your understanding and analysis of the variant, potentially changing the pathogenicity call of the variant.
While other variant databases force users to specify variant entries exactly by their observed genomic change, Mastermind understands and recognizes variant descriptions even when they lack the granularity of data to resolve them to the specific genomic coordinates. Mastermind indexes this variant as a frameshift at this particular protein residue, and is capable of dealing with some ambiguity as to the exact genomic change.
When you query Mastermind via the UI or API for any frameshift,, Mastermind understands the change described and is able to map that to the resulting protein effect, thereby returning all articles which match not only your exact variant, but also those resulting in the same or similar protein effect (this behavior is described in a previous post for non-coding changes, but Mastermind does this for all other types of variants as well). This is the power of Mastermind’s Genomic Language Processing (GLP).
However, protein effects can be tricky to reverse-transcribe into all possible genomic causes, especially when described ambiguously, such as with p.P403fs, which tells us approximately where the frameshift starts (at least at a codon resolution if not nucleotide resolution), but it doesn’t tell us at exactly which nucleotide the frameshift starts or ends. In many cases, the number of genomic changes which could result in this protein change is nearly indefinite (as it could be the deletion of a single nucleotide, the addition of a nucleotide, the deletion of two nucleotides, the addition of two nucleotides, etc.). There is no way to effectively reverse this citation into a finite or manageable set of possible genomic coordinate/ref/alt variants for inclusion in the CVR file which is released each quarter. Therefore, articles like the above, and their ambiguous variant citations, are not included in the CVR, which follows the VCF format requiring a nucleotide-specific description for each variant in the file.
Reverse-transcribing missense variants
The CVR is able to reverse-transcribe missense variants due to the small number of nucleotide changes which can result in a given missense amino acid substitution. For example, if we have an article describing a variant such as BRAF:p.Val600Glu with no additional information, we can look at the protein sequences corresponding to the transcripts available for the BRAF gene, filter down to only those in which the V600 amino acid residue is valid, and then use the information in a codon chart to determine that the reference codon for BRAF’s valine at that position has only 2 possible changes which could result in the valine becoming a glutamate.
Because the CVR is often used as a filter by overlaying the CVR data on top of the VCF files of patient sequence data (and finding the intersection of variants between the two), we want to ensure that if the patient has any of these possible causes of BRAF:p.V600E, that the CVR returns those. This emphasis on sensitivity for filtering use cases is why the CVR includes all possible expansions of missense variants, regardless of whether those nucleotide-specific changes have been cited directly in the literature.
Advantages of the Mastermind API
The CVR is limited due to the need to reverse-transcribe variants a priori (which are cited in literature at the protein level) so that they’re compatible with the VCF file format for use in standard genomics pipelines. However, the Mastermind API and the web-based genomic search engine have no such limitations. This is because they can forward-transcribe all of your searched variants into the resulting protein effect in real time and return all relevant results without needing to guess which variants you may need to search before you’ve searched them.
To see more details about integrating the Mastermind API into your analysis pipeline or clinical or research applications, see the API Integration documentation available for download from Mastermind.
Still need to try Mastermind? Create your account today and start with a free trial of Professional Edition.