What information is in the Cited Variants Reference?
The Mastermind Cited Variants Reference contains the number of articles for each variant cited in the medical literature. This information can be used in many ways, and is most commonly used as an evidence filter for clinical actionability in genomic analysis pipelines (based on presence or lack of evidence in the literature) and a quick way to get insight into the literature for variant curation through links into the Mastermind Genomic Search Engine. We think that this is just scratching the surface, and are excited to see other ways the data can be mixed and mashed up to reveal unique insight.
Specifically, the Cited Variants Reference includes three unique numbers for each variant, denoted as MMCNT1, MMCNT2, and MMCNT3. These three literature counts range from highly specific (MMCNT1) to highly sensitive (MMCNT3):
- MMCNT1 (most specific) – cDNA-level exact matches. This is the number of articles that mention the variant at the nucleotide level in either the title/abstract or the full-text.
- MMCNT2 – cDNA-level possible matches. This is the number of articles with nucleotide-level matches (from 1) plus articles with protein-level matches in which the publication did not specify the cDNA-level change, meaning they could be referring to this nucleotide-level variant but there is insufficient data in these articles to determine conclusively.
- MMCNT3 (most sensitive) – This is the number of articles citing any variant resulting in the same biological effect as this variant. This includes the articles from MMCNT1 and MMCNT2 plus articles with alternative cDNA-level variants that result in the same protein effect.
- MMURL3 – This is a deep-link into Mastermind for the selected variant, which shows all articles from MMCNT3, in order to investigate and explore the evidence in the literature.
How can I use this data?
A common use-case would be to integrate this information into a genomic analysis pipeline for NGS (next-generation sequencing) data. For example, the variant citation counts can be used to annotate a patient VCF file in order to prioritize those variants with clinical evidence, while the URL can be used to speed up the variant curation process.
To further improve the curation process, you may prioritize variants relative to one another by number of articles, prioritizing those with more citations more highly. Preference may also be given to those variants with more exact cDNA-level citations (MMCNT1).
Is all Mastermind data contained in this reference file?
No. While the file does contain over 15 million variants seen in the medical literature, it doesn’t include everything in Mastermind’s ever-expanding database.
There are some technical limitations to providing variant counts by genomic coordinates standard in the VCF specification, due to the fact that the medical literature doesn’t always provide enough information for protein-level changes to enable translating them into their exact cDNA-level variants.
This is why the Cited Variant Reference includes three separate levels of specificity for each genomic-level variant. In order to provide both MMCNT2 and MMCNT3 in the file, we must expand each protein-level change, such as amino acid substitutions, into all nucleotide-level changes that could result in that change at the protein level.
For example, an article may cite p.M856V in the SLC4A11 gene with no mention of the cDNA-level change. From the gene’s transcript, we can determine that there are four nucleotide-level changes which can result in this amino acid substitution:
This allows maximum sensitivity for these four nucleotide-level variants in the reference file (which absent any other articles would have counts of MMCNT1=0, MMCNT2=1, and MMCNT3=1).
However, some protein-level changes cannot feasibly be normalized into all possible nucleotide-level changes, as there would be too many possibilities to list. These variants may be queried in the Mastermind user interface or API, but are not contained in the Cited Variants Reference.
- Substitutions, intronic and splice-site variants, and UTR variants are in the reference file.
- Duplications, Deletions, Insertions, Indels, and Inversions may be in the reference file, depending on the complexity of the variation and the level of nomenclatures used within the literature. For these, we recommend querying the Mastermind API if they are not in the reference file, for maximum sensitivity.
- Frameshifts and upstream and downstream gene variants are not in the reference file, but can be queried in the API.
PLEASE NOTE: If you embed the CVR data in user-facing applications, we strongly recommend that you clearly inform users of these limitations. Mastermind has found variants in the published literature which are not in the CVR (refer to summary above). In these cases, Mastermind can be queried directly to find relevant literature.