With millions of human genetic variants cited in over 30 million articles, comprehensively searching the medical research for a variant can be like finding a needle in a haystack. This is especially important when a single paper can mean the difference between a variant of uncertain significance (VUS) and a pathogenic variant.
Missing even one article from among these millions can significantly impact the accuracy of variant calling, reducing the chances that patients receive the best and most appropriate care.
Having ready access to the most complete database of published variants is essential in reducing the time it takes to interpret a variant and ensuring the accuracy of its interpretation. Using a comprehensive genomic search engine is the best way to ensure that the search is complete, the results are correct, and no articles have been overlooked.
Selecting the Best Variant Database
Here are five key considerations in assessing the comprehensiveness and quality of a variant database drawn from the published research literature:
1. Minimizing False Negatives Requires Full-Text Indexing
Many providers claim to have indexed 30 million articles, when in fact they only index the titles and abstracts found in PubMed. Titles and abstracts contain only 6.7% of the variants mentioned in the full paper, and these tend to be the most widely cited and well-known variants (which are less likely to require literature investigation). To gain insight into the 93% of the variants not seen in the title and abstracts, it is necessary to index the full-text articles.
2. Minimizing False Negatives Requires a Breadth of Literature Coverage
Some providers index the 5 million open access articles in PubMed Central. This covers only 16% of the full-text published articles, and presumably 16% of all variants found in the full-text. Articles tend to be embargoed by publishers for several months to several years post-publication before being released to open access. This means the latest genomic discoveries are out of reach of this type of indexing process. Having a broad reach into the literature is key to building a comprehensive genomic search engine. Providers must reach beyond a subset of journals to comprehensively identify variant citations across all the published medical research.
3. Minimizing False Negatives Requires Access to Supplemental Datasets
Supplemental data is a key source of less frequently cited variants. The ability to index and include supplemental data is key in maximizing variant coverage. Supplemental files are heterogeneous in nature, and extracting data from them is challenging task that most providers shy away from.
4. Minimizing False Negatives Requires a Comprehensive Variant Indexing Process
Beyond matching standardized variant nomenclatures, providers must recognize a number of non-standard nomenclatures, styles, and special characters that are used to describe variants in the published literature (and even more so in the heterogeneous supplemental data). Capturing all the variant types and all the ways an author can describe a variant takes years of development and domain experience.
5. Minimizing False Positive Variants Requires Correctly Identifying Corresponding Genes
The broad range of both variant types and nomenclatures exposes the indexing process to the increased likelihood of numerous false positives resulting from inappropriate variant matches and faulty gene-to-variant pairing. Solving these challenges requires years of development and iterative improvement in eliminating these false positives in a systematic fashion, and requires an ongoing commitment to improving search result quality with a specific focus on genetics and genomics.
Selecting the Best Provider
Selecting a provider for your variant curation needs is the most important decision you can make to ensure the validity of your interpretations. I hope this criteria is helpful in your search, but it should be no surprise that I believe we provide the best solutions for variant interpretation.
Genomenon focuses solely on human genomic content from the entirety of the medical literature, updated weekly, with the most comprehensive and accurate dataset ever collected. We have addressed each of these five criteria by building the Mastermind suite of variant interpretation tools.
Mastermind Tools for Variant Interpretation
- Mastermind Genomic Search Engine – A powerful user interface used to search the medical literature
- Mastermind API – Comprehensive solution sets for programmatic access to Mastermind data
- Mastermind VCF Processing – Batch file tools to append publication data to VCF files.
- Mastermind Alerts – Real-time, automated notification for patient look-backs and database updates
- Customized Solutions – Clients can customize comprehensive database assembly projects for clinical reporting, next generation sequencing gene, variant, and fusion gene panel design, pharmaceutical R+D activities, and more.