Powerful new machine learning algorithms for variant indexing deliver improved specificity for the Mastermind Genomic Search Engine.

Introduction

When indexing the genomic data across the scientific literature, simultaneously maintaining maximal sensitivity and preserving optimal specificity of search results is a critical component of successful information retrieval. In the early iterations of the Mastermind Genomic Search Engine, our focus was on ensuring the highest sensitivity.

With this approach, we offered our users the best opportunity to identify any and all references where their variant may have been cited.

We then pointed the user to the evidence to allow the user to quickly determine if the results were useful. Using this approach, we have created both the most comprehensive genomic search engine and the most comprehensive variant database in existence.

Our next goal was to refine the specificity of the data by enhancing the rate of true positive results without compromising on the sensitivity of the data.

Our Approach to Improving Specificity

Examination of the plot below illustrates the balancing act that is required for an exercise such as this. Maximally specific results reflected in line C were always possible given the data available in Mastermind.

However, as described above, a conscious decision was made when designing the first versions of Mastermind to have search results conform to line A to ensure the highest sensitivity so as not to miss any true positive results shown under the dashed line curve.

sensitivity vs. specificity in variant search

With the sensitivity of Mastermind’s coverage of the genomic literature set, we shifted our focus to enhancing the specificity of the search engine.

By reviewing Mastermind’s variant search results against both manual curation and automated data analysis (via machine learning techniques), we were able to study, identify, and catalog recurrent patterns in the variant match data for false positive results.

We then adjusted our search engine algorithms to eliminate these falsely mapped variants that had previously reduced the specificity of our results.

The Genomenon scientific team manually reviewed more than 10,000 variants across many thousands of individual articles to ensure the accuracy of the variant indexing process.

Rather than simply adjusting our cutoff by redrawing the threshold at line B, which would have resulted in compromises to both sensitivity and specificity, we have introduced dozens of new data points to augment the test result information along the x-axis.

This advancement widens the gap between the true negative and true positive curves to allow for a near perfect separation, meaning that we do not need to make sacrifices to sensitivity when seeking to enhance specificity.

The shift in the measurement of variant match accuracy has effectively made it possible to have lines A and C (reflecting maximal sensitivity and maximal specificity, respectively) move closer and closer until they almost converge, creating a near-perfect separation between true negatives and true positives.

sensitivity and specificity

Examples of Specificity Enhancements

Example #1 – Unambiguous Variant Matches

Mastermind’s indexing process identifies variants described with a multitude of different nomenclatures. Some of these nomenclatures are easier to match unambiguously than others. For instance, variants described by their rsID numbers are easy to match unequivocally during the index process.

Moreover, there are almost no other instances of the form “rs” followed by one or more number characters. Finally, because the gene where the variant is found is implicit in the rsID, mapping the correct gene to the variant is trivial. Identifying variants by rsID then has extremely high sensitivity and also high specificity performance characteristics.

Example #2 – Potentially Ambiguous Variant Matches

Mastermind identifies variants described using both “c.” and “p.” nomenclature. These too are generally very easy to pick out from reference material, as the form they take is not often repeated when describing other entities. However, when matching these variants you must unambiguously determine which gene is being referred to by the authors for that variant.

In the case of single nucleotide variants described using the “c.” nomenclature, 25% of the genes mentioned in the paper are likely to also match in addition to the correct gene that the author intended.

For potentially ambiguous variants exemplified in Example #2 above, we sought to perform disambiguation by using context clues from each reference in exactly the same way a human reader would when inferring the author’s’ intent as the text was written.

To automate this disambiguation process, we relied on our manual curation database, recognizing the multiple parameters that contributed to an improved true positive gene match for any given variant.

Some of these parameters include the mention of the gene in the title of the reference, the frequency of times the gene is mentioned in the paper, the distance between the variant and the nearest gene match, and the presence of multiple nomenclature descriptions of the same variant; for instance in a table with both “c.” and “p.” descriptions of a variant.

Results of Specificity Enhancements

As a specific example of the impact these new changes has on the quality of the results in Mastermind, rescanning of the following paper in BMC Med Genomics from Feb 2018 entitled Efficient strategy for the molecular diagnosis of intractable early-onset epilepsy using targeted gene sequencing using our new variant indexing process reduced the total number of variants identified from 549 to 48! The other 501 variants previously matched were the result of matching each variant reference to the multiple genes in the article to which they could have belonged, rather than specifically to the one gene the author intended.

Each of the 48 currently indexed variant were verified to be correctly associated with the gene intended by the author and none of these variants were incorrectly matched. This paper can be examined in Mastermind at the link below.

View Results in Mastermind

Summary

Altogether, using this collection of information driven by upstream machine learning techniques, we were able to dramatically enhance the specificity of our variant indexing process without impacting the sensitivity of our results.

The latest release of Mastermind now provides more accurate search results that automatically disambiguate false gene-to-variant mappings that often confound Google Scholar and PubMed.

By dramatically reducing the false positive variant results from Mastermind in a way that is not possible using Google Scholar and PubMed, we are able to:

  • Deliver far more accurate variant search results,
  • Leverage sophisticated computational intelligence algorithms to continually refine Mastermind’s search results, and
  • Allow our users to more efficiently determine the clinical significance of each variant.

Users who rely on Mastermind to understand the comprehensive landscape of variants within an entire gene will benefit from the new variant indexing process whether they are doing discovery work, biomarker identification, or gene panel design.

Let us know about the improvements you see to your search process and whether you have any questions. Much more to come!