With our update on June 10, 2020, Mastermind now has greatly enhanced capabilities when searching for intronic and non-coding variants across the medical literature. I’m covering this major enhancement in two blog posts. This first post explains the “why” of the update, the second goes into detail about the “what”.
Sensitivity versus Specificity
Mastermind was created to make it easy to find relevant genomic insight from the medical evidence. In the tug of war between sensitivity and specificity, our approach has been to first maximize sensitivity, and then optimize specificity without affecting the sensitivity. In other words:
Sensitivity > Specificity
The reason is simple:
False positive results from lower specificity can be addressed by looking at the data.
False negative results, however, are non-recoverable, since it is impossible to evaluate data you don’t have.
Especially in the case of rare disease, this could be the difference between missing the diagnosis and finding the treatment.
We’ve posted in the past about how this philosophy drives our approach to optimizing specificity in the context of maximal sensitivity. Mastermind optimizes for sensitivity and provides insight into the results for users to quickly validate (or invalidate) the results.
With this philosophy in mind, Mastermind has always considered the protein-level to be the optimal variant description; it more closely describes the biological effect of a given variant, and thus maximizes sensitivity for practical purposes of finding evidence. While Mastermind allows searching for variants using any nomenclature, including genomic coordinates, cDNA, rsID, and even IVS descriptions (internally, we call all of these “nucleotide-specific” nomenclatures, compared to descriptions at the protein level, which are “codon-specific”), Mastermind always normalizes searches to the resulting protein-level description.
For example, a search for BRAF:c.1799T>A would show results for all BRAF:p.Val600Glu variations, including BRAF:c.1799_1800delinsAA, as they both result in the same amino acid change within the resulting protein. This helps ensure that you don’t miss valuable information related to the biological impact of the variant being searched simply because the nomenclature being searched wasn’t as sensitive as it could have been.
For coding variants, this “protein-level” philosophy gives the optimal results. As genomics has matured over the years, however, a more nuanced understanding of sensitivity versus specificity has evolved, especially in non-coding regions where the biological impact of variations are less direct.
Since non-coding variants had no pre-established nomenclature within the realm of protein descriptions, we did what any self-respecting problem solvers would do; we invented one. Just as c.1799T>A and c.1799_1800delinsAA could be grouped into p.V600E, we grouped all variations within a given intron as “int”. For example, c.981-20C>T could be described as G327int within the protein space (meaning an intronic variant bordering the G327 codon). Likewise, variants within the splice-acceptor or splice-donor sites could be grouped into “sa” and “sd”, respectively, and variants within the untranslated regions could be grouped into “5′UTR” and “3′UTR”.
Because the effects of non-coding variants don’t tend to be as discrete as those of coding variants, this helps maximize the sensitivity of results. This approach has yielded great results in many cases, such as for one user of Mastermind searching for an intronic deletion c.2888-15_2888-4del12. Before Mastermind, they had never found any evidence citing this variant for a patient with non-small cell lung cancer. However, the sensitivity of the Mastermind search using the D963int grouping returned an article about non-small cell lung cancer associated with intronic variants including c.2888-16_2888-3del14.
However, this approach can produce false positives as well, if for example, the results of c.981-20C>T are cluttered with results citing c.981-4G>A near the splice site, or even variations on the opposite end of the intron, which are less likely to behave as similarly to the searched variant, even within the same intron.
As the field has matured, so too has our understanding of sensitivity. While our stated philosophy that sensitivity > specificity still holds, our understanding of sensitivity has evolved. A search that returns many results can cause important results to be overlooked or ignored. In this sense, too many true positives can cause practical false negatives (due to the user missing them within the data).
On the other hand, we don’t want highly relevant results to be missed simply because an article didn’t exactly match the nucleotide-specific description of the variant given in the search.
With these goals and considerations in mind, read my companion post to see how we’re improving Non-coding Variant Precision and Prioritization in Mastermind.