Mark Kiel, Founder and Chief Science Officer of Genomenon was a featured presenter for the LabRoots Virtual Genetics & Genomics Event on May 8th and 9th, 2019. This is a transcript and slides from his talk, which opened the event on Thursday, May 9th.
Use of the Mastermind Genomic Search Engine for ACMG Variant Interpretation
NGS, or next-generation DNA sequencing, as you’re undoubtedly aware, has dramatically changed the face of molecular medicine and, increasingly, the utility of this modality is recognized in clinical practice, not just for cancer but also for constitutional disease. So, the utility in different patient populations is expanding and it’s also the case that for an increasing number of different types of patients the utility of NGS is increasingly being recognized. So, there’s this very strong incentive for molecular diagnostic and pathology labs to implement next-generation sequencing into their clinical care and a growing number of labs are doing so if they’ve not already implemented NGS either whole exome or whole genome or very large comprehensive gene panels, they’re making moves in that direction.
This is very exciting from my perspective because I entered graduate school when the human genome project was being completed and I matriculated into molecular pathology training through my clinical pathology residency right as the next-generation DNA sequencing revolution was underway and so I feel like we’re more than ever on the verge of fully harnessing the power of genomic medicine and though the details are complex the goals of realizing personalized medicine in the genomic era are really quite simple.
Simply put, they are to explain a patient’s genetic disease or if you magnify that to entire patient groups, genetic disease and even further and increasingly to use that knowledge to predict the development of disease even before symptoms appear. The real challenge there is to do this quickly and inexpensively.
In previous talks, I’ve said that the human genome project made it seem possible that we could realize this vision of precision medicine through genomics and the next-generation DNA sequencing revolution made it seem like it was plausible that could happen where we could sequence the patient’s genome in routine clinical practice. What we’re faced with now is this challenge of doing that quickly and inexpensively and as I’ll allude to in subsequent slides that is properly referred to as the bioinformatics bottleneck and the challenge there is making it practical – making it cost-effective and making it possible to get this information and interpret this data quickly enough so that it’s beneficial to patients who have been recently diagnosed clinically or who otherwise don’t yet know that they have a genetic disease.
How can we make this happen? This is a useful slide that schematizes the opportunities that are afforded by the NGS revolution and some prevailing parameters in clinical medicine but also highlight some opposing forces so at the top [of this diagram] the new opportunities include the dramatic reduction in the cost to produce sequence data – this is superseded Moore’s law for shrinking the size and reducing the cost and increasing the data content on silicon chips. We have not seen as a species such a dramatic improvement in technological advances as we have in the reduction in the cost and the time it takes to produce DNA sequencing data and that feeds forward through this pipeline and greatly facilitates the ability for genetics labs and pathology labs to automate their laboratory workflow and beyond the ability to streamline and automate laboratory workflow with NGS sequencing technology as I’ve alluded to there’s an increasing awareness of the utility of this genetic data for multiple patients and multiple different patient types and even multiple circumstances as we’re seeing with non-invasive prenatal testing or otherwise liquid biopsy for detecting cancers at a molecular level before symptoms are manifest.
I like to call this the virtuous circle of using next-generation DNA sequencing to inform that research and almost immediately having that research inform clinical practice and then coming back around again where increasingly larger and larger clinical studies can be performed using NGS to further enhance the research. We’ve seen this play out repeatedly certainly since my graduate student days.
It seems to be exponentiating in its value but there remain challenges and those challenges include the payer system and the minimal reimbursement for diagnosis that takes place even where costs for producing this data are decreasing. It’s such a new technology and the healthcare infrastructure has not yet risen to the challenge of ensuring that this sequencing modality is made available to all patient types. Even further, there’s next to no reimbursement for prevention where particularly in the United States, the healthcare model is predicated on treating symptoms after disease has developed and very minimal effort is expended on behalf of preventive techniques. Part of the reason for that is that there’s a lack of incentive to reimburse for those types of interventions. Or rather, a lack of awareness of the benefit of intervening early before disease presents itself but the last piece there perhaps the most important and the one that I want to touch on most during this talk here is the high interpretation cost so if you go up to the upper left where I began there’s this dramatic decrease in the cost to produce the data but there has been this stasis in terms of how much time and therefore cost it is required to interpret that data.
So, where in the upper left, the NGS sequencing improvements have increased the amount of data that we can produce exponentially in most cases leads to an exponentiation of interpretation cost and that is the real sticking point for fully realizing genomic medicine.
A little bit of data to put a finer point on my assertion that the challenge with high interpretation cost is that with the increased use of next-generation sequencing and more and more patients being sequenced in this way we’re finding more and more things – more and more portions of the genome, either additional exons or more genes, bigger panels, up to and including whole exomes and whole genomes are being sequenced on a routine basis and so naturally we’re discovering more things than we knew about before and that is challenging the efficiency of the genome or exome or panel interpretation process.
So, this is from a recent study in genetics and medicine that demonstrated that 40% of all pathogenic variants, after interpretation has been performed for these large studies exomes or large panels, are novel and that means they’ve never been seen before and it requires that variant scientists pathologists or geneticists need to investigate that data. And then the flip side of that, from the same study, is an indication that 60% of the variants that are seen as a result of those molecular assays are seen once and only once.
And so, putting these together you can see how even with an increased publication rate and representation of the variants that we’re now learning about from clinical practices or from research efforts increasingly in clinical practice we’re seeing this 40-60% of variants that are novel and demand extra care and attention and can’t simply have their information be extracted from existing database.
So, I’ve alluded to this before but I’ll say it here; this is a colloquialism in the genomics field. There’s the $1,000 genome – it’s actually decreased below that price point afforded by the fact that NGS has no longer made sequencing the rate limiting step indeed it’s become a commodity but there’s nevertheless this million-dollar interpretation and the $1,000,000 interpretation or the bioinformatics bottleneck means that we have to interpret the data.
Again, to put a finer point on why that is a challenge there’s just too much information. There’s so many variants as I alluded to and there’s a large amount of data associated with each of those variants that are necessary to properly understand what those variants mean in clinical practice. Another challenge is that this data is very unstructured there are databases that I’ll talk to in a moment where there’s high degrees of structure around those datasets because they were produced with a view toward using them in high-throughput but by and large the critical pieces of information coming from empirical studies in smaller scale clinical studies are highly unstructured and they come from the empirical medical literature that has been produced over many decades of genetic investigation.
So, too much information and highly unstructured information are two huge challenges in scaling variant interpretation. Some other challenges are that while there are steps forward in routinizing the process of interpreting these variants and the data that goes along with the variants the guidelines are complicated I think almost by necessity. But the fact that they’re complicated and the fact that the information is vast and unstructured has really challenged our ability to streamline their implementation particularly in automated workflows. And we’ll speak to that in a slide or two here and then finally I’ll emphasize a growing challenge in the genomics field for clinical medicine and that is the mandate to reinterpret data.
You’ll get a sense for a consistent thread through this talk that the clinical information available in the scientific literature and from the research that’s being done in research labs does not stop. And, it particularly does not stop as a result of the NGS revolution. There’s this continual need – if you come up empty and find no information about your variant – there is a continual need to return to that literature to return to those data stores of information and see if anything new has come up that will change the patient’s diagnosis and inform their clinical care in a way that was not possible six months ago or three months ago or even a week ago because no information was available at that time.
So, with respect to the interpretation guidelines, it’s the substance of the talk that I want to lay out here – it is the ACMG variant interpretation guidelines from this reference that I’m showcasing ACMG – for those that aren’t aware is American College of Medical Genetics and Genomics – and they’ve allied with the Association for Molecular Pathology to produce this framework for interpretation of variants using these various data stores that I’ve mentioned to both increase the accuracy of the variant interpretations for various that are coming out of these large-scale next-generation sequencing platforms large panels exomes or genomes but also to increase the reproducibility of the interpretations so if you are two different variant interpretation groups or even within the same group two different variants scientists if you have available to you all of the same information all of the same data for a molecular lab, there’s every hope and expectation that those two interpretations would be identical and that’s effectively what the goal of the ACMG/AMP scoring criteria is – it is to enhance the reproducibility and accuracy of these variant interpretations.
And there’s very positive signs that these guidelines are being widely adopted. They’ve been adopted outside of the US and the UK and they’ve been published in Chinese to meet the demands in the expanding Asian market and they are used by 95% of labs worldwide, which is a very positive sign. Although there are some challenges – typically idiosyncratic challenges in different clinical circumstances that make more complicated this attempt to simplify and reduce to practice the variant interpretation guidelines. But as a community those are being worked out.
This will not be an exhaustive discussion of the ACMG criteria but rather a high level. There’s something called evidence triad for interpreting the meaningfulness and clinical significance of variants that come out of exome genome or large panel molecular assays. Two of those that the base of the pyramid are statistics on population frequency in healthy normal cohorts.
So that’s population data that comes from resources like Nomad or ESP 6500 which many of you will be familiar with. That’s one arm of the triangle, another arm is computational data or in silico models that predict pathogenicity of a given variant based on three-dimensional conformation of the protein awareness of domain structure and or understanding of evolutionary conservation. Together those two facets of the evidence triad are typically not sufficient for a variant to be interpreted and so properly those variants that are otherwise not interpretable are called variants of uncertain significance or VUS and the way that we better understand those variants and decide which ones of those variants of uncertain significance are pathogenic requires literature curation requires evidence either at a functional level where authors of scientific studies have performed in vivo or in vitro studies of the consequence of that patient’s variant and have determined that there’s some change in the function of the protein that would be a strong indication of pathogenicity for the very scientist who’s tasked with understanding this patient’s data similarly different than the healthy normal population statistics that comes from that first arm of the tribe that I talked about.
Case studies either individual Pro bands within a single pedigree or otherwise multiple pedigrees or large cohorts of patients who have had a sequencing performed and where the research the clinical researchers have identified segregation patterns of that variant tracking with development of disease and those patients so functional studies and clinical segregation data those are the two major components of evidence that inform the ACMG classification schema and increasingly those require literature curation so that’s where I want to spend the rest of the time talking about the challenge of parsing the literature identifying the right references and parsing those references to pull out the needed information to make the designation of likely pathogenic or pathogenic according to this ACMG framework.
So I’m a clinical pathologist so this is the air I breathe. What are we missing and why do we have so much of this information when I really want to answer this particular question so that sensitivity and specificity so in the first scenario sensitivity the that you’re asking yourselves as a very scientist is am I missing any publications and the place that most labs are going to now is their standard recourse when when doing literature investigation which is typically PubMed or Google or Google Scholar and so with a simple search in Google Scholar you may return no documents, no results. The question still remains is that because of some inaccuracy in my search query or is it truly because there’s no evidence in the medical literature that matches the variant that I saw in my patients? There’s a myriad of ways that you can perform a search and not return any results when in fact those results do exist. Things as simple as not including synonyms of the gene that you’re searching for or the very complicated variant nomenclature and the the multiple different ways that a variant can be accurately described either following conventional or using colloquial names and so a negative result in a non-technical tool like Google Scholar is still meeting with unease for variant scientists who aren’t sure that they have performed the searches as comprehensively as they otherwise could.
The other problem, which I’m illustrating here for a different variant, is there’s just too much information so Google is vast, it doesn’t just include scientific literature and patent Google scholar does but it includes a great deal of additional information and depending on what you’ve searched for you can either have many thousands of potentially on-target results or otherwise many hundreds or thousands of results that are false positives and both of those scenarios get to the question of which publications should I be paying attention to. I don’t have time to look through these thousands or hundreds of papers. I want a more specific result to be returned to me.
So sensitivity and specificity are a huge challenge that limits the ability to automate this literature curation challenge for varying interpretation but then suppose you do have a maximally sensitive solution just let’s say before I start talking about Mastermind, let’s say it’s a magic solution and you also have a way to specifically divide which of those references are going to be most beneficial in your curation you still have the challenge of extracting that information from an unstructured sometimes very dense scientific publication and in many scenarios you’ve you’ve prioritized your data by maximizing the specificity of your search to a couple dozen references. In typical research studies that have looked into this matter is fully investigating individual patients variants.
Just a single variant can take between 30 minutes and 3 hours depending on the complexity of the result the number of references that are returned and the type of evidence that is being sought after to adequately score the variant for ACMG criteria so at present extracting this genetic information from these references requires human intervention and where we think that will always be true, we do feel at Genomenon particularly facilitated by the tools that we’re building in the form of Mastermind that requirement can wane with advances in automation where the final arbiter will always be humanized expert variant interpreters but we feel like there’s a great deal of legwork that can still be automated and that’s what I want to highlight here toward the last half of the talk.
So just to summarize what I’ve mentioned here and to put it in the context of the field in general, manual literature curation doesn’t scale. It’s too time-consuming, there aren’t enough expert variant scientists trained to meet the patient sample demand and the larger mission of precision medicine in genomics.
There’s just too few of us who have been adequately trained to meet this need and increasingly, as I mentioned before, the literature doesn’t stop. In the past year or so there have been over half a million new studies that have genetic or genomic content and it seems to be growing at a faster rate.
So that’s the background that’s the problem statement and as I said for those. The last part of the talk I’d like to introduce to those of you who aren’t familiar with the Mastermind Genomic Database. It is properly a search engine that’s a comprehensive index of the genomic literature that’s annotated for clinical and functional variants so I like to say imagine if Google Scholar never slept and had gone to med school and specialized in genetics and molecular pathology that’s effectively what we’ve created at Mastermind and we’ve presented the results in a way that we know comports with what variant scientists are looking for from their results so in essence we’ve made attempts to minimize the manual work that’s required to first find the right material to organize and annotate that material and then very quickly make decisions based on the evidence that’s presented to you.
So this slide is a reflection of the data content in the Mastermind Genomic Database we understand fully all of the titles and abstracts that comprise the medical literature and even more fully we understand all of the content in six and a half million full-text genomic articles as they are indexed very thoroughly for any one of many thousands of different diseases the whole spectrum of human disease and we will soon be adding human phenotypes as well to our ontology and indexing process any one of many tens of thousands of human genes and any way that any of those terms the diseases phenotypes or genes can be described we understand which references talk about which of those entities and then even more fully we understand in those genes in papers that mention genes.
We understand when a variant is described of varying in any one of those transcripts from any one of those genes in any position of any type coding or non-coding in Dells or skews as well as whether the authors have used a conventionally accepted nomenclature or have deviated from that accepted nomenclature this is, in essence, what Genomenon has been working toward for the better part of the last five years is perfecting this art of recognizing these pieces of the puzzle these sources of evidence for interpreting the variants, then presenting that data in a way that really resonates with existing workflows for variant scientists.
So without showing you everything in the software this is a reflection of some of the data components and the features that we have in the Mastermind software on the left, you’ll notice a comprehensive variant landscape. This happens to be for the fibrin one gene and you can see along the x-axis of that diagram is the linear axis of the protein from the first position to the end terminus of that protein and then on the y-axis is a reflection of how many times each of those variants or individual bars has been cited in the literature.
Then in that diagram there you can see the functional domains that comprise that protein so I chose fibrin one which is a loss-of-function mechanism because it really underscores how exhaustive the search the automated search Mastermind is and how comprehensive the data is for each gene.
So this is true not just a simple one and certainly not just true of constitutional disease but truly of the many tens of thousands of genes in any disease context from any one of those references in any one of the articles ever published and then over on the right is a reflection of as I suggested what we’re doing to events that data so that the variant scientists needn’t take the Mastermind software at its word but rather can take a deep dive in and fully understand what the author intended to say about that variant even more fully we’ve given them the power to enhance the specificity of their search results where the previous slide indicated that it’s maximally sensitive to.
To turn your attention as a user to specificity mode we’ve made it possible for users to investigate that evidence by prioritizing the list of articles that mention their variant using keywords that prioritize the data along ACMG guidelines so the example that I’m showcasing here is for a search that’s looking for functionally significant references to identify the ps3 or B s3 the strong evidence of pathogenicity based on functional studies or otherwise strong evidence of a benign variant based on those functional assays.
Here’s a reflection of some of those keywords that we’re using to derive the recognition of a which of the Articles that have cited your variant are the most likely to convey that functional significance and then present that evidence to you in the form of those sentence snippets from each of those references and so that’s a very high-level view in the context of what we’re talking about today about the Mastermind software and its use in variant interpretation so the next thing that I wanted to do here is walk through a couple of cases that are illustrative of the benefit of using Mastermind again for maximizing sensitivity and specificity as well as for very quickly showcasing the requisite evidence from these references that are meaningful for your workflow automation.
So the first example I’ve talked about on at least a couple of other occasions is the UROS gene it is a missense variant in the 73rd residue which is assisting in the reference situation but has changed into an arginine. Missense variants can be particularly challenging they’re the most numerous of all the different types of variants for obvious reasons but you’re not able to jump to a higher tier of likely pathogenicity as you might be in certain circumstances for loss-of-function variants such as frameshift or nonsense so missense variants can sometimes be polymorphisms and can sometimes be functionally inert and passenger mutations or other times they can really damage the functionality of the protein and lead to disease. So there are somewhat enigmatic in that way.
This example is a reflection of one of those potential enigmas that is only solvable by going out to the literature so this variant has a low frequency in these population databases as I’m showing there on the left and further it’s predicted to be damaging providing some small supporting evidence of pathogenicity that’s the pp3 or pathogenic supporting evidence in the third category but those pieces of evidence those lines of sight into the evidence triad are not sufficient to label this variant what’s required are two additional strong lines of evidence or otherwise one strong line of evidence or two to three moderate lines of evidence and the only recourse having exhausted the other two arms of the Triad is to go out to the literature and so effectively that’s what we’ve done in this example coming from one of our users. We went out to PubMed in Google Scholar or rather took their information and included it here.
We also compared those results with Mastermind and so conventional search results here on the left for using PubMed which basically has insight into the titles and abstracts of the references as well as a little bit more penetrating insight into Google Scholar which has a full complement of full text on many of these references the results here show that PubMed is a shallow reflection of the data content showing only seven of the results compared to the hundred and nineteen that Google Scholar results had returned after the user has understood all of the different ways that the gene and the variant can be configured which is effortful in and of itself.
You’ll notice if you look closely or if you’ve ever used Google Scholar those results are very unorganized, they’re not thoroughly annotated there’s a number of duplicates, there’s a number of very highly off-target results either falsely recognizing the gene inadequately recognizing the variant having results with a mismatch between the gene and the variant or otherwise not actually including both of the terms that you’ve searched for but in Google’s attempt to produce as many results as they can there are results that when you investigate you can’t see why those results were returned and so just because there’s more information there does not mean that that’s a better solution.
If you’re trying to maximize the sensitivity of your search certainly that would be better but if you’re trying to enhance the automation and efficiency of your workflow in many circumstances the disorganized presentation and the cluttering of the results with false positives can pose a significant challenge.
So contrast those two results with the results in the Mastermind search and in this particular example we’ve done a pretty deep investigation of these returned results in all three circumstances and in this case mastermind returned 85 results and critically
there were no duplicates reflected in those 85 results. There were no false positive results that were returned Mastermind and I think critically importantly the results were prioritized by clinical relevance and all of the evidence necessary to very quickly make a determination about which reference is the most meaningful and to validate or verify how that evidence from those references figures into the ACMG criteria.
That’s what Mastermind does and we make that very easy for users to the point where a user can perform this search on a single variant and complete that investigation in five to fifteen minutes which is in contrast to the 30 minutes to 3 hours that is otherwise required to first identify references go out to those references pull out that evidence and then investigate that evidence Mastermind does all of that like work upfront so the user can instead focus on the Cerebral effort required to interpret the variants.
Just to underscore the benefit of Mastermind this is one of the top results I believe DTaP result in Mastermind for this particular variant and immediately it’s clear based on a number of cues that are provided in the software context clues from the full text that indicate that this is a functional paper and that the authors have ascribed a functional significance to this variant in an in vivo mouse model of the disease that’s associated with this gene porphyria and so this was a result that if you saw this variant in your clinical workflow that you would be able to come to within a matter of a few short minutes compared to an otherwise disorganized and incomplete or non comprehensive search strategy using PubMed or Google Scholar.
So on the other side I’d like to talk about a different case this is in a different gene the protein convert A’s sub to listen Texan type 1 or PCSK 1 variant, that’s also a missense variant in this case it’s a phenylalanine to lysine change this variant is absent in those population data sets and it has no presence in these in silico predictive databases and so we’re left bereft with any meaningful information to take this variant out of the realm of uncertainty into either benign or likely pathogenic and so we perform the same investigation using the conventional methods of PubMed and Google Scholar and there are no results so this underscores that example that I gave before.
Of it you have no results, how can you have confidence that you haven’t done something wrong in your search or otherwise your search tools are inadequate and in this case they were in fact inadequate and this is is analogous to the situation that we hear from many of our users repeatedly thanking us and praising the sensitivity of Mastermind for finding references that they otherwise didn’t know about sometimes for a matter of years Mastermind when the search is performed by the user will return a paper from the early 2000s where the variant scientists at a given group were unaware of that paper having been published even at that early date again because of the inadequacy of their search technique.
So this one result in Mastermind was also a functional study and it allowed for the designation of ps3 or functional signal evidence of pathogenicity and that one paper and that one item of evidence in the ACMG framework was enough to take the variant out of uncertainty into likely pathogenicity and as i indicated that is a very typical result that our users tell us about and why they continue using the Mastermind software.
I’d like to talk about what Genomenon’s overall mission is and particularly my driving ambition as a co-founder of Genomenon and as the current Chief Science Officer. That has to do with the genomic landscape that we’re able to produce for individual genes at a client’s request or otherwise us working toward producing genomic landscapes per se the ACMG genes or otherwise the Sanger consensus gene set for cancer.
Our goal is to thoroughly annotate all of the literature for each of these individual genes understanding every one of the variants in each of those genes and providing a data set of the manually curated literature evidence that indicates those variants are pathogenic or not and this is a very brief overview of how we’re doing that beginning with identifying the comprehensive and focused data content that I talked about coming from the Mastermind database and then utilizing artificial intelligence or better set computational intelligence to annotate ranked and curate with a technical platform all of those variants from all of those literature sources before a final and expedited manual review and I’ll walk you very quickly through each of those steps here but that’s the overarching process.
So the first step as I alluded to we’ve already done and continue to do so on a weekly basis is updating the data content in the Mastermind database for any disease and any gene and every possible variant and thoroughly annotating those lines of evidence with clinical and functionally significant annotations.
The next thing as I said is organizing that data which happens to facto in the Mastermind indexing process we know where all of these variants live in the medical literature and what diseases they can sort with and that’s a reflection in the Mastermind software of some of that data result for a given variant.
We also understand in what contexts in a given disease circumstance those variants are described does it segregate with patients is there functionally significant experiments that have been performed that indicate this variant is a bad actor and a patient with the variant is diagnosable with that is as a result. I alluded to the ACMG amp criteria we have a whole battery of additional key terms such as prognostic significance or therapeutic significance as well as diagnostic utility.
These results are ranked and prioritized using a proprietary algorithm that takes into account all of this information using some bespoke data processing algorithms that we’ve devised as well as some industry standards in data science and natural language processing.
Then critically we present that information to our team of manual curators masters level and above data reviewers because we know what we’re looking for and we can pattern our technology to present those results in a way that’s sufficient for us to very quickly adjudicate the meaningfulness of each of those results similar to what I showed you in the Mastermind interface with the difference that we’re able to do this now in batch mode for many hundreds and thousands of variants in any given project were able to thoroughly annotate those results very exhaustively and accurately according to ACMG guidelines.
Then lastly those results are delivered through the Mastermind Reporter which you are seeing a reflection up here where every variant in any one of the genes of interest for a given study is investigated and the results from the manual curation and from the database organization are presented to the user and those results are continuously updated.
So every week we understand what new information is presented and every quarter those results are updated and presented to the user so that the database is as comprehensive as it could possibly be and again this is available through project work that we’re doing and in our own internal curation efforts with a view toward eventually subsuming the entire human genome.
So that was my last slide. I wanted to say that Mastermind is available to users through the basic Edition for research purposes and with the sign up to the Mastermind Basic Edition at the link that I’m providing there below with the code labroots2019, you’ll get access to the Mastermind Professional Edition for two weeks. Mastermind professional being intended for clinical practice in the way that I had talked about earlier and allowing you to see all of those streamlined capabilities for enhancing the accuracy and efficiency of your variant interpretation workflow so that concludes my presentation.
I’ll reiterate that I’m available to any of the viewers for questions that I’ll address by email after the live event or otherwise if you’d like to reach out and talk about any of the features of Mastermind the data curation projects that we’re working toward or any other forward-looking aspect of genomics and precision medicine, I’d be happy to take those emails so with that I’ll adjourn and thank you again for your attention.