The Mastermind Genomic Search Engine has the essential ability to filter the genomic literature by ACMG/AMP criteria.
The American College of Medical Genomics (ACMG), in collaboration with the Association for Molecular Pathology (AMP) and the College of American Pathologists (CAP), guide the standards for the interpretation of genomic variants. This includes classifying genetic variants into five categories based on specific scientific evidence.
While some of the evidence can be based on population data or computational data, genomic variants cannot be classified as pathogenic without citing evidence from peer-reviewed scientific literature. Mastermind is the only genomic search engine that provides an extensive search of all the scientific literature according to ACMG classification guidelines. Users can search Mastermind by disease, gene, variant, and ACMG/AMP criteria to find clinically prioritized scientific evidence that can be cited in patient reports.
Co-founders Dr. Mark Kiel and Steve Schwartz led a live discussion and demonstration of the ACMG/AMP filtering process in the Mastermind Genomic Search Engine.
- How to more efficiently identify and prioritize publications by ACMG/AMP variant classification guidelines
- How increased specificity and immediate access to annotated search results accelerates variant interpretation workflow
- How increased sensitivity in literature search results in fewer false negatives
ACMG/AMP Masterclass Transcription
Hello, everyone, and welcome to the Mastermind Masterclass! My name is Candace Chapman, VP of Marketing here at Genomenon. Thank you for joining us for today’s live session. Wherever you are, we hope you are safe and healthy, and have at least one face mask that makes you feel like a superhero.
This Masterclass will focus on one of the most popular features of the Mastermind genomic search engine, which is the ability to filter by ACMG/AMP criteria. Filtering by ACMG/AMP criteria is a professional feature, so if you currently have the basic edition, some of this may not look familiar to you.
This is a great time to get reacquainted with the full functionality of Mastermind, and to see if you and your organization can benefit from Mastermind Professional Edition. It’s our goal to make Mastermind the first resource you use when interpreting variants, and I hope that this deep-dive Masterclass will help you get the most out of it.
Once again, we’ve gathered our two amazing founders to discuss Mastermind in depth. We also have an MVP from the data science team to walk you through the demo. Let’s welcome Mark Kiel for the scientific update. Hey, Mark.
Thanks, Candace and everyone. Today’s Masterclass is going to focus on the ACMG (American College of Medical Genetics and Genomics) guidelines for variant interpretation, and particularly how you can use Mastermind to guide your search and review of the empirical evidence necessary for accurate and efficient diagnosis of your patients. So a little bit about on the the origin of the ACMG:
Many of you are undoubtedly familiar with the Richards et al. paper from several years back now that first laid out the guidelines, a framework, for collecting and categorizing and annotating the evidence necessary for your curation to build to and culminate in a variant interpretation: “pathogenic,” “likely pathogenic,” or “uncertain,” “likely pathogenic,” and “benign.”
The ACMG/AMP framework was the first attempt by that working group to provide reproducibility for these variant curation and interpretation activities. We’re going to focus on the ACMG side, the constitutional side here.
This poster I’m very proud of — I can’t take any credit for the aesthetics here, that was all Candace and the marketing team, and some talented graphic designers on my team. I have no claim to ownership to how beautiful it looks, but it’s really a way to better understand the content of the Richards paper by segregating that information into separate subgroupings of the workflow.
So the way that I like to think about ACMG is by bifurcating the work into looking for evidence that’s intrinsic, and looking for evidence that’s extrinsic to your patient in their clinical scenario. On the intrinsic side, what I mean by that is twofold.
One, there’s intrinsic properties of the variant itself that you’re trying to interpret. A very straightforward example would be the PV:S1 or “Pathogenic: Very Strong” criteria for nonsense variants or frameshift variants that lead to loss of function of the protein, or otherwise demonstrably deleterious splice mutations. So that’s intrinsic to the variant.
Then also in the ACMG workflow is evidence that’s intrinsic to the clinical circumstance. A simple example is the PS2 or “Pathogenic: Strong” in the second sub-category of evidence. If you’re finding that the patient has a de novo inherited variant that didn’t come from either mother or father, that is obviously something that you need to have the clinical circumstances surrounding the case to provide that evidence and information to inform the variant interpretation. So those are intrinsic, and those come out of the casework and the very nature of the variant that you’re looking at and are fairly straightforward.
What I want to turn your attention to now, though, are the extrinsic sources of evidence necessary to guide your interpretation. Those come in three flavors, they helpfully begin with a P in my concept:
Firstly, Population frequency data, from publicly available resources like gnoMAD, secondly, in silico Predictive models of damage to the protein that in part draw from evolutionary conservation across the different species genomes. Those are databases that are quite likely to be already interpolated into your workflow and data annotation streams, and so those two of the three aspects for the extrinsic evidence have been routinized in most or all labs that do this work.
The third aspect of the intrinsic evidence that I really want to focus on here is what Mastermind can help illuminate. That is to say, the evidence from the Published literature, or the empirical evidence that gets to the functional consequences of the variant from in vitro or in vivo functional studies, or otherwise identifying evidence from the clinical studies or case reports that indicate that this variant has been seen before, and has been ascribed to cause that disease through segregation studies and individual detailed clinical course works, such as what you guys perform when you have a live case.
So intrinsic data, extrinsic data; the focus is on extrinsic with a specific focus on the empirical evidence that can only come from data sources in the medical literature that Mastermind has a great mastery of. I’ll emphasize where I talk about seven different streams of data. Six of those streams that don’t include the empirical evidence from the literature are seldom if ever sufficient to allow you to ascribe pathogenicity to a variant, so there’s a really a need to have access to the information that we’re going to talk about that is surfaced in the Mastermind software.
This poster was, as I said, a training tool that we devised for our own internal curation work. And, to date, using the internal suite of tools that Genomenon has at its disposal to maximize the accuracy and efficiency of variant curation, we have curated many tens of thousands of individual genetic variants using these guidelines. This poster and associated materials, such as the ACMG Variant Interpretation Cards that many of you may be familiar with, are useful to make sure that our curators are all on the same page and proceeding with their work in a very reproducible way, using those data streams that Genomenon internally has automated.
The purpose of the Masterclass today is to highlight two of those capabilities that we provide to our customers and basic edition users, the first of which is the Mastermind user interface or Genomic Search Engine, which undoubtedly many of you are already familiar with. I’ll emphasize that the professional edition is really critical for clinical casework. So you’re missing out on content and features if you’re only using the Basic Edition, whereas you have the full replete data and toolset in the Professional Edition.
Rachel will be walking through four fairly straightforward and specific examples that highlight the accuracy and efficiency of the use of the Mastermind user interface for the purposes of variant interpretation. When Rachel is done, toward the end of the webinar, Steve, my co-founder and CTO, will talk more broadly about how Genomenon specializes in genomic association data.
Making the aggregation and annotation and interpretation of all of this information writ large is a core value of the Genomenon data and suite of tools. He’s going to talk about the other way that we enable our users, whether through partnerships with tertiary software providers, or internal to their own home-brewed data processing infrastructure, Mastermind provides API data to fully automate and enhance the efficiency of large-scale, high-throughput clinical workflows. Rachel, if you want to take this opportunity to grab the screen, I’ll turn it over to you. Hey, Rachel.
RACHEL: Hi everyone!
I’m going to be interjecting a little bit where I see someplace to add some color, but I’m going to turn my video off, so Rachel if you can start sharing your screen, I’ll turn the control over to you.
Hi, everyone, thank you again for coming to our webinar today. I’m excited to share with you four case examples of how to use the ACMG interpretation in Mastermind. So let’s get started! Suppose we have a case with the UROS gene mutation C73R. And before I get into anything, I would like to share what each of the four quadrants mean in Mastermind, especially to those who might not be as familiar with Mastermind.
In the top right, we have our publication history, where each bubble represents a journal article. On the x-axis, we have our publication year, and on the y-axis, we have our relevance of journals. So the higher up the number goes, the more relevant the journal is, and the size of the circles represent how often the variant is mentioned in the article, or if it’s mentioned in the abstract or title.
In the bottom right, we have all of our lists of journal articles, and they are listed based on relevance in accordance with the top right quadrant. In the bottom left, we have our list of variants, and since we’re looking at the C73R variant, that one has been highlighted, where you’ll notice the cDNA position as well as the total number of articles.
In the top left, we have our variant diagram, where each bar represents a variant in the x-axis being the protein position and the y-axis being the number of articles mentioned. For this case, let’s assume that we already know that C73R associates with the disease porphyria. One way to check that is by looking at related diseases, and the top disease is porphyria. So for this example we are interested in looking at functional studies only, so we’ll be going in to filter categories.
You’ll notice that there are four main types of categories, the first being ACMG interpretation, the second being clinical significance, (that includes terms such as prognosis or therapy), then genetic mechanism, (that includes terms such as copy number variants or epigenetics), and the final category, which is significant terms in the abstract, (that also includes the disease). So since this is an ACMG Masterclass, today we’ll only be focusing on ACMG interpretation. For this case we’ll focus on functional.
There are two main types of categories: in vivo and in vitro. Underneath each of these includes specific terms that might be associated with these two different types of functional studies. For instance, under in vivo, you might get terms such as “knock-down” or “mouse.” And for in vitro, you might get terms such as “enzymatic assay” or “transfection.” For this case, we’re going to enable all of these key terms and Mastermind has highlighted each of these terms.
Once we submit, the relevance of journal articles will change. So for this example, there has been a dramatic change in journal article relevance, and the article segment in the bottom right has also changed in accordance with the relevance of journals, including the filtered key terms. By looking at our first example, right now Mastermind has highlighted the genes and the variants. C73R has been highlighted, as well as UROS. For this case, we’re interested in seeing functional keywords only, so we’ll click on “keywords only,” and Mastermind has now highlighted the words “murine” or “knock-in mouse” or “mice,” showing that this is an in vivo functional study.
Then, we’ll toggle down to our next journal article, and right next to the journal, we have a paper clip, and that means that the variant’s also mentioned in the supplemental data as as the full-text article. In this article, the terms “in vitro” and “wild-type” have been highlighted, showing that this is an in vitro functional study. In this example, we saw a case of in vivo and also in vitro.
Great, thank you, Rachel. A couple of things to highlight there: big picture is, the true power of the relevance calculation and how it cuts through the otherwise noisiness of the maximally sensitive result in Mastermind. So with a simple click, you could see how the content was really efficiently and optimally prioritized to focus on those functional studies.
The relevance calculation is something that Steve is going to be speaking to through the use of Mastermind’s API through many of our partnerships. And then, the second thing to highlight here is the power of those full-text sentence fragments that highlight the context where those search terms, variant, or keywords appeared in the in the references that were used to derive the relevance, and that bottom right quadrant of full-text matches is an extremely valuable resource in Mastermind, and can either be sufficient for you to levy an interpretation about the meaningfulness of that evidence, or lead you to seek out the full text content to review it in full, which we encourage in situations where you want to go in even more depth to that evidence.
So Rachel, why don’t you go ahead with the second example. This was a really great encapsulation of the power of those two features in Mastermind.
Of course, thank you, Mark. For our second example, suppose we have a case with the TARDBP gene mutation A315T, segregated with the disease ALS. So for this example we’re going to begin like we did with our last example, by filtering by “functional only,” enabling all, and Mastermind again highlighting all the key terms.
As we check the reprioritization, there have been prioritization changes, but none as dramatic as our first example, meaning that this case contains a lot more functional studies than our previous report.
Let’s suppose for this case we already knew that there were a lot of functional studies, and now we’re interested in seeing case studies. So we’ll go under filtered categories to “disable all” for functional, and now we’ll go into pedigrees and case studies. For the different sections in this, you include case genotypes, including if it’s a homozygote or a heterozygote; inheritance, if it was an autosomal dominant or autosomal recessive; or if it segregates with the disease.
For this case, we’re interested in seeing case studies, so we’ll enable “case”. Mastermind only highlighted this segment. They did not highlight the rest of the sections. So once we submit, we’ll see the reprioritization. Based on the filtering, there is only one article that contains a case study. So by looking at our first paper, the term “case report” has been highlighted.
Now, let’s see full-text sentences, including “variants only.” You’ll see the gene and the variant are both highlighted, and in the second sentence, it notes that the patient does have a motor neuron disease with this particular variant. So for this example, what we did similarly to the first example was showing functional studies, but in a different sense, we got introduced to case studies.
Great, thanks, Rachel. I really like the way that that prioritization illuminated the power of the enhanced specificity of the search terms that the user can specify in the ACMG criteria.
In this particular example, TARDBP is pretty well characterized at a functional level, and that was why the prioritization with the functional category still left a great number of articles prioritized, because this particular variant has been studied quite a bit. Just as, on the other side of the equation, Rachel focused on finding case reports, similarly in the functional category, if you’re looking for specific aspects of the function that any one of those prioritized functional studies contains, you can drill deeper and become more specific even within those functional categories, just as Rachel did within the clinical category key terms by focusing on “case report.”
I think, Rachel, you’re going to go into the keywords again in the third example, and I’ll encourage the viewers to note the numbers beside each of the key terms — maybe, Rachel, you can point those out in the third example — because that, a suggestion of how many times each of those terms appear in those references, that will then obviously dictate how they become prioritized.
All of that information is used, as I said, to drive the relevance calculation in the user interface, and then all of the various ways that our API is used to prioritize that content. So Rachel, why don’t you take it away with the third example.
Of course, thank you, Mark. For our third example, suppose we have a case with the ATP7B gene mutation H1069, which segregates with Wilson’s disease. So for this case, let’s assume that we don’t have a prioritization whether we want to see functional studies or case studies, we want to see it all. Under filter categories, we’ll not only enable functional studies, but now we’ll also enable all for pedigrees and case studies.
Mastermind allows you to enable multiple key terms across different categories, so no matter how many key terms you choose, the articles will reprioritize and place the most relevant articles at the top, depending on what keywords you have chosen. For this case, we have “functional,” “pedigrees,” and “case studies” all highlighted. As this loads, we’ll go into our first paper example.
Under this first article, under keywords, the terms “homozygous,” “Wilson’s disease,” and “compound heterozygous” are all highlighted, showing that this is a clinical study with patients diagnosed with Wilson’s disease with this particular variant.
You’ll see that Mastermind has also highlighted the variant. In the second example, we have the variant highlighted, as well as “homozygous” and “compound heterozygous.” Again, like the first article, this gives a clear indication that this is a clinical study. If we go to the third article, not only is “Wilson’s disease” highlighted, but also terms such as “immunoprecipitation,” or “Western blot,” or “mouse,” showing that these are assays or different models, showing that this is a functional study.
So for this example you get a really good mixture of not only clinical studies, but also functional studies by filtering both functional key terms and case key terms.
That’s great, so this is a really great example of the efficient use of Mastermind to find the empirical evidence when you’re in situations where you need both or you need either functional studies or clinical studies.
This is a common workflow for many of our users, where in the example here, Rachel has highlighted how very quickly in the first, say, usually top 5 references that are prioritized, you get the information that you need prioritized, and that’s sufficient to make your interpretation.
The other thing to suggest is a little bit behind the scenes, about how the relevance calculation is performed. Rachel alluded to the idea that we look for the frequency of the terms, including the variant that you’ve searched on, and that is certainly a major component of the relevance calculation. An additional and nuanced component is how close together terms of interests that comprise your search are mentioned, whether they’re in the same sentence or same paragraph, and whether they’re mentioned always together or frequently together.
That relevance calculation is extremely sophisticated in bringing to light in the prioritization those articles that have the most relevant content, even if you cross categories as Rachel had done. So in particular, if there’s a paper that mentions both functional studies and clinical studies, which is not uncommon, especially for some of the higher-tier journals like New England Journal of Medicine, that content gets prioritized first using this paradigm, which really maximizes the efficiency of a user’s workflow and is readily stereotyped.
What Rachel did there can be repeated for tens of hundreds of variants in your daily curation activity to maximize the efficiency of the empirical evidence that Mastermind displays. So Rachel, I think you just have one more example, right? We can move into that one.
Thank you, Mark. For our fourth example, suppose you have a case with polycystic kidney disease with the PKD2 gene mutation arginine 322 leucine (R322L). So as this loaded, Mastermind provided a yellow box that says “no articles match your search.” When this happens, Mastermind gives you three things. The first thing it gives you is every confidence that this variant has not been published. Mastermind has gone over 30 million abstract full-text articles as well as supplemental data, and came back with zero results.
The second thing that Mastermind allows you to do is to search for related variants. This means, are there other variants that lie within the same protein position? A way to check that is by going over to “variants” on the left side and typing in “arginine 322,” and Mastermind provided us with three missense mutations, and you can go through each of these and review each of the articles and see whether they contain clinical studies or functional studies. You can also check this by going into the variant diagram to zoom in and toggle over to the 322nd position, and Mastermind will show the three missense variants all aligned in the 322nd position.
The third thing that Mastermind allows you to do is to create an Alert. To create an Alert, all you do is click this button, “Create an Alert for the current search,” and you’ll notice that I had a notification at the top of my screen. And that is all you need to do, you don’t need to go back to Google Scholar or PubMed to continuously check for newly published articles that might contain the variant. Once Mastermind has found a newly published article that contains that variant, you will be notified by Mastermind immediately when that paper comes out.
Great, thank you, Rachel. Rachel brought up the three benefits when you encounter this situation, which happens in particular when you’re doing exome sequencing or working in rare disease, a null result here is a victory for Mastermind users because we refer to this as the “power of zero.” Rachel alluded to the fact that a user who sees this result can have every confidence that, if it’s not in Mastermind, it hasn’t been published and there’s no evidence to search for, and that search and realization took a matter of seconds. And we actually even provide quantitation around the lengths to which Mastermind has gone to determine that there are no mentions of this bearing in the literature.
The other thing that she brought up was looking for additional variants at that same residue. That’s one of the things that I think Steve will speak to more broadly, about the power of Mastermind being able to aggregate this content. Just with a simple scroll, she was able to see related variants at that residue that will clearly inform in many ways the ACMG interpretation. Even when your very variant in your workflow has not been published, related variants may have been published that may very well inform and have an impact on your overall variant interpretation.
The last thing that Rachel highlighted was the power of our Alert capability. In this example, this was a one-off, which many of our users avail, when you have a null result like this, again, affirmation that there’s there’s no published information about this variant, you’re empowered to create an alert and have any newly published information come to your attention through your user account and email. So that’s a one-off, but the true power of alerts can be realized for larger labs, commercial and academic, when you batch your alerts, when you’ve got a database of variants of uncertain significance, and you don’t, as Rachel said, have the time or desire to go out and find that information, Mastermind brings that information to your attention.
In many of our engagements with large reference labs who have such databases of variants of uncertain significance that have run these alerts, routinely, we surface information that informs a change in the interpretation in between 10 and 12, and sometimes even 15% of those cases. So that alert capability, whether you’re looking back in time and running this data processing in batch, or looking forward in time to receive the information as it’s published, is an extremely powerful capability. This was a reflection of the sensitivity of Mastermind’s search capability, and that was interwoven as well in the other examples, where those other examples were more focused on the specificity. So Rachel, I know that you just had those four, but if you’ll permit me, I’ll give you another variant.
RACHEL: Of course!
JAK3, and instead of the protein search, if you can use a cDNA search, C.1533, and I think it’s G to A, let me see if that’s the right variant. There we go, that is the right variant. So, everybody in the audience will have noted that there were two suggestions in the drop-down: one that indicated what the protein changed as a result of that cDNA change was, and the other that preserved that cDNA change. Clicking on that cDNA change in the prompt, now when Rachel performs a search, will make a declaration to the Mastermind software that she cares explicitly about that nucleotide change. In the article pane, right next to those filled-in PDF icons, you’ll notice targets. Rachel, if you could click on that topmost prioritized reference, the targets there, those targets indicate that the authors of that study have specifically mentioned that cDNA change in the context of describing their variant. You’ll also notice that those specific matches from those papers have been prioritized.
So that’s, just like we talked about, that fourth example being a reflection of the power of the Mastermind sensitivity, this is a reflection of the power of Mastermind’s specificity, when, again, the user has declared the the desire to see the nucleotide-specific match prioritized first. But an important distinction to notice is that we didn’t throw away the other protein level changes, because those are informative. So Mastermind is deftly playing both sides of the sensitivity-specificity fence in a way that maximally enables users to be sure that they’re not missing information, while also maximizing the efficiency with which they find the most informative information first. I’m pretty sure since Steve and his team were behind this initiative, the targets in this case, that Steve will have more to say particularly when it comes to non-coding variants, which can be more of a challenge. So in Rachel’s examples we stuck to coding changes, but Mastermind contains non-coding data as well, non-coding variants.
So Rachel, thank you very much for walking through the user interface. We’re gonna change our viewpoint now, and Steve will address a couple of the follow-on components from Rachel’s demo examples, and then he’ll take it up to a high level and talk about genomic associations and how all of that information that Rachel spoke to can be automated through the API. So Steve, I’m going to disappear with Rachel, and you’ll be able to take control here.
Sounds good. A couple of things before I get to the genomic associations and the power of the API. There are a couple things that I think Rachel and Mark did a great job of explaining. The idea of the prioritization in the articles which Rachel explained, and then Mark further elucidated and expanded upon, they did a great job, and I think I can even slightly further elucidate and expand upon how we think about article prioritization, because Mark is right, we try to maximize sensitivity and then optimize specificity within the bounds of or within the constraint of maximal sensitivity.
The idea behind that philosophy is that, if we’re going to err on either side of sensitivity or specificity, we want to err on the side of sensitivity, because a false positive is much easier to recover from than a false negative. We want to optimize for sensitivity because it’s much better to provide a false positive than a false negative, particularly when dealing with such critical information as rare variants.
For example, for a false positive, we provide you all of the tools that Rachel showed you to be able to quickly go through those and determine even ones that aren’t false positives, which ones are most relevant to you, whereas a false negative is not as easy to come back from if you don’t have it in front of you.
I think Mark did a good job of explaining how Mastermind plays both sides of the fence on sensitivity versus specificity. One of the ways that we do that is by making sure that we show you all of the relevant variants and then prioritizing the most relevant at the top of the results, so that you don’t miss anything, but also so that you don’t spend time unnecessarily wading through evidence that might be less relevant for your particular case.
One of the ways that we do that for prioritization is we tend to think of prioritization in two separate pieces.
One is the sort of inherent relevance of a given article in isolation, in and of itself. How relevant is the journal it was published in? What’s the impact score of the journal? How recent is the publication (meaning, how likely are you to not already know that information)? How many times has that paper been cited? Things like that that are intrinsic to the paper regardless of the search that you’re doing to return that paper in the results. And so there’s the initial intrinsic relevance of a paper versus the relative relevance of the paper relative to the search that you are doing.
When we calculate the relevance of the articles and prioritize them for your query, what we’re really doing is taking two different relevance calculations and then combining them. We’re taking the absolute relevance of the paper, intrinsic to the paper itself, and we’re taking how relevant the paper is to the thing that you search. That’s what Rachel hit upon, talking about the genes and the variants and the disease or the keywords that are mentioned in the title, versus the abstract, versus the full text or the supplemental, and weighting those, assigning different scores or weights to how the terms that you search are both mentioned in the paper and relevant to each other within the paper. [For example,] a variant was mentioned right next to one of your ACMG keywords, versus the variant being in the abstract and the keyword being on page 13.
Those are the things that we’re doing when you select each additional filter, whether you’re filtering by disease or therapy or adding in those ACMG/AMP keywords, we’re filtering the articles, but we’re also reprioritizing based on that dynamic component of the relevancy of the article. That’s a little bit more technically involved, but I think sometimes helpful to understand what’s happening under the hood. It helps inform how you use the software, how you can use the data that we provide.
Going back to the idea of maximizing sensitivity and then optimizing specificity, this is an example of one of the things that we’ve added within the past month, particularly for non-coding variants.
Mark and Rachel showed you how you can search by cDNA nomenclature, or genomic coordinates, or RSID, which are all nucleotide-specific nomenclatures for variants. What we want to do when we show the results for that is we’re using what you search to prioritize nucleotide or cDNA level matches within the papers at the top, so that you can see those first, but at the same time showing you any other variant that might be relevant or highly relevant to what you are searching for, meaning a different cDNA change that has the same biological effect or a very similar biological effect, we want to include those as well.
So one area where this is especially prominent is in the intronic regions of a gene. Before, what we would do is if you search for a variant that was in an intronic region, we would categorize all intronic variants together, so that if there was no specific reference of your cDNA level change, we would at least show you other variants that matched whatever filters you queried on: disease, therapy, phenotype, or ACMG/AMP criteria, or other categorical keywords, we would show you the articles that match those, along with any other variants in that same intron as the variant that you searched.
That can be very helpful, and sometimes, it will lead to the one and only paper that has any evidence on the variant you’re searching, in that it shows a similar variant in the same intron. However, for some introns, that can be a very broad category. You might search for a variant that’s four base pairs into the intron on the splice acceptor side, and then we show you a paper that has a deletion over on the splice donor side that isn’t nearly as relevant to you as you might like. If there’s only one or a few or even 15 results, it’s really quick and easy to go through those in the Mastermind interface as Rachel showed, but if there happen to be 200 articles, it’s much less practical to be able to look through all of those and see if any of them are more relevant than others.
One of the things that we’ve done recently is we’ve subdivided the intronic variants in order to give the option for more specific groupings of relevant variants.
For example, if you search for something in the splice region on the donor side, we can show you other variants in the evidence in the medical literature from that same splice region, meaning it might have a very similar or highly related biological effect to the variant that you’re searching, even if it’s not exactly your variant.
But again, if it is exactly your variant, those results will be at the top with that little crosshair icon, showing that it’s a precision match.
Like I said, we’ve previously had splice donor, intronic, and splice acceptor groupings, and we have now additionally added the more specific groupings on top of those, for splice region on the donor side, splice region on the acceptor side, and then we’ve also taken the intronic variants and divided those into two halves: the intronic donor side and the intronic acceptor side, so that you can then add those filters as well to increase specificity and change the prioritization.
If we go to the next slide now, the way that most people think of the value of Mastermind and the value of the data that we’re providing is in the initial benefit that we’ve organized and indexed the entirety of the medical evidence and literature, making it easily searchable.
What I want to show is that that is only half of the value we provide. The other half is that we then take that data that we’ve indexed and we pre-identify all Genomic Associations, so identifying every genomic association supported by the medical evidence.
What do I mean by genomic association? When I talk about genomic associations, really what I’m talking about is the fact that we identify every occurrence of several different genomic concepts within the evidence, so we identify every gene, every variant, every disease, every phenotype, every therapy, and every categorical keyword mentioned in the literature as well. So those categories or keywords are what Rachel just demonstrated for you, particularly with the ACMG/AMP subcategories.
What we do is we identify each of those entities, and then we identify the connections between any two or more of those entities. A very simple connection might be connecting a disease to a therapy, or a variant to a disease, or a gene to a disease, but causative links between these different concepts are strengthened and really become well understood when you can add more entities or more information from the evidence to these associations.
While a disease therapy association is a great start, it gives you a hypothesis which you can then start to investigate and build upon. If you can go from disease to gene to therapy, now you’re starting to better understand the causal mechanisms underlying that disease, and why that therapy acts the way it does in terms of treating that disease. If you can connect a disease – gene – variant – therapy, now you’re really starting to understand that functional mechanism underlying the disease and the efficacy of the therapy. But if you can connect a disease – gene – variant – therapy characterized by phenotypes and categories or keywords, that’s sort of the holy grail of understanding the relationship between disease and genomics.
That’s really the heart of what we’re doing, is identifying every connection between any two entities.
And they don’t have to be different entities. For example, one of the things that Rachel showed is looking at similar variants, and that’s one of the other things that I was just talking about, with the biological groupings of intronic variants, for example.
The associations can be between one or more variants, or it might be between one or more genes as well. So while our user interface is really optimized right now for finding the associations connected to variants, this is where the power of our API starts to come into play, in that it makes it much more easy to automate the process of finding any of these associations between any two entities, having any set of inputs and producing any set of outputs.
You can start with the gene and variant as input and get a disease as an output, or you can start with a variant as an input and get a list of other variants as your output. If we go to the next slide, I’ll show what I mean by how the API can help automate this process, or automate your curation in a much more high-throughput process.
A very basic use of our API is what’s shown here, where you might have a list of variants that you want to investigate, and you can just find the number of articles that cite each variant. It’s actually a really low barrier to entry, and it tends to be very effective with the partners that we have and the customers that we have that integrate with our API.
It’s a very low-effort but high quality output way of prioritizing variants for curation work. Knowing which variants to look at first can save a lot of time. In this case, we have three variants shown in the graph, but for a given case, you might have a few thousand variants for a patient for a case. You can pass those into the API, get the number of articles that cite each variant, and then sort by that. Now you have a pretty good heuristic for knowing which variants to investigate first, but that is just the base API, and that’s just using the count information.
Where the association strength really comes in is in this next example, using the advanced API. So this is an example from an actual API integration that we built for a large platform provider, where, for a given patient or a given case, you might have three, five, ten thousand variants that you pass in as one of your inputs, and the other input that you pass in for that case is a list of five to fifteen phenotypes for that patient.
What our API can do is start to automate the process of the combinatorial search that isn’t as well feasible in the user interface. While the user interface makes it easy to search by any single combination of variants or phenotypes, and it does give you, in the Professional Edition, the ability to use the advanced boolean search so you can specify multiple variants or multiple phenotypes within one query, what is not feasible in the user interface is trying this combinatorial approach to see which combinations of variants.
Maybe you have multigenic effects that for all of the phenotypes that you’re seeing in a given case, and so our API makes it much easier to automate that sort of combinatorial search. In this case, we have a script that was integrated into an analysis pipeline, which uses our advanced API by passing all of the variants for a given case in with all of the phenotypes, and it will use the API that we provide to identify which combinations of those variants and phenotypes are cited in the literature, and then prioritize by that.
So if you have a given paper that cites four of the patient’s variants together with three of the phenotypes, and then you have another paper that cites two other variants with three more of the phenotypes, it will prioritize those two papers at the top and pull out the combinations of variants and phenotypes that they cite.
But that’s just step one actually, because what you can then do from that set of cited articles, those highly prioritized articles, is you can then see what other entity types were cited in those articles.
For example, in this graphic, we’re saying, from the articles that mention these combinations of variants and phenotypes, what diseases did the articles talk about? We can just as easily ask also what therapies did the articles talk about, or what other phenotypes did it talk about, or any of those entities, and now you’re starting to get into an area where you can start to automate not just the prioritization of things to investigate, but you’re starting to automate the investigation itself in a way that isn’t feasible for a curator to do manually.
So that’s the example I think of the API and the power inherent in the API where you can start to build this more automated approach, which then lends itself very well toward building out even further with machine learning, artificial intelligence, that sort of thing as we can score these articles and start to automatically pull out these other entities.
Thank you, Mark and Steve, and thank you in advance for the questions you’re going to be [answering]. And I want to thank you for joining us today! I hope you learned things that you can use in your daily work.
ACMG Masterclass Q&A
Hello, thanks for this Mastermind Masterclass on ACMG/AMP. I have a question about ACMG interpretation. On which databases are the ACMG criteria based on ? For example, for a mutation to be PM1, it needs to be found in a mutational hotspot based on a literature. How do you determine that ? Thanks in advance.
There are two ways to use Mastermind to determine whether a variant is in a hot spot region. The first way would be to look at the number of variants found in Mastermind in the vicinity of the variant being interpreted and assess the literature evidence supporting the pathogenicity of each of these. The other way would be to search Mastermind for papers that describe defined hotspot regions (using either the category keyword hotspot or a similar free-text keyword search) and have your curation results incorporate the consensus from such references.
Does Mastermind use the HGMD database ? If yes, is it the public version?
No. Mastermind does not depend on other databases to build the Genomic associations nor does it provide canned interpretations for variant pathogenicity. Instead, Mastermind contains information for many more variants and much more information per variant to ensure curators have the most sensitive search results to determine the most accurate and up-to-date interpretation possible.
What about transcript differences leading to variant nomenclature differences?
Mastermind includes results for all transcripts for each gene that may result in differing nomenclatures or variant numbering as a core comp[etency of its Genomic Language Processing (GLP) technology. As you know, most authors do not identify the transcript they are using in their publications and the results in Mastermind are purposely designed to maximize sensitivity of search results.
Can we search for SNPs functional evidence? Can we query by the RSID number?
Yes, Mastermind normalizes this type of variant search and can handily recognize RSIDs. Additionally, Mastermind recognizes “c.” and “p.” nomenclatures among many others.
Are you using machine learning or some sort of “AI” to scour the professional articles or is this done by “hand” ?
Mastermind uses a proprietary process we call Genomic Language Processing which has its foundation in NLP (Natural Language Processing). We have spent 6 years perfecting GLP resulting in a command of the “language” of genetics – recognizing the myriad ways genes and variants can be variably described by authors and reconciling and disambiguating and organizing the results. GLP is how we use automation to index the content of the literature so the needful information is readily available with a simple, single search in Mastermind.
Can searches be saved to compare them?
We do allow users to set up Alerts for variants to stay apprised of newly published information. The user’s Alert dashboard can be used in this way to keep track of the changes in literature citations for each saved variant.
Do other nomenclatures like BIC nomenclature be captured by Mastermind for the same mutation?
BIC nomenclatures are not explicitly included in the Mastermind indexing process. However, we strive to continue building upon our best-in-industry sensitivity for recognizing all variants cited in the medical evidence, regardless of nomenclature or formatting. If you know of any articles which cite variants exclusively by BIC nomenclature, please reach out to us, and we can add them to our roadmap Although the BIC database is no longer actively maintained or curated, if there are any older papers which use this nomenclature exclusively, they could be useful to add to the Mastermind data.
Which is the reference transcript (Refseq, MANE, etc.) used for searching the variants? Can other transcripts be used as well?
Refseq is currently used in the indexing process for both the canonical and legacy transcripts.
PM7 is not defined in ACMG (believe this was in the filter). Is that what you are associating with case studies?
PM7 was a one-time category of ACMG that was not widely adopted that relied on previously interpreted pathogenicity from a reputable source that had identified the variant in a clinical setting. The presence of PM7 in the Mastermind user interface is a reflection of this designation.
If I put in an intronic splice site variant, will it also find synonymous variants near the end of the exon (that may also affect splicing)? Intronic variants seem not too relevant unless there are functional studies.
Mastermind aggregates variants that influence splicing into splice donor and splice acceptor variants if the change is directly within the intronic nucleotides one or two positions from the exon-intron or intron-exon boundary. These variant groupings will be defined by the protein-coding position they are closest to and labeled with either an “sa” or “sd”.
Additionally, variants deeper within the intron but still close to the exon-intron and intron-exon boundary as well as coding variants in the exon that are near the exon-intron or intron-exon boundary are aggregated as “srd” and “sra”, respectively.
These splice region variants can be searched directly for results that would include any synonymous exonic variants within the splice region. To narrow down to only the synonymous exonic variants within the splice region, you can do a boolean search as well for the synonymous variant.
We have written a couple of blog posts that describes this feature of Mastermind that provides more in depth information.
Do you highlight functional (meaning RNA splicing) results for such variants?
Functional studies that detail effects on splicing can be searched for using category keywords under “Genetic Mechanism – Variants”.
Do I have to filter by a certain ACMG guideline? Can Mastermind just tell me which ACMG criteria a given paper gives info about?
At present, we require the user to specify which aspect of ACMG they are searching for. In the future, we plan to have this information (which ACMG category is addressed in any one reference) automatically appear for each result that Mastermind returns.
Sometimes the same variant is differently named in different publications, mainly across time (older publications specially), how does mastermind deal with this issue?
Mark: There’s a couple ways to think about answering that question: The first is pretty generic and that’s to say that authors can use any kind of nomenclature that they want. I think it was talked about RSID level searching – if the variants mentioned in a paper as an RSID – that’s one way that an author can describe the variant. Another way is cDNA versus protein, and a more nuanced way to answer that more generic version of the question, is whether it’s following HGVs standard nomenclature guidelines or not – which it almost never is, and which is part of the genius of what Steve and his team have been able to build in Mastermind’s indexing.
The other way to think about answering that question is with legacy nomenclature, which is I think what you were alluding to when you talk about it throughout time. Mastermind is genomically literate and it’s aware of those legacy issues. A couple of examples would be a differential variant nomenclature and different transcripts of a gene. Another example would be legacy nomenclature, where you omit the first initiation which obviously changes all the numbers, and the third example would be a signal peptide that’s either included or not included depending on the author’s fancy. So Mastermind is aware of all of those things that I talked about, and just like Steve suggested, maximizes the sensitivity and then optimizes for specificity. Steve I dare you to build on that!
Steve: Everything that you mentioned, I would say is just scratching the surface in one respect because those are all of the things that make sense! There’s an entire class of variant nomenclatures that just plainly don’t make sense. For example, there’s what we consider the colloquial nicknames like in CFTR you’ve got the f 508 deletion that authors love referring to as Delta F 508, with a delta symbol, and there’s about 50 different ways that represent a delta symbol in different character maps. So things like that that we’ve addressed as well, and then there are also other ones that make even less sense for example, I can’t recall which variant it was specifically, but there’s one variant we encountered where the seminal paper on that variant referenced it by RSID but the author transposed the first and last number of the RSID, and so the RSID used by the author is a completely different variant than what the author was was writing a paper on. Because that paper was so early and so important for the study of that variant, the next ten papers published on that variant over the following years use the same transposed RSID number as the initial author. You literally end up with 10 different papers describing the same variant with a typo throughout the entire paper. That’s sort of the you know, going below sea level on the iceberg and that is alternate variant nomenclatures. There’s the ones that make sense like legacy transcripts, which we are fully aware of when we’re doing the indexing and searching for these variants.
We’re looking not just the current transcript for the variant, but we’re looking for those nomenclatures across different transcripts and recognizing the alternate ways to describe a variant across different nomenclatures, across different numbering, or position systems, taking into account things like wide shifts, from the fact that they ‘do or don’t know’ a number starting from ‘one or zero’ and then going into things like that that are just purely colloquial happenstance and part of the history of publishing on that variant. We try to address all of those. One of the ways that those often come to our attention because the genome is huge, so a lot of times we have customers and users who specialize in those genes and let us know about those issues. Sometimes they know exactly what the issue is and they can send us a list of articles that illustrate the issue. Other times they say. ‘you didn’t find this variant that I know this paper exists for’ and then we have to dig into why that was, but that’s always something that we try to do is ensure that we’re continually improving our pipeline, our genomic language processing, to take those kinds of things into account.