The Emergence of AI-Guided Genomics to Accelerate Variant Interpretation
Thursday, May 13, 2021 | 11am EST
Next-generation sequencing (NGS) data is widely used to inform both clinical diagnostics and drug development. In either case, manual curation and interpretation of this data is hindered by an inability to identify information within scientific literature quickly, comprehensively, and reproducibly. This challenge is compounded by the complexity and heterogeneity of nomenclatures used to describe diseases, genes, and genetic variants.
In this webinar, we will stand at the intersection of Artificial Intelligence (AI) and Precision Medicine to discuss the developing role of AI-driven genomics in the research space, and how these advances are improving variant interpretation workflows. Additionally, we will discuss a recent study on the Mastermind Genomic Search Engine, and observe how a computationally intelligent approach to building a search engine for genetic evidence has quickly surpassed two decades of manual effort in building static genetic databases.
Genomenon Founder and Chief Science Officer, Mark Kiel, MD, PhD, and Director of Customer Success, Brittnee Jones, PhD, lead a live discussion and Q&A session.
- High-level industry challenges associated with genetic variant interpretation,
- The emerging role of AI and machine learning-assisted technology in genomics, and
- Recent findings surrounding the implications of novel computational genomic intelligence platforms.
GARRETT: Good morning, everyone, and welcome to today’s webinar: the emergence of AI-guided genomics to accelerate variant interpretation. My name is Garrett Sheets, and today we’re joined by two members of Genomenon’s leadership team, who will guide us through a discussion about where artificial intelligence meets precision medicine. We’ll unpack emerging innovation around AI-driven genomics and how it’s improving outcomes in both the research and the clinical space. We’ll also discuss key findings from a recent study on the Mastermind genomic search engine, and talk about how using a computationally intelligent approach to curate genomic data has quickly surpassed almost two decades of manual effort. They have a lot of great information to share, so I’ll get right to housekeeping!
For those watching, feel free to put your questions into the Q&A. If we have time, we’ll get to those at the end of the presentation. This webinar is being recorded, and we will email you the recording within the next day or so. So, without further ado, I’ll briefly introduce our speakers, and they’ll tell you a little bit more about themselves!
We have Dr. Mark Kiel, who is our co-founder and chief science officer. We also have Dr. Brittnee Jones, our director of customer success. Hi, Mark! Hi, Brittnee! Thank you both for joining us today. Brittnee, I’m gonna hit the ball over to you. Why don’t you get us started?
BRITTNEE: Great, thank you very much, Garrett! My name is Brittnee Jones, I’m the director of customer success. I just joined the Genomenon team. Most recently, I was at Fabric Genomics, which is a tertiary analysis software, so I was really looking at a clinical support software, but also the world of AI, which is why I’m here with us today. I’ve spent over 10 years in the genetics and genomics space as an application scientist, and before that, I got my PhD at UCSF. Mark?
MARK: Thanks, Brittnee, and welcome! This one of the fastest hire-to-webinar cycles that I’ve ever seen, so thank you, Brittne, for joining me. I’m Mark, I’m the chief science officer and founder of Genomenon. My background is as an MD/PhD scientist, having done clinical genomics (both cancer and constitutional) as well as doing a fair bit of research genomics. I’m a data nerd, I love data, data visualization, compelling order to chaos. Hopefully, that’ll come out in the conversation that Brittnee and I are going to have.
As chief science officer here at Genomenon, I manage teams that supervise the data quality that we produce, as well as the content curation, which I’m sure will come out as Brittnee and I go through some of our questions here. Welcome to all of our attendees! It’s a pleasure to have you. We’ll welcome questions as you have them toward the end, but otherwise, Brittnee, let’s take it away.
BRITTNEE: Absolutely. We wanted to get kicked off today with a little bit of context. Can you give us a background on Genomenon, and maybe how, since this about AI, how AI figures into the platform?
MARK: Yeah, sure! Apologies to those of you who know Genomenon, if this a little bit redundant, but we have a much wider audience here than is typical of our webinars, so let me tell you about Genomenon a bit, Mastermind, our product, how we deliver value for variant interpretation, both in the clinic and in pharma, and then how AI contributes to the value that we generate.
At Genomenon, we are organizing the world’s genomic information, and we’re doing that in the service of saving and improving lives of individuals with genetic and other related diseases. I suggested that Mastermind is our software, but it’s also our database, and that’s where the value really starts. It is a database of evidence that we have indexed from the published empirical medical literature, the clinical and research literature. It comprises titles, abstracts, full-text articles including figures and tables, as well as supplemental material. That’s all updated on a weekly basis and goes back in time many decades. It’s a highly sensitive data repository, but one of the measures of the sensitivity is a very sophisticated algorithm that we use to pull out information from those references, to organize it, to annotate it, and then to present it to our users. We sort of cheekily refer to that as genomic language processing, which is a slant on natural language processing, which we’ll talk about a little bit later. We like to say that there’s nothing natural about genomic language.
The genes and the way they’re described, the variants in the way that they’re mentioned, the obviously multiple ways that you can describe a disease or clinical phenotypes, the drugs, etc., all of that is part of our genomic language processing. We organize that data into genomic associations for our software users to see, increasingly for our higher throughput labs to access that data in higher throughput through an API, and custom configure how those data components get put together. Recently, as of the past, say, three or four years, we’ve actually begun curating that content as well, using a unique combinatorial approach of automating the indexing of that information with our GLP and computational capability, and then expertly curating that information to ensure the utmost accuracy of the information, as well as the specificity and relevance to specific questions or different diseases or therapies. We do that through manual curation.
To answer the latter part of your question specifically, Brittnee, how do we use AI — AI is a loaded term. It’s very broad. I’ve likened it before to saying “we’re doing science!” It’s just kind of contentless, and it’s taken on a bit of a life of its own. As for our general approach to using AI, our philosophy is that we don’t use it at run time, we use it while we’re learning. So on the back end, with our very capable development team and our content and scientific team, we use AI capabilities to understand the patterns in the data, and then to take action on it. Nothing that you see in Mastermind is “live” AI, but we’re leveraging AI in the background very frequently and very sophisticatedly to understand these patterns to most effectively prioritize the content, both for our software users, as well as for our internal teams.
BRITTNEE: That kind of brings up another question — like you just said, AI is everything. AI is science, right? So what is the definition that you’re giving of AI? Maybe you could tell us a little bit more around machine learning applications.
MARK: It’s a good question. There’s often a confusion about AI and machine learning being the same thing. Actually, machine learning is a square to the shape of AI. Machine learning is broadly used and very powerful, but it’s a narrow application of AI. This is not meant to be detailed technical conversations, because we want to put it more to the practical applications in clinical genomics, but I like to fractionate AI into things like interactive AI (an example being chatbots, for better or worse), then, functional AI (the things like Boston Dynamics and the awesome dancing robots that are also highly useful — it’s really cool and somewhat intimidating to see them dance), and then things like visual AI (you probably have heard about deepfakes and augmented reality, etc.) The one that I want to focus on, just to give you a landscape view, is analytic AI, and that has to do with finding patterns in data. A subset of that subset of AI is machine learning. We use machine learning at Genomenon, but not as much as we use something called named entity recognition, which, again, is a facet of natural language processing, and specific applications of genetics and genomics would invoke the GLP that I talked about before. Just to put a finer point on it, for analytical AI, I like to say it’s using statistics and data to find patterns to ensure that the evidence that you’re organizing and the answers that you’re coming to are appropriate for a user question.
BRITTNEE: The other question that your original statements were bringing up, you were saying that, when we’re using AI, we’re using it as part of the learning, not as part of the final product. That brings up the question, what are some of the challenges of bringing AI into clinical space or into a clinical practice, and how do we push that more? How can we get the people to understand that, feel comfortable with it, and then bring that in-house to use the power of AI?
MARK: So, there’s a huge difference between the acuity of the result, or the importance of getting it right, on the clinical side versus the consumer side. I’m an avid YouTube watcher, I watch lectures as I fall asleep, I listen to music while I work, and there are these uncanny algorithmic predictions that come up from YouTube as they track my search data and that sort of thing. That’s AI. But if they get a video wrong, I just skip over it. If I’ve seen it before, they might know that I like to re-watch stuff, but it’s very low stakes there. Similarly for, say, Facebook ads, you do a Google search and you see a Facebook ad. In the clinic, it’s totally different. The data has to be right. So when I talk about not applying AI at run time, it’s because we want to ensure that there’s reproducibility and predictability in the results that are generated. That’s a challenge with typical applications of AI. There’s actually benefit on the consumer side, they’re having some of that serendipity, and it’s a huge burden and a problem on the clinical side. So there’s that, you’ve got to get it right.
There’s also the need to access the evidence, and there’s a trend in AI for these to no longer be black box tools, but for these algorithms to actually defend what they’ve said. You can imagine if the YouTube algorithm had to tell you why it gave you a certain video, and show you all the machinations that it undertook from analyzing your search data. That’s not always wanted or welcome, and it can a problem on the consumer side. That’s exactly what you need to do on the clinical side, though — we, as clinicians who are looking at this data, need to know why it’s being prioritized. The last thing I’ll say, fairly briefly, is related to showing the work, which is something that Genomenon puts a lot of primacy on. There’s got to be more of the interconnected dynamic interaction between the end user and that evidence, and again, that’s what we make available to our users through the software.
BRITTNEE: I think that was an important point you just brought up on user input, that last piece, they can actually put in more information to refine search results. I know you had a couple of examples of those that I’d seen with user searches with genes associated with leukemia.
MARK: Yeah, this one of my favorite examples! It’s harkening back to my training and my research. Much of my clinical practice was around hematopoietic malignancies, studying them and diagnosing them. Obviously, Mastermind knows the genomic associations around leukemia, say, we’ll just keep them generic. An example search might be going into a tool like a database tool with genomic association capability like Mastermind, and saying, “I want to see genes associated with leukemia.” That’s a pretty open statement. As a user, you might know exactly what that means to you, but until you let the algorithm know what specifically you care about, it’s showing you everything. Especially in the context of clinical result returns, it’s being very sensitive on purpose, because it doesn’t want, or, not to anthropomorphize the algorithm, we don’t want the user to be missing out on anything. An example of a couple of results that might be returned, or the ones that people are probably expecting, would be c-KIT, FLT3, CbPl, those are all mutated to cause cancer, but there’s also structural variants. Those are gene associations with leukemia, like BCR-ABL or MLL deletions. There’s also surprise entities where, say, CD34, which is used as a marker to identify stem cells for therapeutic transplant for patients with leukemia, well, that’s a gene too. So that gets at the heart of what you said about the user obtaining results that are predicated on the quality of their input, how thoughtfully they’ve told the algorithm what specifically they care about. That comes out in the results that you have, and it has to do with that competition between sensitivity and specificity.
BRITTNEE: Yeah, which brings us forward to what we wanted to discuss next, which is technical challenges. Can you speak to some of the technical challenges, especially in genomics, from that data perspective? I mean, there’s just so much data. How do we make those types of associations, and then how do we drill down?
MARK: So there’s two flavors of that, actually, the proliferation of the data, and they feed off each other. One is that there’s this hyper-proliferation of data in the clinic and in the research world. Sequencing is done, commoditized, now the biologist isn’t constrained, and they can say, “I want to sequence a thousand patients with this disease,” and there’s no monetary or technical barrier, it’s just an organizational challenge to find these patients and have that happen. There’s this dramatic expansion in that data, and then there’s the patterning of that information in the medical literature. So it’s organizing all that data as it’s growing in the clinical lab, and then, reconciling it with that proliferation of the evidence that’s been published.
We’ll talk about some more of those challenges here in a second, but that just creates a situation where there’s a need to automate, and automation is the real challenge. I’ve said before, when I trained, you’d look at a single mutation result on a paper requisition. You can’t do that with a panel or an exome or a genome. There’s this critical need to marry the research data with the sequencing data, highly computationally, and it has to be automated. All of these disparate aspects of the genomic pipeline need to be seamlessly integrated. There’s a lot of moving parts. If you imagine that you coded a little bit, and took the raw data and put it through its paces and processed that — it’s not as easy as it looks at the clinical user level, where you just click a button and out comes the data. There are challenges to streamline that integration.
As I was training as a clinical pathologist, one of the most interesting teaching points from my mentors was the critical need for reproducibility of the data, reproducibility of the output. It has to work predictably on a Tuesday morning and Thursday afternoon, whether Bill or Sue is manning the the shop. That reproducibility is a real challenge when you’re automating and dealing with all this data, especially when it’s so proliferative and disparate, it’s different types of data. That’s a flavor of some of the technical challenges.
BRITTNEE: Yeah, as part of you know, for your certification process, they often make you prove that. In these clinical labs, you have to prove that, if Joe did it one day and Sue did it the next, we’re getting the same output, so that patient one isn’t treated differently than patient two just by accident.
MARK: There’s not much room for the typical voodoo that is AI, where a little bit of overfitting can totally change the result. So, as I said, to put a fine point on this in the context of the conversation, there has to be much more management of curation of the application of the AI in the context of clinical genomics.
BRITTNEE: That’s sort of what you were talking about, how do we make sense of all of this data, and how do we organize it? People have been trying that for a long time. There are other databases out there. To dig into a little bit more of this, can you speak about some of the challenges of organizing that data? You’ve got your ClinVar, your LSDBs, your COSMIC; there’s a lot of other databases that have been trying to do this, to provide this back to interpreters.
MARK: It harkens back to this super-abundance of information, but also, the complexity of that information. We talked enough, I think, about how proliferative this data is. Let’s talk a little bit more about the complexity. I remember, as a graduate student, when I was first beginning, you used dbSNP and how it was this great wealth, it’s almost an embarrassment of data. There was just tons of data there. When I was a graduate student, and I think, still now, it was being used for things that it wasn’t intended for, because there was so much data there, but there were things that were being inferred about the data that weren’t exactly accurate because of the nature of what was submitted. A lot of research samples, a lot of engineered variants… We’ve clearly grown up since then, and now we have more appropriate databases of population frequency, etc., that address those issues, but the challenge with this proliferation of data is, who’s going to do the work?
There’s typically a distributed model, which is what happens with ClinVar: the organizers of ClinVar have tasked the community with submitting and qualifying and interpreting those variants, and it’s a repository of that information. They’ve taken some pains to ensure the quality and accurate annotation of that data, but it’s an ongoing challenge. That’s a general database of genetic variants. Then, you talk about locus-specific databases (LSDBs), which, if you’re not familiar, are databases focused on a disease or a gene. Typically, this happens out of an academic setting, where one researcher takes it upon themselves to pick up that mantle and organize that data. It’s really hard. It’s very complicated. If you have to keep track of the literature as it’s published, to go in and disambiguate all the variants that these authors have typoed or described in a slightly different way, that makes it hard for you to reconcile with already existing entries in your database. The challenge is the complexity and the need to have more high throughput, but also the need to have it be manicured and expertly curated. Those are the two competing challenges with organizing these variant databases.
BRITTNEE: I know that in the paper, it actually mentioned — we’ll go into the paper now a little bit more — the paper actually mentioned that Genomenon is finding about five times more pathogenic variants than, say, ClinVar. What do you think is leading to that discrepancy? I think it was ten times more variants overall, even.
MARK: Yep. Let’s talk about that a little bit more. A variant is not a variant is not a variant. You can have a variant that’s a polymorphism and is benign and doesn’t contribute to disease. You can have a very well-known pathogenic variant that is widely studied, and everybody in biology knows that it’s associated with disease, but there’s this wide distribution in between. There’s the need to be as comprehensive as possible, because, especially with that proliferation of evidence, any one study might contain controversial evidence, properly, evidence that changes your view about a variant or provides information for a rare variant from a rare disease circumstance. There’s that wide distribution and heterogeneity of the variants.
When you talk about how Mastermind, our database of this evidence, compares with ClinVar, the reason that we’re able to surface more evidence and find more variants is because of our computational approach. Put them next to each other, our model is computational, and the ClinVar model is crowdsourced. There’s challenges with ClinVar in the incentivization of having lab-submitted and user-submitted data, but even if that is not a challenge, there’s this hope that you would have all of the patients that have ever been seen, which we’ve amortized across the decades of medical literature, captured in that database, and captured effectively. Because ClinVar is relying on users submitting evidence from real patients, there’s a substantial lag.
As you pointed out, there’s maybe 10 or 20 percent of variants that are captured in ClinVar. There’s a bias toward well-known variants or polymorphic variants on either side of that spectrum, whereas our reach into the literature and our computational capability captures everything, including that hump in the middle, where many of them are pathogenic, as you point out, but a substantial number of them have been found in the context of patients. There is evidence that they might be associated with disease. Any other paper, functional study, a new cohort or whatever will tip that balance, and the evidence would be sufficient to say, yeah, this is diagnosable.
BRITTNEE: You keep mentioning Genomenon’s computational approach. What is Genomenon’s computational approach? And what does AI have to do with that?
MARK: Good question. I’ve alluded to that a little bit before, about our data substrate is the empirical literature. Now, talking about the Mastermind software — the data in the software right now is predicated on those publications. GLP is very sophisticated and knowing. It’s bioinformatically literate and aware. It knows about legacy nomenclature in different transcripts, it knows about HGVS and how authors don’t often respect HGVS nomenclature, which is the standardized way to describe variants. We can take a step up and talk about structural variants and copy number variants, which are particularly challenging to search for or find in the evidence, but also, to be sure that you’re not missing something that’s similar but not identical to your search for variants. That’s what GLP does. Recalling that I described named entity recognition as a facet of AI, GLP is totally focused on named entity recognition, and expert at organizing that information.
Then, I alluded to one of the ways that we deliver value to our clients: through curation of this content. Your question here allows me to say that our ambition is to curate the entire genome. Every gene in the context of every disease, every variant, every scrap of evidence that’s been published about those variants — we have ambitions in the coming years here to curate all that content. We’ve begun with a targeted approach for some of our pharma clients. We’ve also taken a disease body system by body system approach, beginning with neurodegenerative disease. There are three dozen or so genes associated with ALS, general neurodegenerative disease, Parkinson’s, there are a couple dozen Alzheimer’s, etc., etc. We’ve used our AI approach to find all the variants and all the genes that are associated with those diseases, to organize and annotate that evidence. Then, there’s even sophisticated computational capability that would fall under the umbrella of AI that allows my team to curate that very quickly and highly accurately. Again, without going into much technical detail, that’s a flavor for how we’re using AI to mitigate these challenges that we talked about, with the distributed but still manual approach to organizing data like in ClinVar or locus-specific databases.
BRITTNEE: Up until now, we’ve been focusing a lot of the conversation around ClinVar. I know it has both somatic as well constitutional information, but this is probably even more expounded in the cancer space. There’s heterogeneity and there’s varying mechanisms of disease in the constitutional space or the hereditary space, but are there similar challenges? Are they different, or are they bigger in the somatic space?
MARK: Yes! You hit the nail on the head with respect to heterogeneity. Every constitutional disease, by and large, has its own idiosyncrasies and weird way that it causes disease. It, with a focus on on genes, and obviously cancer, is not simple, but biology is very economical. It is unabashedly reusing mechanisms to cause disease across multiple different disease types. The challenge that this occasions is an exaggeration of the proliferation of that data. In typical searches of cancer-related data in Mastermind, say, there’s a lot of results. The questions that the user is asking now are less about, “is this a variant that causes disease?”, which is typically the case; it’s now more about, “what do I do with it, and what specifically do I want to ask?” It would be things like, what diseases has it been found in? Is there prognostic significance? What do I treat this patient with? What are the latest guidelines? Are there any untoward side effects or counter indications in this context? It’s a greater depth of evidence, and a somewhat broader need to get more specific information and evidence out.
Those are the sorts of things that I see as being different challenges in the cancer space. For our AI capability, sensitivity is less important. It’s still important, obviously, and especially for the rare variants, but it’s less important for the BRAF V600E variant. Everybody knows that it’s causative, so it’s now about prioritizing the information. Just like the leukemia example I gave, what do you actually want to see? We’re giving users capabilities to refine their search in a way that hyper-focuses their attention on the most meaningful evidence to make their downstream clinical decisions. That would be what I say is the pivot, more emphasis on specificity because of that greater depth of evidence.
BRITTNEE: You just mentioned this — you gave two different scenarios of two different types of variants. We have these ultra rare variants, and then we have variants that have been published eighteen thousand times. There are challenges that are going to be associated with these two situations. So, if I were a variant scientist — I am not, but let’s pretend for a second, I’m a variant scientist — so I’m going in there, and for the rare variant, I want to make sure that I find those two papers that mention that variant, which could be associated with some type of phenotype, disease, whatever you want to say. There’s that challenge, as opposed to the challenge of the 7,000 references that I need to get to, and I need to actually figure out which ones are important and what matters. Can you describe how AI can be used in those two scenarios?
MARK: This a great question. Just recalling the information that was recently published by by my team, and I think this was corroborated by some of our user input, when users take a query and assign Mastermind a task, and ask these questions of archive variants or even real-time variants, using their old method of searching, did they find everything that they needed for every patient with any one of these variants? The standard yield for that rare variant that you mentioned, you talked about a needle in a haystack — absent of an AI capability like Mastermind, am I missing evidence? Well, yeah, you are! That happens ten or twenty percent of the time. I think the paper cites a statistic like fourteen percent. In our experience talking to users, it’s not small. It’s ten to twenty percent of cases where you do find evidence, and it is instructive, and it only comes out of the results from this very sensitive capability in Mastermind. So that’s on the first side, and that’s where we talked about the need to ensure that you’re maximally sensitive.
Then, on the other side, supposing it’s a widely published variant, at the one extreme, it would be V600E or a similarly popular variant. There, it goes back to what I answered before, what do you want to find? A more typical case, and the search results that are most typical in Mastermind, are between one paper and a dozen, so about five papers. If you’ll permit me to describe a little bit more about the standard guidelines to interpret these variants on the constitutional side, it’s ACMG which requires population data and predictive data and then publication data. Now, AMP is a little bit more focused on the guidelines, but it still benefits greatly from finding these functional studies or other related clinical studies. For much of that, the crux of evidence needs to come from the published literature. With the ACMG guidelines in particular, you might have one study that’s well performed and appropriate, but one study can tip the scales away from a variant of uncertain significance to a likely pathogenic variant.
So, in the context you talked about, where it’s been published, it’s in ClinVar, and there’s a paper, it actually becomes really important that you’re not missing the other two papers that describe them. Those other two papers, if they’re obscure, if they’re in a foreign language, or if they’re published a decade ago, those can wind up being critically important. That number of ten to twenty percent, actually, is just the tip of the iceberg for the importance of sensitivity, which can only come, as I’ve alluded to, with a computational or AI approach to organizing this data.
BRITTNEE: So we went through, you know, me, as this variant scientist, I went through all of this information that first time, and I classified this variant. Often, in the first scenario that we just talked about, there wasn’t enough evidence. There’s that one paper, but it doesn’t really mention how it’s affecting the disease, or we’re not really sure what that variant is doing. Maybe it got associated, but not anything strong. Now, we’re bringing up this new issue, or new challenge, where I need to continue to review that at some frequency. I need to look back and actually ask, is there something more? So, can you speak to a little bit about that struggle? You have more experience in this space, obviously, than I do, so what does that struggle look like, and what is the answer to that?
MARK: It’s a struggle. I mean, it’s a real challenge. There’s great potential for benefit in being able to do that, but there’s obviously also a huge hurdle to do it: Who’s got the time? How are we going to report this information to the patient? There are liability issues about this kind of thing, the challenge of communicating with the clinician, all of those things are outside of the need to be able to find this information altogether. There’s a lot in the ecosystem that needs to be organized, but it’s starting to become a topic of conversation. Many of the labs that we deliver Mastermind to have an approach where they’re either developing or have already instituted periodic re-review of the variants.
There are two ways to do it. One is just a staged revisit. Every six months, you look back. Another way is, every time you newly encounter the variant, if it’s been longer than a certain period, then you would go back, but that method takes extra time. It’s a situation where most labs aren’t doing it now, because it’s challenging, but it’s going to provide benefit to the patient. Those two things are at odds, and that’s, like I said before, that’s a situation that’s right for automation. Mastermind, being updated every week and having all this data in a database, is a perfect capability to automate those look-back needs. We have that in the form of Mastermind alerts. As I mentioned, we’re starting to see an uptick, in particular, with our larger reference labs using Mastermind alerts for that purpose. That’s a scenario where computational techniques can do the heavy lifting very sensitively, and then leave it to the human to review that information for specificity and accuracy, which is how Mastermind alerts work.
BRITTNEE: Yeah, I think that’s a conversation that, in my previous life with Fabric, we were having with every single hospital system. What exactly does this look like? How do I report that back? I want notifications, I want to do it actively, versus that notification being kind of more passively, something new comes out, and you tell me. I think everybody’s going through those debates right now. It’s interesting to see how it develops. The next thing I was going to dig a little bit more into specifically was the paper structural variants, I’d say that’s probably one of the biggest challenges. Obviously, the recommendations came out for classifying CNVs not too terribly long ago, in the world of time and space. Can you speak a little bit about CNVs and fusions? Those are very different, but very difficult problems to solve. How does one review literature of that?
MARK: I’ll speak to fusions first, because it’s a little bit of a cleaner scenario; it’s easy to understand. Most of the time, a rearrangement or a fusion takes the “guts” of one gene and links it with the “legs” of another, and they get together, and they cause havoc. Those are most frequently described as gene-dash-gene events. Those aren’t that hard to find, although increasingly, there’s a need to see the break points, and to understand at a detailed level what’s going on with that individual scenario. It’s also the case that finding new events is uncommon. Actually, as a graduate student, I discovered a fusion in the context of a cutaneous lymphoma. As soon as I looked at the data, I was like, this is real, this a jackpot, because they’re just not that common. It’s a very complicated mechanism of causing disease. If the data says there’s one of these, there’s a high selection bias in favor of being real. The same is true on the clinical side: if you see an NPM1 translocation, and the patient has the appropriate disease, it’s probably real and causative.
I would liken the identification of fusion genes to what we talked about before, as the cancer challenge, which is to say, okay, now what? It’s easy enough to find, but what else do you want to know? I should also caution that just because I think it’s easy to find with our computational capability, that doesn’t mean that it’s in all these other databases. In fact, that fusion event that I found is not found in other databases, despite it being published and republished subsequently. It’s a real causative fusion event, it’s just not forthcoming in some of these other organized databases that are manual and out of date, or lag behind for whatever reason. But it is in Mastermind.
So that’s the fusion component, but dealing with CNVs is a totally different animal. I should emphasize, fusions cause disease at a real but somewhat lower level than, say, mutations, when you amortize that across all different diseases. CNVs cause disease in a much larger percentage of cases, across all different disease types. Depending on the disease, five to thirty percent of cases are caused by CNVs. The challenge with CNVs is the nomenclature. There are a lot of colloquialisms. We’re talking about named entity recognition in the context of AI. If you’re just logging that this gene was deleted, you often will not use HGVS, even though it’s appropriate. Instead, you’ll say “exon 8” was cut out of this gene. So that’s a challenge to find, and it gets even worse when people start to use chromosomal coordinates, which you can describe in multiple different ways. It makes it hard to index with a computational approach. Then, also, the bigger deletions, like gross aneuploidies or cytobands that are amplified or deleted — there’s that heterogeneity in the description of the CNVs.
What I really want to focus on is how weird CNVs are, and that you don’t need to know or record the exact change. That’s great! In fact, you’re you’re far less likely to have the exact changes to what you see in your patient. There’s interdependence in the data that we index that the user will want to see. If you’ve got a very similar CNV, if I search for this region, and there’s a paper that talks about it like this, and another that talks about it like this, but it causes disease, or one of those larger cytobands that has a similar phenotypic presentation to my patient; there’s a challenge in first finding all those and then presenting them in a meaningful way. The paper touches on our capability to do this. Actually, it was published before that capability went live. Now that CNV search is live, I feel like our users are finding great value in it. We’re only getting better with the way that we visualize the data, and the sophistication with which we index it, and the value that we can provide for those cases where CNVs are probably positive.
BRITTNEE: Yeah, that’s always been a challenge. What is an overlap? “It’s 90 percent overlap.” Are you talking evidence with CNV? Were you going from CNV to evidence? What 90 percent was overlapping? It seems like protein domains would matter… I’ve been seeing that as a huge challenge in the space, in order to organize that type of data and search broadly enough that you actually get the information.
I think we’ve got some questions coming in, so what I’m going to do is just jump to the last closing questions that you and I actually put together. One of my big ones in all of this, and what we were hearing about at least at Fabric, is about time. A lot of this comes down to time. How long does it take a variant scientist to get through and actually analyze all of these papers, and how do they know what not to get through? They want to return as many results to patients as they can, that’s going to help them diagnose and help these patients as quickly as possible. Have you analyzed much of the time savings that come from these types of AI approaches, especially around prioritization?
MARK: Right, so we’ve done a couple of studies, and so too have our users, about the time savings. I’ll refer you to the paper where I describe some of the time savings. We’ve also written a grant in the service of curing the genome where we say it’s only possible when you expedite things or semi-automate those things, as I talked about with our combinational computational approach, followed by expert curation. That actually allows me to say that with time, you can break it out into components. The database organization has been automated, so that’s just straightforward bioinformatic processing. If you see this in your patient assay, results go to this database or that database for population frequency and predictive data. That all happens automatically and is highly quantitative. The data can even be figured into, say, an ACMG framework.
The real challenge, though, starts when you talk about the published evidence, which is, again, where the crux of the evidence comes from for these interpretations. That itself can be broken out into two major phases. The first is searching, finding, organizing. That’s what I call the “muscular” time component, and that takes the longest time. Then, there’s the “cerebral” component, where, once you’ve found everything, after you prioritize it, you know where to look for functional or clinical stuff (this not a good paper, this a good paper), then you just look at the evidence and figure that evidence in the context of the ACMG framework. E.g.: This is a cohort of patients who have a higher allele frequency for this variant than in the controls, this is a family where there’s segregation of disease, or this a case where they’ve actually followed on with functional studies. Once you have all those references highly sensitively pre-organized and pre-annotated, the cerebral part, especially if the data is presented in a meaningful way to the user, the cerebral part then becomes the only time barrier.
In my own studies, I’ve sort of fractionated that muscular searching component. That’s the majority of the time that’s spent for manual searches, and that’s what Mastermind and Genomenon have largely automated. We’ve shrunk that time component down now, so it’s a highly cerebral exercise. I joke with with my curation team members that they’re getting smarter, because they’re exercising their brain, but their bodies are getting deconditioned, because they’re not doing the muscular component. It’s done for them.
BRITTNEE: Yeah, and it also means that those people that are highly specialized, that are being trained for all of these reasons, are using their time in the right way, and aren’t just professional Google searchers.
MARK: That’s a great point. We are variant scientists on my team, and I can actually sort of round out our part of this discussion by saying that I don’t expect AI to take over the human component, ever, particularly in clinical medicine, particularly with the complexity attendant with genetic and genomic information. The application of AI, as I just described it, is all about the muscular component, sensitivity, organizing, annotating content, delivering the information in the most appropriate way to the user. It will always rely on the human user to approve of that evidence, if they’re producing, say, a curated database, or otherwise validating that curation from the other user. The variant scientists’ position is not going away. In fact, as you alluded, it’s only getting more meaningful. We need more of these people. There’s just not enough of those folks, appropriately trained people, in the workforce.
BRITTNEE: Yeah, because we’re going to be moving people away from a single SNP assay, obviously, that’s a thing of the past, but we’ve moved to panels for the majority of sequencing. As we go from panels, we’re going to get a whole genome. Those people are going to become more important, which actually brings us back to the beginning, this discussion of this process not being a black box. I’ve heard it said around here — I love the words around Genomenon, “show your work.” It’s another way of talking about presenting that information back in such a way that a human, that a variant scientist, that a trained professional can go through, and really analyze that data.
BRITTNEE: So, with that, we’ll turn it over to Garrett! I believe you’re gonna jump back on and ask some of the questions that have been coming from the audience.
GARRETT: I am, yes! Mark and Brittnee, this has been a really great conversation. We do have some questions that have come in from the audience that we’d like to get to. Everyone who’s watching, if we don’t get to your question, a member of our team will get back to you.
So, Mark and Brittnee, a viewer wants to know: How do you classify variants from “damaging” to “pathogenic?” We carry a lot of “damaging” variants, but only a few could be described as “pathogenic.”
MARK: Great question. I think one of my previous answers talked about the ACMG and the framework. If you’re not familiar, it’s a very complicated, but very structured analytical approach to reviewing and adjudicating the evidence. The viewer points out a good case, where we have these neat, tidy categories of “benign,” “likely benign,” “uncertain,” (which is a big middle ground) “likely pathogenic” and “pathogenic.” Biology just doesn’t work that way, but from a laboratory perspective, we need to have reproducibility. We can’t say, “it’s kind of this way, and we think we might do this.” You actually have to take action, you have to make a diagnosis or not, you have to treat the patient or not, etc., etc. In a way, we’re sort of forcing these categories onto what is more of a spectrum.
In particular, when we talk about variants of uncertain significance, there’s a spectrum of variants of uncertain significance. There are some that have a modest amount of evidence, but it’s insufficient to go over the threshold for that variant to be deemed “likely pathogenic” by the ACMG. Then, there are others where there’s actually a lot of evidence, but it’s still below some threshold. I think the viewer’s question is sort of touching on that idea, that really, there’s a spectrum, and the ACMG framework is intended to put the decision in the hands of the user in context-dependent terms. There are other clinical parameters around this patient and their presentation. There are other circumstances that will guide the variant analyst and/or the clinician into making the most appropriate determination. Sometimes, that can mean, yeah, it’s technically a variant of uncertain significance, but the clinical circumstance would dictate that my action as a clinician will be to treat, because of these circumstances, the clemency of the treatment, the treatment’s effectiveness, the fact that the patient has other family members who are affected who have the same mutation, etc., etc. Brittnee, I don’t know if you want to build on that?
BRITTNEE: I think you really said it. It is a spectrum. They’re trying to give you some type of guidelines, and I think one of the great things about using this type of approach that we’re doing, where we present you back with evidence in a very organized manner, is that you can actually normalize across different people. What you don’t want to see is that, even in different labs, this person over here uses that framework and says, “this is likely benign, this isn’t actually anything we’re concerned about,” and then somebody else uses that information, because it’s just a framework, but there’s still so much variability in it that somebody over here says, “this is likely pathogenic.” The more information back, the more information that those people can consider is critical. If this person was missing half the information, they may call that variant likely benign, whereas if this person had all that information, they could realize, no, there’s really a lot more.
GARRETT: Very good. Our next question is: Hi! Could you explain GLP in detail?
MARK: Hi! Oh, boy. So, GLP is a living thing. It’s been accumulating facets and capabilities and improvements since Genomenon’s founding, which is actually seven years ago yesterday, for a little bit of history!
BRITTNEE: Happy birthday!
MARK: Yeah, let me do my best to phrase this succinctly, but in sufficient detail so that you get more information than I provided before. I alluded to the way that we find genes, with all of the synonyms and aliases and nicknames and previous versions etc., etc. That is the sensitivity part. There’s also the need, once you’ve found a gene in one of these references, to know if it is, in fact, a gene, or if it’s a true technical match, but it’s a false biological match. A better example than the one I usually give: SDS is a gene, but it’s also a detergent that’s used all over the place in blots and dishwashing, that sort of thing. GLP not only recognizes these things, but is context-aware enough to say, it’s not the gene that’s being referred to here. There’s still a healthy balance between sensitivity and specificity. We take great pains to strike the right balance of being sure that we can throw away stuff, but not so much that you lose out, and throw some babies out with the bathwater.
That’s at the gene level, but just like I said before, you have a gene in a paper, you’ve also got variants in the paper; there’s the need to be sure that they go together. GLP is bioinformatically literate. It knows what the nucleotide is at this reference position in this transcript for this gene. If you have these three genes at nd had these three mutations, the challenge for NLP, or for GLP, is to connect the right gene to the right variant. It does so fairly sophisticatedly, using the genome sequencing information, using context clues, as well as using the data that exists outside of that paper to determine whether or not it’s more likely to be this variant, because that variant was seen before.
The last thing I’ll say, and really the heart and soul of GLP, is how sophisticatedly we can index the variants. An rsID is a typically a polymorphism. That’s an easy thing to find. It’s just RS number number number number. It gets much more complicated when you talk about cDNA variants. There are multiple ways to describe a cDNA variant, complicated by transcripts and legacies like signal peptides and that sort of thing. Then, there’s also the protein-level descriptions. I mentioned V600E before in some of my answers. Simple, right? Well, there’s really about 120 different ways, completely distinct ways, that this variant has been described in the literature. GLP is aware of all of those, and is able to accurately identify, yeah, this is BRAF V600E, it belongs in this result set, whatever one of those 120 ways the user chooses to type in for their search. Hopefully, that’s enough of a flavor of GLP, but keep in mind, it also has diseases, phenotypes, drugs, categorical keywords, etc. I just gave you the sort of tip of that iceberg.
GARRETT: Okay, so in our last six minutes here, we probably have time for one more question: Do you foresee Mastermind as replacing ClinVar, or some of the public databases?
MARK: I don’t. In fact, I’ve had conversations with folks at high levels in ClinVar, and what we’re working on, it’s a different purpose. Mastermind does not receive patient information. We don’t receive variants from users. That’s what ClinVar does. Moreover, ClinVar doesn’t curate, they don’t put an emphasis on organizing the evidence or being very sensitive about those evidence citations. They’re a repository of information. They don’t supervise what the users submit. I don’t view them in a competitive light at all. I don’t think one is better than the other, they’re just used for different purposes. Actually, what I’d like to see is more of an integration of the benefit of ClinVar, from the live, flesh-and-blood, clinically submitted data, as well as the empirical, evidence-based, archival data from the literature that only gets more and more proliferative. We’re focused on this, and we’re doing it really well, and ClinVar is focused on doing this and doing it capably. I see a great benefit to merging those capabilities for the benefit of the community. Brittnee, I don’t know if you want to speak to that too?
BRITTNEE: No, I would say the exact same thing. I think there’s always going to be utility in people submitting, and also, with that, you get to see, at a glance, who that was. You get to, at a glance, see that this information came from lab one or lab two, and make the decision, I trust that lab because they’re submitting criteria answers. There’s different types of information that are available there, versus the breadth, the sensitivity of information that Mastermind can show, because that’s from all literature, including supplemental information. That may not be something that anyone’s ever going to deposit in ClinVar, but hopefully we can build ClinVar.
GARRETT: Thank you both! Do you guys have any closing thoughts?
MARK: Oh, just to say, again, Brittnee, welcome aboard! It’s really great to have you. It’s funny, I feel like we’re old friends, and we’ve only had two hours worth of conversation, with this hour being one of them — It’s good to have you on board!
BRITTNEE: Actually, my one closing remark was, you keep saying that Genomenon’s mission is to curate the entire genome, so when will we get on that? When that going to happen?
MARK: Jeez. Already cracking the whip! Well, I told you, we’re taking pains to start out with a prioritized approach. I think we’ll surprise people with how quickly we do it.
BRITTNEE: That speaks to this idea where, the more people that throw more information into publications, which is great, the more scientific knowledge our work will continue to give, because we’ll have to index even more.
GARETT: Cool. Well, to everyone who is on this webinar, thank you for tuning in! Again, this webinar is recorded, and we’ll send that to you soon. Mark and Brittnee, thank you so much for taking your time to offer your expertise!
If you have any questions, feel free to send us an email at firstname.lastname@example.org. If you haven’t signed up for Mastermind yet, we have a bit.ly link on the slide that’s up where you can create your free account and start with a free trial of Mastermind professional edition. We’ll send you all this information in an email, as well. At the conclusion of this webinar, you’ll receive a short post-webinar questionnaire. We’d love to hear what you thought of today’s event! With that, I think we’re wrapped up. Thanks so much!
BRITTNEE: Yep, thank you! Bye!