Mastermind Masterclass: Human Phenotype Ontology (HPO)

With the release of version 2.0, the Mastermind Genomic Search Engine now includes the ability to search the genomic literature by phenotype.

Patients, particularly those with rare diseases, don’t always have a name for their condition. This makes it very difficult to diagnose and treat their disease. The ability to search the genetic literature by symptoms (following the Human Phenotype Ontology, or HPO) helps clinicians narrow their diagnosis to the appropriate disease and treatment.

Co-founders Dr. Mark Kiel and Steve Schwartz led a concise guided tour of the new HPO search feature and Q&A with attendees.

Masterclass Takeaways:

  • How to determine the causative gene(s) associated with a patient’s phenotype or combination of phenotypes
  • How to discover which syndrome(s) are associated with a given phenotype or combination of phenotypes for clinical diagnosis
  • How to identify associations between genetic mutations and clinical syndromes or phenotypes

Q&A: 

Questions Answered During the Webinar

Q: What determines the order of genes on the list in the result of the search?

A:
Mark: It’s prioritized by number of articles in its sort of base presentation, so with Diane’s example for Peutz-Jeghers syndrome and STK11 being present, you saw P10 being present below that, I think with 80-some articles, given that it does have pigmentation challenges and hamartomas. STK11 had 50% more results, and so it was prioritized upfront, and then there was a very steep drop-off after that. In its simplest version in the user interface, the results for gene matches when phenotypes are input is just on the sheer burden of results. We obviously do have very sophisticated capability to prioritize the results on the variant side in that list of references that include variants. When you add any key terms, those key terms and how close together they are to the variant or gene that you’re searching for, as well as how frequently those key terms appear throughout the paper, that is a very high-level view of how sophisticated we have prioritization algorithms present those article results to maximize the likelihood that the first 5 or 10 papers will be papers that you can you render a diagnosis with. Especially in light of what Steve was able to show, the sky’s the limit on all the ways that you want, as an informatician, to prioritize the results based on the the collected API returns that emerged from your query.

Steve: Since this is a Masterclass I can add a little bit more depth to that as well. When talking about the prioritization of results within Mastermind, the UI, and the API, I often will dissect the problem into two main components that make it a little easier to understand. The first is understanding that the associations and the results that we return are from the medical evidence, so they all stem from what has been studied and cited in the literature, and so understanding that the two components that we often look at when you’re thinking about firstly, the the order of article results for any given query is determined by a sort of static Association strength and a dynamic Association strength component. The static Association strength comes from characteristics of each article in and of itself. Things like, what journal was it published in? What was the publication date? What is the impact score of the journal? The citation count of the article?… things like that, so those are sort of characteristics that follow each article around no matter what the search is that you’re doing that results in that article being present in the results. The second is the dynamic component, which is how relevant is the article to the thing you just searched for? That’s where characteristics such as Mark was describing come into play, meaning if you search for a variant in a phenotype, how closely were those cited in the article? How many times were each of those cited in the article? How many times were they cited in the article relative to other entities in the article? For example, was my variant mentioned once in the article, and the article also mentioned 300 other variants because it was a GWAS study? Or was my variant mentioned 5 times in the article, and only one other variant was ever mentioned in the same article, meaning the article was about my variant. So how strong is the evidence for each term that you search for? And then, how closely related are those two or more terms in the results? That’s how we prioritize articles. And then you can see how our capabilities become more and more sophisticated to, for example, rank the genes as results to your search query, because a lot of the strength of evidence for the genes stems from the articles. We can rank them on how much were the articles that inform that gene association relevant to your search, and more. There’s so many factors that we use in that prioritization, and it is something that we’re continuing to refine and improve.

Q: Can you put parentheses between the search terms? i.e. use a combination of and/or?

A:
Steve: Right now we’ve tried to build what we refer to as the ‘boolean search’ capability – the capability of searching for more than one term within any given entity. A regular search would be up to one term of each entity type. For example, a gene and variant and disease and phenotype would be a pretty complex basic non-boolean search because it’s just this and this and this and this. The boolean search comes into play when you might have more than one term for any one of those. For example, this gene and this gene, and this variant or this variant, or this variant and this phenotype. Our current boolean search has the parentheses implied around each entity type, and then you can change the and/or within any one entity. For example, you can search for (gene A and gene B), and (variant A or variant B) or (variant C or variant D), and you then have parentheses around the diseases, parentheses around the phenotypes, etc. The ability to then change those parentheses and have (gene A and variant A), or (gene B and variant B), and sort of move those parentheses so that they’re around combinations of entities is not something that we currently have. We have discussed it, but we haven’t yet had any users or customers reach out to us saying that it was a necessary component of their pipeline, or something that they had really even started investigating doing yet. All of that said, you can do that in the API, because in the API you can just put ‘or(s)’ between everything and then you can do whatever combination of logic you want to do, and wherever you want to place the parentheses, you can around the analysis of the results as a post processing-operation on your own machine in your own code. So it is possible to do with the API as a post-processing operation.

Q: If the phenotypes entered by the user don’t include the output phenotypes, what will you see?

A:
Steve: I’m assuming that question is in relation to the script that I showed. In the context of the script that I showed that uses the API to do the variant phenotype association and identify diseases, I think the scenario we’re talking about here is the variants and the phenotypes that you input don’t result in any articles that mention those variants together with any of the input phenotypes. What that script does, the output that I showed you is only the summary file that the script generates, it actually generates five different files. It generates two summaries: One is a really high-level summary, and then one is a very detailed summary. Then it generates three or four CSV files that list all of the results by article, all of the results by variant, all of the results by gene, all the results by phenotype, and all the results by disease, and however many files that was. So what will end up happening if none of the variants are cited together with any of the input phenotypes, is the summary file will actually show no results in the section that has the variants with the phenotype. Above that, there’s a different section that will look for any of the variants that were co-cited together in articles absent phenotype – absent in any phenotype matches – so you’ll still see the results of any variants that were cited together with each other, what those PMIDs were, and then you’ll see a list of all the phenotypes and diseases that those articles had regardless of whether they were in your search.

Mark: If your specific input phenotypes are found, that appears front and center, with maximal efficiency in finding the results that you had asked for. If none of those input phenotypes are found, what Steve was getting at was the API script assembles all of the phenotypes, irrespective of what your input was, and puts that at the second tier, which if the first tier is null, you’ll see emerge immediately. I like to say there’s those two phases: One is recovery of an exact match or even a partial match of what you input, and the second phase is discovery of things that you didn’t even know you wanted to see that comes out of that vast Genomic Association network that Steve described and showcased the script, invoking any given variant data set for any patient.

Q: How does this tool manage the issue of having many different transcripts for a single gene? Papers do not always have proper information on which transcript is being used for naming variants.

A:
Mark: They almost never do, but I’ll answer succinctly that Mastermind handles those handily. We have something called Genomic Language Processing (GLP) that is responsible for the indexing of the content as well as understanding what the users input is. There’s a lot of complexity there that belies my short answer, but Mastermind is “genomics aware”. Aware of all these transcripts, attendant legacy nomenclature issues, all of the nuances of the genetic code, and the complexities of the human genome. And that’s one of the value propositions of Mastermind is that you needn’t have to worry about or think about those things, because our infrastructure has already done that consideration for you and presents the results appropriately. Steve, I don’t know if you want to speak to what the user experience is, that they get the benefits of that maximal sensitivity. I think Paula had asked that question, that the indexing recognizes all of the transcripts for any given gene, and when the author’s haven’t declared what transcript they’re speaking in terms of, Mastermind recognizes which of those transcripts the variant may be associated with. And if the authors didn’t declare a specific transcript, we maximize the sensitivity of those results by associating with all of those transcripts that are appropriate based on the cDNA mentioned or the reference allele at that protein position. As Diane showcased in one of her examples, we show the sentence context where the authors describe those matches, so that you can then be the judge, “is this my specific result or do I disagree that this is the very variant that I was looking for?” In which case, you can just very quickly move on, and I think what Steve was interested in getting at was when you perform a search at the cDNA level for a specific variant, one of the mechanisms that we have in place to prioritize those results will be aware that you’re looking for that specific cDNA match, and in addition to prioritizing according to that match will showcase very quickly in that lower right compartment where the the sentence context is mentioned what exactly the author said. You can rapidly toggle from result to result and just be sure that you’re getting the most specific result depending on your specific clinical concern, whereas other circumstances warrant that you benefit from seeing all of those results irrespective of the specific cDNA match. Mastermind allows you to play both sides of that fence.

Q: How well does Mastermind prioritize genes and variants in complex undiagnosed disease cases, or potentially a new genetic condition? Imagine a complex congenital condition with dozens of phenotypes across multiple organ systems and dozens of candidate genes or variants.

A:
Mark: Awesome question, and it gets back to the answer that I gave about data recovery and data discovery. The recovery is when you’ve got phenotypes and variants that you’ve searched on – if there’s any evidence in the literature, we’ll pull that out and show you exactly what those references are. If what we’re talking about is no explicit associations that very broadly unites the disease and the variant and the gene etc and the phenotypes, if that kind of information doesn’t exist in the literature, you would default then to that discovery phase. Where you’re talking about complex undiagnosed cases, the power of the Mastermind API is its ability to draw on all of that information and aggregate it all for you in a way that you couldn’t hope to do manually, and then present the fruits of that work that was automatic, in a ranked prioritized order. Whenever you’re looking for things that aren’t crystal clear, you should be able to tolerate some false leads, but when you have no other option, the API will give you the best opportunity to very quickly find among those false leads any real lead that will then get you to the eureka moment that I think you’re seeking when it’s an undiagnosed case. Diane highlighted a facile example with Prader-Willi, where you have 15q chromosomal effects, you can imagine if you’ve got a case where it defies diagnosis and you’ve got some abnormal chromosomal microarray results and some confusing constellation of phenotypes that don’t form a syndrome in your eye as you’re looking at it clinically. If you input those results into the Mastermind user interface from the API, you may very well have that Eureka moment based on the prioritized results that would otherwise take you probably many hours to days to comb through iteratively doing Google searches or what have you, if you’re even able to find that information at all. That is really the power of the script that Steve showcased in the specific version that he showed, and in any permutation thereof depending on your inputs and the types of queries that you’re looking for. So your question was quantitative, and I can’t answer it quantitatively. Suffice it to say that one of the powers of the capability of the Mastermind API is data discovery in that vein.

Steve: I’ll also add that we’ve actually had quite a bit of experience with that specific scenario exactly, where we worked with one large company that was doing a trial of our API, and they ran a script very similar to the one that I showed where the first thing they threw at it was a difficult unsolvable case that they had for a rare disease. The patient’s information was sent to several different labs and other providers that were unable to explain the combination of phenotypes that the patient was having. Then they tried it with our software and it actually came up with a combination of three different variants across two different genes that when combined, fully explained the full scope of phenotypes the patient had and led them to a candidate diagnosis that they had never been able to find before. That was just the result of the one-off trial that they did with us where they tried to throw one of their toughest cases at our API. We’ve done similar trials with several other companies that have seen in some cases anywhere from 10 to 50% diagnosis rate on some of their hardest cases, so we’ve definitely seen the data shine in those kinds of scenarios.

Q: Does your search engine enable finding connections within and across other species, like mouse and mouse to human?

A:
Steve: We currently don’t do this for other species, but we have built the platform with the ability to do that. We have done it on a case by case basis with some customers in the past.

Mark: I’ll say that the gene synonyms for human genes often always include those for lower species, even yeast, and certainly there’s a lot of parity across the mouse and rat and other standard research tool gene names, and the hexagon chart that Steve showed bringing all those genomic associations together would certainly be true across species, absent the variant which is much more species-specific. There are ways to impregnate in your API query some of those species names to specifically pull out that information as you please in your API calls.

Q: How many HPO terms can be entered at a time?

A:
Mark: As many as won’t annoy the user! In the user interface, you can collect as many as you’d like – I’d like to say where we showcased how automatable the API is, the user interface can be programmed to include those HPO terms because it has URLs that are called “deep linking” – it’s got the actual term in the URL. If you have a workflow and you’re either savvy with Excel or you have an informatics group, you can pattern all of those HPO queries into a link that you then click into Mastermind for any given gene – I don’t know if there’s a limit.

Steve: There’s an answer for the user interface and an answer for the API, actually. So as you said Mark, there’s not an explicit limit. I actually don’t know and I haven’t personally tried stressing the system to see if it’s a hundred or a thousand or ten thousand HPO terms. I’m actually not sure at what point the application starts to bog down and not be able to complete the request. Theoretically, the more HPO terms you put in, the slower it should get, but in our day to day use and in all of the use cases we’ve seen with the customers that we’ve been using this with so far, we haven’t seen any practical limit on how many they can do. On the API side, there’s actually the script that I just showed you in the output from that. We actually built that script with HIPAA considerations in mind. We wanted to limit the amount of information that was actually being sent to our server in the first place, so the script that I just showed you doesn’t query our server for HPO terms at all. What it actually does is it finds every article for every variant, and then it looks through the article info endpoint that shows the results of everything we’ve indexed in the article and it pulls out all of the HPO terms from that output, and then it mashes them all together and does an intersection operation between that and the phenotypes that you passed into the script. So for that script there’s literally no limit to the number of HPO terms you could pass into that script, because it’s never even sending them to the server anyway. It’s just grabbing all HPO terms from all papers that it queried for the variants, and then doing the filtering on HPO as a post-processing operation on your own machine. On the UI, I’m actually not sure what the limit is. We haven’t run into it yet, and on the API side it can be written in a way to where the phenotypes don’t even hit the server.

Mark: I will say one caveat to that is it’s probably not a technical concern like Steve was describing and more of a sensitivity/specificity concern. The more entries you use with ‘and’, the less and less likely you’re going to find all of those in any given article. So that’s something that you should be very mindful of. And if you toggle to an ‘or’ and you add a dozen, you’re gonna start to get nonspecific results. That’s why I said it’s to the users tolerance, and it bears some testing empirically to see what kind of results you get with what types of phenotypes.

Additional Questions Answered After the Webinar

Q: Is the API script you demonstrated shared?

A: Yes, we have several sample scripts that use the Mastermind API, written in Python, which are open source and available on Github here:

https://github.com/Genomenon/mastermind-api-cookbook/

The script demonstrated in the webinar is available directly here:

https://github.com/Genomenon/mastermind-api-cookbook/blob/master/variant_phenotype_evidence.py

Q: Looking at applications to weave this phenotype abstraction into broader decision support tools, are any ongoing partnerships doing this?

A: Mastermind is integrated in many third-party genome sequencing software platforms, which can be found on our Integrations Page. Many of these are now integrating the advanced features such as phenotypes.

Q: Does Mastermind also search non-published data on genes and variants, such as in ClinVar and ClinGen?

A: Mastermind has vastly more variants and more info per variant than ClinVar and ClinGen – including both references and association data. Moreover, any variant in ClinVar with reference citations is also in Mastermind, so Mastermind could be considered to be a super-set of ClinVar. We haven’t integrated the full complement of data from ClinVar or ClinGen (e.g. star ratings, variant calls), but these data are provided to clients as part of our Mastermind Genomic Landscapes.

Q: New user here. Is this demonstration showing us what’s happening in the background of the normal interface or a new interface? If so, is that included in the professional edition?

A: The AI and ML algorithms that prioritize and present data in Mastermind (what’s happening in the background) are the same for both the Basic and Professional editions, and the interface is also very similar. Some features of the demonstration, such as searching by phenotype, are Professional features. The Professional Edition is required if using Mastermind for clinical workflows or for more sophisticated research activities. Here is a link to the Mastermind Plan Comparison.

Q: Is there a webinar on how to best use user interface professional edition for variant interpretation (ACMG guidelines, etc)? If yes, please share a link to that webinar.

A: You can find demonstrations of seven different use cases for variant interpretation with Mastermind on our Tutorials Page. Filtering on ACMG/AMP criteria is included in the video “Use Case #1”. We also plan to hold a Mastermind Masterclass on this topic in June.

Q: For the script, can we do our own script on linux or / and python?

A: Yes! The Mastermind Base and Advanced APIs are RESTful JSON-formatted APIs, which can be used with any scripting language, such as Python, Ruby, JavaScript, Java, and many others, and on any operating system including Linux, Mac, or Windows.

The script demonstrated in the webinar happens to be written in Python, and is open source and available on Github here:

https://github.com/Genomenon/mastermind-api-cookbook/blob/master/variant_phenotype_evidence.py

Q: If we have a conflict of predicting the effect of a variant mentioned for a variant of unknown significance (VUS) in ClinVar which is a causative gene for X disease, what should we do? What is the best methodology to adapt in such a situation?

A: In the event of a conflict of interpretation between ClinVar and the evidence in Mastermind that you have personally assessed, we strongly recommend that you take the evidence in Mastermind into account. ClinVar is missing a vast number of disease-causing and clinically actionable variants. And for variants that it does contain, it is missing a vast number of references citing any of these variants. In order to ensure the most accurate interpretation of a variant (and therefore the most accurate diagnosis of a patient), all of the available information from the empirical literature made available through Mastermind should be taken into consideration. If the information in Mastermind is itself conflicting (e.g. multiple reports making disparate claims about the pathogenicity of any given variant), this disparity should be reconciled by reviewing the primary evidence in detail. The variant scientist should come to his or her own conclusions based on that evidence.

Q: The explanation regarding article ranking (with the Mastermind Relevance Score) is very helpful. I suggest you consider adding a relevance “score” to each article so that the user can make a determination themselves of the relevance of the article, other than the rank order provided by the search results.

A: We have considered it. However, it’s difficult to provide an absolute score that would be useful to filter, or from which to make an absolute judgment of relevance, because the relative scoring varies greatly from one query to another.

In other words, while the internal relevance score allows sorting the article results of a given search query, they are not useful for comparing articles across different search queries. Therefore, it would be difficult if not impossible to determine any relevance score threshold that would work across searches.

For example, a single result may be highly relevant to a query that returns only that result (especially for something like rare disease diagnosis, where the article may be the needle in the haystack that makes a diagnosis possible) while still having a relatively low overall relevance score, whereas a search query that returns over 1,000 results may have over 500 results with a very high relevance score, which by definition would make most of those results less relevant (since a curator can’t possibly review that many articles), thus requiring additional filters to increase the specificity of the search.

We can still imagine scenarios where an article-specific relevance score could be useful, though, so it’s still something we’re keeping under consideration.

Q: Is it possible to filter data by population criteria?

A: We don’t have population criteria currently built into the Mastermind user interface or API, but this is something we provide as part of our Mastermind Genomic Landscapes.

Transcription

Candace: Hello everyone and welcome to our first Mastermind Masterclass! My name is Candace Chapman, VP of Marketing here at Genomenon. Thank you for joining us for today’s live session. Wherever you are, we hope you’re safe and healthy and wearing pajama pants!

This Masterclass will focus on one of the latest features added to the Mastermind Genomic Search Engine which is the ability to search by phenotype. This is a powerful feature that even during a rehearsal this week we were exploring more applications for and I can’t wait for you to see what can be done with it. Searching by phenotype is a Professional Edition feature so if you currently have the Basic Edition some of this may not look familiar to you but during the demonstration you’ll see many features that you may not be aware are available in Mastermind. This is a great time to get reacquainted with the full functionality of Mastermind and see if you and your organization can benefit from Mastermind Professional Edition. It’s our goal to make Mastermind the first resource you use when interpreting variants and I hope that this deep dive Masterclass will help you get the most out of it. Let me cover some housekeeping items before we get started. You can submit questions in the question box at any time during our time together and we’ll have a Q&A session with the founders after the demo.

 I’m really excited about this webinar because we’ve gathered not one, but two of our amazing founders to discuss Mastermind in depth. We also have an MVP from the data science team to walk you through the demo. Introductions are usually pretty boring so I thought we change it up a bit and I asked our presenters to share something interesting that you might not know about them.

Our first presenter is Dr. Mark Kiel, co-founder and Chief Science Officer. A paper that Mark authored disproved the immortal strand hypothesis, so that’s pretty cool – you can google it! Mark will be sharing a brief Mastermind scientific update. Next up is Steve Schwartz, co-founder and Chief Technology Officer. Steve has a competition racing license and he races when he has the time which is almost never since the birth of his two daughters but I think I know his priorities! Steve will be sharing a brief Mastermind technology update and then we have our ‘S.H.I.E.L.D.’ or our Special Handler for Information and Essential Logistics Development, Diane Nefcy. Diana tends her own tree farm with her husband and her toothless cat! Here’s a few more interesting facts about these guys: What are the odds that both Mark and Diane have identical twins? And Mark named his children after characters from greek mythology, Diane named her pets after characters from Greek mythology, but not to be undone Steve claims that he has books about Greek mythology! But enough about the introductions, let’s welcome Mark for the scientific update. Take it away, Mark!

Mark: Hey thanks Candace, and thanks everybody in attendance for joining us for what is the first in a series of Mastermind Master classes. I’m going to give a little bit of a nostalgic view back in time of where Genomenon started then I’ll spend two or three minutes and then I’ll pass it over to Steve and I tend to go over the practical aspects of what we wanted to showcase here on the webinar. Many of you may not know but Genomenon, which actually is Greek means born out of need, and was started about seven years ago when I was an MD PhD frustrated that I was constantly needing to go out to Google to get the genomic information I needed to either diagnose my patient caseload or otherwise to complete the discovery work that I was doing in the research lab. Since that time we’ve accumulated a very robust and increasingly large user base that numbers around 7,000 across a hundred and more countries and continues to grow at a near exponential pace. Mastermind, the software that we’re going to be discussing today, is increasingly becoming the go-to resource for genomic evidence for both clinical workflows and for making pharma discoveries with some of the work that we’re doing with pharmaceutical companies and biotech companies. This ability to grow the user base and to become the dominant force in genomic sequencing interpretation is due in part to the advances that we’re making such as the one that we’re going to showcase today: the addition of the human phenotype ontology or phenotypic terms which are particularly valuable in solving otherwise mysterious rare disease cases for patients for whom there is sequencing information.

As Candace alluded to, I’m very happy to be sharing the spotlight with my co-founder and friend, our CTO Steve Schwartz, and one of our earliest employees and somebody who I don’t think gets enough credit for in many ways keeping Genomenon running smoothly on a day to day basis, Diane Nefsy. I’m very happy to be on the webinar with both of them. Steve is gonna walk through the technology at a high level that is driving this new development – this new feature that we’re adding into Mastermind – and Diane is gonna walk through several use cases. A couple will be in the user interface and a couple also in the API and then Steve is going to round out the conversation by showcasing some of the powerful capabilities and value that can be elicited through maximal use of the API through some scripting. Without further ado I’ll pass it over Steve and I’ll be rejoining the group here during the Q&A session toward the end of the webinar. Steve?

Steve: As Mark said, I’m going to focus on the high level of the technology and try to make it not too boring for everyone! The technology behind Mastermind really empowers us to provide two services to our users. The first is organizing and indexing the entirety of the medical evidence, making it easily searchable with the idea being that it gets the most information to you the quickest so that it can get out of your way and you can continue – if you’re on the clinical side making a diagnosis or if you’re on the pharma side doing research. The second is identifying every genomic Association that’s supported by that medical evidence. The genomic Associations that I’m talking about stem from searching the entirety and indexing the entirety of the medical evidence for genomic entities – what we call genomic entities – so those are genes, variants, diseases, phenotypes, which we’re talking about today, as well as categories such as ACMG and AMP criteria and therapies which will come at a later date. The idea here is that when we find every one of those entities throughout the medical evidence we can then identify the Associations between any entity and any other entity in that graph, and from those Associations is how you can start to identify hypotheses and move those hypotheses into insight.

A standard use case might be a patient that has a disease and you need to find a therapy. That’s great, but any hypothesis or any Association directly between disease and therapy really is just a hypothesis until you can start to identify and understand the causative link between that disease and that therapy. In the context of genomics, that evidence tends to be genes, the function of genes, and the variants within those genes. You can see here every time that we can add a causative or an entity into that causative link we add a piece to the puzzle and we understand that connection much better. The strongest insight that you can find is where you can connect for example, a disease to a therapy through a specific gene and variant categorized by phenotypes and categories,and so that’s what we’re really enabling our users to do here by adding phenotypes to our ontological structure and starting to draw those connections. An example of how you might do that in a clinical context would be you have a patient that has three to five thousand variants that you’ve managed to filter down from a whole exome screening and you have a set of phenotypes on the right. What you can start to do with our search engine, or more powerfully with our advanced API, is you can start to iterate through every possible Association, not just between one variant and one phenotype, but between any and all of the variants that the patient has and any and all of the phenotypes that they have. You can start drawing those connections much more easily using the medical evidence as the basis for informing those Associations so that you have strong evidence that you can then put into a report and make conclusions based on.

This example we’re using variants connecting them to the phenotypes from the medical evidence and then in those same papers we can further look through the papers and see what diseases or what therapies did those papers also talk about to help automate the process of diagnosing a patient or generating a case report. Or again on the pharma side, starting to automate the process of drug discovery and drug development. With that I will now turn it over to Diane who can show you how this phenotype actually looks in the API and the user interface.

Diane: Excellent. Thank you, Steve and while I am working on sharing my screen I would just like to remind you that not all of the features that you’ll see today are available to Basic accounts and if you would like to test the features but you don’t have access to them, just send us an email we’d be happy to help.

Today I’m going to walk you through some user interface Mastermind examples that will allow you to discover a set of genes or diseases associated with your patients phenotype or even a combination of phenotypes. In this first example, say your patient has a phenotype that you suspect has an underlying genetic cause and you’d like to find a candidate gene for testing and not just a candidate gene but literature evidence for it. In this first example we have hyperpigmentation of the skin. You see this with this drop down list that not only can you choose the phenotype, you’ll have the typo-free phenotype. The other one that your patient has in this example is hamartomatous polyposis and right here we see the phenotype and the proper canonical HPO term and we have both of these together in our patient. We’ll click search and right away you can see our results that the STK11 gene is the first hit on the list with 81 articles and that is expected because these are symptoms for Peutz-Jeghers syndrome.

In our next search we will show you that you can search with either one or two phenotypes or both of them combined. Your patient for example has a family with a variety of phenotypes. The first one being ectopia lentils, and the second phenotype we have here is arachnodactyly also as a phenotype, but not all of the people in the family have them, so we will click the ‘or’ to toggle between and and or. As expected, the genes on the list are organized by the fibrillins and elastins first. We can use this interface to look more deeply at the titles and abstracts of the papers here.

The default sort is by relevance and if you click on any paper, you’ll see that by default we have genes and variants highlighted, but you can switch to phenotypes and you can see that in the full text anytime that there’s a phenotype mentioned, you’ll see that highlighted and a sense proof is shown to you.

One more example we have for you today is something that would help you maximize specificity for your patient. Say you have a patient with hyperphagia. You’ll see that not only does it correct spelling for you but it also gives you the proper canonical HPO term. So we have polyphagia here, however and as you can guess, we have a lot of articles that may or may not help you for your specific patient, but in this case you know that your client has some sort of chromosomal aberration at the 15q locus. If you enter that in as a text term and search for it, what this will do is this will bring up all the articles that mention 15q in the full text and this brings up all the genes associated with the prader-willi locus. If you click on them we can see that it brings up papers for Angelman syndrome, and for prader-willi.

Now I’m going to show you some API examples. I just showed you how a human would do it and this is sort of a demonstration of how you would automate this. Just like this last example, if we fetch our API token and our HPO term of interest is hyperphagia, we submit that query. You can see that all of these outputs are in all labeled and we have the proper HPO term. We can take that and we can go to our disease endpoint and we can plug the HPO term in and you can see that the API token carries over. This will give you the output of all diseases associated with that HPO term and this is just page one and for example we can flip through to page two and here on page three, we have prader-willi syndrome. If you want to find articles associated with that, we’ll highlight that and go to the articles endpoint. I’ll plug that in and in our suggestions endpoint we can grab our HPO title, go back to articles, put that in and here we go! We have a full count of how many articles are associated with it. You can scroll down and see all this information here but if we’re only interested in one specific paper, we can take that, go to our article info endpoint, plug it in and not only would we see the title and the abstract but as we scroll down, everything that was discovered in the full-text will be shown. HPO terms and all the way down here at the bottom we’ve got all of the genes that were found in this paper as well as the gene variants. You can see this is quite a meaty paper.

Mark: Diane, it’s likely that there aren’t any in that paper, but that’s exactly right the Articles info endpoint, showcases all of the indexing results for that paper, phenotypes, genes, variants and title abstracts that are associated with it.

Diane: So you can imagine doing this as a single query and all of these steps would be a fraction of a second to complete. If you do this in bulk, that would be even more powerful. I’ll show you how to discover live literature evidence that provides a link for your clients phenotypes and genetic variations. As before, we have as an example that I already have available for you: If you say that we suspect the notch gene and your client has one specific disease, you can search and you will find this very handy variant diagram plot that will help you discover just where in the gene you would search for any suspected variants. If you have the same gene then you choose a different syndrome, and see that the variant plot is very different.

We’re gonna go through one more API example using hajdu cheney syndrome. If we have the HPO term, you want to find out the HPO term for acroosteolysis, which is a symptom of hajdu cheney. You will find out that this one in particular phalanges of the hand is one symptom of hajdu cheney so you will copy that, and we’ll go to our variants endpoint, we’ll put in our HPO ID, and the suspected gene, submit that query, and we have a handy list of all relevant variations including the article count. If you’re interested in those articles in particular, we will go to the articles endpoint, we’ll put in our variant interest, we will grab that HPO term, and we can see a comprehensive list of all those eight articles associated with the variant of interest and your clients disease. As before, if there are extra pages, we can flip through those. The Mastermind interface is great for a visual overview of these features and the real powerful features of the API allows you to do all of those things in bulk. With that I’ll hand it over to Steve and he might give you a more in-depth explanation of our API.

Steve: Awesome, thank You Diane! What you just thought was a demonstration of the UI and the individual API endpoints.One thing that really starts to make the API powerful as Diane alluded to, the user interface is extremely useful when you’re doing sort of ad hoc investigation or discovery work trying to figure out what exactly it is you’re looking for. Once you figure that out the API makes it much easier much faster to start automating that process across use cases. You can integrate it into your existing pipeline and you can really start to do much more in-depth analysis that isn’t very practical or realistic for a human to do in real time. One example of that as I showed earlier with a graphical representation is analyzing a lot of variants and a lot of phenotypes at one time, but again going back to the Association visualization that I showed, there’s a lot of different entities that you can analyze on in order to start to automate the production of that output. What I am going to show you here real quick is an example script that we put together that looks at a collection of variants from a given patient with a list of phenotypes for that patient, and does an aggregate analysis across all of the medical literature to start to find any multiples of variants or phenotypes within that patient that tend to be co cited within the medical literature – meaning papers that talk about more than one of those variants with more than one of those phenotypes at the same time. This is a technical demonstration of an API which doesn’t tend to be extremely exciting, but I will try to make it as interesting as I can. What we have here is a script that I’ve written that takes a list of variants as input and a list of phenotypes as input. In fact, let me show what those lists look like real quick. I just have a few sample variants that I’ve put together here that look like this. You can see it’s just a collection of several variants across several different genes, these aren’t necessarily related to each other, it’s just a bunch of random variants that I threw into a file. You can see that the script is easily able to accept the variants in whatever format that you have whether it’s the protein nomenclature or cDNA, it can also handle RSID genomic coordinates and IVS nomenclature. I also have a set of phenotypes for this patient made up as well. What this script does is it takes that list of variants that I just showed you as well as that list of phenotypes and it starts analyzing them in aggregate. The first thing it does is that find the canonical HPO terms that match each of those phenotypes in the list, then it will search through each variant finding every article.

Mark: Maybe well while he’s rejoining I think he’ll be able to come back and hopefully he’ll get a screen that illuminates the results of that script but I’ll speak from a clinical perspective of the benefit of that script as Diane alluded to while she was show capable web interface, and how you can iterate through those queries to amass all of the insight and information that you’re seeking around your clinical case. What Steve is highlighting is the ability to do that all at once for any given patients sequencing data with all of those results once you’ve configured the API calls and the way that the data gets output. Steve is gonna hopefully show his screen and showcase the fruits of that automated workflow to give you immediate insight no matter what the phenotypes are that are input, no matter what the clinical circumstance, no matter what the list of variants the script that Steve was running that was built on the APIs I like to say automagically assembles all that data, and prioritizes the insights that you need to very quickly and very accurately, sensitively make the appropriate diagnosis for your patient based on the information content throughout the entirety of Mastermind. Steve thanks for jumping back on there, this is the output screen that I was suggesting he would show and again, this is particularly the example that Steve ran, and the way he configured the script but it should be apparent to you how valuable this collected information is for prioritizing the results and again, getting immediate and keen insight into the nature of your patients disease based on the clinical circumstance. Steve if you’re on again I’ll turn it back over to you.

Steve: Thanks Mark, thanks for rescuing me. This is an example of the output from the script that I was just showing. In this case it’s found three different phenotypes that have been co cited in three different articles matching two different input variants. You can see the three phenotypes that it found in the articles are the three PMIDs and these are the two variants that those three articles both all three mentioned, and then to start to help you automate the process of finding a diagnosis, it will also list out the diseases that were cited in those three articles. If you scroll down, here’s another output where it found two different phenotypes within this large set of PMIDs also co citing these two variants, and here’s the larger set of diseases that those found, and then you can keep going down and see the rest of the Associations. This is just the the example input that I had. We’ve run this for many more variants – between three and five thousand variants and 15 to 20 phenotypes per patient – and it can still automate the production of this kind of output. It’s an API call, so in this case I happen to have it outputting in a indented file that makes it a little easier for me to demonstrate here the kind of information that it’s outputting, but it also can output in any form that you would like such as CSV a VCF file or even just directly putting that data into your own pipelines database for you to interpolate the results into your own pipeline, your own user interface for your variant curators, or for your research team. This kind of shows the power of the API and the kinds of things that you can start to do once you automate the the multi-faceted Association identification between genomic entities for a given patient or again for drug discovery or any number of other use cases.