The recent pace of genomic research and the great burden of historical content in the genetic literature has made the manual collection of all new—and old—genetic discoveries from the medical literature impractical.
This has necessitated the indexing of scholarly papers in bulk as an inevitable next step for genomic medicine to capture all relevant evidence from the literature needed to make informed variant curation decisions.
Genomenon has risen to this challenge by creating a novel and powerful search technique that has indexed over 5 million genomic articles—to date—to identify genomic content for disease, gene, and variant relationships from the medical literature. All of this information is accessible through the Mastermind Genomic Search Engine. However, the myriad of ways that authors can describe genetic variants poses a challenge for devising effective ways to search for the content likely to be useful for making diagnostic decisions.
In this blog post, we will clarify how Mastermind continues to address this issue of diversity of variant nomenclature to maximize the sensitivity of the variant search space and to provide our users with the best opportunity to identify high-yield content to inform their curation efforts.
Our goal for creating the Mastermind variant search space was to expose all relevant information available for a given variant or set of related variants with a single search query. Our initial focus was on indexing the literature for functionally consequential genetic variants and creating an appropriate search methodology for protein-coding changes.
To this end, we have developed a protein nomenclature search method derived from the standard HGVS nomenclature. HGVS nomenclature was abbreviated and optimized explicitly for speed and sensitivity in applying downstream machine learning and automated curation internally, or manual review of data for Mastermind users. While Mastermind does provide c.DNA nomenclature for each identified protein variant, and permits this search in the Mastermind user-interface when “c.” is appended in the “Filter by variant” field, c.DNA nomenclature remains secondary to Mastermind’s protein nomenclature because of the ambiguity c.DNA nomenclature occasions when using automated indexing techniques or assessing variants across multiple transcripts. As such, for API calls and direct modification of URL links, the Mastermind protein nomenclature search space described below is used.
Mastermind protein nomenclature search space
- Missense Variants: All Mastermind variants use the uppercase, single-character amino acid abbreviation. Missense variants in Mastermind therefore always follow the format of “A123B”, with the prefix “p.” implied from HGVS nomenclature removed.
- Nonsense Variants: Nonsense mutations follow the same convention as missense variants and are all standardized to use the “X” abbreviation for the stop codon. They will uniformly appear as “A123X”.
- Frameshift Variants and indels: Mastermind uses a condensed version of HGVS nomenclature for describing frameshift variants. The reason for this is that whereas HGVS nomenclature format is optimized for exact descriptions of specifically observed variants in a study, Mastermind results can be expected to display all possible variants affecting a given residue with the same functional consequence irrespective of the specific variant. This is especially true of frameshift variants which can result from any one of a number of possible genetic changes but all result in essentially the same protein-coding consequence. The format of frameshift variants is then “A123xyz”, where “A123” describes the protein residue affected, and “xyz” can be any of: “del”, “ins”, “delins”, “fs”, “dup”, or “inv”. Note that Mastermind preserves the nomenclature assigned by the original authors in the variant’s description so if a variant is described as a “del” but results in a frameshift mutation, the author’s original assignation is preserved.
- Non-coding changes: UTR, splice and intronic changes are currently identified and stored in Mastermind, but the ability to search and filter by these nomenclatures is not yet available in the Mastermind software interface or API calls. Our development team is finalizing these efforts and we hope to launch updates with this capability in the near future.
It is important to note that Mastermind identifies protein variants in an article if two conditions are satisfied—the gene symbol (or a synonym for that gene) is found anywhere in the full-text and the appropriate amino acid residue and residue number are found adjacent in the variant’s description in the text.
Since the transcript of the gene is seldom mentioned to discriminate which transcript of the gene is being described, if two transcripts for a given gene have the same amino acid at the residue number, the variant from the text will be mapped to both transcripts for that gene because the situation from the paper is ambiguous and Mastermind seeks to maximize sensitivity of search results over specificity. This ambiguity occurs when one of two conditions are met. First, when two transcripts share the same initial exons where the variant is located and only downstream have different exons included in the transcripts. This is biologically the same variant and is correctly mapped to both. Second, when two transcripts, by chance, share the same amino acid at a given residue number—a situation that arises by chance approximately 5% of the time and only for those genes with multiple transcripts that diverge upstream of the residue number for the variant in question.
Mastermind also identifies variants in the full-text described at the c.DNA level and maps these positions onto the appropriate protein residue for that gene. The challenges in appropriately mapping the variant to the correct gene/transcript are magnified for variants described at the c.DNA level. Since the output for Mastermind is meant to enhance the sensitivity of a search, all possible mappings are applied to ensure none are missed. Reviewing the context of a variant’s description from the full-text is required to fully disambiguate the variant which is, nevertheless, often not possible given the inadequacy of an author’s description.
The available body of genomic information grows prodigiously, demanding that the tools to aggregate this information be more flexible. Whereas HGVS nomenclature format excels at precisely describing genetic variants, Mastermind gene nomenclature for searching genetic variants seeks to maximize the sensitivity of search results to enable users to see all the evidence from the literature that may inform their variant curation needs.
We will be expanding these efforts in the coming months to add enhance disambiguation capability and content prioritization and will continue to keep our users apprised of these developments through this blog. Feedback in these matters is of course greatly appreciated!