MARK: Hello, everyone, and welcome to today's webinar, From the Lab to the Clinic: Where Generative AI for Biomedicine Holds Up, and Where it Breaks Down. My name is Mark. I'm the Chief Science Officer and Founder at Genomenon, and I'll be your host for today's proceedings. Today, we're going to take a clear-eyed look at generative AI in biomedicine — where it genuinely accelerates work across biomedical research and the drug development cycle, and where it can introduce risk, uncertainty, and false confidence.
I think everyone will agree that the conversation around genAI right now is polarized. On the one side, there's real optimism for the speed and scale and new capabilities that it can bring to bear for research. On the other, there's a healthy skepticism about what these AI models can truly prove, how they handle evidence, and what it means to validate an insight before it influences high-stakes drug development decisions. So this session is designed to put those two perspectives into dialogue and pressure test a practical question: What can and what can't genAI do to responsibly support biomedical research and clinical decision making?
Before we begin, a few housekeeping notes. First, this webinar will be recorded and shared with all registered attendees and made available on our website. During the chat, I'll be looking out for questions from the audience, so please submit any questions in the QA panel in the webinar monitor at any time, and I'll be fielding those and incorporating them into the conversation as we proceed. You'll also see a few poll questions pop up during the session. Please take a moment to respond. And before we wrap, we'll share a brief optional survey soliciting your feedback, which helps us shape future webinars.
So before we jump into the content, I'll spend a few moments introducing Genomenon and introducing our speaker. First, who is Genomenon? Genomenon is the first literature-derived real-world evidence company applying AI to the biomedical literature to power precision medicine. We are a mission-driven company, and our mission is to save and improve lives by making that biomedical information actionable, both in the service of helping develop precision therapeutics that target the molecular drivers of disease, particularly tuned to rare disease and precision oncology, but also to help diagnose patients suffering from those genetic diseases and cancers by incorporating our AI capability to identify causative genetic variants for those diseases. So, Genomenon is truly at the center of big data as generated by AI, real-world evidence, and genomics.
Just by way of education, many of you, as researchers, as drug developers, as diagnosticians, understand the value of the literature, but few truly appreciate the scale and detail that underpins that literature when it comes to real-world evidence at the patient level. This graphic captures the types of patients that are found in the medical literature, from the case reports to the pedigrees, the families and the patients and their affected relatives, to the case series and larger patient cohorts. All of those different types of studies capture real-world evidence as it's experienced in patient lives. That real-world evidence takes the form of demographic data (the sex of the patient, the ethnicity, the age at onset, etc.), the phenotypes that they experience and how they present in the clinic, the lab values, the chemistries, the radiology studies, as well as the genetic data and biomarkers, and importantly, the treatments that those patients receive and their outcomes. So, altogether, that's a distillation of over two trillion dollars of research and clinical inquiry across many millions of patients' lives that are locked deep in that scientific and biomedical literature.
Genomenon's approach to extracting that information marries two different capabilities. One is a sophisticated computational capability that's built around generative AI capabilities, as well as marrying that with expert human curation, particularly for downstream purposes where it's necessary to support diagnostic criteria in the hands of clinicians, but also for drug development, to support regulatory submission to foster publication quality data, as well as to put that data in the hands of clinicians within and outside of a diagnostic context.
A tiny window into what Genomenon's AI capability looks like — I mentioned the great value and data that's locked in the clinical literature. Genomenon and our Genomenon Genomic Graph capability, or G3, is able to find those entities: the genetic variants, the diseases, the phenotypes, the real-world evidence that underpins the cataloging of those patients' lives. We find that information, and then we can put that together in a semantic knowledge graph to help understand how those terms are interrelated, better characterizing those terms and organizing data for downstream use, whether it goes directly to our pharma and clinical clients, or to our curators, who now number nearly a hundred highly trained Master's level and above biomedical curators, to make sure that the data is accurate and fit for purpose.
This is a summary of how that all fits together, beginning with the corpus, layering on that computational capability that I just described, and then putting that in the hands of our curators and scientists to take that raw, real-world data and translate that into real-world evidence, and then translate that into real-world insight that's actionable for many phases of the drug development cycle. So how does it all fit together? I say, well, combinatorially. You take the corpus and the value within that corpus of biomedical literature across millions of articles and millions of supplemental data sets, you extract and organize that information with high-throughput, highly sensitive computational techniques, and then you layer on top of that the curation and scientific assessment of that data.
So, joining me in this conversation is a friend and colleague, John Quackenbush. John and I recently were on a panel together at the AI3D conference, the Artificial Intelligence and Drug Discovery and Development. We had a great time on the panel. We had such a great time, actually, and such an engaging conversation, that I wanted to invite him back to have a repeat. By way of introduction, John is a professor of computational biology and bioinformatics at Harvard T.H. Chan School of Public Health and professor at Dana-Farber. He trained in theoretical physics and worked on the Human Genome Project while holding roles at the Salk, Stanford, and TIGR before joining Harvard. His research focus, which we will definitely get into, involves using massive data sets to understand how many small effects combine to influence health and disease, including gene regulatory networks, and how they change across individuals and over time. He has over 350 publications and over 105,000 citations. He's the creator of netZoo Tools with tens of thousands of downloads. He's recognized as a White House Open Science Champion of Change. He is a fellow entrepreneur, having founded GenoSpace, which went on to be acquired by HCA, and he's been elected to the National Academy of Medicine in 2022. John, I don't mean to make you blush. Perhaps I should have invited you to join the stage after I was finished, but thank you so much for joining us. It's so great to have you here.
JOHN: Well, thank you for inviting me, Mark. Thanks, everybody, for attending. I always hate those introductions because it sets everyone's expectations high. I'd rather go with low expectations and exceed them.
MARK: Alright, well, let's deliver for the audience, John.
JOHN: We'll do our best.
MARK: There's a bit of a flow to the conversation. I want to start talking about data, and move into evidence, then talk about applications, and finally talk about validation. So there's a little bit of an arc, but we'll see where the conversation takes us. But to start, at the core, let's talk about how there's more data than ever, a reality that I think both optimists — which I am one of, pragmatic optimists — and skeptics can agree on, we have more data than ever. But does that automatically mean that it's usable? So, going back to your introduction and the arc of your career, what have you seen change most over the last five to ten years, is it the data? Is it the quality of the data? Is it ready access? Give us some historical perspective based on your experience.
JOHN: Well, you mentioned the arc of my career, and you talked about how I got my start working on the Human Genome Project. So this was, I hate to think about it, 35 years ago, when the Genome Project was just really sort of getting underway. The first genome was sequenced and published in 2001. So, if you think about that genome, generating that data was an extraordinarily difficult challenge. It took years, it took laboratories around the world. Then, what we saw, which I think really the history of science is true everywhere, is that the real driver of innovation is technology. The raw material that we use to be able to form hypotheses to test them, to develop theories, it's data. Data is also the primary engine, or the primary fuel for the engine that we use to falsify our theories and then build new ones. So we started with one genome. The technology rapidly evolved to the creation of what we now call next-generation sequencing, but even that evolved very rapidly.
In 2009, the estimate was, to sequence a genome, it would cost $100,000, and take weeks, if not months. Today, we can sequence a genome for less than $1,000, and we can do it overnight. In the early 2000s, we had DNA arrays. Now we regularly do RNA sequencing. We have the ability to look at DNA methylation genome-wide. We can look at single-cell data. We can look at spatial transcriptomics, and that's just in the data generation realm. We have better EHR systems to capture phenotype data. We have people using wearables and mobile phones to generate digital phenotypes. We've just had a really rich fullness of data become available to us in the past few years. Now, that's the good thing. The bad thing is that you've also got a lot of noise that comes in with some of this data. One of the things you mentioned in discussing Genomenon is having curators to be able to filter through some of that. At the end of the day, one of the important things is going to be able to distinguish the signal from noise as we try to use this data to build better, more effective models.
MARK: So where does genAI help in that context? There's a healthy tension between technological capability and genAI, and how it borders on human curation capability, but doesn't come up to scratch for many applications, especially some that I mentioned. Publication, regulatory submission for drug development, whenever it's put in front of a clinical user for diagnosis, for instance. So, how is it helping get along that curve toward faster and better curation, and where is it falling short?
JOHN: In the 1980s, there was something interesting in negotiating the START treaty with the then-Soviet Union. Ronald Reagan says something which is interesting, and I sort of use it as a guide to thinking about how we can work with genAI, and what he said was, trust but verify. I mean, these tools have become very good, but it's very easy to fool yourself, too. You can easily take large-scale data sets, or you can take genAI tools and you can ask them questions, and sometimes I'll ask, well, is there support in the literature for this particular hypothesis, and what refutes it? And it will come back, and it will start putting together some interesting discussions of what might support or refute a particular hypothesis, and then start giving you PubMed IDs for papers, and when you look, they're PubMed IDs for papers about something else. The arguments often begin in a rational way, but if you think about the way genAI works, right, where you start with words and you look for words that are likely to follow, you sort of build from something that's sensible, and then it starts to diverge and go down its own path. Sometimes you get interesting insights, other times you get led down false paths. I think it's just realizing that this is still a research tool. It's an imperfect tool, and it's one that requires that kind of human intervention to really help ensure that we're using it appropriately.
MARK: So, let's pull on that string a little bit. When you talk about using it appropriately, it's a big tent, genAI. AI generally has multiple different facets and individual applications. When you parse that, could you parse it for me and help me understand, how a practitioner can break down a problem so they're not basically going in with a question that's submitted to a black box, and expecting the answer to be perfect? What are the types of genAI, as used today, and what's a good practical approach to breaking down a question and a problem so that you don't just go from zero to a hundred and have a trust gap in between?
JOHN: You know, that's a good question, and there are a lot of places, a lot of different domains where you can start thinking about this idea of breaking down the problem and building it back up. My group, as you mentioned in the introduction, we've developed a lot of software and tools. We developed them to solve real problems. We found that some of the genAI tools, like Claude, are really good if you want to optimize code, or if you have problems with debugging code. But it's not good at — if you say, build a method to do something, right, and it's really abstract, it's not good at doing that. It's good at giving you incremental improvements on something, it's good in building tools to import a file, to transform it this way, to do this, to do that. If you want to do something really inventive, that still falls back on people who are thinking about clever ways to build and implement new analytical tools and algorithms.
There are areas of application where some of these AI tools are just phenomenal, and people often come and point to AlphaFold. That's basically won the Nobel Prize for revolutionizing the way we think about the protein folding problem. There have been serious improvements on the original implementations, too. Even though it's a huge problem, right, folding hundreds of thousands, if not millions of proteins, it's well-defined, and there's a ton of data. The structure of the data is fundamentally simple. It's a string of amino acids that lead to a structure. They just released Alpha Genome, which is supposed to find regulatory elements in the genome. In talking to people who've used it, it's just much harder. It can find things that we already know, but when we start thinking about enhancers, where our definition even of what these are is imprecise. Its ability to find verifiable elements is not up to snuff, really. It's a good investigational tool, but it gives you hypotheses, potentially, that you have to test, but you have to realize it's going to miss a lot of likely hypotheses, too.
MARK: Yeah, I like that you said you're developing tools to solve real-world problems. It's not a panacea approach. You break the problem down into subsets, and you target those with individual tools. The other thing that you brought up there was hypothesis generation capability. So, especially when you're concerned about the validity of some of the output. You can use it advisedly to iterate on hypothesis generation and testing, and that should be a territory that's very comfortable to folks in drug development, because it's the business of failure. They're used to having hypotheses that they test, but the value, the practical application would be to prioritize the hypotheses based on evidence, and if the tool is designed right, to support with the underpinning evidence that led them down that path, so that the end user can assess that evidence and make an informed decision about the likelihood of this hypothesis or this hypothesis being ultimately verifiable, or falsified.
JOHN: (LAUGHTER) Two sides of the same coin. Yeah. Yeah, I mean that's interesting, and I think it points to something that we haven't necessarily done as well as we could over the years, which is exploiting the large collection of data that we have as a hypothesis generation tool. There, you don't necessarily need genAI. A few years ago, I had two high school students and a postdoc build a tool we call SEAHORSE, and I forget what the acronym for Seahorse is, but, SE is Serendipity Engine. How do you find things by luck? What we did, which was really simple-minded, is we just took these large cohort data sets, we had phenotypic data, we had gene expression data, and we'd calculate all by all correlations, and then do some analysis post-hoc. It's simple because correlations are relatively inexpensive to calculate; you do them once. If you throw them all into a queryable database, you can say, alright, what's correlated with height? You find there's some association with BMI, and you say, okay, well, so what? We all knew that. It's part of the definition. But then, if you start to dig deeper, you find that in skeletal muscle, height is correlated with the expression of genes that's associated with catabolic metabolism. You look at it and think, okay, that's potentially interesting. Maybe tall people metabolize things differently than people like me, more normal size. But I don't know, it's a hypothesis to test. On the other hand, we looked at 26 different tissues in GTEx, and we found height was correlated with the expression of genes associated with cancer. And you think, well, that's odd. And then if you dig into that, what you find is there are epidemiological studies that find an association in some of those tissues between height and cancer. So now, you start to develop a hypothesis that you can actually test looking at the existing literature, but you can start to look at in different ways to see if there's supporting evidence. I use that as an example because that's, in a lot of ways, the way we think about these AI tools. They will give us these interesting hypotheses sometimes, but I'm not going to conclude that everything that's there is going to be true. It gives you leads you can start to think about following up. It's really about having supporting data. You mentioned, in drug discovery, pharmaceutical companies typically have a wealth of data and information in their possession that they can use in many instances to think about verifying this, either the suggestions, either in terms of functional assays, or molecular structures, or laboratory experiments that have been done. So again, I think it's a great set of tools for helping us develop some hypotheses, but not necessarily proving or falsifying them.
MARK: Yeah, when I've done large-scale screens like that, I was counseled to not get married to the top finding. You should be suspicious of the top finding. Either it's a false lead, the way that you generated the data was not really in line with what you were actually asking, so it was an artifact of the way that you processed that data. But then, you also brought up something that you might brush aside as, well, that's already known. If you really examine it and understand where the data came from, there's actually an insight hiding in plain sight that the data is trying to point you to.
JOHN: The other thing, too, is we can't just think of genAI as being all of the problem. Data's noisy. In 2012, some colleagues and I looked at two studies that were published looking at drug screens and cell lines. One from a group largely at the Broad, one from a group largely at the Sanger Center. We were trying to build a classifier on one that would predict results on the other. What we found, as we dug into it, was that the two studies actually, even in the small number of drugs they shared, contradicted each other. It caused a huge stink in the community, because these were sort of the landmark studies that were going to be used to do drug screening and drug prediction. I don't think any of the scientists did anything wrong. It just comes down to the fact that these assays, sometimes, are imperfect. Our responsibility as scientists is to be skeptical of everything that we do. If you build up enough evidence, you start to believe in it, but if I do an experiment, if I do an analysis, I'm always going to step back and give it the sniff test. I think we're in the same position here with genAI tools. They're getting better, but even if it gives you a firm conclusion, it's still a hypothesis generation engine at the end of the day. The quality depends on the quality of the input data.
MARK: So, we talked about data. We briefly touched on where to get it. I'm happy to let you expound on that, but I'd like to move now into making it evidence. So, data is just all the raw grains of sand, but when you think about turning that data into something that can point you in a direction or another, you have to gather the data, process it, structure it, and start to have it be a substrate upon which to ask some of these questions. We'll get to the models later, but let's move from data to evidence. In that process of going from data to evidence, what do you think is the weakest link right now? Especially when you're dealing with complex, multifactorial, very large data sets, biomedical, so there's a lot of complexity, what's the weakest link in that movement from data to evidence?
JOHN: That's a great question, and as you were talking about it and giving your analogy of building on sand, there's a quote — I love to use quotes when I give talks, and it's a quote from Henri Poincaré. He said, "Science is built with facts as a house is with stones, but a collection of facts is no more a science than a heap of stones is a house," or something like that. I often use that as a foundation in thinking about my work as we try to go from data to knowledge to understanding. There are a few things with this bridge from data to evidence. I gave you this example of SEAHORSE, and one of the criticisms is, well, you're doing all these correlations. You're finding lots of things that are spurious correlations, and the answer is absolutely yes. At the end of the day, the bridge between that kind of data and real evidence of something is whether we can put it in the framework of a conceptual understanding of what the association might really be.
LLMs, a lot of these genAI tools, at the end of the day, they're large-scale correlation engines. They're very sophisticated in the way they find correlations, but you're finding correlations. If you can put those into the context of some understanding of the biological system that you're studying, that you're trying to work out, whether it's a set of molecular mechanisms or drug disease associations, or clinical variables that might be predictive of some kind of outcome or some kind of risk of an event. The more you can tie those to mechanism, the more you can start to believe that it is becoming evidence of something. I've been in this field for a long time.
When we first started doing DNA microarray analysis in the early 2000s, people would say, oh, it's just a fishing expedition. You're finding lots of things that are correlated. Well, now we have a really sophisticated trawler to do our fishing, and a big ocean of data, but in some sense, we're still fishing. Fishing is okay if you want to catch fish, but to really turn this into evidence of something, you do have to put in kind of that mechanistic understanding. This comes back to one of the problems that people often cite with training AI, and it's the problem of underspecification. If we see associations, there can be multiple paths between the data and the outcome we're trying to predict. The really interesting ones are those that fit in the context of our existing understanding. That can be extraordinarily useful if you can put it into that kind of framework.
MARK: We have a question from the audience that, as I mentioned, I'm happy to insert here. They're talking about state-of-the-art agentic analysis, where you basically have an assemblage of iterative genAI/LLM capabilities, asking a question, doing a task, asking a question, doing a task. How does that enhance model efficacy, robustness, reasoning, and depth compared to a single model approach, like I talked about before, sort of a black box approach? This question was foreshadowed by our discussion earlier.
JOHN: I think that's a very interesting question. This is something I haven't had much of an opportunity to play with, but I think that's one place where there's probably a better opportunity to get closer to the truth, because each one of the steps is well-defined. If you can carry out those well-defined steps, your chance of extrapolating out the nonsense becomes much, much smaller. Well-defined problems often have a better-defined solution. But until these systems do evolve and become verifiable, it's going to continue to be very important for us to really verify them in some way, whether it's through looking at other sources of data, looking for mechanistic associations, or with drug discovery, really going into the lab and doing the hard test you have to do. Like you mentioned, the top candidate may not be the best one, but if you have the top five or ten candidates, and there's good rationale for narrowing that list down, then you've accelerated the process just by reducing the search space.
MARK: So I may have said this on the panel, I've certainly had this fleeting notion, I don't know that the problem is technology anymore, it's more about the application. It's not really, can it answer the question, it's, can you ask the question well? What are you trying to do? Getting to what you said about being underspecified and the question that came from the audience, breaking down the question and asking more scoped questions to pull all that information together. I wonder if there's going to be a time when the AI is able to self-assemble, recognizing that that's the approach to take, and creating agentic models by itself to break down a problem, ask that question, and answer it iteratively.
JOHN: So, when you think about the really well-focused questions, what do you think about the best way to structure those?
MARK: In my own practice?
JOHN: Yeah, in your own practice.
MARK: Go small. As we talked about, starting with the data, really understanding the data, and the nature of the data, the challenges with the data, what it's structured like, then transforming that into evidence. At Genomenon, we have the luxury of basically starting with the evidence, because it comes from the scientific and clinical literature; it's peer-reviewed. That's not to say that it's flawless. So what we do is amass all that data first, ask a very singular question, and take what I call a slice of that data, and assess it just within that slice, but then ask those questions that I talked about. What are the demographics for the patient, say? What are the phenotypes? And you iteratively parse that question finer and finer so that you're sure that you're getting the right data, rather than just saying, hey, genAI, GPT, what have you, just tell me what you're thinking. We have more control because we've parsed that question more finely, and then we take that result and assemble it, and we put it to our curators. So that's our approach, but it's a really well-scoped challenge that we have of finding that data for patients, refining that data, and then putting it together in what we call a large landscape.
JOHN: You're addressing this issue I raised earlier of underspecification. You're trying to really specify each question well, within the scope of the problem you're trying to solve, but also the information you have. You mentioned asking those simple questions. I think even those simple questions, in my experience, have to be well formulated. You have to put up some guardrails. Working in this field now frightens me. I started in 1992, so 34 years. You know, it's interesting how whenever any new technology is introduced, people say, aha, it's solved all the problems, and we can throw out everything we've always done. We went from DNA arrays to sequencing. This is digital, we don't have to worry about doing statistical tests, but we still do. All these things we learned in the past keep on coming up. If you think about these genAI systems — throw out genAI, if you ran a statistical analysis on a particular data set, you draw a conclusion, that's great. But what's the first thing, if you try to publish it, any referee is going to ask you to do? Or if you're my student, and you come to my office and give me this, what's the first thing I'm going to do? I'm going to say, validate it in an independent data set. What you're saying is the same thing. You have a really well-defined, curated set of data. You run these well-defined tasks. Validate.
MARK: So, we're flirting with, when you get to my space on the diagnostic side, clinical-grade evidence. So, when we're talking about decision-grade or clinical-grade evidence, what are the things that you demand? You just said this for your students, validation. What are some of the other checks and balances that you can use to assess the output, provided you've made a well formulated question?
JOHN: Getting into the clinic, and clinical diagnostics, you start now to go down a somewhat slippery slope. I've talked to some of my friends who are physicians who will get very frustrated because their patients show up and they say, yes, I asked ChatGPT about my symptoms, and this is what ChatGPT says I have. Sometimes, it's pretty close, sometimes, when they're asking these questions and posing these questions, they're ignoring other critical data. It's this question we were talking about earlier of scoping the problem. We all have biases to what the right data should go into this, and all physicians aren't perfect, but medical training is pretty good. Peer review is not perfect. It's a filter, but there are lots of studies that probably should be retracted, because the data and information or the analysis has flaws, and the conclusions sometimes aren't right. Science is an iterative process moving toward truth. If you think about bringing these to a clinical setting, great places to start are with professional society guidelines, because that's a really good framework in which to base all of the diagnostics and decision making. In some cases, I think that's a place where genAI can actually help support clinical decision making, just to assure that the right tests are run, the right data is collected, to really help narrow down what physicians do, which is develop a hypothesis about what your particular malady might be, and come up with therapeutic options. But we also have to worry about the quality of the data that goes in, and recognize that EHRs are not necessarily designed for research, they're not always designed perfectly for clinical care. They're designed for the most important business of the hospital, which is reimbursement. Things often get coded in ways that don't necessarily reflect what happens, but reflects the organization's need to be reimbursed.
I always tell this story, when my son was really young, and I'll keep this one short, but I had this episode of vertigo. I ended up going to the ER because it was bad. They did EKGs, EEGs, all these other things, and then found out it was just a little calcium chloride crystal in my ear that had broken loose and was floating around, and they did this head twist, and it all went back, which is great. But then the next day, I got a call from my primary care physician wanting to follow up on my cardiac event. I thought, what? You know, and you just realize it's got coded in the EHR as a cardiac event, or potential cardiac event, because they ran an EKG. If you look, though, at EHRs, and you look at episodic information, then you start to build up good evidence about things that happen. What is the real association? You can start to, again, filter signal from noise. We have to really think carefully, like you said earlier, about how you formulate these queries, what the right data is to input into these queries.
MARK: Yeah, I like what you said there. If you simplify it, you have to make sure that not only is the input data good, but it's complete. You could be missing circumstance and perspective, as you pointed out in some of your clinical examples. Make sure that the question is formulated well and is designed to get at the answer that you're looking for. That's the essence of prompt engineering and a well-formulated hypothesis, but then also, what the output is. What are you using this data, this information, this insight for? We talked about generating hypotheses and testing downstream. That's different than if you're going to do open-heart surgery, yes or no, depending on the answer. That's an important distinction to be considering in all phases, is, what are you going to do with this information? It doesn't just have to do with how carefully you set up the input and how carefully you craft the question, but how you interpret the result based on what you're planning on doing with the output downstream.
JOHN: Right, and these hypotheses you have to investigate as a physician are much more high-stakes, and you realize that there are biases in the training data. So, I think when we were talking before, you mentioned you're in Michigan. You could train an AI system to do diagnostics using all the data in all the hospitals in Michigan, but if I happen to be traveling to South America, or Southeast Asia, or somewhere in Africa, and I came back with some kind of malady, and that wasn't included in the diagnostic query — or even if it was — not having seen some rare disease that might be endemic at a place where I visited, you very well might miss it with the system, just because it's not part of the overall training data that went into building whatever system you're going to try to use.
MARK: So, a question comes to mind: are we ready for this? I'm thinking of self-driving cars. The data shows that self-driving cars are much safer, though they have incidents and accidents, but overall, they're much safer than having emotional apes driving around, with their day-to-day distractions et cetera. What litmus, what threshold, do we need to attain before we start trusting artificial intelligence more than human intelligence, especially where it matters, when we're talking about high-acuity, high-stakes decisions?
JOHN: The example of self-driving cars is sort of an interesting one. I don't know if you've ever been in one, I never have. You know, I'm a little skeptical of those systems, because I worry about the edge cases, I worry about where they haven't been trained. Here, I live in the Boston area. We had a huge blizzard yesterday. I would much more trust my ability to navigate, having driven for many years and had experiences in bad weather, than some AI system trained in the roads of Southern California. The edge case we had yesterday was pretty severe. You have to worry about how many of these edge cases do appear in these things like clinical decision making. I probably should have mentioned this earlier, but one of the things that really have taken to heart in developing our own software and tools is, Wolpert and Macready's 1997 paper, No Free Lunch Theorems for Optimization. If you read the paper, it's really very well written, and very insightful. One of the things that's always sort of stood out to me is one of their conclusions, and this is at the heart of a lot of what we do, and this comes back to this idea of underspecification. They argue in that paper that one of the best ways, in their problem of getting to a near-optimal solution in finite time, is to introduce prior knowledge or prior information about the system to incorporate your understanding of the system into the behavior of the algorithm. As we look at trying to deploy these systems, the more we can build in that kind of understanding, of where we go wrong, where the right paths are, what the things are that we're supposed to do. To model, and not throw out hundreds of years of medical knowledge, of biological investigation, to keep that as part of the system, maybe not as a firm constraint, so we can discover new things, but as a kind of soft constraint, as a way of guiding the algorithms themselves.
MARK: Let's talk about applications, and I'll cut right to the chase, is can you give me a perspective, your perspective, on where genAI in biomedical research and drug discovery or development is actually a winning proposition, and where it is behind the curve, or where it's best to just put human ingenuity and other brute force existing data analytic techniques to bear. So what is it winning at, and what is it more challenged to do?
JOHN: Well, some of the things where I've seen it just be phenomenally successful, AlphaFold is a good example. I've talked to people in hospital systems where they've taken AI and deployed it in ways that help them with operations, and it's been phenomenally successful. Someone at MD Anderson a few years ago, I was at a talk, and they described how they used it to better manage pharmaceuticals across all the pharmacies. MD Anderson is a massive institution, just to manage inventory, so they were losing less money through drugs expiring. I talked to someone else at a company where they deployed it at a diagnostic testing company. They employed it as a way of optimizing billing codes for different insurers to get reimbursed, because they started to see that some payers would pay for one code, not in another, and so they optimize their billing. There are lots of operational issues where you can get early wins. As we were talking about, in solving really well-defined questions and problems, this could be useful. I think about this sometimes as tools for patient intake. I think about this at times as tools for helping do surgical planning. We're starting to see some really nice applications. One of the earliest wins has been just image analysis, where you do have AI systems that can look at CT scans, or any kind of radiographic images or cellular images, and do feature identification. But what you really see at the end of the day is where those systems are most successful is where they do have some level of human intervention, right, human review and approval. At times, they perform better than humans, but humans are better at catching the edge cases. Like your example of self-driving cars. Sunny day, beautiful highway, no real problems, it might be great. In difficult situations, I think humans are better able to interact and react. There's a poll up now asking about questions where people are concerned, and the leading answer here is overconfidence. Keeping humans involved in that process, keeping humans in the loop, is going to help address some of that. The other place, and this sort of touches on my own research, we spend a lot of time looking at sex differences in disease. It's one of the most understudied problems. It's a fault in our healthcare system today, in that most diseases are treated the same, independent of whether someone has a Y chromosome or not. And I think that bias is already baked into a lot of the data that's available. So, if we don't put guardrails up, if we don't sort of guide the system in the right way, do we keep on going down the same paths in areas that we've already failed?
MARK: Yeah, it's like a crash test dummy that's 150 pounds and structured like a man. We've since moved on perhaps, as you're pointing out, we'll move on in genAI. A question comes to mind about where genAI may fall down — it falls down spectacularly when asked to do math. Unless it's a fit-for-purpose capability design for math, you'd ask it a simple math question, and it falls. It also likes to add supernumerated digits to images, which is being worked on because it's a huge flaw. Speak to me about whether you think biology is the final frontier because of its complexity, or do you think because of genAI's capability of amassing all this data. Do you think that we'll ever crack that in a way that makes it hard for the AI to have us keep up with it, and explain its result? Just like when you put two AIs together, they start speaking their own language, and you have to pull the plug, because you don't know what's going on. So is biology the same as a math challenge for AI, or do you view it differently?
JOHN: So, I don't know if you've seen all the news about Open Claude.
MARK: No, not particularly.
JOHN: So this is a set of agents, and it's a system called Open Claude. It's gotten a lot of press, because there are now tens of thousands of them. The guy who developed it set up a social media site for Open Claude agents, and they're going in, and I'm really worried Skynet is going to become self-aware at some point. You know, they're asking each other questions, they're talking about their humans, and it's very, very bizarre. I sort of got off-topic with that. Remind me of the question again.
MARK: Is biology just as bad?
JOHN: Oh, biology. Yeah, I think it's a great question. So, math is really sort of an interesting use case, because the theorems that are understanding the bases are pretty well-defined in a lot of different areas, and it does do a good job with simple math. But more complex mathematics, yeah, it can go spectacularly off the rails. My background was in physics, and about a year and a half ago, I got asked to talk about my career development and sort of transition into biology. The interesting thing in physics is that we always talk about first principles. They're things that we know about the nature in which physical systems behave and evolve and change over time. That can be a really useful foundation to sort of guide our thinking. There are all sorts of things, like symmetries in nature are linked to conservation principles. They're things that we fundamentally know. If we look at biological systems, it becomes much more complex, because we don't necessarily have first principles. AlphaFold is a great example of a really well-defined problem, where we have a simplified structure of input data that we can try to learn rules from. If I look at a lot of what I work on, we're trying to model gene regulatory networks, and in the simplest instantiation of this, we're trying to model transcription factors binding to DNA. As soon as I start describing this problem, we know that transcription factors have motifs, so there's no perfect place where it binds, but there's a family of places it can bind. Some transcription factors share motifs in these logos that define binding sites with others. The key idea is that they probably bind close to the transcription start site, but what's the regulatory region, and how do we define that? Is it different for every gene? How do we account for the fact that binding these molecules is stochastic? How do we account for the fact that there are enhancers, which may operate far away from the transcription factor binding site, and honestly, we don't have anything approaching a good concept of how to identify the location where those enhancers might be. So we start to build on more and more complexity in building our models. We just don't have good ways of constraining those. If we then add on the fact that regulation involves epigenetic factors, that involve microRNAs, there are all these things along the road from a gene to an RNA to a protein, the post-translational modifications of the protein, to building together signaling networks. It's a place where the complexity becomes huge, but as you suggested earlier, maybe a good way to approach this is to take those small steps, to try to solve the small problems we can. In a lot of ways, I think that's the best path forward with all of this. Good data, constraints on models, trusting but verifying, and, using these as hypothesis generation tools to improve our understanding.
MARK: Maybe time for two more questions for you, one from the audience and one as a closer here. If we accept that input data will always contain flaws, such as missing context and incorrect classification, as we touched on, which pre-processing techniques have the highest impact on giving an LLM the most informative and reliable stimulus? So, basically, pre-refining the data to make sure that the input to the system is giving the best chance to have an accurate output.
JOHN: It's really interesting, because you see, there's a lot of literature about pre-training. I think that's interesting, that pre-training, even with unrelated data, can help the system figure out how to identify certain types of associations that we might not be smart enough to build into the system. But if I really wanted to build an LLM or some other type of genAI tool that was fit for purpose, really doing a good job of curating the input data that is going to be used for the key next level of training is an essential part of the puzzle. The other thing, and this was pointed out in one of the questions that someone put into the Q&A, is the fact that input data is noisy. That's absolutely true. Input data is always noisy. Any data's always noisy. That's part of the reason why verifying things in independent datasets is so important. But we also have to realize those independent data sets may have biases that influence their ability to verify things. I don't think we can focus on any one piece. I think we have to really think of this as an ecosystem. The quality of data at every step, the quality of the training at every step, the quality of the input information, and our ability to better specify the problem, whether it's through good query engineering, or through building in some kind of constraints based on our understanding, the Wolpert-Macready No Free Lunch Theorems. All these things together can help us create an ecosystem that is going to be better able to deliver more meaningful results at the end of the day.
MARK: And to understand what you're going to use that data for. So in our work, there's situations in which the data needn't be curated, because we're just trying to find patterns, we're just trying to generate hypotheses, and it's okay if there's some flaws in the input data, because you're looking for patterns, and then you're gonna test them downstream. Other situations, that data had better be curated, both before it goes into the system and reviewed by curation afterwards, because of where it's gonna go downstream.
JOHN: Yeah. I was part of a National Academies report in, I think, about 2010. It was on translational -omics, and it really grew out of the misuse of transcriptomic data, that came out of the thing that happened at Duke University with Anil Potti. A phrase that came up a lot was fit for purpose. That's sort of what you're talking about, you want to make sure the data in the system are being used appropriately for the purpose in which you want to use it.
MARK: Yeah. So, I promised one more question. I'll try to make it fun, but you don't have a long time to answer, so it's a tall order. What are the telltale signs that your genAI is just fluent, but isn't properly understanding? So, I've called it confident nonsense at times, when it hallucinates. Very often, though, it's dead on, but what are those edge cases where it's wrong? So what are the telltale signs of fluency? And do you think that any of these models will ever become self-aware?
JOHN: I worry they might become self-aware, and I worry what that means for us, but I still have a hard time telling. Sometimes we'll ask a question, and it'll come back with something that sounds plausible, and then you start to dig into it. You think, wait, wait, wait. There's no evidence for this, this doesn't make sense, it's stringing together words in an interesting way. One thing I've seen is that sometimes if you ask related questions in different ways, and you get wildly divergent answers, then it's a good sign the system is kind of going off track. You often have to, if you use two different tools to do this sometimes, and get wildly diverging answers —
MARK: Yeah, in orthogonal confirmation with a totally different system. I've heard, I don't know if it's apocryphal, people tend to be polite to some of these GPT models, for instance, and that's costly, but the answer turns out to be self-serving. I've heard that if you threaten violence, you're more likely to get an accurate result. Again, I don't know if that's true. As I said, that flirts with the idea that there's something underpinning these models, that they respond to different emotionally charged prongs.
JOHN: I'm in trouble, because I always said, if I type in something, I'll say, will you please.
MARK: Yes, I'm very polite, it's just my nature.
JOHN: Well, you grew up in the Midwest, right?
MARK: That's right. Well, thank you so much, John. Thank you, everyone, for attending. This was a great conversation again. I look forward to our third. Just as a reminder, before signing off, there will be a feedback survey with two short questions. We'd appreciate your input. As a reminder, the recording will be shared with everyone who registered. So thank you all again, and thank you especially, John.
JOHN: Thanks for inviting me, Mark, and thanks for everybody who indulged us and let us talk for an hour and listen, so thanks.
MARK: Alright, take care all.