When conversations turn to real-world evidence (RWE), the spotlight often falls on the big datasets - electronic health records, insurance claims, patient registries. And yes, those are vital. But they tell only part of the story. To truly see what’s happening in real clinical practice - across continents, specialties, and patient populations - you have to look beyond the obvious. One of the richest, most detailed, and most overlooked sources of insight is hiding in plain sight: the scientific literature.
RWD vs. RWE - The Basics
Real-world data (RWD) are the raw materials - the patient records, insurance claims, registries, wearable device data, and literature itself. Real-world evidence is what you get when you analyze and interpret that raw data to derive actionable information.
The Full Spectrum of RWD Sources
RWE is not built from a single type of real-world data - it comes from an ecosystem of complementary RWD sources:
- EHR/EMR data - Patient-level insights from everyday clinical care, though often inconsistent in structure.
- Insurance claims - Large-scale longitudinal patterns, but limited clinical context.
- Patient registries - Focused, disease-specific datasets with narrower scope.
- Wearables & patient-reported outcomes - Continuous, patient-driven perspectives, sometimes without standardization.
- Scientific literature - Peer-reviewed studies, case reports, observational research, and meta-analyses documenting patient presentations, clinical journeys, molecular findings, and treatment outcomes.
Each source fills part of the puzzle. Literature adds something the others often can’t: deep context, global reach, and early signals across both clinical and molecular domains. Literature also captures unique longitudinal data - detailed follow-ups and multi-year case studies that track how diseases progress, treatments evolve, and outcomes change over time.
Why Literature Matters
Traditional real-world data sources - such as electronic health records (EHRs), claims data, and patient registries - are valuable. But each has clear limitations that can leave evidence gaps. EHRs are often fragmented across systems, with variable documentation and interoperability. This can lead to incomplete patient histories 1. Much clinically rich information remains buried in unstructured text, and missing data is common.
Claims data, while standardized, lacks the depth to capture symptoms, disease progression, or the rationale behind treatment decisions. It may also omit non-billed encounters 2.
Routinely collected data - whether from EHRs or claims - is observational and prone to biases such as confounding. These biases can significantly compromise causal inferences unless carefully addressed using robust statistical frameworks and causal-inference methods (e.g., target-trial emulation, rigorous confounder selection, and doubly robust estimators 3.
Decades of published literature capture details that other sources can’t. These include patient presentations, clinical journeys, treatment approaches, and outcomes. This wealth of information can refine prevalence estimates, uncover correlations between patient characteristics and outcomes, clarify demographic patterns, quantify biomarker or functional assay results, and deepen understanding of disease progression.
However, extracting and interpreting this vast body of unstructured data is not easy. It’s time-consuming, complex, and prone to error. Missing key details can lead to inaccurate conclusions, delayed recruitment, and setbacks in both clinical and commercial strategies.
To overcome these challenges, the right combination of advanced AI capabilities and expert curation can turn literature from a scattered archive into a powerful, structured source of real-world evidence. By systematically extracting, organizing, and interpreting information across millions of publications, it’s possible to unlock insights that drive better trial design, more efficient recruitment, and stronger regulatory submissions - in any therapeutic area.
The Challenge - and the Opportunity
- Despite its value, literature is often underused because it is dispersed across thousands of journals and conference proceedings - suboptimal search strategies and incomplete guidance frequently leave gaps in sampling coverage 4.
- Varied use of terminology around a common theme creates barriers to communication and presents obstacles to systematic literature retrieval 5.
- Manual review is both time-consuming and error-prone, with systematic reviews reporting that screening and extracting data from even a modest set of studies can take weeks to months 6.
Without these insights, organizations risk delayed trial recruitment, weaker regulatory submissions, and missed opportunities for earlier intervention.
When extracted systematically, the scientific literature can turn scattered, inconsistent studies into high-value real-world evidence. Ultragenyx used this approach to assemble a reference-backed cohort of 42,000 published patients with familial hypercholesterolemia - uncovering diagnostic gaps invisible in other datasets. Loxo@Lilly analyzed 97 RET variants to refine its label expansion in Japan, removing non-actionable variants and securing regulatory approval. In rare cardiac disease, literature curation delivered a 500+ patient dataset - double expectations - while surfacing new pathogenic variants to tighten trial inclusion criteria.
The Goldmine Hiding in Plain Sight
In an era where speed, precision, and evidence-based decision-making define success, ignoring the scientific literature means leaving some of the most valuable real-world insights untapped. When harnessed effectively, it doesn’t just complement other RWD sources - it transforms them, filling in the gaps and revealing connections that change outcomes. The organizations that learn to unlock this goldmine will be the ones shaping the future of healthcare.
- Li, I., Pan, J., Goldwasser, J., Verma, N., Wong, W. P., Nuzumlalı, M. Y., Rosand, B., Li, Y., Zhang, M., Chang, D., Taylor, R. A., Krumholz, H. M., & Radev, D. (2021). Neural natural language processing for unstructured data in electronic health records: A review [Preprint]. arXiv. https://arxiv.org/abs/2107.02975
- Amplity. (2025). How doctor-patient conversations provide the critical information claims data overlooks. Amplity. https://amplity.com/news/why-claims-data-cant-tell-the-whole-story
- Doutreligne, M., Struja, T., Abecassis, J., Morgand, C., Celi, L. A., & Varoquaux, G. (2025). Step-by-step causal analysis of EHRs to ground decision-making. PLOS digital health, 4(2), e0000721. https://doi.org/10.1371/journal.pdig.0000721
- Gusenbauer, M., & Gauster, S. P. (2025). How to search for literature in systematic reviews and meta-analyses: A comprehensive step-by-step guide. Technological Forecasting and Social Change, 212, 123833. https://doi.org/10.1016/j.techfore.2024.123833
- Parker, R., & Hayden, J. (2011, October). Uncommon language: The challenges of inconsistent terminology use for evidence synthesis [Poster presentation]. 19th Cochrane Colloquium, Madrid, Spain. Cochrane Colloquium Abstracts. https://abstracts.cochrane.org/2011-madrid/uncommon-language-challenges-inconsistent-terminology-use-evidence-synthesis
- Borah, R., Brown, A. W., Capers, P. L., & Kaiser, K. A. (2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ open, 7(2), e012545. https://doi.org/10.1136/bmjopen-2016-012545