[MA 2025 17] Evaluating and Improving Medical Factuality in Large Language Models
Okra.ai, Leiden
Proposed by: Michele Cafagna [michele.cafagna@okra.ai]
Introduction
Language models hold promise for medical applications but may generate factual inaccuracies and
hallucinations that can undermine trust and safety [1, 2], as this can lead to ill-formed market
decision or in might jeopardize patient’s health [3]. Ensuring factual accuracy in LLMs for medical use
is not just technically challenging [3, 4], it is a crucial step toward their responsible and reliable
deployment in tools that directly or indirectly affect our daily lives [3].
Bridging the gap between the potential of these systems and their safe, trustworthy use requires
carefully tailored evaluation, rigorous testing on in-domain data [5], and the development of
methodologies that can expose weaknesses and guide improvements in the specific domain [1].
Description of the SRP Project/Problem
This project investigates systematic methods to evaluate medical factuality in LLMs, develops
approaches to measure factual accuracy and hallucinations, and explores strategies to improve
information fidelity with emphasis on reproducibility.
In this use case you will primarily focus on highly technical drug-focused medical documents such as
clinical trials reports and pharmacology publications.
The aim is to improve the reliability and factual grounding of LLM outputs in a question-answering
(QA) scenario, ensuring alignment with contextual medical knowledge and supporting evidencebased
decision-making in biomedical applications.
Data
The main work will be based on:
- ClinicalTrials.gov: https://huggingface.co/datasets/louisbrulenaudet/clinical-trials
The student is encouraged to explore, build or integrate more datasets if they focus on document-QA
task within the drug-related medical domain.
Additionally, we may grant access to proprietary data for further experiments, with the
understanding that any findings derived from such data may be partially eligible for publication.
Research questions
RQ1) How can we design a reproducible framework to assess factual accuracy and hallucinations in
drug-focused medical language model for QA applications?
RQ2) How can the fidelity of drug-focused medical information produced by language models be
improved?
Expected results
- Develop a reproducible evaluation framework for medical factuality and hallucinations, which may
take the form of metrics, a benchmark, or both, including evaluation protocols and baseline results.
- Devise methods to improve medical fidelity (e.g. via context enrichment, external knowledge
injection, output validation, etc.)
- Thesis or scientific report suitable for publication
Time period, please tick at least 1 time period
X. November – June
o May - November
About Okra.ai
OKRA.ai operates in the Life Sciences Industry, curating and enhancing medical data for insights
generation and leveraging advanced AI algorithms to produce critical value, ultimately, enabling the
delivery of the right drug to the right patient at the right moment.
We leverage several Machine learning algorithms as well as Large Language Models (LLMs) to
produce valuable insights helping to streamline the analysis of technical medical documents.
References
1. Zhou, Hongjian, et al. "A survey of large language models in medicine: Progress, application,
and challenge." arXiv preprint arXiv:2311.05112 (2023).
2. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination
in natural language generation. ACM computing surveys, 55(12), 1-38.
3. Wang, C., Liu, X., Yue, Y., Guo, Q., Hu, X., Tang, X., Zhang, T., Jiayang, C., Yao, Y., Hu, X.,
Qi, Z., Gao, W., Wang, Y., Yang, L., Wang, J., Xie, X., Zhang, Z., & Zhang, Y. (2025). Survey
on Factuality in Large Language Models. ACM Computing Surveys.
4. Tang, Xiangru, Arman Cohan, and Mark Gerstein. "Aligning factual consistency for clinical
studies summarization through reinforcement learning." Proceedings of the 5th Clinical
Natural Language Processing Workshop. 2023
5. Antony, Sebastian, et al. "FactPICO: Factuality Evaluation for Plain Language Summarization
of Medical Evidence." Association for Computational Linguistics, 2024.