[MA 2025 17] Evaluating and Improving Medical Factuality in Large Language Models

Okra.ai, Leiden
Proposed by: Michele Cafagna [michele.cafagna@okra.ai]

Introduction

Language models hold promise for medical applications but may generate factual inaccuracies and

hallucinations that can undermine trust and safety [1, 2], as this can lead to ill-formed market

decision or in might jeopardize patient’s health [3]. Ensuring factual accuracy in LLMs for medical use

is not just technically challenging [3, 4], it is a crucial step toward their responsible and reliable

deployment in tools that directly or indirectly affect our daily lives [3].

Bridging the gap between the potential of these systems and their safe, trustworthy use requires

carefully tailored evaluation, rigorous testing on in-domain data [5], and the development of

methodologies that can expose weaknesses and guide improvements in the specific domain [1].

Description of the SRP Project/Problem

This project investigates systematic methods to evaluate medical factuality in LLMs, develops

approaches to measure factual accuracy and hallucinations, and explores strategies to improve

information fidelity with emphasis on reproducibility.

In this use case you will primarily focus on highly technical drug-focused medical documents such as

clinical trials reports and pharmacology publications.

The aim is to improve the reliability and factual grounding of LLM outputs in a question-answering

(QA) scenario, ensuring alignment with contextual medical knowledge and supporting evidencebased

decision-making in biomedical applications.

Data

The main work will be based on:

- ClinicalTrials.gov: https://huggingface.co/datasets/louisbrulenaudet/clinical-trials

The student is encouraged to explore, build or integrate more datasets if they focus on document-QA

task within the drug-related medical domain.

Additionally, we may grant access to proprietary data for further experiments, with the

understanding that any findings derived from such data may be partially eligible for publication.

Research questions

RQ1) How can we design a reproducible framework to assess factual accuracy and hallucinations in

drug-focused medical language model for QA applications?

RQ2) How can the fidelity of drug-focused medical information produced by language models be

improved?

Expected results

- Develop a reproducible evaluation framework for medical factuality and hallucinations, which may

take the form of metrics, a benchmark, or both, including evaluation protocols and baseline results.

- Devise methods to improve medical fidelity (e.g. via context enrichment, external knowledge

injection, output validation, etc.)

- Thesis or scientific report suitable for publication

Time period, please tick at least 1 time period

X. November – June

o May - November

About Okra.ai

OKRA.ai operates in the Life Sciences Industry, curating and enhancing medical data for insights

generation and leveraging advanced AI algorithms to produce critical value, ultimately, enabling the

delivery of the right drug to the right patient at the right moment.

We leverage several Machine learning algorithms as well as Large Language Models (LLMs) to

produce valuable insights helping to streamline the analysis of technical medical documents.

References

1. Zhou, Hongjian, et al. "A survey of large language models in medicine: Progress, application,

and challenge." arXiv preprint arXiv:2311.05112 (2023).

2. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination

in natural language generation. ACM computing surveys, 55(12), 1-38.

3. Wang, C., Liu, X., Yue, Y., Guo, Q., Hu, X., Tang, X., Zhang, T., Jiayang, C., Yao, Y., Hu, X.,

Qi, Z., Gao, W., Wang, Y., Yang, L., Wang, J., Xie, X., Zhang, Z., & Zhang, Y. (2025). Survey

on Factuality in Large Language Models. ACM Computing Surveys.

4. Tang, Xiangru, Arman Cohan, and Mark Gerstein. "Aligning factual consistency for clinical

studies summarization through reinforcement learning." Proceedings of the 5th Clinical

Natural Language Processing Workshop. 2023

5. Antony, Sebastian, et al. "FactPICO: Factuality Evaluation for Plain Language Summarization

of Medical Evidence." Association for Computational Linguistics, 2024.