Scientific Research Project

[MA 2025 19] Generating and Validating Synthetic Textual Data for Medical / Clinical Entity Linking

Department of Medical Informatics, Amsterdam UMC, University of Amsterdam.

Proposed by: Iacer Calixto, assistant professor of artificial intelligence [i.coimbra@amsterdamumc.nl ]

Introduction

Synthetic data generation refers to the creation of novel data using computational procedures, such as large language models (LLMs) or controlled templates. In the clinical context, it is a way to expand training corpora when real clinical text is scarce, privacy-restricted, or skewed toward common cases. LLMs such as Chat-GPT have already proven to be highly effective in the clinical domain for data annotation [9] and for generating synthetic data [8,10]. At the same time, synthetic data must be constrained and validated to avoid e.g. hallucinations and factual errors.

Biomedical named entity recognition (NER) [1,2,3] is an NLP task that identifies and segments spans in an input text referring to biomedical concepts (e.g., diseases, drugs, procedures), whereas named entity disambiguation or entity linking (EL) [4,5,6,7] aims at resolving each span to a unique concept identifier in a reference knowledge base (KB). Together, NER and EL enable downstream biomedical tasks, such as clinical decision support, by converting unstructured text into computable data.

At the NLP4Health lab, we have curated a large multi-domain dataset for training NER and EL models that spans scientific abstracts, clinical notes, and other biomedical text types. This dataset contains roughly ~500,000 mentions (~10M tokens), ~45,000 unique concepts, and mappings of spans to 62 KBs across 126 semantic types. Moreover, each mention in the dataset linked to a KB includes a full hierarchical path in 11 KBs, i.e., the sequence of nodes that start from an ontology root node following parent–child relations until the target concept in the KB is reached (e.g., "Patient experienced nausea. [nausea]:SNOMED CT concept -> symptom -> gastrointestinal -> nausea). Despite its scale and heterogeneity, our dataset exhibits a very long tail: the KBs we use cover millions of concepts, the vast majority of which are underrepresented or absent in naturally occurring clinical texts, including, for example, rare disorders, fine-grained procedures, or specific devices or microorganisms. This limits the generalizability of models trained on this data, and risks biased performance toward frequent concepts and common clinical contexts.

Description of the SRP Project

This project aims to build a synthetic text generator to create high-quality sentences and short paragraphs that contain mentions of underrepresented biomedical concepts. The goal is to produce examples of sentences and paragraphs annotated with a mention, where the mention is paired with the correct concept in the KB of choice and, where relevant, its hierarchical path. Data diversity and quality are major requirements, and the student will investigate (a) how to best use an LLM to generate sentences annotated with concept mentions paired with the valid concept in the KB, (b) how to use an LLM judge with a brief rubric—clinical plausibility, correctness of the mapping, clarity of the text, etc—to automatically validate and improve synthetic data quality, and (c) the student will also conduct a small-scale human evaluation with clinicians on a stratified sample to quantify how well judges align with expert clinician judgment.

Research questions

- RQ1 (Model): How do proprietary and open source LLMs compare when generating synthetic data (i.e., sentences and paragraphs annotated with a mention, where the mention is paired with the correct concept in the KB of choice) in terms of plausibility, diversity, and coherence?

- RQ2 (Strategy): What is the best strategy for generating synthetic data? E.g. What in-context examples should we use to enforce stylistic variations and how?

- RQ3 (Evaluation): How to best use LLM judges to evaluate the quality of the generated examples? Which metrics to use (e.g, plausibility, coherence, etc), and how does an LLM judge align to clinicians in a small-scale experiment?

Expected results

- a master’s thesis / scientific research paper to be published at an NLP venue;

- a codebase implementing the pipeline for synthetic clinical NER and EL generation

- a synthetic dataset generated using this pipeline including a dataset card;

Time period

- November – June (x)

- May – November (x)

References

[1] Monajatipoor, M., Yang, J., Stremmel, J., Emami, M., Mohaghegh, F., Rouhsedaghat, M., & Chang, K. (2024). LLMs in Biomedicine: A study on clinical Named Entity Recognition. ArXiv, abs/2404.07376.

[2] Perera, N., Dehmer, M., & Emmert-Streib, F. (2020). Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Frontiers in Cell and Developmental Biology, 8.

[3] Kocaman, V., & Talby, D. (2020). Biomedical Named Entity Recognition at Scale. ArXiv, abs/2011.06315.

[4] Kulyabin, M., Sokolov, G., Galaida, A., Maier, A., & Arias-Vergara, T. (2024). SNOBERT: A Benchmark for clinical notes entity linking in the SNOMED CT clinical terminology. International Conference on Pattern Recognition.

[5] Borchert, F., Llorca, I., & Schapranow, M. (2024). Improving biomedical entity linking for complex entity mentions with LLM-based text simplification. Database: The Journal of Biological Databases and Curation, 2024.

[6] Zhu, M., Celikkaya, B., Bhatia, P., & Reddy, C.K. (2019). LATTE: Latent Type Modeling for Biomedical Entity Linking. ArXiv, abs/1911.09787.

[7] Garda, S., Weber-Genzel, L., Martin, R., & Leser, U. (2023). BELB: a biomedical entity linking benchmark. Bioinformatics, 39(11), btad698.

[8] Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., & Wang, H. (2024). On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. ArXiv, abs/2406.15126.

[9] Goel, A., Gueta, A., Gilon, O., Liu, C., Erell, S., Nguyen, L.H., Hao, X., Jaber, B., Reddy, S., Kartha, R., Steiner, J., Laish, I., & Feder, A. (2023). LLMs Accelerate Annotation for Medical Information Extraction. ArXiv, abs/2312.02296.

[10] Mamooler, S., Montariol, S., Mathis, A., & Bosselut, A. (2024). PICLe: Pseudo-Annotations for In-Context Learning in Low-Resource Named Entity Detection. ArXiv, abs/2412.11923.