Scientific Research Project

[MA 2025 15] Inducing causal structure to improve the prediction of blood-glucose level in patients with Diabetes Type 1: from synthetic to real-world data

Department of Medical Informatics at the Amsterdam UMC

Proposed by: Giovanni Cinà [g.cina@amsterdamumc.nl]

Introduction

Currently most ML algorithms are forced to learn 'from scratch' the correlation between features and outcome. There are however fields where expert knowledge has been accumulated for decades; a clear example is the field of medicine, where the relationship between risk factors and medical outcomes has been studied extensively. How can we leverage such knowledge in the training of our models? Moreover, data can contain all kinds of spurious correlations, thus ‘ignorant’ models might pick up incorrect correlations that expert knowledge rules out. How do we prevent this from happening?

Causal knowledge is often encoded in a causal model, representing how the relevant variables influence each other [1]. Several researchers have proposed methodologies to infuse predictive models (especially neural networks) with such expert knowledge; one example is the method of interchange intervention training (IIT) [2]. In a previous SRP project, this approach was shown to be successful for the task of predicting glucose level in the blood of patients with Diabetes Type 1 [3].

Description of the SRP Project/Problem

This project aims at continuing the work on the application of the IIT approach to tabular data. While the previous project only worked with simulated data, this project will extend the results to real-world data. The goal is to assess whether this technique can indeed help us infuse predictive models trained on Electronic Health Records with relevant medical knowledge.

Depending on the speed of the project, several advanced questions can be tackled, extending the IIT framework to settings with ????different levels of information (for example incomplete causal models, where only part of the relationships between variables are known), providing bootstrap confidence intervals and considering counterfactual treatment plans.

The real-world data in question is the Ohio T1DM dataset, which is a public dataset accessible on request. The data will be requested by the student with support from the supervisor. In the unlikely case the data is not accessible, the student will work on the open questions on the synthetic data (see above).

Research questions

RQ1) Can we improve predictive models of blood glucose levels trained on real world data by leveraging expert knowledge encoded in causal models? Concretely, do we see improvements in terms of

a. model performance

b. data efficiency (how much data is needed to train)

c. compliance to expert knowledge

RQ2) How do the results on synthetic data port to the real-data domain?

RQ3) How can IIT be expanded to a framework where only partial information is available?

Expected results

- The main intended outcome of this SRP project is a scientific paper. The results of the work will be submitted to a top-tier machine learning workshop, and this paper will constitute your SRP.?

- The second main deliverable is a open source code base where all the experiments conducted will be shared with the research community.

Time period, please tick at least 1 time period:

- November – June (x)

- May - November (x)

References

[1] Glymour, Madelyn, Judea Pearl, and Nicholas P. Jewell. Causal inference in statistics: A primer. John Wiley & Sons, 2016.

[2] Geiger, Atticus, et al. "Inducing causal structure for interpretable neural networks." International Conference on Machine Learning. PMLR, 2022.

[3] Esponera, Ana, and Giovanni Cinà. "Inducing Causal Structure Applied to Glucose Prediction for T1DM Patients."?Causal Learning and Reasoning. PMLR, 2025.

[4] Marling, Cindy, and Razvan Bunescu. "The OhioT1DM dataset for blood glucose level prediction: Update 2020." CEUR workshop proceedings. Vol. 2675. 2020.