[MA 2025 16] Is my model perplexed for the right reason? Contrasting LLMs’ Linguistic Behavior to Human Intuitions through Token-Level Perplexity
Department of Medical Informatics, Amsterdam UMC
Proposed by: Giovanni Cinà [g.cina@amsterdamumc.nl]
Introduction
When it comes to Explainable AI (XAI) and testing the abilities of artificial agents, we always run the risk of drawing unwarranted conclusions from behavior that intuitively complies with our expectations, also known as confirmation bias (see e.g. [1,2]). This is especially the case for Language Models (LM), due to their already impressive language skills. To escape this pitfall, we need a rigorous (formal) specification of the ability in question and a systematic way to measure compliance with this specification. In the context of linguistic abilities, several benchmarks were proposed to assess the capabilities of LMs; in these efforts, the model's ability is measured in terms of perplexity-driven performance on a carefully designed task.
Description of the SRP Project/Problem
The goal of this project is to argue that this is insufficient and leverage an existing general methodology, in the spirit of mechanistic interpretability, to formalize and test the linguistic behavior of LLMs in terms of token-level perplexity. In a variety of NLP tasks and datasets, from simple to more complex, the project will show how the novel approach allows us to measure the compliance of a language model to a specified behavior, allowing developers to understand if the model achieves high performance on benchmarks for the right reason and side-stepping confirmation bias.
The project is essentially on mapping sub-symbolic model explanations (in this case, we use token-level model perplexities) to a more "readable" or "interpretable" level of analysis. The way this will be achieved is by formulating some hypotheses, e.g., "if the model understands that this sentence is ungrammatical, it should be perplexed by a certain token X, which is the one making it ungrammatical". The way we test this is based on the framework proposed in [3]. The project is a purely NLP one, i.e., we're only using textual data and text-only LLMs (e.g., Llama, Gemma, Mistral).
There are already some preliminary results in this research line, as well as clear connections with recently published approaches, e.g. [4], so we expect that this project will lead to a publication in an NLP venue. More broadly, this thesis will contribute to the solution to this problem by testing and refining the proposed method to aggregate mechanistic explanations to a symbolic statement encoding model behavior, avoiding confirmation bias. This line of research will connect with ongoing research in XAI and learn from other experiences in the field (e.g., [4] in NLP).
You will be working under the supervision of Sandro Pezzelle and Giovanni Cinà. Sandro is an assistant professor in Responsible AI at the Institute for Logic, Language and Computation (ILLC), Faculty of Science, working on topics at the intersection of NLP, Cognitive Science, and XAI (especially, mechanistic interpretability of LLMs). Giovanni is an assistant professor in Responsible Medical AI with a joint appointment at the Faculty of Science and Faculty of Medicine. The supervision meetings will take place weekly, and additionally, the student will be encouraged to take part in the relevant seminars, reading groups with peers, and social gatherings.
Research questions
RQ1. For selected linguistic tasks, e.g. recognizing correct pronouns, can we define a desired behavior for the model in terms of perplexity at the token level?
RQ2. For selected linguistic tasks, e.g. recognizing correct pronouns, do we see a ‘correct’ behavior when we analyze the LLM’s perplexity at the token level?
RQ3. What are good sanity checks in order to test the correctness of this method?
RQ4. Is the analyses in terms of token-level perplexity helpful to obtain a nuanced perspective on the performance of LLMs on linguistic benchmarks? For instance, can we reduce confirmation bias wrt LLM’s linguistic skills?
Expected results
- The main intended outcome of this SRP project is a scientific paper. The results of the work will be submitted to a top-tier machine learning workshop, and this paper will constitute your SRP.
- The second main deliverable is a open source code base where all the experiments conducted will be shared with the research community.
Time period
o November – June
o May - November
References
[1] Ghassemi, Marzyeh, Luke Oakden-Rayner, and Andrew L. Beam. "The false hope of current approaches to explainable artificial intelligence in health care." The lancet digital health3.11 (2021): e745-e750.
[2] Cina, Giovanni, et al. "Semantic match: Debugging feature attribution methods in XAI for healthcare." Conference on Health, Inference, and Learning. PMLR, 2023.
[3] Cinà, Giovanni, et al. "Fixing confirmation bias in feature attribution methods via semantic match." arXiv preprint arXiv:2307.00897 (2023).
[4] Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling? In The Thirteenth International Conference on Learning Representations, 2025. URL: https://openreview.net/forum?id=fL4qWkSmtM.219