Rob Long · March 2, 2026
Internal experience machines
Papers on self-knowledge, introspection, and what models can tell us about themselves
I’m on a forthcoming 80,000 Hours podcast episode, out on March 3. During my conversation with Luisa Rodriguez—which was in November, many AI-progress aeons ago—I spoke about AI introspection and self-reports. At one point, I promised a list of papers on the topic. Well, now the bill has come due. Here they are!
It’s an exciting time to be working on introspection and self-reports: roughly, how and whether AI systems know things about their internal workings, and when and whether they can speak reliably about themselves. My list of papers about this is nowhere close to comprehensive, but should help people who are looking to get into this area (footnote 1).
Some of the earliest work on LLM self-knowledge concerns model calibration: can they track what they know and don’t know? Does their confidence track how likely they are to be right?
- Kadavath et al. (2022), “Language Models (Mostly) Know What They Know.” Showed that larger language models (larger for the time!) were reasonably well-calibrated on multiple-choice and true/false questions, and could be fine-tuned to be more accurate.
- Lin, Hilton & Evans (2022), “Teaching Models to Express Their Uncertainty in Words” fine-tuned GPT-3 to express uncertainty in natural language (“maybe”). This calibration training generalized moderately.
Since then, models are better in some ways but still have weird, jagged calibration abilities. In spite of massive capability increases, you can still see really wild failures of under- and over-confidence, as many LLM users have experienced:
- Barkan, Black, and Sourbut (2025), “Do Large Language Models Know What They Are Capable Of?” found that then-current LLMs were “systematically overconfident but have better-than-random ability to discriminate between tasks they can and cannot accomplish…We also find that LLMs with greater general capability often have neither better-calibrated confidence nor better discriminatory power.”
Another important related literature is work on situational awareness. This is not always as “internal” as introspection work, but it asks in general how and whether models are able to reason about their own position in the world
- Laine et al. (2024), “The Situational Awareness Dataset (SAD)”. Assesses whether models can answer questions and do tasks that depend on facts about itself, its outputs, and its deployment context.
Next up, there’s a cluster of papers looking at whether models have some kind of privileged access to information about their own behavioral tendencies — information that wasn’t explicitly present in their training data, and that they can leverage better than other models do.
- Binder et al. (2024), “Looking Inward: Language Models Can Learn About Themselves by Introspection.” (I’m a middle co-author.) Fine-tunes models to predict their own behavior in hypothetical scenarios — in a way that requires them to not just think about what they’d do, but also extract some property of what they’d do.2
- Song, Hu & Mahowald (2025), “Language Models Fail to Introspect About Their Knowledge of Language.” Argue that Binder et al.’s measure is not stringent enough and overstates model capabilities.
- Betley, Balesni, Hariharan, Meinke & Evans (2025), “Tell Me About Yourself: LLMs Are Aware of Their Learned Behaviors.” Fine-tunes models on implicit behavioral policies — like making risk-seeking decisions or steering users toward saying certain words — and then asks models to describe those policies without any in-context examples of their behavior. Models can do it! They can also recognize backdoor-like behaviors.
- Plunkett, Morris, Reddy & Morales (2025), “Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions.” Extends Betley et al. by fine-tuning models on novel quantitative preferences — specific attribute weights for decisions — and then asking models to report those weights. Models can do this, and training on self-report accuracy generalizes to other decision contexts.
Next we have some cool interpretability papers looking into this.
- Lindsey (2025), “Emergent Introspective Awareness in Large Language Models.” Lindsey injects concept vectors (patterns associated with “bread”, say) into model activations and tests whether models can notice. He also tests whether models can exercise intentional control of internal states. Models can do these, although not reliably - that said, they seem to be improving with scale. I wrote about this paper at length when it came out, so I’ll also point readers there.
- Lindsey et al. (2025), “Biology of a Large Language Model.” Not primarily about introspection, but some relevant sections: for example, suggests that the mechanisms a model uses to assess whether it is familiar with something (a “do I know this entity?” circuit) are separate from the mechanisms it uses to actually retrieve information about that entity. Also shows that models are not aware of the (rather convoluted) way they do some mathematical tasks.
- Ferrando et al. (2024), “Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models”. Related mechanistic work identifying the internal circuits that distinguish entities the model knows about from those it doesn’t, in a way that can drive model refusals and other behaviors.
Lastly, some overview papers.
- Stanford Encyclopedia of Philosophy, Introspection. The ever-helpful SEP! This entry is by Eric Schwitzgebel, who literally wrote the book on introspection.
- Eleos AI Research (2025), “Why Model Self-Reports Are Insufficient—and Why We Studied Them Anyway”Argues that there are three central problems with model self-reports of consciousness and other states—that is, when Claude says things like “there’s something it feels like for me to process information”:Still, we argue that welfare interviews are still worth doing: they can raise red flags, they scale with capability, and they can help identify areas for improvement. We don’t want to necessarily take self-reports at face value, but to treat them as one input among many.
1. We lack strong, independent evidence that LLMs have welfare-relevant states in the first place (although we take that possibility seriously), let alone human-like ones.
2. Even if models do have such states, there’s no obvious introspective mechanism by which they could reliably report them.
3. Even if models can introspect, we can’t be confident that model self-reports are produced by introspection.
- Kammerer & Frankish (2023), “What forms could introspective systems take? A research programme”. Propose a minimal definition of introspection as “a process by which a cognitive system represents its own current mental states, in a manner that allows the information to be used for online behavioural control.”See Lindsey (2025), Song, Lederman, Hu, and Mahowald (2025), and Comșa & Shanahan (2025) for other definitions. There’s no consensus yet on what we should mean by “AI introspection”; like most interesting concepts, there’s probably no single best way to pick it out.
- Perez & Long (2023), “Towards Evaluating AI Systems for Moral Status Using Self-Reports.” We express hope that AI self-reports might eventually be able to provide evidence about morally significant states, if we can make them reliable. And we propose to train models to answer verifiable questions about themselves (as Binder and others have since done), to see if introspection-like capabilities can generalize to harder questions where we don’t have ground truth.Will this work? Will larger models just solve this with more capabilities? Only time will tell!
A final note: we can’t consider introspection and self-reports without reckoning with the distinctive kinds of entities that language models are. Human concepts will not necessarily map neatly onto them. So I recommend thinking about this alongside a heavy dose of ideas about model identity and personas. Sam Marks, Jack Lindsey, and Chris Olah give a handy collection in a recent post: Andreas, 2022; janus, 2022; Hubinger et al., 2023; Shanahan et al., 2023; Byrnes, 2024; nostalgebraist, 2025.
[1] I’ve borrowed some organizing principles from the “related work” section of Lindsey (2025).