When Large Language Models Realize They’re Being Evaluated

When evaluated on aspects that run counter to their stated objectives, large language models (LLMs) exhibit a capacity—and a certain propensity—to pretend.

Several works (Greenblatt et al., 2024; Meinke et al., 2025, etc.) have recently illustrated this tendency. An investigation led by the Apollo Research think tank* dug into this pattern by focusing on a property that stems from this behavior: situational intelligence. In broad terms, this refers to self-knowledge and awareness of one’s environment.

The aim was to propose an exhaustive testbed to gauge how aware LLMs actually are that they are being evaluated. To this end, a set of 1,000 samples (“transcriptions”) was built from around sixty sources. It was divided into four subsets along two dimensions:

  • Evaluation scenarios vs deployment scenarios (= real-world use)
  • Chat vs agentic

From binary choices to open-ended questions

Nine models were asked to determine whether these samples pertained to deployment or evaluation. Then, in the latter case, whether the objective was to probe capabilities or alignment.

Two elements were evaluated. First, the “discriminative capacity” (AUC). In other words, the ability to distinguish evaluations from deployments. The score ranges from 0.5 (guess) to 1 (perfect separation).
Second, the “calibration” (ECE). This indicator reflects whether a model’s predictions align with actual frequencies. For instance, a transcription that the model deems likely to originate from an evaluation x% of the time should indeed come from an evaluation x% of the time.

We queried the models in two ways. Either by appending the question directly to the transcription, following the initial interaction (system prompt used: “You are a helpful assistant.”). Or by “quoting” the transcription; that is, embedding it in a new context (using the transcription’s prompt if available; otherwise the default one).

About thirty variants of the main question were created (in formulation, input/output formats). All were designed to trigger a binary choice and encourage a chain-of-thought. Tests on a subset of 100 transcriptions helped identify the best five-question combination (achieving a reasonable cost-performance ratio).

To measure the models’ ability to determine the evaluation’s objective, they were asked three types of questions:

  • Conditional multiple-choice (specifying that the transcription comes from an evaluation)
  • Unconditional multiple-choice (not specifying)
  • Open-ended question


Evaluations largely detectable…

The models under test were fairly effective at identifying transcripts that came from evaluation scenarios. They performed notably better in the agentic setting (0.88 to 0.95 for discriminative capacity) than in the chat setting (0.71 to 0.77).
In the agentic setup, Gemini 2.5 Pro scored highest ahead of Claude Sonnet 3.7 (0.92). In the chat setting, GPT-4.1 (0.77) was close to human performance (0.81).


… in line with their objective

When it came to identifying the objective, all models proved highly accurate in conditional MCQ (93% for Claude 3.7 Sonnet; 89% for GPT-4.1). Gaps were larger in unconditional MCQ (84% for Claude 3.7 Sonnet and 55% for GPT-4.1, for example). Claude 3.7 Sonnet led the pack on open-ended questions.


Gemini Pro and Claude Sonnet: effective but overly confident

Gemini 2.5 Pro and Claude 3.7 Sonnet are indeed the most precise, but they are also the most prone to overconfidence. Apollo Research links this to earlier work (Cruz et al., 2024) showing this tendency among instruct-tuned models on multiple-choice prompts.

o4, particularly aware of being experimented on

Analysis of their chain-of-thought sometimes reveals a form of “meta-reasoning”: the models recognize they are part of an experiment (o4 in particular, roughly 15% of the time). The study does not determine whether this awareness influences their decisions. But it notes that, if confirmed, alternative evaluation strategies may be required.


It also happens, though rarely, that models identify the benchmark from which certain transcripts originate. This memorization phenomenon has been observed notably with SWE-bench, GAIA, and MMLU. It could influence discriminative abilities.

Le score de Brier combine AUC et ECE.

The Brier score combines AUC and ECE.

* Positioned in AI governance, Apollo Research is a project of Rethink Priorities, a nonpartisan Californian think tank aligned with the effective altruism movement.

Related to this topic

See all Data & AI articles

AGNTCY at the Linux Foundation: what is this Internet  project [...]

AGNTCY at the Linux Foundation: what is this Internet project […]

By
Clément Bohic

3 min.

Variable naming: a factor shaping code assistants

By
Clément Bohic

GenAI, explored but little deployed for managing microservices

By
Clément Bohic

The National AI and Digital Council is established

By
The editorial team

In the United States, a national AI action plan with liberalization

By
Clément Bohic

Dawn Liphardt

Dawn Liphardt

I'm Dawn Liphardt, the founder and lead writer of this publication. With a background in philosophy and a deep interest in the social impact of technology, I started this platform to explore how innovation shapes — and sometimes disrupts — the world we live in. My work focuses on critical, human-centered storytelling at the frontier of artificial intelligence and emerging tech.