“You are a highly experienced engineer performing a code review. Your task is to determine whether the proposed changes follow the instructions.”
Thus begins one of the system prompts Spotify defined within its agentic coding architecture.
The company began exploring this topic in February 2025. Its Fleet Management system already automated a large portion of software maintenance. From code excerpts it could perform scale transformations in a GKE environment and open PRs on the target repositories.
This mechanism facilitated operations such as upgrading dependencies in build files, updating configuration files, and performing straightforward refactoring (for instance, removing or replacing a method call). Half of the PRs pushed since mid-2024 had been driven by this approach.
A first focus on code migration
The implementation of the agentic approach initially focused on declaring the transformation code. Objective: enable the definition and execution of changes in natural language, replacing deterministic migration scripts.
Rather than choosing an off-the-shelf agent, Spotify designed a CLI. This tool can delegate the execution of a prompt to various AI models, but can also run formatting and linting tasks using MCP, evaluate a diff by LLM as a judge, upload logs to GCP, and capture traces in MLflow.
In early November 2025, roughly 1,500 PRs merged had passed through this system. Spotify tackled operations such as:
- Modernizing the language (for example, replacing value types with records in Java)
- Upgrades without breaking changes (migrating data pipelines to the latest version of Scio)
- Migration between UI components (moving to the new Frontend system of Backstage)
- Configuration changes (updating parameters in JSON and YAML files while adhering to schemas and formats)
Spotify claimed that these migration tasks earned 60 to 90 percent of the time savings compared with hand-coding. It projected ROI improvements with the prospect of extending the approach to other codebases.
Slack, Jira et Cie intégrés dans une architecture agentique
Complementing the migration effort, work shifted toward a more generalist system capable of performing ad hoc tasks. This led to a multiagent architecture that plans, generates, and revises PRs.
At the base level, there are agents tied to different applications (Slack, Jira, GitHub Enterprise…). Interaction with them, potentially augmented by context retrieved from MCP servers, yields a prompt. That prompt then goes to the coding agent, also exposed via MCP. Its actions are verified by another group of agents.
Among other “satisfactory” uses, Spotify notes the extraction of architectural decisions from Slack threads and the possibility for product managers to propose simple changes without needing to clone repositories on their machines.
Des agents open source à Claude Code
Initial experiments used open-source agents such as Goose and Aider. When applied to migrations, they did not produce reliable PRs. Spotify therefore built its own agentic loop layered atop LLM APIs. The principle: the user provides a prompt and a list of files the agent edits, continuously incorporating feedback from the build system at each step. The task finishes when tests pass or when certain limits are exceeded (10 turns per session; 3 retries).
This approach worked for “small” changes: editing a line of code, tweaking a manifest, replacing a flag… But the agent remained difficult to use. Loading files into the context window relied on a git-grep command. Depending on the search pattern, the window could become saturated or fail to provide enough context. The agent also struggled before editing multiple files. Often the loop hit the turn limit. And when the context window filled up, the agent would forget the task.
In this context, Spotify migrated to Claude Code. It enabled “more natural prompts” while offering its native to-do lists and the ability to spawn subagents. It now covers the majority of PRs that are merged into production.
Know when to forbid… and not do everything at once
The initial agent performed best with strict, step-by-step prompts. Claude Code handles prompts that describe the desired end state and allow leeway in the path to get there.
Spotify notes it can be useful to tell the agent clearly when it should not act. This prevents tasks that are impossible to complete, especially when prompts are reused across repositories that do not necessarily use the same language versions.
Providing code examples also strongly influences results. Ideally, the desired state would be defined as tests—the agent needs a verifiable objective to iterate on. It will also be wise to request only one change at a time to avoid exhausting the context window. And do not hesitate to ask the agent for a post-session debrief.
Une ouverture limitée via MCP
Spotify favored long static prompts, where models’ reasoning tends to be simpler.
An alternative approach consists of starting with a shorter prompt but granting the agent access to MCP tools. The context it can retrieve theoretically enables handling more complex tasks. It also makes the agent’s behavior less verifiable and less predictable.
For now, Spotify allows its agent to access a verifier (formatting, linting, tests), a subset of Git subcommands (no push or change origin, for example) and a set of Bash commands (such as riggrep).
Encoding the method of invoking the build systems inside MCP was deemed simpler than relying on AGENTS.md files. The reason: build configurations can vary widely across the thousands of repos the agent works with. It also helps reduce noise in tool outputs by summarizing them before sending them to the agent.
A deterministic verification loop…
Occasionally the system fails to generate a PR. Sometimes it outputs one that fails CI or is functionally incorrect. Other times it stems from insufficient test coverage on the target component. In other cases, the agent goes beyond the instructions or simply doesn’t know how to run builds and tests properly.
That is when verification loops step in to steer the agent toward the desired result. The agent does not need to understand how they work; it only needs to know that it can call them.
The loop comprises several independent verifiers, exposed—via MCP—according to the software component. For example, the Maven verifier activates only when a pom.xml file sits at the root of the codebase.
Together, these verifiers filter out much of the noise that would otherwise fill the context window. The agent does not need to grasp the specifics of invoking different build systems or parsing test results.
Whether triggered during execution or not, the relevant verifiers activate before any PR is opened. With Claude Code, this happens through the stop hook.
… et du LLM as a judge
Beyond these deterministic checkers, Spotify added an LLM as a judge layer to address the tendency of the agent to drift outside the instructions.
The LLM judge evaluates the diff of the proposed change against the original prompt. It runs after the other verifiers. Internal metrics indicate it rejects about a quarter of sessions. For about half of the remaining sessions, the agent ends up correcting itself.
Specialized (it does not push code, does not write prompts, and does not interact with users), the agent is also more predictable. And potentially more secure.
In early December, Spotify stated its intent to extend the verification infrastructure to more platforms (beyond Linux-x86). Many of its systems have particular needs. Among them its iOS applications, which require macOS hosts for proper verifier execution. The company also operates Arm backends. It plans to deepen the integration of its agent within its continuous deployment system, enabling it to act on CI checks in PRs. And it aims to develop more structured evaluations that encourage exploring new agent architectures.