Mistral AI Refines the DeepSeek Algorithm While Challenging Its Core Principles

Without prior distillation, reinforcement learning (RL) appears to benefit only large foundational models.

This conclusion was initially reached by DeepSeek during the development of their reasoning models released earlier this year. They observed that RL, without the aid of distillation processes, seems to only enhance very large-scale models. However, a different perspective emerged from the experiments conducted by Mistral AI when developing their own models, notably challenging this prior assumption—particularly concerning their mathematical capabilities.

Their findings primarily focus on mathematics and coding tasks, which are Central to the development of their models Magistral Small and Magistral Medium.

Distillation or not? Mistral AI’s approach to reinforcement learning

Magistral Medium was trained solely through reinforcement learning, or RL. Compared to the base model—Mistral Medium 3 Instruct—this fine-tuning process resulted in nearly 50% improvement in mathematical problem-solving benchmarks like AIME ’24, and approximately 30% gain in coding evaluation datasets such as LiveCodeBench v5, according to Mistral AI’s reports.

Meanwhile, Magistral Small also underwent reinforcement learning. However, its training involved an initial distillation step—a “cold start”—using traces from Magistral Medium combined with additional responses generated by the same model on prompts sourced from OpenThoughts and OpenR1, implementing supervised fine-tuning (SFT). Instruction-following datasets were incorporated during this phase to maintain reasoning capabilities.

A refined RL pipeline

For their reinforcement learning, Mistral AI applied the same core algorithm as DeepSeek—specifically, the Generalized Reinforcement Policy Optimization (GRPO)—but with several modifications. These included removing the KL divergence penalty and normalizing the loss to prevent length bias in generated responses.

Model outputs were evaluated across four key dimensions:
– Formatting (for example, code responses had to include at least one Markdown block)
– Accuracy
– Response length
– Linguistic coherence

Regarding coherence, the goal was for the models to reason consistently in the user’s language. Preliminary experiments indicated that without specific constraints, models sometimes mixed languages in their output. To address this, 10% of English prompts were translated into French, German, Chinese, Spanish, Italian, and Russian during training to ensure language consistency.

Asynchronous system and greedy algorithm

The RL architecture features three types of “workers”:
– Trainers, who manage the main copy of the weights and update gradients
– Generators, who produce outputs and log probabilities
– Verificators, who evaluate outputs and assign rewards

To reduce latency, Mistral AI designed the system so that generators do not wait for each other or for trainers; responses are immediately sent to the appropriate verifier after generation. This approach avoids batch processing delays, ensuring rapid feedback.

Additionally, they implemented a greedy sorting algorithm—one that makes locally optimal choices to achieve the best global outcome. This algorithm benefits from a micro-batch architecture, where the order of data samples does not impact the overall process.

The mathematical data utilized for RL training came from a dataset of around 700,000 examples. Various filtering procedures were applied, including a filtering stage with Mistral Large 2, which aimed to eliminate overly simple or extremely difficult problems. After this refinement, only 38,000 examples remained.

Secondary products: Generalization and multimodality in Magistral models

As DeepSeek had noted, prompt diversity had a stronger influence on supervised fine-tuning (SFT) than on RL, which used a limited, carefully curated set of training data.

On multilingual benchmarks, the degradation in response quality with these models is comparable to that seen with their base models, indicating similar levels of robustness.

Interestingly, reinforcement learning focused on mathematical tasks also benefits code understanding, and vice versa.

Moreover, the verification checkpoints used in RL incorporate multimodal encoders, including vision components. Although trained solely on text, these models develop reasoning skills in multiple modalities, and can handle all types of questions, according to Mistral AI. The use of external tools further enhances their capabilities.

* Note: Using solely RL, Mistral Small 3 achieves a performance on AIME’24 comparable to the distilled version. It even surpasses it in math and GPQA benchmarks. However, it does not perform as well on code exercises. Combining distillation with RL often yields incremental gains of a few points.