After the Datasets, Open-R1 Aims to Reproduce the Pipeline of …

When preparing a mixed dataset for fine-tuning, it is possible to take advantage of an additive property.

The technical report for the Phi-4 model (from Microsoft) includes a remark on this topic.

The property in question allows optimizing the data mix domain by domain and then concatenating the resulting weights without any loss. Open-R1 has leveraged it. The project, led by Hugging Face, began in January 2025. Its goal: to create an open reproduction of DeepSeek-R1 by developing the “missing pieces.” Specifically, datasets and training code.

The plan unfolds in three stages :

  • Be able to distill a high-quality reasoning dataset from DeepSeek-R1
  • Replicate the reinforcement learning pipeline of R1-Zero
  • Apply this combination to base models to turn them into reasoning models

Mathematics first

Open-R1 initially focused its work on a math reasoning dataset: OpenR1-Math-220k. Released under Apache 2.0 license, it covers 400,000 problems (2 to 4 traces each) drawn from NuminaMath-1.5. Filtered, it retains 220,000. It was split into two parts. The one, called “default,” groups 94,000 problems and yields the best performances. The other, called “extended,” gathers 131,000 problems… and does not produce as good results, probably because the questions are simpler.

Read also: AI and GDPR: the CNIL plays the genealogist

By running Qwen-7B-Math-Instruct for three cycles on the “default” portion, Hugging Face claims to have matched the performance of DeepSeek-Distill-Qwen-7B. Specifically, it achieved the same score on AIME 25 (40) and performed slightly worse on MATH-500 (90.6 vs 91.6).

The code next

The work then extended to coding, with the production of a dataset based on CodeForces competitions. In the mix, about 10,000 problems (with up to 5 traces each), of which 60% are accompanied by the organizers’ explanation of the correct solution.

On this basis, R1 was asked to produce chain-of-thoughts (about 100,000 examples), resulting in the CodeForces-CoTs dataset. Published under the ODC-BY license, it was used to fine-tune Qwen-2.5-Coder-Instruct 7B and 32B. From these came the OlympicCoder models. Tested on the latest International Olympiad in Informatics, they competed with state-of-the-art LLMs (the 32B even outperformed R1).

Science to conclude

A portion of CodeForces-CoTs (83,000 traces of Python and C++) and of OpenR1-Math-220k (the “default” part) was finally combined with a subset of the post-training dataset of Llama Nemotron to form Mixture-of-Thoughts. Thus, science was added to code and math, for a total of about 350,000 traces. No license was added (it is a standing request).

This base, applied to a variant of Qwen-2.5-Math-7B (base RoPE frequency extended to 300k to enable training over a 32k window), produced OpenR1-Distill-7B. The model proved more performant than R1-Distill-Qwen-7B on AIME 2024 (52.7 vs 51.3), GPQA Diamond (52.8 vs 52.4) and LiveCodeBench v5 (39.4 vs 37.4). These scores are reported as pass@1 (one attempt, with 4 to 64 responses per query depending on the task), at temperature 0.6 and top_p 0.95.

Dawn Liphardt

Dawn Liphardt

I'm Dawn Liphardt, the founder and lead writer of this publication. With a background in philosophy and a deep interest in the social impact of technology, I started this platform to explore how innovation shapes — and sometimes disrupts — the world we live in. My work focuses on critical, human-centered storytelling at the frontier of artificial intelligence and emerging tech.