Beyond the language model, there is the architecture itself.
This aspect of Lingua Custodia’s work under the Large AI Grand Challenge is one worth highlighting.
This competition was part of the European AI-BOOST project, which aims to organize six more challenges by 2027 to foster open scientific innovation in AI. The European Union has allocated 4 million euros for this initiative.
3.2 million GPU hours on two EuroHPC supercomputers
The Large AI Grand Challenge was launched in November 2023. The contract, in broad terms, was to develop, from scratch, a foundation LLM with at least 30 billion parameters that is “more capable than state-of-the-art systems on a number of tasks.” Each winner would receive €250,000 and 2 million GPU hours on an EuroHPC supercomputer (LUMI, based in Finland, or LEONARDO, located in Italy).
There were four laureates (out of 94 proposals), announced in June 2024. They were TextGain (Belgium), Tilde (Latvia), Unbabel (Portugal)… and Lingua Custodia. The Paris-area SME—small business under the French commercial code—opted for LEONARDO. At the end of 2024, it secured an additional allocation of 1.2 million hours on another EuroHPC supercomputer: JUPITER, located in Germany.
New architecture… and a new brand
In principle, the first model from this work does not meet the contract: it “only” has 3.6 billion parameters. Moreover, it is a so‑called “base” model—i.e., not fine-tuned for dialogue or instruction-following. Consequently, it is not production-ready as is. Nevertheless, it should be seen as a demonstration of the real value added: an architecture alternative to Transformers. Its name is Dragon. With it, Lingua Custodia shifts gears. Or at least opens a new chapter. Until now, the company had been primarily known for document-processing services (classification, extraction, translation, summarization…), offered both as SaaS and via API to the financial sector.
This strategic shift comes with a rebranding: exit Lingua Custodia, and in enters Dragon LLM.
Exceeding the limits of Transformers and Mamba at inference
The Dragon architecture combines a variety of existing techniques to push beyond, in particular, the limits of the self-attention mechanism in Transformers during inference. Specifically, the base architecture sees resources grow with sequence length (for each token, the model looks at all the preceding tokens). These resources are a matter of compute, but memory becomes the main bottleneck, mainly due to bandwidth constraints.
In response, linear variants of the attention mechanism emerged. They avoid the quadratic growth of compute with sequence length and allow the use of a fixed memory budget. This relies on a hidden state: a matrix that does not keep every token, but rather a form of an “evolving summary.”
That approach has the drawback of reducing model precision. In this context, an alternative architecture appeared: Mamba. It replaces the attention component with a mechanism inspired by control theory: State Space Models (SSMs). With SSMs, scalability is linear. More importantly, the SSM parameters can be input-dependent, so the selection of information to retain happens at memorization time—rather than at recall time, as is the case with Transformers.
However, Mamba has a weakness that discourages a complete abandonment of self-attention: models struggle with recall. This metric measures the proportion of positive results that are correctly recalled as such. It should be distinguished from precision, which indicates the percentage of correct predictions among those made by the model.
Hymba, a NVIDIA-made backbone
Dragon LLM took these insights into account as it conducted its experiments. They involved training models ranging from 120 million to 770 million parameters on up to 50 billion tokens.
To improve the loss function, a benchmark target was set: modded-NanoGPT. For recall, SWDE (500-token prompts) and FDA (2,000-token prompts) were used. To assess language modeling, HellaSwag was employed.
With these foundations laid, Dragon LLM turned to another architecture: Hymba (Hybrid Mamba). Created by NVIDIA, it merges, per layer, traditional attention heads with SSM heads. It uses full-attention globally in only 3 layers; in other cases, attention is local (restricted to the last 1,024 tokens). Models built on this base prove effective at inference: throughput remains stable as context grows. The recall shortfall persists, however. Hence the team explored what are known as differential attention mechanisms. Dragon LLM cites two, originating from DeepSeek and Microsoft. The results from the first could not be reproduced reliably. The second, which involves a noise-suppression system intended to help the model better identify the important context, produced marginal improvements when applied to all layers. When restricted to global attention, it yielded a meaningful benefit. It is suggested, perhaps, because it encouraged specialization of those layers for recall.
A bit of DeepSeek in the mix
Other techniques were deployed to boost Dragon’s performance. Among them, scaling normalization helped stabilize variance in deep layers, leading to better training.
Dragon LLM also replaced PyTorch parameter initialization with a DeepSeek-origin scheme. It used SkyLadder scheduling, gradually widening the attention window during training. It also performed per-head attention normalization (improving signal integrity) and repositioned global attention layers (improving loss and recall) while removing positional encoding for the associated heads. As for the internal state management of Mamba, it was replaced by the GDN (Gated Delta Net) method, which delivers better performance once the model passes the 30-billion-token threshold.
Some techniques did not pay off. For example, on data efficiency, Rho-1 and SoftDedup. Both weight tokens: they use a small model to assign a score that defines each token’s contribution to the loss (the more informative tokens influence gradients more). Likewise, no optimizer proved clearly superior to AdamW. Other optimizers—Ademamix, for instance—introduced too much instability to manage effectively.
SmolLM3 performance, but with greater frugality
To scale up, Dragon LLM implemented its architecture in the Megatron-LM framework. The resulting model sits on par with Qwen3-4B and SmolLM3 for ARC, FDA, HellaSwag, LAMBADA, PIQA, and SWDE (0-shot). All of this with greater efficiency. For inference, as noted earlier (DragonLLM even hints at CPU deployment), and for training (3,700 billion tokens, i.e., three times fewer than SmolLM3 and ten times fewer than Qwen3-4B).
Dragon LLM now targets training on more than 10,000 billion tokens, along with adapting to instruction-following and training larger models. It promises production-ready versions in the coming months.
Further reading:
JUPITER, the Arm supercomputer that puts Europe into the exascale era
IBM distances itself from Transformers for its Granite LLMs
Alibaba abandons “hybrid thinking” for its Qwen LLMs
Undisclosed, poorly understood: Deloitte admonished for its use of generative AI
GenAI, explored but not widely deployed for managing microservices