A little Transformers, a lot of Mamba: with IBM’s Granite 4.0 language models, IBM is pursuing an architectural shift.
Mamba is designed to address the bottlenecks of transformer models when processing long sequences. In these scenarios, the attention mechanism becomes a chokepoint because it relies on a key–value cache that lets each token access prior ones during prediction. As the context size grows, the memory footprint and latency grow, typically in a quadratic fashion. Techniques such as sliding windows and flash attention can mitigate this effect. Mamba goes further by replacing the attention component with a mechanism inspired by control theory: State Space Models (SSMs). With them, scaling is linear. The SSM parameters can be conditioned on the input, so a selection of what information to retain is performed at the memorization stage — rather than at the recall stage, as is the case with transformers.
Transformers pared down to essentials
IBM does not discard Transformers entirely, but it reduces them to the essentials: only 4 layers out of 40 in each of the currently published Granite 4.0 models (open-weight, Apache 2.0 license). More precisely, a sequential arrangement combines a group of 9 Mamba blocks, a single Transformer block, and so on. The Transformer blocks are retained because they offer advantages on in-context learning tasks (few-shot prompting, typically).
The architectures described above do not rely on positional encoding. By design, Mamba intrinsically preserves the order of tokens. This is not the case for Transformers, which typically incorporate positional encoding—often at the expense of a model’s ability to work with sequences longer than those it was trained on.
Thinking versions on the horizon
As their predecessors did, the Granite 4.0 models are intended to generate text and code. There are currently four, all available in base and instruct variants (thinking versions expected “by the end of 2025”):
- H-Small
Hybrid Mamba/Transformers in MoE (32 billion parameters of which 9 billion are active, i.e., 10 experts out of 72). - H-Tiny
Hybrid Mamba/Transformers in MoE (7 billion parameters of which 1 billion are active, i.e., 6 experts out of 64). - H-Micro
Hybrid Mamba/Transformers dense (3 billion parameters). - Micro
Classic variant (Transformers) of H-Micro.
All of them are available in quantized versions (GGUF, with FP8 also for H-Small instruct).
In 8-bit precision, H-Small requires 33 GB of RAM; H-Tiny, 8 GB; H-Micro, 4 GB, versus 9 GB for its Transformer variant. IBM emphasizes this gain for inference, especially in long-context tasks and/or multi-session scenarios (for example a customer service agent handling several tickets in parallel).
All Granite 4.0 models have been validated for 128k sequences. The training of the base versions followed a four-step pipeline (see the table below), on CoreWeave’s GB200 NVL72 servers. Fine-tuning relied on “open-licensed, permissive” datasets, internal synthetic datasets, and human-annotated data.
Integrating Mamba into the ecosystem
H-Small and H-Tiny display another form of hybridity: they are the first IBM MoE models to use “shared experts.” In other words, always-active parameters that help other experts specialize more effectively.
Nano and Medium models are on the roadmap. There is also an effort to push Mamba support further within the ecosystem. Tools like llama.cpp do not yet support it. It is in this spirit that IBM has kept a “classic” model in its lineup.
The IBM open-weight catalog includes multimodal models, among them:
- Granite Speech (speech recognition; latest version published in August, at 2B and 8B)
- Granite Vision (latest version – 2B – published in June, with an embedding variant added in August)
- Granite Guardian (content moderation; latest version – 8B – published in September)
- Granite Docling (structured data extraction; latest version – 258M – published in September)
Its latest “code-specialized” models date back to 2024. There are also Granite models for geospatial data processing and for time series.
To read in parallel, our brief review of Granite 3.0 LLMs. Released almost a year ago, they introduced, in IBM’s model catalog, techniques such as ScatterMoE (an implementation that does not impose a token cap per expert) and Evol-Instruct (generation of synthetic data from root questions whose improved versions are produced via prompt engineering).