SLM vs LLM: NVIDIA Takes a Side on Agentic AI Systems

Phi at Microsoft, xLAM at Salesforce, SmolLM2 at Hugging Face… A multitude of SLM families that have shown themselves capable, on certain tasks, of reaching performance levels comparable to LLMs from the same generation.

NVIDIA cites them all, benchmarks in support, in an article that reads like an ode. Its title says it all: “SLMs are the future of agentic AI.” These models would be simultaneously “powerful enough,” “intrinsically better suited,” and “necessarily more economical.” The American group argues that such models can be categorized as those under 10 billion parameters.

On the aspect of being “powerful enough”

If scaling laws hold, the performance of SLMs will gradually converge toward that of LLMs, NVIDIA asserts. It points to evaluations where:

  • Phi-2 (2.7B) achieved scores comparable to 30B models of the same generation in reasoning and code generation, while running 15 times faster
  • Phi-3 small (7B) reached the level of 70B models of the same generation in natural language understanding, reasoning, and code generation
  • SmolLM2 models (125M to 1.7B) produced performances comparable to 14B contemporaries in natural language understanding, tool use, and instruction following
  • Its own Nemotron-H SLMs (Transformer-Mamba hybrids ranging from 2 to 9B) rivaled 30B models of the same generation on instruction following and code generation while consuming notably less compute power

The capabilities cited here are typical of agentic systems, NVIDIA notes.

On the aspect of being “more economical”

SLMs require less parallelization, NVIDIA emphasizes. This translates into infrastructure that’s easier to manage, not to mention that these “smaller” models—by their very nature—demand less compute power. This applies to both inference and fine-tuning. Moreover, parameter efficiency tends to be higher with SLMs than with LLMs.

On the aspect of being “intrinsically better suited”

Using SLMs of varied sizes and areas of expertise naturally suits the heterogeneity of agentic tasks. It broadly promotes system modularity—and with that, easier deployment and maintenance.

SLMs also carry a form of AI democratization. More parties are likely to contribute to their design, which NVIDIA believes will spur innovation and diversity. It takes the opportunity to highlight its ChatRTX software, which it claims demonstrates the ability to run these “small” models even on consumer-grade GPUs (GeForce RTX).

The American group continues: in agentic systems, the bulk of tasks are bounded, repetitive, and non-conversational. In this context, SLMs not only suffice but are often preferable. Especially when one needs models with tightly defined behavior (output structuring and tool calls): LLMs, owing to their heritage, appear more prone to hallucinations.

Agentic systems lend themselves to massive data collection from usage, NVIDIA adds (every tool and model invocation becomes a data source). This favors training expert SLMs (10,000 to 100,000 examples suffice), provided that, beforehand, tasks most likely to benefit are identified through clustering.

Human resources, a potential counterargument? Yes, but…

One might think that for a given task, an LLM will always have the edge in overall understanding. A study in particular suggests the existence of a semantic hub mechanism that would boost generalization across languages and domains of expertise—perhaps more so than with SLMs.

NVIDIA concedes the argument but counters it. It points to a presupposition that fuels many studies on scaling laws: that architecture within a given generation of models would remain constant. Yet recent work on training SLMs demonstrates the benefits of exploiting different architectures depending on model size.

The flexibility of SLMs also makes upgrading easier. And the reasoning (inference scaling) is more tractable. As for this “hub,” its usefulness seems limited in agentic systems, where complex problems are broken down into simpler tasks.

Another counterargument: LLMs, due to their centralization, will remain cheaper to operate. NVIDIA concedes this: a multitude of agents potentially implies a multitude of human resources. Yet these considerations are highly dependent on use cases, he cautions. It is worth noting that recent advances in planning and modularization of inference advocate for using SLMs, given the flexibility they bring to monocluster configurations.

Cradle, MetaGPT, Open Operator: three use cases to illustrate the potential for replacement

In practice, LLMs remain deeply embedded in agentic systems. NVIDIA argues that SLMs suffer from a lack of visibility (they do not benefit from the same “marketing intensity”). Another barrier: a tendency to ground their design and evaluation in general-purpose benchmarks. The substantial investments already made in LLMs also come into play: the industry has built tools and infrastructures accordingly.

To estimate the potential for replacing LLMs with SLMs, NVIDIA cites three examples:

  • MetaGPT

This Apache-2.0 licensed agent framework emulates a software publishing company. It assigns roles such as product manager, architect, QA engineer, and so on. According to NVIDIA, 60% of requests within such a system could be handled by SLMs. In the foreground would be routine code generation and producing structured responses based on templates. LLMs’ generative capabilities could retain an edge in debugging or architectural reasoning.

  • Open Operator

This workflow automation tool defines the behavior of agents performing tasks of API calls, monitoring, and orchestration using tools and services. NVIDIA estimates that 40% of requests in this domain could be handled by SLMs. They would be well-suited to handling simple commands and template-based generation. However, they might hit limits on tasks requiring multi-step reasoning or context maintenance.

  • Cradle

This MIT-licensed tool enables agents to drive graphical interfaces from screenshots. NVIDIA estimates that 70% of requests for this use case could be managed by SLMs. At the top of the list would be repetitive interactions and pre-learned actions execution. Fewer capabilities would be able to handle dynamic interfaces or the resolution of unstructured errors.

On the same topic

See all Data & AI articles

AGNTCY at the Linux Foundation: what is this Internet project […]

By
Clément Bohic

3 min.

Variable naming, a factor influencing code assistants

By
Clément Bohic

GenAI explored but not widely deployed for managing microservices

By
Clément Bohic

The National AI and Digital Council is established

By
The Editorial Team

In the United States, a national AI action plan all in favor of deregulation

By
Clément Bohic

Dawn Liphardt

Dawn Liphardt

I'm Dawn Liphardt, the founder and lead writer of this publication. With a background in philosophy and a deep interest in the social impact of technology, I started this platform to explore how innovation shapes — and sometimes disrupts — the world we live in. My work focuses on critical, human-centered storytelling at the frontier of artificial intelligence and emerging tech.