LLMs Could Be Evolving Away from Pure Transformer Architectures to Solve Mounting Memory Concerns
By Larbi Belkhit |
21 Jan 2026 |
IN-8018
Log In to unlock this content.
You have x unlocks remaining.
This content falls outside of your subscription, but you may view up to five pieces of premium content outside of your subscription each month
You have x unlocks remaining.
By Larbi Belkhit |
21 Jan 2026 |
IN-8018
NEWSNVIDIA Launches Nemotron™ 3, Introducing Hybrid Mamba-Transformer MoE Architecture |
In December 2025, NVIDIA announced the NVIDIA Nemotron™ 3 family of open models, data, and libraries to help developers build and deploy multi-agent Artificial Intelligence (AI) systems at scale. This Nemotron™ family consists of three models:
- Nemotron™ 3 Nano: A 30B parameter model (activating up to 3B at any given time) for targeted, highly-efficient tasks
- Nemotron™ 3 Super: Approximately 100B parameter model (activating up to 10B per token) for multi-agent applications
- Nemotron™ 3 Ultra: Large Reasoning Model (LRM) with approximately 500B parameters (up to 50B active per token) for complex AI applications
Nemotron™ 3 introduces a hybrid Mamba-Transformer Mixture-of-Experts (MoE) architecture, Reinforcement Learning (RL) across interactive environments, and a native 1M-token context window. NVIDIA also announced that Nemotron™ 3 already has early adopters such as Cursor, Oracle Cloud Infrastructure, Perplexity, Palantir, Siemens, and others. However, NVIDIA is not the first to launch said hybrid MoE architecture. In October 2025, IBM also launched its Granite 4.0 family of models, which leverages a hybrid Mamba-Transformer MoE architecture, including:
- Granite-4.0-H-Small: A 32B parameter model (activating up to 9B at any given time)
- Granite-4.0-H-Tiny: A 7B parameter model (1B active per token)
- Granite-4.0-H-Micro: A dense 3B parameter hybrid model
The Mamba architectures, specifically Mamba-2, are optimized State Space Models (SSMs) that use a selectivity mechanism rather than self-attention like Transformers, which is inherently more efficient, and its computational requirements scale linearly with sequence length. Importantly, Mamba layers avoid Key-Value (KV) cache growth, making memory usage largely independent of sequence length during inference. That said, Transformers and self-attention have advantages when it comes to tasks requiring in-context learning, which is why NVIDIA and IBM have adopted a hybrid Mamba-Transformer architecture.
IMPACTHybrid LLM Architectures Aim to Address the Growing Memory Bottleneck |
The introduction of Mamba layers and models is driven by the growing memory and cost constraints of Transformer models, which can become a much larger pain point with growing context lengths and multi-step reasoning workloads. Transformer models introduce a quadratic computational and memory scaling with sequence lengths, which decreases the speed of inference, while simultaneously increasing the cost of inference as context length increases. Furthermore, the memory efficiency gains are across all deployment scenarios (edge and cloud), and the different model sizes (from very small to large) announced by NVIDIA and IBM illustrate the feasibility of the architecture across said scenarios.
The lower KV cache growth and lower computational requirements aim to help alleviate some of the challenges for agentic system being deployed. In the enterprise space, this should enable more agentic systems to be deployed at the edge or even on-device given that these hybrid models from IBM and NVIDIA are not only more computationally efficient, but provide better performance as well for their model “class.” Overall, this should support enterprises in making their AI inference costs lower and more efficient overall, which should enable improved proliferation of Agentic AI among enterprises.
The early adoption of Nemotron™ 3 models by AI leaders helps to legitimize and garner more attention on the feasibility of hybrid LLM architectures for agentic systems. Furthermore, NVIDIA’s support for said hybrid architectures illustrates how it aims to support the growth of the AI ecosystem by optimizing and driving innovation across the entirety of the AI stack, not just on the Graphics Processing Unit (GPU) and AI compute layers, to improve the Return on Investment (ROI) of running agentic systems for enterprises.
RECOMMENDATIONSThe Performance Discussion May Become Secondary to Cost for Agentic Systems in 2026 |
To date, Agentic AI has been heavily reliant on cloud processing, but the increased adoption & spending on Agentic AI by enterprises shall grow the demand for distributed deployments across cloud, edge, and on-premises infrastructure. In 2026, ABI Research expects discussions around this distributed inference to ramp up significantly, given that it is the next frontier for the AI industry and is better suited to serve and meet enterprise needs. Furthermore, these discussions will inevitably not simply focus on the technical feasibility, but also on the cost efficiency angle, as enterprises thus far have had many failed AI pilot projects. With growing context lengths, increased inference passes of multi-agent systems, and integration costs, coupled with the rising cost of memory in the cloud, enterprise Agentic AI must evolve quickly to be able to be viable and remove the reliance on cloud infrastructure for agentic systems for both Proofs of Concept (PoCs) and commercial launch.
With IBM Granite 4.0 and Nemotron™ 3, we could be seeing an early signal of the next architectural trend in LLM engineering for agentic workflows, and we could see incumbent model providers such as OpenAI, Anthropic, and Google leverage said hybrid architecture for their LLMs in order to drive down the cost of training and inference of those models while they built out the massive infrastructure they have already committed to. To date, they are only publicly committed to Transformer-based LLM models. Given the sheer volume of new model releases that continue to be announced, educating developers, and enterprises specifically, on the benefits of this architecture approach is much simpler, given that the value proposition is not solely ground-breaking benchmark results but the Total Cost of Ownership (TCO) and ROI angle, a language the C-suite of enterprises is much more comfortable and familiar with. While these hybrid architectures do not resolve all concerns and challenges for agentic systems, this does illustrate the full stack approach that NVIDIA and others are taking to address them.
Written by Larbi Belkhit
Related Service
- Competitive & Market Intelligence
- Executive & C-Suite
- Marketing
- Product Strategy
- Startup Leader & Founder
- Users & Implementers
Job Role
- Telco & Communications
- Hyperscalers
- Industrial & Manufacturing
- Semiconductor
- Supply Chain
- Industry & Trade Organizations
Industry
Services
Spotlights
5G, Cloud & Networks
- 5G Devices, Smartphones & Wearables
- 5G, 6G & Open RAN
- Cellular Standards & Intellectual Property Rights
- Cloud
- Enterprise Connectivity
- Space Technologies & Innovation
- Telco AI
AI & Robotics
Automotive
Bluetooth, Wi-Fi & Short Range Wireless
Cyber & Digital Security
- Citizen Digital Identity
- Digital Payment Technologies
- eSIM & SIM Solutions
- Quantum Safe Technologies
- Trusted Device Solutions