Is NVIDIA's AI Dominance in High-Performance Computing Being Challenged?

Subscribe To Download This Insight

By Paul Schell | 4Q 2023 | IN-7199

The last year has seen numerous developments in High-Performance Computing (HPC) chipsets addressing demanding Artificial Intelligence (AI) workloads—from incumbents like AMD and Intel, to hyperscalers Google and Amazon Web Services (AWS). Chipsets targeting leading-edge AI training workloads are still dominated by NVIDIA’s top-shelf Graphics Processing Units (GPUs), but accessibility, price, and lock-in concerns mean that AI players are looking to novel solutions.

Registered users can unlock up to five pieces of premium content each month.

Log in or register to unlock this Insight.

 

Numerous Chipsets Have Entered the HPC AI Arena, Knocking on NVIDIA's Door

NEWS


The majority of the market still sees NVIDIA’s top-shelf A100 and H100 Graphics Processing Units (GPUs) as the best options suited for training the most demanding frontier Large Language Models (LLMs). Despite their price, they offer huge value, as significant time is saved during the training process, accelerating time to market. In the context of mushrooming demand for Artificial Intelligence (AI), ABI Research’s Artificial Intelligence Software market data (MD-AISOFT-101) forecasts AI software revenue to grow at a Compound Annual Growth Rate (CAGR) of 27% between 2023 and 2030), NVIDIA’s leading chipsets, AI systems, and strong software proposition have helped their share price triple in the last year. However, they are increasingly seeing competitors build strong propositions in the High-Performance Computing (HPC) market:

  • AMD’s recently launched MI300X accelerator offers performance improvements on the MI250, and is targeted toward generative AI and LLM training. AMD claims 1.3X the AI performance over the H100.
  • Microsoft revealed the Azure Maia AI Accelerator (suited to the most demanding training workloads), which will start to roll out in early 2024 with a second generation already featuring the company’s innovation roadmap.
  • Google expanded its AI-optimized infrastructure portfolio with Cloud TPU v5e, an in-house accelerator for medium- and large-scale training and inference.
  • AWS built Trainium, a high-performance Machine Learning (ML) training accelerator for models with more than 100 billion parameters.
  • Intel’s Gaudi 2 processor for Deep Learning (DL) targets both inferencing and training workloads. The company promotes a 1.8X training throughput increase over A100 accelerators.

Beyond benchmarks and announcements, tangible evidence of these chipsets’ ability to provide a workable alternative to NVIDIA GPUs is also mounting.

  • Lamini, a generative AI startup, revealed that it exclusively deployed AMD Instinct MI250 GPUs to train LLMs even before the launch of ChatGPT last year—as did MosaicML, another generative AI platform.
  • Stability AI opted for Intel’s Gaudi 2 accelerator for training its multimodal LLMs. Intel’s chipset is touted as performing with a higher throughput than NVIDIA’s A100 GPUs, and Intel’s AI Everywhere event revealed details of Gaudi 3, launching 2024, to rival AMD’s MI300X and NVIIDIA’s H100 GPU.
  • Chinese AI company Baidu has pivoted to Huawei’s 910B Ascend AI chips to substitute NVIDIA’s A100 GPUs to mitigate the impact of U.S. sanctions.
  • Microsoft uses the Azure Maia AI Accelerators in conjunction with NVIDIA GPUs and AMD Instinct MI300X to power cloud Virtual Machines (VMs).

AI Companies Need Access to Cheaper, Performant Hardware Now

IMPACT


Both Stability AI and MosaicML explicitly mention lead times and costs in their Public Relations (PR) materials around their decision to choose alternatives to NVIDIA’s GPUs. These factors are key motivators, especially when addressing less demanding AI workloads (like fine-tuning, inference, or even training of small models). And now there is increasing competition addressing more demanding workloads, evidenced by Lamini and MosaicML’s decisions to deploy AMD’s solutions, and Stability AI’s preference for Intel. While impressive performance benchmarks versus NVIDIA are resonating with customers, the importance of software is also a priority. NVIDIA’s CUDA (and NCCL) frameworks have developed a “walled garden” for NVIDIA systems, but customers increasingly want to pivot toward more open solutions. In this vein, Stability AI has highlighted the importance of Intel’s software stack for “its seamless model architecture compatibility,” while MosaicML is impressed with AMD’s full-stack approach.

In a growing, highly competitive space, all these factors are important. But above all, accessibility is key, and will help challengers gain commercial success. As enterprise AI deployments continue to scale, they will need access to hardware, which may lead to certain players pivoting away from supply-constrained and more expensive NVIDIA systems to competitors like Intel or AMD.

An Opportunity to Capitalize on Vertically Integrated Solutions and Vendor Lock-in Fears

RECOMMENDATIONS


Considering NVIDIA’s entrenchment in the HPC market, Intel and AMD face an uphill battle. However, given the growing demand for hardware capable of running training on leading edge models, and NVIDIA’s supply constraints, they certainly have an opportunity to compete. ABI Research recommends that Intel and AMD should look to build highly differentiated propositions by focusing on the following areas:

  • Continue to expand software offerings through internal investment and acquisitions. AMD’s purchase of Nod.ai is a good example of this: the open-source compiler technology enables high-performance AI solutions on AMD AI hardware such as its Instinct accelerators. Another instance is Intel’s acquisition of Granulate, a workload optimization software for cloud.
  • Double down on the open-source approach to address vendor lock-in concerns and entice developers to their hardware. Portability remains one of the industry’s main concerns about NVIDIA’s CUDA-based ecosystem. A strong example is Intel’s OpenVINO toolkit for generative AI that converts and optimizes models using popular frameworks for deployment across multiple hardware platforms, lowering development barriers to accelerate time to market.
  • Increase the public partnerships with leading AI players, like AMD’s collaboration with Lamini, to improve the awareness of hardware and software solutions that can compete with NVIDIA. These events are picked up by numerous media outlets thanks to the relentless focus on AI and NVIDIA’s pole position. They gain traction and are effective in complementing traditional marketing strategies.
  • Given NVIDIA’s authority in the most “leading-edge” training workloads, Intel and AMD could build differentiated value propositions to target less demanding training workloads from AI players more sensitive to hardware prices. Not everyone has the treasure chest budgets of OpenAI, Anthropic, and Meta.

Persistent concerns around vendor lock-in will make Intel and AMD’s open-source approach desirable (compared to NVIDIA’s CUDA-based “walled garden”). But NVIDIA should not be too worried: its AI systems still provide market-leading performance and enable simple integration; its walled garden approach retains captive developers, underscoring its value proposition for customers; and it has a very strong verticalized commercial strategy. And it is not resting on its laurels—the launch and continuous expansion of NeMo (an end-to-end platform targeting AI developers, including enterprises, for deploying generative AI), the acceleration of its hardware roadmap, including upgrades to its flagship GPUs, and the decision to increase the cadence of hardware releases from every 2 years to once per year are all evidence of activity aiming to further entrench its position. The availability of alternatives does not immediately spell trouble for NVIDIA, but this is an important space to watch, especially as Intel and AMD’s investments in AI hardware and software ecosystems mature.