AI Inference Optimization Steals NVIDIA GTC Headlines: Chip Vendor Software Capabilities Are under a Microscope Again

Subscribe To Download This Insight

By Reece Hayden | 2Q 2024 | IN-7305

Although Blackwell is important, NVIDIA Inference Microservices (NIMs) were the most widely spoken about announcement at NVIDIA’s GTC. This led ABI Research to return to the question of how can chip vendors develop their software stack to drive solution differentiation and migration toward their hardware.

Registered users can unlock up to five pieces of premium content each month.

Log in or register to unlock this Insight.


NIMs Are Not Revolutionary, but Certainly Speak to Enterprise Challenges


NVIDIA Inference Microservices (NIMs) announced at NVIDIA GTC are a new part of the NVIDIA AI Enterprise offering. NIMs are pre-built containers that enable enterprises to deploy Artificial Intelligence (AI) models and applications with optimized inference on any enterprise platform using CUDA-accelerated hardware, and claim to reduce deployment times from weeks to minutes. These microservices are built on top of the CUDA framework using NVIDIA inference software (including Triton Inference and TensorRT-LLM), and form part of the NeMo platform, which means other generative AI services can be used in parallel to support application deployment. Included within these microservices are industry-standard Application Programming Interfaces (APIs) for multiple domains (including language, speech, and drug discovery) to further accelerate AI application development. Alongside this announcement, NVIDIA unveiled a raft of cloud, Independent Software Vendor (ISV), and AI partners that are already building—or being made compatible with—NIMs.

NIMs address the following enterprise challenges:

  1. Lower time and effort to deploy AI applications at scale.
  2. Enable quick AI optimizations to lower costs and improve application performance.
  3. Retain control of data and autonomy over model deployment location, reducing data risk.
  4. Eliminate some of the bottlenecks hindering AI adoption (including gap between developers and systems engineers, optimization processes, model choice).
  5. Pick and choose models from any source to meet their own Key Performance Indicators (KPIs), rather than relying on third parties.

NIMs will solve some of the challenges inherent within enterprise generative AI deployment. Of course, there are constraints. The biggest one is that NIMs are only compatible and “optimized” for CUDA-accelerated NVIDIA hardware. This will create even further vendor lock-in within the NVIDIA ecosystem, a boon for NVIDIA as it drives demand for its hardware, but restrictive for customers seeking alternative AI platforms. Developers may see this as one step too far, as it creates another layer of vendor lock-in; however, it is more likely that developers see NIMs as an opportunity to accelerate generative AI deployment.

Will the UXL Foundation Help Competitors Combat NVIDIA's Growing Software Dominance?


NVIDIA is certainly investing in its software offering to support their market-leading CUDA framework. The rest of the market still lags behind, but they are trying to catch up with internal Research and Development (R&D) and acquisitions targeting the software stack, as well as cooperation. A recent announcement from a consortium called the UXL Foundation (which includes Qualcomm, Google, Intel) aims to develop a suite of software tools that will support multiple types of AI accelerator chips. This open-source project, built on Intel’s oneAPI, aims to make chip-agnostic computer code by creating a standard programming or common specification designed for AI. This will hopefully ease NVIDIA’s grip on the AI market by undercutting the role of CUDA.

Driving competition at the software layer is the right approach given NVIDIA’s chipset dominance, but ABI Research has certain reservations. The key challenge remains that developers have been using CUDA for 15+ years, built code around it, and are still seeing the best performance from NVIDIA hardware. These constraints mean that it is easier said than done to get developers to migrate models and workloads even with a competitive open-source approach. In addition, using oneAPI as a starting point may not be beneficial: the toolkit remains complex and challenging to use, which will hinder its ability to scale across enterprise use cases. If we contrast this to NIMs’ high degree of simplicity, ABI Research sees obvious challenges, even if the consortium brings substantial improvements to oneAPI. Lastly, oneAPI is focused on supporting developers by deploying training workloads, which is certainly important, but increasingly the focus will shift toward inference, as this will be the largest growing workload. Success for the UXL Foundation will rely on fostering common innovation that targets inference workloads and creates a clear business case that justifies the difficult transition away from NVIDIA for developers.

R&D and Investment in Optimization Is Essential for Sustained Competitive Differentiation


The UXL Foundation may be successful in opening up the AI market, but chip vendors cannot rely on this alone. They must continue to develop in-house software capabilities to build a differentiated value proposition. The first area that they must target is optimization. As AI scales rapidly, costs, resources, and performance will create problems for stakeholders. Deep model optimization with hardware, application, and domain awareness will be the key variable that can unlock these AI bottlenecks.

Most chip vendors (like AMD, Qualcomm, Intel) already offer tools to support developers in optimizing AI for their chipsets, but these platforms are immature and complex, requiring deep AI expertise for effective use. This is just one of the challenges for chip vendors looking to develop a strong software value proposition. Moving forward, chip vendors must look to enhance their solutions by investing in or building partnerships with third parties. Some of the key areas that they must explore are highlighted below:

  • New, Emerging Tools or Services: Chip vendor optimization tools are often quite limited, and stakeholders should look to implement new tools to support deep model optimizations. One area to explore is Neural Architectural Search (NAS). Still in development with only a few commercially-ready solutions, NAS is a technique that automates the design of artificial neural networks. It ensures that the network offers higher performance for specific hardware and applications. Challenges around cost and resource utilization still exist due to their computation intensity, but NAS should be looked at for medium-term planning.
  • Automation: AI at scale brings development and management challenges, especially given that more “personas” with different skill sets will be involved. Automation will be vital to support AI at scale. This should involve Automated Machine Learning (AutoML) to speed up optimization—namely, the integration of generative AI models into developer platforms to support previously manual tasks like data annotation.
  • Inference-Specific Tools: NVIDIA’s accelerators are likely to lead the training market for some time given their performance dominance. Competing in this market will be challenging. Instead, stakeholders should look to focus on inference—this will be the fastest growing workload in AI as enterprises look beyond model development and toward commercial outcomes. Companies like Recogni have seen this and built Software Development Kits (SDKs) to make it easier to port NVIDIA trained models onto their “inference-specific” hardware. Inference SDKs that enable developers to quickly port “trained models” from CUDA to other environments will lower barriers to exit and support the business case for migration.
  • Support across Distributed Compute Continuum: ABI Research forecasts that inference workloads will slowly migrate toward the edge or device for cost, performance, security, and other reasons. Inference KPIs at the edge and the cloud differ significantly: the cloud is focused on performance-to-cost (Tera Operations per Second (TOPS)/US$ or TOPS/Watt), while the edge or device focuses on a range of metrics, including cost, privacy, security, and performance. Chip vendors must provide domain-specific tools to ensure deep model optimization to target specific workloads—partnering with companies like Edge Impulse may be helpful to provide these domain-specific tools.

As NVIDIA continues to build out its software armory, competitors must follow suit and invest appropriately in R&D or third-party partnerships to build out an enticing software value proposition that creates traction with developers and enterprise customers.


Companies Mentioned