Moving Multimodal AI to the Edge

Subscribe To Download This Insight

1Q 2019 | IN-5387

Multimodal AI is a burgeoning field of AI that involves the combination of different modalities of data to build unique systems. Today most multimodal AI systems are supported in the cloud, but as they move to the edge they will drive purchases of heterogenous edge systems and could potentially drive demand for custom multimodal processors.

Registered users can unlock up to five pieces of premium content each month.

Log in or register to unlock this Insight.


What is Multimodal and Why is it Shifting to the Edge?


Modality refers to the way in which something happens or is experienced. In AI training and inference, this can be thought of as single type of data input. An example of modality could include sound, vision, language, or any kind of sensor data. Multimodal then refers to the incorporation or interaction of multiple modalities of data into a single system or application. From the 1980s until the 2010s, multimodal systems were defined by rules-based or heuristic techniques. Since 2010, and developments in the field of deep learning, deep neural networks have begun to be incorporated into multimodal systems, making them more robust, accurate, and capable of generating unique insights into situations where many parameters are simultaneously at play. An example of a multimodal AI system would be a voice assistant, which has to combine audio data with a catalogue of natural language data and then a make decision on how to best respond.

Today, like in the case of most voice assistants, multimodal systems are mostly implemented in the cloud or in the enterprise’s regional servers. Implementing multimodal AI at cloud or server level is sufficient for certain use case instances, like voice assistants, because they aren’t mission critical. In circumstances where there is a sensitivity around data privacy or an application is mission critical or must happen in real time, however, moving processing of the model to the edge will be necessary. A move to the edge could also be driven by enterprises not wishing to rely on potentially expensive connectivity options or wanting to avoid continued build out of their internal server systems to support them. Vendors in the robotics and automotive markets are already beginning to implement edge multimodal AI systems to support interactions between vehicles and drivers. However, these implementations are throwing up unique challenges that could create market opportunities for those players that can best meet them with quality solutions.  

What Are the Core Challenges of Edge Multimodal AI?


Multimodal AI is increasingly being developed in layers, where some elements of a system are heuristic or rules-based and others rely on deep neural networks. The multi layered nature of these systems makes even scheduling and sequencing tasks on a compiler difficult. As such, multimodal AI doesn’t only require sophisticated software for integration, it also has specific requirements in terms of hardware, particularly when implemented at the edge. Developers must perform optimized inference of deep neural networks potentially at both the perception level of inference and at the system level of analyzing the interactions of different perception data together. Performing inference on deep neural networks (DNNs) involves completing many small calculations rapidly--processors that have parallel processing architectures are well suited for this task. Chips like GPU, FPGAs, and, increasingly now, ASICs designed for DNN processing are often used for DNN processing, as they have significant parallel processing capabilities. This is not the only computation problem at hand, as multimodal AI systems often have multiple layers that rely on more traditional heuristic calculation. Creating a need for compute processing that can also deal well with this type of calculation is something that a CPU is well suited to. There is the problem of effectively time scheduling these processes, so their interactions are synchronized; again, these types of calculations are well suited to more traditional CPU-type architecture. The combined mixture of DNN processing in combination with heuristic processing and sophisticated task scheduling lends itself to a processor that incorporates several computer architectures that are well suited to these different tasks. This type of chip, often called a heterogeneous chip, is already commonplace in consumer smartphones and a number of system on chip microprocessors.

Who Are the Key Players Moving to Capitalize on Addressing the Emerging Space?


Modern multimodal AI applications implemented at the edge will drive demand for heterogenous processors, as they meet the mixed computational requirements needed for inference. None of the major chip companies today are focusing on the specific challenge posed by multimodal AI edge inference, but those already building heterogenous processors are naturally at an advantage in terms of addressing the market opportunity here. Companies building or designing smartphone chips that incorporate AI processors such as Qualcomm, Huawei, Samsung, Apple, and Google are very well-placed to turn their attention to multimodal edge AI inference processing. Given that these companies are squarely focused on the smartphone market, it is likely they may miss opportunities to build custom systems that can address emerging application in the robotics, consumer, and automotive markets. The earliest signs of this activity have been seen in China, where chip company UbiSound has launched a number of what they’re calling multimodal AI chips for combining image and sound recognition in vehicles, consumer devices, and smart city surveillance contexts. These chips will be available in the third quarter of 2019, and their uptake and further details about their architecture should shed further light on the direction. UbiSound already provides voice recognition chips to vendors in China such as Gree, Haier, Hisense, LeEco and Canbot, and claims to support more than 100 million devices today. ABI Research anticipates that more players will address the multimodal AI space with their own custom chips toward the end of the year. If the established players in the heterogenous space don’t consider how to best respond to companies like UbiSound, they could quickly loose out to them.

On the software side, after experiencing many issues developing and implementing multimodal AI systems at the edge, Microsoft recently launched an open source code library called Platform for Situated Intelligence (PSI). PSI provides software developers with the tools to visualize and coordinate multimodal applications. The project has had a slow start on Github, picking up only 18 commits since its launch 11 months ago, and its user base is still mainly confined within Microsoft and a number of academic institutions such as Carnegie Mellon, Boise State University, and Northwestern University. Multimodal AI currently suffers from the fact that most systems are currently built in-house and remain proprietary, and so knowledge and toolsets for building them is not disseminated. PSI should help in lowering this barrier to entry for building commercial multimodal AI systems.

Strategically, Microsoft is trying to gain influence in the emerging domain of multimodal AI. In terms of influence in the AI community Microsoft lags Google and Facebook, who control Tensorflow, PyTorch, and Caffe 2 frameworks, respectively. Microsoft’s effort with Situated Platform could help them become the main influencer in Multimodal AI development the same way that Google is with TensorFlow and Deep Learning more generally. Microsoft is also using PSI to encourage use of its Azure software perception tools, which it can then charge for, thereby creating a commercial incentive for the company.