Amazon, Google, IBM, and Microsoft Open Source IoT Machine Learning and Shift Focus from Proprietary Technology to Proprietary Data

Subscribe To Download This Insight

By Ryan Martin | 3Q 2016 | IN-4163

At more than $1.4 trillion, Amazon, Google, IBM, and Microsoft have a combined market cap that dwarfs the annual gross domestic product (GDP) of more than 90% of countries in the world. Each has also open sourced its own deep learning library in the past 12 to 18 months.

Registered users can unlock up to five pieces of premium content each month.

Log in or register to unlock this Insight.


Machine Learning for the Masses


At more than $1.4 trillion, Amazon, Google, IBM, and Microsoft have a combined market cap that dwarfs the annual gross domestic product (GDP) of more than 90% of countries in the world. Each has also open sourced its own deep learning library in the past 12 to 18 months.

This includes Amazon’s Deep Scalable Sparse Tensor Network Engine (DSSTNE; pronounced “destiny”), the same foundation that powers the company’s product recommendation capabilities; Google’s TensorFlow, the company’s second-generation machine learning system; IBM’s SystemML, contributed to Apache Spark and offered as a service through Bluemix; and Microsoft’s Computational Network Toolkit (CNTK), which followed the release of its Distributed Machine Learning Toolkit (DMLT).

Baidu (Warp-CTC), Facebook (through its Facebook AI Research arm known as “FAIR”), and OpenAI (backed by Elon Musk, Reid Hoffman, Peter Theil, etc.) are among a growing list of players also in the open-source artificial intelligence (AI)/machine learning (ML) race.

Write Once, Run Anywhere


Machine learning is the study of algorithms that learn from examples and experience instead of relying on hard coded rules, which do not always adapt well to real-world environments.

In the case of image recognition, for example, a hard coded rule could attempt to distinguish a piece of fruit by counting the number of pixels of a certain color, and the ratio of that color relative to others in the image. But this would not work if it were a black and white photo, or if the identifying color was not in the picture. Creating additional hard coded rules could address specific scenarios in which the model might not hold up, but their utility would remain limited to that of a point solution, since the process would need to be replicated from scratch to achieve the same functionality in a different application.

The various techniques used to develop machine learning algorithms fall under two broad categories (detailed in ABI Research’s Machine Learning in IoT report):

  • How they learn: based on the type of input data provided to the algorithm. Examples include supervised learning, unsupervised learning, reinforcement learning, and semi-supervised learning.
  • How they work: based on the type of operation, task, or problem performed on I/O data. Examples include classification, regression, clustering, anomaly detection, and recommendation engines.

Early AI programs were a one trick pony; they typically excelled at just one thing, like playing chess at a champion level (e.g., IBM Deep Blue supercomputer vs. chess Grandmaster Garry Kasparov, 1996) or Jeopardy on live TV (e.g., IBM Watson vs. Jeopardy contestants Ken Jennings and Brad Rutter, 2011), but not much else. Today, the goal is to write one program that can solve many problems without the need to be rewritten—write once, run anywhere.

A classifier can be trained to automate the creation of rules for a model. It can also be thought of as a function, in the sense that data is taken as input and assigned a label as output (e.g., classify an image as an Apple or Orange, or an email as Spam or Not Spam). The challenge is that learning and implementing the complex algorithms required to build ML models can be difficult and time-consuming, which is in addition to the cost of deploying and managing the enabling infrastructure.

Engaging the open-source community introduces an order of magnitude to the development and integration of machine learning technologies without the need to expose proprietary data.

Proprietary Data > Proprietary Technology


A general rule of thumb is that more training data leads to better classifiers, and that better classifiers improve the validity and reliability of ML models. But it is also important to understand how a particular ML model will perform with new data in a real world environment. Companies–or, more specifically, data scientists–often turn to the practice of splitting a dataset into two random sets (e.g., 70/30, 60/40, 50/50), using one of the diverged datasets to train and test the model against the other diverged dataset.

This process is fairly straightforward in IoT machine learning, since it is easier to enumerate “thing” input data than it is to codify the linguistic structure of more than 6,500 different dialects. While companies like Amazon, Google, Microsoft, and Nuance have a solid footing in natural language understanding (NLU), not even the best commercially-available syntactic parsers can marry “the human element” with “thing” technologies in a meaningful way. The open-source release of SyntaxNet, the neural network framework Google implemented in TensorFlow that includes code to train SyntaxNet models using proprietary data, is an early example of efforts to jumpstart the digital-physical convergence depicted in ABI Research’s Internet of Everything Market Tracker. Add Amazon, IBM, and Microsoft to the mix, and we’re talking about four companies that turned to collaborative economics to accelerate AI/ML development beyond the capacity of the 750,000+ people they employ.