Questioning The Long – Term Importance Of Big Data In AI

No asset is more prized in today’s digital economy than data. It has become widespread to the point of a cliche to refer to data as “the new oil.” As one recent Economist headline put it, data is “the world’s most valuable resource.”

Data is so highly valued today because of the essential role it plays in powering machine learning and artificial intelligence solutions. Training an AI system to function effectively – from Netflix’s recommendation engine to Google’s self – driving cars – requires massive troves of data.

The result has been an obsession with bigger and bigger data. He with the most data can build the best AI, according to the prevailing wisdom. Incumbents from IBM to General Electric are racing to re-brand themselves as “data companies.” SoftBank’s Vision Fund—the largest and most influential technology investor in the world—makes no secret of the fact that its focus when looking for startups to back is data assets. “Those who rule data will rule the world,” in the words of SoftBank leader Masayoshi Son.

As the business and technology worlds increasingly orient themselves around data as the ultimate kingmaker, too little attention is being paid to an important reality: the future of AI is likely to be far less data-intensive.

At the frontiers of artificial intelligence, various efforts are underway to develop improved forms of AI that do not require massive labeled datasets. These technologies will reshape our understanding of AI and disrupt the business landscape in profound ways. Industry leaders would do well to pay attention.

SYNTHETIC DATA
Today, in order to train deep learning models, practitioners must collect thousands, millions or even billions of data points. They must then attach labels to each data point, an expensive and generally manual process. What if researchers didn’t need to laboriously collect and label data from the real world, but instead could create the exact dataset they needed from scratch?

Leading technology companies—from established competitors like Nvidia to startups like Applied Intuition—are developing methods to fabricate high-fidelity data, completely digitally, at next to no cost. These artificially created datasets can be tailored to researchers’ precise needs and can include billions of alternative scenarios.

“It’s very expensive to go out and vary the lighting in the real world, and you can’t vary the lighting in an outdoor scene,” said Mike Skolones, director of simulation technology at Nvidia. But you can with synthetic data.

As synthetic data approaches real-world data inaccuracy, it will democratize AI, undercutting the competitive advantage of proprietary data assets. If a company can quickly generate billions of miles of realistic driving data via simulation, how valuable are the few million miles of real-world driving data that Waymo has invested a decade to collect? In a world in which data can be inexpensively generated on demand, the competitive dynamics across industries will be upended.

As AI gets smarter in the years to come it is likely to require less data, not more.

FEW-SHOT LEARNING
Unlike today’s AI, humans do not need to see thousands of examples in order to learn a new concept. As an influential Google research paper put it, “A child can generalize the concept of ‘giraffe’ from a single picture in a book, yet our best deep learning systems need hundreds or thousands of examples.”

In order for machine intelligence to truly approach human intelligence in its capabilities, it should be able to learn and reason from a handful of examples the way that humans do. This is the goal of an important field within AI known as “few-shot learning.”

Exciting recent progress has been made on few-shot learning, particularly in the field of computer vision. (The technique is called one-shot learning or zero-shot learning, respectively, when only one or zero data points are used.) Researchers have developed AI models that, under the right circumstances, can achieve state-of-the-art performance on tasks like facial recognition based on one or a few data points.

These advances remain mostly confined to the academic world for now. As small-data methods migrate from academia to commercial production in the coming years, however, they will fundamentally change the way AI is done—eroding the importance of big-data assets in the process.

“If you’re doing a visual inspection on smartphones, you don’t have a million pictures of scratched smartphones,” explained Andrew Ng, deep learning pioneer and former AI head of Google and Baidu. “If you can get something to work with just 100 or 10 images, it breaks open a lot of new applications.”

REINFORCEMENT LEARNING
A final AI method making important advances without the need for gobs of real-world data is reinforcement learning.

In reinforcement learning, an AI model learns not through brute-force data ingestion but through self-guided trial and error: it is let loose to experiment with different actions in a given environment, and it gradually optimizes its behavior as it receives feedback about which actions are advantageous and which are not.

One of the most widely-publicized AI breakthroughs in recent years was powered by reinforcement learning: DeepMind’s defeat of the world’s best human players in the ancient game of Go.

DeepMind’s original model, AlphaGo, learned the game using a combination of historical data and reinforcement learning. But the truly remarkable achievement came with its more sophisticated successor, AlphaGo Zero. AlphaGo Zero was given absolutely no prior data other than the game’s rules. With no other input, simply by playing against itself, AlphaGo Zero learned the game of Go better than any human or machine ever had: it defeated the original AlphaGo 100-0.

“Expert data sets are often expensive, unreliable or simply unavailable,” explained the AlphaGo Zero team. “By contrast, reinforcement learning systems are trained from their own experience, in principle allowing them to exceed human capabilities and to operate in domains where human expertise is lacking.”

Beyond board games, reinforcement learning is finding real-world applications in robotics, chemical engineering, advertising, and beyond. Reinforcement learning represents a novel approach within AI: rather than requiring massive preexisting datasets, it can generate its own data, learning as it goes. As it finds its way into commercial applications, reinforcement learning will represent yet another challenge to the orthodoxy of big data.

CONCLUSION
The world of artificial intelligence is in constant flux. As the field’s frontiers advance at breakneck speed, methodologies that are cutting-edge today can become dated tomorrow.

The dominant AI paradigm at present is deep learning, which relies on up to billions of labeled data points to train neural networks to recognize patterns and make predictions. Because neural networks are so data-hungry, business and technology leaders have become obsessed with amassing the largest datasets that they can, hoping that data will be their ultimate competitive advantage in an AI-driven world.

But deep learning is a waypoint on the long road ahead for AI, not its final destination. To base one’s long-term business strategy on present-day neural networks’ massive data needs is to fail to appreciate the future paradigm shifts in AI that lie ahead. Recent advances in fields like synthetic data, few-shot learning and reinforcement learning make clear that as AI gets smarter in the years to come it is likely to require less data, not more.

These new paradigms will reset the AI landscape and redefine the terms on which companies will compete. For forward-thinking business people and technologists, it will be a massive opportunity.

originally posted on forbes.com by Rob Toews