Time To Build “Picks And Shovels” For Machine Learning

Time To Build “Picks And Shovels” For Machine Learning

Many multi-billion-dollar companies have been built by providing tools to make software development easier and more productive. Venture capitalists like to refer to businesses like these as “pick and shovel” opportunities, a reference to Mark Twain’s famous line: “When everyone is looking for gold, it’s a good time to be in the pick and shovel business.”

Atlassian, which offers a suite of software development and collaboration tools, has a public market capitalization above $30B. GitHub, a code repository, was acquired for $7.5B by Microsoft in 2018. Pivotal, which accelerates app development and deployment, was valued at $2.7B in VMWare’s acquisition last year. Many more of today’s hottest high-growth startups—LaunchDarkly, GitLab, HashiCorp—offer tools for software development.

Each of these companies’ tools are built for “traditional” software engineering. In recent years, an entirely new paradigm for software development has burst onto the scene: machine learning.

Building a machine learning model is radically different from building a traditional software application. It involves different activities, different workflows and different skillsets. Correspondingly, there is a need—and opportunity—to build a whole new generation of software tools. The reward for developing this next wave of “picks and shovels”: many billions of dollars of enterprise value.

How exactly do traditional software development and machine learning differ? In traditional software development, the core task is writing code. The human programmer’s job is to craft an explicit set of instructions to tell the software program what to do given different contingencies. For software programs of any sophistication, the volume of human-written code can be immense. The Internet browser Google Chrome has 6.7 million lines of code; the operating system Microsoft Windows 10 reportedly has 50 million.

On the other hand, the fundamental premise of machine learning (as its name suggests) is that the program learns for itself how to act, by ingesting and analyzing troves of data. Human programmers need not write large volumes of rules (code) to guide the software’s actions.

In this regime, the core set of tasks for software engineers is completely different. Their primary activities become, instead, to prepare datasets for the machine learning model to ingest and learn from; to establish the overall parameters that will guide the model’s learning process; and to evaluate and monitor the model’s performance once it has been trained.

The developer tools—the “picks and shovels”—to streamline and enhance this set of activities will look quite different from those built to support traditional software engineering.

At present, a mature ecosystem of machine learning developer tools simply does not exist; machine learning itself remains a nascent discipline, after all. As a result, ML practitioners generally manage their workflows in ad-hoc ways: in documents saved on their local hard drives, in sequentially-adjusted file names, even by hand. These methods are not sustainable or scalable for production-grade machine learning deployments.

This market gap represents a massive opportunity. In the years ahead, billions of dollars of enterprise value will be created by providing tools for the machine learning development pipeline. Below, we walk through a few key categories in which these tools will be needed.

DATA LABELING
The dominant ML approach at present is known as supervised learning, which requires a label to be attached to each piece of data in a dataset in order for the model to learn from it. (Think, for instance, of a cat photo accompanied by a text label that says “cat”.) The process of creating these labels is tedious and time-consuming.

A crop of startups has emerged to handle the unglamorous work of affixing labels to companies’ corpuses of data. These startups’ business models often rely on labor arbitrage, with large forces of workers labeling data by hand in low-cost parts of the world like India. Some players are working on technology to automate parts of the labeling process.

The most prominent company in this category is Scale AI, which focuses on the autonomous vehicle sector. Scale recently raised $100M at a ~$1B valuation from Founders Fund. Other data-labeling players include LabelBox, DefinedCrowd and Figure Eight (now owned by Appen).

It is unclear how durable these businesses will be over the long term. As has been previously argued in this column, the need for massive labeled datasets may fade as the state of the art in AI races forward.

DATASET AUGMENTATION
More toward the cutting edge of machine learning, researchers and entrepreneurs are working on a set of innovations to reduce the amount of real-world data needed to train models and to enhance the value of existing datasets.

One of the most promising of these is synthetic data, a technique that allows AI practitioners to artificially fabricate the data that they need to train their models. As synthetic data increases in fidelity, it will make machine learning dramatically cheaper and faster, opening up myriad new use cases and business opportunities.

The first commercial use case to which synthetic data has been applied at scale is autonomous vehicles; startups focusing here include Applied Intuition, Parallel Domain and Cognata. Other companies, like recently-launched Synthesis AI, are seeking to build synthetic data toolkits for computer vision more broadly.

A related category can be thought of as “data curation”: tools that evaluate and modify datasets pre-training to optimize the cost, efficiency and quality of model training runs. Gradio and Alectio are two promising early-stage startups pursuing this opportunity.

A final set of data tools likely to become increasingly valuable relates to “semi-supervised learning”, an emerging technique that trains models by leveraging a small amount of labeled data together with large volumes of unlabeled data. Semi-supervised learning holds much promise because unlabeled data is vastly cheaper and easier to come by than labeled data. Snorkel.ai, out of Stanford University, is one project generating buzz in this space.

MODEL OPTIMIZATION
Once an organization has prepared its dataset, a key next step is to establish the specifications of the machine learning model into which the data will be fed. This includes decisions such as which class of algorithm to use and which model parameters to set in order to optimize outcomes like accuracy and training time. To give a concrete example, for a deep learning model, such specifications would include how many layers the neural network will have and how many “training epochs” the model will run before stopping.

The process of establishing these model specifications can be incredibly complex. New software tools can support and partially automate this process. One well-known company developing tools for model optimization is SigOpt, with a particular focus on hyperparameter optimization. Most of the “end-to-end” machine learning platforms (see section below) offer a similar set of features.

“Firms are realizing that they don’t need to reinvent the wheel when it comes to state-of-the-art modeling,” said Scott Clark, CEO and founder of SigOpt. “There are a wave of enterprise solutions that are transforming modeling from an art into more of a science. The best AI teams are taking advantage of these tools today and are seeing a massive impact on their ability to efficiently deliver results.”

EXPERIMENT TRACKING & VERSION CONTROL
Machine learning entails a great deal of trial and error. There are various “knobs” that practitioners continuously adjust in efforts to improve their models. Input variables include the dataset, the parameters and hyperparameters, the training code, the software environment and the hardware used. The key output to track is the model itself, along with associated performance results.

ML researchers may run dozens or even hundreds of experiments for a given model. It is essential that companies keep systematic track of these model experiments for purposes of reproducibility and accountability. Such versioning will only become more essential as privacy regulations like GDPR roll out and concerns about the explainability of AI models proliferate.

Startups building tools to aid in experiment tracking and version control include Weights & Biases, Comet, Verta.ai and SigOpt.

“If software is eating the world, machine learning is eating software,” said Lukas Biewald, CEO and founder of Weights & Biases. “I got passionate about ML experiment tracking because it is underappreciated in the same way that data labeling was ten years ago.” (In 2007 Biewald founded early data-labeling company Figure Eight, which was acquired by Appen last year for $300M.)

“I believe that in the same way no company today would dream of writing code without version control, in the near future no company will dream of building ML models without experiment tracking.”

MODEL DEPLOYMENT AND MONITORING
For those looking to drive real business value with machine learning, building a model is just the beginning. A trained ML model sitting in a Jupyter notebook on a developer’s local machine may be impressive technology, but in order for it to have commercial impact it must be deployed into production.

Deploying ML models at scale is a major challenge, in some ways more complex than building the model in the first place. Depending on the use case, machine learning workloads may need to run robustly in diverse hardware environments spanning multi-cloud, hybrid cloud and the edge. Efficient management of computing resources becomes increasingly important as model use scales.

Algorithmia and Seldon are two promising startups offering tools for ML model deployment. Seldon’s solution is architected specifically for Kubernetes, the open-source container orchestration platform that has become wildly popular among ML practitioners.

After a model goes into production, its owners must monitor it on an ongoing basis to ensure that it continues to perform as expected. Many models will go “stale” over time as external conditions change and the dataset that the model was trained on obsolesces. Tools for monitoring and retraining will become essential as more companies put machine learning models into production.

END-TO-END PLATFORMS
In the preceding discussion, we have broken out various segments of the overall machine learning lifecycle and discussed tools that can support those individual segments. An alternative approach is to develop an “end-to-end” machine learning platform that encompasses most or all of this lifecycle in a single solution. A varied set of competitors is working to build such an end-to-end platform.

The major cloud providers each offer some version of this. AWS’ Sagemaker is likely the most sophisticated. Google Cloud officially released its ML platform, Cloud AI Platform Pipelines, just this month.

Robust open-source options also exist. Perhaps the most popular is Kubeflow, originally developed internally at Google. The Kubeflow team recently announced a major new release.

Others in the end-to-end category include well-funded late-stage companies like DataRobot and H2O.ai. These players provide data science platforms with a broad set of capabilities, extending from data management at the front end to model monitoring at the back end. What they offer in breadth, however, they lack in depth: these companies’ offerings are often perceived as “starter” solutions for machine learning, not suitable for building differentiated, complex or cutting-edge models.

Among early-stage startups, it is common for teams to begin by building a tool for a particular part of the overall ML pipeline, but to aspire to broaden this offering into a full end-to-end solution. Several of the startups mentioned above are working to evolve their product positioning from point solution to comprehensive platform.

It is difficult to say whether the ML ecosystem will trend over time toward all-in-one platforms or more specialized tools. Today’s developer tool ecosystem suggests that there will always be some appetite among programmers to patch together a customized set of point solutions from various providers depending on particular needs, workflows and preferences. The market is large enough, and the types of ML use cases diverse enough, that end-to-end platforms and specialized tools will likely thrive side by side.

CONCLUSION
AI is the most important “gold rush” of our era. To date, the picks and shovels required to support this field remain underdeveloped. This represents a massive opportunity for entrepreneurs.

Software development tools have always been an essential part of the digital economy and have always represented a massive market. As we enter the next generation of software development, oriented around machine learning, the ultimate purpose of these tools will remain the same: improved productivity, efficiency, quality assurance and collaboration.

But because human programmers’ activities and workflows are fundamentally different in the era of machine learning than they are in traditional software engineering, a whole new ecosystem of tools will need to be built.

For now, it remains anyone’s guess where the value generated by this new ecosystem of tools will accrue. A new crop of promising startups is pursuing these opportunities, but there is no guarantee that these challengers will prevail. Major tech companies and cloud providers like Amazon, Google and Microsoft are rolling out competitive offerings. Their sheer size gives them a formidable distribution advantage. Open-source tools like Kubeflow likewise continue to grow in popularity.

originally posted on forbes.com by Rob Toews