The Future Of The Cloud Lies In Analyzing All Those Petabytes

The modern cloud has made the storage and access of massive datasets routine. Companies today manage so much data in the cloud, with even hundreds of petabytes not unheard of, that the real challenge has become not how to store massive data in the cloud, but how to most easily, securely and powerfully extract insights from that data at scale. The future of the cloud therefore lies in how to help companies extract meaningful insight from their vast repositories of raw data.

It is a remarkable statement regarding the state of today’s cloud that even a decade ago companies like Google had so much storage capacity that their engineers could simply grab a petabyte of disk to run sorting benchmarks. Last year Twitter announced it was moving more than 300 petabytes of data to Google’s cloud. Even startups are now processing petabytes in the cloud.

For companies today “big data” is no longer a buzzword, it reflects the reality that managing hundreds of terabytes or even petabytes is now considered a mundane routine business task rather than a frontier research question.

Whether over high-speed networking or half-petabyte offline transfer appliance, shipping petabytes into the cloud is no longer an unsolved problem.

The problem is what to do with all of that data.

The challenge today is thus not storing data, but rather making sense of that data securely, at scale, in real-time, with advanced operators from geographic analysis to machine learning.

In Google’s cloud, the answer to this challenge is its BigQuery platform, which increasingly forms a central nexus of its cloud strategy, acting as an almost infinitely scalable real-time analytics platform, adding everything from GIS to machine learning.

All of these capabilities not only operate at data scale but are accessed through the same SQL language already familiar to companies’ data analysts, bringing these previously complex technologies into the world of plug and play ease.

The ability BigQuery provides to assess and analyze massive datasets cannot be underestimated. The single greatest limiting factor in data analysis today is being able to rapidly triage and explore large datasets in order to verify their contents, understand their applicability to different algorithms and clean and prepare data for more advanced analyses like deep learning. Tools like BigQuery make it possible to effectively interactively explore even the largest datasets and go beyond simple summations like histograms towards full-fledged data scale machine learning.

Finally, for those needing to apply the latest deep learning advances to their massive data archives, the cloud offers a vast range of solutions, from pre-built APIs ready to use right out of the box to bleeding edge experimental solutions on the very frontier of AI research.

Of course, the cloud’s limitless scale makes it possible to build traditional computing pipelines, spinning up even tens or hundreds of thousands of cores to plow through even the largest datasets, with workflow systems like Dataflow orchestrating the entire process.

Yet the simplistic SQL interface of BigQuery to advanced data scale analytics from GIS to machine learning reminds us just what is possible in the modern cloud.

In many ways, BigQuery represents essentially an intelligent storage fabric. It’s pricing model charges data at rest separately from querying and can even operate over flat files in cloud storage buckets, meaning companies can essentially load their vast data archives into BigQuery and leave them until needed.

Perhaps BigQuery’s greatest strength is its ability to brute force its way through unstructured data. This means companies don’t have to decide at ingest time what kinds of queries to support and the kinds of indexes and structures required to support them. It can simply load its data as-is and parse and filter and normalize at query time.

With the ability to table scan a petabyte in just 3.3 minutes, it would take just 16.5 hours for Twitter to scan every single byte of its 300 petabyte archive and require nothing more complex than an SQL statement.

This kind of scale is truly transformative, as it enables companies to focus on asking questions of their data rather than architecting petascale database designs and compute workflows.

As deep learning deployments become increasingly limited by data curation and preparation work, tools like BigQuery could dramatically help accelerate these efforts.

Putting this all together, the cloud has upended how we think about data. Simply storing petascale data is no longer a limiting factor. It is the analysis of that data and its translation into meaningful actionable insights that is where the future of the cloud lies.

originally posted on by Kalev Leetaru