Terabytes Of News Versus One Trillion Tweets: Is Social Media Really As Big As We Think?

Terabytes Of News Versus One Trillion Tweets: Is Social Media Really As Big As We Think?
Terabytes Of News Versus One Trillion Tweets: Is Social Media Really As Big As We Think?

One of the great ironies of the “big data” revolution is how little we know about the data we use. The world of “big data” might be better called the world of “big imagination.” It is particularly ironic that we have fixated on social media as “big data” despite having absolutely no visibility into whether these platforms are anywhere near as large as we imagine them to be. Despite the promise of “big data” to make our world transparent, we turn to black boxes that aggressively market themselves as massive data repositories but refuse to provide even the most rudimentary statistics about how much data they actually hold. In turn, entire industries have been founded on the vision of social media as to massive new datasets that can shine unprecedented new light onto society, even as we operate entirely in the dark, uncertain of either the data or algorithms we use. Looking closer we find that traditional news media is actually just a few times smaller than the totality of all trillion tweets combined and is several times larger than even the largest Twitter datasets available to researchers. Even Facebook’s new petabyte research dataset pales in comparison to what is available from the news media. Is it time to step back and ask whether social media is really as big as we think it is?

Why is it that the words “Facebook” and “Twitter” appear in nearly every presentation or publication about big data today? How did social media become the benchmark we use to define what precisely we mean by “big data?”

Most importantly, how can we define social media as “big data” when we know absolutely nothing about how big it is, how fast it updates or how varied its contents is?

In their early heydays of explosive growth, both Facebook and Twitter released regular updates on how large their services had grown. In recent years, however, as rumors have abounded that usage of the services has slowed (or in Facebook’s case shifted to Instagram), the companies have ceased offering such detailed statistics.

How many novel textual posts and images are published to Facebook each day? How many total words and pixels do those posts add on a daily basis? How many links are shared on the platform each day?

The answer is that we simply don’t know and there is no independent external auditing of the few vague numbers Facebook does release. In terms of the number of links shared daily on the platform, the only insights we have are the estimates the company released for a research dataset it is compiling from its archives, suggesting that link sharing on Facebook is far rarer than imagined.

Of course, it is Twitter that has become the centerpiece of the social media analytics industry due to its decision to make a realtime firehose of its entire platform available for analytics.

Despite its ubiquity and the fact that nearly every social media analytics platform in the world offers at least basic Twitter support, the company has not released official volume numbers in quite some time.

For its part, when asked how many tweets have been published since its founding, a Twitter spokesperson this afternoon offered only “not commenting.” When asked why it no longer publishes such numbers, the company did not respond.

It didn’t use to be this way. Twitter used to regularly publish its latest volume milestones and in 2014 it proudly announced that it had reached half a trillion tweets.

In recent years, however, the company has been relatively reticent regarding precise growth numbers that would make it easier to understand the company’s trajectory.

Nor has it been forthcoming about the geographic and demographic reach and representativeness of its user base.

Extrapolating from previous trends we can estimate that there have been just over one trillion tweets sent since the service’s founding.

Despite the sheer number of tweets, the historical 140-character size limit of tweets (which averaged around 74 bytes Twitter-wide in 2012) means those trillion tweets add up only to a few tens of terabytes of actual text, of which a large portion are duplicate posts in the form of retweets proxying for “likes.”

Compared with online news coverage over the last four years, we see that the full Twitter firehose is actually only slightly larger, while the Decahose and 1% products that most researchers work with are considerably smaller. Even digitized books aren’t that far off.

So why are Facebook and Twitter at the center of our societal conversation around big data rather than journalism? Why are researchers all across the world rushing to work with tweets and posts but not news? Why does every marketing and analytics offering have a Twitter subscription but few license large news archives? Why are major funding agencies pouring hundreds of millions of dollars into social media research, while funding little news analytics beyond combating “fake news?” Why are journals rushing to release special issue after special issue about “big data” social media research, while viewing traditional journalism analytics as obsolete small data?

The answer is that we focus on social media because the companies have aggressively marketed themselves as defining “big data.” Their early growth was advertised in billions of posts and petabytes of data, rather than measures of impacts or insights.

Perpetually in search of the next shiny new thing, data scientists jumped on board in droves, seduced by social’s machine friendly firehoses and APIs and eager to capitalize on the tremendous interest from funding agencies, publication venues and companies.

In turn, we have created a world in which our analyses are based on data we know nothing about, processed by algorithms we have no visibility into.

For all, its promise to make our world transparent, “big data” has instead made it more opaque than we could ever have imagined.

How is it that we accept results compiled from black-box datasets that refuse even to say how much data they actually hold?

How is it that we accept completely unknown data being shoveled through black-box algorithms we know absolutely nothing about, performing analyses that we know are exceptionally nuanced and entirely dependent on the decisions of their creators?

After all, it was less than a decade ago that the 2010 release of the Google Books NGram collection was met with howls of protest from the research world that its findings were utterly meaningless because we didn’t know precisely what was in the collection of books that made up its data.

In the decade since, researchers seem to have had a change of heart, wholeheartedly embracing the idea of reporting the trends of data they know nothing about. Much like privacy and data ethics, the idea of actually knowing something about your data has become a relic of the past.

How did we reach this point?

Much of the answer lies in the fact that, despite their titles, most “data scientists” aren’t actually very skilled with “data.” Most are programmers or statisticians (or both) who excel at figuring out how to ask questions of data but lack the critical thinking skills to step back and ask whether the data they are using actually answers their question and whether the algorithms they are using to make sense of that data are faithfully computing what they think they are.

A typical data scientist knows how to make a graph or run a regression or even build a neural network with ease. What their technical backgrounds make them blind to, however, is the underlying question of data bias. Few data scientists know how to reverse engineer an algorithm’s performance from the outside to assess how well it might perform in a given analytic context. Even fewer have the worldly experience to step back and fully examine a dataset’s nuances, limitations, and biases before ever considering it for analysis.

The end result is that data scientists today grab for whatever dataset is easiest to obtain and work with. Social media has positioned itself squarely as the go-to “big data” dataset, with real-time firehoses and APIs with simplistic authentication processes, industry-standard JSON file formats, and a vast marketing machine designed to create a reality distortion field that portrays social media and big data as one and the same.

Does social media afford us insights not available through other platforms? The answer is that much of the insight we receive from social media is readily obtainable through other mediums like news.

The issue is that social media is easier to use.

Plug a few hashtags or keywords into a filter and monitor the JSON Twitter firehose feed for sharp spikes to alert to sudden changes. A few lines of code and a few minutes of time is all that’s needed to set up a Twitter-based global alerting system.

Monitoring news requires a vast global infrastructure capable of handling the digital, broadcast, and print worlds. It also requires building all of the filterings and supporting infrastructure from scratch.

Most importantly, it requires an entirely different way of thinking.

Twitter’s simplicity makes it trivial to analyze. High volume low informational content messaging streams are essentially analyzed through filters and volume counts. Twitter is for all intents and purposes, behavioral event data, readily amenable to relatively simplistic stream analytics and anomaly detection. Trends manifest themselves in highly visible deviations that can be easily flagged through even the most basic of filtration processes.

In contrast, news media consists of a lower volume of very rich content that contains layer on top of the layer of informational and emotional signals. Translating such incredibly rich and nuanced content into meaningful content is a vastly harder problem.

In short, extracting meaning from Twitter is akin to processing any other kind of high volume event data and is a well understood and fairly well-solved problem. In many ways it is like building a thermostat: the input signal is a well-formed machine-friendly numeric signal that can be easily thresholded.

Extracting meaning from news content, on the other hand, is still very much an active area of research, falling into the category of classical machine understanding.

Seen in this light, social media has gained such popularity with data scientists because it requires almost no skill to analyze or utilize. In contrast, effectively extracting meaning from news media requires expertise possessed by very few data scientists.

Putting this all together, social media has become the poster child of the “big data” revolution not because it is the largest and richest source of information about human society, but rather because social platforms made decisions early on to position themselves for the data science community with firehoses and APIs. They have built massive marketing machines to cement themselves in the public consciousness as data-first enterprises. Their high-volume low-content modalities make them trivial to analyze and thus easy for data scientists to work with.

The remarkable success of social platforms in convincing the world of their size and impact in the total absence of any hard numbers to support those claims reminds us how much of our understanding of social media exists only in our imagination. News media is actually far larger and richer than the social datasets available to researchers, but the journalistic world has yet to rebrand itself for the digital era and lacks the enormous marketing machine to position itself as data in the public consciousness.

In the end, perhaps the biggest story is that rather than make the world transparent, the “big data” revolution has plunged it into opacity and created a world in which we accept findings from unknown algorithms running on unknown data without the slightest concern in the world. Welcome to the reality of big data.

originally posted on Forbes.com by Kalev Leetaru