Data science was supposed to create a new productivity boom. But, for many companies, that boom never arrived. What’s gone wrong? While companies have invested in data tools, much of the data that’s fed into these systems is low quality – with mislabeled, missing, or incorrect information, which in turn creates more work, and more drag in the system. To address this, companies should: 1) rally employees to the cause and drive home why this is important, 2) measure data quality across the full range of departments, functions, and tasks, 3) relentlessly attack the sources of the data quality tax.
In principle, new technologies help companies increase productivity: Logistics systems ensure that they have the right quantities of things they need, operational systems help automate production and the delivery of goods and services, and decision support systems bring the analyses and forecasts managers need to make better decisions to their fingerprints. Further, during the pandemic, many companies rushed to digitize even faster. Productivity should be soaring. Unfortunately, it hasn’t happened.
While there are many competing explanations, I believe there is a fundamental explanation for low productivity gains and, even more importantly, a way to boost them: Digital technologies are fueled by data and too much data is simply bad, negating the possible productivity gains. New technologies have proliferated, but data management has not kept pace, adding enormous cost and friction. The key to boosting productivity lies in eliminating the root causes of bad data.
What Makes Data “Good” Or “Bad”?
By definition, data is of high quality if it is fit for its intended uses in operations, decision-making, planning, and data science. There is a lot to this definition. Each use comes with its own requirements, and failure to meet them drags productivity down. To illustrate, consider three scenarios.
In scenario one, a salesperson depends on leads data from the marketing department to do their work. Relatively little data (about 20 data elements) is required, but the record must be complete and correct. Thus, when the contact name is missing, or incorrect but easily spotted, the salesperson has to find or correct it. This is hard, at best taking considerable time. Worse, if they fail to spot an error, they may lose the sale. Both lower productivity. And note that I could replace “leads data and Sales” with “sales data and Operations,” “processed orders data with Inventory Management and Finance,” or any of the hundreds of ways one department depends on data from another.
In scenario two, a manager needs to know how many new customers the company has acquired in the past quarter to set budgets. They use data from both the Finance and Customer Relationship Management systems, because neither yields an answer that everyone trusts. Additional problems arise because Sales gives itself credit for a new customer when the first deal is signed, while Finance waits until the first invoice is paid. Most quarters the numbers are “close enough,” but when the discrepancy is large, the manager must ask their staff to dig deeply into both systems to sort it out. Even then, “the answer” is never fully trusted so, as a practical reality, new budgets are based more on guesswork than data. More money is wasted when the budgets are too high and opportunities lost when too low. Again, while the details differ, the essence of this scenario plays out many times each day.
The third scenario involves developing an algorithm for improving customer retention using artificial intelligence. The training data set must be reasonably accurate and the various data sources must align. If not, data scientists must spend time wrangling the data into shape. Further, these sources must be free of bias, which can be especially complex, with biases only revealing itself once the new algorithm is use. Finally, there are additional data requirements once the algorithm goes live. While the costs of dealing with all these issues can be considerable, the lost opportunity costs are even more important. Bad data makes it more difficult to take advantage of artificial intelligence and digital transformation, robbing companies of potential productivity gains.
Companies do not capture these costs in their accounting system. Fortunately, good-enough-to-get-started estimates can be obtained by the applying the “rule of 10”: it costs 10 times as much to complete a unit of work when the data is flawed in any way as it does when the data is good. In practice, this means that if data is good for 90% of your work, finishing the remaining 10% will cost more because of all the added friction. (90 x 1 = 90, 10 x 10 = 100.) One can view these added costs in various ways:
- the cost of “non-value-added” work (no informed customer pays more because you have to correct bad data),
- the cost incurred in the hidden data factory (“hidden,” because the accounting system doesn’t capture the cost; “data factory,” because people are re-working data),
- the cost of inefficiencies,
- the “productivity hit,” or
- perhaps counterintuitively, the size of the opportunity to improve quality and increase productivity.
A manager or company need not completely eliminate errors. By even cutting the error rate in half, they significantly reduce costs and increase productivity.
How Much Is Low Quality Data Costing You?
As the rule of 10 makes clear, the lower the data quality, the lower the productivity, and the greater the tax. But how can leaders know – or estimate – when they’re dealing with low data quality?
When I lead executive education sessions, I ask attendees to do an exercise I call the “Friday Afternoon method,” in which they audit a sample of the data in their last 100 units of work. Using a spreadsheet, they go back through the data elements for each unit of work and look for errors, marking each cell where they find a mistake. Then, they count up how many mistake free units they had, which provides a data quality score on a 0-100 scale. (E.g., if you had 85 units with error-free data, you’d score an 85.) Finally, to complete the assignment, they apply the rule of 10, and estimate the tax for their areas. Let me offer two highlights from these sessions:
- Only 8% report a DQ score of 90 or better.
- Most score in the 40 to 80 range, with a median score of 61. At that level, the tax is 3.5x greater than the total cost if all data was good. Similarly, productivity declines to less than a quarter of what it would be.
To be sure, every company is different, and so is their opportunity to reduce the cost of bad data and improve productivity. But it’s significant, even for the least data-intense firms. And for some, it may represent their single best opportunity to improve overall performance.
What Companies Can Do
So how should companies pursue raising the bar for data quality? I find that too many simply accept the tax associated with low data quality as just another cost of doing business. But this is waste, pure and simple. Leaders need to recognize the opportunity for improvement and act.
First, adopt language that best rallies people to the cause and helps them understand the problem. I’ve used “tax” here, but “non-value-added work,” the “hidden data factory,” or “opportunity” may resonate with others.
Second, develop their data quality profiles, by measuring data quality across the full range of departments, functions, and tasks, using the Friday Afternoon Measurement outlined above.
Third, relentlessly attack the sources of the data quality tax. Creating data correctly the first time is the best, fastest way to do so. This means eliminating the root causes of error. I’ve helped companies do this for a very long time and far and away the two most frequent root causes involve:
- Those who create data simply do not know that others have requirements for their data, and
- Data customers (those victimized by bad data) reflexively act to fix bad data, unconsciously incurring the tax.
Both are relatively easy to resolve: Data customers must get in the habit of seeking out the creators and explaining their quality requirements. Data creators, in turn, must understand these requirements and find and eliminate the root causes of failures to meet them. If this sounds like “old-school” quality management, it is. Most importantly, it is stunningly effective.
Finally, ignore the “data quality is boring” talk, because it simply is not true. In my experience, most people like their new roles as data creators and data customers, and they certainly appreciate spending less time working on mundane data issues. Start in areas where managers have open minds and set an initial goal of halving the error rate in six months. Train people, help them make an improvement or two, and then turn them loose. Move to the next area, building momentum as you go.
Productivity need not, indeed must not, stagnate. Many will find the connection between productivity and quality counterintuitive, yet enormous opportunity lies there. Bad data hammers productivity. It’s time to make it go away.
originally posted on hbr.org by Thomas C. Redman
About Author: Thomas C. Redman, “the Data Doc,” is President of Data Quality Solutions. He helps companies and people, including start-ups, multinationals, executives, and leaders at all levels, chart their courses to data-driven futures. He places special emphasis on quality, analytics, and organizational capabilities.