Data may very well be “the new oil,” but thanks to over 20 years of lax governance and data mismanagement, most companies now have a serious data hoarding problem. It’s estimated that anywhere from 50% to 90% of data collected and stored by organizations is “dark data” (some estimations can be found here and here)—which is data that is stored but not utilized or analyzed.
In other words, dark data is just sitting idle, collecting virtual dust, increasingly within the massive storage infrastructures of major cloud service providers.
Many of the implications of rampant data hoarding are reasonably well known—including the potential compliance and privacy risks associated with storing petabytes of data you know absolutely nothing about. For many companies, “don’t ask, don’t tell” seems to be the approach when it comes to ensuring compliant management of their dark data—or at the very least, ignoring dark data represents a business risk many compliance officers seem willing to take.
Other implications on the cost of storing dark data, or the potential value that could be unlocked by operationalizing it, are also often discussed. For many CDOs, the motivation to store troves of data that may never get used is a form of FOMO, where the fear of being unable to support a future request for new analytical insights outweighs the cost of data storage.
In these situations, the unwillingness of many CDOs to apply methods to measure the business value of data is a primary enabler of data hoarding, where the idea that “we might need it someday” is sufficient to drive millions in annual revenues for cloud service providers.
In isolation, this inability (or unwillingness) to apply simple cost/benefit analyses to data projects may seem inconsequential to many companies, but when applied en masse at an industry level, the implications are stark.
Such is the case for the negative impacts data hoarding has on our environment. The cumulative effect of thousands of companies perpetually storing over half of their data in a virtual rainy-day fund has real-world impacts that extend beyond the data center.
According to the International Energy Association (IEA), in 2020, data centers produced over 300 metric tons of greenhouse gas. When considering all the energy consumption aspects of data centers (including water, electricity, indirect emissions from the creation of servers, etc.), the data center industry accounts for 2.5% to 3.7% of global greenhouse gas emissions—more than both the aviation and global shipping industries.
While a significant portion of data center energy is consumed by the more transactional aspects of digital life (e.g., crypto mining, cell phone apps and buying goods online, just to name a few), the fact that dark data even exists at a time of growing global energy scarcity is problematic.
The decision to store increasing levels of data that will never be used but consume scarce energy that could be used for more productive enterprises should be a major concern for company executives. This is especially the case in a world of increasing prevalence of artificial intelligence, where the motivations to endlessly hoard data are greater than ever before.
In a business environment increasingly focused on ESG, its highly likely that at some point, CDOs and other IT leaders will be given sustainability targets that force those executives to reduce their reliance on data hoarding and implement processes to ensure that all decisions to store data are done so in a way that justifies the cost—to both the company and the planet.
In the interim, there are four steps that CDOs can take to start resolving the data hoarding issue.
1. Revisit data governance policies and supporting technologies
If you’re a CDO and don’t have policies for data discovery, retention and archival, then defining these policies should be job number one. These governance policies should be supported with data management software that allows you to catalog your data and automatically manage the retention/archival processes for it. This process will require CDOs to engage with leaders across the business to define policies impacting all software applications and systems.
2. Develop analytics for data usage
Once there’s visibility on the location and attributes of all assets in a data estate, data leaders must then create dashboards to understand how and if that data is actually used. In essence, this process would help data leaders to understand which data is truly dark, or not.
3. Implement procedures to support the costs/benefits of all data initiatives
In concert with revisiting data governance policies, data leaders must also revisit their processes for prioritizing and rationalizing the services they provide. This means that data leaders must determine how they will provide cost-benefit analyses for all new data initiatives and all ongoing “run the business” tasks required to maintain a data estate.
A great time to apply these newly developed data cost/benefit models is during the contract renewals with major cloud service providers. Any data identified as “dark” can be actioned (per policies defined in step 1 above), and contracts can be renewed at lower rates. Similar procedures can be run incrementally over time to all databases across the enterprise.
4. Address longer-term data culture issues
Beyond the tactical steps needed to clean up dark data, CDOs must also work diligently, over time, to address the root causes of data hoarding—including an existing data culture that puts a premium on data hoarding over creating an awareness of the business value of data. Embracing processes that focus on quantifying the benefit of better data management, coupled with the cost savings realized when better data management processes are applied, will go a long way to promoting a better overall data culture.
Enjoyed my latest deep dive on data hoarding? I discuss pressing topics like this and more with a wide range of data experts and thought leaders on the CDO Matters Podcast. Listen to the latest episodes here or wherever you get your podcasts.
Head of Data Strategy @ Profisee
Malcolm Hawker
Malcolm Hawker is a former Gartner analyst and the Chief Data Officer at Profisee. Follow him on LinkedIn.