Here's a statistic that should keep you up at night: only 16 per cent of business executives are confident in the accuracy of the data that informs their decisions. Imagine that. We’re living in a data-driven age – there are around 2.5 quintillion bytes of data created every single day – and yet most professionals can’t say for sure whether the numbers, on which they might be basing multi-million dollar decisions, actually mean anything.
Welcome to the world of Dirty Data.
What is ‘Dirty Data’?
Dirty Data is simply customer or business information that is corrupt, missing, duplicated or inaccurate. It happens every day in companies all over the world. Every time an Account Manager incorrectly duplicates a customer record, or someone misspells a crucial address, every time a CRM gets cluttered with spam emails, or a date format is followed inconsistently, Dirty Data is the result. They’re numbers without meaning. Information without purpose.
You might think, what’s the big deal? Any errors are probably small, and they’re unlikely to impact overall business efficiency, right? But Dirty Data is estimated to cost the US economy over $3 trillion dollars every year.
Experian reports that companies around the world, on average, believe that 26 per cent of their data is inaccurate or corrupt. And since data informs pretty much every facet of industry decision making – from why groceries aren’t selling to the order of your Netflix feed and how people move through Disneyland – 26 per cent data corruption pretty much represents 26 per cent missed opportunity. Opportunity for customer engagement, interaction or simple revenue.
What causes Dirty Data?
It probably comes as no surprise that human error accounts for more than 60 per cent of all Dirty Data. Combine a human brain (or even worse, multiple teams of human brains) with manual input fields, and problems are bound to occur.
The rest is a combination of inaccurate records and poor data strategy, which eventually circle back to human error in any case: there’s no point having a fully automated CRM or overarching data plan if their integrity relies on manual entry.
Often Dirty Data is the result of departmental miscommunication: different teams inputting related data into separate siloes, without any cooperation or internal data logic. It’s the classic left-hand-right-hand scenario. Unfortunately, because of this kind of internal bureaucracy, Dirty Data can go unchecked or unnoticed for years. Over 57 per cent of companies only discover Dirty Data when it’s reported by the customer, which everyone can agree is probably the worst way to discover a fundamental business problem.
Cleaning up Dirty Data
‘Data Cleaning’ refers to the (often painstaking) process of going through data records and correcting mistakes. Pruning all the sloppy irregularities and misspellings and duplicate entries until you’re left with crisp, clean, useful data. Ideally this happens before the data is transferred to a target database or data warehouse.
Data cleansing can be a manual process, but that’s often incredibly labour intensive, hard to scale for enterprise, and (like data input) relies on human accuracy and consistency. So we’re right back where we started. These days there are dozens of automated apps and tools to help developers and data scientists speed things up.
Of course, data cleansing comes with its own challenges. If you employ an automated script or cleaning program, you need to make sure it can correct data mismatches, order columns correctly, check format (like date or currency information), revise or update your data schema, and even enrich existing data with supplementary information. There’s a helpful guide here, if you’re looking for more hands-on information.
Skills for the future
Knowledge of data best practice, data integrity and even data cleansing are going to be incredibly valuable professional assets in years to come. As companies scale up their data collection – and according to recent Accenture studies, 83 per cent of enterprise executive have already harnessed Big Data to gain a competitive edge – the demand for skilled data scientists and analysts will continue to grow.
That’s why, when RMIT designed its online data shortcourses with Udacity, we included units on data integrity and data cleansing. Whether you want to become a data scientist or data analyst, an understanding of Dirty Data won’t just be helpful – it’ll be essential.