Data Architecture Trends Part 1; How to Improve Data Quality
Improve the Quality of Your Data by Observing It Every Step of the Way
--
Poor Data Quality (DQ) is a nightmare for organisations, leading to many failed projects and the loss of millions in revenue.
In the development of the Modern Data Stack, there has been a clear use case for improving DQ by removing the onus on the humans involved. Introducing Data Observability, a concept of observing the health of the data, borrowed from application observability from the Software Engineering world.
In this first part, deep dive into Data Architecture trends, we will focus on improving Data Quality using Data Observability measures.
Let's go!
So — What Is Data Observability?
Observe your data's health as it flows through various layers of your architecture.
Data Quality can be divided into two main categories, technical and functional checks. Technical checks usually include how up-to-date the data is. i.e. did the last job run?; is the table schema still the same? i.e. did someone randomly add a column and break the flow?; does the table have all the records? i.e. did random records get dropped in the pipeline?
Functional checks, on the other hand, are whether the data is accurate for the specific use case, i.e. if a customer was supposed to have an active status, they do have one; if a customer is no longer trading with us, we should have no billing information generated etc.
Carrying out functional checks requires business domain knowledge and usually what gets translated into DQ rules. However, technical checks are more implicit in the general data ecosystem. For example, if you have a data pipeline, you will likely include some audit columns confirming the records received from the source and records loaded in the target.
Data Observability is those technical DQ checks spread across your architecture, from data capture to transfer to storage and consumption.
Ok — Why Is It Hyped?
Data Quality has been a known and under-invested problem.