You’re checking your phone on the way to work and an email arrives with the subject ‘Replication errors’.
It’s one of your team telling you that key data is not replicating correctly, due to a technical issue. They can’t say how much data has been missed over time, they can only say how much isn’t aligned right now.
You think “How on earth could we not know what data has or hasn’t processed correctly?”
Is lack of data validation hampering your organisation?
In today’s data-flooded business environment, how sure are you of your actual data quality? You’ll be well aware that reliable data is essential, but it can be difficult to determine which of it is valuable, and even harder to predict how you may eventually use it.
It goes without saying that for ongoing success your business will increasingly need to centralise and combine data, from right across your enterprise. This will enable you to gain in-depth understanding of your customers, operations, financial performance, and regulatory compliance. The currently preferred platform for these activites is the data lake, a centralised repository for the consolidation of your organisation’s data. One design approach for a data lake is to persist raw data as it comes in from data sources, in an immutable (ie unchangeable) data layer, to provide flexibility for any of your future business needs.
It’s critical that you’re able to trust the data you work with, regardless of the platform you use. You need to have confidence that what you’re seeing accurately reflects the data source it comes from.
The solution to this is data assurance. This involves designing and implementing technologies and processes to validate your data. However, despite its critical importance, the establishment of data assurance can be very easily overlooked during your platform implementation.
Data replication – the key problem areas
There are three broad problem areas associated with data replication:
|Absence: Some of the data in your source isn’t present in the data that’s stored in your target. This could manifest as missing columns, rows, tables, or other objects.|
|Mis-transformation: Data is intentionally modified between your source and storage in your target, so that your data no longer accurately reflects your source.|
|Corruption: Data is unintentionally modified between your source and storage in your target. This could be due to a physical disk error, or an error during transmission.|
In order to build a trustworthy platform you need to actively address these problem areas, ensuring that your ingested data is 100% uncorrupted.
Issues you may have with data quality controls at the point of data entry or capture will very likely be identified through implementing data validation plans. These controls have the biggest impact on your overall data quality and you should definitely make them a priority if possible.
Data validation – the key techniques
All data validation techniques involve comparing what data is present in your source system with what arrives in your target platform, and the best technique for each of your particular circumstances may vary. Each successive technique will increase your confidence that the data is correct, but will also increase your validation costs.
|Movement: Does your data container, such as a file or message, arrive in the correct state? Eg, in the case of a file, do the source and target filenames and file sizes match?|
|Structure: Does the structure of your data container match the structure of your source? Posing this question also doubles as a basic data quality check.|
|Aggregate: Do summaries of your data from the source match with summaries of data from your container or target? Checking this can be as simple as carrying out a record count, or can involve data aggregation, eg, a sum of your total sales by product category.|
|Row or object: Does the data in a row or object within your container match between your source and your target? Comparing each data element in each row or object will deliver you 100% validation, but it’s also extremely resource intensive.|
These are not one-size-fits-all techniques. Some of your data may need to have a high level of accuracy for regulatory or compliance purposes, in which case simply checking that your filenames match won’t provide an appropriate level of confidence. On the other hand, some of your data may never be valuable enough to justify the cost of checking that every data element of every object matches, between your source and target.
Data validation – where do I start?
Building data assurance involves looking with a fresh eye at how your data arrives at your platform. It’s vital that the following questions are properly addressed before new data is introduced to your platform:
- How much confidence do you have that data in a source system and the platform are aligned?
- What data validation tools are available to you to make sure they are in sync?
- Are you deploying the appropriate tools to get the level of confidence you need?
By tackling these questions you’ll pinpoint exactly what data on your platform has or has not been correctly processed.
Keep an eye out for future articles in this series, in which I’ll be deep-diving into aspects of data validation and assurance.
Contact Bridge Consulting
As one of Australia’s leading management consulting firm specialising in digital business we’ve delivered quality, customised data validation plans for a wide variety of clients. To discuss how we can establish data validation best practices for your organisation, get in touch with our Bridge Consulting experts today.
Shaun Baker, Managing Consultant