Traditionally the test data architecture has been identical to the production data architecture – perhaps with a cheaper infrastructure setup. Why should the Test Data Architecture differ from the Production Data Architecture?
In an era of IT transformation, a horde of disruptive technologies and innovations setting off a paradigm shift across all industries and sectors of the economy, data architecture needs to change – test data architectures included. Based on business objectives, data architecture plays a vital role in the flow of information and the way it is controlled.
A lot of different trends are pushing Test Data Management and necessitate differences.
Development paradigms are changing – everybody is looking at DevOps.
DevOps require automated testing and you need a consistent initial state for your test data for each test occurrence. So, you will need a mechanism to provide a consistent set of test data automatically for each test. You also need to version your test data set as you might need to update the test data set as you update your code or discover hitherto undetected bugs.
DevOps with it’s focus on automated testing sets requirements for the test data architecture, that does not apply to the production data architecture.
As the name suggests, Microservices architecture is the practice of breaking up an application into a sequence of smaller, specialized fragments that are highly maintainable and testable, loosely coupled, easily and individually deployable and effectively structured around enterprise capabilities. With the software shrinking to Microservices, all the testing efforts would typically involve a multitude of services, each with its own set of data (If the service doesn’t have its own data – I would call it a function – and that is an entirely different discussion). To test in this kind of environment you will need to provide a set of test data for each service where the union of test data sets is consistent. And this activity will have to be repeated for every test activity.
Again, the use of Microservices implies new requirements specific to the test data architecture.
While it is best practice to define (test) infrastructure as software the same is not true for data. While it is fast and reliable to create a fresh new test infrastructure from a versioned script you will need to copy test data into this new environment from somewhere and you will need to pay for the storage of all the versions of your test data you need to have available. Or design a very ingenious way to generate the test data you need when you need them.
If you are running hybrid cloud you need to keep in mind that, while it usually is free to upload data to the cloud, there is a cost associated to download the data again.
GDPR does not per se forbid the usage of real production data for testing purposes, it does require you to get consent from each individual to use their personally identifiable data for testing purposes – which again is next to impossible, given the fact that one would either not get the consent – or would be a cumbersome process that won’t be practically feasible.
You are left with a couple of alternatives you can either mask production data or you could generate synthetic data. One thing to remember when masking data or generating synthetic data is that you need to do it consistently across all systems delivering data to the test data set.
For a small number of the tests you could need synthetic test data – e.g. to test that the data masking needed in the production environment actually work as expected*.
For production data this is not in question you have one full version of the entire data set. For test purposes though, one will need several versions of the data that will incur costs for storage etc. Quite obviously, one would want to keep the volumes down, but at the same time need different amounts of data for different test purposes to make sure that data is representative, e.g. changes to a data entry module need much less test data, than changes to a customer segmentation algorithm.
There are a lot of architecture decisions you need to take specifically related to data in the test environments. You need to decide if you use masked production data or synthetic data and make architecture decisions on this. Also you need to decide where, when, with which granularity and how often you store a versioned snapshot of data. Another decision you need to take is how you log the usage of test data to be able to provide full traceability.
* verifying masking on data that is already masked could be misleading when it comes to the effectiveness of the masking algorithm.