Test data management is difficult, time consuming and expensive leading to incorrect implementation of test data management and significant losses in revenue from high QA costs to bugs in production. Fortunately there is a solution that alleviates the majority of the huge resource, time and planning of conventional test data management and that new technology is called data virtualization.
Dependable QA testing requires that code be tested on data that represents the data to be encountered on the final production system. The data has to respect the business rules and data has to correlate between one business rule and another. For example a customer order record has to correlate to existing order items and an existing customer. The records also have to cover the date ranges being searched. If for example the test code searches date ranges outside of the test data then the test code won’t even touch any data. It’s difficult nearly impossible to create from scratch a full set of data that represents all the possible combinations of data that will be encountered in production.
If test data management is done incorrectly then the lack proper data in testing leads bugs making their way into production leading to significant impact on the bottom line.
The absence of proper test data causes nearly one-third of the incidents we see in QA/nonproduction environments and is a major reason why nearly two-thirds of business applications reach production without being properly tested. The resulting application failures cause significant amounts of downtime, with an average price tag of $100,000 per hour for mission-critical applications, according to industry estimates. – Cognizant
After talking to hundreds of companies about their test data management we have found that QA systems typically use data that is partial, representing a subset of data in production, or the data is synthetic and had been generated to simulate data in production. In both cases the data isn’t sufficient to cover all the data possibilities that are encountered in the production system(s). In order to address these missing case, the testing plan typically includes a QA cycle on a full copy of the production system near the end of the development cycle. At this point many bugs are found leading to the project either having to delay the project release or to release on time but with existing bugs.
The question arises “why isn’t the code tested on a full copy of production earlier in the development cycle?” If code testing was run on a full copy of production earlier in the code cycle then bugs would be found earlier, fixed earlier and fixed with less code rework.
The problem is that production systems usually run on larger complex databases and making copies of these databases is difficult, time consuming and resource intensive. Even when a copy is made, if the test cycles modify data as is typically the case where data is created, modified and deleted then the same test data will change from one QA run to another leading to diverging results. This leads to the need to refresh the data to the state it was before the QA cycle started. With large complex data sets the refresh process can be deemed prohibitively expensive and time consuming. Thus the industry had come to think that testing on full size production data is not feasible.
In fact using production data is possible but it requires a technology called data virtualization. Data virtualization is a technology that allows almost instant cloning of data for almost no storage. There is a storage requirement for the original source data and even that requirement can be eased through compression such that storage requirement is on the order of one third of the original source data. Once data virtualization is linked to the source data, i.e. there is an initial copy of the source data, then from there new clones can be made for almost no storage because the clones aren’t actual copies but point to the already existing data. The beauty is that these clones are full read and write copies because when these clones modify that data, these modifications are stored separately from the original data and are only visible to the clone that made the changes. These clones are called thin clones because they don’t initial take up additional storage. The whole process is called data virtualization and when it comes to databases that use this technology they are called virtual databases. On top of the storage savings data virtualization technologies come with automated collection of changes on the source data creating a time flow of changes. The time flow means that there is never a full copy of the source taken again. Only changes are collected. Changes older than the time window, which is usually a couple of weeks, are purged out. Virtual database can be provisioned from any point in time down to the second from this time flow. Most data virtualization technologies come with automated provisioning of virtual databases making the provisioning of a up and running database on a target machine a matter of a few clicks of a mouse and a few minutes. Data virtualization also generally includes options for data masking improving security coverage in the testing cycles.
A few companies we have worked with have had test cycles that repeated over and over and between testing code the database had to be refreshed. In one specific case the refresh took 8 hours for only 20 minute of actual code testing. Going from this architecture to an architecture of virtual database we were able to use full fresh copies of production data catching bugs earlier and reduce the refresh time down to a few minutes drastically reducing the overhead of QA and increasing test coverage: