To subset or not to subset

July 28th, 2014

What I see over and over again are development and QA teams using subsets of data. A full set of data is not used for testing until near the end of the development release cycle. Once a full set of data is used, testing flushes out more bugs than can be fixed in the time remaining forcing release dates to be pushed or releases to include bugs:

Screen Shot 2015-04-22 at 10.19.28 AM
Why do people use subsets? Here is one story. There was a problem at a customer in application development where using full copies for developers and QA was causing excessive storage usage and they wanted to reduce costs , so they decided to use subsets of the production development and QA
  • Data growing, storage costs too high, decided to roll out subsetting
  • App teams and IT Ops teams had to coordinate and manage the complexity of the  shift to subsets in dev/test
  • Scripts had to be written to extract the correct and coherent data, such as correct date ranges and respect referential integrity
  • It’s difficult to get 50% of data 100% of skew instead of  50% of data 50% of skew
  • Scripts were constantly breaking as production data evolved requiring more work on the subsetting scripts
  • QA teams had to rewrite automated test scripts to run correctly on subsets
  • Time lost in ADLC, SDLC to enable subsets to work (converting CapEx into higher OpEx) put pressure on release schedules
  • Errors were caught late in UAT, performance, and integration testing, creating “integration or testing hell” at the end of development cycles
  • Major incidents occurring post deployment, forcing more detailed tracking of root cause analysis (RCA)
  • Production bugs causing downtime were  due 20-40% to non-representative data sets and volumes.
Moral of the story, if you roll out subsetting,  it’s worth holding the teams accountable and tracking the total cost and impact across teams and release cycles. What is the real cost impact of going to subsetting? How much extra time goes into building and maintaining the subsets and more importantly what is the cost impact of letting bugs slip into production because of the subsets? If on the other hand you can test on full size data sets you will flush bugs out early where they can be fixed fast and cheaply.
Screen Shot 2015-04-22 at 10.12.32 AM
A robust, efficient and cost savings alternative solution would be to use database virtualization. With database virtualization, database copies take up almost no space, can be made in minutes and all the over head and complexities listed above go way. In addition database virtualization will reduce CapEx/OpEx in many other areas such as
  • Provisioning  operational reporting environments
  • Providing  controlled backup/restore  for DBAs
  • Full scale test environments.
And subsets do not provide the data control features that database virtualization provides to accelerate application projects (including analytics, MDM, ADLC/SDLC, etc.). Our customers repeatedly see 50% acceleration on project timelines and cost, which generally dwarf the CapEx, OpEx storage expense lines, due to the features we make available in our virtual environments:
  • Fast data refresh
  • Integration
  • Branching (split a copy of dev database off for use in QA in minutes)
  • Automated secure branches (masked data for dev)
  • Bookmarks for version control or compliance preservation
  • Share (pass errors + data environment from test to dev, if QA finds a bug, they can pass a copy of db back to dev for investigation)
  • Reset/rollback (recover to pre-test state or pre-error state)
  • Parallelize all steps: have multiple QA databases to run QA suites in parallel. Give all developers their own copy of the database so they can develop without impacting other developers.


  1. Trackbacks

  2. No trackbacks yet.

  2. | #1

    I love that video! :)



eight + 8 =