The Inferior Subset

November 9th, 2013

Why Subsets qualify as an inferior good

Why are you sub-setting your data? Even with the cost of spinning disk falling by half every 18 months or so, and the cost and power of flash rapidly catching up, several large customers I’ve encountered in the last three years are investing in large scale or pervasive programs to force their non-prod environments to subset data as a way to save storage space.

However, there are also several trade-offs with sub-setting and potential issues it can create, including:

* The performance of code under small sets of test data can be radically different than results on full sets of data.
* The creation of the subset can be CPU and Memory intensive, and may need to be repeated often.
* The process to create consistent subsets can be arduous, iterative, and error prone. In particular, corner cases are often missed, and creating subsets that maintain referential integrity can be quite difficult.

Its difficult to get 50% of data and 100% of skew instead of 50% of data 50% skew.  Without the proper skew QA could miss important cases and the SQL optimization could come out completely wrong not to mention that the SQL queries could hit zero rows instead of thousands.

Why thin cloning makes subsets an inferior good

As we’ve discussed in other blogs, a thin cloning solution such as Delphix caused the total cost of data to fall dramatically, and this increases a CIO’s purchasing power (in the context of data) – allowing much more data to be readily available at a much lower price point. The dramatic result that we observe out of this is that people are abandoning subsets in droves. In fact, as the price of data has fallen with the implementation of Delphix, the desire for subsets is being replaced by a desire for full size datasets. Certainly, customers will still want subsets for reasons such as: limiting data to a specific business line or team, or as a security measure, or as a way to utilize fewer resources in the CPU and Memory stack. But, it is also clear that the reduction in the total cost of data has resulted in customers switching to full size datasets to avoid performance-related late breakage, avoid the cost of subset creation. Beyond this, its causing them to rethink their investment in a sub-setting apparatus altogether.

At Delphix, the data we see from customers bears this out. Subsets cost a lot to make, and with the storage savings gone – they are just inferior to full size sets when it comes to many applications. With the elimination of storage as a primary reason to subset, (based on storage savings through thin clones), the inferiority of the subset is quickly being realized.


  1. Trackbacks

  2. No trackbacks yet.

  2. | #1

    subset yes….but obfuscation still presents a challenge, ie, sensitive data in Production must be perturbated before cloning to a non-production environment

  3. khailey
    | #2

    Hey Conner – thanks for the follow on. Yep, obfuscation is an obstacle. THis is Cool ! Virtual databases and masked data: Delphix and Axis Technology Team Up to Simplify Data Masking

five + = 9