Difference between storage snapshots and data virtualization
People are hearing more about Delphix and data virtualization. Data virtualization is where data copies can be made in minutes using thin cloning. Thin cloning means sharing the un-modified blocks on the file system between clones. Modified blocks are private to the clone that made the modification.
As people hear about data virtualization, the question comes up “what’s the difference between data virtualization and file system snapshots?” Comparing file system snapshots and data virtualization is like comparing an engine to a car. Creating a car from an engine takes some serious work. Creating data virtualization from snapshot technologies takes some serious work.
File system snaphots can be used to duplicate a file system. If that file system has a database on it then a thin clone of the database can be made using the file system snapshot. The benefits of file system snapshots in the arena of database cloning, thin cloning, are clear. Thin cloning saves enormous amounts of storage but more importantly it saves time, or in theory should save time. If thin cloning offers so much them why is thin cloning technology so rarely used? The reason is it’s a steep barrier to entry. It requires storage experts, specialized hardware and lots of brittle scripting and/or hands on operations. For example, CERN a big Netapp site wrote over 25K lines of code to try and provide minimal ability for a developer to thin clone a database.
vt100 internet from Tim Patterson
The analogy that comes to mind between thin cloning and data virtualization is the same comparison between the internet and the browser accessed world wide web . The internet was around long before the web with email, ftp, gopher, bulletin boards etc but hardly anyone used it until the web browser and web servers came out. When the browser came out the barrier to entry fell completely and every one started using the internet. It’s the same with data virtualization. With data virtualization everyone is starting to use thin cloning.
Thin cloning is like the car where as file system snapshots are the engine. Comparing file system snapshots to data virtualization is like comparing a company that makes car engines to an actual car. Make a full car from just a car engine is a serious amount of work. Implementing enterprise database virtualization from file system snapshots is serious work.
Now there will be some who say “I can make a file system snapshot and then make a thin clone of a databases using that snapshot, easy.” Sure, if you know how to put a database in hot backup mode, you can if you take the file system snapshot and then make a thin clone database using the file system snapshot. There is one problem. You made that snapshot on the production storage filer on the same LUNs that the production database is using, so all activity on the clones will impact the performance of production. The whole point of creating database copy was to protect production and to avoid adding more load on production. The trick is how do you get a copy of production onto a development storage array away from production and so that you can then make the snapshots on the development storage? Sure you can copy the whole database across but then what if you want to make clones tomorrow? Do you copy the whole database across again? That defeats the purpose of thin cloning.
Data virtualization takes care of the syncing storage used by the data virtualization tier with the source which means continuously pulling in changes from the source. Data virtualization also takes care of many other things automatically such as snapshotting the storage, cloning the storage, compressing the storage and then provisioning the thin clone databases which means exposing the file system on the data virtualization tier to the hosts that run the thin clones. It means renaming the clones, setting up the startup parameters and recovering the database.
Data virtualization has 3 parts
- Copying and syncing the source data to a data virtualization appliance (DVA)
- Cloning the data on the DVA
- Provisioning the clone to target machine that runs the thin clone data
Each of these 3 parts requires important features.
1. Source data copying
Not only do we require copying the source data to the data virtualization appliance (DVA) but we also require to continuously pull in the changes to the DVA from the source data such that one can create virtual data from the source at different points in time. Pulling in changes requires a time flow meaning the DVA will save a time window of changes from the source and purge changes older than the time window. The time window allows the system to continuously run and reach a storage equilibrium without using up more and more storage.
2. The storage or DVA
The DVA has to be able to snapshot, clone and compress the data for efficient storage. The DVA should also not only share data blocks on disk but also in memory. The DVA tier handles and orchestrates access to the data it manages meaning sharing un-mondified duplicate datablocks between all the thin clones and keeping modified blocks private to the clone that made the modification.
Data virtualization has to automate the provisioning of thin clones meaning providing a self service interface. Provisioning handles exposing the data on the DVA over NFS to the target machines that run the data. Provisioning has to automatically handle things such as renaming a database that use the data, setting startup database parameters, recovering and opening the database thin clone. Provisioning has to be self service where anyone can provision clones be they a DBA or a developer. In order to allow access to anyone data virtualization has to handle logins, security groups and defining which groups have access to which source data, how many clones can be made, what target machines can run clone instances, what operations the user is allowed to do and how much extra storage the user is allow to incur on the system. Data virtualization also requires functionality such as rolling back, refreshing, branching and tagging virtual data.
photo by zachstern