Difference between storage snapshots and data virtualization

November 12th, 2013

Screen Shot 2013-11-11 at 7.49.21 PM

photos by Keith Ramsey and Automotive Rhythms

People are hearing more about Delphix and data virtualization. Data virtualization is where data copies can be made in minutes using thin cloning. Thin cloning  means sharing the  un-modified blocks on the file system between clones. Modified blocks are private to the clone that made the modification.

As people hear about data virtualization, the question comes up “what’s the difference between data virtualization and file system snapshots?” Comparing file system snapshots and data virtualization is like comparing an engine to a car. Creating a car from an engine takes some serious work. Creating data virtualization from snapshot technologies takes some serious work.

File system snaphots can be used to duplicate a file system. If that file system has a database on it then a thin clone of the database can be made using the file system snapshot. The benefits of file system snapshots in the arena of database cloning, thin cloning, are clear. Thin cloning saves enormous amounts of storage but more importantly it saves time, or in theory should save time. If thin cloning offers so much them why is thin cloning technology so rarely used? The reason is it’s a steep barrier to entry. It requires storage experts, specialized hardware and lots of brittle scripting and/or hands on operations. For example, CERN a big Netapp site  wrote over 25K lines of code to try and provide minimal ability for a developer to thin clone a database.

Screen Shot 2013-11-11 at 8.36.24 PM

vt100 internet from Tim Patterson

The analogy that comes to mind between thin cloning and data virtualization is the same comparison between the internet and the browser accessed world wide web .  The internet was around long before the web with email, ftp, gopher, bulletin boards etc but hardly anyone used it until the web browser and web servers came out.  When the browser came out the barrier to entry fell completely and every one started using the internet. It’s the same with data virtualization. With data virtualization everyone is starting to use thin cloning.

Thin cloning is like the car where as file system snapshots are the engine. Comparing file system snapshots to data virtualization is like comparing a company that makes car engines to an actual car. Make a full car from just a car engine is a serious amount of work. Implementing enterprise database virtualization  from file system snapshots  is serious work.

Screen Shot 2013-11-11 at 8.51.06 PM

Now there will be some who say “I can make a file system snapshot and then make a thin clone of a databases using that snapshot, easy.” Sure, if you know how to put a database in hot backup mode, you can if you take the file system snapshot and then make a thin clone database using the file system snapshot. There is one problem. You made that snapshot on the production storage filer on the same LUNs that the production database is using, so  all activity on the clones will impact the performance of production. The whole point of creating database copy was to protect production and to avoid adding more load on production. The trick is how do you get a copy of production onto a development storage array away from production and so that you can then make the snapshots on the development storage? Sure you can copy the whole database across but then what if you want to make clones tomorrow? Do you copy the whole database across again? That defeats the purpose of thin cloning.

Screen Shot 2013-11-11 at 8.51.31 PM

Data virtualization takes care of the syncing  storage used by the data virtualization tier with the source which means continuously pulling in changes from the source.  Data virtualization also takes care of many other things automatically such as snapshotting the storage, cloning the storage, compressing the storage and then provisioning the thin clone databases which means exposing the file system on the data virtualization tier to the hosts that run the thin clones. It means renaming the clones, setting up the startup parameters and recovering the database.

 

Screen Shot 2013-11-11 at 7.38.02 PM

 

Data virtualization has 3 parts

  1. Copying and syncing the source data to a data virtualization appliance (DVA)
  2. Cloning the data on the DVA
  3. Provisioning the clone to target machine that runs the thin clone data

Each of these 3 parts requires important features.

 

1. Source data copying

 

Not only do we require copying the source data to the data virtualization appliance (DVA) but we also require to continuously pull in the changes to the DVA from the source data such that one can create virtual data from the source at different points in time. Pulling in  changes  requires a time flow meaning the DVA will save a time window of changes from the source and purge changes older than the time window. The time window allows the system to continuously  run and reach a storage equilibrium without using up more and more storage.

 

2. The storage or DVA

 

The DVA has to be able to snapshot, clone and compress the data for efficient storage. The DVA should also not only share data blocks on disk but also in memory.   The DVA tier handles and orchestrates access to the data it manages meaning sharing un-mondified duplicate datablocks between all the thin clones and keeping modified blocks private to the clone that made the modification.

 

3. Provisioning

 

 

Data virtualization has to automate the provisioning of thin clones meaning providing a self service interface. Provisioning handles exposing the data on the DVA over NFS to the target machines that run the data. Provisioning has to automatically handle things such as renaming a database that use the data, setting startup database parameters, recovering and opening the database thin clone.  Provisioning has to be self service where anyone can provision clones be they a DBA or a developer. In order to allow access to anyone data virtualization has to handle logins, security groups and defining which groups have access to which source data, how many clones can be made, what target machines can run clone instances, what operations the user is allowed to do and how much extra storage the user is allow to incur on the system. Data virtualization also requires functionality such as rolling back, refreshing, branching and tagging virtual data.

87431231_1912ffe12c_z

 

photo by zachstern


Uncategorized

  1. Trackbacks

  2. No trackbacks yet.
  1. Comments

  2. | #1

    I’ll take the engine please!

    “It requires storage experts, specialized hardware and lots of brittle scripting and/or hands on operations.”

    We regularly use storage snapshots for our non-production databases. Yes, it does require specialized hardware. But our production database is over 10TB in size. We store this in our NetApp Filers. There is no “brittle scripting”. Our scripts to refresh a database are only 100 lines long and they are tried and true. Have worked properly for years. There is no hands-on-operations. If you want to refresh a non-production database, its executing one script. It doesn’t get more push-button than that. I have no clue why anyone would have a 25K line script to do this type of cloning.

    “There is one problem. You made that snapshot on the production storage filer on the same LUNs that the production database is using, so all activity on the clones will impact the performance of production.”

    Sure..putting this clone on your production disk system is silly for all the reasons you mentioned. What we do is to use our DR site for our non-production environments. We have a standby database there. We clone from our standby. So non-production cannot affect our production database. But that isn’t the only way we have done this…

    “The trick is how do you get a copy of production onto a development storage array away from production and so that you can then make the snapshots on the development storage? Sure you can copy the whole database across but then what if you want to make clones tomorrow? Do you copy the whole database across again? ”

    You leverage your storage vendor’s technology. Before we move our non-prod databases to our DR site, we leveraged NetApp’s SnapMirror update process to get our snapshots to our development storage. And the update doesn’t have to copy the entire database. It only needs to copy the deltas since the previous snapmirror update was performed.

    “Implementing enterprise database virtualization from file system snapshots is serious work.”

    I would disagree. It isn’t that much work, not with today’s storage technologies.

    Database Virtualization is cool, but I find that many of the reasons you post against disk-based snapshots just don’t fit with today’s hardware solutions.

  3. khailey
    | #2

    Hi Brian, Thanks for the long reply.

    Question: how long does it take for a developer to get a clone of production after he asks for it?

    of interest are these to posts:
    http://www.kylehailey.com/delphix/delphix-oracle-12c/
    http://www.kylehailey.com/handoffs-delay-self-service-saves/

    Your experiences would be great to share. Have written up your procedures and methods?

  4. | #3

    > how long does it take for a developer to get a clone of production after he asks for it?

    In my environment, the cloned database can be up and running in 20 minutes.

    > Have written up your procedures and methods?

    I can share the scripts.

    Here is the script I use to take the snapshot of my standby database. Since this is a standby, I can make my life easy and take the standby down to perform a cold snapshot.

    ssh oracle@standby_host “/u01/app/crs11.2.0.3/bin/srvctl stop database -d standby_db”
    ##############
    #Set up variables
    snaptoday=`date +_%m_%d_%y_%a_%H_%M`
    prefix=environment
    snapname=”$prefix$snaptoday”
    #Take snapshots of data volumes
    ssh root@netapp1 snap create standby_data01 $snapname
    ssh root@netapp1 snap create standby_data02 $snapname
    ##############
    ssh oracle@standby_host “/u01/app/crs11.2.0.3/bin/srvctl start database -d standby_db”
    ssh oracle@primary_host “/scripts/createcontrol.sh /scripts”

    The last line just dumps the control file to trace. Pretty simple stuff for the Oracle DBA. It also does some minor string manipulation to change the db name. This script is automated via cron to take a regular snapshot once per week. But just yesterday, I had our IT Director ask for a more current version of production for a hot-button issue. So I ran the script manually to take the snapshot right then and there. The standby database is down for about 1-2 minutes for this, which is easily tolerable for a standby environment that is not actively being used. Most of that time is spent actually stopping/starting the instances.

    When I get a request to refresh a non-production database, here is the script that I run for that db:

    refresh_db.sh:

    #!/bin/bash
    #Script to set up database volumes
    #
    #Some static variables
    script_dir=/scripts
    #Variables that can be set
    basename=file_mount_point
    sid=dev1
    snapnum=1
    dbhost=dev1-db
    #Shut down the database on the remote server
    ssh oracle@$dbhost “$script_dir/shutdowndb.sh $sid”
    #Remove _old volumes if there are any.
    ssh root@netapp1 vol offline $basedir_data01_old
    ssh root@netapp1 vol offline $basedir_data02_old
    #Unmount volumes on the remote server.
    umount /u01/app/oracle/oradata/$basedir/data01
    umount /u01/app/oracle/oradata/$basedir/data02
    #Rename the volumes -Also unmounts the volumes.
    ssh root@netapp1 vol rename $basedir_data01 $basedir_data01_old
    ssh root@netapp1 vol rename $basedir_data02 $basedir_data02_old
    #Make the new volumes
    $script_dir/makevolumes.sh $basename $snapnum
    #Mount up the volumes
    $script_dir/mountvolumes.sh $dbhost $sid $basename
    #Fix the startup scripts and start the database instance
    /naucommon/scripts/start_db_instance.sh $dbhost $sid

    You can see that towards the end, three scripts are called. They are:

    makevolumes.sh:
    #!/bin/bash
    #Get the latest snapmirror update snapshot.
    data01=`ssh root@netapp1 snap list -n standbydb_data01 | grep environment_ | awk ‘{print $4}’ | head -$2 | tail -1`
    data02=`ssh root@netapp1 snap list -n standbydb_data02 | grep environment_ | awk ‘{print $4}’ | head -$2 | tail -1`
    #Create a new volume
    ssh root@netapp1 vol clone create $1_data01 -s none -b standbydb_data01 $data01
    ssh root@netapp1 vol clone create $1_data02 -s none -b standbydb_data02 $data02
    #Disable Misc volume settings that are not needed
    ssh root@netapp1 vol autosize $1_data01 off
    ssh root@netapp1 vol autosize $1_data02 off

    mountvolumes.sh:

    #!/bin/bash
    #Script to mount database volumes
    #Mount up the volumes
    ssh root@$1 “mkdir -p /u01/app/oracle/oradata/$2/data01″
    ssh root@$1 “mkdir -p /u01/app/oracle/oradata/$2/data02″

    start_db_instance.sh:

    #!/usr/bin/expect -f
    #
    set timeout -1
    set dbhost [lindex $argv 0]
    set dbsid [lindex $argv 1]

    spawn ssh $dbhost
    expect “$”
    send “export ORAENV_ASK=NO\r”
    expect “$”
    send “export ORACLE_SID=$dbsid\r”
    expect “$”
    send “. oraenv\r”
    expect “$”
    send “sqlplus / as sysdba\r”
    expect “SQL>”
    send “@/scripts/$dbsid\_controlfilecreate.sql\r”
    expect “CANCEL}”
    send “CANCEL\r”
    expect “SQL>”
    send “ALTER DATABASE OPEN RESETLOGS;\r”
    expect “SQL>”
    send “EXIT\r”
    exit

    The last script uses Expect to handle some interaction. Our controlfile renames the database and we don’t clone the online redo logs so we have to open the database with the RESETLOGS option.

    I modified names to protect the innocent, but this is the basic process. I can’t speak to other hardware vendors, but NetApp’s technology makes this super easy for us. It takes a minute to take the snapshot of our standby database. It takes just a few minutes (up to 200 executing one script to create the clone. This has become so easy for us, that I have our application deployment team now able to refresh their deployment staging databases before they release to production. They do this on demand on their schedule without DBA involvement because they just run one script. I have thought about granting the ability to have developers push the button themselves, but I haven’t pulled the trigger on that one yet.

    This same snapshot ability is used to take hourly backups of our production database. The process to restore production is almost nearly the same process as cloning the database. I recently went though an exercise where we tested a switchover to our standby database. That worked but I was unable to switch back. Before this test, I stopped both databases and we took snapshots of both the primary and the standby. When I was unable to switch back, I had my storage admin destroy the data volumes and create new clones on both the primary and standby. This took him about 30 minutes to accomplish since we don’t script this, he had to do it manually. But in that 30 minutes, I had both the primary and the standby up and running back to where they were prior to the test. Also, our regular DR exercises uses a clone database similar to this process. The upshot is we use this same process for multiple avenues…backup and recovery of production, cloning non-production databases, and for DR exercises that do not involve switching over to our standby database.

    This process did take some time to set things up correctly. But that investment has paid off many times. Also, with NetApp’s FlexClone technology, the cloned database only uses additional space for changed blocks. Unchanged blocks are shared by each clone. Our production database is 10TB in size and growing. We often run 20 clones of production. Without deduplication, we would need about 200TB just for the clones. But since unmodified blocks are shared by the clones, we often find that we need about 13TB to support all of these clones. So while the disk may have a higher cost/TB, we recoup those savings with all of those clones.

    I don’t want any of this discussion to take away from the great work in your article. It is great stuff. Personally, I find that using RMAN’s DUPLICATE database functionality to be very cumbersome for anything other than smaller databases. I often find many companies that have their dev and test databases to be smaller, scaled-down versions of production. And they don’t get refreshed that often due to the time it takes for a refresh to complete. I encourage anyone to look at the contents of your article to see if there is a better way to be able to clone production databases with other technologies. That way, you can have a non-production database that is a clone of production and refresh it more often.

    Thanks,
    Brian

  5. khailey
    | #4

    Hi Brian,

    Thanks for the detailed description and the scripts. Awesome stuff – appreciate your sharing the scripts. Pretty cool.

    In summary, you

    * need standby database
    * stop standby every time you snapshot
    * snapshot once a week
    * you can only create clones from one snapshot a week ?
    * up to 200 minutes to create a clone (?)
    * for each new database and each new array added to the configuration you have to code new snapshot scripts and clone scripts
    * only create clones from the latest snapshot otherwise it’s custom intervention or script modification

    These scripts illustrate perfectly the comparison between the internet before the browser using telnet, ftp etc vs using a browser. Before the browser and web servers I depended on ftp, email, telnet but once the browser came out it opened up the a world of possibility for me and more importantly it opened up a world of possibility for the entire world as the bar to entry fell completely. As you said it took a while for you to get the process set up correctly. Delphix takes 5 minutes to install, 5 minutes to configure and then it’s automated with 4 click provisioning of a clone down to the second or SCN that a developer can do. The Netapp approach outlined above uses fixed custom scripts for one database and one filer that you have to run and a developer doesn’t have access to. It’s great stuff but I’d hate to run an enterprise on. I’d much rather use Delphix. It’s not even a rather. Delphix and database virtualization opens up functionality not available otherwise.

    With Delphix

    * no need for standby databases
    * no need to stop a database
    * no limit on snapshots
    * can spin up a clone from any SCN or any second in the time window which is weeks or months (configurable)
    * all virtual databases are snapshot so no need to implement standby databases for the virtual databases
    * virtual databases can be cloned also called branching like branching code but for data
    * virtual databases can be rolled back
    * virtual database can be refreshed
    * have security groups for developers limiting access to source databases and cloning operations and space usage
    * can start up a virtual database on any hardware with the correct Oracle binaries
    * storage vendor agnostic allowing one to even uses solid state arrays like Pure Storage http://www.kylehailey.com/supercharged-performance-at-a-fraction-of-the-price/
    * share blocks in memory and not just disk for enormous performance boosts http://www.kylehailey.com/delphix/performance/
    * automatic compression typically by 1/3

    Did you say it takes up to 200 minutes to create a clone?
    ” It takes just a few minutes (up to 200″
    That’s over 3 hours? That’s cool but it’s a far cry from 2 – 10 minutes with Delphix. Why would it take 200 minutes to create a clone? or is that a typo? In the beginning of the response you said it takes about 20 minutes.

    Your scripts are super cool and I’d be proud to manage this myself as it’s cool use of technology but rare to see people succeed at this and it shows your prowess. You are a rare breed – the DBA, storage admin and system admin all in one and can manage these scripts. If you had to do handoffs it would kill the time, see http://www.kylehailey.com/handoffs-delay-self-service-saves/. The scripts are not a flexible solution to run an enterprise on. It works for a specific well worn path without deviation. There is no error checking and error checking and validations end up being the 80 work in the 80/20 rule. All changes are done through modifying shell scripts. Automation and efficiency is the future. You asked why CERN wrote 25K lines of code to provide automated cloning on Netapp to their developers. It’s the 80/20 rule. It’s why you hesitate to give access to your developers. Things go wrong. Controls need to be put into place. At CERN it’s too big to provide small precise scripts. At CERN the have to handle 100s of source databases and clones. Providing access to developers means error checking, security groups, access control. Check out Oracle’s EM 12c cloning interface. http://www.kylehailey.com/em-12c-snap-clone/

  6. Leighton
    | #5

    Hi Kyle,
    We’re using Netapp SMO for daily and weekly clones but the daily clones usually run into errors and requires manual intervention. You are right in that SMO is component heavy. It makes troubleshooting difficult. That’s the main reason we’re planning on using custom scripts instead of SMO. This is so we can have better control of the process and remove some of the components involved.
    Another option would be to use Snap Creator which we’re testing in POC. This will call underlying APIs and also removes some of the top-heavy components such as SDU. I believe the OEM 12c Snap Clone works by calling the same APIs so it’s somewhat extensible.

  7. khailey
    | #6

    The world is always innovating and though are ways we could do it in the past there are always better ways to do it now and in the future
    innovation

    http://itrevolution.com/manifesto/

    More than ever, organizations need to innovate in order to win in the marketplace.
    And almost all innovation requires technology. Over 95% of capital projects involve IT,
    The need to innovate and deliver technology to help the business win is a competency for any organization.
    As Chris Little quipped, “Every company is an IT company, regardless of the business they think they’re in.”

    They typical efficiency gains by development organization using Delphix is 2x. Yes 200%. The least I’ve seen is 20% gains. The most, at two different companies is 500%.

    “If you’re not doing DevOps, you’re in a downward spiral”

    “The status quo is pre-ordained failure”

    http://servicevirtualization.com/profiles/blogs/gene-kim-if-you-re-not-doing-devops-you-re-in-a-downward-spiral

  8. | #7

    * need standby database

    I don’t need a standby. That is just the way I currently take the snapshot. I have taken the snapshot from production (placed in Hot Backup mode) and then used the disk vendors technology to ship the deltas of that snapshot to our dev disk farm.

    * stop standby every time you snapshot

    Since this is from a standby, yes. But if you do this from production, you can leave the primary up and running if you choose.

    * snapshot once a week

    That’s just our current schedule. It can be more often if we desire.

    * you can only create clones from one snapshot a week ?

    Again, that is just our current schedule. We have no business need to do this more than once a week, but it certainly could be more frequent.

    * up to 200 minutes to create a clone (?)

    That must be a typo. It should say Twenty (20) minutes. And that’s actually a top-end time. It actually takes about 5 minutes to create the clones volumes. The rest of the time is spent shutting down the instance and starting it back up again with the RESETLOGS option.

    * for each new database and each new array added to the configuration you have to code new snapshot scripts and clone scripts

    Just a simple search/replace for us. Actually, the script I posted is modified from an earlier script that we had where each cloned database host, SID, etc were all parameters to the script. We just took the easy way out.

    * only create clones from the latest snapshot otherwise it’s custom intervention or script modification

    Not true. In my script I have defined a variable “snapnum=1″. The ‘1’ means the current snapshot. If you want the snapshot before that, then use ‘2’, and so on. Again…this used to be a parameter to the script but when we changed to our standby DR environment, we took the easy way out and just made this a script variable. It’s minor coding to make this a parameter again.

  9. khailey
    | #8

    Why aren’t your developers directly cloning databases?
    The reason is you don’t trust your scripts. Of course you don’t trust your developers but if the scripts were trust worthy then you wouldn’t have to trust your developers. If the developers tried to allocate too many clones or use up to much space the scripts would stop them, if they worked at that level.
    You as a professional can create these scripts and run them safely. If you need to change the databases or machines involved you know how, but there is little to no error checking in the scripts and no interface for an unskilled user to make customizations.
    It’s nice stuff. I use to script all my Oracle backups in shel and SQL before RMAN. It was nice but now that there is RMAN I’d be a fool not to use RMAN and I’d also have a hard time getting a DBA job if I didn’t know how to use RMAN.
    So …
    We can have a rare expert DBA code scripts like you have impressively done and these scripts can be run only by you, or we can buy a seamless automated powerful package that does much much more than any coder could ever put together.
    So why make someone as crucial as a smart expert DBA run scripts when they could be liberated to work on really innovative work by buying a more powerful package that pays for itself and more?
    http://www.kylehailey.com/theory-of-constraints-the-dba/


+ three = 12