It’s Not About the Snapshot

I hear this from time to time:

“Can’t I create a database copy in seconds with a file system snapshot?”

First let’s take a step back. There is a huge need to clone databases quickly, easily and with as little resource as possible for development, QA, UAT, debugging, reporting and backup yet hardly any of the industry uses file system snapshots for cloning. Cloning with file system snapshots has been a possibility for almost 20 years.    Why, if there is a solution to a huge problem, is no one using it? Because it requires specialized hardware, storage system experts coordinating with DBAs and system admins and a lot of scripting. (Unless using Delphix)

It’s a bit like saying I can make a copy of the database using the UNIX “cp” command.  Any good DBA knows using “cp” to backup a database is  naive. Why? Because either I have to shut the database down or I have to put the database in hot backup mode and then I have to be able to get a list of all the datafiles, control files and redo files. The cloning process must also perform prerequisite tasks like producing startup parameter files, database definition files, or other pre-creation tasks for the target database to function. For example, Oracle requires control files, password files, pfile/spfiles, and other pre-created components before the target can be opened for use. Controlfiles, logfiles, and datafiles together constitute a database that can be opened and read by the appropriate DBMS version. Most of the required task which will be done automatically with RMAN. RMAN will also do many things that only RMAN can do such as skip empty blocks, run incremental backups instead of having to copy the entire database  and do corrupt block checking at backup time. Similarly cloning databases from file system snapshots has similar complexities and limitations.

Can you create a thin clone database for almost no storage from a file system snapshot?

Yes! Definitely.

Can you provision a database copy in seconds with a file systems snapshot?

No :(

Longer answer is you might be able to do it in minutes if you were an expert system admin, storage admin and Oracle DBA,  you didn’t mind having some serious drawback to your clone and impact on production, you had all the resources you needed and  you had everything aligned perfectly. (see comments on previous blog post for a start on how you can do this yourself)

Lets get into the real world of actually using storage system snapshots. Talking to one of the largest video game makers, a Fortune 500 company and one of the top  software companies in the world  about their database cloning, I asked them “when you use Netapp storage snapshots, how long does it take you to create a database clone from the snapshots?” I expected the answer to be in the minutes and was shocked to hear that it took them respond with 2-4 days.  My response was “what?! 2-4 days? why?”  Their response was that if everybody required was in the same room working together maybe they could do it in 2-4 hours but because each stage of the process was handled by a different person or group such as the storage admin, system admin, DBA etc that the total process ended up taking days.

Taking 2-4 days to provision a database from a file system snapshot is the good case. I asked another subsidiary of one of the worlds largest software companies how long it took them to create a database clone with file system snapshots and their response was 3 months! The reason give was the amount of bureaucracy that one had to go through to the clones.

Why does it take so long to clone databases with file system snapshots? There are two reasons

  • bureaucracy
  • technical challenges

Bureaucracy

Depending on your company the more or less bureaucratic steps you will have to get a thin clone database allocated. If you are a DBA, Storage, Systems guru at a small company that might be you, and if so bravo, you can do it pretty  quick. On the other hand if you wear all those hats, you are probably a “Brent” as in Brent in The Phoenix Project who is the crucial person in IT and most critical IT processes grind to a halt because they depend on him and he is super busy.

Screen Shot 2014-05-23 at 4.01.39 PM

 

Why does it take so long to pass tasks between people and  groups? Because a task that might take an hour when someone is completely free and idle will take multiple days as that person starts to be 95% busy or more. See the following chart from the book The Phoenix Project:

Screen Shot 2014-05-23 at 4.02.15 PM

 

Technical Challenges

Screen Shot 2013-11-11 at 8.51.06 PM

 

The easiest way to create a clone is to snapshot the production storage. To snapshot the production storage, either shutdown the source database, take a snapshot or more  likely put all the tablespaces in hot backup mode, take a snapshot, and then take all of the tablespace out of hot backup mode. If the database spans more than one LUN it may take special storage array options to snapshot all the LUNs at the same point in time. Once the all the database LUNs are snapshot, then you can use the snapshots to create a “thin clone” of the production database on the same storage as production. If you are using EMC then most likely this is going to hurt write performance as snapshots on EMC typically use copy on write, which entails a read and two writes to manage the file system snapshot. Also on EMC one is typically limited to 16 snapshots and going over that limit incurs a full copy of the LUNs upon snapshot. On EMC typically snapshots can not themselves be snapshot, preventing branching of clones. On Netapp there is no write performance penalty,  snapshots can be taken of snapshots  and the limit on number of snapshots is 255.  The question is why limit oneself to 255 snapshots when they can be made unlimited with ZFS or with Delphix DxFS.

Problem with this scenario no matter what storage you use is that the clone is doing I/O on the same LUNs as production.  The whole point of cloning production is to protect production but in this case the clone’s I/O will be hurting production. Ooops


Screen Shot 2013-11-11 at 8.51.31 PM

What we want to do is somehow get a copy of production onto some non-production storage where we can snapshot it. This means making a full physical copy of production onto a “development filer.” Once a copy has been made we can make clones by snapshoting the copy. These snapshots then require configuration to make them available to target machines either  over fiber channel or mounting them over NFS and then recovering the database on the target machines.

Problem with this scenario is that what if tomorrow we want a clone of production as it is that day? Currently we only have the copy from yesterday, thus we have to copy across the whole copy of production onto the “development filer.” Continually copying the source each time we need a clone at a different point in time defeats the purpose of creating thin clones in the first place.

What can we do? Netapp offers a feature called Snapshot Manager for Oracle (SMO) that is suppose to copy across file system snapshots from production filer to the development filer.  One problem is it’s complex and difficult to debug problems

from http://media.netapp.com/documents/tr-3761.pdf 

Screen Shot 2013-06-05 at 8.11.09 AM

What alternatives do we have?

Well for EMC one could use SRDF if both production filer and development filer were Symmetrix or VMAX Symmetrix filers. SRDF will play the changes from production on to development, but development will still be limited to 16 snapshots and surpassing that limit results in a full copy. We have customers running up to 400 virtual databases on a single set of storage at Delphix and we have customers taking snapshots as fast as every 5 minutes.  That wouldn’t work very well on EMC nor Netapp.

For ZFS one could use Oracle data guard to relay changes from production onto the development array. It makes sense in theory but haven’t seen anyone users clearly document actually usage.  The best document I’ve seen on it is an old Oracle white paper but I’ve yet to find a company doing this.

Summary

All of the technical solutions require specialized expertise, manual intervention and scripting. Scripting for oneself can be relatively straight forward, but making those scripts production ready and runnable by anyone is another matter. As anyone who has written scripts or software to be used by someone else, it is orders of magnitude harder.

Ways to do it yourself

 Automation and Integration

 

Snapshots are the fuel and Delphix is the Ferrari.

Thin cloning obstacles are solved using Delphix.  Delphix consists of 3 technologies. The first technology continuously collects all the changes from a data source and writes them to any storage. The second technology manages the storage implementing snapshot capabilities, saving all changes in a time window and purging data older than the time window that is no longer needed. The third technology harnesses the file system snapshots and the time window to provision database thin clones to target machines either over NFS.  All of this technology is rolled into a software stack that can run on commodity hardware and map a filesystem onto any storage.

Delphix  takes care of the syncing  with the source database which means continuously pulling in changes from the source database.  Delphix  takes care of many other things automatically such as snapshotting the storage, cloning the storage, compressing the storage and then provisioning the thin clone databases which means exposing the file system on the data virtualization tier to the hosts that run the thin clones such as renaming the clones, setting up the startup parameters and recovering the database.

Screen Shot 2013-11-11 at 7.38.02 PM

Delphix technology  has 3 core parts

  1. Source sync
    • initial full copy
    • forever incremental change collection
    • rolling window of save changes with older replace data purged
  2. DxFS storage on Delphix
    • storage agnostic – maps snapshot file system onto any storage
    • compression – typically compress data by 1/3
    • memory sharing of data blocks (only technology AFAIK to do this)
  3. VDB provisioning and management
    • self service interface
    • rolls, security, quotas, access control
    • branching, refresh, rollback

Each of these 3 parts has important features.

1. Source database copying

Delphix not only copies the source database at initial setup time but we also continuously pulls in the changes to  from the source database such that one can create virtual databases from the source at different points in time. Pulling in  changes  requires a time flow meaning the Delphix saves a time window of changes from the source and purge changes older than the time window. The time window allows the system to continuously  run and reach a storage equilibrium without using up more and more storage.

2. The storage 

Delphix maps their own file system onto any storage and  manages  snapshots, clones and compression of the data for efficient storage. Delphix not only shares data blocks on disk but also in memory.   Delphix handles and orchestrates access to the data it manages meaning sharing un-mondified duplicate datablocks between all the thin clones and keeping modified blocks private to the clone that made the modification.

3. Provisioning

Delphix  automates the provisioning of thin clones meaning providing a self service interface. Provisioning handles exposing the datafiles  over NFS to the target machines that run the database. Provisioning  automatically handles renaming the database, setting startup parameters, recovering and opening the thin clone.  Provisioning is self service where anyone can provision clones be they a DBA or a developer. In order to allow access to anyone Delphix handles logins, security groups and defining which groups have access to which source databases, how many clones can be made, what target machines can run clone instances, what operations the user is allowed to do and how much extra storage the user is allow to incur on the system. Delphix also provides functionality such as rolling back, refreshing, branching and tagging virtual databases.

 

With all 3 core functionality integrated and automated Delphix provides key features

  • Database Virtualization – thin clone support for major databases such as Oracle, SQL Server, PostGres
  • Application Virtualization – support thin cloning application stacks
  • Self Service – interface easily usable by application developers, QA staff and BI teams to provision their own clones on demand.
  • Branching  – support branching clones meaning making clones of clones which crucial for supporting multi-version development or even just being able  to re-provision previous application development stack for patching while at the same time working on the current version
  • Synchronization – support cloning multiple related database such that each clone is cloned from the exact same point in time. Any Fortune 500 that has multiple revenue tracking databases will need to synchronize a clones of each for analyzing financial close discrepancies.
  • Cloud Ready – supports any storage and efficient low bandwidth replication across heterogenous storage types. Installs on any commodity hardware.
  • Any Time – make clones at any time down to the second with the timeflow window
  • Live Archive – save specific points in the timeflow for ever to support  compliance and auditing

 

Example

 

The following video shows how to create a full RAC clone of a RAC source database in 5 minutes using Delphix. To see how Dephix links to the source database see the second video below.

Here is the RAC Delphix setup video, which similarly was finished in 5 minutes

 


Uncategorized

  1. Trackbacks

  1. Comments

  2. Niko
    | #1

    The initial question on itself does not mention or require that the source database must be up and running during this process, or cannot be in semi-available status (offline files/tablespaces). If the database can be downed, the solution doesn’t even involve a third party tool. No ?

  3. khailey
    | #2

    Hi Niko,

    If the original database is down, then it eliminates the need to put the database in hot backup mode. If the database is up then one can use hot backup mode when copying the data files. Really thats more or less tangential to the discussion of how does one automate making thin clones of a source database. Without automation we are stuck with slow bureaucracy whether or not we use thin clones or not. Without thin clones there is no way to automate the process efficiently. Question is how do we combine thin cloning technology with automation. The only end to end package solution is Delphix. You can try to script automation yourself and the above discussion goes into some of the steps you’d have to address. Of course if one does it in house one really has to compare the internal cost benefit to the price of buying an automated solution. The price of the automated solution Delphix has been far eclipsed by the benefits it brings.
    In the case of writing something in house or cobling together different peices available in the industry one has to consider what one colleague calls some of these in house custom automations “Self service Frankensteins”:
    * If only one programmer- single point of failure
    * If multiple programmers – how expensive was it to build? vs buy?
    * Time to Market (TTM) – how long did it take/is it taking to build?
    * How much is each delayed day costing in lost productivity?
    * Functionality- how rich and efficient is the interface?
    * maintenance – lack of formal support plan and established SLA’s
    * Longevity – no product/feature road map, no long-term development plans
    * Agility – Business demands outpacing feature release
    * Adoption – viewed as “pet project” vs. critical business service,
    * Value – lack of proper req. gathering and needs analysis results in missing features and some nobody wants
    * Stability – Lack of full proper testing results in errors uncovered in production/failure of product.
    All of these things, plus more, will keep your Frankenstein from graduating from a science project to a First-rate Enterprise Business service, replete with all the trappings that entails (Dedicated Development, O&M budget, SLA’s, etc).

  4. Robert
    | #3

    Oracle Multitenant seems to be missing in this discussion….
    Additionally, the open/close state of a database does not address the need to provide *quick* cloning…It only addresses some questions that determine how such cloning can be done.

  5. khailey
    | #4

    Hi Robert, Thanks for your feedback. Yes, I didn’t address multi-tenant directly here but the discussion is pretty much the same. He are a couple of blog posts on multi-tenant :
    http://datavirtualizer.com/delphix-and-oracle-em-12c-partnership/
    http://datavirtualizer.com/12c-pdbs-delphix-like-chocolate-peanut-butter/
    Not sure I follow on the open and close state. Yes, it’s all about speed. Many of our customers who bought Delphix originally for storage savings later said they didn’t really care about the storage savings – what they cared about and what helped their company the most was the speed of cloning.


8 + = fifteen