Archive

Archive for November, 2013

Theory of Constraints

November 13th, 2013

phoenixproject-680x400 Great interview with Gene Kim, author of The Phoenix Project, discussing the Theory of Constraints and delays in project development. Here is an excerpt starting from minute 6:50

“I’ve been trained in the theory of constraints and one of the things I think is so powerful is the notion of the constraint in the value stream.  What is so provocative about that notion is that any improvement not made at the constraint is an illusion.  If you fix something before the constraint you end up with more work piled up in front of the constraint.  If you fix something after the constraint you will always be starved for work.
Screen Shot 2013-09-05 at 9.52.45 PM
In most transformations, if you look at what’s really impeding flow,  the fast flow of features, from development to operations to the customer,  it’s typically IT operations.

Continue reading at http://blog.delphix.com/kyle/2014/theory-constraints/

 

Uncategorized

Oaktable World UK Dec 2 & 3

November 13th, 2013

Oaktable World UK 2013

Oaktable World continues it’s global tour with the next stop in Manchester, UK during the UKOUG.

Scale Abilities is proud to be sponsoring the first independent OakTable World UK event along with co-sponsors Pythian and Dbvisit.

Check out the awesome lineup

otwuk

And it’s free. Just be sure to register to reserve your place.

 

Register at  http://www.scaleabilities.co.uk/oaktable-world-uk-2013/


The event is offered free of charge to all visitors to the UKOUG Tech13 Conference. Registration is required in order to gain entry, but please note that the event will be intentionally oversubscribed to allow delegates to attend a mixture of sessions alongside Tech13 sessions. As a consequence of the oversubscription, entry to specific sessions will be on a first come, first served basis, so please arrive with plenty of time for each session even after registration.

The event will be held across the street from the Tech13 conference at: Premier Inn, 7-11 Lower Mosley Street, Manchester, Greater Manchester M2 3DW

Speakers include:

James Morle
Jonathan Lewis
Alex Gorbachev
Iloon Ellen-Wolff
Niall Litchfield
Doug Burns
Joel Goodman
Marcin Przepiorowski
Christian Antognini
Pete Finnigan
David Kurtz
Moans Nogood

 

 

Uncategorized

Handoffs delay, self service saves

November 13th, 2013

“Self-service is awesome; if they can get work done without opening a ticket, we’re winning.” – Kelsey Hightower via Gene Kim, author of The Phoenix Project

There is a great “ah ha” moment in the book, The Phoenix Project, when the hero, Bill Palmer, realizes why it takes his  star IT technician, Brent, several days to accomplish a task that Brent said would take him 45 minutes!  The reason it’s taking Brent days to accomplish the 45 minute task is that the task depends on several hand offs between different people and each hand off is eating a surprising amount of time. Why if each of the multiple steps only takes a few minutes each does the completed task end up taking several days? The reason is the queuing time between each handoff. Handoff time delays and handoffs delay inordinately the busier each resource is.

Here is a chart from the book The Phoenix Project, in Chapter 23, that I like, because it makes the concept easy to understand, even if the chart is less than rigorous. The chart shows that as a resource becomes busier the amount of time it takes to process the task grows exponentially. Say for example, the time it takes to process a task is 1 hour when the resources is 50% busy then the time it takes to process at 90% busy will be 9x longer or 9 hours.  The formula is more representative of queue size than wait time but the idea is roughly the same.  There is a detailed discussion  of this formula on linkedin.

Screen Shot 2013-11-12 at 12.18.37 PM

 

A one of the characters in the Phoenix Project puts it “We’re losing days at each handoff” !! That’s why handoff delays are so impactful and if the task can be made self service these impactful delays can be eliminated.

Now take a task such as creating a clone copy of a database. There are many steps and handoffs between teams.

Screen Shot 2013-11-12 at 12.19.57 PM

Of course if the database is large it takes a long time to just to do the copy of a database, but if the database copy is done with thin cloning like Netapp flex clone, how long does it take? Many of our customers are Netapp customers and previously used or tried to use Netapp to create thin clones. Electronic Arts used Netapp flexclone to create database copies. I asked them how long did it take to create a copy and they said 2-4 days! I was like “why?!” and they said because they had to enter work tickets. The work ticket  went to the DBA who had to submit one to the Sys Admin who had to submit one to the storage admin and that they lost time in the hand offs. He said if everyone was in the same room maybe he could get a copy in 4 hours. Four hours is still vastly different than 4 clicks of mouse by a developer and a few minutes in the Delphix interface.

A guy from Ariba told me the use flexclone on Netapp as well and that it took them 3 months because of the bureaucracy and passing off each step from team to team!
Screen Shot 2013-09-05 at 9.52.45 PM

Handoffs delay, self service saves

Screen Shot 2013-09-05 at 9.52.45 PM

Uncategorized

The Thin Cloning Left Shift

November 13th, 2013

The DevOps approach to software delivery manages risk by applying change in small packages instead of big releases. By increasing release frequency, overall risk falls since more working capabilities are delivered more often. The consequence of this is that problems with your data can be amplified. And, as a result, you can squeeze the risk out of one aspect of your delivery just to introduce it in another. Thin cloning attacks that risk, enhancing and amplifying the value of DevOps by reducing the data risk inherent in your architecture.

Data Delivery

How is there risk in your architecture? Well, just because you’ve embraced Agile and DevOps doesn’t mean that your architecture can support it. For example, one customer with whom I spoke had a 3-week infrastructure plan to go along with every 2-week agile sprint because it took them that long to get their data backed up, transmitted, restored and ready for use. So, sure, the developers were a lot more efficient. But, the cost in infrastructure resources, and the corresponding Total Cost of Data was still very high for each sprint. And, if a failure occurred in data movement, the result would be catastrophic to the Agile cycle.

Data Currency and Fidelity

Another common tradeoff has to do with the hidden cost of using stale data in development. The reason this cost is hidden (at least from the developer’s viewpoint) is that the cost shows up as a late breakage event. For example, one customer described their data as evolving so fast that a query developed using stale data might work just fine in development but then be unable to respond to several cases that appear in more recent production data. Another customer had a piece of code tested against a subset of data that came to a crawl 2 months later during production-like testing. Had they not caught it, it would have resulted in a full outage.

I contend that the impact of these types of problems is chronically underestimated because we place too much emphasis on the number of errors, and not enough on their early detection. I contend that being able to remediate errors sooner is significantly more important than being able to reduce the overall error count. Why? First, because the cost of errors rises dramatically as you proceed through a project. Second, because remediating faster means avoiding secondary and tertiary effects that can result in time wasted chasing ghost errors and root causing things that simply would not be a problem if we fixed things faster and operated on fresher data.

Thought Experiment

To test this, I did a simple thought experiment where I compared two scenarios. In both scenarios, time is measured by 20 milestones and the cost of error rises exponentially from “10” at milestone 7 to “1000” at milestone 20. In Scenario A, I hold the number of errors constant and force remediation to occur in 5% less time. In Scenario B, I leave the time for all remediation constant and shrink the total number of errors down by 10%.

Scenario A
Scenario A: Defects Held Constant; Remediation Time Reduced by 10%

Scenario B
Scenario B: Remediation Time Held Constant; Defects Reduced by 10%

In each graph, the blue curve represents the before state, and the green curve the after state. For both Scenarios, in the before state, the total cost of errors was marked at $2.922M. The comparison of the two graphs shows that the savings from shrinking the total time to remediate by 10% was $939k vs. the savings from shrinking the total number of errors was $415k. In other words, even though these graphs didn’t change much at all – the dollar value of the change was significant when time to remediate was the focus. And, the value of reducing the time to remediate by 10% was more than twice as much then the value of just reducing the number of defects by 10%. In this thought experiment, TIME is the driving factor driving the cost companies pay for quality – the sooner and faster something gets fixed, the lower it costs. In other words, shifting left saves money. And, it doesn’t have to be a major shift left to result in a big increase in savings.

The Promise of thin cloning.

The power of thin cloning is that it addresses both of the key aspects of data freshness: currency and timeliness. Currency measures how stale it is compared to the source [see Segev ICDE 90] and timeliness how old it is since its creation or update at the source [See Wang JMIS 96]. These two concepts capture the real architectural issue with most organizations. There is a single point of truth somewhere that has the best data (high timeliness). But, it’s very difficult to make all of the copies of that data maintain fidelity with that source (currency) and the difficulty to do so rises in proportion to the size of the dataset, and the frequency with which the target copy needs currency. But, it’s clear that DevOps goes in this direction.

Today, most people accept the consequences of low fidelity/lack of currency because of the benefits of a DevOps approach. That is, they accept that some code will fail because its not tested on full size data, or because they will miss cases because data is evolving too quickly, or that they will chase down ghost errors because of old or poor data. And, they accept it because the benefit of DevOps is so large.

But, with thin cloning solutions like Delphix, this issue just goes away. Large – even very large databases can be fully refreshed in minutes. That means full size datasets with minutes old timeliness and minutes old currency.

So what?

Even in shops that are state of the art – with the finest minds and the best processes – the results of thin cloning can be dramatic. One very large customer struggling to close their books each quarter was struggling with a close period of over 20 days, with more than 20 major errors requiring remediation. With Delphix, that close is now 2 days, and the errors have become undetectable. For a large swath of customers, we’re seeing an average reduction of 20-30% in the overall development cycle. With Delphix, you’re DevOps ready, prepared for short iterations, and capable of delivering a smooth data supply at a much lower risk.

Shifting your quality curve left saves money. Data Quality through fresh data is key to shifting that curve left. Delphix is the engine to deliver the high quality, fresh data to the right person in a fraction of the time that it takes today.

Uncategorized

Difference between storage snapshots and data virtualization

November 12th, 2013

Screen Shot 2013-11-11 at 7.49.21 PM

photos by Keith Ramsey and Automotive Rhythms

People are hearing more about Delphix and data virtualization. Data virtualization is where data copies can be made in minutes using thin cloning. Thin cloning  means sharing the  un-modified blocks on the file system between clones. Modified blocks are private to the clone that made the modification.

As people hear about data virtualization, the question comes up “what’s the difference between data virtualization and file system snapshots?” Comparing file system snapshots and data virtualization is like comparing an engine to a car. Creating a car from an engine takes some serious work. Creating data virtualization from snapshot technologies takes some serious work.

File system snaphots can be used to duplicate a file system. If that file system has a database on it then a thin clone of the database can be made using the file system snapshot. The benefits of file system snapshots in the arena of database cloning, thin cloning, are clear. Thin cloning saves enormous amounts of storage but more importantly it saves time, or in theory should save time. If thin cloning offers so much them why is thin cloning technology so rarely used? The reason is it’s a steep barrier to entry. It requires storage experts, specialized hardware and lots of brittle scripting and/or hands on operations. For example, CERN a big Netapp site  wrote over 25K lines of code to try and provide minimal ability for a developer to thin clone a database.

Screen Shot 2013-11-11 at 8.36.24 PM

vt100 internet from Tim Patterson

The analogy that comes to mind between thin cloning and data virtualization is the same comparison between the internet and the browser accessed world wide web .  The internet was around long before the web with email, ftp, gopher, bulletin boards etc but hardly anyone used it until the web browser and web servers came out.  When the browser came out the barrier to entry fell completely and every one started using the internet. It’s the same with data virtualization. With data virtualization everyone is starting to use thin cloning.

Thin cloning is like the car where as file system snapshots are the engine. Comparing file system snapshots to data virtualization is like comparing a company that makes car engines to an actual car. Make a full car from just a car engine is a serious amount of work. Implementing enterprise database virtualization  from file system snapshots  is serious work.

Screen Shot 2013-11-11 at 8.51.06 PM

Now there will be some who say “I can make a file system snapshot and then make a thin clone of a databases using that snapshot, easy.” Sure, if you know how to put a database in hot backup mode, you can if you take the file system snapshot and then make a thin clone database using the file system snapshot. There is one problem. You made that snapshot on the production storage filer on the same LUNs that the production database is using, so  all activity on the clones will impact the performance of production. The whole point of creating database copy was to protect production and to avoid adding more load on production. The trick is how do you get a copy of production onto a development storage array away from production and so that you can then make the snapshots on the development storage? Sure you can copy the whole database across but then what if you want to make clones tomorrow? Do you copy the whole database across again? That defeats the purpose of thin cloning.

Screen Shot 2013-11-11 at 8.51.31 PM

Data virtualization takes care of the syncing  storage used by the data virtualization tier with the source which means continuously pulling in changes from the source.  Data virtualization also takes care of many other things automatically such as snapshotting the storage, cloning the storage, compressing the storage and then provisioning the thin clone databases which means exposing the file system on the data virtualization tier to the hosts that run the thin clones. It means renaming the clones, setting up the startup parameters and recovering the database.

 

Screen Shot 2013-11-11 at 7.38.02 PM

 

Data virtualization has 3 parts

  1. Copying and syncing the source data to a data virtualization appliance (DVA)
  2. Cloning the data on the DVA
  3. Provisioning the clone to target machine that runs the thin clone data

Each of these 3 parts requires important features.

 

1. Source data copying

 

Not only do we require copying the source data to the data virtualization appliance (DVA) but we also require to continuously pull in the changes to the DVA from the source data such that one can create virtual data from the source at different points in time. Pulling in  changes  requires a time flow meaning the DVA will save a time window of changes from the source and purge changes older than the time window. The time window allows the system to continuously  run and reach a storage equilibrium without using up more and more storage.

 

2. The storage or DVA

 

The DVA has to be able to snapshot, clone and compress the data for efficient storage. The DVA should also not only share data blocks on disk but also in memory.   The DVA tier handles and orchestrates access to the data it manages meaning sharing un-mondified duplicate datablocks between all the thin clones and keeping modified blocks private to the clone that made the modification.

 

3. Provisioning

 

 

Data virtualization has to automate the provisioning of thin clones meaning providing a self service interface. Provisioning handles exposing the data on the DVA over NFS to the target machines that run the data. Provisioning has to automatically handle things such as renaming a database that use the data, setting startup database parameters, recovering and opening the database thin clone.  Provisioning has to be self service where anyone can provision clones be they a DBA or a developer. In order to allow access to anyone data virtualization has to handle logins, security groups and defining which groups have access to which source data, how many clones can be made, what target machines can run clone instances, what operations the user is allowed to do and how much extra storage the user is allow to incur on the system. Data virtualization also requires functionality such as rolling back, refreshing, branching and tagging virtual data.

87431231_1912ffe12c_z

 

photo by zachstern

Uncategorized

The Principle of Least Storage

November 12th, 2013

We’re copying and moving redundant bits

In any application environment, we’re moving a lot of bits. We move bits to create copies of Prod for BI, Warehousing, Forensics and Production Support. We move bits to create Development, QA, and Testing Environments. And, we move bits to create backups. Most of the time most of the bits we’re moving aren’t unique, and as we’ll discover, that means they we’re wasting time and resources moving data that doesn’t need to be moved.

Unique Bits and Total Bits

Radically reducing the bulk and burden of caring for all of the data in the enterprise has to start with two fundamental realizations: First, the bits we store today are often massively redundant. Second, we’ve designed systems and processes to ship this redundant data in a way that makes data consolidation difficult or impossible. Let’s look at a few examples:

Backup Redundancy

Many IT shops at major companies follow the Weekly Full Daily Incremental Model and will keep 4 weeks full of their data on hand for recovery. If we assume that for a data store (such as a database) the daily churn rate is 5% per day, then we could describe the total number of bits in the 4 week backup cycle as follows: (Using X as the current size of the database and ignoring annual growth):

Total Bits: 4*X + 24*5%*X = 5.20*X

But how much of that data is really unique? Again, using X as the current size of the database and ignoring annual growth:

Unique Bits: X + 27*5%*X = 2.35*X

The ratio of total to unique bits is 5.2 / 2.35 or 2.09. That is, our backups are 51% redundant at a bit level. Moreover, the key observation is that the more full backups you perform, the more redundant your data is.

Environment Redundancy

According to Oracle, the average application has 8 copies of their production database, and this number is expected to rise to 20 in the next year or two. In my experience, when backups have about a 5% daily change rate, Dev/Test/QA classes of environments have about a 2% daily change rate, and are in general 95% similar to their production parent database even when accounting for data masking and obfuscation.

If we assume an environment with 8 copies that are being refreshed monthly, start out 5% divergent and churn at a rate of 2% per day, then we could describe the total number of bits in these 8 environments as follows: (Using X as the current size of the database and ignoring annual growth):

Total Bits: 8*95%*X + 2%*30*8*X = 10*X

But how much of that data is really unique? Again, using X as the current size of the database and ignoring annual growth:

Unique Bits: X + 2%*30*8*X = 3.40*X

The ratio of total to unique bits is 10 / 3.4 or 2.94. That is, our copies are 65% redundant at the bit level. Moreover, the key observation is that the more copies you make, the more redundant your data is.

Movement is the real redundancy

Underlying this discussion of unique bits vs. total bits is the fact that most of the time, the delta in bits between the current state of our environment and the state we need it to be in is actually very small. In fact, if we eliminate the movement of bits to make operations happen, we can reduce the total work in any operation to almost nothing. If you’re hosting not just one copy but every copy from a shared data footprint, you have a huge multiplying effect on your savings.

The power of a shared data footprint is that it makes a variety of consolidations possible. If the copy of production data is shared in the same place as the data from the backup, redundant bits can be removed. If that same data is shared with each development copy, then even more redundant bits can be removed. (And, in fact, we see a pattern of only storing unique bits emerging.) Finally, if we need to fresh development, we can move almost NO bits. Since every bit that we want already exists in the production copy, we just have to point to those bits and do a little renaming. And because its a shared footprint, we don’t have to export huge amounts of data to a distant platform; we can just present those bits (e.g., via NFS).

Consider a developer who needs to refresh his 1 TB database from his production host to his development host in concert with his 2-week agile sprints. In a world without thin clones, this means we transmit 1 TB over network every 2 weeks. In a world with thin clones and a shared footprint, we copy 8 GB locally and don’t have to transmit anything to achieve the same thing.

The better answer

Regardless of our implementation, we reach maximum efficiency when we achieve our data management operations at the lowest cost. Reducing the cost of movement is part of that, so I offer the:
Principle of Least Movement:

Move the minimum bits necessary the shortest distance possible to achieve the task.

So what?

There’s a workload attached to moving these bits around – a cost measured in bits on disk or tape, network bandwidth consumed, and hours spent. Since we’re moving a lot of redundant bits, much of that work is unnecessary. There’s money to be saved, and it isn’t a small amount of money. And, that cost doesn’t just end in IT. It costs the business every time a Data Warehouse can’t get the fresh data it needs to so that real time decisions can be made. (Should I increase my discount now, or wait until tomorrow? Should I stock more of Item X because there is a trend that people are buying it?) It costs the business when a production problem continues for an extra 4 or 6 or 8 hours because that’s how long it takes to restore a forensic copy. In fact, in my experience, the business benefit to applications for outweighs the cost advantage, which is not insignificant.

Uncategorized

R data structures

November 11th, 2013

man-73318_640

Nicer formating at https://sites.google.com/site/oraclemonitor/r-slicing-and-dicing-data

  1. R data types
  2. Converting columns into vectors
  3. Extracting Rows and converting Rows to numeric vectors
  4. Entering data
  5. Vectorwise maximum/minimum
  6. Column Sums and Row Sums

R can do some awesome data visualizations: http://gallery.r-enthusiasts.com/thumbs.php

Instead of doing one off data visualizations like with Excel, R can automate the process allowing one to visualize many sets of data with the same visualizations.

Installing R is pretty easy http://scs.math.yorku.ca/index.php/R:_Getting_started_with_R

There are lots of blogs out there on getting started with R. The one thing that I didn’t find explained well was slicing and dicing data.

Lets take some data that I want to visualize.  The following data shows the performance of network throughput. The throughput is measured by latency of communication in milliseconds (avg_ms) and throughput in MB per second (MB/s).

The parameters are the I/O message size in KB (0KB is actually 1 byte) and the number of concurrent threads sending data (threads)

IOsize ,threads ,avg_ms ,    MB/s
     0 ,      1 ,   .02 ,    .010 
     0 ,      8 ,   .04 ,    .024 
     0 ,     64 ,   .20 ,    .025 
     8 ,      1 ,   .03 ,  70.529 
     8 ,      8 ,   .04 , 150.389 
     8 ,     64 ,   .23 ,  48.604 
    32 ,      1 ,   .06 , 149.405 
    32 ,      8 ,   .07 , 321.392 
    32 ,     64 ,   .18 ,  73.652 
   128 ,      1 ,   .03 , 226.457 
   128 ,      8 ,   .01 , 557.196 
   128 ,     64 ,   .06 , 180.176
  1024 ,      1 ,   .01 , 335.587
  1024 ,      8 ,   .01 , 726.876
  1024 ,     64 ,   .02 , 714.162

If this data is a file, it can be easily loaded and charted with R.

Find out what directory R is working in:

getwd()

go to a directory with my data and R files:

setwd("C:/Users/Kyle/R")

list files

dir()

load data into a variable

mydata <- read.csv("mydata.csv")

Simple, et voila, the data is loaded. To see the data just type the name of the variable ( the “>” is the R prompt, like “SQL>” in SQL*Plus)

> mydata
   IOsize threads avg_ms    MB.s
1       0       1   0.02   0.010
2       0       8   0.04   0.024
3       0      64   0.20   0.025
4       8       1   0.03  70.529
5       8       8   0.04 150.389
6       8      64   0.23  48.604
7      32       1   0.06 149.405
8      32       8   0.07 321.392
9      32      64   0.18  73.652
10    128       1   0.03 226.457
11    128       8   0.01 557.196
12    128      64   0.06 180.176
13   1024       1   0.01 335.587
14   1024       8   0.01 726.876
15   1024      64   0.02 714.162

Creating a chart is a breeze, just say plot(x,y) where x and y are the values you want to plot.
How to we extract an x and y from mydata?
First pick what to plot. Let’s plot averge ms latency (avg_ms) verse MB per sec (MB.s).
Here is how to extract those columns from the data

x=mydata['avg_ms']
y=mydata['MB.s']

Now plot

> plot(x,y)
Error in stripchart.default(x1, ...) : invalid plotting method

huh … what’s that Error?

If we look at x and/or y, they are actually columns from mydata and plot() wants rows (actually vectors but we’ll get there).

> x
   avg_ms
1    0.02
2    0.04
3    0.20
4    0.03
5    0.04
6    0.23
7    0.06
8    0.07
9    0.18
10   0.03
11   0.01
12   0.06
13   0.01
14   0.01
15   0.02

To transpose a column into a row we can use “t()”

> t(x)
       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
avg_ms 0.02 0.04  0.2 0.03 0.04 0.23 0.06 0.07 0.18  0.03  0.01  0.06  0.01  0.01  0.02

Now we can try plotting again:

> plot(t(x),t(y))

and voila

but let’s address the issue of transforming x and y from columns to rows and specifically into vectors.
Let’s look at the original data and then the transformed data

x=mydata['avg_ms']     #  column of data extracted from a data.frame 
tx=t(mydata['avg_ms']) #  transform the column of data into a row

Look at the datatypes of x and t(x) using the class() function

> class(mydata)
[1] "data.frame"
> class(x)
[1] "data.frame"
> class(tx)
[1] "matrix"

the column is considered a “data.frame” and the row is considered a “matrix”.

The method of extracting a column by it’s column name only works for datatype class data.frame.

If the datatype was a matrix we would be required to supply both the row and column as in  matrix[“row”,”column”]

By leaving either row or column empty but keeping the comma in place then it acts as a wild card.

matrix[,”column”] – gives all values in that column

matrix[“row”,] – gives all the values in that row

plot() wants a vector (but it forgivingly works with rows of data as we did above).

R data types

What are these datatypes in R?
There is a simple discussion of data types at http://www.statmethods.net/input/datatypes.html

The types are basically (using “value1:value2″ gives a list iterating from value1 to value2 by increments of 1)

  • integer
    • > i=1:5
      > class(i)
      [1] "integer"
      > i
      [1] 1 2 3 4 5
  • character
    • > c=letters[1:5]
      > class(c)
      [1] "character"
      > c
      [1] "a" "b" "c" "d" "e"
  • (booleans are integers )
    • > b=FALSE:TRUE
      > class(b)
      [1] "integer"
      > b
      [1] 0 1
  • vectors
    • > v=c(1,2,3,4,5)
      > class(v)
      [1] "numeric"
      > v
      [1] 1 2 3 4 5
  • matrix
    • > m=matrix(c(1,2,3,4,5))
      > class(m)
      [1] "matrix"
      > m
           [,1]
      [1,]    1
      [2,]    2
      [3,]    3
      [4,]    4
      [5,]    5
  • data.frames – mixes numeric and character
    • > df=matrix(1:5,letters[1:5])      # matrix can't contain character and numeric
      Error in matrix(1:5, letters[1:5]) : non-numeric matrix extent
      >
      > df=data.frame(1:5,letters[1:5])  # dataframe can

      > class(df)
      [1] "data.frame"
      > df
        X1.5 letters.1.5.
      1    1            a
      2    2            b
      3    3            c
      4    4            d
      5    5            e
  • lists – like an matrix but can mix different data types together such as character, number, matrix
    •  > a = c(1,2,5.3,6,-2,4) # numeric vector
      > # generates 5 x 4 numeric matrix 
      > y=matrix(1:20, nrow=5,ncol=4)
      > # example of a list with 4 components - 
      > # a string, a numeric vector, a matrix, and a scaler 
      > w= list(name="Fred", mynumbers=a, mymatrix=y, age=5.3)
      > w
      $name
      [1] "Fred"
      
      $mynumbers
      [1]  1.0  2.0  5.3  6.0 -2.0  4.0
      
      $mymatrix
           [,1] [,2] [,3] [,4]
      [1,]    1    6   11   16
      [2,]    2    7   12   17
      [3,]    3    8   13   18
      [4,]    4    9   14   19
      [5,]    5   10   15   20
      
      $age
      [1] 5.3
    • extract the various parts of a list with  list[[“name”]], as in w[[“mymatrix”]]
  • array – are matrices with more than 2 dimensions
  • factors

Useful functions on data types

  • dput(var) – will give structure of var
  • class(var) – will tell the data type
  • dim(var) – will set dimension
  • as.matrix(data.frame) – useful for changing a data.frame into a matrix, though be careful because if there are any character values in the data frame then all entries in the matrix will be charcter

Sometimes R transforms data in ways I don’t predict, but the best strategy is just to force R to do what I want more explicitly.

Converting columns into vectors

When originally selecting out the columns of the data, we could have selected out vectors directly instead of selecting a column and transforming the column to a vector.
Instead of asking for the column which gives a column we can ask for every value in that column
by adding in a “,” infront of the column name. The brackets take the equivalent of x and y coordinates or row and column position. By adding a “,” with no value before it, we are giving a wild card to the row identifier and saying give me all the values for all rows in the column “avg_ms”

x=mydata[,'avg_ms']
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

We can also extract the values by the column position instead of column name. The “avg_ms” is column 3

> x=mydata[,3]
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A third way to get the vector format is using “[[ ]]” syntax

> x=mydata[[3]]
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

A forth way is with the matrix$col syntax

> x=mydata$avg_ms
> class(x)
[1] "numeric"
> x
 [1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

Another way that we’ll talk about in converting a row to a vector is the apply() and as.numeric() functions:
The function apply can also change a column to a vector

> x=mydata['avg_ms']
> class(x)
[1] "data.frame"
> x
   avg_ms
1    0.02
2    0.04
3    0.20
4    0.03
5    0.04
6    0.23
7    0.06
8    0.07
9    0.18
10   0.03
11   0.01
12   0.06
13   0.01
14   0.01
15   0.02
> x=apply(x,1,as.numeric)
> class(x)
[1] "numeric"
> x
[1] 0.02 0.04 0.20 0.03 0.04 0.23 0.06 0.07 0.18 0.03 0.01 0.06 0.01 0.01 0.02

These vector extractions work for columns but things are different for rows.

Extracting Rows and converting Rows to numeric vectors

The other side other coin is extracting a row into vector format. In mydata, the rows don’t have names, so we have to use position. By specifying row position with no following column names then all column values are given for that row.

> row=mydata[3,]
> class(row)
[1] "data.frame"
> row
  IOsize threads avg_ms  MB.s
3      0      64    0.2 0.025

The resulting data is a  data frame and not a vector  (ie a vector is of datatype numeric)
We can use the “as.numeric” function to convert the data.frame to a vector, ie numeric.
The apply() function will apply the “as.numeric” function to multiple values at once. The apply() takes 3 args

  • input variable
  • 1=row,2=col,1:2=both
  • function to apply

see http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/

> ra=apply(row,2,as.numeric)
> class(ra)
[1] "numeric"
> ra
 IOsize threads  avg_ms    MB.s 
  0.000  64.000   0.200   0.025

The above applies  the change to all columns in the given row in a data.frame.

(apply can also be used for example to change all 0 to NULLs

new_matrix = apply(matrix,1:2,function(x)if (x==0)  NULL else x)

see http://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega)

For selecting the row out directly as a vector, the as.matrix() function can also be used

> row=as.matrix(mydata)[3,]
> class(row)
[1] "numeric"
> row
 IOsize threads  avg_ms    MB.s 
  0.000  64.000   0.200   0.025

yet another way

> row=c(t(mydata[3,]))
> class(row)
[1] "numeric"
> row
[1]  0.000 64.000  0.200  0.025

( see http://stackoverflow.com/questions/2545228/converting-a-dataframe-to-a-vector-by-rows)

or yet

> row=unlist(mydata[3,])
> class(row)
[1] "numeric"
> row
 IOsize threads  avg_ms    MB.s 
  0.000  64.000   0.200   0.025

Filtering Data

The data in the CSV file actually represents throughput not only at different I/O send sizes but also for different number of concurrent senders. What if I wanted to just plot the throughput by I/O send size for tests with one thread? How would I filter the data?

IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
plot(IOsize,MBs)

 

How about plotting the throughput by I/O size for each number of threads test?
The parameter ‘type=”o”‘  makes the plot a line plot

#extract data
IOsize=subset(mydata[,'IOsize'],mydata['threads'] == 1 )
MBs_1=subset(mydata[,'MB.s'],mydata['threads'] == 1 )
MBs_8=subset(mydata[,'MB.s'],mydata['threads'] == 8 )
MBs_64=subset(mydata[,'MB.s'],mydata['threads'] == 64 )
# create graph
plot(IOsize,MBs_64,type="o")
# plot other lines
lines(IOsize,MBs_1,lty=2,col="green",type="o")
lines(IOsize,MBs_8,lty=3,col="red",type="o")

# add a legend
legend(1,700,c("1 thread","8 threads","64 threads"), cex=0.8, 
   col=c("green","red","black"), lty=3:1);

 

 

 

Entering data

Instead of entering data via a CSV file it can be entered directly into R

> m=matrix(c(
     0 ,      1 ,  .02 ,    .010 ,
     0 ,      8 ,  .04 ,    .024 ,
     0 ,     64 ,  .20 ,    .025 ,
     8 ,      1 ,  .03 ,  70.529 ,
     8 ,      8 ,  .04 , 150.389 ,
     8 ,     64 ,  .23 ,  48.604 ,
    32 ,      1 ,  .06 , 149.405 ,
    32 ,      8 ,  .07 , 321.392 ,
    32 ,     64 ,  .18 ,  73.652 ,
   128 ,      1 ,  .03 , 226.457 ,
   128 ,      8 ,  .01 , 557.196 ,
   128 ,     64 ,  .06 , 180.176 ,
  1024 ,      1 ,  .01 , 335.587 ,
  1024 ,      8 ,  .01 , 726.876 ,
  1024 ,     64 ,  .02 , 714.162 ),
nrow=4,ncol=15,
dimnames=list(rows=c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m
rows      [,1]  [,2]   [,3]   [,4]    [,5]   [,6]    [,7]    [,8]   [,9]   [,10]   [,11]   [,12]    [,13]    [,14]    [,15]
  IOsize  0.00 0.000  0.000  8.000   8.000  8.000  32.000  32.000 32.000 128.000 128.000 128.000 1024.000 1024.000 1024.000
  threads 1.00 8.000 64.000  1.000   8.000 64.000   1.000   8.000 64.000   1.000   8.000  64.000    1.000    8.000   64.000
  avg_ms  0.02 0.040  0.200  0.030   0.040  0.230   0.060   0.070  0.180   0.030   0.010   0.060    0.010    0.010    0.020
  MB/s    0.01 0.024  0.025 70.529 150.389 48.604 149.405 321.392 73.652 226.457 557.196 180.176  335.587  726.876  714.162
> t(m)
        IOsize threads avg_ms    MB/s
   [1,]      0       1   0.02   0.010
   [2,]      0       8   0.04   0.024
   [3,]      0      64   0.20   0.025
   [4,]      8       1   0.03  70.529
   [5,]      8       8   0.04 150.389
   [6,]      8      64   0.23  48.604
   [7,]     32       1   0.06 149.405
   [8,]     32       8   0.07 321.392
   [9,]     32      64   0.18  73.652
  [10,]    128       1   0.03 226.457
  [11,]    128       8   0.01 557.196
  [12,]    128      64   0.06 180.176
  [13,]   1024       1   0.01 335.587
  [14,]   1024       8   0.01 726.876
  [15,]   1024      64   0.02 714.162

The bizarre thing about this is that the nrows corresponds to the number of columns and the matrix comes out transposed. Using t() can re-transpose it, but this is all confusing.
To make it more intuitive add the argument
"byrow=TRUE,"
and add a
"NULL"
for the rowname position in the row and columns name section

m=matrix(c(
     0 ,      1 ,  .02 ,    .010 ,
     0 ,      8 ,  .04 ,    .024 ,
     0 ,     64 ,  .20 ,    .025 ,
     8 ,      1 ,  .03 ,  70.529 ,
     8 ,      8 ,  .04 , 150.389 ,
     8 ,     64 ,  .23 ,  48.604 ,
    32 ,      1 ,  .06 , 149.405 ,
    32 ,      8 ,  .07 , 321.392 ,
    32 ,     64 ,  .18 ,  73.652 ,
   128 ,      1 ,  .03 , 226.457 ,
   128 ,      8 ,  .01 , 557.196 ,
   128 ,     64 ,  .06 , 180.176 ,
  1024 ,      1 ,  .01 , 335.587 ,
  1024 ,      8 ,  .01 , 726.876 ,
  1024 ,     64 ,  .02 , 714.162 ),
nrow=15,ncol=4,byrow=TRUE,
dimnames=list(NULL,c( 'IOsize' ,'threads' ,'avg_ms' , 'MB/s'
)))
> m
     IOsize threads avg_ms    MB/s
 [1,]      0       1   0.02   0.010
 [2,]      0       8   0.04   0.024
 [3,]      0      64   0.20   0.025
 [4,]      8       1   0.03  70.529
 [5,]      8       8   0.04 150.389
 [6,]      8      64   0.23  48.604
 [7,]     32       1   0.06 149.405
 [8,]     32       8   0.07 321.392
 [9,]     32      64   0.18  73.652
[10,]    128       1   0.03 226.457
[11,]    128       8   0.01 557.196
[12,]    128      64   0.06 180.176
[13,]   1024       1   0.01 335.587
[14,]   1024       8   0.01 726.876
[15,]   1024      64   0.02 714.162

Vectorwise maximum/minimum

Another issues is trying to get the max or min of two or more values on a point by point basis.
Using the “min()” function gives a single minimum and not a minimum on a point by point basis.
Use “pmax()” and “pmin()” to get point by point max and min of two or more vectors.
> lat
[1]  44.370  22.558  37.708  73.070 131.950
> std
[1]  37.7  21.6  67.1 136.1 186.0
> min
[1] 0.0 0.6 0.6 1.0 1.0
> pmax(lat-std,min)
[1] 6.670 0.958 0.600 1.000 1.000

 

Column Sums and Row Sums

to sum up rows or colums use “rowSums()” and   “colSUms()”

http://stat.ethz.ch/R-manual/R-patched/library/base/html/colSums.html

For more info

for more info on data types and manipulation see

see: http://cran.r-project.org/doc/manuals/R-intro.html

 

 

Uncategorized

The Inferior Subset

November 9th, 2013

Why Subsets qualify as an inferior good

Why are you sub-setting your data? Even with the cost of spinning disk falling by half every 18 months or so, and the cost and power of flash rapidly catching up, several large customers I’ve encountered in the last three years are investing in large scale or pervasive programs to force their non-prod environments to subset data as a way to save storage space.

However, there are also several trade-offs with sub-setting and potential issues it can create, including:

* The performance of code under small sets of test data can be radically different than results on full sets of data.
* The creation of the subset can be CPU and Memory intensive, and may need to be repeated often.
* The process to create consistent subsets can be arduous, iterative, and error prone. In particular, corner cases are often missed, and creating subsets that maintain referential integrity can be quite difficult.

Its difficult to get 50% of data and 100% of skew instead of 50% of data 50% skew.  Without the proper skew QA could miss important cases and the SQL optimization could come out completely wrong not to mention that the SQL queries could hit zero rows instead of thousands.

Why thin cloning makes subsets an inferior good

As we’ve discussed in other blogs, a thin cloning solution such as Delphix caused the total cost of data to fall dramatically, and this increases a CIO’s purchasing power (in the context of data) – allowing much more data to be readily available at a much lower price point. The dramatic result that we observe out of this is that people are abandoning subsets in droves. In fact, as the price of data has fallen with the implementation of Delphix, the desire for subsets is being replaced by a desire for full size datasets. Certainly, customers will still want subsets for reasons such as: limiting data to a specific business line or team, or as a security measure, or as a way to utilize fewer resources in the CPU and Memory stack. But, it is also clear that the reduction in the total cost of data has resulted in customers switching to full size datasets to avoid performance-related late breakage, avoid the cost of subset creation. Beyond this, its causing them to rethink their investment in a sub-setting apparatus altogether.

At Delphix, the data we see from customers bears this out. Subsets cost a lot to make, and with the storage savings gone – they are just inferior to full size sets when it comes to many applications. With the elimination of storage as a primary reason to subset, (based on storage savings through thin clones), the inferiority of the subset is quickly being realized.

Uncategorized

Data glut is the problem. Data agility is the solution

November 8th, 2013

 

Data Agility feels like

9064685440_6fb94f29d0

photo by jasephotos

Data glut data feels like

Screen Shot 2013-09-12 at 9.06.05 PM

photo by Christophe Pfeilstücker

There is a problem in the IT industry: the problem of data.  More precisely the problem is lack of data agility. Data agility means getting the right data to the right people at the right time. Data is the lifeblood of the applications that companies depend on to generate revenue. Data has to be pumped across project environments from production to development to QA to UAT.

A typical production database incurs a triple data copying tax. Data, ie copies of that production database, have to be pumped off to

  1. Reporting and analytics databases
  2. Development, QA and UAT environments
  3. Backup

Screen Shot 2013-11-08 at 10.08.06 AM

The backup portion alone is enough to saturate IT infrastructure on weekends or nightly jobs. When nightly and/or weekend job windows are not an option at global corporations running 24×7 when and how can data be pumped across the IT infrastructure?

Only with this life blood can applications run and can development projects  move forward. Without the life blood of data, applications productivity  falls off  and development projects start failing.

A lot of people think they understand the data problem but let’s go through what we’ve seen with customers in the industry

 

  • 96% of QA cycle time spent waiting on data for QA environments
  • 95% data storage spent on duplicate data
  • 90% of developer lost time is due to waiting for data in development environments
  • 50% of DBA time spent making database copies
  • 20% of bugs of production bugs slipping in because of using subsets in development and QA

 

The above metrics come from companies who took the time to quantify these measurements. Do you have metrics to show your operational expenses and results?  If you can’t measure it then you can manage it

You can’t mange what you don’t measure.

Questions:

  • How many copies of databases are in your IT department?
    • Which groups have copies?
    • How long does it take to provision a copy of a database?
    • How often does development need copies of database provisioned?
  • How much of your storage is due to duplicate database copies?
  • How long does it take QA to build an environment?
    • Are the QA suites destructive?
    • Do QA environments have to be rebuilt after each QA test cycle?
    • Are these environments rolled back? refreshed?

Can you answer these basic metrics?

We are seeing the following issues eating away at the revenue of companies we are talking to:

  • Delays
    • Slow environment build times cause project delays
    • Sharing copies causes programming bottlenecks, delaying coders
    • QA can’t run multiple tests in parallel, delaying QA
  • Bugs
    • Subset databases in dev and QA allow  bugs to slip into production
    • Slow QA environment builds allow more dependent code to be written on top of bugs before the bugs are found
    • Lack of QA environments and QA testing allowing more bugs  into production
  • Costs
    • Storage cost of storage ownership. Storage is cheap. Managing storage is expensive.
    • DBA team time. DBA teams spending 50% of their time building database copies.
    • Higher development costs due lost developer days as developers wait for development environments

CIO magazine recently surveyed CIOs and found that on average CIOs had 46 projects yet 28 of them were behind schedule and/or over budget.  That’s 60% of projects over budget and/or over schedule and of those 85% were delayed because of data and environment provisioning delays.

“Database environments are constantly getting bigger and bigger, and they’re increasingly the bottleneck,” said Tim Campos, CIO of Facebook. “If you have multiple projects going on, you have multiple copies of an environment you need to maintain. What ends up happening is you get an absolute sprawl in the database environment. Scaling that is particularly expensive for IT organizations. We have high expectations about how quickly new initiatives are rolled out and needed new technology to facilitate rollouts and support more projects simultaneously. With database virtualization, we get better use of hardware and our people. It can accelerate our enterprise application development projects significantly in the same time frame.”

Delphix reduces delays, bugs and costs. That’s why global industry leaders in all domains are turning to Delphix to accelerate their application development, transform their QA processes and eliminate thousands hours of DBA time , sysadmin, storage admin and management time that was previously required to deploy database copies and can now be done in minutes.

With Delphix

  • Companies have doubled or more development team output
  • QA has gone from 4% efficiency to 99% efficiency
  • DBA’s have gone from 8000 hours/year of database copying to 8 hours

Delphix  accelerates application releases  driving revenue growth while driving costs down.

“The most powerful thing that organization can do is to enable  development and testing to get environments when they need them”

– Gene Kim, author of the Phoenix Project

 

gif by Steve Karam

Uncategorized

Production Possibility Frontier

November 8th, 2013

By: Woody Evans

“The most powerful thing that organization can do is to enable  development and testing to get environments when they need them”

Gene Kim, author of the Phoenix Project

App Features vs. Data Work

The power of a technology change, especially a disruptive technology shift, is that it creates opportunities to increase efficiencies. The downside is that companies take a long time to realize that someone has moved their cheese. Data virtualization, ie automated thin cloning of databases, VM images, App Stacks etc alters the production possibility frontier dramatically, providing customers can get past the belief that their IT is already optimized.

An Ideal Frontier

An idealized Production Possibility Frontier describing the tradeoff between Application Features and Data Related Work might look like the following where an engineering team of developers and IT personel could  shift their focus from producing feature work or towards data related work smoothly.

PPF1

Companies on the blue line are able to efficiently shift between working on the data related work and the application feature related work in their IT projects. That self-proclaimed confidence, however, can become a barrier to adoption when a technology shift occurs – especially if you believe that certain tradeoffs are already optimized.

Suppose a developer needs to execute a refresh as part of a testing cycle. In this idealized world, it may be able to refresh my database in 2 hours or I may be able to refresh by spending 2 hours writing a piece of throwaway rollback code. Either way, that developer would have to trade off 2 hours that they would spend writing new application features in order to accomplish the refresh.

The Thin Cloning Technology Shift

Using a broad brush, we can classify much of the time and effort of application projects as Data Related:
• Waiting for data to arrive at a certain place
• Performing extra work because of a lack of the right data
• Trying to keep data in sync for all of the various purposes and work streams that an application endeavors to complete.

The research we have shows that 85% of the work in application delivery is really data related. The technology shift brought about by thin cloning, and in particular Delphix technology, pushes the Production Possibility Frontier and dramatically reshapes IT’s understanding of efficiency.

Because application feature work is dependent on data related work such setting up development environments, creating builds, building QA environments, the application feature work will be constrained by the efficiency of the data and IT work. If we make the data and IT work much more efficient then we accomplish more data work and thus more feature work.

PPF2

This new production possibility frontier dwarfs the initial one. In fact, the massive size of the shift contributes greatly to the resistance that IT has because to unlock this value IT has to change what they believe to be their already optimized processes.

But, the proof is out there. And, at Delphix, we’re gathering powerful proof points every day demonstrating how customers are creating powerful efficiencies.

Waiting for data to arrive is affecting customers today. One customer of Delphix was spending 96% of their testing cycle time waiting for data to refresh. That meant only 4% of their testing time frame was used to actually test the product, shifting error detection to the right where it is more expensive. Using Delphix to refresh, they now spend less than 1% of their time waiting for refresh. That is a 99:4 ROI. That’s Better than 20:1!

We started this post with the example of an ideal situation where a developer could chose to refresh a database in 2 hours or spend 2 hours writing a piece of throwaway rollback code. But, the reality is that more often it’s a 10 hour wait to get your copy of the 3 TB database (if you can get the DBAs attention). There’s a lot of code being written out there because we’ve accepted the “optimized” way of doing things – where we accept that we can’t get fresh data so we write our own workaround. This kind of wasted effort just evaporates with Delphix.

And if you’re thinking that this is a small scale problem, think about all of the ETL and Master Data Management applications out there where developers spend endless hours writing code and business users do the same configuring apps so that the data can be properly synchronized. If you had immediate access to data that was already being synchronized in near real time, all of that work just goes away.

What IT isn’t considering and CIOs should

Disruptive technology is exactly that. It uncovers an opportunity for efficiency that you don’t see already. So, whatever was optimal before simple isn’t now. In fact, if you don’t challenge the current optimization, you’ll likely never reap the benefits of the disruptive technology. All the same, overcoming the resistance to the idea that a new optimization is possible, as well as overcoming the resistance to the idea that change can be revolutionary not just evolutionary just isn’t in the DNA of war weary, battle hardened DBAs and developers. CIOs need to consider and understand the powerful imagery of the Production Possibility Frontier for application development using thin cloning.

Thin cloning is such a powerful shift, that IT shops will often shake their heads in disbelief. CIOs need to see through that and understand that Data virtualization with thin cloning is a seismic shift. 10 years ago no one knew what VMWare was. Now, you can’t walk into any data center without it. 10 years from now the idea of having physical data instead of thin clones will be laughable. Careers are about to be made on Data Virtualization, and Delphix is the tool to which you should attach you star.

Uncategorized