Archive for October, 2013


October 31st, 2013

 burning_rubber_latemodelresto copy

photo by Latemodel Restoration Supply

Oracle 10 added the awesome procedure dbms_xplan.display_cursor but unfortunately the documentation of the package is a bit lacking and the options and output can be confusing, so here are few clarifications.

The procedure display_cursor gives the *real* execution plan instead of an estimated execution plan which is a huge relief after “explain plan”, “set autotrace on” and  “dbms_xplan.display” which all output the expected execution plan. Of the the three, “set autotrace on” is the most peculiar as we can actually run the query and yet the output is the expected plan not the actually executed plan. Pheww – what a relief with dbms_xplan.display_cursor.

But, hold on to the seat of your pants, because that’s not all folks. The coolest thing is that display_cursor will output the actual execution statistics for each row source line in the real execution plan. Now that’s cool. But to access these statistics, one has to enable the collection of the statistics.

To enable the collection of row source statistics, run a query in SQL*Plus and include the hint

 /*+ gather_plan_statistics */

then,  immediately following the query execution run:

 select * from table(dbms_xplan.display_cursor(null,null,’ALLSTATS’));

Then you will see the following columns in the output (minus the color coding)

A few confusing things about the output. Some of the columns are estimated statistics from the optimizer. Some of the columns are actual statistics from executing the query. And some of the columns are not per row statistics but statistics that include the statistics of the child rows.

The columns E-Rows and A-Rows can be used for Tuning by Cardinality Feedback (TCF). TCF refers to the process of comparing “E-Row” and “A-Rows” to determine whether the optimizer is optimizing correctly. The two columns “E-Rows” and “A-Rows” which represent estimated rows and actual rows can be compared to highlight how close or how far off the optimizers predictions where. If the optimizers predictions are far of, it’s an indication, though not a proof, that the query has been inadequately optimized. To get the optimizer to  create an optimal execution path, one can look for ways of bringing the optimizers estimated statistics in line with the actual statistics. See Tuning by Cardinality feedback by Wolfgang Breitling.

In v$sql_plan_statistics  these “should” be equal

  • cardinality ~= output_rows/starts
  • starts*cardinality ~= output_rows

And in the output above

  • E-Rows=cardinality
  • A-Rows=output_rows

Thus in order to compare E-Rows to A-Rows, we have to multiply E-Rows by starts.

Other display_cursor 3rd argument options

There are other options besides specifying “ALLSTATS” in the display_cursor. Here are a list of the options and the fields they show:

By default statistics are total for all executions of the cursor but “LAST” can be appended to the 3rd argument to get the stats for the last execution. NOTE:  the last four columns on memory and temp space always show total or last as indicated above. Here is an example of using “LAST”:

select * from table ( dbms_xplan.display_cursor (null,null,’MEMSTATS LAST‘));

Notice that for memory consumption, some of the stats are bytes and some are kilo-bytes.

The above tables are images. The following table is in text for searching and copy/paste.

Arguments giving only optimizer estimates

Rows Bytes TempSpc  Cost Time
 null    *  *  *  *
TYPICAL  *  *  *  *  *
SERIAL  *  *  *  *  *
ALL  *  *  *  *
ADVANCED  *  *  *  *

 Arguments that give actual row source executions statistics (the red asterisks are estimates)

(the following two tables show the same arguments, but are broken in two to fit the width)

E-Rows starts A-Rows Buffers Reads Writes A-Time
MEMSTATS  *  *  *  *
ALLSTATS  *  *  *  *  *  *  *
IOSTATS  *  *  *  *  *  *   *


0Mem 1MEM 0/1/M Used-Mem Used-Tmp Max-Tmp

 Arguments for special cases

TQ IN-OUT PQ Distrib pstart pstop Instance

List of arguments for non-statistical output

predicate info note Query Block Name Column projection info outline bind vars
null X X


Enabling extended rows source execution statistics gathering

There are three ways to gather row source executions stats:

  1.  /*+ gather_plan_statistics */
  2. sql_trace=true;
  3. Statistics_level=all

The first two are cheaper (sets _rowsource_statistics_sampfreq =128 ) but less accurate than the last option which though more accurate can consume signficantly more CPU (_rowsource_statistics_sampfreq =1).

 Manually querying extended row source execution statistics

The data retrieved with display_cursor can be queried directly from

  • v$sql_plan_statistics
  • v$sql_plan_statistics_all


desc v$sql_plan_statistics


desc v$sql_plan_statistics_all
 TIME (seconds estimated by optimizer)
 OTHER_XML (bind vars and other info)
 LAST_EXECUTION (LAST whether work area was optimal, one pass or multi)
 TOTAL_EXECUTIONS (number of times work area was active)
 ACTIVE_TIME (centi-seconds, time work area is active)
 MAX_TEMPSEG_SIZE (kb -  mis-documented as bytes, though correctly bytes in v$sql_workarea )
 LAST_TEMPSEG_SIZE (kb - "" , see ) )

so you can write a query to get the data directly such as

col operation for a45
      LPAD(' ',depth)||P.OPERATION||'_'||P.OPTIONS||' '||P.OBJECT_NAME  operation
    , last_starts * cardinality e_rows_x_starts
    , last_output_rows  a_rows
    , LAST_DISK_READS     pread
    , LAST_DISK_WRITES    pwrites
    , LAST_ELAPSED_TIME   elapsed
       V$SQL_PLAN_statistics_all P
  order by child_number,id

which makes the comparison between estimate and actual easier because I can include starts*cardinality to get the e-rows_x_starts which I can compare directly to a-rows, ie output_rows.

Putting it all together

We can take the TCF ideas farther by putting them together in a query such that the output is easier to  read:

col cn format 99
col ratio format 99
col ratio1 format A6
--set pagesize 1000
set linesize 140
break on sql_id on cn
col lio_rw format 999
col "operation" format a60
col a_rows for 999,999,999
col e_rows for 999,999,999
col elapsed for 999,999,999

Def v_sql_id=&SQL_ID

       -- sql_id,
       childn                                         cn,
       --ptime, stime,
       case when stime - nvl(ptime ,0) > 0 then
          stime - nvl(ptime ,0)
        else 0 end as elapsed,
       nvl(trunc((lio-nvl(plio,0))/nullif(a_rows,0)),0) lio_ratio,
       --nvl(ratio,0)                                    TCF_ratio,
       ' '||case when ratio > 0 then
               rpad('+',ratio*-1 ,'+')
       end as                                           TCF_GRAPH,
       starts*cardinality                              e_rows,
       --nvl(lio,0) lio, nvl(plio,0)                      parent_lio,
from (
      stats.LAST_ELAPSED_TIME                             stime,
      p.elapsed                                  ptime,
      stats.sql_id                                        sql_id
    , stats.HASH_VALUE                                    hv
    , stats.CHILD_NUMBER                                  childn
    , to_char(,'990')
      ||decode(stats.filter_predicates,null,null,'F')     id
    , stats.parent_id
    , stats.CARDINALITY                                    cardinality
    , LPAD(' ',depth)||stats.OPERATION||' '||
      stats.OPTIONS||' '||
      DECODE(stats.PARTITION_START,NULL,' ',':')||
      DECODE(stats.PARTITION_STOP,NULL,' ','-')||
      TRANSLATE(stats.PARTITION_STOP,'(NRUMBE','(NR')      "operation",
      stats.last_starts                                     starts,
      stats.last_output_rows                                a_rows,
      (stats.last_cu_buffer_gets+stats.last_cr_buffer_gets) lio,
      p.lio                                                 plio,
          nullif(stats.last_output_rows,0),0)))             ratio
       v$sql_plan_statistics_all stats
       , (select sum(last_cu_buffer_gets + last_cr_buffer_gets) lio,
                 sum(LAST_ELAPSED_TIME) elapsed,
         from v$sql_plan_statistics_all
         group by child_number,sql_id, parent_id) p
    stats.sql_id='&v_sql_id'  and
    p.sql_id(+) = stats.sql_id and
    p.child_number(+) = stats.child_number and
order by sql_id, childn , id

which gives output like

    Enter value for sql_id: g2w9n4gksyys6
    old  59:     stats.sql_id='&v_sql_id'  and
    new  59:     stats.sql_id='g2w9n4gksyys6'  and

     CN   ELAPSED    LIO_RATIO TCF_GRAPH   E_ROWS       A_ROWS operation
    --- ------------ --------- ------ ------------ ------------ ------------------------------------------------------------
      0            0         0                                1 SELECT STATEMENT
           5,720,456         0                   1            1  HASH GROUP BY
              29,711         0                            1,909   NESTED LOOPS
                   0         0  +++              1        1,909    NESTED LOOPS
           1,969,304         0  +++              1        1,909     NESTED LOOPS
                   0         0  +++              1        2,027      NESTED LOOPS
           7,939,649         0  +++              1        1,656       NESTED LOOPS
             716,054         0  +++              1        1,657        NESTED LOOPS
             270,201         0  ++              39       23,171         HASH JOIN
                  23         0                   5            1          JOIN FILTER CREATE :BF0000
                  31         1                   5            1           TABLE ACCESS BY INDEX ROWID PS_PAY_CALENDAR
                  14         2                   5            1            INDEX RANGE SCAN PS0PAY_CALENDAR
             141,467         0              18,503       23,171          VIEW  VW_SQ_1
           3,032,120         0              18,503       23,171           HASH GROUP BY
             152,564         0             163,420       33,020            JOIN FILTER USE :BF0000
             407,746         0             163,420       33,020             MERGE JOIN
                  55         0                   5            1              SORT JOIN
                  12         2                   5            1               INDEX RANGE SCAN PS0PAY_CALENDAR
              79,435         0              40,000       33,020              SORT JOIN
             119,852         0              40,000       40,000               INDEX FAST FULL SCAN WB_JOB
           2,959,031        13  -           23,171        1,657         TABLE ACCESS BY INDEX ROWID WB_JOB
             944,887         1              23,171       23,174          INDEX RANGE SCAN WB_JOB
             102,650         0               1,657        1,656        VIEW PUSHED PREDICATE  VW_SQ_2
              73,769         0               1,657        1,657         SORT AGGREGATE
              25,617         0               1,657        1,657          FIRST ROW
             225,497         1               1,657        1,657           INDEX RANGE SCAN (MIN/MAX) WB_JOB
             357,872         0               3,312        2,027       TABLE ACCESS BY INDEX ROWID WB_RETROPAY_EARNS
           3,655,774         1               3,312        2,027        INDEX RANGE SCAN WB_RETROPAY_EARNS_IDX1
             199,884         0               2,027        1,909      TABLE ACCESS BY INDEX ROWID PS_RETROPAY_RQST
             317,793         1               2,027        1,909       INDEX RANGE SCAN PS_RETROPAY_RQST
              71,534         0               1,909        1,909     INDEX RANGE SCAN PS#RETROPAYPGM_TBL
              18,396         0               1,909        1,909    TABLE ACCESS BY INDEX ROWID PS_RETROPAYPGM_TBL

The 3 important parts of this query are

  • Elapsed is per row source, not cumulative of it’s children

Elapsed time format has a huge drawback in the display_cursor output as each lines elapsed time includes the elapsed time of all the children which makes an execution plan difficult to scan and see where the time is being spent. In the above output the elapsed time represents the elapsed time of each row source line.

LIO_RATIO shows the number of buffers accessed per row returned. Ideally 1 buffer or less is accessed per row returned. When the number of buffers per row becomes large, it’s a good indication that there is a more optimal method to get the rows.  The I/O stats include the stats of the child row source, so the query has to get the I/O from the childern and subtract from the parent, making the query a bit more complex.

TCP_GRAPH graphically shows the ratio of estimated rows to actual rows. The estimated rows used is cardinality* starts, not just cardinality. This value can be compared directly to actual_rows and the difference in order of magnitude is shown. Each ‘+’ represents and order of magnitude larger and each “-” represents an order of magnitude smaller. The more orders of magnitude, either way, the more the optimizers calculations are off and thus like more pointing to a possible plan that is suboptimal.

In the above output there   are 5 lines where the optimizer only expect 1 row and the actual results were over 1000, ie 3 orders of magnitude difference. These are the three lines with “+++”
There is one line with “-” where actual was an order of magnitude smaller. On that same line we see it’s one of the slower lines almost 3 seconds and that the were 13 lio’s per row returned, which is sign of inefficiency.



NFS versus dNFS

October 30th, 2013


NFS versus dNFS



Finally got to take Kevin Closson’s SLOB for spin. (BTW can one test dNFS with Orion at all?)
Nice write up on SLOB by Yury Velikanov at Pythian:

Also see Karl Arao’s SLOB cheat sheet.
NEW: see slob on steroids at
I ran into the same issues Pythian came up with

  • turn off default AWR
    • dbms_workload_repository.MODIFY_SNAPSHOT_SETTINGS(51120000, 51120000, 100, null)
  • reduce db_cache_size (cpu_count didn’t seem to work. The only way I got
    db_cache_size down was by allocating a lot to the shared pool)

    • *.sga_max_size=554M
  • avoid db file parallel reads
    • *._db_block_prefetch_limit=0

The goal was to test NFS verses DNFS.

I didn’t expect DNFS to have much of an impact in the basic case. I think
of DNFS as something that adds flexibility like multi pathing and fail
over and takes care of all the mount  issues. In some case like on LINUX
where the maximum outstanding RPC is 128, then  Oracle DNFS as I
understand it can go beyond these limits. I’m also guessing  that in the
case of rac DNFS will avoid the overhead of getattr calls that would be
required with noac that is required with out DNFS. (on single instance
noac can be taken off the mount options)


DB host Linux version 2.6.18-164.el5 

sunrpc.tcp_slot_table_entries = 128
sysctl -p


Ran SLOB straight through with default reader.sql which does 5000 loops. This took about
12+ hours to complete 18 tests – 9 on NFS and 9 on DNFS of varying user

The stats showed dNFS impact to be variable depending on the load.

I then change reader.sql to do 200 loops in order to run more tests faster
and get a handle on the standard deviation between tests.

With that change, then ran the test alternatively between NFS and dNFS, ie I ran
1,2,3,4,8,16,32,64 users loads on NFS then switch to dNFS and ran the
same tests. Then did this 4 times.

The stats were a bit different than the first run.
I then ran the test 8 times (ie full user ramp up on DNFS then NFS  , repeat 8 times)

Here are the stats.

Looks like there is a pretty consistent performance degradation on dNFS at 2 users
and a good performance improvement at 8 users.
At 16 users the test is just under line speed (1GbE, so around 105MB/s)
By 32 users the test is hitting line saturation so I grayed out those lines.

The NFS server caches a lot thus the fast I/O times.

% impact = 100 * ( NFS time – dNFS time )/ NFS time , ie % better or worse of dNFS over NFS

5000 loops 200 loops
200 loops             x8
users avg ms % impact avg ms % impact Stdev avg ms % impact Stdev
1 0.44 0.62 0.43 -3.68 10.76 0.45 -7.90 9.24
2 0.43 -16.62 0.49 -19.95 15.21 0.49 -20.97 3.67
3 0.57 6.61 0.57 -23.40 13.49 0.56 -8.62 5.63
4 0.64 5.98 0.58 -0.80 4.97 0.60 -0.86 5.12
8 0.76 17.17 0.69 14.04 11.33 0.68 11.80 2.25
16 1.29 11.86 1.20 8.56 15.70 1.20 8.34 2.23
32 2.23 -2.85 2.31 1.83 16.41 2.33 2.04 2.31
64 4.23 -8.38 4.62 1.75 12.03 4.67 2.43 0.76
128 8.18 0.73 1.86

At this point there is a clear degradation at 2 and improvement at 8 and the line maxes out just above 16, so decided to run the tests with 1-16 users, 8 times each, and here is the data:

Again, there is apparent degradation around 2 and 3 users and performance improvement peaks around 10 users, with a maximum saved time of 130us.

Here is the script to run the tests
NOTE:you have to be in the directory with “” from SLOB
and you have to have backed up as

# see:

# if EVAL=0, then script only outputs commands without running them
USERS="1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16"
REPEAT="1 2 3 4 5 6 7 8"
function DNFS
   sqlplus / as sysdba << EOF
   shutdown abort
   cp  $ORACLE_HOME/lib/ $ORACLE_HOME/lib/
   sqlplus / as sysdba << EOF
function NFS
  sqlplus / as sysdba << EOF
  shutdown abort
  cp  $ORACLE_HOME/lib/ $ORACLE_HOME/lib/
  sqlplus / as sysdba << EOF
function param {
for i in 1; do
cat << EOF select name,value from v\$parameter where ISDEFAULT='FALSE'; select count(*) from v\$dnfs_servers; select count(*) from v\$dnfs_files; select count(*) from v\$dnfs_channels; show sga exit EOF done > param.sql
function runcmd {
    echo $cmd
    if   [ $EVAL  -eq 1 ] ; then
       eval  $cmd
for type in $TYPES; do
  [[ ! -d $type ]] && mkdir $type
for runnumber in $REPEAT;do
  echo "runumber $runnumber"
  for type in $TYPES; do
    if   [ $EVAL  -eq 1 ] ; then
       eval  $type
    echo "type  $type, run number $runnumber"
    for users in $USERS; do
      echo "Users $users"
      for output in 1; do
        cmd="sqlplus / as sysdba @param"
        runcmd $cmd
        cmd="./ 0 $users"
        runcmd $cmd
        cmd="mv awr.txt ${type}/awr_${users}_${runnumber}.txt"
        runcmd $cmd
      done > ${type}/${type}_${users}_${runnumber}.out 2>&1

To extract the data after the test you can grep out  the sequential read performance in the AWR files that end in “txt”:

$ grep ‘db file sequential read’ *NFS/*txt | grep User

DNFS/awr_1_1.txt:db file sequential read              32,032          14      0   88.1 User I/O
NFS/awr_1_1.txt:db file sequential read              34,633          16      0   88.8 User I/O
DNFS/awr_2_1.txt:db file sequential read              88,826          52      1   95.4 User I/O
NFS/awr_2_1.txt:db file sequential read              92,367          41      0   95.1 User I/O
DNFS/awr_3_1.txt:db file sequential read             141,068          92      1   96.0 User I/O
NFS/awr_3_1.txt:db file sequential read             145,057          81      1   96.2 User I/O
DNFS/awr_4_1.txt:db file sequential read             192,897         111      1   95.2 User I/O
NFS/awr_4_1.txt:db file sequential read             197,535         118      1   96.0 User I/O
DNFS/awr_5_1.txt:db file sequential read             246,322         153      1   95.3 User I/O
NFS/awr_5_1.txt:db file sequential read             249,038         141      1   96.2 User I/O
DNFS/awr_6_1.txt:db file sequential read             297,759         181      1   95.4 User I/O
NFS/awr_6_1.txt:db file sequential read             301,305         199      1   96.5 User I/O
DNFS/awr_7_1.txt:db file sequential read             349,475         216      1   95.5 User I/O
NFS/awr_7_1.txt:db file sequential read             352,788         244      1   96.5 User I/O
DNFS/awr_8_1.txt:db file sequential read             402,262         266      1   95.8 User I/O
NFS/awr_8_1.txt:db file sequential read             405,099         282      1   96.5 User I/O
DNFS/awr_9_1.txt:db file sequential read             453,341         306      1   96.2 User I/O
NFS/awr_9_1.txt:db file sequential read             456,009         345      1   96.7 User I/O

Alternatively calculate the average single  block read time in ms:

‘db file sequential read’ *NFS/*txt | grep User | sort -n -k 5 | sed -e
‘s/,//g’ | awk ‘{ printf(“%20s %10.2f %10d %10d\n”, $1, $6/$5*1000, $5,
$6 ) }’
DNFS/awr_1_1.txt:db       0.44      32032         14
NFS/awr_1_1.txt:db       0.46      34633         16
DNFS/awr_2_1.txt:db       0.59      88826         52
NFS/awr_2_1.txt:db       0.44      92367         41
DNFS/awr_3_1.txt:db       0.65     141068         92
NFS/awr_3_1.txt:db       0.56     145057         81
DNFS/awr_4_1.txt:db       0.58     192897        111
NFS/awr_4_1.txt:db       0.60     197535        118
DNFS/awr_5_1.txt:db       0.62     246322        153
NFS/awr_5_1.txt:db       0.57     249038        141
DNFS/awr_6_1.txt:db       0.61     297759        181
NFS/awr_6_1.txt:db       0.66     301305        199
DNFS/awr_7_1.txt:db       0.62     349475        216
NFS/awr_7_1.txt:db       0.69     352788        244
DNFS/awr_8_1.txt:db       0.66     402262        266
NFS/awr_8_1.txt:db       0.70     405099        282
DNFS/awr_9_1.txt:db       0.67     453341        306
NFS/awr_9_1.txt:db       0.76     456009        345
DNFS/awr_10_1.txt:db       0.74     507722        376
NFS/awr_10_1.txt:db       0.85     508800        433

Or to get the averages, max, min and standard deviations,

grep 'db file sequential read' *NFS/*txt | \
grep User | \
# lines look like
# NFS/awr_9_6.txt:db file sequential read   454,048   359    1 97.0 User I/O
# first number is count
# second number is time in seconds
# names are like "NFS/awr_9_6.txt:db file sequential read   "
# break 9_6 out as a number 9.6 and sort by those numbers
sed -e 's/_/ /' | \
sed -e 's/_/./' | \
sed -e 's/.txt/ /' | \
sort -n -k 2 | \
sed -e 's/,//g' | \
# lines now look like
# NFS/awr 9.6 db file sequential read   454,048   359    1 97.0 User I/O
# $2 is userload.run_number
# $7 is count
# $8 is time in seconds
# print out test # and avg ms
awk '
       { printf("%10s %10.2f ", $2,  $8/$7*1000 ) }
/^NFS/ { printf("\n"); }
' | \
# print out user load
# print out %impact of dns
awk '{printf("%10s %10.2f %10.2f\n",int($1),($4+$2)/2,($4-$2)/$2*100) }
END{printf("%d\n",17) }
' | \
# for each test, each user load
# print avg ms, avg impact, std of impact, min and max impact
awk 'BEGIN{row=1;nrow=0}
{ if ( row != $1 ) {
   if ( nrow > 0 ) {
     printf("%10s ", row )
     row = $1
     for(x=1;xmax) {max=val};
   if(val< min) {min=val};

         1       0.46       1.93      14.69     -14.00       35.00
         2       0.47     -13.24       8.14     -25.42        0.00
         3       0.61     -13.88       4.14     -21.13       -6.25
         4       0.61       0.10       5.40     -11.29        8.62
         5       0.60      -1.60       7.94      -9.68       11.67
         6       0.64      12.10       6.33      -1.54       19.30
         7       0.67      10.73       5.53       3.03       21.31
         8       0.70      12.46       4.42       6.06       19.70
         9       0.73      14.77       2.94      11.43       19.70
        10       0.78      17.08       2.91      13.89       23.61
        11       0.85      13.48       2.90       8.64       17.95
        12       0.92      10.34       2.76       6.90       16.28
        13       0.99      11.83       3.92       6.45       18.28
        14       1.04       7.23       4.42       0.00       13.13
        15       1.12       8.03       4.18      -0.93       13.21
        16       1.21       9.37       2.78       5.08       14.04



Oracle “Physical I/O” ? not always physical

October 29th, 2013

A customer called and wanted to know why the development database was so much slower than production when both databases were on the same type of machine and same type of storage.

To analyze the situation, the same query was run on both databases with

       alter session set events '10046 trace name context forever, level 12';

and sure enough, development (SID=dev) showed average I/O almost twice as slow as production:

    db file sequential read

       db   Times Waited Elapsed (ms) Avg Ela (ms)
       ~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~
       dev         55528       479930            9
       prod        65275       294785            5

MH900439292Now the above is physical I/O as reported by Oracle from looking at the wait event “db file sequential read” in the trace file. Based upon the above data alone, it would look like the “dev” database had an I/O subsystem problem. On the other hand not all “physical I/O” as reported by Oracle is really physical I/O. The I/O reported by Oracle could simple be I/O that is satisfied by the O/S file system cache. When Oracle issues a request for I/O, in this case a “db file sequential read” (aka a single block read request) all we know is Oracle is asking the operating system for the data. We don’t know where the operating system gets the data. The operating system might issue a request for the data from the underlying storage system or it might find the block in the operating system file cache. How can we tell the difference between a read from the underlying storage or a read from the O/S file system cache? Well, from the Oracle perspective there are no statistics that differentiate between the two, but based upon the latency for the I/O request we can make an educated guess. Oracle keeps I/O latency information in the v$event_histogram views, but unfortunately these views only go down to the granularity of 1ms. For I/O from a SAN that is cached in memory on the SAN, the latency could go down to micro-seconds. What we want to know is how much I/O is coming faster than a SAN could reasonably supply the data which would mean the data is coming from UNIX file system cache and not the SAN. An 8K data transfer over 4Gb FC takes 20us for the data transfer alone not to mention any O/S scheduling and or code path. Thus a reasonable value for I/O that is faster than a SAN would/could supply would be on the order of of 10s of mircro-seconds. For example 10us is faster than even a 8Gb FC could supply 8k. Now add on some overhead for code path, scheduling etc and 100us (ie 100 microseconds) is a pretty good cut off.

Taking the same trace files used for the above analysis and grouping the I/O requests into latency buckets gives:

   db  : 10u  50u  .1u  .2m  .5m  1ms  2ms    4ms   8ms  16ms  32m  64m  .1   .5 >.5
   dev :        1   14  908 13238 6900 9197 15603  9056   265   26   12   12
   prod:4419 2713  391  118 22189 2794 1688  2003 11969 14877 2003  105    3   3

What we see is that production gets substantial portion of it’s I/O in less than 10us, ie this data is coming from the file system cache but on dev most of it is coming from the SAN.f

In this case the query had only been run once on “dev” but had been run multiple times on prod, thus on prod, the O/S had cached some of this data in the file system cache.

Once the query was run a couple of times on “dev”, then the latency on “dev” when down to the latency on “prod”.

To avoid reading from the O/S cache all together (or for the most part), one can set


but by doing this alone, it would just increase the latency of the I/O as the O/S file system cache would no longer be used. In order to compensate for the loss of the O/S file system caching, then the buffer cache for the Oracle database should be increased in order to compensate for the loss. If the buffer cache is increased sufficiently to compensate for the lost O/S caching memory, the there will be less CPU used and less I/O and less latency due to I/O calls by Oracle. On the other hand if there are several database on the host then balancing the SGA sizing among the databases without use of O/S file system caching can be more trouble than it’s worth and using O/S file system caching might be more flexible.

How to extract a histogram of I/O latency from the Oracle trace file? Run the script

The script is on github, so feel free to fork the code and make your own changes.

To run, just type

    ./  sid_ora_29908.trc

where “sid_ora_29908.trc” is the name of your trace file.

The output looks like:

                    Time Breakdown (seconds)
elapsed           3690
total wait        1525
cpu               1327
unaccounted        837
idle                 0

                    Summary of Activity in trace file
           event                               count total secs    avg ms
        1) ELAPSED                                         3690
        2) CPU                                             1327
        3) db file sequential read            243910        964     3.955
        4) local write wait                    34485        175     5.095
        5) db file scattered read              25147        140     5.598
        6) log buffer space                      326        118   362.508
        7) free buffer waits                     164         64   394.329
        8) write complete waits                    4         29  7394.890

Histogram of latencies  for:
                                 db file sequential read
        64u .1m  .2m   .5m    1m    2m    4m    8m   16m  33m  65m .1s .1s+
     109572 782  118   345 10652 21701 15502 43635 34636 5270 1300 258  139

Histogram of latencies by I/O size in # of blocks for:
                                 db file scattered read
                                 direct path read
                                 direct path read temp
        64u .1m  .2m   .5m    1m    2m    4m    8m   16m  33m  65m .1s .1s+
db file scattered read
  4      46   3    1     0     2     2    11     1     4    2    1   0    0
 32       0   0 3436 14534   617   342  1542   852   834  642  361 295  186


The part of interest is “Histogram of latencies for: ”
The first histogram section is for “db file sequential read”, ie single block reads.
The next histogram section is for multiblock reads. These multiblock reads are grouped by read type and for each read type the I/O is broken down by I/O size in number of blocks. Number of blocks is the first field.


Here is an example of plotting ofhe average I/O latency on three databases. What we see is a tri-modal histogram. The histogram represents how many I/Os fell into the each time bucket. The first time bucket is < 10 us.  The next is between 10 us and 50 us. Any I/O coming from under 50 us is most likely coming from Unix file system cache. I/O with latency under 1ms is most likely, statistically speaking, coming from storage cache. I/O latency over 1ms is probably coming from disk reads. We don’t know for sure but this is a good educated guess on where the I/Os are coming from based on their latency.

What we see in the graphic is that actual disk (spindle) I/O is faster on DEV than it is on PROD which is surprising since the initial reaction of the team was that disk I/O was slower on DEV than PROD. The reason PROD seemed faster was because it was benefiting from significant Unix files system caching and DEV was not.





Listener won’t register database

October 28th, 2013



8426787137_169a12003cI must admit, I’m a sqlnet hack.  I never really sat down to figure it out. If I have problems getting a database to register with a currently listener then I generally just add a separate listener. Well, today I was stuck on a system where I couldn’t add a second listener. Everything seems fine.  There was a default listener running on 1521 with  a database instance already registered. The first and second database were in the same ORACLE_HOME but I just couldn’t get the listener  to pick up the second databases.  I bounced and reloaded the listener and registered the second database

  • lsnrctl start
  • lsnrctl stop
  • lsnrctl start
  • alter system register

No dice. Then I tried

alter system set local_listener=’(ADDRESS = (PROTOCOL=TCP)(HOST =myhost)(PORT=1521))’

and bang, it registered.

Of course, I guess I should track down why it wouldn’t register in the first place but getting it working for now was enough.

photo by Thomas Nielsen





photo by Nicola Einarson



OOUG Performance Presentations

October 25th, 2013


Had a great time last week presenting to the Ohio Oracle User Group (OOUG). I’m generally use to presenting a single presentation but for this day at OOUG, I was the guest speaker and presented 4 back to back presentations.

OOUG members picked me up at the airport, whisked me to my hotel, then picked me up in the morning to take me to the conference venue, then arranged for my transport back to the airport. All in all super smoothly run.

Many thanks to Mary Brown from Nationwide who is the Vice President of OOUG for setting this up and for Nathan and Paddy  also from Nationwide for helping out.

Here are the 4 presentations .


Database Performance: Average Active Sessions



Database Virtualization



Visual SQL Tuning



Transaction Locks in Oracle


Many thanks to Doug Burns for his help on the lock presentation with his impressive set of blog posts on Oracle transaction locks:




Merging and Branching a Database

October 16th, 2013



Webinar: DB Maestro and Delphix


Three part post


In the application world, merging and branching looks something like the picture below. Copies are made from confirmed releases on the trunk which can then be operated on by developers as a branch of the main code. Changes are made, tested, and then merged back into the trunk for a new release.

Screen Shot 2013-10-10 at 11.40.33 AM


Imagine if we could do this in the database world as well, similar to this image:

Screen Shot 2013-10-15 at 6.29.37 AM

When developing applications that run on top of a database, each branch of code theoretically requires a copy of the source database where necessary schema and data changes can be applied. For maximum development capabilities, each developer would require a full copy of the source database so they could work unimpeded without the danger of breaking code changes being made by other developers. That’s the theory, anyways. But we would need to have a way to merge in changes from other developers into each development environment as well as a way to merge developer changes back into the trunk environment for this to truly work out.

  • Challenge #1: Creating additional databases for development work to be performed in parallel to the trunk.
  • Challenge #2: Merging branches back into a trunk so at the end of the day everyone’s work is combined into a single release that can be pushed to testing and hopefully to production.


Creating a copy of a database can take days or even weeks and require extensive resources. This is where database virtualization comes into play. With database virtualization, a database copy is as simple as a few clicks of a mouse. Within a couple minutes and almost no storage footprint, a fully functional read/write database will be available to use for development.

The saved time and lack of space consumption are thanks to the fact that virtual databases share the majority of their storage. Only a single copy of the production database and its change records is actually needed to create as many branches to as many different developers are required. Each developer or development team will have their own full copy of the source database(s) and different versions of the code can be correlated with tagged versions of virtual databases.

This solves one of the most challenging aspects of database management in development environments while also solving the problem of branching for source control. However, the question of how to merge changes from different development environments with different copies into a single coherent version still remains.


Figuring out how to combine everything together (let’s call it merging, shall we?) is the next challenge, as conflicts arise (as they usually do) and all involved developers need to be consulted during the merge.

Screen Shot 2013-10-10 at 11.40.16 AM

But as we can see from this diagram, remembering everything that was changed and what it might conflict with is not practical. Even locating and focusing on conflicts to be merged can be a time consuming, error prone, ‘eye ball tested’ process. Risky is one word that leaps to mind. Damage is another one.

We really need to ask ourselves:

  1. Did we consider all changes? Overlooking a change that was introduced to Branch 1 and needs to be updated in the trunk will backfire. If Murphy is right (and it is his law) it will happen when we are already in production.
  2. Do we understand the history and origin of each change? Locating changes is one thing—but do we really understand why and where each change was introduced? Should it be integrated? If a branch and a trunk are different, which version is the right one? The last thing we want to do is override the new version of hot-bug-fix in the trunk with an older version from Branch 2 while performing the integration!
  3. Can we recognize a conflict if we see one? Ok, so we definitely know that object “x” in Branch 2 was changed, and we do see a difference in the trunk. So it’s safe to override the trunk, right? Wrong! What happens if the same object undergoes different changes in both trunk and branch? None of the versions is the right one, because the right one is a combination of both.

As we can see, dealing with branch merges can be quite a challenge. Eye balling changes is risky business. Just comparing branch and trunk to find these changes is risky as well, as we want to make sure we don’t override necessary changes—because it’s not A or B, it’s a combination of both.

Screen Shot 2013-10-15 at 6.30.18 AM

Easy creation of branches or developer sandboxes can be achieved with database virtualization. Safe branch merges require conflict and merging capabilities that are part of database deployment automation and connected to database version control, for safe and informed decision taking.

See these practices in action!

Join us for a webinar with DB Maestro Thu, Oct 24, 2013 12:00 PM – 1:00 PM EDT.

Screen Shot 2013-10-14 at 1.36.47 PM


Databases for every developer like source code?

October 15th, 2013



Webinar: DB Maestro and Delphix



Three part post


The simple truth is that databases can’t be given to every developer like source code.

Databases can be large, complex, and extremely difficult to copy. It is rare to see developers or even development groups that have a copy of the full production database to use for their tasks.

On the other hand, developers almost always have access to a full copy of the application code stack. Source control methods like git make it easy to clone, fork, and track code from the component level to the entire codebase.

For databases, the typical strategies are:

  1. Share a full copy of the source database across a group
  2. Provide small subsets of data from the source database to developers
  3. Allocate schema only ‘databases’ per developer

Each of these strategies presents their own problems.

Sharing a database across a group of developers means that any change one developer makes can impact the other developers. There is no automated way to protect a group of developers from the changes of a single developer. To protect development problems that can be caused by new changes, the changes have to be reviewed. Reviews can take up to a couple of weeks, meaning the coder who wants to implement these changes has to wait. The first in a long line of delays as a result of inadequate database resources.

If developers are given subsets of the database (partial copies, only some data) or empty skeleton schemas, it will easily lead to code that has not been tested on the full range of data that is present in production. This of course opens the door to untested bugs born from outliers in the data and queries that are not optimized on the full set of data. Finding these bugs late in the development cycle is more costly by far than finding them immediately with the developer writing the code.

Delays and errors in database development

Let’s zoom out a bit and look at the whole development process. Doing so unfortunately presents additional challenges. These challenges increase due to multiple lines of development that have to be run in parallel and merged, run through QA and UAT, and then rolled out to production.

In most cases development is not a single player effort. Different teams are working in parallel, and code deployment is often practiced using structures called branches and trunks that create separate working environments for different teams to work uninterrupted. Prior to rolling out to production, these branches have to be merged into a single working environment that can be tested and rolled into production.

Doing this for source code is standard practice in the industry, as all that is required is for a new copy of the code files to be made in a new centralized location. This is often done automatically by a source control tool. But as we’ve seen, databases are more complex than source code and more difficult to copy.

But …

There are other methods. Stay tuned tomorrow our join us for a live webinar  with DB Maestro Thu, Oct 24, 2013 12:00 PM – 1:00 PM EDT.

Screen Shot 2013-10-14 at 1.36.47 PM


Version Control Meets Data Control

October 14th, 2013




Webinar: DB Maestro and Delphix



Three part post

Source Control


Imagine being tasked with writing a twenty-page paper. In pen. How would you ensure a complete paper without any mistakes? Sure, you can rewrite a page if something goes wrong, but that’s a lot of extra work that could potentially set you back hours. In fact, even though each completed page is a saved document and therefore safe, if you are ever asked to rewrite even a single word you will have to rewrite an entire page.

So what could you do to minimize the risk and ensure you don’t waste much time? One option is to make a copy of each page after every sentence. In the event of an issue on any sentence, you can revert to a copy and begin from there and save yourself at least some work. When a page is confirmed as being good you can commit it, and when all the pages are confirmed you can merge them into a final document.

Obviously there are better solutions than this (like not using pen in the first place). But in application code, we face similar issues. Thousands of lines of code that have to be written, checked, crosschecked, put through QA cycles, subjected to UAT, and finally merged together into a cohesive package for production that may or may not be what the customer is looking for. And in that world we have a similar solution called source control.

Source control is widely used for development projects in languages like Java, .NET, C++, Ruby, PHP, JavaScript, and more. In some organizations, teams are actually built around the source control solution the company uses. Without proper source control it is next to impossible to keep track of all the pieces of code, releases, bug fixes, and requirements of the business. It not only holds the project together, it helps orchestrate the flow of the development cycle.

Sadly, in the database world we have no widespread source control. Some organizations do track changes in development, QA, and production databases as they are made. But a complete solution incorporating data versioning, change repositories, development merging, and rollback capabilities is out of reach for most organizations. Surprisingly this is even true for code such as PL/SQL, T-SQL, etc. in many cases.

Databases Pose Unique Source Control Problems

Screen Shot 2013-09-05 at 9.52.45 PM

Database development poses unique problems for source control compared to traditional code development. This is because the database is not a flat code object; instead, it is constructed from various parts:

  • Schema Structure
  • Content
  • Database Code (PL/SQL, T-SQL, etc.)

The simple truth is that databases can’t be given to every developer like source code.

Until now.

More to come tomorrow and join us for a webinar with DB Maestro Thu, Oct 24, 2013 12:00 PM – 1:00 PM EDT.

Screen Shot 2013-10-14 at 1.36.47 PM 

 above photo by Charles Kenny




Oracle 12c helps developers

October 3rd, 2013


September/October Cover

I’m super excited about two aspects of 12c (among many). First is Oracle 12c EM Express which gives light weight access to Oracle performance information without all the agents and heavy infrastructure. You basically just point a browser at the database via a setup port and voila. Since EM Express has limited functionality  but gives powerful insight into performance on the database it’s perfect for developers to use and see the impact of their code in the databases where they are running their code.

The other feature in 12c that is exciting is pluggable databases. Why? Because for me as someone who clones 100s to 1000s of databases using virtual databases via Delphix where creating a clone is 4 clicks of a mouse, almost no storage usage and takes a few minutes, it means I can clone more databases. Why? Because each of these “free” virtual databases I create with Delphix still creates it’s own SGA which is at a minimum about  half a gigabyte. Now with pluggable databases that will go down to about 100 megabytes or 1/5 the size, which means I’ll be able to put a couple of hundred Delphix virtual databases on a machine thanks to Oracle pluggable databases.

What does all this mean? It means that each developer can have their own full copy of  production with a shared disk foot print, a shared memory caching footprint (only available in Delphix) and shared SGA components and instance processes. On top of that the developers can get clear and powerful views into the impact of the code they run on their databases.

Have an article in the latest  Oracle magazine, but don’t see a way to link the article in so here it is. There is also a video that goes along with the article at the bottom of this post.












Popularity vs Productivity vs Performance

October 1st, 2013



What to choose

 Screen Shot 2013-09-05 at 9.52.45 PM

 Screen Shot 2013-09-05 at 9.52.45 PM

A more popular car will have better servicing as parts and skills will be common. A fast car is compelling but sometimes speed is easily sacrificed for comfort.

The question of language efficiency verses productivity has been nagging me. The nagging question includes  what  languages, frameworks and packages are best for collecting data, analyzing data and exposing the data in a rich UI via a web browser.

The question goes well beyond just languages. For web interfaces javascript is the clear winner and the question is more what frameworks to use such as ember.js, backbone.js, angular.js, knockout.js etc and whether to use D3 or go with something like highcharts. But this posting just goes into languages themselves.

It is a huge decision to make at the beginning of a project as to which language to use and why.

Screen Shot 2013-09-05 at 9.52.45 PM


 Screen Shot 2013-09-05 at 9.52.45 PM

 Screen Shot 2013-09-05 at 9.52.45 PM

Productivity is probably the most important factor when choosing a language.  As the founder of Slideshare, Jonathan Boutelle, said  when asked if he regretted using Ruby instead of  Java inhis response was that the question was moot, that he wouldn’t even be there talking if he had use Java. Sure Java would have scaled better and Ruby had concurrency issues but  Slideshare would not have succeeded, at least not  as fast,  or possibly not even existed had they used Java.

Here is an interesting graphic on speed of the programs in various languages

From Dion Hinchcliffe  

Now the above graphic might be hard to read (wish it was larger) but what it is indicating is that Python is  roughly 10x times slower in execution time to than the equivalent programs in Java. On the other hand the expected productivity gain for a programmer is 5-10x faster in Python than Java. Now on the surface my reaction is “OK, it’s faster to program in Python, ok, but Java is 10x faster so it’s clearly the choice!”, but if I put the numbers into real life I’m like OMG – imagine a Python program that takes 1 day to write, now that same program would take 5-10 days in Java!? Thats a huge deal. Hands down, I’m going to program in Python (or something other than Java).  I can deal with some runtime performance issues.

Should a company program in Java or Python? What if Python would take a year to bring to market? What if the Java verisontook 5-10 years ?!

Here is another graphic on productivity from the book “From Java to Ruby

The data is base on a study of productivity in C, Java, Perl, Python:

As far as productivity comparisons, you may have many issues with the above to graphics, but  I think the following two images of Python vs Java explain it nicely:



Java equivalent


For a down and out dirty view of Java sprawl that further exacerbates productivity check out:

(as a rebuttal try pursing hello world in gnu packaging : )

Language Expressiveness

 Screen Shot 2013-09-05 at 9.52.45 PM

Fascinating article about how languages are becoming more expressive

Screen Shot 2013-07-23 at 8.04.41 PM

Functional vs Object Programming

 Screen Shot 2013-09-05 at 9.52.45 PM

Related is  a discussion of functional programming verses object oriented programming

  Screen Shot 2013-09-05 at 9.52.45 PM


 Screen Shot 2013-09-05 at 9.52.45 PM

Screen Shot 2013-09-05 at 9.52.45 PM

Python rockets to ascension and Java becomes lack luster


Screen Shot 2013-07-26 at 8.38.36 AM

What are the most influential languages

Screen Shot 2013-09-05 at 9.52.45 PM

Another perspective on languages , the most influential languages:

Screen Shot 2013-07-16 at 9.52.35 AM

 Screen Shot 2013-09-05 at 9.52.45 PM

Languages mentioned in job postings as % and % increase

 Screen Shot 2013-09-05 at 9.52.45 PM

 Screen Shot 2013-09-05 at 9.52.45 PM

Language popularity based on book sales

Screen Shot 2013-09-05 at 9.52.45 PM

The issues of showing total growth  verses % growth lend itself well to heat map representation. Here is a heat map from O’Reilly:

(Interesting to all the Oracle folks: PL/SQL shows a 51% growth 2007-2008.

Here is data from O’Reilly on growth over time, showing Python and C# growing steadily:

Tiobe Index

Screen Shot 2013-09-05 at 9.52.45 PM

Google language trends

Screen Shot 2013-09-05 at 9.52.45 PM

Here is google trends normalized by C programming (with out normalization all trends are down because google trends show hits normalized by all traffic and as general traffic increases verses computer geek searches general computer geek searches trend downwards)

Interesting how Ruby seems to be trending down (as normalized against C  searches)


Other links on popularity

Screen Shot 2013-09-05 at 9.52.45 PM


Screen Shot 2013-09-05 at 9.52.45 PM

Screen Shot 2013-09-05 at 9.52.45 PM

Here is a fascinating site with empirical data on speed of languages

Current speed might be a bit misleading as popularity will impact the efforts put into the issues of a language and a language that is popular though maybe less efficient will see improvements, for example it’s interesting to see the improvements in Java over the years, and now to see the improvements in Ruby.  Though Ruby is slower than  Scala and Scala’s productivity may even be better than Ruby, Scala doesn’t, yet, have the market momentum, thus the assurances that it is a language to invest in now.

Also if speed of the language is an issue the solution is not to throw the baby out with the bathwater, ie throw out the productive programming language altogether and go for a fast one such as C or JAVA but to find the areas of slowness in the productive framework and replace those with a fast function in another language, ie polyglot programming.

 Screen Shot 2013-09-05 at 9.52.45 PM

Other stuff

 Screen Shot 2013-09-05 at 9.52.45 PM

 Screen Shot 2013-09-05 at 9.52.45 PM

Just for Fun

Here  is how I started my morning, a fun video on Python, that catalyzed the above discussion: