Oracle “Physical I/O” ? not always physical

October 29th, 2013

A customer called and wanted to know why the development database was so much slower than production when both databases were on the same type of machine and same type of storage.

To analyze the situation, the same query was run on both databases with

       alter session set events '10046 trace name context forever, level 12';

and sure enough, development (SID=dev) showed average I/O almost twice as slow as production:

    db file sequential read

       db   Times Waited Elapsed (ms) Avg Ela (ms)
       ~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~
       dev         55528       479930            9
       prod        65275       294785            5

MH900439292Now the above is physical I/O as reported by Oracle from looking at the wait event “db file sequential read” in the trace file. Based upon the above data alone, it would look like the “dev” database had an I/O subsystem problem. On the other hand not all “physical I/O” as reported by Oracle is really physical I/O. The I/O reported by Oracle could simple be I/O that is satisfied by the O/S file system cache. When Oracle issues a request for I/O, in this case a “db file sequential read” (aka a single block read request) all we know is Oracle is asking the operating system for the data. We don’t know where the operating system gets the data. The operating system might issue a request for the data from the underlying storage system or it might find the block in the operating system file cache. How can we tell the difference between a read from the underlying storage or a read from the O/S file system cache? Well, from the Oracle perspective there are no statistics that differentiate between the two, but based upon the latency for the I/O request we can make an educated guess. Oracle keeps I/O latency information in the v$event_histogram views, but unfortunately these views only go down to the granularity of 1ms. For I/O from a SAN that is cached in memory on the SAN, the latency could go down to micro-seconds. What we want to know is how much I/O is coming faster than a SAN could reasonably supply the data which would mean the data is coming from UNIX file system cache and not the SAN. An 8K data transfer over 4Gb FC takes 20us for the data transfer alone not to mention any O/S scheduling and or code path. Thus a reasonable value for I/O that is faster than a SAN would/could supply would be on the order of of 10s of mircro-seconds. For example 10us is faster than even a 8Gb FC could supply 8k. Now add on some overhead for code path, scheduling etc and 100us (ie 100 microseconds) is a pretty good cut off.

Taking the same trace files used for the above analysis and grouping the I/O requests into latency buckets gives:

   db  : 10u  50u  .1u  .2m  .5m  1ms  2ms    4ms   8ms  16ms  32m  64m  .1   .5 >.5
   dev :        1   14  908 13238 6900 9197 15603  9056   265   26   12   12
   prod:4419 2713  391  118 22189 2794 1688  2003 11969 14877 2003  105    3   3

What we see is that production gets substantial portion of it’s I/O in less than 10us, ie this data is coming from the file system cache but on dev most of it is coming from the SAN.f

In this case the query had only been run once on “dev” but had been run multiple times on prod, thus on prod, the O/S had cached some of this data in the file system cache.

Once the query was run a couple of times on “dev”, then the latency on “dev” when down to the latency on “prod”.

To avoid reading from the O/S cache all together (or for the most part), one can set


but by doing this alone, it would just increase the latency of the I/O as the O/S file system cache would no longer be used. In order to compensate for the loss of the O/S file system caching, then the buffer cache for the Oracle database should be increased in order to compensate for the loss. If the buffer cache is increased sufficiently to compensate for the lost O/S caching memory, the there will be less CPU used and less I/O and less latency due to I/O calls by Oracle. On the other hand if there are several database on the host then balancing the SGA sizing among the databases without use of O/S file system caching can be more trouble than it’s worth and using O/S file system caching might be more flexible.

How to extract a histogram of I/O latency from the Oracle trace file? Run the script

The script is on github, so feel free to fork the code and make your own changes.

To run, just type

    ./  sid_ora_29908.trc

where “sid_ora_29908.trc” is the name of your trace file.

The output looks like:

                    Time Breakdown (seconds)
elapsed           3690
total wait        1525
cpu               1327
unaccounted        837
idle                 0

                    Summary of Activity in trace file
           event                               count total secs    avg ms
        1) ELAPSED                                         3690
        2) CPU                                             1327
        3) db file sequential read            243910        964     3.955
        4) local write wait                    34485        175     5.095
        5) db file scattered read              25147        140     5.598
        6) log buffer space                      326        118   362.508
        7) free buffer waits                     164         64   394.329
        8) write complete waits                    4         29  7394.890

Histogram of latencies  for:
                                 db file sequential read
        64u .1m  .2m   .5m    1m    2m    4m    8m   16m  33m  65m .1s .1s+
     109572 782  118   345 10652 21701 15502 43635 34636 5270 1300 258  139

Histogram of latencies by I/O size in # of blocks for:
                                 db file scattered read
                                 direct path read
                                 direct path read temp
        64u .1m  .2m   .5m    1m    2m    4m    8m   16m  33m  65m .1s .1s+
db file scattered read
  4      46   3    1     0     2     2    11     1     4    2    1   0    0
 32       0   0 3436 14534   617   342  1542   852   834  642  361 295  186


The part of interest is “Histogram of latencies for: ”
The first histogram section is for “db file sequential read”, ie single block reads.
The next histogram section is for multiblock reads. These multiblock reads are grouped by read type and for each read type the I/O is broken down by I/O size in number of blocks. Number of blocks is the first field.


Here is an example of plotting ofhe average I/O latency on three databases. What we see is a tri-modal histogram. The histogram represents how many I/Os fell into the each time bucket. The first time bucket is < 10 us.  The next is between 10 us and 50 us. Any I/O coming from under 50 us is most likely coming from Unix file system cache. I/O with latency under 1ms is most likely, statistically speaking, coming from storage cache. I/O latency over 1ms is probably coming from disk reads. We don’t know for sure but this is a good educated guess on where the I/Os are coming from based on their latency.

What we see in the graphic is that actual disk (spindle) I/O is faster on DEV than it is on PROD which is surprising since the initial reaction of the team was that disk I/O was slower on DEV than PROD. The reason PROD seemed faster was because it was benefiting from significant Unix files system caching and DEV was not.





  1. Trackbacks


  2. | #1

    Hi Kyle,

    I’ve had this issue many times (difference in I/O performance between prod and dev servers), and I knew about OS (or hardware) cache being a factor here, but not once it occured to me to estimate OS cache hit ratio by looking at the event duration histogram. A very nice idea.


  3. Biplab Chakraborty
    | #2

    Nice concept! It has cleared few of my doubts.

five + = 14