Is NFS on ZFS slowing you down?

January 25th, 2018

If you think so, check out shell script “ioh.sh” from github at  https://github.com/khailey/ioh

Introduction and Goals

The goal of ioh.sh is to measure both the throughput and latency of the different code layers when using NFS mounts on a ZFS appliance. The ZFS appliance code layers inspected with the script are I/O from the disks, ZFS layer and the NFS layer. For each of these layers the script measures the throughput, latency and average I/O size. Some of the layers are further broken down into other layers. For example NFS writes are broken down into data sync, file sync and non-sync operations and NFS reads are broken down into cached data reads and reads that have to go to disk.

The primary three questions ioh is used to answer are

  • Is I/O latency from the I/O subsystem to ZFS appliance sufficiently fast?
  • Is NFS latency from ZFS appliance to the consumer sufficiently fast?
  • Is ZFS adding unusual latency

One: If the latency from the I/O subsystem is not adequate then look into supplying better performing I/O subsystem for ZFS appliance. For example if the goal is 3ms write times per 1K redo write but the underlying I/O subsystem is taking 6ms, then it will be impossible for ZFS appliance to meet those expectations.

Two: If the latency for NFS response from ZFS appliance is adequate and yet the NFS client reports latencies as much slower (more than 2ms slower) then one should look instead at problems in the NIC, network or NFS client host, see network tracing, example http://datavirtualizer.com/tcp-trace-analysis-for-nfs/

Three: If the I/O latency is sufficiently fast but ZFS latency is slow, then this could indicate a problem in the ZFS layer.

The answer to the question “what is adequate I/O latency” depends. In general a single random 8 Kb block read on Oracle is expected to take 3-12 ms on average, thus the typical latency is around 7.5 ms. NOTE: when measuring I/O latency on the source system it’s important to use a tool like “iostat” that will show the actually I/Os to the subsystem. The I/O measured by the Oracle database will include both I/Os satisfied from the host file system cache as well as the I/O subsystem unless the database is running with direct I/O

The ioh tool can also give insight into other useful information such as

  • Are IOPs getting near the supported IOPs of the underlying I/O subsystem
  • is NFS throughput getting near the maximum bandwidth of the NIC?”

For example if the NIC is 1GbE then the maximum bandwidth is about 115MB/s, and generally 100MB/s is a good rule of thumb for the max. If throughput is consistently near the NIC maximum, then demand is probably going above maximum and thus increasing latency

$ ioh.sh -h

usage: ./ioh.sh options

collects I/O related dtrace information into file "ioh.out"
and displays the

OPTIONS:
  -h              Show this message
  -t  seconds     runtime in seconds, defaults to forever
  -c  seconds     cycle time ie time between collections, defaults to 1 second
  -f  filename    change the output file name [defautls to ioh.out]
  -p              parse the data from output file only,  don't run collection
  -d  display     optional extra data to show: [hist|histsz|histszw|topfile|all]
                    hist    - latency histogram for NFS,ZFS and IO both reads and writes
                    histsz  - latency histogram by size for NFS reads
                    histszw - latency histogram by size for NFS writes
                    topfile - top files accessed by NFS
                    all     - all the above options
                  example
                    ioh.sh -d "histsz topfile"

two optional environment variables CWIDTH – histogram column width PAD – character between columns in the histograms, null by default

Running

$ sudo ./ioh.sh 

Outputs to the screen and put raw output into default file name “ioh.out.[date]”. The default output file name can be changed with “-o filename” option. the raw output can later be formatted with

$ ./ioh.sh -p  ioh.out.2012_10_30_10:49:27

By default it will look for “ioh.out”. If the raw data is in a different file name it can be specified with “-o filename”

The output looks like

date: 1335282287 , 24/3/2012 15:44:47
TCP out:  8.107 MB/s, in:  5.239 MB/s, retrans:        MB/s  ip discards:
----------------
            |       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|     0.005 |    24.01 |    4.899 |        1
R |     zfs:|     7.916 |     0.05 |    7.947 |     1020
C |   nfs_c:|           |          |          |        .
R |     nfs:|     7.916 |     0.09 |    8.017 |     1011
- 
W |      io:|     9.921 |    11.26 |   32.562 |      312
W | zfssync:|     5.246 |    19.81 |   11.405 |      471
W |     zfs:|     0.001 |     0.05 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     5.215 |    19.94 |   11.410 |      468
W |nfssyncF:|     0.031 |    11.48 |   16.000 |        2

The sections are broken down into

  • Header with date and TCP throughput
  • Reads
  • Writes

Reads and Writes are are further broken down into

  • io
  • zfs
  • nfs

For writes, the non stable storage writes are separated from the writes to stable storage which are marked as “sync” writes. For NFS the sync writes are further broken down into “data” and “file” sync writes.

examples:

The following will refresh the display every 10 seconds and display an extra four sections of data

$ sudo ./ioh.sh -c 10 -d "hist histsz histszw topfile"   

date: 1335282287 , 24/3/2012 15:44:47
TCP out:  8.107 MB/s, in:  5.239 MB/s, retrans:        MB/s  ip discards:
----------------
            |       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|     0.005 |    24.01 |    4.899 |        1
R |     zfs:|     7.916 |     0.05 |    7.947 |     1020
R |     nfs:|     7.916 |     0.09 |    8.017 |     1011
- 
W |      io:|     9.921 |    11.26 |   32.562 |      312
W | zfssync:|     5.246 |    19.81 |   11.405 |      471
W |     zfs:|     0.001 |     0.05 |    0.199 |        3
W |     nfs:|           |                     |        .
W |nfssyncD:|     5.215 |    19.94 |   11.410 |      468
W |nfssyncF:|     0.031 |    11.48 |   16.000 |        2
---- histograms  -------
    area r_w   32u   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .5s    1s    2s    2s+
R        io      .     .     .     .     .     .     .     .     .     1     3     1
R       zfs   4743   287    44    16     4     3     .     .     .     1     2     2
R       nfs      .  2913  2028    89    17     3     .     .     .     1     2     2
-
W        io      .     .     .    58   249   236    50    63   161   381   261    84    20     1
W       zfs      3    12     2
W   zfssync      .     .     .     .    26   162   258   129   228   562   636   250    75    29
W       nfs      .     .     .     .    12   164   265   134   222   567   637   250    75    29
--- NFS latency by size ---------
    ms   size_kb
R   0.1    8     .  2909  2023    87    17     3     .     .     .     1     2     2
R   0.1   16     .     4     5     2
-
W   5.0    4     .     .     .     .     8    49    10     3     4    11     4     2     1
W  21.4    8     .     .     .     .     4    55   196    99   152   414   490   199    60    23
W  18.3   16     .     .     .     .     .    34    29    25    43    91    84    28     8     5
W  16.1   32     .     .     .     .     .    19    16     7    14    38    36    15     3
W  19.3   64     .     .     .     .     .     6    11     .     9    11    19     6     2     1
W  20.4  128     .     .     .     .     .     1     3     .     .     2     4     .     1
---- top files ----
   MB/s                  IP  PORT filename
W  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
W  0.02MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.57MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
W  0.70MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/redo3.log
W  3.93MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/10.2.0.4/db_1/dbs/soe.dbf
-
R  0.01MB/s  172.16.100.102 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
R  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
R  0.02MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.05MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
R  7.84MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/10.2.0.4/db_1/dbs/soe.dbf
IOPs         313

Sections

First line is the date Second line is TCP MB per second in,out and retransmitted. The last value is “ip discards”

The three parts are all related and are a drill down starting with course grain data at the top to finer grain data at the bottom.

  • averages – default
  • histograms [hist]
  • histograms by size reads [histsz] writes [histszw] for NFS

The first section is a quick overview.

The second section breaks out the latency into a histogram so one can get an indication of amount of I/O from memory (ie those in microsecond ranges) as well as how far out the outliers are. (are the outliers on the VDBs matching up to the outliers seen on ZFS appliance?)

The third section differentiates between latency of single block random (typically the 8K size) and latency of multi-block sequential reads (32K and higher). The differentiation is important when comparing to Oracle stats which are grouped by single block random reads (called “db file sequential read” ) and sequential multi-block read (called “db file scattered read”).

The final section

top files read and write [topfile]

is sort of a sanity check as there are periods where there is suppose to be little to no NFS I/O and yet there is, so the top file sections tells which file and which host the NFS I/O is coming from.

The last line after all the sections is total IOPs for reads plus writes. (Note these IOPs could get converted to higher values at the storage layer if using RAID5 which will cause each write to be two reads plus two writes.)

The first section, shows up by default. The other sections require command line arguments.

To see just the first section, which is the default, run ioh.sh without any arguments:

sudo ./ioh.sh

To show non-default sections, add them to the command line

sudo ./ioh.sh -d "hist histsz histszw topfile"

A shortcut for all sections is “all”

sudo ./ioh.sh  -d all

Collecting in the background

nohup sudo ./ioh.sh -c 60 -t 86400 &

Runs the collection for 1 day (86400 seconds) collecting every 60 seconds and put raw output into default file name “ioh.out”. The default output file name can be changed with “-o filename” option.

1. Averages:

The displays I/O, ZFS and NFS data for both reads and writes. The data is grouped to try and help easily correlate these different layers First line is date in epoch format

columns

  • MB/s – MB transferred a second
  • avg_ms – average operation time
  • avg_sz_kb – average operation size in kb
  • count – number of operations

example

.
             |      MB/s|     mx_ms| avg_sz_kb|     count
 ------------|----------|----------|----------|--------------------
 R |      io:|  through | average  |  average |      count of operations
 R |     zfs:|  put     |latency   | I/O size | 
 R |     nfs:|          |millisec  | in KB    |   
 - 
 W |      io:|          |          |          |
 W | zfssync:|          |          |          |                                         
 W |     zfs:|          |          |          |                                         
 W |     nfs:|          |          |          |                                         
 W |nfssyncD:|          |          |          |                                         
 W |nfssyncF:|          |          |          |                                         

For writes

  • zfssync – these are synchronous writes. THese should mainly be Oracle redo writes.
  • nfs – unstable storage writes
  • nfssyncD – data sync writes
  • nfssyncF – file sync writes

DTrace probes used

  • io:::start/done check for read or write
  • nfs:::op-read-start/op-read-done , nfs:::op-write-start/op-write-done
  • zfs_read:entry/return, zfs_write:entry/return

2. Histograms

latency distribution for i/o, zfs, nfs for reads and writes. These distributions are not normalized by time, ie if ioh.d is outputs once a second then these counts will be equal to the counts in the first section. If ioh.d outputs every 10 seconds, then these values will be 10x higher

  1. Histograms by size for reads and writes

The first column is the average latency for the size of I/O for this line. The second column is the size. The size includes this size and every size lower up till the previous bucket. The goal here is to show the sizes of I/Os and the different latency for different sizes. For an Oralce database with 8k block size, 8k reads will tend to be random where as higher read sizes say will be multiblock requests and represent sequential reads. It’s common to see the 8K reads running slower than the larger reads.

4. Top files

shows the top 5 files for reads and writes. First column is MB/s, then R or W, then IP, then port then filename

Examples and Usage

Idle system

First thing to look at is the MB in and out which answers

  • “how busy is the system?”
  • “is NFS throughput approaching the limits of the NIC?”

In the following example, there is only less than 50KB/s total NFS throughput ( in plus out) thus the system isn’t doing much, and there must be no database activity other than the regular maintenance processes which are always running on a database. To confirm this, one can look at the top files at the bottom and see that the only activity is on the control files which are read and written to as part of database system maintenance. Otherwise there is no activity to speak of, so no reason look at I/O latency in this case. Additionally, all majority of what little I/O is in 16K sizes which is typical of control file activity, where as the default database data block activity is in 8K sizes. Most read I/O is coming from ZFS appliance cache as its 64 micro seconds.

date: 1335282646 , 24/3/2012 15:50:46
TCP  out:  0.016 MB/s, in:  0.030 MB/s, retrans:        MB/s  ip discards:
----------------
            |       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|           |          |          |        .
R |     zfs:|     0.016 |     0.01 |    1.298 |       13
R |     nfs:|     0.016 |     0.10 |   16.000 |        1
- 
W |      io:|     0.365 |     4.59 |    9.590 |       39
W | zfssync:|     0.031 |    14.49 |   16.000 |        2
W |     zfs:|     0.001 |     0.07 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     0.003 |          |          |        .
W |nfssyncF:|     0.028 |    14.33 |   14.400 |        2
---- histograms  -------
    area r_w   32u   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .5s    .5s+
R        io      .
R       zfs     60     5
R       nfs      .     .     5
-
W        io      .     .     .    20    43    60    11    11     8    28    17     1
W       zfs      2     8     5     2
W   zfssync      .     .     .     .     .     .     2     .     2     5     1     1
W       nfs      .     .     .     .     .     .     2     .     2     5     1     1
--- NFS latency by size ---------
    ms   size_kb
R   0.1   16     .     .     5
-
W          8     .     .     .     .     .     .     .     .     .     1     .     1
W  16.0   16     .     .     .     .     .     .     2     .     2     4     1
---- top files ----
   MB/s                  IP  PORT filename
W  0.00MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.01MB/s  172.16.100.102 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s  172.16.103.133 59394 /domain0/group0/vdb13/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s   172.16.100.69 39682 /domain0/group0/vdb14/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
-
R  0.00MB/s   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.01MB/s  172.16.100.102 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
R  0.01MB/s  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
IOPs          39

Active System Below is an example of an active system. Looking at TCP bytes in and out, there is a fair bit 3MB/s out and 2MB/s in. These rates are a long way from saturating 1GbE but there is activity going on.

READs all reads are coming out of the cache. How do we know? For one the average ms latency is 0.07, or 70 micro seconds. Does thhis 70us include slower reads that might be off disk? Looking at the histogram, one can see that the slowest zfs I/O is under 100us and looking just above at the I/O histogram there are no I/Os thus all the I/O is coming from cache.

Writes Writes are pretty slow. Oracle Redo writes on good systems are typically 3ms or liess for small redo. Unfortunately most of the I/O is coming from datafile writes so it’s difficult to tell what the redo write times are. (maybe worth enhancing ioh.d to show average latency by file) Typically the redo does “nfssyncD” writes and datafile writes are simply unstable storage writes “nfs” writes that get sync at a later date. This particular database is using the Oracle parameter “filesystemio_options=setall” which implements direct I/O. Direct I/O can work without sync writes but the implementation depends on the OS. This O/S implementation, OpenSolaris, causes all Direct I/O writes to by sync writes.

date: 1335284387 , 24/3/2012 16:19:47
TCP out:  3.469 MB/s, in:  2.185 MB/s, retrans:        MB/s  ip discards:
----------------
            ||         |           |          |          o       MB/s|    avg_ms| avg_sz_kb|     count
------------|-----------|----------|----------|--------------------
R |      io:|           |          |          |        .
R |     zfs:|     3.387 |     0.03 |    7.793 |      445
R |     nfs:|     3.384 |     0.07 |    8.022 |      432
- 
W |      io:|     4.821 |    12.08 |   24.198 |      204
W | zfssync:|     1.935 |    38.50 |   11.385 |      174
W |     zfs:|     0.001 |     0.06 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     1.906 |    39.06 |   11.416 |      171
W |nfssyncF:|     0.028 |    14.59 |   14.400 |        2
---- histograms  -------
    area r_w   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .3s+
R        io      .
R       zfs   2185    34     5     .     1
R       nfs    903  1201    47     8     1
-
W        io      .     .    19   142   143    46    42   108   240   212    57    12     1
W       zfs     13     3     1
W   zfssync      .     .     .     .    10     6     .    21    60   384   287    86    16
W       nfs      .     .     .     .    10     5     .    21    60   384   287    86    16
--- NFS latency by size ---------
    ms   size_kb
R   0.1    8   900  1199    47     7     1
R   0.1   16     3     2     .     1
-
W  17.7    4     .     .     .     .     3     1     .     2     5     3     3
W  41.1    8     .     .     .     .     3     .     .    13    35   292   231    76    13
W  34.0   16     .     .     .     .     3     3     .     4    13    61    30     8     2
W  39.0   32     .     .     .     .     .     1     .     .     2    16    14     2     1
W  28.3   64     .     .     .     .     1     .     .     .     2     9     8
W  26.2  128     .     .     .     .     .     .     .     2     3     2     1
W        256     .     .     .     .     .     .     .     .     .     1
---- top files ----
   MB/s             IP  PORT filename
R  0.01   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.01  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
R  0.02   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/system.dbf
R  0.02   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
R  3.33   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/product/dbs/soe.dbf 
-
W  0.01   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.01  172.16.103.196 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
W  0.15   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
W  0.30   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/redo1.log
W  1.46   172.16.100.81 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/product/dbs/soe.dbf 
IOPs         204

ZFS read layer problem

          |      MB/s|    avg_ms|  avg_sz_kb
----------|----------|----------|-----------
R |   io :|    88.480|      4.60|     17.648 
R |  zfs :|    19.740|      8.51|     12.689 
R |  nfs :|    16.562|     22.67|     30.394 

In this case the ZFS I/O 19MB/s is higher than NFS at 16MB/s. Now that could because some thing is accessing the file system locally on ZFS appliance or that ZFS is doing read ahead, so there are possible explanations, but it’s interesting. Second subsystem I/O at 88MB/s is much greater than ZFS I/O at 19MB/s. Again that is notable. Could because there is a scrub going on. (to check for a scrub, run “spool status”, to turn off scrub run “zpool scrub -s domain0″ though the scrub has to be run at some point). Both interesting observations.

Now the more interesting parts. The NFS response time 22ms is almost 3x the average ZFS response time 8ms. On the other hand the average size of NFS I/O is 2.5x the average ZFS I/O size so that might be understandable. The hard part to understand is that the ZFS latency 8ms is twice the latency of subsystem I/O at 4ms yet the average size of the I/O sub-system reads is bigger than the average ZFS read. This doesn’t make any sense.

In this case to hone in the data a bit, it would be worth turning off a scrub if it was running and see what the stats are to eliminate a factor that could be muddying the waters.

But in this case, even without a scrub going, the ZFS latency was 2-3x slower than the I/O subsystem latency.

It turns out ZFS wasn’t caching and spending a lot of time trying to keep the ARC clean.

ZFS write layer problem

           |       MB/s|    avg_ms| avg_sz_kb|     count
-----------|-----------|----------|----------|--------------------
W |     io:|     10.921|     23.26|    32.562|       380
W |    zfs:|    127.001|     37.95|     0.199|      8141
W |    nfs:|     0.000 |     0.00 |    0.000 |        0

NFS is 0 MB/s because this was from http traffic. The current version of ioh would show the TCP MB/s. This version also mixed up zfs sync and non-sync writes into one bucket, but much of the ZFS writes have to be non-sync because the write rate is 127MB/s where as the I/O subsystem writes are only 10MB/s thus at least 117MB/s is not sync and if they are not sync they are just memory writes so should be blindingly fast, but they aren’t. The average latency for the ZFS writes is 37ms. All the more shockingly the average size is only 0.199K where as the I/O subsystem writes 32K in 23ms. The case here was that because of disk errors, the ZFS layer was self throttling way to much. This was a bug


Uncategorized

  1. Trackbacks

  2. No trackbacks yet.
  1. Comments

  2. No comments yet.


5 × = forty