Archive

Archive for the ‘Network’ Category

NFS Performance Issues at TCP level

May 6th, 2014

What happens with I/O requests over NFS and more specifically with Oracle? How does NFS affect performance and what things can be done to improve performance?

What happens at the TCP layer when I request with dd an 8K chunk of data off an NFS mounted file system?

Here is one example:

I do a

dd if=/dev/zero of=foo bs=8k count=1

where my output file is on an NFS mount, I see the TCP send and receives from NFS server to client as:

(the code is in dtrace and runs on the server side, see tcp.d for the code)

There is a lot of activity in this simple request for 8K. What is all the communication? Frankly at this point, I’m not sure. I haven’t looked at the contents of the packets but I’d guess some of it has to do with getting file attributes. Maybe we’ll go into those details in future postings.

For now what I’m interested in is throughput and latency and for the latency, figuring out where the time is being spent.

I most interested in latency. I’m interested in what happens to a query’s response time when it reads I/O off of NFS as opposed to the same disks without NFS. Most NFS blogs seem to address throughput.

Before we jump into the actually network stack and OS operations latencies, let’s look at the physics of  the data transfer.

If we are on a 1Ge we can do up toin theory 122MB/s but actually the maximum is more like 115 MB/s and for most systems 100 MB/s is good)  , thus

100 MB/s
100 KB/ms
1 KB per  0.01ms  (ie 10us)

15us  ( 0.015 ms) to transfer a 1500 byte network packet (ie MTU or maximum transfer unit)

a 1500 byte transfer has IP framing and only transfers 1448 bytes of actual data

so an 8K block from Oracle will take 5.7 packets which rounds off to 6 packets

Each packet takes 15us, so 6 packets for 8K takes 90us (interesting  to note this part of the transfer goes down to 9us on 10Ge – if all worked perfectly )

Now a well tuned 8K transfer takes about 350us (from testing, more on this later) , so where is the other ~ 260 us come from? (260us+90us transfer time = 350us total latency)

Well if I look at the above diagram, the total transfer time takes 4776 us ( or 4.7ms) from start to finish, but this transfer does a lot of set up.

The actual 8K transfer (5 x 1448 byte packets plus the 1088 byte packet ) takes 780 us or about twice as long as optimal.

 

Comparing good throughput to bad throughput

I don’t have an answer to the following TCP performance difference between two machines, but thought I’d post about what I’ve seen as I think it’s interesting, and maybe someone else has ideas.
I’m running netio (http://freshmeat.net/projects/netio/) on one machine (opensolaris) and contacting two different Linux machines (both on 2.6.18-128.el5 ), machine A and machine B.
Machine A has a network throughput of 10MB/sec with netio and machine B 100MB/sec with netio. Netio is set to send 32K chunks:

Linux machine: netio -s -b 32k -t -p 1234
Opensolaris: netio -b 32k -t -p 1234 linuxmachine

On the opensolaris machine I used dtrace to trace the connections. From dtrace, all the interactions TCP settings look the same – same windows sizes on the receive and send, same ssthresh, same congestion window sizes, but the slow machine is sending an ACK for every 2 or 3 receives whereas the fast machine is sending an ACK every 12 receives.
All three machines are on the same switch.

Here is the Dtrace output:

Fast Machine:

delta send   recd
 (us) bytes  bytes  swnd snd_ws   rwnd rcv_ws   cwnd    ssthresh
  122 1448 \      195200      7 131768      2 128872  1073725440
   37 1448 \      195200      7 131768      2 128872  1073725440
   20 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   19 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   18 1448 \      195200      7 131768      2 128872  1073725440
   57 1448 \      195200      7 131768      2 128872  1073725440
  171 1448 \      195200      7 131768      2 128872  1073725440
   29  912 \      195200      7 131768      2 128872  1073725440
   30      /    0 195200      7 131768      2 128872  1073725440

slow machine:

delta send   recd
 (us) bytes  bytes  swnd snd_ws   rwnd rcv_ws   cwnd    ssthresh
  161      /    0 195200     7 131768      2 127424   1073725440
   52 1448 \      195200     7 131768      2 128872   1073725440
   33 1448 \      195200     7 131768      2 128872   1073725440
   11 1448 \      195200     7 131768      2 128872   1073725440
  143      /    0 195200     7 131768      2 128872   1073725440
   46 1448 \      195200     7 131768      2 130320   1073725440
   31 1448 \      195200     7 131768      2 130320   1073725440
   11 1448 \      195200     7 131768      2 130320   1073725440
  157      /    0 195200     7 131768      2 130320   1073725440
   46 1448 \      195200     7 131768      2 131768   1073725440
   18 1448 \      195200     7 131768      2 131768   1073725440

Dtrace code

dtrace: 130717 drops on CPU 0
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline int TICKS=$1;
inline string ADDR=$$2;
dtrace:::BEGIN
{
       TIMER = ( TICKS != NULL ) ?  TICKS : 1 ;
       ticks = TIMER;
       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send
/     ( args[2]->ip_daddr == ADDR || ADDR == NULL ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%6d %8d \ %8s  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
    flag=0;
    title--;
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{
      delta=timestamp-walltime;
      walltime=timestamp;

      printf("%6d %8s / %8d  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
    flag=0;
    title--;
}

Followup , since I did the above, I have since changed the dtrace code to include the number of unacknowledged bytes and it turns out the slow code does run up it’s unacknowleged bytes until it hits the congestion window, where as the fast machine never hits it’s congestion window:

unack    unack    delta  bytes   bytes       send   receive  cong       ssthresh
bytes    byte      us     sent   recieved    window window    window
sent     recieved
139760      0     31     1448 \             195200  131768   144800   1073725440
139760      0     33     1448 \             195200  131768   144800   1073725440
144104      0     29     1448 \             195200  131768   146248   1073725440
145552      0     31          / 0           195200  131768   144800   1073725440
145552      0     41     1448 \             195200  131768   147696   1073725440
147000      0     30          / 0           195200  131768   144800   1073725440
147000      0     22     1448 \             195200  131768    76744        72400
147000      0     28          / 0           195200  131768    76744        72400
147000      0     18     1448 \             195200  131768    76744        72400
147000      0     26          / 0           195200  131768    76744        72400
147000      0     17     1448 \             195200  131768    76744        72400
147000      0     27          / 0           195200  131768    76744        72400
147000      0     18     1448 \             195200  131768    76744        72400
147000      0     56          / 0           195200  131768    76744        72400
147000      0     22     1448 \             195200  131768    76744        72400

dtrace code:

#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs
inline int TICKS=$1;
inline string ADDR=$$2;
tcp:::send, tcp:::receive
/     ( args[2]->ip_daddr == ADDR || ADDR == NULL ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%6d %6d %6d %8d \ %8s  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
}
tcp:::receive
/ ( args[2]->ip_saddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{
      delta=timestamp-walltime;
      walltime=timestamp;
      printf("%6d %6d %6d %8s / %-8d  %8d %8d %8d  %8d %8d %12d %12d %12d %8d %8d  %d  \n",
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        args[3]->tcps_sack_fack,
        args[3]->tcps_sack_snxt,
        args[3]->tcps_rto,
        args[3]->tcps_mss,
        args[3]->tcps_retransmit
      );
}

So the fact that it looked like the slow machine was acknowledging every second or third send was due to the fact that the receiver was already behind on acknowledging previous packets.

Now the question remains is why did the receiver get so far behind on one machine and not the other.

I’ve check the rmem values on both machines and set the to the same:

net.core.rmem_default=4194304
net.core.rmem_max=4194304

I showed some code that displayed the send and receive sizes and times over TCP on Solaris 10 with the dtrace command (see tcp.d for the code). I took this code on another machine and got errors like

“dtrace: error on enabled probe ID 29 (ID 5461: tcp:ip:tcp_input_data:receive): invalid alignment (0xffffff516eb5e83a) in predicate at DIF offset 136″

Not quite sure why this was happening but by a process of elimination I found that accessing args[4]  in tcp:::receive caused these errors.

Fortunately many of the values in args[4] are found in args[3] as well.

To find arguments to tcp:::receive , first run the following command  (or alternatively look the TCP probes up on the wiki athttp://wikis.sun.com/display/DTrace/tcp+Provider)

( for command line flags check out http://compute.cnr.berkeley.edu/cgi-bin/man-cgi?dtrace+7

  • -l = list instead of enable probes
  • -n = Specify probe name to trace or  list
  • -v = Set verbose mode

i.e. list the verbose information about the probes named and don’t enable these probes, just list them)

$ dtrace -lvn tcp:::receive
 5473        tcp                ip                        tcp_output send

        Argument Types
                args[0]: pktinfo_t *
                args[1]: csinfo_t *
                args[2]: ipinfo_t *
                args[3]: tcpsinfo_t *
                args[4]: tcpinfo_t *

(by the way, there are a number of probes that match tcp:::receive, but they all have the same arguments, I didn’t notice all these different tcp:::receive until I actually ran the above command. Before running the command I’d depended on the wiki. Now I’m wondering what the difference is between some of these tcp:::receive and tcp:::send probes )

After finding the args for a probe, you can look up the definition of the structs at  http://cvs.opensolaris.org/source/

tcp:::send and tcp:::receive both have the same arguments

  • args[3] is tcpsinfo_t
  • args[4] is tcpinfo_t

Looking up the structs at http://cvs.opensolaris.org/source/, I find the contents of the structs as follows:

tcpsinfo_t  ( args[3] )

     111 typedef struct tcpsinfo {
    112 	uintptr_t tcps_addr;
    113 	int tcps_local;			/* is delivered locally, boolean */
    114 	int tcps_active;		/* active open (from here), boolean */
    115 	uint16_t tcps_lport;		/* local port */
    116 	uint16_t tcps_rport;		/* remote port */
    117 	string tcps_laddr;		/* local address, as a string */
    118 	string tcps_raddr;		/* remote address, as a string */
    119 	int32_t tcps_state;		/* TCP state */
    120 	uint32_t tcps_iss;		/* Initial sequence # sent */
    121 	uint32_t tcps_suna;		/* sequence # sent but unacked */
    122 	uint32_t tcps_snxt;		/* next sequence # to send */
    123 	uint32_t tcps_rack;		/* sequence # we have acked */
    124 	uint32_t tcps_rnxt;		/* next sequence # expected */
    125 	uint32_t tcps_swnd;		/* send window size */
    126 	int32_t tcps_snd_ws;		/* send window scaling */
    127 	uint32_t tcps_rwnd;		/* receive window size */
    128 	int32_t tcps_rcv_ws;		/* receive window scaling */
    129 	uint32_t tcps_cwnd;		/* congestion window */
    130 	uint32_t tcps_cwnd_ssthresh;	/* threshold for congestion avoidance */
    131 	uint32_t tcps_sack_fack;	/* SACK sequence # we have acked */
    132 	uint32_t tcps_sack_snxt;	/* next SACK seq # for retransmission */
    133 	uint32_t tcps_rto;		/* round-trip timeout, msec */
    134 	uint32_t tcps_mss;		/* max segment size */
    135 	int tcps_retransmit;		/* retransmit send event, boolean */

tcpinfo_t (args[4])

     95 typedef struct tcpinfo {
     96 	uint16_t tcp_sport;		/* source port */
     97 	uint16_t tcp_dport;		/* destination port */
     98 	uint32_t tcp_seq;		/* sequence number */
     99 	uint32_t tcp_ack;		/* acknowledgment number */
    100 	uint8_t tcp_offset;		/* data offset, in bytes */
    101 	uint8_t tcp_flags;		/* flags */
    102 	uint16_t tcp_window;		/* window size */
    103 	uint16_t tcp_checksum;		/* checksum */
    104 	uint16_t tcp_urgent;		/* urgent data pointer */
    105 	tcph_t *tcp_hdr;		/* raw TCP header */
    106 } tcpinfo_t;

In the script I accessed the TCP seq and ack in arg[4], which was giving me errors in tcp:::receive, so I just switch these for the equivalents in arg[3]. Now I’m not exactly clear on the equivalence, but it seems

  • args[4]->tcp_seq  ?= args[3]->tcps_suna
  • args[4]->tcp_ack  ?= args[3]->tcps_rack

The ack=rack seems solid as tcps_rack = “highest sequence number for which we have received and sent an acknowledgement”.

The seq=tcps_suna is less clear to me as tcps_suna = “lowest sequence number for  which we have sent data but not received acknowledgement”

But for my purposes, these distinctions might be unimportant  as  I’ve stopped looking at seq and ack and now look at outstanding unacknowledge bytes on the receiver and sender, which is

  • tcps_snxt-tcps_suna ” gives the number of bytes pending acknowledgement “
  • tcps_rsnxt – tcps_rack  ” gives the number of bytes we have received but not yet acknowledged”

but more about that later. Let’s look at the new version of the program and the output

#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/   title == 0 /
{   printf("%9s %8s %6s %8s %8s %12s %s \n",
        "delta"    ,
        "cid"    ,
        "pid"    ,
        "send" ,
        "recd"  ,
        "seq",
        "ack"
      );
     title=TITLE;
}
tcp:::send
/     ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%9d %8d %6d %8d --> %8s %8d %8d %12s > %s \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[4]->tcp_seq  -
        args[3]->tcps_suna ,
        args[4]->tcp_ack -
        args[3]->tcps_rack ,
        args[3]->tcps_raddr,
        curpsinfo->pr_psargs
      );
    title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{     delta=timestamp-walltime;
      walltime=timestamp;
      printf("%9d %8d %6d %8s <-- %-8d %8d %8d %12s < %s \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_rack % 10000,
        args[3]->tcps_suna % 10000,
        args[3]->tcps_raddr,
        curpsinfo->pr_psargs
      );
    title--;
}

output

starting up ...
    delta      cid    pid     send     recd
      570     3904    845          <-- 0
       34     3904    845          <-- 140
      455     3904    845          <-- 0
       24     3904    845          <-- 0
  4789720     3904    845      124 -->
       82     3904    845      244 -->
       99     3904    845      132 -->
    delta      cid    pid     send     recd
       52     3904    845     1448 -->
       28     3904    845     1448 -->
       24     3904    845     1448 -->
       36     3904    845     1448 -->
       33     3904    845     1448 -->
       26     3904    845      952 -->
       86     3904    845      244 -->
      212     3904    845          <-- 140
      501     3904    845          <-- 132
       60     3904    845      124 -->
      256     3904    845          <-- 140
       72     3904    845          <-- 0
    39658     3904    845          <-- 0

What the heck is that huge time 4789720 us? ie 4 secs? The whole operation took me less than 1 second.
I wouldn’t have found the answer to this without help from Adam Levanthal. Turns out that output is in order only per CPU, but different CPUs can output data in different order. Each CPU buffers up data and then passed the buffer back to userland dtrace,  so on a one CPU machine, this code will always output in order, but on a multi-cpu machine there is no guarentee on the order of the output. Lets add CPU # to the output:

#!/usr/sbin/dtrace -s
#pragma D option defaultargs
#pragma D option quiet
inline string ADDR=$$1;
dtrace:::BEGIN
{       TITLE = 10;
       title = 0;
       walltime=timestamp;
       printf("starting up ...\n");
}
tcp:::send, tcp:::receive
/   title == 0 /
{   printf("%9s %8s %6s %8s %8s %4s \n",
        "delta"    ,
        "cid"    ,
        "pid"    ,
        "send" ,
        "recd"  ,
        "cpu#"
      );
     title=TITLE;
}
tcp:::send
/     ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) /
{    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    delta= timestamp-walltime;
    walltime=timestamp;
    printf("%9d %8d %6d %8d --> %8s %d \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        cpu
      );
    title--;
}
tcp:::receive
/ ( args[3]->tcps_raddr == ADDR || ADDR == NULL ) && nfs[args[1]->cs_cid] /
{     delta=timestamp-walltime;
      walltime=timestamp;
      printf("%9d %8d %6d %8s <-- %-8d %d \n",
        delta/1000,
        args[1]->cs_cid  % 100000,
        args[1]->cs_pid ,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        cpu
      );
    title--;
}

output

  delta      cid    pid     send     recd cpu#
       42     3904    845      244 -->          0
       66     3904    845      124 -->          0
     6091     3904    845      124 -->          2
       81     3904    845      244 -->          2
       96     3904    845      132 -->          2
       31     3904    845     1448 -->          2
       20     3904    845     1448 -->          2
       18     3904    845     1448 -->          2
       17     3904    845     1448 -->          2
       16     3904    845     1448 -->          2
  8910406     3904    845        0 -->          3
      375     3904    845          <-- 0        3
       24     3904    845          <-- 140      3
       34     3904    845        0 -->          3
      470     3904    845          <-- 140      3
      410     3904    845          <-- 132      3
    delta      cid    pid     send     recd cpu#
      491     3904    845          <-- 140      3
      393     3904    845          <-- 0        3
       15     3904    845      952 -->          3
       36     3904    845          <-- 0        3
    delta      cid    pid     send     recd cpu#
       19     3904    845          <-- 0        3
    40167     3904    845          <-- 0        3

what we see is the data ordered by CPU. In other words for each CPU the data is ordered but which CPU get’s printed first is unknown. In dtrace each CPU buffers up it’s data and then sends it to the userland dtrace process.
The only “fix” for now is to add a timestamp and order the data by timestamp.
Unsorted, it looks like:

607858102997348       281     3904    845      124 -->          2
607858103608856        84     3904    845      244 -->          2
607858104125414     delta      cid    pid     send     recd cpu#
607858104143731       116     3904    845      132 -->          2
607858104176769        33     3904    845     1448 -->          2
607858104198187        21     3904    845     1448 -->          2
607858104215813        17     3904    845     1448 -->          2
607858104233004        17     3904    845     1448 -->          2
607858104267629        34     3904    845     1448 -->          2
607858104287379        19     3904    845      952 -->          2
607858102716187  11569935     3904    845          <-- 132      3
607858103245377       248     3904    845          <-- 0        3
607858103282421        37     3904    845          <-- 140      3
607858103339076        56     3904    845      244 -->          3
607858103524093       185     3904    845          <-- 140      3
607858103774417       165     3904    845          <-- 132      3
607858103823145        48     3904    845      124 -->          3
607858104027216       204     3904    845          <-- 140      3
607858104387780       100     3904    845          <-- 0        3
607858104401487        13     3904    845          <-- 0        3
607858104520815       119     3904    845          <-- 0        3
607858144436175     delta      cid    pid     send     recd cpu#
607858144454625     39933     3904    845          <-- 0        3

sorted it looks like

607858102716187  11569935     3904    845          <-- 132      3
607858102997348       281     3904    845      124 -->          2
607858103245377       248     3904    845          <-- 0        3
607858103282421        37     3904    845          <-- 140      3
607858103339076        56     3904    845      244 -->          3
607858103524093       185     3904    845          <-- 140      3
607858103608856        84     3904    845      244 -->          2
607858103774417       165     3904    845          <-- 132      3
607858103823145        48     3904    845      124 -->          3
607858104027216       204     3904    845          <-- 140      3
607858104125414     delta      cid    pid     send     recd cpu#
607858104143731       116     3904    845      132 -->          2
607858104176769        33     3904    845     1448 -->          2
607858104198187        21     3904    845     1448 -->          2
607858104215813        17     3904    845     1448 -->          2
607858104233004        17     3904    845     1448 -->          2
607858104267629        34     3904    845     1448 -->          2
607858104287379        19     3904    845      952 -->          2
607858104387780       100     3904    845          <-- 0        3
607858104401487        13     3904    845          <-- 0        3
607858104520815       119     3904    845          <-- 0        3
607858144436175     delta      cid    pid     send     recd cpu#
607858144454625     39933     3904    845          <-- 0        3

so now the strange long time delta is at the beginning where I’d expect it.

I’m not quite sure how to deal with this. Post processing the data by sorting the timestamp column works, but interactively processing the data to get it in the right order as it comes out seems problematic.

 

Jumbo Frames

Jumbo frames can have a significant impact in latency, though jumbo frames are often difficult.  Implementing jumbo frames on a machine  can cause the machine network communications to hang if a machine or switch on that connection doesn’t support jumbo frames. Modifying switches to support jumbo frames generally requires taking the switch offline which may not be feasible.

Below is a comparison between standard MTU of 1500 bytes and jumbo frame MTU of 9000. IN n this case the jumbo frame connection is twice as fast.

Screen Shot 2014-05-07 at 11.38.29 AM

Network

TCP Trace Analysis for NFS

August 23rd, 2013


How do we know where latency comes from when  there is a disparity in reported I/O latency on  the I/O subsystem and that of the latency reported on the  client box requesting the I/O.

For example if I have an Oracle database requesting I/O  and Oracle says an 8Kb request takes 50 ms yet the I/O storage subsystem says 8Kb I/Os are taking 1ms (averages) , then where does the 49  extra ms come from?

When the I/O subsystem is connected to Oracle via NFS  then there are a lot of layers that could be causing the extra latency.

Screen Shot 2013-08-23 at 1.35.20 PM

Where does the difference in latency come from between NFS Server and Oracle’s timing of pread?

One strategy is to take tcp traces on NFS server and NFS client  (the Oracle Host) at the same time, during a period of load that shows the disparity.

The two trace files can then be analyzed to show latency at each side and the delta of latency between packets found in both traces

 

Files Required

 

Parsing script is ( thanks to Matt Amdur from Delphix for the  core code)

https://github.com/khailey/tcpdump/blob/master/parsetcp.pl

The script requires access to  tethereal command which comes with the wireshark install on LINUX:

      yum install wireshark

Example situation

 

For example in the following table stats were collected on the NFS server   (running Open Solaris) with DTrace and on the Oracle database using   oramon.sh.  The stats show the average latency for 8Kb read requests

Avg latency of 8Kb reads measured in ms


NFS Server Oracle
30.2 126.15
24.4 103.91
21.9 117.33
23.7 96.46
32 78.43
25.5 91.94

There is a substantial difference between the latency seen by NFS server and the latency seen by Oracle. Where is the latency coming from? Is it a problem on the NFS server, the NFS client or the network? How do we find out? One strategy is to collect TCP dumps on the NFS client and NFS server and compare them.

Setup problem situation

First, set up a situation where there is a load running on the database and there is a discrepancy between latency on NFS server and as seen by Oracle. During this period of latency discrepancy collect TCP traces. The traces will be large, so run them for 20 second for example, to limit the size.

 

Collecting TCP dumps

TCP dumps can be collected on Linux with tcpdump or  on Solaris with snoop:

NFS server (Solaris NFS server example)

snoop -q -d aggr0 -s 512 -o nfs_server.cap 172.16.100.102
  • -d : device
  • -o : output file, give “.cap” extension to be recognized automatically by wireshark
  • -s : amount of packet to keep

NFS client host is the IP at the end of the command line

Linux (linux client example)

tcpdump -s 512  host 172.16.100.87 -w nfs_client.cap
  • -s : amount of packet to keep
  • -w : output file, give “.cap” extension to be recognized automatically by wireshark
  • host : IP of NFS server

Analyzing the Traces

 

The trace files can be analyzed on a linux machine with wireshark installed and with the script parsetcp.pl  on the two traces.

$ ./parsetcp.pl nfs_server.cap nfs_client.cap
 ==================== RAW DATA ============

Parsing NFS server trace: nfs_server.cap
type       avg ms count
   READ : 18.10,  12493  
  WRITE : 30.08,   4918   

Parsing NFS client trace: nfs_client.cap
type       avg ms count  
   READ : 18.74,  25974 
  WRITE : 30.05,  10197   

 ==================== MATCHED DATA  ============

READ
type       avg ms 
nfs server: 18.12
nfs_client: 19.26
     diff :  1.14 
Processed 14010 packets (Matched: 11964 Missed: 2046)

WRITE
type       avg ms  
 server : 30.86    
 client : 31.83
   diff :  0.96 
Processed 5806 packets (Matched: 4391 Missed: 1415)

The script will find the average latency for reads and writes on both ends and then try to match the packets found both at the client and server and report the average latency for reads and writes on the packets that match.
In the above output the read time on on the NFS client receiving end is on par with what is seen on NFS server.
Thus if the receive time on the NFS client host is on par with the NFS server and Oracle latency is much greater than the receive time, then there looks to be some issue on the NFS client host machine such as heavy memory paging that would cause the latency seen by Oracle to be much higher than that seen by the tcp traces.

 

Example problem analysis

 

In the following 2 examples a swingbench load was run on the Oracle database on the NFS client. The database files were mounted via NFS from the NFS server. In one case the NFS client and database host was Open Solaris and in the other case it was LINUX. In both cases the Oracle database reported much slower latency than the NFS server threads.

test 1 ) Open Solaris is the NFS client

NFS Server is Open Solaris and NFS Server latency is measured with DTrace of using probes nfs start/done .

NFS client is a different machine running Open Solaris. Latency is measured from the Oracle perspective with oramon.sh:


NFS Server Oracle
1.4 29.92
1.8 32.62
1.1 50.38
1.1 56.38
1.5 65.45
0.9 46.88
0.7 45.13
1.1 44.93
1 39.48
0.9 55.8

 

Test 1 TCP trace analysis

What do the tcp traces say? Tracing TCP on the NFS server side, and NFS client side indicates that the slow latencies are coming from the NFS server

 ==================== Individual HOST DATA ============
Parsing nfs server trace: nfs_server.cap
type       avg ms count   
   READ : 44.60,   7731    

Parsing client trace: client.cap
type       avg ms count    
   READ : 46.54,  15282    

 ==================== MATCHED DATA  ============
READ
type       avg ms   
 server : 48.39,   
 client : 49.42,   
   diff :  1.03,    
Processed 9647 packets (Matched: 5624 Missed: 4023)

TEST 1 summary:

The NFS Server says it is responding with 1ms latency (from DTrace) but the TCP traces show latency more on the order of 48ms which is in alignment with what Oracle sees on the NFS client, thus the NFS server must be introducing the large latency somehow and not the client.

 

Test 2) LINUX NFS client

NFS Server is Open Solaris and NFS Server latency is measured with DTrace of using probes nfs start/done .

NFS client is a different machine running Linux. Latency is measured from the Oracle perspective with oramon.sh:


NFS Server Oracle
0.1 35.73
0.1 58.9
0.1 44.88
0.1 54.43
0.1 54.57
0.1 63.92

Again we see the same stats as the first test case, ie the NFS Server says it’s latency is low and the Oracle database on the NFS client says the latency is slow

TCP trace analysis

==================== Individual HOST DATA ============
Parsing NFS server trace: nfs_server.cap
type       avg ms count   
   READ :  1.17,   9042            

Parsing client trace: client.cap
type       avg ms count    
   READ :  1.49,  21984   

==================== MATCHED DATA  ============
READ
type       avg ms count    
 server :  1.03  
 client :  1.49
   diff :  0.46

Processed 13815 packets (Matched: 8169 Missed: 5646)
summary of test 2:

In this case the NFS Server TCP traces show fast latency as well as the NFS client TCP traces, thus the long latency times must be introduced on the client machine somehow and not the NFS Server.

 

Summary of test 1 and test 2

Tests are run with swingbench benchmark.
The Oracle databases are set with the parameter “filesystemio_options=SETALL” which enables direct I/O, so that all I/O requests from Oracle should be become NFS requests to NFS server.
Roughly the latency stats for 8k reads originating from the Oracle are

 

 Oracle on Linux  Oracle on Solaris  latency data source
 NFS Server  .1 ms   2 ms dtrace nfs:::op-read-start/op-read-done
 TCP trace NFS Server   1 ms  44 ms snoop
 TCP trace NFS Client   1.5  45 ms tcpdump on LINUX and snoop on Open Solaris
 Oracle  58 ms  47 ms “db file sequential read” wait (which is basically a timing of “pread” for 8k random reads specifically

Screen Shot 2013-08-23 at 2.09.43 PM

In the above examples there was nothing outstanding in the vmstat results on the NFS client and the NFS server.

In both cases the problem arose with high level of swingbench sessions. The latency disparity appeared above a user load of 16. The latency disparity increased as user load increased.

It turned out that in the case where the NFS server was introducing the disparity that the maximum NFS threads on the server was 16. The solution was to increase the NFS server thread max: sharectl set -p servers=512 nfs.

In the case where the LINUX client introduce the disparity it turns out that the maximum outstanding NFS requests was 16. The solution was to raise sunrpc.tcp_slot_table_entries = 128.

Screen Shot 2013-08-23 at 2.33.00 PM

The tcpdump.pl script didn’t find the answers but it did identify which side of the connection was at fault and it would identify if the network itself was the problem.

One thing to keep in mind is that snoop and tcpdump can be resource intensive and may introduce a major amount of latency. In the above cases with and without tcp tracing enabled showed no more than 1ms of difference and the latency being investigated was an order of magnitude greater. In other cases where there has been more I/O between the NFS server and client and/or the transfer sizes were higher (like 1M) then the impact of TCP tracing has eclipsed the latency disparity being investigated and  tcp tracing would not have been appropriate investigation path.

 


Other information

Network , , ,