Archive for the ‘Uncategorized’ Category

Want to change the future Amazon RDS performance monitoring?

April 10th, 2019

UPDATE: All slots for this study have been filled.

On the other and would love your feedback. Please send any ideas about what you’d like to see in RDS performance monitoring to me at kylelf at



Have you ever used Amazon RDS console to manage performance on an RDS database and had ideas on how to make it better?  The Amazon UX Research team for AWS is collaborating with me to recruit for an upcoming user research study with Amazon Relational Database Service (RDS) performance monitoring. Sessions will take place between Monday, April 22, 2019 – Friday, April 26, 2019. In this study Amazon is looking to speak to experienced Database Administrators who currently use RDS. The sessions will be conducted remotely on WebEx and will be 1 hour long. As appreciation for your time and feedback, participants will receive a $100 gift card for completing a session (if your company guidelines permit).

If you are interested, sign up for this confidential study here: (Note that the study has limited spots, first come first serve.)

After you sign up, and if you get a slot, the Amazon UX team will send you a WebEx invite for your session.




Uncategorized for DB Load and Active Sessions

April 3rd, 2019 turns out to be a nice solution for collecting, retrieving and displaying multi-dimensional time series data, i.e. the kind of data you get from sampling.

For example, in the database world we have Active Session History (ASH) which at  it’s core tracks

  1. when – timestamp
  2. who – user
  3. command – what SQL are they running
  4. state – are they runnable on CPU or are they waiting and if waiting what are they waiting for like I/O, Lock, latch, buffer space, etc

Collecting this information is pretty easy to store in a relational database as I did when creating S-ASH (Simulated ASH) and Marcin Przepiorowski built upon over the years since, or even store in flatfiles like I did with W-ASH (web enabled ASH).

On the other hand retrieving the data in a way that can be graphed is challenging. To retrieve and display the data we need to transform the data into number time series.

WIth we can store, retrieve and display data by various dimensions as time series.

Just sign up at honeycomb, then start to create a dataset. Pick any application it doesn’t matter and when you hit continue for creating a dataset, you will get a writekey. WIth that writekey you can start sending data to

I’m on a Mac using Python so I just installed with

pip install libhoney


I then connected to a PostgreSQL database in Amazon RDS and looped, running a query to collect the sampled data

       state = 'active' ; "

and putting this in a call to

import libhoney
import psycopg2
import pandas as pd
import time
from time import gmtime, strftime
libhoney.init(writekey="xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", dataset="honeycomb-python-example", debug=True)
conn_string = "host="+ PGHOST +" port="+ "5432" +" dbname="+ PGDATABASE +" user=" + PGUSER  +" password="+ PGPASSWORD
sql_command = "select datname, pid, usename, application_name, client_addr, COALESCE(client_hostname,'unknown'), COALESCE(wait_event_type,'cpu'), COALESCE(wait_event,'cpu'), query from pg_stat_activity where state = 'active' ; "
print (sql_command)
builder = libhoney.Builder()
        while 1 < 2 :
                mytime=strftime("%Y-%m-%d %H:%M:%S", time.localtime())
                print "- " + mytime + " --------------------- "
                cursor = conn.cursor()
                for row in cursor:
                        query=row[8].replace('\n', ' ')
                        if group != "cpu" :
                                event= group + ":" + event
                        print '{0:10s} {1:15s} {2:15s} {3:15s} {4:40.40s}'.format(un,ip,group,event,query)
                        ev = builder.new_event()
                        ev.add_field( "hostname", ip)
                        ev.add_field( "user", un)
                        ev.add_field( "event", event)
                        ev.add_field( "sql", query)
                        #ev.created_at = mytime;

( You might notice the disconnect / connect at the end of the loop. That waste resources but for some reason querying from pg_stat_activity would return the same number of rows if I didn’t disconnect. Disconnecting it worked. For the case of a simple demo I gave up trying to figure out what was going on. This weirdness doesn’t happen for user tables)

On the dashboard page I can choose “events” under “BREAK DOWN” and “count()” under (CALCULATE PER GROUP) and I get a db load chart by wait event. I can further choose to make the graph a stacked graph:


Screen Shot 2019-04-03 at 2.38.47 PM


Now there are some limitations that make this less than a full solution. For one, zooming out cause the granularity to change from graphing a point every second to graphing points every 15, 30 or 60 seconds, yet the count will count all the points in those intervals and there is no way to normalize it by the elasped time i.e. for a granularity of 60 seconds it sums up all the points in 60 seconds and graphs that value where what I want is to take that 60 second sum and divide by 60 seconds to get the *Average* active sessions in that interval and not the sum.

But over all a fun easy demo to get started with.

I found to respond quickly to emails and they have a slack channel where folks were asking and answering questions that was responsive as well.



Amazon RDS cluster dashboard with Performance Insights

March 28th, 2019

Amazon RDS Performance Insights (PI) doesn’t have a single pane of glass dashboard for clusters, yet. Currently PI has a dashboard that has to be looked at for each instance in a cluster.

On the other hand one can create a simple cluster dashboard using Cloudwatch.

PI, when enabled, automatically sends three metrics to Cloudwatch every minute.

These metrics are

  1. DBLoad
  2. DBLoadCPU
  3. DBLoadNonCPU

DBLoad = DBLoadCPU + DBLoadNonCPU

These metrics are measured in units of Average Active Sessions (AAS). AAS is like the run queue on the UNIX top command except at the database level. AAS is the average number of SQL queries running concurrently in the database. In the case of DB Load AAS,  the average is over 1 minute since the metrics are reported each minute, and represents the total average # of SQL queries running concurrently. The DBLoad AAS can be broken down into those queries that are runnable on CPU, which is DBLoadCPU, and those queries that are not ready to run on the CPU because they are waiting for some resource like an I/O to complete, a lock , a latch, or some  resource that can only be accessed in single threaded mode like a latch or buffer.

These metrics can be used to look at the health of the database.

For example we can only have as many SQL running on the CPU as there are vCPUs. If DBLoadCPU goes above the # of vCPUs then we know that some of those queries that are runnable on the CPU are actually waiting for the CPU.

We also know that when DBLoadNonCPU is low or near 0 then the queries are not waiting for resources and can execute. When DBLoadNonCPU goes up significantly then that represents an opportunity to optimize. For example if queries are spending half their time waiting for IO then if we could buffer that IO we could remove the IO wait and theoretically the queries could go twice as fast, doubling throughput.

By looking at DBLoadCPU for each instance in a cluster we can see if the load is well balanced and we can see if the load goes above the maximum CPU resources of the instance which would indicate a CPU bottleneck.

By looking at the ratio or percentage of DBLoadNonCPU to total DBLoad we can see how much time is wasted waiting for resources instead of executing on CPU. By show this percentage for each instance in the cluster in one graph we can see if any particular instance is running into a bottleneck. If so we would want to look the performance insights dashboard for that instance to get more detail about what is happening.

So let’s set up a cluster dashboard using PI data.

Create a RDS database instance with PI enabled :

PI is supported on all RDS Oracle, Aurora PostgreSQL,  RDS PostgreSQL 10 , RDS SQL Server (except 2008) and on the most recent versions of  Aurora MySQL 5.6, RDS MySQL 5.6 & 5.7. See docs for details:

In the case of this blog I created an Aurora MySQL 5.6  cluster with the original writer node and 3 reader nodes. You can create the initial instance with one reader node, then after creation, go to modify instanced and add reader node.

My nodes are shard1, shard2, shard3 and shard4.

After creating or modifying an instance to support PI, navigate to Cloudwatch and create a new dashboard (or add to an existing one)

Screen Shot 2019-03-28 at 12.17.34 PM


after creating a dashboard (or modifying an existing one) add a new widget, click “Add widget” button

Screen Shot 2019-03-28 at 12.22.46 PM

and for this example chose the line graph, the first option on the left of popup:

Screen Shot 2019-03-25 at 6.00.44 PM

at the bottom of the screen enter “DBLoadCPU” into the search field

Screen Shot 2019-03-28 at 12.24.02 PM

hit return and click on “per database metrics”

Screen Shot 2019-03-28 at 12.25.17 PM


My cluster instances are shard1, shard2, shard3 and shard4 so I click those

Screen Shot 2019-03-28 at 12.26.19 PM

and click “Create Widget” in the bottom left

Screen Shot 2019-03-28 at 12.28.46 PM

I got a surprise, as each shard instance was suppose to have same load but can see  something is wrong on shard4. Will investigate that as we go.

For now there are some options on the widget that I want changed. I want the graph to start at 0 (zero) and have a maximum of 4 ,  since my instances have 2vCPUs and I want to be able to look quickly at the graph to know where I’m at without having to read the axis everytime.  My max available CPU load is 2 AAS since I have 2 vCPU. I set the max at 4 so there is some head room to be able to show load about 2.

There is pull down menu in top right of widget. Click it and choose “Edit”

Screen Shot 2019-03-28 at 12.31.25 PM

Click the third tab “Graph options” and enter 0 for min and for max enter a value above the # of vCPUs you have on your instance. I have 2 vCPUs, so I enter 4.

Screen Shot 2019-03-28 at 12.32.14 PM

click update in the bottom right.

Screen Shot 2019-03-28 at 12.34.30 PM

I can see I’m running a pretty good load as shards 1-3 are running  around 1.5 AAS on CPU i.e. our SQL are asking for about 75% of the CPU capacity of the box. ( 100% * (1.5 AAS CPU / 2 vCPU) ).

I also see shard4 has a lower DBLoadCPU. Turns out I had turned off the load to that shard this morning and forgot, so I restarted it.

Now lets add a widget to see how efficient our load is, i.e. what % of the load is waiting instead of being runnable on CPU.

Create a new widget and search for DBLoad and choose the 2 metrics DBLoadNonCPU & DBLoad for all 4 instances. We will use them in a mathematical expression.

Screen Shot 2019-03-28 at 12.40.19 PM

create the widget, then edit it,

uncheck all the metrics

then we click “Add a math expression”

add the expression 100*(DBLoadNonCPU/DBLoad)  for each instance

Screen Shot 2019-03-28 at 12.45.32 PM

Looks like

Screen Shot 2019-03-28 at 12.47.12 PM

You can see I restarted the load on shard4 because the DBLoadCPU has gone up.

Now for the new graph click on title and type “% bottleneck”

edit it and add min 0 and max 100 ( i.e. 0-100% range), now it looks like

Screen Shot 2019-03-28 at 12.49.36 PM

Be sure and click “Save dashboard” so you don’t loose you work.

Now what do we see? well now that I’ve restarted the load on shard4, we see on “DBLoadCPU”,  the DBLoadCPU is pretty evenly balanced.

On “% bottleneck”  we see it’s pretty low except for shard1. To find out what is happening we have to navigate to the PI dashboard for shard1. Looks like shard1 is spending a lot of it’s load waiting on resources.

Let’s go to the PI dashboard for shard1.


Screen Shot 2019-03-28 at 12.58.21 PM


we can see that on the left most of the load was not CPU. CPU is green. All other colors are waiting for resources.

This is the write node and other activity is going on than the reader nodes which are only selects.

On the right hand side we can see CPU load went up so the ratio of Wait load in relation to CPU load and total load went down. This is what we see in the “% bottleneck” widget we made in Cloudwatch.

Now what are those resources that we are waiting on and what changed to make CPU go up? We can see that by exploring the PI dashboard.

For a demo on how to use PI to  identify and solve performance issues see




“delayed commit ok initiated” – Aurora MySQL

January 11th, 2019

“delayed commit ok initiated” –  is a thread state in Aurora MySQL which indicates the thread has started the async commit process but is waiting for it to be ack’d. You will not find this thread state in MySQL as  MySQL  does not use our async commit protocal, it is Aurora MySQL specific. This is  usually the genuine commit time of a transaction.

This is a “state” and not a wait.


CLI for Amazon RDS Performance Insights

December 11th, 2018

Installing CLI on LINUX

1. install PIP

curl -O
python --user

2. install AWS CLI

pip install awscli --upgrade --user

3. configure

aws configure

For “aws configure” you will need

  • AWS Access Key ID:
  • AWS Secret Access Key:

Which you can get by going to the AWS console, going to IMS and creating access key.

Running example

Once “aws” is configured you can run the CLI like

aws \
 pi get-resource-metrics \
 --region us-east-1 \
 --service-type RDS \
 --identifier db-xxxxxx \
 --metric-queries "{\"Metric\": \"db.load.avg\"}" \
 --start-time `expr \`date +%s\` - 60480000 ` \
 --end-time `date +%s` \
 --period-in-seconds 86400

That “—identifier” is for one of my databases, so you will have to change that.
You will also have to modify region if you are accessing a database in a different region

getting json output

export AWS_DEFAULT_OUTPUT="json"





My databases

  • db-YTDU5J5V66X7CXSCVDFD2V3SZM ( Aurora PostgreSQL)

see these blogs for loads on these databases

CPU load last 5 minutes

aws \
 pi get-resource-metrics \
 --region us-east-1 \
 --service-type RDS \
 --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM \
 --start-time `expr \`date +%s\` - 300 ` \
 --metric-queries '{
      "Metric": "db.load.avg",
      "Filter":{"db.wait_event.type": "CPU"}
      } ' \
 --end-time `date +%s` \
 --period-in-seconds 300

Top SQL by load

aws pi describe-dimension-keys \
    --region us-east-1 \
    --service-type RDS \
    --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM \
    --start-time `expr \`date +%s\` - 300 ` \
    --end-time `date +%s` \
    --metric db.load.avg \
    --group-by '{"Group":"db.sql"}'

Top Waits by load

aws pi describe-dimension-keys \
    --region us-east-1 \
    --service-type RDS \
    --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM \
    --start-time `expr \`date +%s\` - 300 ` \
    --end-time `date +%s` \
    --metric db.load.avg \
    --group-by '{"Group":"db.wait_event"}'

Top User by load

aws pi describe-dimension-keys \
    --region us-east-1 \
    --service-type RDS \
    --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM \
    --start-time `expr \`date +%s\` - 300 ` \
    --end-time `date +%s` \
    --metric db.load.avg \
    --group-by '{"Group":"db.user"}'


"Total": 0.15100671140939598, 
            "Dimensions": {
                "db.sql.db_id": "pi-4101593903", 
                "": "209554B4D97DBF72871AE0854DAD97385D553BAA", 
                "db.sql.tokenized_id": "1F61DDE1D315BB8F4BF198DB219D4180BC1CFE05", 
                "db.sql.statement": "WITH cte AS (\n   SELECT id     \n   FROM   authors \n   LIMIT  1     \n   )\nUPDATE authors s\nSET    email = 'toto' \nFROM   cte\nWHERE =\n\n"

Top SQL by waits grouped

aws pi describe-dimension-keys \
    --region us-east-1 \
    --service-type RDS \
    --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM \
    --start-time `expr \`date +%s\` - 300 ` \
    --end-time `date +%s` \
    --metric db.load.avg \
    --group-by '{"Group":"db.sql"}' \
    --partition-by '{"Group": "db.wait_event"}'


            "Total": 0.1644295302013423, 
            "Dimensions": {
                "db.sql.db_id": "pi-4101593903", 
                "": "209554B4D97DBF72871AE0854DAD97385D553BAA", 
                "db.sql.tokenized_id": "1F61DDE1D315BB8F4BF198DB219D4180BC1CFE05", 
                "db.sql.statement": "WITH cte AS (\n   SELECT id     \n   FROM   authors \n   LIMIT  1     \n   )\nUPDATE authors s\nSET    email = 'toto' \nFROM   cte\nWHERE =\n\n"
            "Partitions": [
"PartitionKeys": [
            "Dimensions": {
                "db.wait_event.type": "CPU", 
                "": "CPU"
            "Dimensions": {
                "db.wait_event.type": "IO", 
                "": "IO:XactSync"
            "Dimensions": {
                "db.wait_event.type": "Lock", 
                "": "Lock:transactionid"

Top SQL over last 5 minutes based on CPU

 aws pi describe-dimension-keys     \
    --region us-east-1     \
    --service-type RDS     \
    --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM     \
    --start-time `expr \`date +%s\` - 300 `     \
    --end-time `date +%s`     \
    --metric db.load.avg     \
    --group-by '{"Group":"db.sql"}'     \
    --filter '{"db.wait_event.type": "CPU"}'
"Total": 0.003355704697986577,
"Dimensions": {
"db.sql.db_id": "pi-4101593903",
"": "209554B4D97DBF72871AE0854DAD97385D553BAA",
"db.sql.tokenized_id": "1F61DDE1D315BB8F4BF198DB219D4180BC1CFE05",
"db.sql.statement": "WITH cte AS (\n SELECT id \n FROM authors \n LIMIT 1 \n )\nUPDATE authors s\nSET email = 'toto' \nFROM cte\nWHERE =\n\n"

load over last 5 minutes based on CPU

aws \
 pi get-resource-metrics \
 --region us-east-1 \
 --service-type RDS \
 --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM \
 --start-time `expr \`date +%s\` - 300 ` \
 --metric-queries '{
      "Metric": "db.load.avg",
      "Filter":{"db.wait_event.type": "CPU"}
      } ' \
 --end-time `date +%s` \
 --period-in-seconds 300


Top SQL over last 5 minutes based on CPU

 aws pi describe-dimension-keys     \
    --region us-east-1     \
    --service-type RDS     \
    --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM     \
    --start-time `expr \`date +%s\` - 300 `     \
    --end-time `date +%s`     \
    --metric db.load.avg     \
    --group-by '{"Group":"db.sql"}'     \
    --filter '{"db.wait_event.type": "CPU"}'

alternatively with a partition by waits

aws pi describe-dimension-keys \
 --region us-east-1 \
 --service-type RDS \
 --identifier db-YTDU5J5V66X7CXSCVDFD2V3SZM \
 --start-time `expr \`date +%s\` - 300 ` \
 --end-time `date +%s` \
 --metric db.load.avg \
 --group-by '{"Group":"db.sql"}' \
 --partition-by '{"Group": "db.wait_event"}' \
 --filter '{"db.wait_event.type": "CPU"}'

CLI  with counter metrics

aws \
 pi get-resource-metrics \
 --region us-east-1 \
 --service-type RDS \
 --metric-queries "{\"Metric\": \"db.Transactions.xact_commit.avg\"}" \
 --start-time `expr \`date +%s\` - 3600 ` \
 --end-time `date +%s` \
 --period-in-seconds 60 \


Aurora MySQL synch/mutex/innodb/aurora_lock_thread_slot_futex wait

September 18th, 2018

Thanks to Jeremiah Wilton for the following info:

This wait event indicates that there is a thread which is waiting on an InnoDB record lock. Check your database for conflicting workloads. More information on InnoDB locking can be found here:


In other words, record-level lock conflicts are happening. More than one connection is trying to update the last_login for a particular id in the_table at the same time. Those connections are conflicting and serializing on the record lock for that id. Here’s a query that can help you identify the blocker and waiter for InnoDB record locks in MySQL-family engines. Run this when you see the aurora_lock_thread_slot_futex wait event in Performance Insights. In a future release of Performance Insights, we will automatically generate and display a similar blockers-and-waiters report when Performance Insights detects this event.

select waiting_thread, p1.user waiting_user, waiting_host, it1.trx_query waiting_query,
       ilw.requesting_trx_id waiting_transaction, ilw.blocking_lock_id blocking_lock, il.lock_mode blocking_mode,
       il.lock_type blocking_type, ilw.blocking_trx_id blocking_transaction,
       case it.trx_state when 'LOCK WAIT' then it.trx_state else p.state end blocker_state, il.lock_table locked_table,
       it.trx_mysql_thread_id blocker_thread, p.user blocker_user, blocker_host
from information_schema.innodb_lock_waits ilw
join information_schema.innodb_locks il on ilw.blocking_lock_id = il.lock_id and ilw.blocking_trx_id = il.lock_trx_id
join information_schema.innodb_trx it on ilw.blocking_trx_id = it.trx_id
join information_schema.processlist p on it.trx_mysql_thread_id =
join information_schema.innodb_trx it1 on ilw.requesting_trx_id = it1.trx_id
join information_schema.processlist p1 on it1.trx_mysql_thread_id =;

| waiting_thread | waiting_user | waiting_host        | waiting_query                         | waiting_transaction | blocking_lock      | blocking_mode | blocking_type | blocking_transaction | blocker_state | locked_table         | blocker_thread | blocker_user | blocker_host        |
|           1117 | reinvent     | | UPDATE sbtest8 SET k=k+1 WHERE id=125 | 888017450           | 888017113:88:6:17  | X             | RECORD        | 888017113            | LOCK WAIT     | `sysbench`.`sbtest8` |           1196 | reinvent     | |
|           1117 | reinvent     | | UPDATE sbtest8 SET k=k+1 WHERE id=125 | 888017450           | 888017089:88:6:17  | X             | RECORD        | 888017089            | LOCK WAIT     | `sysbench`.`sbtest8` |           1431 | reinvent     | |
|           1117 | reinvent     | | UPDATE sbtest8 SET k=k+1 WHERE id=125 | 888017450           | 888015342:88:6:17  | X             | RECORD        | 888015342            | LOCK WAIT     | `sysbench`.`sbtest8` |           1680 | reinvent     | |

Also the following:

  r.trx_id waiting_trx_id,
  r.trx_mysql_thread_id waiting_thread,
  r.trx_query waiting_query,
  b.trx_id blocking_trx_id,
  b.trx_mysql_thread_id blocking_thread,
  b.trx_query blocking_query
FROM       information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b
  ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r
  ON r.trx_id = w.requesting_trx_id;
| waiting_trx_id | waiting_thread | waiting_query                          | blocking_trx_id | blocking_thread | blocking_query                         |
| 917169041      |           2822 | UPDATE sbtest5 SET k=k+1 WHERE id=126  | 917169007       |            2296 | UPDATE sbtest5 SET k=k+1 WHERE id=126  |
| 917169041      |           2822 | UPDATE sbtest5 SET k=k+1 WHERE id=126  | 917168488       |            2214 | UPDATE sbtest5 SET k=k+1 WHERE id=126  |
| 917169025      |           3069 | UPDATE sbtest2 SET k=k+1 WHERE id=125  | 917168945       |            2700 | UPDATE sbtest2 SET k=k+1 WHERE id=125  |

see AWS forum post at


Is NFS on ZFS slowing you down?

January 25th, 2018

If you think so, check out shell script “” from github at

Introduction and Goals

The goal of is to measure both the throughput and latency of the different code layers when using NFS mounts on a ZFS appliance. The ZFS appliance code layers inspected with the script are I/O from the disks, ZFS layer and the NFS layer. For each of these layers the script measures the throughput, latency and average I/O size. Some of the layers are further broken down into other layers. For example NFS writes are broken down into data sync, file sync and non-sync operations and NFS reads are broken down into cached data reads and reads that have to go to disk.

The primary three questions ioh is used to answer are

  • Is I/O latency from the I/O subsystem to ZFS appliance sufficiently fast?
  • Is NFS latency from ZFS appliance to the consumer sufficiently fast?
  • Is ZFS adding unusual latency

One: If the latency from the I/O subsystem is not adequate then look into supplying better performing I/O subsystem for ZFS appliance. For example if the goal is 3ms write times per 1K redo write but the underlying I/O subsystem is taking 6ms, then it will be impossible for ZFS appliance to meet those expectations.

Two: If the latency for NFS response from ZFS appliance is adequate and yet the NFS client reports latencies as much slower (more than 2ms slower) then one should look instead at problems in the NIC, network or NFS client host, see network tracing, example

Three: If the I/O latency is sufficiently fast but ZFS latency is slow, then this could indicate a problem in the ZFS layer.

The answer to the question “what is adequate I/O latency” depends. In general a single random 8 Kb block read on Oracle is expected to take 3-12 ms on average, thus the typical latency is around 7.5 ms. NOTE: when measuring I/O latency on the source system it’s important to use a tool like “iostat” that will show the actually I/Os to the subsystem. The I/O measured by the Oracle database will include both I/Os satisfied from the host file system cache as well as the I/O subsystem unless the database is running with direct I/O

The ioh tool can also give insight into other useful information such as

  • Are IOPs getting near the supported IOPs of the underlying I/O subsystem
  • is NFS throughput getting near the maximum bandwidth of the NIC?”

For example if the NIC is 1GbE then the maximum bandwidth is about 115MB/s, and generally 100MB/s is a good rule of thumb for the max. If throughput is consistently near the NIC maximum, then demand is probably going above maximum and thus increasing latency

$ -h

usage: ./ options

collects I/O related dtrace information into file "ioh.out"
and displays the

  -h              Show this message
  -t  seconds     runtime in seconds, defaults to forever
  -c  seconds     cycle time ie time between collections, defaults to 1 second
  -f  filename    change the output file name [defautls to ioh.out]
  -p              parse the data from output file only,  don't run collection
  -d  display     optional extra data to show: [hist|histsz|histszw|topfile|all]
                    hist    - latency histogram for NFS,ZFS and IO both reads and writes
                    histsz  - latency histogram by size for NFS reads
                    histszw - latency histogram by size for NFS writes
                    topfile - top files accessed by NFS
                    all     - all the above options
           -d "histsz topfile"

two optional environment variables CWIDTH – histogram column width PAD – character between columns in the histograms, null by default


$ sudo ./ 

Outputs to the screen and put raw output into default file name “ioh.out.[date]”. The default output file name can be changed with “-o filename” option. the raw output can later be formatted with

$ ./ -p  ioh.out.2012_10_30_10:49:27

By default it will look for “ioh.out”. If the raw data is in a different file name it can be specified with “-o filename”

The output looks like

date: 1335282287 , 24/3/2012 15:44:47
TCP out:  8.107 MB/s, in:  5.239 MB/s, retrans:        MB/s  ip discards:
            |       MB/s|    avg_ms| avg_sz_kb|     count
R |      io:|     0.005 |    24.01 |    4.899 |        1
R |     zfs:|     7.916 |     0.05 |    7.947 |     1020
C |   nfs_c:|           |          |          |        .
R |     nfs:|     7.916 |     0.09 |    8.017 |     1011
W |      io:|     9.921 |    11.26 |   32.562 |      312
W | zfssync:|     5.246 |    19.81 |   11.405 |      471
W |     zfs:|     0.001 |     0.05 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     5.215 |    19.94 |   11.410 |      468
W |nfssyncF:|     0.031 |    11.48 |   16.000 |        2

The sections are broken down into

  • Header with date and TCP throughput
  • Reads
  • Writes

Reads and Writes are are further broken down into

  • io
  • zfs
  • nfs

For writes, the non stable storage writes are separated from the writes to stable storage which are marked as “sync” writes. For NFS the sync writes are further broken down into “data” and “file” sync writes.


The following will refresh the display every 10 seconds and display an extra four sections of data

$ sudo ./ -c 10 -d "hist histsz histszw topfile"   

date: 1335282287 , 24/3/2012 15:44:47
TCP out:  8.107 MB/s, in:  5.239 MB/s, retrans:        MB/s  ip discards:
            |       MB/s|    avg_ms| avg_sz_kb|     count
R |      io:|     0.005 |    24.01 |    4.899 |        1
R |     zfs:|     7.916 |     0.05 |    7.947 |     1020
R |     nfs:|     7.916 |     0.09 |    8.017 |     1011
W |      io:|     9.921 |    11.26 |   32.562 |      312
W | zfssync:|     5.246 |    19.81 |   11.405 |      471
W |     zfs:|     0.001 |     0.05 |    0.199 |        3
W |     nfs:|           |                     |        .
W |nfssyncD:|     5.215 |    19.94 |   11.410 |      468
W |nfssyncF:|     0.031 |    11.48 |   16.000 |        2
---- histograms  -------
    area r_w   32u   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .5s    1s    2s    2s+
R        io      .     .     .     .     .     .     .     .     .     1     3     1
R       zfs   4743   287    44    16     4     3     .     .     .     1     2     2
R       nfs      .  2913  2028    89    17     3     .     .     .     1     2     2
W        io      .     .     .    58   249   236    50    63   161   381   261    84    20     1
W       zfs      3    12     2
W   zfssync      .     .     .     .    26   162   258   129   228   562   636   250    75    29
W       nfs      .     .     .     .    12   164   265   134   222   567   637   250    75    29
--- NFS latency by size ---------
    ms   size_kb
R   0.1    8     .  2909  2023    87    17     3     .     .     .     1     2     2
R   0.1   16     .     4     5     2
W   5.0    4     .     .     .     .     8    49    10     3     4    11     4     2     1
W  21.4    8     .     .     .     .     4    55   196    99   152   414   490   199    60    23
W  18.3   16     .     .     .     .     .    34    29    25    43    91    84    28     8     5
W  16.1   32     .     .     .     .     .    19    16     7    14    38    36    15     3
W  19.3   64     .     .     .     .     .     6    11     .     9    11    19     6     2     1
W  20.4  128     .     .     .     .     .     1     3     .     .     2     4     .     1
---- top files ----
   MB/s                  IP  PORT filename
W  0.01MB/s 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
W  0.02MB/s 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.57MB/s 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
W  0.70MB/s 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/redo3.log
W  3.93MB/s 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/
R  0.01MB/s 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
R  0.01MB/s 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
R  0.02MB/s 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.05MB/s 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
R  7.84MB/s 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/
IOPs         313


First line is the date Second line is TCP MB per second in,out and retransmitted. The last value is “ip discards”

The three parts are all related and are a drill down starting with course grain data at the top to finer grain data at the bottom.

  • averages – default
  • histograms [hist]
  • histograms by size reads [histsz] writes [histszw] for NFS

The first section is a quick overview.

The second section breaks out the latency into a histogram so one can get an indication of amount of I/O from memory (ie those in microsecond ranges) as well as how far out the outliers are. (are the outliers on the VDBs matching up to the outliers seen on ZFS appliance?)

The third section differentiates between latency of single block random (typically the 8K size) and latency of multi-block sequential reads (32K and higher). The differentiation is important when comparing to Oracle stats which are grouped by single block random reads (called “db file sequential read” ) and sequential multi-block read (called “db file scattered read”).

The final section

top files read and write [topfile]

is sort of a sanity check as there are periods where there is suppose to be little to no NFS I/O and yet there is, so the top file sections tells which file and which host the NFS I/O is coming from.

The last line after all the sections is total IOPs for reads plus writes. (Note these IOPs could get converted to higher values at the storage layer if using RAID5 which will cause each write to be two reads plus two writes.)

The first section, shows up by default. The other sections require command line arguments.

To see just the first section, which is the default, run without any arguments:

sudo ./

To show non-default sections, add them to the command line

sudo ./ -d "hist histsz histszw topfile"

A shortcut for all sections is “all”

sudo ./  -d all

Collecting in the background

nohup sudo ./ -c 60 -t 86400 &

Runs the collection for 1 day (86400 seconds) collecting every 60 seconds and put raw output into default file name “ioh.out”. The default output file name can be changed with “-o filename” option.

1. Averages:

The displays I/O, ZFS and NFS data for both reads and writes. The data is grouped to try and help easily correlate these different layers First line is date in epoch format


  • MB/s – MB transferred a second
  • avg_ms – average operation time
  • avg_sz_kb – average operation size in kb
  • count – number of operations


             |      MB/s|     mx_ms| avg_sz_kb|     count
 R |      io:|  through | average  |  average |      count of operations
 R |     zfs:|  put     |latency   | I/O size | 
 R |     nfs:|          |millisec  | in KB    |   
 W |      io:|          |          |          |
 W | zfssync:|          |          |          |                                         
 W |     zfs:|          |          |          |                                         
 W |     nfs:|          |          |          |                                         
 W |nfssyncD:|          |          |          |                                         
 W |nfssyncF:|          |          |          |                                         

For writes

  • zfssync – these are synchronous writes. THese should mainly be Oracle redo writes.
  • nfs – unstable storage writes
  • nfssyncD – data sync writes
  • nfssyncF – file sync writes

DTrace probes used

  • io:::start/done check for read or write
  • nfs:::op-read-start/op-read-done , nfs:::op-write-start/op-write-done
  • zfs_read:entry/return, zfs_write:entry/return

2. Histograms

latency distribution for i/o, zfs, nfs for reads and writes. These distributions are not normalized by time, ie if ioh.d is outputs once a second then these counts will be equal to the counts in the first section. If ioh.d outputs every 10 seconds, then these values will be 10x higher

  1. Histograms by size for reads and writes

The first column is the average latency for the size of I/O for this line. The second column is the size. The size includes this size and every size lower up till the previous bucket. The goal here is to show the sizes of I/Os and the different latency for different sizes. For an Oralce database with 8k block size, 8k reads will tend to be random where as higher read sizes say will be multiblock requests and represent sequential reads. It’s common to see the 8K reads running slower than the larger reads.

4. Top files

shows the top 5 files for reads and writes. First column is MB/s, then R or W, then IP, then port then filename

Examples and Usage

Idle system

First thing to look at is the MB in and out which answers

  • “how busy is the system?”
  • “is NFS throughput approaching the limits of the NIC?”

In the following example, there is only less than 50KB/s total NFS throughput ( in plus out) thus the system isn’t doing much, and there must be no database activity other than the regular maintenance processes which are always running on a database. To confirm this, one can look at the top files at the bottom and see that the only activity is on the control files which are read and written to as part of database system maintenance. Otherwise there is no activity to speak of, so no reason look at I/O latency in this case. Additionally, all majority of what little I/O is in 16K sizes which is typical of control file activity, where as the default database data block activity is in 8K sizes. Most read I/O is coming from ZFS appliance cache as its 64 micro seconds.

date: 1335282646 , 24/3/2012 15:50:46
TCP  out:  0.016 MB/s, in:  0.030 MB/s, retrans:        MB/s  ip discards:
            |       MB/s|    avg_ms| avg_sz_kb|     count
R |      io:|           |          |          |        .
R |     zfs:|     0.016 |     0.01 |    1.298 |       13
R |     nfs:|     0.016 |     0.10 |   16.000 |        1
W |      io:|     0.365 |     4.59 |    9.590 |       39
W | zfssync:|     0.031 |    14.49 |   16.000 |        2
W |     zfs:|     0.001 |     0.07 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     0.003 |          |          |        .
W |nfssyncF:|     0.028 |    14.33 |   14.400 |        2
---- histograms  -------
    area r_w   32u   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .5s    .5s+
R        io      .
R       zfs     60     5
R       nfs      .     .     5
W        io      .     .     .    20    43    60    11    11     8    28    17     1
W       zfs      2     8     5     2
W   zfssync      .     .     .     .     .     .     2     .     2     5     1     1
W       nfs      .     .     .     .     .     .     2     .     2     5     1     1
--- NFS latency by size ---------
    ms   size_kb
R   0.1   16     .     .     5
W          8     .     .     .     .     .     .     .     .     .     1     .     1
W  16.0   16     .     .     .     .     .     .     2     .     2     4     1
---- top files ----
   MB/s                  IP  PORT filename
W  0.00MB/s 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.01MB/s 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s 59394 /domain0/group0/vdb13/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s 39682 /domain0/group0/vdb14/datafile/home/oracle/oradata/kyle/control01.ora
W  0.01MB/s 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
R  0.00MB/s 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.01MB/s 39938 /domain0/group0/vdb12/datafile/home/oracle/oradata/kyle/control01.ora
R  0.01MB/s 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
IOPs          39

Active System Below is an example of an active system. Looking at TCP bytes in and out, there is a fair bit 3MB/s out and 2MB/s in. These rates are a long way from saturating 1GbE but there is activity going on.

READs all reads are coming out of the cache. How do we know? For one the average ms latency is 0.07, or 70 micro seconds. Does thhis 70us include slower reads that might be off disk? Looking at the histogram, one can see that the slowest zfs I/O is under 100us and looking just above at the I/O histogram there are no I/Os thus all the I/O is coming from cache.

Writes Writes are pretty slow. Oracle Redo writes on good systems are typically 3ms or liess for small redo. Unfortunately most of the I/O is coming from datafile writes so it’s difficult to tell what the redo write times are. (maybe worth enhancing ioh.d to show average latency by file) Typically the redo does “nfssyncD” writes and datafile writes are simply unstable storage writes “nfs” writes that get sync at a later date. This particular database is using the Oracle parameter “filesystemio_options=setall” which implements direct I/O. Direct I/O can work without sync writes but the implementation depends on the OS. This O/S implementation, OpenSolaris, causes all Direct I/O writes to by sync writes.

date: 1335284387 , 24/3/2012 16:19:47
TCP out:  3.469 MB/s, in:  2.185 MB/s, retrans:        MB/s  ip discards:
            ||         |           |          |          o       MB/s|    avg_ms| avg_sz_kb|     count
R |      io:|           |          |          |        .
R |     zfs:|     3.387 |     0.03 |    7.793 |      445
R |     nfs:|     3.384 |     0.07 |    8.022 |      432
W |      io:|     4.821 |    12.08 |   24.198 |      204
W | zfssync:|     1.935 |    38.50 |   11.385 |      174
W |     zfs:|     0.001 |     0.06 |    0.199 |        3
W |     nfs:|           |          |          |        .
W |nfssyncD:|     1.906 |    39.06 |   11.416 |      171
W |nfssyncF:|     0.028 |    14.59 |   14.400 |        2
---- histograms  -------
    area r_w   64u   .1m   .2m   .5m    1m    2m    4m    8m   16m   33m   65m    .1s   .3s   .3s+
R        io      .
R       zfs   2185    34     5     .     1
R       nfs    903  1201    47     8     1
W        io      .     .    19   142   143    46    42   108   240   212    57    12     1
W       zfs     13     3     1
W   zfssync      .     .     .     .    10     6     .    21    60   384   287    86    16
W       nfs      .     .     .     .    10     5     .    21    60   384   287    86    16
--- NFS latency by size ---------
    ms   size_kb
R   0.1    8   900  1199    47     7     1
R   0.1   16     3     2     .     1
W  17.7    4     .     .     .     .     3     1     .     2     5     3     3
W  41.1    8     .     .     .     .     3     .     .    13    35   292   231    76    13
W  34.0   16     .     .     .     .     3     3     .     4    13    61    30     8     2
W  39.0   32     .     .     .     .     .     1     .     .     2    16    14     2     1
W  28.3   64     .     .     .     .     1     .     .     .     2     9     8
W  26.2  128     .     .     .     .     .     .     .     2     3     2     1
W        256     .     .     .     .     .     .     .     .     .     1
---- top files ----
   MB/s             IP  PORT filename
R  0.01 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
R  0.01 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
R  0.02 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/system.dbf
R  0.02 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
R  3.33 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/product/dbs/soe.dbf 
W  0.01 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/control01.ora
W  0.01 52482 /domain0/group0/vdb17/datafile/home/oracle/oradata/swingb/control01.ora
W  0.15 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/undo.dbf
W  0.30 21763 /domain0/group0/vdb16/datafile/export/home/oradata/sol/redo1.log
W  1.46 21763 /domain0/group0/vdb16/datafile/export/home/opt/app/product/dbs/soe.dbf 
IOPs         204

ZFS read layer problem

          |      MB/s|    avg_ms|  avg_sz_kb
R |   io :|    88.480|      4.60|     17.648 
R |  zfs :|    19.740|      8.51|     12.689 
R |  nfs :|    16.562|     22.67|     30.394 

In this case the ZFS I/O 19MB/s is higher than NFS at 16MB/s. Now that could because some thing is accessing the file system locally on ZFS appliance or that ZFS is doing read ahead, so there are possible explanations, but it’s interesting. Second subsystem I/O at 88MB/s is much greater than ZFS I/O at 19MB/s. Again that is notable. Could because there is a scrub going on. (to check for a scrub, run “spool status”, to turn off scrub run “zpool scrub -s domain0″ though the scrub has to be run at some point). Both interesting observations.

Now the more interesting parts. The NFS response time 22ms is almost 3x the average ZFS response time 8ms. On the other hand the average size of NFS I/O is 2.5x the average ZFS I/O size so that might be understandable. The hard part to understand is that the ZFS latency 8ms is twice the latency of subsystem I/O at 4ms yet the average size of the I/O sub-system reads is bigger than the average ZFS read. This doesn’t make any sense.

In this case to hone in the data a bit, it would be worth turning off a scrub if it was running and see what the stats are to eliminate a factor that could be muddying the waters.

But in this case, even without a scrub going, the ZFS latency was 2-3x slower than the I/O subsystem latency.

It turns out ZFS wasn’t caching and spending a lot of time trying to keep the ARC clean.

ZFS write layer problem

           |       MB/s|    avg_ms| avg_sz_kb|     count
W |     io:|     10.921|     23.26|    32.562|       380
W |    zfs:|    127.001|     37.95|     0.199|      8141
W |    nfs:|     0.000 |     0.00 |    0.000 |        0

NFS is 0 MB/s because this was from http traffic. The current version of ioh would show the TCP MB/s. This version also mixed up zfs sync and non-sync writes into one bucket, but much of the ZFS writes have to be non-sync because the write rate is 127MB/s where as the I/O subsystem writes are only 10MB/s thus at least 117MB/s is not sync and if they are not sync they are just memory writes so should be blindingly fast, but they aren’t. The average latency for the ZFS writes is 37ms. All the more shockingly the average size is only 0.199K where as the I/O subsystem writes 32K in 23ms. The case here was that because of disk errors, the ZFS layer was self throttling way to much. This was a bug


IP CIDR rules and address ranges

November 3rd, 2017

I always forget IP address range coverage rules and forget where to look.

It’s the wiki!

and for good reference here is the table:

/32 is for a single address

/24 is for a range in the last place x.x.x.0

/16 is for a range in the last 2 places x.x.0.0

Screen Shot 2017-11-03 at 11.28.31 AM


Best method for tuning sub-optimal execution plans

October 18th, 2017

How do you determine if the Oracle SQL optimizer has created a sub-optimal execution plan? re-run statistics and see what happens? wait for Oracle to find other execution plans? What if neither method is helping? Do you read the execution plan? What do you look at? Differences in actual vs estimated? How successful is that? Look for full table scans?  Do you look at the 10053 trace? How much time and effort does that take?  What do you look at in the 10053 trace. Do you have a systematic methodology  that works in almost all cases?

Well there is a method that is reliable and systematic. It’s laid out in Dan Tow’s book SQL Tuning.

The method is tedious as it requires a lot of manual work to draw join trees, identify constraints and relationships, manual decomposition and execution of every 2 table join in the statement. It can add up to a lot of work but it is systematic and dependable.

Cool thing is it can all be done automatically with a tool called DB Optimizer that now (as I look today) only cost about $400.

If you cost your company $50/hour then in 8 hours of saved work it’s paid for its self. In my experience as a DBA I have serveral SQL a year that take me over a day to optimize manually but that I can get done in a few minutes with DB Optimizer.  Thus with just one hard SQL the tool has paid for itself. The DB Optimizer analysis might run for a couple hours, but afterwords with the data it collects and presents, I can find better tuning path in minutes if it exists.


Here is previous blog post that gives some an overview

Here is a video that explains the method. (Same presentation in a different video.)

Slides from the videos.

Here is Dan Tow’s book SQL Tuning that originally laid out the method.

Here is post by Jonathan Lewis demonstrating the method.

Pick up a copy. I think it’s super cool and interested in feedback on the your experiences.




Oaktable World 2017 @ Oracle Open World

September 29th, 2017

The Oak Table members will be discussing their latest technical obsessions and research on Monday and Tuesday, (Oct. 2nd and 3rd, 2017).  The truth is, folks-  The Oak Table experts are an AWESOME group, (if I don’t say so myself! :)) as we could have easily done another day of incredible sessions, but alas, two days is all we have available for this year’s event.


Screen Shot 2017-09-29 at 11.10.53 AM

Screen Shot 2017-09-29 at 11.11.04 AM