Archive

Author Archive

Covering Indexes in Postgres

June 19th, 2020

Covering indexes are where all the columns needed are in the index and, in theory, only the index needs to be accessed. This seemingly straight forward strategy is much tricker on Postgres.

Let’s take TPC-H query 6:

select
        sum(l_extendedprice * l_discount) as revenue
from
        lineitem
where
            l_shipdate >= date '1996-01-01'
        and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
        and l_discount between 0.06 - 0.01 and 0.06 + 0.01
        and l_quantity < 24;

we’ve got 3 predicate filter columns

  • l_shipdate
  • l_discount
  • l_quantity

None of them are that selective but all three are pretty good

select count(*) from
        lineitem
where
            l_shipdate >= date '1996-01-01'
        and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)

→ 1,823,373

select count(*) from
        lineitem
where l_discount between 0.06 - 0.01 and 0.06 + 0.01;

→ 3,272,056

select count(*) from
        lineitem
where l_quantity < 24;

→ 5,517,389

select count(*) from
        lineitem
where
            l_shipdate >= date '1996-01-01'
        and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
        and l_discount between 0.06 - 0.01 and 0.06 + 0.01
        and l_quantity < 24;

→ 228,160

select count(*) from lineitem;

→ 11,997,996

228,160/11,997,996 = 1.9%

 

With the index added on all 3 fields, the optimizer doesn’t even use it!

https://explain.depesz.com/s/OG27

I forced the optimizer to use the index (set enable_seqscan=off;) , then yes, the improvement is small, from ~800+ ms to 600+ ms mainly because the index results have to go back to the table.

https://explain.depesz.com/s/bJMg

I made a covering index with all the fields in the query and it still goes back to the table!

https://explain.depesz.com/s/92mV

Now, if I do a vacuum and sure enough, the covering index works and get a 6x speedup!   ( 870ms to 140ms )

https://explain.depesz.com/s/fdy5

What does the vacuum have to do with the covering index?

From Jim Nasby:

The issue with covering indexes is that you still have to visit the heap page, unless that page happens to be marked as being all visible. Pages can be marked as all visible by autovacuum (by default runs once 10% of rows in a table have been updated or deleted) or a manual vacuum. They can only be marked visible if all live rows on the page are older than the oldest running transaction in the system.

In other words, effectiveness of this technique is highly dependent on the workload.

 

Reference

https://www.percona.com/blog/2018/08/06/basic-understanding-bloat-vacuum-postgresql-mvcc/

TL;DR postgres keeps undo in the data blocks. Vacuum clears out old unnecessary rows that are “dead”

https://bitnine.net/blog-useful-information/index-only-scan-and-visibility-map-12/

“Because there is no information for version in an index tuple, it is impossible to judge whether a tuple is readable in the current transaction or not”

https://blog.makandra.com/2018/11/investigating-slow-postgres-index-only-scans/

Replica + hot_standby_feedback can affect index only scan

 

Uncategorized

Granularity impact on visualization

May 22nd, 2020

The longer the timeline we show in the graph, either the more points we show or the larger granularity we use.

For example if I show 1 hour of data, then I can show 60 points of 1 minute each granularity.

If I showed a day at 1 minute each then that would be 1440 points.

Say we wanted to limit the points to less than 300 points so our UI stays responsive no matter the time scale.

The for 24 hours I can show 5 minute granularity which is  288 points

If  I show 25 hours  that’s 300 points at 5 minute granularity so we are no longer under 300 points.

At this time I’d have to jump to the next level of granularity. In my case the next step up is 1 hour.

Using one hour granularity will look different than 5 minutes.

The following is roughly a day at 5 minute granularity first and 1 hour granularity second.
image(17)
a

image(18)

The most common affect of larger granularity is loosing the spikes i.e the 5 minute granularity could have highs and lows that get averaged out at larger granularity.

What is noticeable is that the 1 hour granularity above is that it is spiky. The highs and lows are roughly the same. The spikiness comes from the fact that the above graphs are using points. Each point represents the average value for the width of the granularity, but since we are using points, the actual width of the granularity is not visible. What would be more accurate is using bars for each “point” where the bar was the width of the granularity.

 

Here are a couple of examples from the tool DB Optimizer were the width of the bar represents the granularity:

Screen Shot 2020-05-22 at 10.41.44 AM

 

 

Screen Shot 2020-05-22 at 10.41.23 AM

Uncategorized

Jobs at RDS

April 13th, 2020

I’m personally interested in using explain plans to identify missing indexes opportunities.  Let me know directly if that is of interest to you. This could be in the roll of database engineer and design or development engineer on design and coding.

My manager has a bunch of openings in all sorts of areas from database performance monitoring, recommendations, proxy

The dev team I work with on performance monitoring and related has dev positions open

 

 

 

 

 

Uncategorized

Where to add Index on Postgres

March 30th, 2020

 

Summary

Goal in this post is to layout a methodology that can be implemented to scan the explain plans and identify opportunities to optimize the sql execution with additional indexes.

The missing index opportunities outlined here are

  1. Indexes that would avoid full table scan with a predicate filter that filters out most of the returned rows
  2. Joins that are missing index on the join field
  3. Index range scans that could be made more efficient by addition of a field to existing index

Also for future consideration, but not yet outline here

  • Recommending covering index where the addition of fields to an index avoids table access all together
  • Indexes that support the order by clause

Basic procedure is to take the output from “explain” command.
In the explain output only look at nodes that are at least say 10-30% or more of total cost of the whole query.
Look for full table scans , these are nodes with the following entry : “Node Type”: “Seq Scan”
If it’s in the second node of a nested loop, suggest an index on the join field
If it’s is the most expensive node of a Hash join suggest an index on the join field
If it’s the most expense node of Merge Join suggest an index on the join field.
If is followed by a filter, suggest an index on the filter and join column.

If if the filter is not the most expensive node in a HJ or it’s the first node in a NL, and that node is expensive and the filter is expected to return X% ( for example 10% ) or less of the table after applying the filter, then suggest an index on that filter column.
Additionally Index range scans that could be improved by adding fields to the index will require using “explain analyze”. With explain analyze we can see how many rows were filtered out after the index access.

Get plan with “Explain” and text from SQL statement

EXPLAIN (COSTS true, FORMAT json, verbose true)
select id from hypg1 where id = 2006;
                QUERY PLAN                 
-------------------------------------------
 [                                        +
   {                                      +
     "Plan": {                            +
       "Node Type": "Gather",             +
       "Parallel Aware": false,           +
       "Startup Cost": 1000.00,           +
       "Total Cost": 10633.43,            +
       "Plan Rows": 1,                    +
       "Plan Width": 4,                   +
       "Output": ["id"],                  +
       "Workers Planned": 2,              +
       "Single Copy": false,              +
       "Plans": [                         +
         {                                +
           "Node Type": "Seq Scan",       +
           "Parent Relationship": "Outer",+
           "Parallel Aware": true,        +
           "Relation Name": "hypg1",      +
           "Schema": "public",            +
           "Alias": "hypg1",              +
           "Startup Cost": 0.00,          +
           "Total Cost": 9633.33,         +
           "Plan Rows": 1,                +
           "Plan Width": 4,               +
           "Output": ["id"],              +
           "Filter": "(hypg1.id = 2006)"  +
         }                                +
       ]                                  +
     }                                    +
   }                                      +
 ]

 

Cost in Explain Plan

In all cases we should sort nodes by cost and only look at nodes that take up a certain percentage or more of the cost.
Each node has a “cost”. We need to extract the “total cost” at each node in the plan and compare that to the “total cost” of the whole query (top node) and calculate % of the “total cost” at a node compared whole query and only look at nodes that represent some value say 10% or more of total cost. Costs include the sum of all indented nodes, so to get the per node time or cost we have subtract out the child nodes.

If we can run “explain analyze” we can get the actual time per node. The command “explain analyze” actually runs the query . I would say we should plan on “explain analyze” with certain constraints. If the query is already being run multiple times a minute, then running it once more for tuning information is probably worth it. Time is just like cost where time includes times of child nodes, so we have subtract out the child nodes to see the time elapsed at every node.

Here is an example of the costs at nodes and showing the child nodes are subcomponents of the parent nodes cost

 

[                                                               +
   {                                                             +
     "Plan": {                                                   +
       "Total Cost": 22308.54,                                   +
       "Plans": [                                                +
         {                                                       +
           "Total Cost": 11675.10,                               +
             {                                                   +
               "Total Cost": 10675.00,                           +
             }                                                   +
         },                                                      +
         {                                                       +
           "Total Cost": 10633.43,                               +
             {                                                   +
               "Total Cost": 9633.33,                            +
             }                                                   +
           ]                                                     +
         }                                                       +
       ]                                                         +
     }                                                           +
   }                                                             +
 ]

for example, subtracting out the child node costs to get the per node costs

[                                                               +
   {                                                             +
     "Plan": {                                                   +
       "Total Cost": 22308.54,  --  node cost 22308.54 - (11675.10+10633.43) = 0.01
       "Plans": [                                                +
         {                                                       +
           "Total Cost": 11675.10, --  node cost 11675.10 - 10675.00 = 1000.1
             {                                                   +
               "Total Cost": 10675.00,                           +
             }                                                   +
         },                                                      +
         {                                                       +
           "Total Cost": 10633.43, --  node cost   10633.43 - 9633.33 = 1000.1
             {                                                   +
               "Total Cost": 9633.33,                            +
             }                                                   +
           ]                                                     +
         }                                                       +
       ]                                                         +
     }                                                           +
   }                                                             +
 ]

Predicate filters missing indexes

For predicate filters that can use indexes scan the explain plan for the following string

           "Node Type": "Seq Scan",       +

and see if they are followed by a “Filter”.

           "Total Cost": 9633.33,         +
           "Plan Rows": 1,                +
           "Filter": "(hypg1.id = 2006)"  +

Verify that the node is a significant cost of overall cost of query. This node takes up most to the cost of the plan 9633.33 out of 10633.43 so it is the most important node to optimize.
We should only look at nodes that take up a certain percentage of total cost, say 10% or more. ( value should be configurable in our code, as it is debatable where the cut off point is)

We can compare rows returned (“Plan Rows”) to total rows in the table statistics
If the filter eliminates a large amount of rows, then it is a candidate.
The following query had to be run in the database that the table is in, thus have to collect database as part of the sampling of the load data.

SELECT reltuples FROM pg_class WHERE relname = 'hypg1';

reltuples
-----------
 1e+06

% filtered = 100 * ( 1e+06 – 1 ) / 1e+06 = 99.99% rows are filtered out, so it’s the perfect candidate.

We should only recommend index the filter returns 10% or less of the table (i.e. 90% of rows returned from full table scanned are filtered out) and in the advisor recommendation for the index we should indicate much of the table is returned via the filter. We can give some recommendations based on the % for example
low recommendation 5%-10% of table returned after apply filter
medium recommendation 1%-5% of table returned after apply filter
high recommendation < 1% of table returned after apply filter

The filter column is hypg1.id thus we suggest an index on hypg1.id

Joins missing indexes

Using the following query in the examples below

explain (FORMAT json)
select max (t1.data)
from t1, t2
where t1.id = t2.id
and t1.clus = 10
and t2.val > 1001
;

HASH JOIN

Look for nodes of type “Hash Cond”: ” that are followed by a “Seq Scan”
If we find a “Seq Scan” is preceded by a Hash JOIN and cost is a significant amount of total query, then suggest an index on the join column
If both nodes of the hash join have Seq Scans , suggest the index join column from the more expensive node.
If that node also contains a filter, suggest including the filter in the index.

There is “Node Type”: “Seq Scan”
table “Relation Name”: “t2″,
cost “Total Cost”: 13512.67″
looking the hash join above the Seq Scan node, the join condition is

"Hash Cond": "(t2.id = t1.id)",  

suggest index on t2.id

              "Plans": [                                                 +
                 {                                                        +
                   "Node Type": "Hash Join",                              +
                   "Parent Relationship": "Outer",                        +
                   "Parallel Aware": false,                               +
                   "Join Type": "Inner",                                  +
                   "Startup Cost": 11.40,                                 +
                   "Total Cost": 15086.97,                                +
                   "Plan Rows": 41,                                       +
                   "Plan Width": 33,                                      +
                   "Output": ["t1.data"],                                 +
                   "Inner Unique": false,                                 +
                   "Hash Cond": "(t2.id = t1.id)",                        +
                   "Plans": [                                             +
                     {                                                    +
                       "Node Type": "Seq Scan",                           +
                       "Parent Relationship": "Outer",                    +
                       "Parallel Aware": true,                            +
                       "Relation Name": "t2",                             +
                       "Schema": "public",                                +
                       "Alias": "t2",                                     +
                       "Startup Cost": 0.00,                              +
                       "Total Cost": 13512.67,                            +
                       "Plan Rows": 416667,                               +
                       "Plan Width": 4,                                   +
                       "Output": ["t2.id", "t2.clus", "t2.val", "t2.data"]+
                     },

Full explain

total cost of query is 41534.23
HJ node cost is 40532.27 on condition “Hash Cond”: “(t2.id = t1.id)”,
Most expensive node is second node of hash join, “Total Cost”: 24346.00
Second node , the expensive node, is doing full scan on “Relation Name”: “t1″
Thus suggest an index on t1.id, the join condition of the hash join.
The T1 node also contains a filter that returns “Plan Rows”: 5000,” i.e. 5000 rows of 1M in the table ( we look up total rows in pg_class) . The filter is on “Filter”: “((clus) = 10)” so we suggest an index on the combine id & clus

recommend index on t1 ( id, clus)

                                                         +
 [                                                        +
   {                                                      +
     "Plan": {                                            +
       "Node Type": "Aggregate",                          +
       "Strategy": "Plain",                               +
       "Partial Mode": "Finalize",                        +
       "Parallel Aware": false,                           +
       "Startup Cost": 41534.23,                          +
       "Total Cost": 41534.24,                            +
       "Plan Rows": 1,                                    +
       "Plan Width": 32,                                  +
       "Plans": [                                         +
         {                                                +
           "Node Type": "Gather",                         +
           "Parent Relationship": "Outer",                +
           "Parallel Aware": false,                       +
           "Startup Cost": 41534.01,                      +
           "Total Cost": 41534.22,                        +
           "Plan Rows": 2,                                +
           "Plan Width": 32,                              +
           "Workers Planned": 2,                          +
           "Single Copy": false,                          +
           "Plans": [                                     +
             {                                            +
               "Node Type": "Aggregate",                  +
               "Strategy": "Plain",                       +
               "Partial Mode": "Partial",                 +
               "Parent Relationship": "Outer",            +
               "Parallel Aware": false,                   +
               "Startup Cost": 40534.01,                  +
               "Total Cost": 40534.02,                    +
               "Plan Rows": 1,                            +
               "Plan Width": 32,                          +
               "Plans": [                                 +
                 {                                        +
                   "Node Type": "Hash Join",              +
                   "Parent Relationship": "Outer",        +
                   "Parallel Aware": false,               +
                   "Join Type": "Inner",                  +
                   "Startup Cost": 24408.50,              +
                   "Total Cost": 40532.27,                +
                   "Plan Rows": 695,                      +
                   "Plan Width": 33,                      +
                   "Inner Unique": false,                 +
                   "Hash Cond": "(t2.id = t1.id)",        +
                   "Plans": [                             +
                     {                                    +
                       "Node Type": "Seq Scan",           +
                       "Parent Relationship": "Outer",    +
                       "Parallel Aware": true,            +
                       "Relation Name": "t2",             +
                       "Alias": "t2",                     +
                       "Startup Cost": 0.00,              +
                       "Total Cost": 15596.00,            +
                       "Plan Rows": 138889,               +
                       "Plan Width": 4,                   +
                       "Filter": "((val + 0) > 1001)"     +
                     },                                   +
                     {                                    +
                       "Node Type": "Hash",               +
                       "Parent Relationship": "Inner",    +
                       "Parallel Aware": false,           +
                       "Startup Cost": 24346.00,          +
                       "Total Cost": 24346.00,            +
                       "Plan Rows": 5000,                 +
                       "Plan Width": 37,                  +
                       "Plans": [                         +
                         {                                +
                           "Node Type": "Seq Scan",       +
                           "Parent Relationship": "Outer",+
                           "Parallel Aware": false,       +
                           "Relation Name": "t1",         +
                           "Alias": "t1",                 +
                           "Startup Cost": 0.00,          +
                           "Total Cost": 24346.00,        +
                           "Plan Rows": 5000,             +
                           "Plan Width": 37,              +
                           "Filter": "((clus) = 10)"      +
                         }                                +
                       ]                                  +
                     }                                    +
                   ]                                      +
                 }                                        +
               ]                                          +
             }                                            +
           ]                                              +
         }                                                +
       ]                                                  +
     }                                                    +
   }                                                      +                                                   +

Nested Loops

Similar we find a Nest Loops join where the second node has a Seq Scan, look to see if an index can be added to support the join. Only look at Nest Loop nodes that are a significant % of total query cost.

Most of the cost of the query is the next loops:

          "Node Type": "Nested Loop",        +
  ...
          "Total Cost": 1320240.33,          + 

The first node in the NL select rows that are looked up in the second node.
The second node is doing a full table scan:

                     "Node Type": "Seq Scan",       +
                       "Relation Name": "t2",         +
                       "Plan Rows": 1000000,          +

We want to avoid Full table scans on the second node of NL.
Every row in the first node will do a full table scan on the second node.
The join is on

                   "Join Filter": "(t1.id = t2.id)",  +

So we suggest an index on t2.id

[                                                    +
   {                                                  +
     "Plan": {                                        +
       "Node Type": "Aggregate",                      +
       "Strategy": "Plain",                           +
       "Partial Mode": "Finalize",                    +
       "Parallel Aware": false,                       +
       "Startup Cost": 1321240.65,                    +
       "Total Cost": 1321240.66,                      +
       "Plan Rows": 1,                                +
       "Plan Width": 32,                              +
       "Plans": [                                     +
         {                                            +
           "Node Type": "Gather",                     +
           "Parent Relationship": "Outer",            +
           "Parallel Aware": false,                   +
           "Startup Cost": 1321240.44,                +
           "Total Cost": 1321240.65,                  +
           "Plan Rows": 2,                            +
           "Plan Width": 32,                          +
           "Workers Planned": 2,                      +
           "Single Copy": false,                      +
           "Plans": [                                 +
             {                                        +
               "Node Type": "Aggregate",              +
               "Strategy": "Plain",                   +
               "Partial Mode": "Partial",             +
               "Parent Relationship": "Outer",        +
               "Parallel Aware": false,               +
               "Startup Cost": 1320240.44,            +
               "Total Cost": 1320240.45,              +
               "Plan Rows": 1,                        +
               "Plan Width": 32,                      +
               "Plans": [                             +
                 {                                    +
                   "Node Type": "Nested Loop",        +
                   "Parent Relationship": "Outer",    +
                   "Parallel Aware": false,           +
                   "Join Type": "Inner",              +
                   "Startup Cost": 0.00,              +
                   "Total Cost": 1320240.33,          +
                   "Plan Rows": 41,                   +
                   "Plan Width": 33,                  +
                   "Inner Unique": false,             +
                   "Join Filter": "(t1.id = t2.id)",  +
                   "Plans": [                         +
                     {                                +
                       "Node Type": "Seq Scan",       +
                       "Parent Relationship": "Outer",+
                       "Parallel Aware": true,        +
                       "Relation Name": "t1",         +
                       "Alias": "t1",                 +
                       "Startup Cost": 0.00,          +
                       "Total Cost": 14554.33,        +
                       "Plan Rows": 41,               +
                       "Plan Width": 37,              +
                       "Filter": "(clus = 10)"        +
                     },                               +
                     {                                +
                       "Node Type": "Seq Scan",       +
                       "Parent Relationship": "Inner",+
                       "Parallel Aware": false,       +
                       "Relation Name": "t2",         +
                       "Alias": "t2",                 +
                       "Startup Cost": 0.00,          +
                       "Total Cost": 19346.00,        +
                       "Plan Rows": 1000000,          +
                       "Plan Width": 4                +
                     }                                +
                   ]                                  +
                 }                                    +
               ]                                      +
             }                                        +
           ]                                          +
         }                                            +
       ]                                              +
     }                                                +
   }                                                  +

 

Merge Join

similar for a merge join

-- to create a merge join example 
 set enable_nestloop to 'off';
 SET enable_hashjoin TO off;

 

explain (FORMAT json)
select max (t1.data)
from t1, t2
where t1.id = t2.id
and t1.clus = 10
and t2.val  > 1001
;

We want to turn the merge join into a NL or HJ.
In order to do so efficiently we need an index on the join on one table.
We want to have the index on the table with the more rows to analyze.

t1 returns 5000 rows

                           "Relation Name": "t1",         +
                           "Plan Rows": 5000,             +

t2 returns 138889

                          "Relation Name": "t2",         +
                           "Plan Rows": 138889,           +

and join on T2 is on

                   "Merge Cond": "(t2.id = t1.id)",       +

thus suggest index on t2.id

  [                                                        +
   {                                                      +
     "Plan": {                                            +
       "Node Type": "Aggregate",                          +
       "Strategy": "Plain",                               +
       "Partial Mode": "Finalize",                        +
       "Parallel Aware": false,                           +
       "Startup Cost": 55741.63,                          +
       "Total Cost": 55741.64,                            +
       "Plan Rows": 1,                                    +
       "Plan Width": 32,                                  +
       "Plans": [                                         +
         {                                                +
           "Node Type": "Gather",                         +
           "Parent Relationship": "Outer",                +
           "Parallel Aware": false,                       +
           "Startup Cost": 55741.42,                      +
           "Total Cost": 55741.63,                        +
           "Plan Rows": 2,                                +
           "Plan Width": 32,                              +
           "Workers Planned": 2,                          +
           "Single Copy": false,                          +
           "Plans": [                                     +
             {                                            +
               "Node Type": "Aggregate",                  +
               "Strategy": "Plain",                       +
               "Partial Mode": "Partial",                 +
               "Parent Relationship": "Outer",            +
               "Parallel Aware": false,                   +
               "Startup Cost": 54741.42,                  +
               "Total Cost": 54741.43,                    +
               "Plan Rows": 1,                            +
               "Plan Width": 32,                          +
               "Plans": [                                 +
                 {                                        +
                   "Node Type": "Merge Join",             +
                   "Parent Relationship": "Outer",        +
                   "Parallel Aware": false,               +
                   "Join Type": "Inner",                  +
                   "Startup Cost": 54013.29,              +
                   "Total Cost": 54739.68,                +
                   "Plan Rows": 695,                      +
                   "Plan Width": 33,                      +
                   "Inner Unique": false,                 +
                   "Merge Cond": "(t2.id = t1.id)",       +
                   "Plans": [                             +
                     {                                    +
                       "Node Type": "Sort",               +
                       "Parent Relationship": "Outer",    +
                       "Parallel Aware": false,           +
                       "Startup Cost": 29360.10,          +
                       "Total Cost": 29707.32,            +
                       "Plan Rows": 138889,               +
                       "Plan Width": 4,                   +
                       "Sort Key": ["t2.id"],             +
                       "Plans": [                         +
                         {                                +
                           "Node Type": "Seq Scan",       +
                           "Parent Relationship": "Outer",+
                           "Parallel Aware": true,        +
                           "Relation Name": "t2",         +
                           "Alias": "t2",                 +
                           "Startup Cost": 0.00,          +
                           "Total Cost": 15596.00,        +
                           "Plan Rows": 138889,           +
                           "Plan Width": 4,               +
                           "Filter": "((val ) > 1001)" +
                         }                                +
                       ]                                  +
                     },                                   +
                     {                                    +
                       "Node Type": "Sort",               +
                       "Parent Relationship": "Inner",    +
                       "Parallel Aware": false,           +
                       "Startup Cost": 24653.19,          +
                       "Total Cost": 24665.69,            +
                       "Plan Rows": 5000,                 +
                       "Plan Width": 37,                  +
                       "Sort Key": ["t1.id"],             +
                       "Plans": [                         +
                         {                                +
                           "Node Type": "Seq Scan",       +
                           "Parent Relationship": "Outer",+
                           "Parallel Aware": false,       +
                           "Relation Name": "t1",         +
                           "Alias": "t1",                 +
                           "Startup Cost": 0.00,          +
                           "Total Cost": 24346.00,        +
                           "Plan Rows": 5000,             +
                           "Plan Width": 37,              +
                           "Filter": "((clus + 0) = 10)"  +
                         }                                +
                       ]                                  +
                     }                                    +
                   ]                                      +
                 }                                        +
               ]                                          +
             }                                            +
           ]                                              +
         }                                                +
       ]                                                  +
     }                                                    +
   }                                                      +

 

3. Mixed cases – filter and join indexing opportunities

If a filter is missing an index causing a full scan (i.e. “Seq Scan”) and the table being scanned is also used in a join (hash join, merge join, nested loops join), suggest an index on both the filter column and
In the following there is a Seq Scan on T2 followed by a filter.
The filter returns an estimated 375K rows or about 1/3 of the table , i.e. not a good candidate for an index
The Seq Scan is part of a HASH JOIN that represents most of the cost of the query. The hash join is on t2.id.

cost total query 16973.03
t2 access for the hash join cost: 14554.33
t2 has 1M rows ( have to collect this from table stats)
access is on t2.id , “Hash Cond”: “(t2.id = t1.id)”,

suggest index on t2 id and val to avoid the full table scan.

 [                                                        +
   {                                                      +
     "Plan": {                                            +
       "Node Type": "Aggregate",                          +
       "Strategy": "Plain",                               +
       "Partial Mode": "Finalize",                        +
       "Parallel Aware": false,                           +
       "Startup Cost": 16973.02,                          +
       "Total Cost": 16973.03,                            +
       "Plan Rows": 1,                                    +
       "Plan Width": 32,                                  +
       "Plans": [                                         +
         {                                                +
           "Node Type": "Gather",                         +
           "Parent Relationship": "Outer",                +
           "Parallel Aware": false,                       +
           "Startup Cost": 16972.81,                      +
           "Total Cost": 16973.02,                        +
           "Plan Rows": 2,                                +
           "Plan Width": 32,                              +
           "Workers Planned": 2,                          +
           "Single Copy": false,                          +
           "Plans": [                                     +
             {                                            +
               "Node Type": "Aggregate",                  +
               "Strategy": "Plain",                       +
               "Partial Mode": "Partial",                 +
               "Parent Relationship": "Outer",            +
               "Parallel Aware": false,                   +
               "Startup Cost": 15972.81,                  +
               "Total Cost": 15972.82,                    +
               "Plan Rows": 1,                            +
               "Plan Width": 32,                          +
               "Plans": [                                 +
                 {                                        +
                   "Node Type": "Hash Join",              +
                   "Parent Relationship": "Outer",        +
                   "Parallel Aware": false,               +
                   "Join Type": "Inner",                  +
                   "Startup Cost": 11.40,                 +
                   "Total Cost": 15972.72,                +
                   "Plan Rows": 37,                       +
                   "Plan Width": 33,                      +
                   "Inner Unique": false,                 +
                   "Hash Cond": "(t2.id = t1.id)",        +
                   "Plans": [                             +
                     {                                    +
                       "Node Type": "Seq Scan",           +
                       "Parent Relationship": "Outer",    +
                       "Parallel Aware": true,            +
                       "Relation Name": "t2",             +
                       "Alias": "t2",                     +
                       "Startup Cost": 0.00,              +
                       "Total Cost": 14554.33,            +
                       "Plan Rows": 375098,               +
                       "Plan Width": 4,                   +
                       "Filter": "(val > 1001)"           +
                     },                                   +
                     {                                    +
                       "Node Type": "Hash",               +
                       "Parent Relationship": "Inner",    +
                       "Parallel Aware": false,           +
                       "Startup Cost": 10.16,             +
                       "Total Cost": 10.16,               +
                       "Plan Rows": 99,                   +
                       "Plan Width": 37,                  +
                       "Plans": [                         +
                         {                                +
                           "Node Type": "Index Scan",     +
                           "Parent Relationship": "Outer",+
                           "Parallel Aware": false,       +
                           "Scan Direction": "Forward",   +
                           "Index Name": "t1_clus",       +
                           "Relation Name": "t1",         +
                           "Alias": "t1",                 +
                           "Startup Cost": 0.42,          +
                           "Total Cost": 10.16,           +
                           "Plan Rows": 99,               +
                           "Plan Width": 37,              +
                           "Index Cond": "(clus = 10)"    +
                         }                                +
                       ]                                  +
                     }                                    +
                   ]                                      +
                 }                                        +
               ]                                          +
             }                                            +
           ]                                              +
         }                                                +
       ]                                                  +
     }                                                    +
   }                                                      +

4. Index Range scans

To see if an index needs more columns to be efficient looks like ti will require “explain analyze” and not just “explain”

select count(*)
from t1
where t1.clus >= 10 and t1.clus < 15
and t1.val = 1001
;

There is an index on t1.clus and the explain shows it is used
but we don’t know how many rows were retrieved by the index vs filtered out by the Filter

[                                                        +
   {                                                      +
     "Plan": {                                            +
       "Node Type": "Aggregate",                          +
       "Strategy": "Plain",                               +
       "Partial Mode": "Simple",                          +
       "Parallel Aware": false,                           +
       "Startup Cost": 27.14,                             +
       "Total Cost": 27.15,                               +
       "Plan Rows": 1,                                    +
       "Plan Width": 8,                                   +
       "Plans": [                                         +
         {                                                +
           "Node Type": "Index Scan",                     +
           "Parent Relationship": "Outer",                +
           "Parallel Aware": false,                       +
           "Scan Direction": "Forward",                   +
           "Index Name": "t1_clus",                       +
           "Relation Name": "t1",                         +
           "Alias": "t1",                                 +
           "Startup Cost": 0.42,                          +
           "Total Cost": 27.13,                           +
           "Plan Rows": 1,                                +
           "Plan Width": 0,                               +
           "Index Cond": "((clus >= 10) AND (clus < 15))",+
           "Filter": "(val = 1001)"                       +
         }                                                +
       ]                                                  +
     }                                                    +
   }                                                      +

but it doesn’t show how many rows are retrieved by the index and how many are post processed filtered out by the Filter.
Using explain analyze does.
In the following output from “explain analyze”
Only 1 run is returned after index access and filter
but filter takes out 499 rows

"Filter": "(val = 1001)",                      +
           "Rows Removed by Filter": 499                  +

Thus suggest to add val to the existing index

           "Index Name": "t1_clus",                       +

 

[                                                        +
   {                                                      +
     "Plan": {                                            +
       "Node Type": "Aggregate",                          +
       "Strategy": "Plain",                               +
       "Partial Mode": "Simple",                          +
       "Parallel Aware": false,                           +
       "Startup Cost": 27.14,                             +
       "Total Cost": 27.15,                               +
       "Plan Rows": 1,                                    +
       "Plan Width": 8,                                   +
       "Actual Startup Time": 0.279,                      +
       "Actual Total Time": 0.279,                        +
       "Actual Rows": 1,                                  +
       "Actual Loops": 1,                                 +
       "Plans": [                                         +
         {                                                +
           "Node Type": "Index Scan",                     +
           "Parent Relationship": "Outer",                +
           "Parallel Aware": false,                       +
           "Scan Direction": "Forward",                   +
           "Index Name": "t1_clus",                       +
           "Relation Name": "t1",                         +
           "Alias": "t1",                                 +
           "Startup Cost": 0.42,                          +
           "Total Cost": 27.13,                           +
           "Plan Rows": 1,                                +
           "Plan Width": 0,                               +
           "Actual Startup Time": 0.053,                  +
           "Actual Total Time": 0.273,                    +
           "Actual Rows": 1,                              +
           "Actual Loops": 1,                             +
           "Index Cond": "((clus >= 10) AND (clus < 15))",+
           "Rows Removed by Index Recheck": 0,            +
           "Filter": "(val = 1001)",                      +
           "Rows Removed by Filter": 499                  +
         }                                                +
       ]                                                  +
     },                                                   +
     "Planning Time": 0.270,                              +
     "Triggers": [                                        +
     ],                                                   +
     "Execution Time": 0.388                              +
   }                                                      +

 

Check if INDEX we suggest already exists

look up existence of index ( https://stackoverflow.com/questions/2204058/list-columns-with-indexes-in-postgresql)

select
    t.relname as table_name,
    i.relname as index_name,
    array_to_string(array_agg(a.attname), ', ') as column_names
from
    pg_class t,
    pg_class i,
    pg_index ix,
    pg_attribute a
where
    t.oid = ix.indrelid
    and i.oid = ix.indexrelid
    and a.attrelid = t.oid
    and a.attnum = ANY(ix.indkey)
    and t.relkind = 'r'
    and t.relname like 'auth%'
group by
    t.relname,
    i.relname
order by
    t.relname,
    i.relname;
    

 table_name | index_name | column_names 
------------+------------+--------------
 authors    | authors_id | id

 

 

Creating data for test cases above

drop table t1;
 
 create table t1 ( 
    id  int,
    clus int,
    val  int,
    data VARCHAR(40)
 );
 
 insert into t1 (
    id, clus , val, data
)
  select 
  i,
  trunc(i/100),
  mod(i,10000),
  left(md5(random()::text), 40) 
from generate_series(1, 1000000) s(i);
  
select count(*) from t1 where clus = 1 ;
100
select count(*) from t1 where val =1 ;
100

create table t2 as select * from t1;  
  

explain (analyze,verbose,buffers)
select max(t1.data) from  t1, t2
where t1.id = t2.id 
and t1.clus = 1
;

Parameters that affect the explain plan

SET max_parallel_workers_per_gather = 0;
 set enable_mergejoin to 'off';
 set enable_nestloop to 'off';
 SET enable_hashjoin TO off;

Uncategorized

Indexes on Joins for MySQL

March 24th, 2020

MySQL is simpler than the Oracle example  because MySQL only has Nested Loops and doesn’t have Hash Join.

Here is the query, like on the Oracle example from the last blog post

select max(t1.data) 
from  t1, t2  
where t1.id = t2.id 
and t1.clus  = 1
;

 

So there are only 4 examples below which all do NL joins

  1. full table scan on T1 and T2
    1. 7.83 secs
  2. index on T1 predicate filter column
    1. 7.39 secs
  3. index on T2 join column
    1. 0.49 secs
  4. index on both T2 join column and T1 predicate filter column
    1. 0.06 secs

There isn’t an idea of “explain analyze” on MySQL until MySQL 8.0.18 and I did my testing on 8.0.17, so the explain plans costs are estimates not actual values, but the elasped time of the query is an actual value.

 

 

drop table seed;
CREATE TABLE seed ( id INT AUTO_INCREMENT PRIMARY KEY, val INT);
 insert into seed(val) values (1);
insert into seed(val) select val from seed;  /*  2 */
insert into seed(val) select val from seed;  /*  4 */
insert into seed(val) select val from seed;  /*  8 */
insert into seed(val) select val from seed;  /*  16 */
insert into seed(val) select val from seed;  /*  32 */
insert into seed(val) select val from seed;  /*  64 */
insert into seed(val) select val from seed;  /*  128 */
insert into seed(val) select val from seed;  /*  256 */
insert into seed(val) select val from seed;  /*  512 */
insert into seed(val) select val from seed;  /*  1024 */
insert into seed(val) select val from seed;  /*  2048 */
insert into seed(val) select val from seed;  /*  4096 */
insert into seed(val) select val from seed;  /*  8192 */
insert into seed(val) select val from seed;  /*  16384 */
insert into seed(val) select val from seed;  /*  32768 */
insert into seed(val) select val from seed;  /*  131072 */
insert into seed(val) select val from seed;  /*  262144 */
insert into seed(val) select val from seed;  /*  524288 */
insert into seed(val) select val from seed;  /*  1048576 */
 select count(*) from seed;

 drop table dummy;
 
  create table dummy ( 
    id  int,
    clus int,
    val int,
    data VARCHAR(40)
 );
 
 insert into dummy (
    id, clus , val, data
)
 select 
 id,
 truncate(id/169,0),
 mod(id,10000), 
 CONV(FLOOR(RAND() * 99999999999999), 10, 36) 
from seed
;

select count(*) from dummy where clus = 1;
+----------+
| count(*) |
+----------+
|      106 |
+----------+
select count(*) from dummy where val =1;
+----------+
| count(*) |
+----------+
|      106 |
+----------+

drop table t1;
drop table t2;

create table t1 as select * from dummy;
create table t2 as select * from dummy; 



Strange thing is that there are gaps in the sequence # in the seed table, thus the strange value in the truncate to get 106 values, i.e. truncate(id/169,0).

First we run the query with no indexes ( I already created the indexes so force them off in the example with “USE INDEX()” )

both nodes have access type

"access_type": "ALL"

so they are doing full table scans.

explain format=json select max(t1.data) 
from 
t1 USE INDEX ()
, t2 USE INDEX ()
where t1.id = t2.id 
and t1.clus = 1
;

1 row in set (7.83 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "18102538.86"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.02",
          "cost_info": {
            "read_cost": "105376.24",
            "eval_cost": "17.21",
            "prefix_cost": "105393.45",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "(`kyle`.`t1`.`clus` = 1)"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "917.35",
            "eval_cost": "17.21",
            "prefix_cost": "18102538.86",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id"
          ],
          "attached_condition": "(`kyle`.`t2`.`id` = `kyle`.`t1`.`id`)"
        }
      }
    ]
  }
}

We can see that the filter on T1 should filter most of the table (172 rows out of 1045682) so now add index on t1.clus the predicate filter ( I remove the “USE INDEX () ” on t1 so the index on T1 is now used in the explain ) . The execution time is a little faster but not that much 7.83 vs 7.39. We see that T1 is now accessed by index

explain format=json select max(t1.data) 
from  t1, t2  
USE INDEX ()
where t1.id = t2.id 
and t1.clus    = 1
;
 row in set (7.39 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "11085148.28"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ref",
          "possible_keys": [
            "t1_clus"
          ],
          "key": "t1_clus",
          "used_key_parts": [
            "clus"
          ],
          "key_length": "5",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 106,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "26.50",
            "eval_cost": "10.60",
            "prefix_cost": "37.10",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ]
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 105,
          "filtered": "0.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "881.98",
            "eval_cost": "10.60",
            "prefix_cost": "11085148.28",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id"
          ],
          "attached_condition": "(`kyle`.`t2`.`id` = `kyle`.`t1`.`id`)"
        }
      }
    ]
  }
}

 

On T2 we see that we access 105 row after examining 1045682, thus a good opportunity for an index. Now add an index or the join column and we get a significant improvement in execution time. We go from 7.39 secs to 0.49  or 7.39/0.49 or a 15x improvement

explain format=json select max(t1.data) 
from  t1 USE INDEX () , t2  
where t1.id = t2.id 
and t1.clus   = 1
;
1 row in set (0.49 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "105582.64"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.02",
          "cost_info": {
            "read_cost": "105376.24",
            "eval_cost": "17.21",
            "prefix_cost": "105393.45",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "((`kyle`.`t1`.`clus` = 1) and (`kyle`.`t1`.`id` is not null))"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ref",
          "possible_keys": [
            "t2_id"
          ],
          "key": "t2_id",
          "used_key_parts": [
            "id"
          ],
          "key_length": "5",
          "ref": [
            "kyle.t1.id"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 172,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "171.98",
            "eval_cost": "17.21",
            "prefix_cost": "105582.64",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id"
          ]
        }
      }
    ]
  }
}

Now use both indexes, which gives us a small  0.43 reduction in execution time, but as a ratio from the previous 0.49/0.06, i.e. an 8x improvement, still pretty good

explain format=json select max(t1.data) 
from  t1, t2  
where t1.id = t2.id 
and t1.clus  = 1
;
1 row in set (0.06 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "153.62"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ref",
          "possible_keys": [
            "t1_clus"
          ],
          "key": "t1_clus",
          "used_key_parts": [
            "clus"
          ],
          "key_length": "5",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 106,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "26.50",
            "eval_cost": "10.60",
            "prefix_cost": "37.10",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "(`kyle`.`t1`.`id` is not null)"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ref",
          "possible_keys": [
            "t2_id"
          ],
          "key": "t2_id",
          "used_key_parts": [
            "id"
          ],
          "key_length": "5",
          "ref": [
            "kyle.t1.id"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "105.93",
            "eval_cost": "10.60",
            "prefix_cost": "153.63",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id"
          ]
        }
      }
    ]
  }

.

Uncategorized

Indexes for Joins on Oracle

March 23rd, 2020

Looking at ways to programmatically analyze explain plans to see if adding indexes would help.

The analysis is straight forward for for full table scans with strong predicate filters. If there is a strong predicate filter on an explain plan node with a full table scan then it’s a good candidate for an index.

Now a similar situation exists for a join.

Again if the join has a strong predicate filter on a full table scan, that is a candidate.

On the other hand when the actual join fields look like they will benefit from an index, it’s less obvious.

In the queries below T1 and T2 have the same data.

Here is the query

select  max(t1.data) from  t1, t2 
where 
      t1.id = t2.id  
   and t1.clus   = 1  ;

 

First we try with no indexes.

Then we try with an index on the predicate filter on T1.

Then we add a index on the join column on T2.

The idea being that we should see performance improvements at the addition of each index so that the final efficient path uses the predicate filter index on T1 returning a small amount of rows then  drive into T2 with those small amount of rows using the index on T2 join field via an NL join.

The surprising thing is that a HJ that simply hashes values from the predicate filtered rows from T1 and driving into the hash result with all the rows from T2 is just efficient and we don’t have to maintain the index on the t2 join column! It surprising to me. Sure HJs are cool, but I always thought of them as just in time indexes (ie the hashed result) which is work that could be avoid by having an index supporting an NL join and not having to recreate the hash result every query execution. (I’ll have to do some load testing to see what the actual impact of the NL vs HJ query on total database resource usage and load).

Simply adding an index on T1 filter column gets us a HJ that is just as efficient as having a NL using the T2 join field but in the case of the HJ we don’t have to maintain the T2 join column filter.

There is a strong predicate filter on T1.clus ( called “clus” because the values a clustered, we can look non-clustered later, i.e. the same query but using the “val” column which is not clustered)

In  the queries below, a NL driving on T1 rows after the predicate filter into T2 using an index on the join, is efficient.

But the HJ , hashing T1 after predicate filter and driving into the hash with all the rows from T2 is surprisingly (to me) as efficient or more efficient).

First example will start with out any index usage.

Create test data:

drop table seed;
create table seed as
select rownum n from dual connect by level <= 1e6
;

drop table t1;
drop table t2;
create table t1 ( 
    id  int,
    clus int,
    val int,
    data VARCHAR(40)
 );
 
 insert into t1 (
    id, clus , val, data
)
 select 
 n,
 trunc(n/100),
 mod(n,10000),
dbms_random.string('U',dbms_random.value(2,7))
from seed;


select count(*) from t1 where clus = 1;
100
select count(*) from t1 where val = 1;
100

create table t2 as select * from t1;

alter session set Statistics_level=all;
set timing on;

The following 3 examples will use full table scans on both tables.

( I use ” + 0 ” to turn indexes off so I don’t have to drop and recreate for demo since I’ve actually already created the indexes before running these queries )

The first NL below is slower than the next HASH join because the NL has to read the full table T2 for every row in T1 resulting in 300K+ buffer accesses (100 rows after T1 filter time ~3000 buffers per row ).

  • NL
    • For every row from T1 (which is all rows from T1 where t1.clus =0 , which is 100 rows)
    • Look up that row in T2 (T2 doesn’t have an index so we do a full table scan or each 100 rows from above
    • -> 6.83 secs

The HASH just has to read T2 once (3K+ buffer accesses)  and look up each of those rows in a hash of T1 , still a lot of rows ( should be 1M, i.e. all the rows in T2 which is what the estimate says, but actual says 1212 which I don’t understand).

  • HJ
    • hash T1 (1000 rows)
    • look up every row from T2 (1M) in hash of T1
    • -> 0.09 sec

Hashing T2 instead of T1 is a lot more memory, 50M instead of 1.5 M and takes longer even though its much less look ups from the filter on T1 (only 100).

  • HJ
    • hash T2  (1M rows, a good bit of work)
    • for every row from T1 (100 rows after filter) look them up in hash of T2
    • -> 0.32 sec
select  /*+ leading(t1 t2)  use_nl(t2) */ max(t1.data) from  t1, t2 
where 
      t1.id = t2.id  + 0 
   and t1.clus + 0  = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------
| Id  | Operation	    | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |	  1 |	     |	    1 |00:00:06.83 |	 328K|
|   1 |  SORT AGGREGATE     |	   |	  1 |	   1 |	    1 |00:00:06.83 |	 328K|
|   2 |   NESTED LOOPS	    |	   |	  1 |	  36 |	  100 |00:00:06.83 |	 328K|
|*  3 |    TABLE ACCESS FULL| T1   |	  1 |	  36 |	  100 |00:00:00.07 |	3275 |
|*  4 |    TABLE ACCESS FULL| T2   |	100 |	   1 |	  100 |00:00:06.76 |	 325K|
--------------------------------------------------------------------------------------
   3 - filter("T1"."CLUS"+0=1)
   4 - filter("T1"."ID"="T2"."ID"+0)


select max(t1.data) from  t1, t2 
where 
    t1.id = t2.id  + 0 
   and t1.clus + 0  = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation	    | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |	OMem |	1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |	  1 |	     |	    1 |00:00:00.09 |	3281 |	     |	     |		|
|   1 |  SORT AGGREGATE     |	   |	  1 |	   1 |	    1 |00:00:00.09 |	3281 |	     |	     |		|
|*  2 |   HASH JOIN SEMI    |	   |	  1 |	  36 |	  100 |00:00:00.09 |	3281 |	1753K|	1753K| 1414K (0)|
|*  3 |    TABLE ACCESS FULL| T1   |	  1 |	  36 |	  100 |00:00:00.09 |	3275 |	     |	     |		|
|   4 |    TABLE ACCESS FULL| T2   |	  1 |	1000K|	 1212 |00:00:00.01 |	   6 |	     |	     |		|
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="T2"."ID"+ 0)
   3 - filter("T1"."CLUS"+ 0=1)


select  /*+ leading(t2 t1) use_hash(t1) max(t1.data) */   max(t1.data)  from  t1, t2 
where 
    t1.id = t2.id  + 0 
   and t1.clus + 0  = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation	    | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |	OMem |	1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |	  1 |	     |	    1 |00:00:00.32 |	6526 |	     |	     |		|
|   1 |  SORT AGGREGATE     |	   |	  1 |	   1 |	    1 |00:00:00.32 |	6526 |	     |	     |		|
|*  2 |   HASH JOIN	    |	   |	  1 |	  36 |	  100 |00:00:00.32 |	6526 |	  50M|	9345K|	 49M (0)|
|   3 |    TABLE ACCESS FULL| T2   |	  1 |	1000K|	 1000K|00:00:00.03 |	3251 |	     |	     |		|
|*  4 |    TABLE ACCESS FULL| T1   |	  1 |	  36 |	  100 |00:00:00.08 |	3275 |	     |	     |		|
-----------------------------------------------------------------------------------------------------------------


Add an index on the predicate filter:

create index t1_clus on t1 (clus);

And we get a much faster hash join access as we access a lot less buffers on T1.

NL driving from T1 still has to do a full table scan on T2 for every row so is slow.

 

 

select max(t1.data) from  t1, t2 
where 
    t1.id = t2.id  + 0 
   and t1.clus    = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));


-------------------------------------------------------------------------------------------------------------------------------- 
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      10 |	  |	  |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      10 |	  |	  ||
|*  2 |   HASH JOIN SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      10 |  1753K|  1753K| 1408K (0)|
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
|   5 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1212 |00:00:00.01 |	6 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------- 

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."ID"="T2"."ID")
4 - access("T1"."MILLI"=1)


select /*+ leading(t1 t2)  use_nl(t2) */  max(t1.data) from  t1, t2 
where 
      t1.id = t2.id + 0
   and t1.clus   = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:08:14.23 |    2893K|   2892K|
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:08:14.23 |    2893K|   2892K|
|   2 |   NESTED LOOPS			      | 	|      1 |    100 |    100 |00:08:14.23 |    2893K|   2892K|
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	 0 |
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	 0 |
|*  5 |    TABLE ACCESS FULL		      | T2	|    100 |	1 |    100 |00:08:14.23 |    2893K|   2892K|
--------------------------------------------------------------------------------------------------------------------

Adding on index on T2.id and allowing the NL driving from T1 to look up rows in T2 by index gets the NL down to about the same as HASH JOIN.

The HJ is on hash of  T2 is faster when accessing T1 by it’s index on the predicate filter but still hashing all of T2 slows it down.

I don’t understand why the HJ hashing on T2 is slower when accessing T1 by index instead of full table scan.

drop index t2_id;

create index t2_id on t2 (id);  

select /*+ leading(t1 t2)  use_nl(t2) */  max(t1.data) from  t1, t2 
where 
      t1.id = t2.id  
   and t1.clus   = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

-----------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      17 |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      17 |
|   2 |   NESTED LOOPS SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      17 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |
|*  5 |    INDEX RANGE SCAN		      | T2_ID	|    100 |   1000K|    100 |00:00:00.01 |      13 |
-----------------------------------------------------------------------------------------------------------

select /*+ leading(t2 t1) use_hash(t1)  */
max(t1.data) from  t1, t2 
where 
      t1.id = t2.id +1 
   and t1.clus = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.22 |    3255 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.22 |    3255 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |     78 |    100 |00:00:00.22 |    3255 |    50M|  9345K|   49M (0)|
|   3 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1000K|00:00:00.03 |    3251 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="T2"."ID"+1)
   5 - access("T1"."CLUS"=1)

select /*+ leading(t2 t1) use_hash(t1)  */
max(t1.data) from  t1, t2 
where 
      t1.id = t2.id 
   and t1.clus = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:01.25 |    2246 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:01.25 |    2246 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |    100 |    100 |00:00:01.25 |    2246 |    43M|  6111K|   42M (0)|
|   3 |    INDEX FAST FULL SCAN 	      | T2_ID	|      1 |   1000K|   1000K|00:00:00.31 |    2242 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="T2"."ID")
   5 - access("T1"."CLUS"=1)

For reference, the  order of NL and HJ access and hints

 

Screen Shot 2020-03-23 at 1.24.20 PM

reference

 

http://datavirtualizer.com/right-deep-left-deep-and-bushy-joins-in-sql/

 

Followup:

The Hash Join put about 7x (0.22 AAS)  as much load on the DB as the next loops (0.03AAS) and both are doing 178 executions/sec.

First Load Chart image with top SQL drill down filters for the  efficient HJ version. Th HJ plan is shown here.

 

-------------------------------------------------------------------------------------------------------------------------------- 
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      10 |	  |	  |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      10 |	  |	  ||
|*  2 |   HASH JOIN SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      10 |  1753K|  1753K| 1408K (0)|
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
|   5 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1212 |00:00:00.01 |	6 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------- 

 

Screen Shot 2020-03-24 at 11.21.06 AM

The second image is the load filtered by the NJ join and here is the explain.

-----------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      17 |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      17 |
|   2 |   NESTED LOOPS SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      17 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |
|*  5 |    INDEX RANGE SCAN		      | T2_ID	|    100 |   1000K|    100 |00:00:00.01 |      13 |
-----------------------------------------------------------------------------------------------------------

 

Screen Shot 2020-03-24 at 11.21.13 AM

 

The following hash join, which hashes T2, was not as efficient  hashing T1 results but it confused me where it looked like using the index on T1 filter was more expensive than not using it. Looks like it’s the other way around in actual load.

Even though skipping the T1 index shows an A-Time of 0.22 and using the T1 Index shows 1.25, when running a concurrent load it looks the opposite.

The first example below avoids the index on T1 and has an AAS of 10.36  and elapsed of 3.7 secs (instead of 0.22 in explain)

The second which uses the index on T1 filter has an AAS of 8.2 and elapsed of 3.1 secs (instead of 1.25 in explain)

Both are executing 2.24 executes/sec.

 

select /*+ leading(t2 t1) use_hash(t1)  */
max(t1.data) from  t1, t2 
where 
      t1.id = t2.id +1 
   and t1.clus = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.22 |    3255 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.22 |    3255 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |     78 |    100 |00:00:00.22 |    3255 |    50M|  9345K|   49M (0)|
|   3 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1000K|00:00:00.03 |    3251 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------

.

Screen Shot 2020-03-24 at 12.09.16 PM

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:01.25 |    2246 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:01.25 |    2246 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |    100 |    100 |00:00:01.25 |    2246 |    43M|  6111K|   42M (0)|
|   3 |    INDEX FAST FULL SCAN 	      | T2_ID	|      1 |   1000K|   1000K|00:00:00.31 |    2242 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
--------------------------------------------------------------------------------------------------------------------------------

.Screen Shot 2020-03-24 at 12.09.24 PM

Reference

http://datavirtualizer.com/right-deep-left-deep-and-bushy-joins-in-sql/

http://www.nocoug.org/download/2012-08/Jonathan_Lewis_Creating_Tests.pdf

Uncategorized

Oracle Imp (import) to RDS

March 2nd, 2020

Had and old export dump file I wanted to import into RDS.

I first tried to use S3 as a hosting site for the .dmp file and use rdsadmin.rdsadmin_s3_tasks.download_from_s3 to copy the file to the RDS database then import it with DBMS_DATAPUMP.

Unfortunately DBMS_DATAPUMP only works with expdp files so instead of using DBMS_DATAPUMP , I got the old “imp” binary and used it with TWO_TASK to the RDS database I wanted it loaded into.

For the record here were the steps  I took

 

First put the file into an S3 bucket called

kylelf-datadump

Create an IAM policy to allow Amazon RDS access to an Amazon S3 bucket

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/oracle-s3-integration.html#oracle-s3-integration.preparing

My policy looks like

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Sid": "VisualEditor0",
 "Effect": "Allow",
 "Action": [
 "s3:PutObject",
 "s3:GetObject",
 "s3:ListBucket"
 ],
 "Resource": "arn:aws:s3:::kylelf-datadump"
 }
 ]
}

Create an IAM role to allow Amazon RDS access to an Amazon S3 bucket

again, see: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/oracle-s3-integration.html#oracle-s3-integration.preparing

add the role to the database you want to import into

  1. Sign in to the AWS Management Console and open the Amazon RDS console at https://console.aws.amazon.com/rds/.
    Choose the Oracle DB instance name to display its details.
    On the Connectivity & security tab, in the Manage IAM roles section, choose the role to add under Add IAM roles to this instance.
    For Feature, choose S3_INTEGRATION.

create s3 option group

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithOptionGroups.html#USER_WorkingWithOptionGroups.Create

  1. create an option group
  2. add s3 integration to option group
  3. modify instance, change option group to s3 option group

Then download the dump file from S3.

SELECT rdsadmin.rdsadmin_s3_tasks.download_from_s3(
 p_bucket_name => 'kylelf-datadump', 
 p_directory_name => 'DATA_PUMP_DIR') 
 AS TASK_ID FROM DUAL; 

TASK_ID

——————————————————————————–

1583185124793-43

Check status of download

SELECT text FROM
table(rdsadmin.rds_file_util.read_text_file(‘BDUMP’,’dbtask-1583185124793-43.log’));

Check to see the file is there

select * from table(RDSADMIN.RDS_FILE_UTIL.LISTDIR('DATA_PUMP_DIR')) order by mtime;
FILENAME-  TYPE     FILESIZE MTIME
---------- ---------- ---------
TC1.dmp  file     39829504 02-MAR-20

My file is the TC1.dmp file

Now import it

DECLARE
hdnl NUMBER;
BEGIN
hdnl := DBMS_DATAPUMP.OPEN( operation => 'IMPORT', job_mode => 'SCHEMA', job_name=>null);
DBMS_DATAPUMP.ADD_FILE( handle => hdnl, filename => 'TC1.dmp', directory => 'DATA_PUMP_DIR', filetype => dbms_datapump.ku$_file_type_dump_file);
DBMS_DATAPUMP.METADATA_FILTER(hdnl,'SCHEMA_EXPR','IN (''KYLELF'')');
DBMS_DATAPUMP.START_JOB(hdnl);
END;
/

This gave me the error

ORA-39001: invalid argument value
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 79
ORA-06512: at "SYS.DBMS_DATAPUMP", line 3507
ORA-06512: at "SYS.DBMS_DATAPUMP", line 3756
ORA-06512: at line 5

Not very informative.

 Found an example to get more verbose error messaging on

https://stackoverflow.com/questions/45348688/oracle-datapump-export-error-invalid-argument-value

SET VERIFY OFF
SET FEEDBACK OFF
SET SERVEROUTPUT ON

DECLARE
  ind        NUMBER;
  fileType   NUMBER;
  value      VARCHAR2(2048);
  infoTab    KU$_DUMPFILE_INFO := KU$_DUMPFILE_INFO();
 
BEGIN
  --
  -- Get the information about the dump file into the infoTab.
  --
  BEGIN
    DBMS_DATAPUMP.GET_DUMPFILE_INFO('TC1.dmp','DATA_PUMP_DIR',infoTab,fileType);
    DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
    DBMS_OUTPUT.PUT_LINE('Information for file: TC1.dmp');
 
    --
    -- Determine what type of file is being looked at.
    --
    CASE fileType
      WHEN 1 THEN
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is a Data Pump dump file');
      WHEN 2 THEN
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is an Original Export dump file');
      WHEN 3 THEN
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is an External Table dump file');
      ELSE
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is not a dump file');
        DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
    END CASE;
 
  EXCEPTION
    WHEN OTHERS THEN
      DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
      DBMS_OUTPUT.PUT_LINE('Error retrieving information for file: ' ||
                           'TC1.dmp');
      DBMS_OUTPUT.PUT_LINE(SQLERRM);
      DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
      fileType := 0;
  END;
 
  --
  -- If a valid file type was returned, then loop through the infoTab and 
  -- display each item code and value returned.
  --
  IF fileType > 0
  THEN
    DBMS_OUTPUT.PUT_LINE('The information table has ' || 
                          TO_CHAR(infoTab.COUNT) || ' entries');
    DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
 
    ind := infoTab.FIRST;
    WHILE ind IS NOT NULL
    LOOP
      --
      -- The following item codes return boolean values in the form
      -- of a '1' or a '0'. Display them as 'Yes' or 'No'.
      --
      value := NVL(infoTab(ind).value, 'NULL');
      IF infoTab(ind).item_code IN
         (DBMS_DATAPUMP.KU$_DFHDR_MASTER_PRESENT,
          DBMS_DATAPUMP.KU$_DFHDR_DIRPATH,
          DBMS_DATAPUMP.KU$_DFHDR_METADATA_COMPRESSED,
          DBMS_DATAPUMP.KU$_DFHDR_DATA_COMPRESSED,
          DBMS_DATAPUMP.KU$_DFHDR_METADATA_ENCRYPTED,
          DBMS_DATAPUMP.KU$_DFHDR_DATA_ENCRYPTED,
          DBMS_DATAPUMP.KU$_DFHDR_COLUMNS_ENCRYPTED)
      THEN
        CASE value
          WHEN '1' THEN value := 'Yes';
          WHEN '0' THEN value := 'No';
        END CASE;
      END IF;
 
      --
      -- Display each item code with an appropriate name followed by
      -- its value.
      --
      CASE infoTab(ind).item_code
        --
        -- The following item codes have been available since Oracle
        -- Database 10g, Release 10.2.
        --
        WHEN DBMS_DATAPUMP.KU$_DFHDR_FILE_VERSION   THEN
          DBMS_OUTPUT.PUT_LINE('Dump File Version:         ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_MASTER_PRESENT THEN
          DBMS_OUTPUT.PUT_LINE('Master Table Present:      ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_GUID THEN
          DBMS_OUTPUT.PUT_LINE('Job Guid:                  ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_FILE_NUMBER THEN
          DBMS_OUTPUT.PUT_LINE('Dump File Number:          ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_CHARSET_ID  THEN
          DBMS_OUTPUT.PUT_LINE('Character Set ID:          ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_CREATION_DATE THEN
          DBMS_OUTPUT.PUT_LINE('Creation Date:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_FLAGS THEN
          DBMS_OUTPUT.PUT_LINE('Internal Dump Flags:       ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_JOB_NAME THEN
          DBMS_OUTPUT.PUT_LINE('Job Name:                  ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_PLATFORM THEN
          DBMS_OUTPUT.PUT_LINE('Platform Name:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_INSTANCE THEN
          DBMS_OUTPUT.PUT_LINE('Instance Name:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_LANGUAGE THEN
          DBMS_OUTPUT.PUT_LINE('Language Name:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_BLOCKSIZE THEN
          DBMS_OUTPUT.PUT_LINE('Dump File Block Size:      ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DIRPATH THEN
          DBMS_OUTPUT.PUT_LINE('Direct Path Mode:          ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_METADATA_COMPRESSED THEN
          DBMS_OUTPUT.PUT_LINE('Metadata Compressed:       ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DB_VERSION THEN
          DBMS_OUTPUT.PUT_LINE('Database Version:          ' || value);
 
        --
        -- The following item codes were introduced in Oracle Database 11g
        -- Release 11.1
        --

        WHEN DBMS_DATAPUMP.KU$_DFHDR_MASTER_PIECE_COUNT THEN
          DBMS_OUTPUT.PUT_LINE('Master Table Piece Count:  ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_MASTER_PIECE_NUMBER THEN
          DBMS_OUTPUT.PUT_LINE('Master Table Piece Number: ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DATA_COMPRESSED THEN
          DBMS_OUTPUT.PUT_LINE('Table Data Compressed:     ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_METADATA_ENCRYPTED THEN
          DBMS_OUTPUT.PUT_LINE('Metadata Encrypted:        ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DATA_ENCRYPTED THEN
          DBMS_OUTPUT.PUT_LINE('Table Data Encrypted:      ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_COLUMNS_ENCRYPTED THEN
          DBMS_OUTPUT.PUT_LINE('TDE Columns Encrypted:     ' || value);
 
        --
        -- For the DBMS_DATAPUMP.KU$_DFHDR_ENCRYPTION_MODE item code a
        -- numeric value is returned. So examine that numeric value
        -- and display an appropriate name value for it.
        --
        WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCRYPTION_MODE THEN
          CASE TO_NUMBER(value)
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_NONE THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           None');
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_PASSWORD THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           Password');
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_DUAL THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           Dual');
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_TRANS THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           Transparent');
          END CASE;
 
        --
        -- The following item codes were introduced in Oracle Database 12c
        -- Release 12.1
        --
 
        --
        -- For the DBMS_DATAPUMP.KU$_DFHDR_COMPRESSION_ALG item code a
        -- numeric value is returned. So examine that numeric value and
        -- display an appropriate name value for it.
        --
        
      END CASE;
      ind := infoTab.NEXT(ind);
    END LOOP;
  END IF;
END;
/

Now I can see the error better:

39001: ORA-39001: invalid argument value
-39000: ORA-39000: bad dump file specification
-39143: ORA-39143: dump file "/rdsdbdata/datapump/TC1.dmp" may be an original
export dump file

 

Looking this up via google, indicates that this is an old “exp” dump file and is not compatible with “impdp”

Thus have use old imp tool

Off of OTN

https://www.oracle.com/database/technologies/instant-client/linux-x86-64-downloads.html

I got

oracle-instantclient19.6-basic-19.6.0.0.0-1.x86_64.rpm
oracle-instantclient19.6-sqlplus-19.6.0.0.0-1.x86_64.rpm
oracle-instantclient19.6-tools-19.6.0.0.0-1.x86_64.rpm

Installed these on an EC2

sudo rpm -ivh

sudo rpm -ivh oracle-instantclient19.6-basic-19.6.0.0.0-1.x86_64.rpm
sudo rpm -ivh oracle-instantclient19.6-sqlplus-19.6.0.0.0-1.x86_64.rpm
sudo rpm -ivh oracle-instantclient19.6-tools-19.6.0.0.0-1.x86_64.rpm

Then copy the TC1.dmp file to the EC2

Then use TWO_TASK to point to my database

export TWO_TASK='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=ora19.xxxxxxxx.us-east-1.rds.amazonaws.com)(PORT=1521)))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)))‘

/usr/lib/oracle/19.6/client64/bin/imp user/pw file=TC1.dmp FROMUSER=’scott’ TOUSER=’kylelf’

also had a hickup because the first database I used was 11.2 and the imp is from 19.0 distribution so got error

 

IIMP-00003: ORACLE error 942 encountered
ORA-00942: table or view does not exist
IMP-00023: Import views not installed, please notify your DBA
IMP-00000: Import terminated unsuccessfully

So bad to create a version 19 Oracle database and then the import worked

 

Uncategorized

Index Advisor ? PostgreSQL, Oracle and SQL Server : yes , MySQL : no

January 30th, 2020

How does one check if there are missing indexes in a database?

Is there a programatic way to do it?

Explain Plan
Most blogs on the subject will say look at the explain plan for your query.
Basic concept is to look for full table scans in the explain plan and the look and see if there is a predicate filter (i.e. SQL text of form “where table.field=value”) with a full table scan or a full table scan in a join ( i.e. tableA.field = tableB.field). There are other considerations, things to check, and other places for indexes, but this is a good start.

Explain Plans for whole database
What if you don’t have a particular query and instead want to check the whole running database?
Is there a programatic way to do it? If the answer to finding missing indexes for one query is looking at the explain plan then by extension the answer for the whole database is looking at the explain plans for the whole database.

Execution Plan Cache
On Oracle and SQL Server the databases have a cache of execution plans (the plans actually used by the optimizer as opposed to the plan the optimizer thinks it would use the “explain” plan. Execution plans are from executions and explain plans from the explain command without executing the statement. They should be the same but for many reason the might be different).
Thus on Oracle (V$SQL_PLAN) and SQL Server ( sys.dm_exec_query_plan) the execution plans can be programmatically queried and analyzed.

Explain plan on SQL from SQL cache
What about PostgreSQL and MySQL. Neither have a query plan cache.
Both PostgreSQL (pg_stat_statements) and MySQL (events_statements_history) do have the text for executed queries. With the SQL queries one can programmatically collect the SQL queries and run the “explain” command and get the explain plan.
Another issue is getting the full SQL text. Without the full SQL text, the Explain plan can’t be run. On MySQL the SQL text length is limited by default to 1K but can be adjusted with parameter performance_schema_max_sql_text_length system variable at server startup. Postgres is also limited to 1K by default can be increased with the parameter track_activity_query_size.

Explain Analyze
For select statements (i.e. queries that don’t modify data) one could even consider running “explain analyze” which both PostgreSQL and MySQL have. Explain Analyze which runs the query to give the execution plan as well as the estimated costs and the actual execution statistics which is useful as we will discuss later.

Explain on SQL with bind variables
Unfortunately the Explain command breaks with SQL that have variable parameters. In order to get Explain to work requires substituting real values for variables.

Bind variable capture
Bind variables can be captured on PostgreSQL with extensions like PG_qualstats. PG_qualstats also captures other information that is perfect for finding missing indexes.
Bind variable capture is not possible on MySQL even with slow query log.

So on MySQL without execution plans caches and without bind variables we can’t programatic index advisor on MySQL.

Uncategorized

SQLcmd on my Mac: Seg fault

January 23rd, 2020

trying to connect to SQL Server from my Mac so install sqlcmd:

brew tap microsoft/mssql-release https://github.com/Microsoft/homebrew-mssql-release
brew update
brew install mssql-tools

then get error

$ sqlcmd -S kylelfsqls.xxxxxx.us-west-2.amazonaws.com,1433 -U kylelf  
Sqlcmd: Error: Microsoft ODBC Driver 17 for SQL Server : SSL Provider: [OpenSSL library could not be loaded, make sure OpenSSL 1.0 or 1.1 is installed].
Sqlcmd: Error: Microsoft ODBC Driver 17 for SQL Server : Client unable to establish connection.
Segmentation fault: 11

Did’t find much info google, so blogging here.
Ran

ln -s /usr/local/opt/openssl@1.1 /usr/local/opt/openssl

and it worked

Uncategorized

MySQL – manually updating table and index optimizer statistics

January 22nd, 2020

Goal was to create a empty table and poplulate it with optimizer table and index stats from a poplulated table to try and get the optimizer to give the same explain plan and costs on the empty table as the populated table. The purpose of this was to be able to modify the empty table by adding indexes quickly and cheaply and still be able to get correct optimizer costs. This worked in the case of table access but unfortunately didn’t work when bring in an index.

Create procedure to populate test table

DROP PROCEDURE IF EXISTS populate_data;
DELIMITER //
CREATE PROCEDURE populate_data(max_val int)
BEGIN
DECLARE int_val INT DEFAULT 0;
test_loop : LOOP
IF (int_val = max_val) THEN
LEAVE test_loop;
END IF;
INSERT INTO source(id) VALUES (FLOOR(RAND()*(1000000+1)+1));
SET int_val = int_val +1;
END LOOP;
END;//
DELIMITER ;

Create source table and populate it.
Create source_clone that will be empty but we want to manually fill in optimizer statistics to mimic source table

CREATE TABLE source(id integer);
call populate_data(1000000);
CREATE TABLE source_clone (id integer) ;

comparing table and index stats on source and source_clone we can see as expected they are different:

select * from mysql.innodb_table_stats where table_name like 'sourc%';
+---------------+--------------+---------------------+--------+----------------------+--------------------------+
| database_name | table_name   | last_update         | n_rows | clustered_index_size | sum_of_other_index_sizes |
+---------------+--------------+---------------------+--------+----------------------+--------------------------+
| kyle          | source       | 2020-01-22 18:49:48 |   9980 |                   21 |                        0 |
| kyle          | source_clone | 2020-01-22 18:49:44 |      0 |                    1 |                        0 |
+---------------+--------------+---------------------+--------+----------------------+--------------------------+
select * from mysql.innodb_index_stats where table_name like 'sourc%';
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name   | index_name      | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:49:48 | n_diff_pfx01 |       9980 |          20 | DB_ROW_ID                         |
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:49:48 | n_leaf_pages |         20 |        NULL | Number of leaf pages in the index |
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:49:48 | size         |         21 |        NULL | Number of pages in the index      |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:49:44 | n_diff_pfx01 |          0 |           1 | DB_ROW_ID                         |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:49:44 | n_leaf_pages |          1 |        NULL | Number of leaf pages in the index |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:49:44 | size         |          1 |        NULL | Number of pages in the index      |
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+

running the explain plan we get different rows returned as expected

--
mysql> explain select id from kyle.source where id > 2056;
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | source | ALL  | NULL          | NULL | NULL    | NULL | 9980 | Using where |
+----+-------------+--------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.05 sec)

mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | source_clone | ALL  | NULL          | NULL | NULL    | NULL |    1 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.04 sec)

Now let’s copy the stats from the populated source table to the empty source_clone data

-- TABLE  STATS
update mysql.innodb_table_stats AS `dest`,
     (SELECT  n_rows , clustered_index_size 
     FROM mysql.innodb_table_stats 
     WHERE 
         table_name = 'source'
    ) AS `src`
SET dest.n_rows  = src.n_rows , 
    dest.clustered_index_size  = src.clustered_index_size
WHERE 
         table_name = 'source_clone';
-- INDEX STATS
UPDATE mysql.innodb_index_stats AS `dest`,
(SELECT stat_value, stat_name
FROM mysql.innodb_index_stats
WHERE
table_name = 'source'
and index_name = 'GEN_CLUST_INDEX'
) AS `src`
SET dest.stat_value = src.stat_value
WHERE dest.table_name = 'source_clone'
and dest.index_name = 'GEN_CLUST_INDEX'
and dest.stat_name = src.stat_name;

The optimizer table and index statistics are now the same

mysql> select * from mysql.innodb_table_stats where table_name like 'sourc%';
+---------------+--------------+---------------------+--------+----------------------+--------------------------+
| database_name | table_name   | last_update         | n_rows | clustered_index_size | sum_of_other_index_sizes |
+---------------+--------------+---------------------+--------+----------------------+--------------------------+
| kyle          | source       | 2020-01-22 18:49:48 |   9980 |                   21 |                        0 |
| kyle          | source_clone | 2020-01-22 18:54:31 |   9980 |                   21 |                        0 |
+---------------+--------------+---------------------+--------+----------------------+--------------------------+
--
mysql> select * from mysql.innodb_index_stats where table_name like 'sourc%';
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name   | index_name      | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:49:48 | n_diff_pfx01 |       9980 |          20 | DB_ROW_ID                         |
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:49:48 | n_leaf_pages |         20 |        NULL | Number of leaf pages in the index |
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:49:48 | size         |         21 |        NULL | Number of pages in the index      |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:54:46 | n_diff_pfx01 |       9980 |           1 | DB_ROW_ID                         |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:54:46 | n_leaf_pages |         20 |        NULL | Number of leaf pages in the index |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:54:46 | size         |         21 |        NULL | Number of pages in the index      |
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+

Running explain we are still getting different rows returned on the source clone

--
mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | source_clone | ALL  | NULL          | NULL | NULL    | NULL |    1 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+

But by doing a flush table on the source_clone we not get the same rows returned in the explain as the source

 FLUSH TABLE source_clone;
--
mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table        | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | source_clone | ALL  | NULL          | NULL | NULL    | NULL | 9980 | Using where |
+----+-------------+--------------+------+---------------+------+---------+------+------+-------------+

Now let’s create an index and try to manually update the optimizer index stats

CREATE INDEX source_id ON source (id);
CREATE INDEX source_clone_id ON source_clone (id);
--
mysql> explain select id from kyle.source where id > 2056;
+----+-------------+--------+-------+---------------+-----------+---------+------+------+--------------------------+
| id | select_type | table  | type  | possible_keys | key       | key_len | ref  | rows | Extra                    |
+----+-------------+--------+-------+---------------+-----------+---------+------+------+--------------------------+
|  1 | SIMPLE      | source | range | source_id     | source_id | 5       | NULL | 4990 | Using where; Using index |
+----+-------------+--------+-------+---------------+-----------+---------+------+------+--------------------------+
--
mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table        | type  | possible_keys   | key             | key_len | ref  | rows | Extra                    |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | source_clone | range | source_clone_id | source_clone_id | 5       | NULL |    1 | Using where; Using index |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
--
UPDATE mysql.innodb_index_stats AS `dest`,
(SELECT stat_value, stat_name
FROM mysql.innodb_index_stats
WHERE
table_name = 'source'
and index_name = 'source_id'
) AS `src`
SET dest.stat_value = src.stat_value
WHERE dest.table_name = 'source_clone'
and dest.index_name = 'source_clone_id'
and dest.stat_name = src.stat_name;
--
mysql> select * from mysql.innodb_index_stats where table_name like 'sourc%';
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| database_name | table_name   | index_name      | last_update         | stat_name    | stat_value | sample_size | stat_description                  |
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:56:46 | n_diff_pfx01 |       9980 |          20 | DB_ROW_ID                         |
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:56:46 | n_leaf_pages |         20 |        NULL | Number of leaf pages in the index |
| kyle          | source       | GEN_CLUST_INDEX | 2020-01-22 18:56:46 | size         |         21 |        NULL | Number of pages in the index      |
| kyle          | source       | source_id       | 2020-01-22 18:56:46 | n_diff_pfx01 |       9950 |          11 | id                                |
| kyle          | source       | source_id       | 2020-01-22 18:56:46 | n_diff_pfx02 |      10000 |          11 | id,DB_ROW_ID                      |
| kyle          | source       | source_id       | 2020-01-22 18:56:46 | n_leaf_pages |         11 |        NULL | Number of leaf pages in the index |
| kyle          | source       | source_id       | 2020-01-22 18:56:46 | size         |         12 |        NULL | Number of pages in the index      |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:56:46 | n_diff_pfx01 |       9980 |           1 | DB_ROW_ID                         |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:56:46 | n_leaf_pages |         20 |        NULL | Number of leaf pages in the index |
| kyle          | source_clone | GEN_CLUST_INDEX | 2020-01-22 18:56:46 | size         |         21 |        NULL | Number of pages in the index      |
| kyle          | source_clone | source_clone_id | 2020-01-22 18:57:38 | n_diff_pfx01 |       9950 |           1 | id                                |
| kyle          | source_clone | source_clone_id | 2020-01-22 18:57:38 | n_diff_pfx02 |      10000 |           1 | id,DB_ROW_ID                      |
| kyle          | source_clone | source_clone_id | 2020-01-22 18:57:38 | n_leaf_pages |         11 |        NULL | Number of leaf pages in the index |
| kyle          | source_clone | source_clone_id | 2020-01-22 18:57:38 | size         |         12 |        NULL | Number of pages in the index      |
+---------------+--------------+-----------------+---------------------+--------------+------------+-------------+-----------------------------------+
--
mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table        | type  | possible_keys   | key             | key_len | ref  | rows | Extra                    |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | source_clone | range | source_clone_id | source_clone_id | 5       | NULL |    1 | Using where; Using index |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
--
 FLUSH TABLE source_clone;
--
mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table        | type  | possible_keys   | key             | key_len | ref  | rows | Extra                    |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | source_clone | range | source_clone_id | source_clone_id | 5       | NULL |    1 | Using where; Using index |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
--
Flush tables;
--
mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table        | type  | possible_keys   | key             | key_len | ref  | rows | Extra                    |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | source_clone | range | source_clone_id | source_clone_id | 5       | NULL |    1 | Using where; Using index |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
-- REBOOT database
mysql> explain select id from kyle.source_clone where id > 2056;
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
| id | select_type | table        | type  | possible_keys   | key             | key_len | ref  | rows | Extra                    |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+
|  1 | SIMPLE      | source_clone | range | source_clone_id | source_clone_id | 5       | NULL |    1 | Using where; Using index |
+----+-------------+--------------+-------+-----------------+-----------------+---------+------+------+--------------------------+

The optimizer index stats are the same, we have flushed the tables, and even did a reboot of the database, but the explain plan still shows different results on the source_clone.
The optimizer must be basing costs in this case on things other than mysql.innodb_index_stats and mysql.innodb_table_stats
 

 

Uncategorized