Archive

Archive for March, 2020

Indexes on Joins for MySQL

March 24th, 2020

MySQL is simpler than the Oracle example  because MySQL only has Nested Loops and doesn’t have Hash Join.

Here is the query, like on the Oracle example from the last blog post

select max(t1.data) 
from  t1, t2  
where t1.id = t2.id 
and t1.clus  = 1
;

 

So there are only 4 examples below which all do NL joins

  1. full table scan on T1 and T2
    1. 7.83 secs
  2. index on T1 predicate filter column
    1. 7.39 secs
  3. index on T2 join column
    1. 0.49 secs
  4. index on both T2 join column and T1 predicate filter column
    1. 0.06 secs

There isn’t an idea of “explain analyze” on MySQL until MySQL 8.0.18 and I did my testing on 8.0.17, so the explain plans costs are estimates not actual values, but the elasped time of the query is an actual value.

 

 

drop table seed;
CREATE TABLE seed ( id INT AUTO_INCREMENT PRIMARY KEY, val INT);
 insert into seed(val) values (1);
insert into seed(val) select val from seed;  /*  2 */
insert into seed(val) select val from seed;  /*  4 */
insert into seed(val) select val from seed;  /*  8 */
insert into seed(val) select val from seed;  /*  16 */
insert into seed(val) select val from seed;  /*  32 */
insert into seed(val) select val from seed;  /*  64 */
insert into seed(val) select val from seed;  /*  128 */
insert into seed(val) select val from seed;  /*  256 */
insert into seed(val) select val from seed;  /*  512 */
insert into seed(val) select val from seed;  /*  1024 */
insert into seed(val) select val from seed;  /*  2048 */
insert into seed(val) select val from seed;  /*  4096 */
insert into seed(val) select val from seed;  /*  8192 */
insert into seed(val) select val from seed;  /*  16384 */
insert into seed(val) select val from seed;  /*  32768 */
insert into seed(val) select val from seed;  /*  131072 */
insert into seed(val) select val from seed;  /*  262144 */
insert into seed(val) select val from seed;  /*  524288 */
insert into seed(val) select val from seed;  /*  1048576 */
 select count(*) from seed;

 drop table dummy;
 
  create table dummy ( 
    id  int,
    clus int,
    val int,
    data VARCHAR(40)
 );
 
 insert into dummy (
    id, clus , val, data
)
 select 
 id,
 truncate(id/169,0),
 mod(id,10000), 
 CONV(FLOOR(RAND() * 99999999999999), 10, 36) 
from seed
;

select count(*) from dummy where clus = 1;
+----------+
| count(*) |
+----------+
|      106 |
+----------+
select count(*) from dummy where val =1;
+----------+
| count(*) |
+----------+
|      106 |
+----------+

drop table t1;
drop table t2;

create table t1 as select * from dummy;
create table t2 as select * from dummy; 



Strange thing is that there are gaps in the sequence # in the seed table, thus the strange value in the truncate to get 106 values, i.e. truncate(id/169,0).

First we run the query with no indexes ( I already created the indexes so force them off in the example with “USE INDEX()” )

both nodes have access type

"access_type": "ALL"

so they are doing full table scans.

explain format=json select max(t1.data) 
from 
t1 USE INDEX ()
, t2 USE INDEX ()
where t1.id = t2.id 
and t1.clus = 1
;

1 row in set (7.83 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "18102538.86"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.02",
          "cost_info": {
            "read_cost": "105376.24",
            "eval_cost": "17.21",
            "prefix_cost": "105393.45",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "(`kyle`.`t1`.`clus` = 1)"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "917.35",
            "eval_cost": "17.21",
            "prefix_cost": "18102538.86",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id"
          ],
          "attached_condition": "(`kyle`.`t2`.`id` = `kyle`.`t1`.`id`)"
        }
      }
    ]
  }
}

We can see that the filter on T1 should filter most of the table (172 rows out of 1045682) so now add index on t1.clus the predicate filter ( I remove the “USE INDEX () ” on t1 so the index on T1 is now used in the explain ) . The execution time is a little faster but not that much 7.83 vs 7.39. We see that T1 is now accessed by index

explain format=json select max(t1.data) 
from  t1, t2  
USE INDEX ()
where t1.id = t2.id 
and t1.clus    = 1
;
 row in set (7.39 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "11085148.28"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ref",
          "possible_keys": [
            "t1_clus"
          ],
          "key": "t1_clus",
          "used_key_parts": [
            "clus"
          ],
          "key_length": "5",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 106,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "26.50",
            "eval_cost": "10.60",
            "prefix_cost": "37.10",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ]
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 105,
          "filtered": "0.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "881.98",
            "eval_cost": "10.60",
            "prefix_cost": "11085148.28",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id"
          ],
          "attached_condition": "(`kyle`.`t2`.`id` = `kyle`.`t1`.`id`)"
        }
      }
    ]
  }
}

 

On T2 we see that we access 105 row after examining 1045682, thus a good opportunity for an index. Now add an index or the join column and we get a significant improvement in execution time. We go from 7.39 secs to 0.49  or 7.39/0.49 or a 15x improvement

explain format=json select max(t1.data) 
from  t1 USE INDEX () , t2  
where t1.id = t2.id 
and t1.clus   = 1
;
1 row in set (0.49 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "105582.64"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows_examined_per_scan": 1045682,
          "rows_produced_per_join": 172,
          "filtered": "0.02",
          "cost_info": {
            "read_cost": "105376.24",
            "eval_cost": "17.21",
            "prefix_cost": "105393.45",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "((`kyle`.`t1`.`clus` = 1) and (`kyle`.`t1`.`id` is not null))"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ref",
          "possible_keys": [
            "t2_id"
          ],
          "key": "t2_id",
          "used_key_parts": [
            "id"
          ],
          "key_length": "5",
          "ref": [
            "kyle.t1.id"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 172,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "171.98",
            "eval_cost": "17.21",
            "prefix_cost": "105582.64",
            "data_read_per_join": "29K"
          },
          "used_columns": [
            "id"
          ]
        }
      }
    ]
  }
}

Now use both indexes, which gives us a small  0.43 reduction in execution time, but as a ratio from the previous 0.49/0.06, i.e. an 8x improvement, still pretty good

explain format=json select max(t1.data) 
from  t1, t2  
where t1.id = t2.id 
and t1.clus  = 1
;
1 row in set (0.06 sec)

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "153.62"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ref",
          "possible_keys": [
            "t1_clus"
          ],
          "key": "t1_clus",
          "used_key_parts": [
            "clus"
          ],
          "key_length": "5",
          "ref": [
            "const"
          ],
          "rows_examined_per_scan": 106,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "26.50",
            "eval_cost": "10.60",
            "prefix_cost": "37.10",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id",
            "clus",
            "data"
          ],
          "attached_condition": "(`kyle`.`t1`.`id` is not null)"
        }
      },
      {
        "table": {
          "table_name": "t2",
          "access_type": "ref",
          "possible_keys": [
            "t2_id"
          ],
          "key": "t2_id",
          "used_key_parts": [
            "id"
          ],
          "key_length": "5",
          "ref": [
            "kyle.t1.id"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 106,
          "filtered": "100.00",
          "using_index": true,
          "cost_info": {
            "read_cost": "105.93",
            "eval_cost": "10.60",
            "prefix_cost": "153.63",
            "data_read_per_join": "18K"
          },
          "used_columns": [
            "id"
          ]
        }
      }
    ]
  }

.

Uncategorized

Indexes for Joins on Oracle

March 23rd, 2020

Looking at ways to programmatically analyze explain plans to see if adding indexes would help.

The analysis is straight forward for for full table scans with strong predicate filters. If there is a strong predicate filter on an explain plan node with a full table scan then it’s a good candidate for an index.

Now a similar situation exists for a join.

Again if the join has a strong predicate filter on a full table scan, that is a candidate.

On the other hand when the actual join fields look like they will benefit from an index, it’s less obvious.

In the queries below T1 and T2 have the same data.

Here is the query

select  max(t1.data) from  t1, t2 
where 
      t1.id = t2.id  
   and t1.clus   = 1  ;

 

First we try with no indexes.

Then we try with an index on the predicate filter on T1.

Then we add a index on the join column on T2.

The idea being that we should see performance improvements at the addition of each index so that the final efficient path uses the predicate filter index on T1 returning a small amount of rows then  drive into T2 with those small amount of rows using the index on T2 join field via an NL join.

The surprising thing is that a HJ that simply hashes values from the predicate filtered rows from T1 and driving into the hash result with all the rows from T2 is just efficient and we don’t have to maintain the index on the t2 join column! It surprising to me. Sure HJs are cool, but I always thought of them as just in time indexes (ie the hashed result) which is work that could be avoid by having an index supporting an NL join and not having to recreate the hash result every query execution. (I’ll have to do some load testing to see what the actual impact of the NL vs HJ query on total database resource usage and load).

Simply adding an index on T1 filter column gets us a HJ that is just as efficient as having a NL using the T2 join field but in the case of the HJ we don’t have to maintain the T2 join column filter.

There is a strong predicate filter on T1.clus ( called “clus” because the values a clustered, we can look non-clustered later, i.e. the same query but using the “val” column which is not clustered)

In  the queries below, a NL driving on T1 rows after the predicate filter into T2 using an index on the join, is efficient.

But the HJ , hashing T1 after predicate filter and driving into the hash with all the rows from T2 is surprisingly (to me) as efficient or more efficient).

First example will start with out any index usage.

Create test data:

drop table seed;
create table seed as
select rownum n from dual connect by level <= 1e6
;

drop table t1;
drop table t2;
create table t1 ( 
    id  int,
    clus int,
    val int,
    data VARCHAR(40)
 );
 
 insert into t1 (
    id, clus , val, data
)
 select 
 n,
 trunc(n/100),
 mod(n,10000),
dbms_random.string('U',dbms_random.value(2,7))
from seed;


select count(*) from t1 where clus = 1;
100
select count(*) from t1 where val = 1;
100

create table t2 as select * from t1;

alter session set Statistics_level=all;
set timing on;

The following 3 examples will use full table scans on both tables.

( I use ” + 0 ” to turn indexes off so I don’t have to drop and recreate for demo since I’ve actually already created the indexes before running these queries )

The first NL below is slower than the next HASH join because the NL has to read the full table T2 for every row in T1 resulting in 300K+ buffer accesses (100 rows after T1 filter time ~3000 buffers per row ).

The HASH just has to read T2 once (3K+ buffer accesses)  and look up each of those rows in a hash of T1 , still a lot of rows ( should be 1M, i.e. all the rows in T2 which is what the estimate says, but actual says 1212 which I don’t understand).

Hashing T2 instead of T1 is a lot more memory, 50M instead of 1.5 M and takes longer even though its much less look ups from the filter on T1 (only 100).

select  /*+ leading(t1 t2)  use_nl(t2) */ max(t1.data) from  t1, t2 
where 
      t1.id = t2.id  + 0 
   and t1.clus + 0  = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------
| Id  | Operation	    | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |	  1 |	     |	    1 |00:00:06.83 |	 328K|
|   1 |  SORT AGGREGATE     |	   |	  1 |	   1 |	    1 |00:00:06.83 |	 328K|
|   2 |   NESTED LOOPS	    |	   |	  1 |	  36 |	  100 |00:00:06.83 |	 328K|
|*  3 |    TABLE ACCESS FULL| T1   |	  1 |	  36 |	  100 |00:00:00.07 |	3275 |
|*  4 |    TABLE ACCESS FULL| T2   |	100 |	   1 |	  100 |00:00:06.76 |	 325K|
--------------------------------------------------------------------------------------
   3 - filter("T1"."CLUS"+0=1)
   4 - filter("T1"."ID"="T2"."ID"+0)


select max(t1.data) from  t1, t2 
where 
    t1.id = t2.id  + 0 
   and t1.clus + 0  = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation	    | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |	OMem |	1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |	  1 |	     |	    1 |00:00:00.09 |	3281 |	     |	     |		|
|   1 |  SORT AGGREGATE     |	   |	  1 |	   1 |	    1 |00:00:00.09 |	3281 |	     |	     |		|
|*  2 |   HASH JOIN SEMI    |	   |	  1 |	  36 |	  100 |00:00:00.09 |	3281 |	1753K|	1753K| 1414K (0)|
|*  3 |    TABLE ACCESS FULL| T1   |	  1 |	  36 |	  100 |00:00:00.09 |	3275 |	     |	     |		|
|   4 |    TABLE ACCESS FULL| T2   |	  1 |	1000K|	 1212 |00:00:00.01 |	   6 |	     |	     |		|
-----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="T2"."ID"+ 0)
   3 - filter("T1"."CLUS"+ 0=1)


select  /*+ leading(t2 t1) use_hash(t1) max(t1.data) */   max(t1.data)  from  t1, t2 
where 
    t1.id = t2.id  + 0 
   and t1.clus + 0  = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation	    | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |	OMem |	1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |	   |	  1 |	     |	    1 |00:00:00.32 |	6526 |	     |	     |		|
|   1 |  SORT AGGREGATE     |	   |	  1 |	   1 |	    1 |00:00:00.32 |	6526 |	     |	     |		|
|*  2 |   HASH JOIN	    |	   |	  1 |	  36 |	  100 |00:00:00.32 |	6526 |	  50M|	9345K|	 49M (0)|
|   3 |    TABLE ACCESS FULL| T2   |	  1 |	1000K|	 1000K|00:00:00.03 |	3251 |	     |	     |		|
|*  4 |    TABLE ACCESS FULL| T1   |	  1 |	  36 |	  100 |00:00:00.08 |	3275 |	     |	     |		|
-----------------------------------------------------------------------------------------------------------------


Add an index on the predicate filter:

create index t1_clus on t1 (clus);

And we get a much faster hash join access as we access a lot less buffers on T1.

NL driving from T1 still has to do a full table scan on T2 for every row so is slow.

 

 

select max(t1.data) from  t1, t2 
where 
    t1.id = t2.id  + 0 
   and t1.clus    = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));


-------------------------------------------------------------------------------------------------------------------------------- 
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      10 |	  |	  |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      10 |	  |	  ||
|*  2 |   HASH JOIN SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      10 |  1753K|  1753K| 1408K (0)|
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
|   5 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1212 |00:00:00.01 |	6 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------- 

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("T1"."ID"="T2"."ID")
4 - access("T1"."MILLI"=1)


select /*+ leading(t1 t2)  use_nl(t2) */  max(t1.data) from  t1, t2 
where 
      t1.id = t2.id + 0
   and t1.clus   = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers | Reads  |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:08:14.23 |    2893K|   2892K|
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:08:14.23 |    2893K|   2892K|
|   2 |   NESTED LOOPS			      | 	|      1 |    100 |    100 |00:08:14.23 |    2893K|   2892K|
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	 0 |
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	 0 |
|*  5 |    TABLE ACCESS FULL		      | T2	|    100 |	1 |    100 |00:08:14.23 |    2893K|   2892K|
--------------------------------------------------------------------------------------------------------------------

Adding on index on T2.id and allowing the NL driving from T1 to look up rows in T2 by index gets the NL down to about the same as HASH JOIN.

The HJ is on hash of  T2 is faster when accessing T1 by it’s index on the predicate filter but still hashing all of T2 slows it down.

I don’t understand why the HJ hashing on T2 is slower when accessing T1 by index instead of full table scan.

drop index t2_id;

create index t2_id on t2 (id);  

select /*+ leading(t1 t2)  use_nl(t2) */  max(t1.data) from  t1, t2 
where 
      t1.id = t2.id  
   and t1.clus   = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS'));

-----------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      17 |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      17 |
|   2 |   NESTED LOOPS SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      17 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |
|*  5 |    INDEX RANGE SCAN		      | T2_ID	|    100 |   1000K|    100 |00:00:00.01 |      13 |
-----------------------------------------------------------------------------------------------------------

select /*+ leading(t2 t1) use_hash(t1)  */
max(t1.data) from  t1, t2 
where 
      t1.id = t2.id +1 
   and t1.clus = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.22 |    3255 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.22 |    3255 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |     78 |    100 |00:00:00.22 |    3255 |    50M|  9345K|   49M (0)|
|   3 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1000K|00:00:00.03 |    3251 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="T2"."ID"+1)
   5 - access("T1"."CLUS"=1)

select /*+ leading(t2 t1) use_hash(t1)  */
max(t1.data) from  t1, t2 
where 
      t1.id = t2.id 
   and t1.clus = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:01.25 |    2246 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:01.25 |    2246 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |    100 |    100 |00:00:01.25 |    2246 |    43M|  6111K|   42M (0)|
|   3 |    INDEX FAST FULL SCAN 	      | T2_ID	|      1 |   1000K|   1000K|00:00:00.31 |    2242 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("T1"."ID"="T2"."ID")
   5 - access("T1"."CLUS"=1)

For reference, the  order of NL and HJ access and hints

 

Screen Shot 2020-03-23 at 1.24.20 PM

reference

 

http://datavirtualizer.com/right-deep-left-deep-and-bushy-joins-in-sql/

 

Followup:

The Hash Join put about 7x (0.22 AAS)  as much load on the DB as the next loops (0.03AAS) and both are doing 178 executions/sec.

First Load Chart image with top SQL drill down filters for the  efficient HJ version. Th HJ plan is shown here.

 

-------------------------------------------------------------------------------------------------------------------------------- 
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
-------------------------------------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      10 |	  |	  |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      10 |	  |	  ||
|*  2 |   HASH JOIN SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      10 |  1753K|  1753K| 1408K (0)|
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
|   5 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1212 |00:00:00.01 |	6 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------- 

 

Screen Shot 2020-03-24 at 11.21.06 AM

The second image is the load filtered by the NJ join and here is the explain.

-----------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.01 |      17 |
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.01 |      17 |
|   2 |   NESTED LOOPS SEMI		      | 	|      1 |    100 |    100 |00:00:00.01 |      17 |
|   3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |
|*  4 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |
|*  5 |    INDEX RANGE SCAN		      | T2_ID	|    100 |   1000K|    100 |00:00:00.01 |      13 |
-----------------------------------------------------------------------------------------------------------

 

Screen Shot 2020-03-24 at 11.21.13 AM

 

The following hash join, which hashes T2, was not as efficient  hashing T1 results but it confused me where it looked like using the index on T1 filter was more expensive than not using it. Looks like it’s the other way around in actual load.

Even though skipping the T1 index shows an A-Time of 0.22 and using the T1 Index shows 1.25, when running a concurrent load it looks the opposite.

The first example below avoids the index on T1 and has an AAS of 10.36  and elapsed of 3.7 secs (instead of 0.22 in explain)

The second which uses the index on T1 filter has an AAS of 8.2 and elapsed of 3.1 secs (instead of 1.25 in explain)

Both are executing 2.24 executes/sec.

 

select /*+ leading(t2 t1) use_hash(t1)  */
max(t1.data) from  t1, t2 
where 
      t1.id = t2.id +1 
   and t1.clus = 1  ;
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:00.22 |    3255 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:00.22 |    3255 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |     78 |    100 |00:00:00.22 |    3255 |    50M|  9345K|   49M (0)|
|   3 |    TABLE ACCESS FULL		      | T2	|      1 |   1000K|   1000K|00:00:00.03 |    3251 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
-------------------------------------------------------------------------------------------------------------------------------

.

Screen Shot 2020-03-24 at 12.09.16 PM

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation			      | Name	| Starts | E-Rows | A-Rows |   A-Time	| Buffers |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT		      | 	|      1 |	  |	 1 |00:00:01.25 |    2246 |	  |	  ||
|   1 |  SORT AGGREGATE 		      | 	|      1 |	1 |	 1 |00:00:01.25 |    2246 |	  |	  ||
|*  2 |   HASH JOIN			      | 	|      1 |    100 |    100 |00:00:01.25 |    2246 |    43M|  6111K|   42M (0)|
|   3 |    INDEX FAST FULL SCAN 	      | T2_ID	|      1 |   1000K|   1000K|00:00:00.31 |    2242 |	  |	  ||
|   4 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1	|      1 |    100 |    100 |00:00:00.01 |	4 |	  |	  ||
|*  5 |     INDEX RANGE SCAN		      | T1_CLUS |      1 |    100 |    100 |00:00:00.01 |	3 |	  |	  ||
--------------------------------------------------------------------------------------------------------------------------------

.Screen Shot 2020-03-24 at 12.09.24 PM

Reference

http://datavirtualizer.com/right-deep-left-deep-and-bushy-joins-in-sql/

http://www.nocoug.org/download/2012-08/Jonathan_Lewis_Creating_Tests.pdf

Uncategorized

Oracle Imp (import) to RDS

March 2nd, 2020

Had and old export dump file I wanted to import into RDS.

I first tried to use S3 as a hosting site for the .dmp file and use rdsadmin.rdsadmin_s3_tasks.download_from_s3 to copy the file to the RDS database then import it with DBMS_DATAPUMP.

Unfortunately DBMS_DATAPUMP only works with expdp files so instead of using DBMS_DATAPUMP , I got the old “imp” binary and used it with TWO_TASK to the RDS database I wanted it loaded into.

For the record here were the steps  I took

 

First put the file into an S3 bucket called

kylelf-datadump

Create an IAM policy to allow Amazon RDS access to an Amazon S3 bucket

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/oracle-s3-integration.html#oracle-s3-integration.preparing

My policy looks like

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Sid": "VisualEditor0",
 "Effect": "Allow",
 "Action": [
 "s3:PutObject",
 "s3:GetObject",
 "s3:ListBucket"
 ],
 "Resource": "arn:aws:s3:::kylelf-datadump"
 }
 ]
}

Create an IAM role to allow Amazon RDS access to an Amazon S3 bucket

again, see: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/oracle-s3-integration.html#oracle-s3-integration.preparing

add the role to the database you want to import into

  1. Sign in to the AWS Management Console and open the Amazon RDS console at https://console.aws.amazon.com/rds/.
    Choose the Oracle DB instance name to display its details.
    On the Connectivity & security tab, in the Manage IAM roles section, choose the role to add under Add IAM roles to this instance.
    For Feature, choose S3_INTEGRATION.

create s3 option group

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithOptionGroups.html#USER_WorkingWithOptionGroups.Create

  1. create an option group
  2. add s3 integration to option group
  3. modify instance, change option group to s3 option group

Then download the dump file from S3.

SELECT rdsadmin.rdsadmin_s3_tasks.download_from_s3(
 p_bucket_name => 'kylelf-datadump', 
 p_directory_name => 'DATA_PUMP_DIR') 
 AS TASK_ID FROM DUAL; 

TASK_ID

——————————————————————————–

1583185124793-43

Check status of download

SELECT text FROM
table(rdsadmin.rds_file_util.read_text_file(‘BDUMP’,’dbtask-1583185124793-43.log’));

Check to see the file is there

select * from table(RDSADMIN.RDS_FILE_UTIL.LISTDIR('DATA_PUMP_DIR')) order by mtime;
FILENAME-  TYPE     FILESIZE MTIME
---------- ---------- ---------
TC1.dmp  file     39829504 02-MAR-20

My file is the TC1.dmp file

Now import it

DECLARE
hdnl NUMBER;
BEGIN
hdnl := DBMS_DATAPUMP.OPEN( operation => 'IMPORT', job_mode => 'SCHEMA', job_name=>null);
DBMS_DATAPUMP.ADD_FILE( handle => hdnl, filename => 'TC1.dmp', directory => 'DATA_PUMP_DIR', filetype => dbms_datapump.ku$_file_type_dump_file);
DBMS_DATAPUMP.METADATA_FILTER(hdnl,'SCHEMA_EXPR','IN (''KYLELF'')');
DBMS_DATAPUMP.START_JOB(hdnl);
END;
/

This gave me the error

ORA-39001: invalid argument value
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 79
ORA-06512: at "SYS.DBMS_DATAPUMP", line 3507
ORA-06512: at "SYS.DBMS_DATAPUMP", line 3756
ORA-06512: at line 5

Not very informative.

 Found an example to get more verbose error messaging on

https://stackoverflow.com/questions/45348688/oracle-datapump-export-error-invalid-argument-value

SET VERIFY OFF
SET FEEDBACK OFF
SET SERVEROUTPUT ON

DECLARE
  ind        NUMBER;
  fileType   NUMBER;
  value      VARCHAR2(2048);
  infoTab    KU$_DUMPFILE_INFO := KU$_DUMPFILE_INFO();
 
BEGIN
  --
  -- Get the information about the dump file into the infoTab.
  --
  BEGIN
    DBMS_DATAPUMP.GET_DUMPFILE_INFO('TC1.dmp','DATA_PUMP_DIR',infoTab,fileType);
    DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
    DBMS_OUTPUT.PUT_LINE('Information for file: TC1.dmp');
 
    --
    -- Determine what type of file is being looked at.
    --
    CASE fileType
      WHEN 1 THEN
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is a Data Pump dump file');
      WHEN 2 THEN
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is an Original Export dump file');
      WHEN 3 THEN
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is an External Table dump file');
      ELSE
        DBMS_OUTPUT.PUT_LINE('TC1.dmp is not a dump file');
        DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
    END CASE;
 
  EXCEPTION
    WHEN OTHERS THEN
      DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
      DBMS_OUTPUT.PUT_LINE('Error retrieving information for file: ' ||
                           'TC1.dmp');
      DBMS_OUTPUT.PUT_LINE(SQLERRM);
      DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
      fileType := 0;
  END;
 
  --
  -- If a valid file type was returned, then loop through the infoTab and 
  -- display each item code and value returned.
  --
  IF fileType > 0
  THEN
    DBMS_OUTPUT.PUT_LINE('The information table has ' || 
                          TO_CHAR(infoTab.COUNT) || ' entries');
    DBMS_OUTPUT.PUT_LINE('---------------------------------------------');
 
    ind := infoTab.FIRST;
    WHILE ind IS NOT NULL
    LOOP
      --
      -- The following item codes return boolean values in the form
      -- of a '1' or a '0'. Display them as 'Yes' or 'No'.
      --
      value := NVL(infoTab(ind).value, 'NULL');
      IF infoTab(ind).item_code IN
         (DBMS_DATAPUMP.KU$_DFHDR_MASTER_PRESENT,
          DBMS_DATAPUMP.KU$_DFHDR_DIRPATH,
          DBMS_DATAPUMP.KU$_DFHDR_METADATA_COMPRESSED,
          DBMS_DATAPUMP.KU$_DFHDR_DATA_COMPRESSED,
          DBMS_DATAPUMP.KU$_DFHDR_METADATA_ENCRYPTED,
          DBMS_DATAPUMP.KU$_DFHDR_DATA_ENCRYPTED,
          DBMS_DATAPUMP.KU$_DFHDR_COLUMNS_ENCRYPTED)
      THEN
        CASE value
          WHEN '1' THEN value := 'Yes';
          WHEN '0' THEN value := 'No';
        END CASE;
      END IF;
 
      --
      -- Display each item code with an appropriate name followed by
      -- its value.
      --
      CASE infoTab(ind).item_code
        --
        -- The following item codes have been available since Oracle
        -- Database 10g, Release 10.2.
        --
        WHEN DBMS_DATAPUMP.KU$_DFHDR_FILE_VERSION   THEN
          DBMS_OUTPUT.PUT_LINE('Dump File Version:         ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_MASTER_PRESENT THEN
          DBMS_OUTPUT.PUT_LINE('Master Table Present:      ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_GUID THEN
          DBMS_OUTPUT.PUT_LINE('Job Guid:                  ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_FILE_NUMBER THEN
          DBMS_OUTPUT.PUT_LINE('Dump File Number:          ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_CHARSET_ID  THEN
          DBMS_OUTPUT.PUT_LINE('Character Set ID:          ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_CREATION_DATE THEN
          DBMS_OUTPUT.PUT_LINE('Creation Date:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_FLAGS THEN
          DBMS_OUTPUT.PUT_LINE('Internal Dump Flags:       ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_JOB_NAME THEN
          DBMS_OUTPUT.PUT_LINE('Job Name:                  ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_PLATFORM THEN
          DBMS_OUTPUT.PUT_LINE('Platform Name:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_INSTANCE THEN
          DBMS_OUTPUT.PUT_LINE('Instance Name:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_LANGUAGE THEN
          DBMS_OUTPUT.PUT_LINE('Language Name:             ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_BLOCKSIZE THEN
          DBMS_OUTPUT.PUT_LINE('Dump File Block Size:      ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DIRPATH THEN
          DBMS_OUTPUT.PUT_LINE('Direct Path Mode:          ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_METADATA_COMPRESSED THEN
          DBMS_OUTPUT.PUT_LINE('Metadata Compressed:       ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DB_VERSION THEN
          DBMS_OUTPUT.PUT_LINE('Database Version:          ' || value);
 
        --
        -- The following item codes were introduced in Oracle Database 11g
        -- Release 11.1
        --

        WHEN DBMS_DATAPUMP.KU$_DFHDR_MASTER_PIECE_COUNT THEN
          DBMS_OUTPUT.PUT_LINE('Master Table Piece Count:  ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_MASTER_PIECE_NUMBER THEN
          DBMS_OUTPUT.PUT_LINE('Master Table Piece Number: ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DATA_COMPRESSED THEN
          DBMS_OUTPUT.PUT_LINE('Table Data Compressed:     ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_METADATA_ENCRYPTED THEN
          DBMS_OUTPUT.PUT_LINE('Metadata Encrypted:        ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_DATA_ENCRYPTED THEN
          DBMS_OUTPUT.PUT_LINE('Table Data Encrypted:      ' || value);
        WHEN DBMS_DATAPUMP.KU$_DFHDR_COLUMNS_ENCRYPTED THEN
          DBMS_OUTPUT.PUT_LINE('TDE Columns Encrypted:     ' || value);
 
        --
        -- For the DBMS_DATAPUMP.KU$_DFHDR_ENCRYPTION_MODE item code a
        -- numeric value is returned. So examine that numeric value
        -- and display an appropriate name value for it.
        --
        WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCRYPTION_MODE THEN
          CASE TO_NUMBER(value)
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_NONE THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           None');
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_PASSWORD THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           Password');
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_DUAL THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           Dual');
            WHEN DBMS_DATAPUMP.KU$_DFHDR_ENCMODE_TRANS THEN
              DBMS_OUTPUT.PUT_LINE('Encryption Mode:           Transparent');
          END CASE;
 
        --
        -- The following item codes were introduced in Oracle Database 12c
        -- Release 12.1
        --
 
        --
        -- For the DBMS_DATAPUMP.KU$_DFHDR_COMPRESSION_ALG item code a
        -- numeric value is returned. So examine that numeric value and
        -- display an appropriate name value for it.
        --
        
      END CASE;
      ind := infoTab.NEXT(ind);
    END LOOP;
  END IF;
END;
/

Now I can see the error better:

39001: ORA-39001: invalid argument value
-39000: ORA-39000: bad dump file specification
-39143: ORA-39143: dump file "/rdsdbdata/datapump/TC1.dmp" may be an original
export dump file

 

Looking this up via google, indicates that this is an old “exp” dump file and is not compatible with “impdp”

Thus have use old imp tool

Off of OTN

https://www.oracle.com/database/technologies/instant-client/linux-x86-64-downloads.html

I got

oracle-instantclient19.6-basic-19.6.0.0.0-1.x86_64.rpm
oracle-instantclient19.6-sqlplus-19.6.0.0.0-1.x86_64.rpm
oracle-instantclient19.6-tools-19.6.0.0.0-1.x86_64.rpm

Installed these on an EC2

sudo rpm -ivh

sudo rpm -ivh oracle-instantclient19.6-basic-19.6.0.0.0-1.x86_64.rpm
sudo rpm -ivh oracle-instantclient19.6-sqlplus-19.6.0.0.0-1.x86_64.rpm
sudo rpm -ivh oracle-instantclient19.6-tools-19.6.0.0.0-1.x86_64.rpm

Then copy the TC1.dmp file to the EC2

Then use TWO_TASK to point to my database

export TWO_TASK='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=ora19.xxxxxxxx.us-east-1.rds.amazonaws.com)(PORT=1521)))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=orcl)))‘

/usr/lib/oracle/19.6/client64/bin/imp user/pw file=TC1.dmp FROMUSER=’scott’ TOUSER=’kylelf’

also had a hickup because the first database I used was 11.2 and the imp is from 19.0 distribution so got error

 

IIMP-00003: ORACLE error 942 encountered
ORA-00942: table or view does not exist
IMP-00023: Import views not installed, please notify your DBA
IMP-00000: Import terminated unsuccessfully

So bad to create a version 19 Oracle database and then the import worked

 

Uncategorized