Archive

Archive for June, 2020

Covering Indexes in Postgres

June 19th, 2020

Covering indexes are where all the columns needed are in the index and, in theory, only the index needs to be accessed. This seemingly straight forward strategy is much tricker on Postgres.

Let’s take TPC-H query 6:

select
        sum(l_extendedprice * l_discount) as revenue
from
        lineitem
where
            l_shipdate >= date '1996-01-01'
        and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
        and l_discount between 0.06 - 0.01 and 0.06 + 0.01
        and l_quantity < 24;

we’ve got 3 predicate filter columns

  • l_shipdate
  • l_discount
  • l_quantity

None of them are that selective but all three are pretty good

select count(*) from
        lineitem
where
            l_shipdate >= date '1996-01-01'
        and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)

→ 1,823,373

select count(*) from
        lineitem
where l_discount between 0.06 - 0.01 and 0.06 + 0.01;

→ 3,272,056

select count(*) from
        lineitem
where l_quantity < 24;

→ 5,517,389

select count(*) from
        lineitem
where
            l_shipdate >= date '1996-01-01'
        and l_shipdate < cast(date '1996-01-01' + interval '1 year' as date)
        and l_discount between 0.06 - 0.01 and 0.06 + 0.01
        and l_quantity < 24;

→ 228,160

select count(*) from lineitem;

→ 11,997,996

228,160/11,997,996 = 1.9%

 

With the index added on all 3 fields, the optimizer doesn’t even use it!

https://explain.depesz.com/s/OG27

I forced the optimizer to use the index (set enable_seqscan=off;) , then yes, the improvement is small, from ~800+ ms to 600+ ms mainly because the index results have to go back to the table.

https://explain.depesz.com/s/bJMg

I made a covering index with all the fields in the query and it still goes back to the table!

https://explain.depesz.com/s/92mV

Now, if I do a vacuum and sure enough, the covering index works and get a 6x speedup!   ( 870ms to 140ms )

https://explain.depesz.com/s/fdy5

What does the vacuum have to do with the covering index?

From Jim Nasby:

The issue with covering indexes is that you still have to visit the heap page, unless that page happens to be marked as being all visible. Pages can be marked as all visible by autovacuum (by default runs once 10% of rows in a table have been updated or deleted) or a manual vacuum. They can only be marked visible if all live rows on the page are older than the oldest running transaction in the system.

In other words, effectiveness of this technique is highly dependent on the workload.

 

Reference

https://www.percona.com/blog/2018/08/06/basic-understanding-bloat-vacuum-postgresql-mvcc/

TL;DR postgres keeps undo in the data blocks. Vacuum clears out old unnecessary rows that are “dead”

https://bitnine.net/blog-useful-information/index-only-scan-and-visibility-map-12/

“Because there is no information for version in an index tuple, it is impossible to judge whether a tuple is readable in the current transaction or not”

https://blog.makandra.com/2018/11/investigating-slow-postgres-index-only-scans/

Replica + hot_standby_feedback can affect index only scan

 

Uncategorized