理解原理的重要性 - 论PostgreSQL merge join 成本评估陷阱 含case
背景
PostgreSQL支持三種JOIN的方法,nestloop, merge, hash。
這三種JOIN方法的差別和原理可以參考
https://www.postgresql.org/docs/devel/static/planner-optimizer.html
《PostgreSQL nestloop/hash/merge join講解》
nested loop join:
The right relation is scanned once for every row found in the left relation. This strategy is easy to implement but can be very time consuming. (However, if the right relation can be scanned with an index scan, this can be a good strategy. It is possible to use values from the current row of the left relation as keys for the index scan of the right.)merge join:
Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key.hash join:
the right relation is first scanned and loaded into a hash table, using its join attributes as hash keys. Next the left relation is scanned and the appropriate values of every row found are used as hash keys to locate the matching rows in the table.對于merge join,在估算成本時,如果JOIN列有索引,那么會掃描索引,獲取該列的最大值和最小值。
(注意,數據庫的統計信息中并沒有最大值和最小值)
View "pg_catalog.pg_stats" Column | Type | Modifiers ------------------------+----------+----------- schemaname | name | tablename | name | attname | name | inherited | boolean | null_frac | real | avg_width | integer | n_distinct | real | most_common_vals | anyarray | most_common_freqs | real[] | histogram_bounds | anyarray | correlation | real | most_common_elems | anyarray | most_common_elem_freqs | real[] | elem_count_histogram | real[] |那么問題來了,如果索引出現了劇烈傾斜(或者沒有及時釋放空頁),那么在評估merge join的執行計劃時,可能導致執行計劃時間過長。
下面看一個例子。
merge join 評估成本
創建兩張測試表,關閉表級autovacuum,以免影響結果
postgres=# create unlogged table tbl1(id int, info text) with (autovacuum_enabled=off); CREATE TABLE postgres=# create unlogged table tbl2(id int, info text) with (autovacuum_enabled=off); CREATE TABLE往兩張表分別插入1000萬記錄
postgres=# insert into tbl1 select generate_series(1,10000000); INSERT 0 10000000 postgres=# insert into tbl2 select generate_series(1,10000000); INSERT 0 10000000檢查mergejoin已打開
postgres=# show enable_mergejoin ; enable_mergejoin ------------------ on (1 row)打開時間記錄
postgres=# \timing Timing is on.查看執行計劃,目前生成執行計劃的耗時很正常
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ---------------------------------------------------------------------------------- Merge Join (cost=1838455.64..2370285748.91 rows=157893676470 width=72) Merge Cond: (tbl1.id = tbl2.id) -> Sort (cost=919227.82..933276.56 rows=5619496 width=36) Sort Key: tbl1.id -> Seq Scan on tbl1 (cost=0.00..100442.96 rows=5619496 width=36) -> Materialize (cost=919227.82..947325.30 rows=5619496 width=36) -> Sort (cost=919227.82..933276.56 rows=5619496 width=36) Sort Key: tbl2.id -> Seq Scan on tbl2 (cost=0.00..100442.96 rows=5619496 width=36) (9 rows) Time: 1.134 ms收集統計信息,生成表對應的vm, fsm文件。
postgres=# vacuum analyze tbl1; VACUUM Time: 834.366 ms postgres=# vacuum analyze tbl2; VACUUM Time: 835.022 ms再次生成執行計劃,強制使用merge join,執行計劃的時間依舊正常
當沒有索引時,評估merge join的成本不需要獲取最大值和最小值
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ---------------------------------------------------------------------------- Hash Join (cost=347372.66..975995.14 rows=9999985 width=72) Hash Cond: (tbl1.id = tbl2.id) -> Seq Scan on tbl1 (cost=0.00..144247.85 rows=9999985 width=36) -> Hash (cost=144247.85..144247.85 rows=9999985 width=36) -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) (5 rows) Time: 0.633 ms postgres=# set enable_hashjoin=off; SET Time: 0.246 ms postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ---------------------------------------------------------------------------------- Merge Join (cost=3285716.66..3510716.32 rows=9999985 width=72) Merge Cond: (tbl1.id = tbl2.id) -> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36) Sort Key: tbl1.id -> Seq Scan on tbl1 (cost=0.00..144247.85 rows=9999985 width=36) -> Materialize (cost=1642858.33..1692858.26 rows=9999985 width=36) -> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36) Sort Key: tbl2.id -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) (9 rows) Time: 0.469 ms postgres=# set enable_material =off; SET Time: 0.205 ms postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ---------------------------------------------------------------------------- Merge Join (cost=3285716.66..3485716.36 rows=9999985 width=72) Merge Cond: (tbl1.id = tbl2.id) -> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36) Sort Key: tbl1.id -> Seq Scan on tbl1 (cost=0.00..144247.85 rows=9999985 width=36) -> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36) Sort Key: tbl2.id -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) (8 rows) Time: 0.436 ms創建tbl1的JOIN字段ID的索引
postgres=# create index idx_tbl1_id on tbl1(id); CREATE INDEX Time: 2813.772 ms當前索引大小214 MB
postgres=# \di+ idx_tbl1_id List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-------------+-------+----------+-------+--------+------------- public | idx_tbl1_id | index | postgres | tbl1 | 214 MB | (1 row)刪除tbl1表的前9999999條記錄
postgres=# delete from tbl1 where id<10000000; DELETE 9999999 Time: 3123.303 ms重新生成執行計劃,發現現在執行計劃耗時變長了很多很多
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ------------------------------------------------------------------------------------------- Merge Join (cost=1642863.77..2047750.44 rows=9999985 width=72) Merge Cond: (tbl1.id = tbl2.id) -> Index Scan using idx_tbl1_id on tbl1 (cost=0.43..229897.34 rows=10000000 width=36) -> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36) Sort Key: tbl2.id -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) (6 rows) Time: 1317.079 ms再一次生成執行計劃,耗時還是不正常,但是略有好轉,可能因為索引頁的數據已經在內存中了。
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ------------------------------------------------------------------------------------------- Merge Join (cost=1642863.77..2047750.44 rows=9999985 width=72) Merge Cond: (tbl1.id = tbl2.id) -> Index Scan using idx_tbl1_id on tbl1 (cost=0.43..229897.34 rows=10000000 width=36) -> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36) Sort Key: tbl2.id -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) (6 rows) Time: 81.410 ms執行計劃的時間與通過索引查詢JOIN列的最大最小值的時間基本一致
postgres=# select min(id),max(id) from tbl1; min | max ----------+---------- 10000000 | 10000000 (1 row) Time: 81.591 ms沒有評估到merge join的時候,執行計劃是正常的
postgres=# explain select min(id),max(id) from tbl1; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------- Result (cost=0.91..0.92 rows=1 width=8) InitPlan 1 (returns $0) -> Limit (cost=0.43..0.46 rows=1 width=4) -> Index Only Scan using idx_tbl1_id on tbl1 (cost=0.43..210649.04 rows=10000000 width=4) Index Cond: (id IS NOT NULL) InitPlan 2 (returns $1) -> Limit (cost=0.43..0.46 rows=1 width=4) -> Index Only Scan Backward using idx_tbl1_id on tbl1 tbl1_1 (cost=0.43..210649.04 rows=10000000 width=4) Index Cond: (id IS NOT NULL) (9 rows) Time: 0.679 ms將優化器的enable_mergejoin關閉,執行計劃的耗時恢復正常,所以問題的根源是merge join執行計劃本身的問題,后面會有更細致的分析
postgres=# set enable_mergejoin =off; SET postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ------------------------------------------------------------------------------------- Nested Loop (cost=0.43..13754566787.32 rows=10000000 width=72) -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) -> Index Scan using idx_tbl1_id on tbl1 (cost=0.43..875.44 rows=50000 width=36) Index Cond: (id = tbl2.id) (4 rows) Time: 0.602 ms目前索引大小依舊是214 MB
postgres=# \di+ idx_tbl1_id List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-------------+-------+----------+-------+--------+------------- public | idx_tbl1_id | index | postgres | tbl1 | 214 MB | (1 row)使用pageinspect插件,檢查一下當前索引
postgres=# create extension pageinspect ; CREATE EXTENSION首先,從metapage,查到索引的root page id
postgres=# select * from bt_metap('idx_tbl1_id'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 290 | 2 | 290 | 2 (1 row)查詢root page有多少條目,可以看到雖然數據都刪了,但是索引還沒有清理,這些條目依舊存在索引頁中。
這也是為什么使用這個索引查找min, max會很慢的原因,因為它不知道這些數據已經被刪除了,必須通過索引條目訪問到HEAP PAGE對應的tuple后,才知道。
postgres=# select * from bt_page_stats('idx_tbl1_id',290); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 290 | r | 97 | 0 | 15 | 8192 | 6216 | 0 | 0 | 2 | 2 (1 row)查找root page索引條目的明細
postgres=# select * from bt_page_items('idx_tbl1_id',290); itemoffset | ctid | itemlen | nulls | vars | data ------------+-----------+---------+-------+------+------------------------- 1 | (3,1) | 8 | f | f | 2 | (289,1) | 16 | f | f | 09 96 01 00 00 00 00 00 3 | (575,1) | 16 | f | f | 11 2c 03 00 00 00 00 00 4 | (860,1) | 16 | f | f | 19 c2 04 00 00 00 00 00 5 | (1145,1) | 16 | f | f | 21 58 06 00 00 00 00 00 ...... 96 | (27080,1) | 16 | f | f | f9 ac 96 00 00 00 00 00 97 | (27365,1) | 16 | f | f | 01 43 98 00 00 00 00 00 (97 rows)接下來使用vacuum tbl1 回收垃圾頁,這個動作同樣會回收tbl1的索引垃圾頁,對于全部dead的索引也,會置為empty page。
postgres=# vacuum tbl1; VACUUM Time: 1797.681 ms現在,使用索引又很快了
postgres=# select min(id),max(id) from tbl1; min | max ----------+---------- 10000000 | 10000000 (1 row) Time: 0.542 ms postgres=# explain select min(id),max(id) from tbl1; QUERY PLAN ----------------------------------------------------------------------------------- Aggregate (cost=1.44..1.45 rows=1 width=8) -> Index Only Scan using idx_tbl1_id on tbl1 (cost=0.12..1.44 rows=1 width=4) (2 rows) Time: 0.467 ms那么現在merge join執行計劃的耗時恢復正常了嗎?
恢復了
postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ---------------------------------------------------------------------- Nested Loop (cost=0.00..313495.67 rows=1 width=72) Join Filter: (tbl1.id = tbl2.id) -> Seq Scan on tbl1 (cost=0.00..44248.01 rows=1 width=36) -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) (4 rows) Time: 0.488 ms postgres=# set enable_nestloop=off; SET Time: 0.210 ms postgres=# explain select * from tbl1,tbl2 where tbl1.id=tbl2.id; QUERY PLAN ----------------------------------------------------------------------------------- Merge Join (cost=1642863.46..1737103.01 rows=1 width=72) Merge Cond: (tbl1.id = tbl2.id) -> Index Scan using idx_tbl1_id on tbl1 (cost=0.12..44249.74 rows=1 width=36) -> Sort (cost=1642858.33..1667858.29 rows=9999985 width=36) Sort Key: tbl2.id -> Seq Scan on tbl2 (cost=0.00..144247.85 rows=9999985 width=36) (6 rows) Time: 0.505 ms雖然現在索引大小沒有變化,但是實際上沒有引用的index page都置為empty page了
postgres=# \di+ idx_tbl1_id List of relations Schema | Name | Type | Owner | Table | Size | Description --------+-------------+-------+----------+-------+--------+------------- public | idx_tbl1_id | index | postgres | tbl1 | 214 MB | (1 row)具體詳見btree的readme
src/backend/access/nbtree/README
Page Deletion ------------- We consider deleting an entire page from the btree only when it's become completely empty of items. (Merging partly-full pages would allow better space reuse, but it seems impractical to move existing data items left or right to make this happen --- a scan moving in the opposite direction might miss the items if so.) Also, we *never* delete the rightmost page on a tree level (this restriction simplifies the traversal algorithms, as explained below). Page deletion always begins from an empty leaf page. An internal page can only be deleted as part of a branch leading to a leaf page, where each internal page has only one child and that child is also to be deleted.觀察vacuum后索引頁的變化
首先獲取metapage的信息,得到root page id,注意索引的層次并沒有變化,依舊是2層,也就是說有第一層是branch節點,第二層是leaf節點。
postgres=# select * from bt_metap('idx_tbl1_id'); magic | version | root | level | fastroot | fastlevel --------+---------+------+-------+----------+----------- 340322 | 2 | 290 | 2 | 27421 | 0 (1 row)讀取root page的信息,顯然現在root page只有一個條目,即一級branch的某個page
postgres=# select * from bt_page_stats('idx_tbl1_id',290); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 290 | r | 1 | 0 | 8 | 8192 | 8136 | 0 | 0 | 2 | 2 (1 row) postgres=# select * from bt_page_items('idx_tbl1_id',290); itemoffset | ctid | itemlen | nulls | vars | data ------------+-----------+---------+-------+------+------ 1 | (27365,1) | 8 | f | f | (1 row)查看第一級,branch的信息,找到第二級,leaf節點。
postgres=# select * from bt_page_stats('idx_tbl1_id',27365); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 27365 | i | 1 | 0 | 8 | 8192 | 8136 | 0 | 0 | 1 | 0 (1 row) postgres=# select * from bt_page_items('idx_tbl1_id',27365); itemoffset | ctid | itemlen | nulls | vars | data ------------+-----------+---------+-------+------+------ 1 | (27421,1) | 8 | f | f | (1 row)查看第二級,leaf節點的信息
postgres=# select * from bt_page_stats('idx_tbl1_id', 27421); blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags -------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------ 27421 | l | 1 | 0 | 16 | 8192 | 8128 | 0 | 0 | 0 | 1 (1 row) postgres=# select * from bt_page_items('idx_tbl1_id',27421); itemoffset | ctid | itemlen | nulls | vars | data ------------+-------------+---------+-------+------+------------------------- 1 | (44247,178) | 16 | f | f | 80 96 98 00 00 00 00 00 (1 row)leaf節點,對應的是heap table的行號,所以通過行號,可以直接訪問數據
postgres=# select * from tbl1 where ctid='(44247,178)'; id | info ----------+------ 10000000 | (1 row)從以上分析可以得到一個結論
在數據庫中執行多表JOIN時,如果沒有設置enable_mergejoin=off,那么數據庫可能會選擇merge join,或者說數據庫需要評估merge join的成本。
當JOIN列有索引存在,為了算出更精確的COST值,評估merge join的成本會用到該列的min, max值(通過掃描JOIN列的索引得到)。
不管任何原因,掃描索引得到min,max 比較慢的話,執行計劃的時間都會被拉長。
一個實際的CASE
某個業務,每天會從幾千萬數據中清除幾十萬,然后就發現某些JOIN的SQL執行計劃時間變得好長(雖然最后選擇的是nest loop join,但是評估過程依舊需要評估merge join的成本)。
如何發現的?
1. 使用perf
連接會話, pg_backend_pid()得到PID 收集該會話統計信息 perf record -avg -p $PID 在該會話執行explain QUERY; 分析該會話的代碼時間占比 perf report --tui2. 使用gdb, 或者打印進程的 pstack
某個場景得到的bt
while true do; pstack $PID sleep 0.01; done0x00000000004a8238 in _bt_checkkeys () #1 0x00000000004a6126 in _bt_readpage () #2 0x00000000004a67a9 in _bt_steppage () #3 0x00000000004a68a8 in _bt_next () #4 0x00000000004a53c8 in btgettuple () #5 0x00000000007da563 in FunctionCall2Coll () #6 0x000000000049e53e in index_getnext_tid () #7 0x000000000049e5fa in index_getnext () #8 0x0000000000782820 in get_actual_variable_range () #9 0x0000000000786092 in ineq_histogram_selectivity () #10 0x0000000000786a87 in scalarineqsel () #11 0x0000000000787062 in mergejoinscansel () #12 0x000000000061896e in initial_cost_mergejoin () #13 0x0000000000624113 in try_mergejoin_path () #14 0x0000000000624c2f in add_paths_to_joinrel () #15 0x0000000000626678 in make_join_rel () #16 0x0000000000626bd8 in join_search_one_level () #17 0x00000000006147e3 in standard_join_search () #18 0x00000000006337c1 in query_planner () #19 0x000000000063521c in grouping_planner () #20 0x0000000000637a80 in standard_planner () #21 0x00000000006cd396 in pg_plan_query () #22 0x000000000056c623 in ExplainOneQuery () #23 0x000000000056c9c5 in ExplainQuery () #24 0x00000000006d1fae in standard_ProcessUtility () #25 0x00007f9c19f8d261 in pgss_ProcessUtility () #26 0x00000000006cf1c7 in PortalRunUtility () #27 0x00000000006d003d in FillPortalStore () #28 0x00000000006d0340 in PortalRun () #29 0x00000000006cd7bb in exec_simple_query () #30 0x00000000006ce9e5 in PostgresMain () #31 0x00000000006682e1 in PostmasterMain () #32 0x00000000005f179c in main ()如何避免問題
當系統關閉了autovacuum后,如果批量刪除或更新數據,可能會導致索引出現大量引用dead tuple的頁面,從而評估與這些列有關的JOIN可能時間會變長(指merge join)
1. 當使用了綁定變量時,可能能解決以上問題,但是也可能無法解決以上問題,因為PostgreSQL綁定變量有一個避免執行計劃傾斜的算法,會記錄custom plan的次數和平均成本,根據plan cache和傳入的參數,調用choose custom plan,評估generic plan的成本,和custem plan平均成本進行比較,以此判斷是否需要custom plan.
如果需要custom plan,那么會重新評估各種執行計劃的成本。生成一次custom plan。
原理詳見本文末尾的幾篇參考文檔。
2. autovacuum設置的閾值太大(autovacuum_vacuum_scale_factor=0.2),默認是20%,也就是說只有數據發送了20%變化后,才會自動清理。
如何避免呢?
1. 不要關閉表的autovacuum。
2. 對于大表,可以設置表級autovacuum 閾值,比如1%,或者更小一點。
create table 或者 alter table都可以,語法詳見PostgreSQL手冊。
3. 開啟系統級autovacuum, 并設置合理的autovacuum_vacuum_scale_factor,不要太大。
4. 在大量刪除數據或者更新數據后,人為的對這些表執行vacuum analyze table;, 避免以上問題。
小結
當JOIN列有索引存在,并且優化器允許merge join時,評估merge join的成本時需要用到該列的min,max值,min,max值通過索引獲得。
當JOIN列都沒有索引存在時,評估merge join的成本,不需要min,max值。因此評估merge join的執行計劃很快。
從索引獲取min,max值,直接影響了產生執行計劃的耗時。
當數據被批量刪除后,如果沒有觸發vacuum垃圾回收,評估merge join的成本就可能比較耗時,也就是本文提到的CASE。
執行vacuum后,index的垃圾也會被清理,優化器評估merge join成本時用到的min,max值可以很快獲得。
參考
《為什么用 PostgreSQL 綁定變量 沒有 Oracle pin S 等待問題》
《PostgreSQL plan cache 源碼淺析 - 如何確保不會計劃傾斜》
《執行計劃選擇算法 與 綁定變量 - PostgreSQL prepared statement: SPI_prepare, prepare|execute COMMAND, PL/pgsql STYLE: custom & generic plan cache》
《新程序員》:云原生和全面數字化實踐50位技術專家共同創作,文字、視頻、音頻交互閱讀總結
以上是生活随笔為你收集整理的理解原理的重要性 - 论PostgreSQL merge join 成本评估陷阱 含case的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Android中使用ViewStub提高
- 下一篇: co源码解读