执行计划代价计算规则梳理
- 表和各个index的物理结构及数据操作
- 统计信息
- cost 计算
- demo
- 选择率的计算
- 基表cost
- index cost
统计信息与选择率与代价计算
当前统计信息会收集下面数据,可以使用视图 pg_stats
查看,在使用统计信息的时候,使用相关信息计算选择率
* Histogram:直方图,这个数据结构用来描述数据的分布,当前pg中计算的是等高直方图,在每一个范围内,数据数量均等
* Most common values: 出现次数最多的一组值。将它们踢出直方图可以减少极端值造成的估算误差。
* Distinct Number: 即这一列一共有多少个不同的值。值得注意的是 PostgreSQL 并没有为直方图的
每个 bucket 维护一个 bucket 本身的不同的值。
* NULL values: 有多少行的值为 NULL。因为 NULL 是一个非常特殊的值,所以也会将 NULL 单独拿出来进行维护
* Average value width in bytes: 列平均长度,记录这个值可以用来对 SQL 使用的内存大小进行估算,以及
对 IO 开销进行更细致的估算。
* Correlation: 索引和主键(或者说 row id)之间的顺序相关程度。正相关为1,负相关为-1,实际上是统计的协方差
选择率
选择率用于计算约束条件过滤之后大的数据量大小,主要用于代价计算中估算执行代价,选择率的计算主要会使用到直方图统计信息和MCV
- 对于等值约束,只会使用 MCV 进行计算,MCV是统计的高频值
- 首先需要判断列 isunique,如果是,则此时是没有mcv ,直接使用公式
selec = 1.0 / rel->tuples
计算选择率 - 否则如果条件中常数刚好是 MCV ,则直接返回当前 MCV 对应的概率即可
- 否则他假设数据是均匀分布,先去除 MCV 和 nullfac 的概率之后,然后再除以 otherdistinct,计算公式为
selec = (1.0 - sumcommon - nullfrac) / otherdistinct
- 首先需要判断列 isunique,如果是,则此时是没有mcv ,直接使用公式
如下例子,100 不是 MCV,对应上面第三条规则
esoye=# explain select * from t1 where a = 100;
QUERY PLAN
-------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=5.19..362.44 rows=99 width=12)
Recheck Cond: (a = 100)
-> Bitmap Index Scan on idx (cost=0.00..5.17 rows=99 width=0)
Index Cond: (a = 100)
(4 rows)
n_distinct | 10037
most_common_vals | {6531,9646,4220,5958,710,933,1058,1082,3047,5060,6971,7584,8043,9110}
most_common_freqs | {0.0004,0.0004,0.00036666667,0.00036666667,0.00033333333,
0.00033333333,0.00033333333,0.00033333333,0.00033333333,
0.00033333333,0.00033333333,0.00033333333,0.00033333333,0.00033333333}
此时在函数 eqsel_internal 调用的函数 var_eq_const 中,大致的计算过程为
selec = (1.0 - sumcommon - nullfrac) / otherdistinct
= (1 - 0.0048666666261851788 - 0 = 0.99513333337381482) / (10037 - 14)
= 9.9284977888238531e-05
9.9284977888238531e-05 * 1000000 = 99
对于不等于,也是同样的计算逻辑
- 对于 “<”, “<=”, “>”, “>=",分为下面三种情况
- 如果数据是唯一约束的,则此时统计信息没有 MCV,此时直接使用 直方图计算占比
- 否则计算时会使用直方图和和 MCV 一起计算
- MCV 计算选择率的方法为,假如此时是
a < 1000
,则 MCV 中所有小于 1000 的值的概率相加 - 直方图会计算当前常数在所在桶中得区间比例,然后加上前面桶得数量再除以所有桶的数量,计算公式为
((i - 1) + ((val - low) / (high - low))) / (sslot.nvalues - 1)
- 使用公式
selec = selec * hist_selec + mcv_selec
计算最终的选择率,selec 的初始值为1.0 - sumcommon - nullfrac
,这里是因为需要去除 MCV 中重复计算的部分,然后乘上直方图计算的结果再加上MCV计算的选择率即可
- MCV 计算选择率的方法为,假如此时是
如下面的例子中,插入大量1和2
esoye=# explain select * from t1 where a < 100;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=6588.95..18294.42 rows=372197 width=12)
Recheck Cond: (a < 100)
-> Bitmap Index Scan on idx (cost=0.00..6495.90 rows=372197 width=0)
Index Cond: (a < 100)
(4 rows)
n_distinct | 9674
most_common_vals | {2,1,16,821,4871,4903,5141,8049}
most_common_freqs | {0.2432,0.022566667,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003}
histogram_bounds | {0,95,197,291,400,504,593,700,794,901,1001,1098,1192,1305,1410.....}
在 scalarineqsel 中,使用函数 mcv_selectivity 和 ineq_histogram_selectivity 分别计算选择率,然后再汇总
mcv_selec = 0.26606667094165459 = 0.2432 + 0.022566667 + 0.0003
sumcommon = 0.26756667101290077
binfrac = (val - low) / (high - low) = 100 - 95 / 197 - 95 = 0.049019607843137254
histfrac = ((double) (i - 1) + binfrac ) / (double) (sslot.nvalues - 1) = 1.04901 / 100 = 0.01049
最终使用公式
selec = selec * hist_selec + mcv_selec = 0.27367426598623362
1360000 * 0.273 = 372197.00174127775
- 对于多个约束条件,PG假设条件独立,会使用简单的概率对单个约束条件进行汇总
P(A+B) = P(A) + P(B) - P(AB)
P(AB) = P(A) × P(B)
esoye=# explain select * from t1 where a < 100 or c > 9800;
QUERY PLAN
------------------------------------------------------------
Seq Scan on t1 (cost=0.00..27453.00 rows=387263 width=12)
Filter: ((a < 100) OR (c > 9800))
(2 rows)
假设 or 左右两端约束条件不想关,则会使用概率加法公式进行计算,在函数 clauselist_selectivity_or 中,
p(a < 100) = 0.273
p(c > 9800) = 0.015
p(a < 100 or c > 9800) = 0.273 + 0.0152 - 0.015 * 0.273 = 0.28475185873552034
1360000 * 0.28475185873552034 = 387262.52788030764
上面只是简单的例子,其他一些表达式这里不能简单的使用上面的方式计算,pg提供不同的计算函数,大致有下面几种,其他的复合条件,则是会递归或者循环调用这些函数单独计算,大致的调用关系如图
src/backend/utils/adt/selfuncs.c:
217: * eqsel - Selectivity of "=" for any data types.
548: * neqsel - Selectivity of "!=" for any data types.
561: * scalarineqsel - Selectivity of "<", "<=", ">", ">=" for scalars.
1466: * scalarltsel - Selectivity of "<" for scalars.
1475: * scalarlesel - Selectivity of "<=" for scalars.
1484: * scalargtsel - Selectivity of ">" for scalars.
1493: * scalargesel - Selectivity of ">=" for scalars.
1502: * boolvarsel - Selectivity of Boolean variable.
1535: * booltestsel - Selectivity of BooleanTest Node.
1693: * nulltestsel - Selectivity of NullTest Node.
1811: * scalararraysel - Selectivity of ScalarArrayOpExpr Node.
2162: * rowcomparesel - Selectivity of RowCompareExpr Node.
选择率的主要入口是clauselist_selectivity,在最终构建path阶段,对于基表,会提前在函数 set_baserel_size_estimates 中计算 reloptinfo 的 rows,人、而对于一些path,会在其的cost阶段使用相关的约束条件计算选择率,用于估算其操作行数
cost 计算
总结
- 启动代价
算子在输出第一行可用tuple需要的代价- 对于非阻塞算子,代价通用的计算规则为约束条件的启动代价 加上 targetlist的启动代价,具体的表达式的代价计算使用函数 cost_qual_eval_node 计算,具体主要计算 startup cost 和 per_tuple cost
- 对于阻塞算子,例如join,sort等,则会再加上一些准备工作需要的代价,例如如果hash join的 hashtable 的构造代价,或者是 sort 算子的排序代价
- 总代价为所有数据输出开始到结束的代价,需要计算当前算子操作tuple的总代价和从下层节点获取数据的代价
- 此时需要使用选择率估算算子操作的tuple,在reloptinfo 中,baserel->tuples 代表得是统计信息收集得表得行数,baserel->rows 是使用选择率计算的行数
- cost 代价计算简单公式总结为
startup_cost = qpqual_cost.startup + pathtarget->cost.startup + other
total_cost = startup_cost + cpu_run_cost + disk_run_cost + other
- other 对于具体的算子有不同的实现
total_cost = startup_cost + cpu_run_cost + disk_run_cost + other cpu_run_cost = cpu_per_tuple * inputrows + target_cost_per_tuple * outputrows disk_run_cost = spc_page_cost * pages
简单的seqscan
seq scan 比较简单,在前期工作完成之后,直接构造 path 且估算其代价
esoye=# explain select * from t1 where c < 100 ;
QUERY PLAN
----------------------------------------------------------
Seq Scan on t1 (cost=0.00..24053.00 rows=9770 width=12)
Filter: (c < 100)
(2 rows)
relpages | reltuples
----------+-----------
7053 | 1.36e+06
1. 使用函数 cost_qual_eval_node 计算表达式的启动代价和操作每一行数据的代价
cost->per_tuple = 0.0025
cost->startup = 0
2. targetlist 只是简单的输出,所以两个值都为 0
3. 不同的磁盘介质有不同的IO代价,使用函数 get_tablespace_page_costs 获得 磁盘 io 的 page 代价 spc_seq_page_cost = 1
4. startup_cost = qpqual_cost.startup + path->pathtarget->cost.startup = 0
5. disk_run_cost = spc_seq_page_cost * baserel->pages = 7053
cpu_run_cost = (cpu_tuple_cost + qpqual_cost.per_tuple) * baserel->tuples = 0.0125 * 1.36e+06 = 17000
cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows = 17000
7. total_cost = startup_cost + cpu_run_cost + disk_run_cost = 0 + 17000 + 7053 = 24053
复杂target list
遵循前面总结的公式,这里targetlist 中有一个子查询,所以代价不为 0
esoye=# explain select *, (select sum(a) from t2) from t1 where a < 100 and c < 5566;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=6713.76..19349.72 rows=151344 width=20)
Recheck Cond: (a < 100)
Filter: (c < 5566)
InitPlan 1 (returns $0)
-> Aggregate (cost=180.01..180.02 rows=1 width=8)
-> Seq Scan on t2 (cost=0.00..155.01 rows=10001 width=4)
-> Bitmap Index Scan on idx (cost=0.00..6495.90 rows=372197 width=0)
Index Cond: (a < 100)
(8 rows)
esoye=# explain select * from t1 where a < 100 and c < 5566;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=6533.74..19169.70 rows=151344 width=12)
Recheck Cond: (a < 100)
Filter: (c < 5566)
-> Bitmap Index Scan on idx (cost=0.00..6495.90 rows=372197 width=0)
Index Cond: (a < 100)
(5 rows)
esoye=# explain select sum(a) from t2;
QUERY PLAN
--------------------------------------------------------------
Aggregate (cost=180.01..180.02 rows=1 width=8)
-> Seq Scan on t2 (cost=0.00..155.01 rows=10001 width=4)
(2 rows)
6713.76 = 6533.74 + 180.01
19349.72 = 19169.70 + 180.02
使用 index scan
首先pg 对index 设计了一层抽象AM接口,下面实现有多种 index 引擎,常规的比如 btree index,hash index 等,不常见的比如 gin index,brin index 等,甚至还存在条件索引,接口中除了常规的 index 控制机访问方法外,还需要 index 实现其 cost 的计算方法,当前支持的 index cost 计算函数如下
amroutine->amcostestimate = blcostestimate;
amroutine->amcostestimate = brincostestimate;
amroutine->amcostestimate = gincostestimate;
amroutine->amcostestimate = gistcostestimate;
amroutine->amcostestimate = hashcostestimate;
amroutine->amcostestimate = btcostestimate;
amroutine->amcostestimate = spgcostestimate;
主要目的是计算下面几个关键参数,用于后续得 cost 计算
indexStartupCost
indexTotalCost
indexSelectivity
indexCorrelation
index_pages
下面是几个简单的例子,说明下 index 的 cost 的计算过程
esoye=# explain select * from t1 where a < 100 and c < 5566;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on t1 (cost=6533.74..19169.70 rows=151344 width=12)
Recheck Cond: (a < 100)
Filter: (c < 5566)
-> Bitmap Index Scan on idx (cost=0.00..6495.90 rows=372197 width=0)
Index Cond: (a < 100)
(5 rows)
relpages | reltuples
----------+-----------
3382 | 1.36e+06
n_distinct | 9674
most_common_vals | {2,1,16,821,4871,4903,5141,8049}
most_common_freqs | {0.2432,0.022566667,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003}
histogram_bounds | {0,95,197,291,400,504,593,700,794,901,1001,1098,1192......}
correlation | -0.5632166
追踪add_path查看其路径的生成过程的日志如下,可以看出他生成了三种不同的path,然后通过代价比较选出最优的 path
accept_new
SeqScan(t1) rows=151344 cost=0.00..27453.00
reject_new
IdxScan(t1) rows=151344 cost=0.43..31024.65
remove_old
SeqScan(t1) rows=151344 cost=0.00..27453.00
accept_new
BitmapHeapScan(t1) rows=151344 cost=6533.74..19169.70
对于 IdxScan(t1) rows=151344 cost=0.43..31024.65
- 首先对于 path.rows, 使用的是baserel 估计约束条件之后计算的结果,这里为 151344
- amcostestimate 是 index 实现的 cost 的计算函数,这里是 btree index , 使用的是 btcostestimate
- indexSelectivity 是 index 的选择率,这里 index 的执行条件为
a < 100
, 前面的例子中已经说明此表达式的选择率的计算过程,结果为 0.27367426598623362 - indexCorrelation 为统计信息中关联度,这里直接查表为 -0.5632166
- index_pages 需要使用公式
numIndexPages = ceil((index->tuples * indexSelectivity) * index->pages / index->tuples)
计算,表示的是 index 在当前条件下需要读取的 page 数量, 带入数据计算得index_pages = (1360000 * 0.27367) * 3382 / 1360000 = 926
- indexStartupCost 可以粗略得使用公式
indexStartupCost = {ceil( log2(N) ) + ( H+1 ) * 50} * cpu_operator_cost
计算,但是实际上他的每一步都是分开单独计算得,因为其中参杂着一些特殊判断, 但是目前这个例子并没有涉及到这些特殊处理log2(N) * cpu_operator_cost
代表是叶子节点的代价( H+1 ) * 50 * cpu_operator_cost
代表的是处理内部节点的代价,这里的50是直接代码写死的- 带入参数计算得
indexStartupCost = 0.42749999999999999
- indexTotalCost 需要计算得代价和前面 seqscan 类似,需要计算 io cost 加上 CPU cost, index 使用的是随机读写, 这里大致通用公式
为
indexTotalCost = indexStartupCost + numIndexPages * spc_random_page_cost + numIndexTuples * (cpu_index_tuple_cost + qual_op_cost)
,带入参数计算得indexTotalCost = 6495.4775
indexStartupCost = 0.4274 indexTotalCost = 6495.904 indexSelectivity = 0.27367 indexCorrelation = -0.5632166 index_pages = 926
- indexSelectivity 是 index 的选择率,这里 index 的执行条件为
- 之后回到 cost_index 中,对 cost 进行最终的计算,计算的方式和常规 cost 计算方式几乎一致
startup_cost = indexStartupCost + qpqual_cost.startup + path->path.pathtarget->cost.startup = 0.4274
- total_cost 需要计算 index 回表时候的代价,包括 IO COST 和 CPU cost
- 这里需要使用 index 和 table 的关联度计算 table_IO_cost,大致公式为
table_IO_cost = max_IO_cost + indexCorrelation^2 * (min_IO_cost − max_IO_cost)
- 这里使用协方差计算代价,max_IO_cost 代表的是最差的时候全表扫描的代价,min_IO_cost 代表的最优的时候,表扫描的代价
- 带入参数 计算得 table_IO_cost = 19876.277613
- 还需要计算 cpu cost 的代价,和之前通用的计算公式一样,对表来说,需要使用 filter 过滤的行数是 index 查询的行数,是 372197, 最终输出的行数是加上其他 条件之后过滤的行数是 151344,带入参数计算 得 cpu_run_cost = 4652.4625
- indexTotalCost + table_IO_cost + cpu_run_cost + startup_cost = 6495.904 + 19876.277613 + 4652.4625 + 0.4274 = 31025.071513
- 这里需要使用 index 和 table 的关联度计算 table_IO_cost,大致公式为
- 最终计算结果为
startup_cost = 0.4274 total_cost = 31025.071513
对于 BitmapHeapScan(t1) =151344 cost=6533.74..19169.70
- bitmap index 需要使用之前的生成的 index path,对于上面的例子中,可用的index path 只有
a < 100
这个,他在 函数cost_bitmap_heap_scan
中的 bitmapqual 是IdxScan(t1) rows=151344 cost=0.43..31024.65
- 对于 indexTotalCost ,他会从下层使用函数 cost_bitmap_tree_node 计算 cost,这里 path 是 IndexPath 使用
公式
cost = ((IndexPath *) path)->indextotalcost + 0.1 * cpu_operator_cost * path->rows
计算,带入参数得cost = 6495.9049 + 0.1 * 0.0025 * 151344 = 6533.741
- 对于 total_cost,使用常规公式计算,
total_cost = startup_cost + pages_fetched * cost_per_page + pathtarget->cost.per_tuple * path->rows + cpu_per_tuple * tuples_fetched
,tuples_fetched 是节点传入得行数,path->rows是输出行数,分别为 372197 和 151344 ,total_cost = 6533.741 + 7053 + 0 * 151344 + 0.015 * 372197 = 19169.696
btree
startup_cost = indexStartupCost + qpqual_cost.startup + path->path.pathtarget->cost.startup
* indexStartupCost = {ceil( log2(index->tuples) ) + ( H+1 ) * 50} * cpu_operator_cost
* H = select level from bt_metap('idx_cm_base_customer_1');
total_cost = startup_cost + run_cost;
* run_cost = cpu_run_cost + max_IO_cost + csquared * (min_IO_cost - max_IO_cost) + indexTotalCost - indexStartupCost
* indexTotalCost = indexStartupCost
* csquared = indexCorrelation * indexCorrelation
* indexCorrelation =
* min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count
if (pages_fetched > 0)
min_IO_cost = spc_random_page_cost;
if (pages_fetched > 1)
min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
else
min_IO_cost = 0;
* max_IO_cost =
if (loop_count > 1)
(pages_fetched * spc_random_page_cost) / loop_count
else
pages_fetched * spc_random_page_cost
* pages_fetched = index_pages_fetched(tuples_fetched * loop_count, baserel->pages, (double) index->pages)
if (indexonly)
pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
baserel->allvisfrac = select relallvisible from pg_class where relname = 'idx_cm_base_customer_1'
* tuples_fetched = indexSelectivity * baserel->tuples
* indexSelectivity =
* cpu_run_cost = cpu_per_tuple * tuples_fetched + path->path.pathtarget->cost.per_tuple * path->path.rows
* tuples_fetched = indexSelectivity * baserel->tuples
cost_bitmap_or_node
for path in bitmapquals
startup_cost = total_cost
totalCost += subCost
subCost
IndexPath 0.1 * cpu_operator_cost * path->rows
BitmapAndPath path->total_cost
BitmapOrPath path->total_cost
if (l != list_head(path->bitmapquals) &&
!IsA(subpath, IndexPath))
totalCost += 100.0 * cpu_operator_cost;
cost_bitmap_and_node
for path in bitmapquals
startup_cost = total_cost
totalCost += subCost
subCost
IndexPath 0.1 * cpu_operator_cost * path->rows
BitmapAndPath path->total_cost
BitmapOrPath path->total_cost
if (l != list_head(path->bitmapquals) &&
!IsA(subpath, IndexPath))
totalCost += 100.0 * cpu_operator_cost;
cost_bitmap_heap_scan
startup_cost = path->pathtarget->cost.startup + qpqual_cost.startup + indexTotalCost
total_cost = startup_cost + run_cost
run_cost = cpu_run_cost + path->pathtarget->cost.per_tuple * path->rows + pages_fetched * cost_per_page
cpu_run_cost = cpu_per_tuple * tuples_fetched
tuples_fetched = indexSelectivity * baserel->tuples +
58.81 = 28.55 + + 28.55 + qpqual_cost.startup + path->pathtarget->cost.startup
6113.85 = pages_fetched * cost_per_page + (cpu_per_tuple * tuples_fetched) / parallel_divisor + 0 * 5405
25.43 - 0.57 + (44 - 1) * inner_rescan_start_cost + 6113.85 - 58.81 + (44 - 1) * inner_rescan_run_cost + cpu_per_tuple * 3440 * 44
25.43 + 6113.85 * 44 + 0.01 * 44 * 3440 95106.65 = 6113.85 + 25.43 * 3440 + 0.01 * 44 * 3440
seq scan
---
startup_cost = 0
total_cost = 0.0125 * baserel->tuples + baserel->pages
btree
btcostestimate
create extension pageinspect;
select level from bt_metap('idx_cm_base_customer_1');
---
// log2(index->tuples) 指的是叶子二分查找的cost,H + 1 指的是树的高度,50 是代码写死的定值 0.0025 = cpu_operator_cost
startup_cost = {ceil( log2(index->tuples) ) + ( H + 1 ) * 50} * 0.0025
// max_IO_cost + csquared * (min_IO_cost - max_IO_cost) 是disk cost, 这里使用相关度计算,后面是 cpu cost, 0.01 是 cpu_per_tuple
indexSelectivity = genericcostestimate(xxx)
indexTotalCost = startup_cost + ceil(indexSelectivity * index->pages) * 4 + (indexSelectivity * index->tuples) * 0.0075
max_IO_cost = PF * 4
min_IO_cost = 4 + (ceil(indexSelectivity * (double) baserel->pages) - 1)
csquared = indexCorrelation * indexCorrelation
total_cost = indexTotalCost + max_IO_cost + csquared * (min_IO_cost - max_IO_cost) + 0.01 * ceil(indexSelectivity * baserel->tuples)
// Btree 的 indexSelectivity 就是其 cond 计算的结果
// PF = page fetch, 指的是语句最差的情况下需要最大读取的page,4为random io cost
// 最优时读取的page cost,这两个表达式分别对应关联度最优和最差时的代价
PF =
min(2TNs/(2T+Ns), T) when T <= b
2TNs/(2T+Ns) when T > b and Ns <= 2Tb/(2T-b)
b + (Ns - 2Tb/(2T-b))*(T-b)/T when T > b and Ns > 2Tb/(2T-b)
where
T = # pages in table
N = # tuples in table
s = selectivity = fraction of table to be scanned
b = # buffer pages available (we include kernel space here)
= (double) effective_cache_size * T / (root->total_table_pages + index_pages);
effective_cache_size = 524288 KB = 4Gb
brin
---
startup_cost = 1 * statsData.revmapNumPages
total_cost = indexTotalCost + max_IO_cost + csquared * (min_IO_cost - max_IO_cost) + 0.01 * ceil(indexSelectivity * baserel->tuples)
indexTotalCost = startup_cost + 4 * (index->pages - statsData.revmapNumPages) + 0.00025 * estimatedRanges * statsData.pagesPerRange
indexRanges = max(ceil((double) baserel->pages / statsData.pagesPerRange), 1.0)
minimalRanges = ceil(indexRanges * qualSelectivity)
estimatedRanges = min(minimalRanges / indexCorrelation, indexRanges)
indexSelectivity = estimatedRanges / indexRanges
max_IO_cost = PF * 4
min_IO_cost = 4 + (ceil(indexSelectivity * baserel->pages) - 1)
csquared = indexCorrelation * indexCorrelation
bitmap
---
startup_cost = ((IndexPath *) path)->indextotalcost + 0.1 * cpu_operator_cost * path->rows
total_cost = startup_cost + pages_fetched * cost_per_page + pathtarget->cost.per_tuple * path->rows + cpu_per_tuple * tuples_fetched
SELECT * FROM brin_page_items(get_raw_page(‘idxb’, 2),‘idxb’); SELECT * FROM brin_revmap_data(get_raw_page(‘idxb’, 1)) ; SELECT * FROM brin_metapage_info(get_raw_page(‘idxb’, 0));
cost_sort (path=0x7ffccbfceae0, root=0x563be3d33940, pathkeys=0x0, input_cost=15406, tuples=1000000, width=12, comparison_cost=0, sort_mem=4096, limit_tuples=-1)
type = 3821921280, pathtype = 22075, parent = 0x563be3cde400, pathtarget = 0x563be3cacd40, param_info = 0x563be3cada20, parallel_aware = 80, parallel_safe = 117, parallel_workers = 1, rows = 1000000, startup_cost = 132154.34284662094, total_cost = 134654.34284662094, pathkeys = 0x563be3d86de0
- FQS pgxc_FQS_create_remote_plan 没有计算代价,cost为0
Data Node Scan on “REMOTE_FQS_QUERY” (cost=0.00..0.00 rows=0 width=0) Primary node/s: dn1 Node/s: dn1, dn2 (3 rows)
- remote create path
benchmarksql=# explain verbose select * from t1 ; QUERY PLAN
Data Node Scan on t1 “REMOTE_TABLE_QUERY” (cost=0.00..82037.50 rows=90005 width=12) Output: t1.c1, t1.c2, t1.c3 Primary node/s: dn1 Node/s: dn1, dn2 Remote query: SELECT c1, c2, c3 FROM ONLY public.t1 t1 WHERE true (5 rows)
使用seq scan 的代价计算方法,只是tuple_cost 为0.9,cluster这里还是使用的seq 的数据计算代价,最后再判断添加什么算子
run cost = cpu run cost + disk run cost
= (cpu_tuple_cost + cpu_operator_cost) × Ntuple + seq_page_cost × Npage
= 0.9 * 90005 + (1 * 1033)
= 82037.50
- cluster cost_cluster_gather 单表执行计划为gather rows = allrows / count(dn) startup_cost = subpath->startup_cost run_cost = subpath->run_cost + (remote_tuple_cost * rows) total_cost = (startup_cost + run_cost)
Cluster Gather (cost=0.00..28934.25 rows=90004 width=12) -> Seq Scan on t1 (cost=0.00..1933.05 rows=45002 width=12) (2 rows)
startup_cost = 0
run_cost = 1933.05 + (0.3 * 90004) = 28934.25
total_cost = (0 + 28934.25) = 28934.25
2.1 单表的scan
数据的读取,tuple的cpu代价,targetlist 的 eval 代价
run cost = cpu run cost + disk run cost
= (cpu_tuple_cost + cpu_operator_cost) × Ntuple + seq_page_cost × Npage
= 0.01 * 90005 + (1 * 1033)
= 1933.05
``
join
reduce ? 广播
reduce_conn_cost 1
reduce_page_cost 3
reduce_setup_cost 1000
COALESCE(hash_combin_mod(2, hashint4(t2.c2)), 0)
startup_cost = reduce_conn_cost + sort_startup_cost;
path->path.startup_cost = subpath->startup_cost + startup_cost;
path->path.total_cost = subpath->total_cost + startup_cost + reduce_run_cost + sort_run_cost;
reduce_run_cost = remote_tuple_cost * reduce_max_rows /* rows cost */
/* plus page cost */
+ page_size(reduce_max_rows, subpath->pathtarget->width) * reduce_page_cost
/* here should plus reduce expr cost */
;
benchmarksql=# explain verbose select * from t1 join t2 on t1.c1 = t2.c2;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------
Cluster Gather (cost=1562.62..10639.74 rows=22501 width=24)
Remote node: 16386,16387
-> Hash Join (cost=562.62..2889.44 rows=11250 width=24)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3
Hash Cond: (t1.c1 = t2.c2)
-> Seq Scan on public.t1 (cost=0.00..1933.05 rows=45002 width=12)
Output: t1.c1, t1.c2, t1.c3
Remote node: 16386,16387
-> Hash (cost=562.00..562.00 rows=50 width=12)
Output: t2.c1, t2.c2, t2.c3
-> Cluster Reduce (cost=1.00..562.00 rows=50 width=12)
Reduce: ('[0:1]={16386,16387}'::oid[])[COALESCE(hash_combin_mod(2, hashint4(t2.c2)), 0)]
-> Seq Scan on public.t2 (cost=0.00..543.00 rows=50 width=12)
Output: t2.c1, t2.c2, t2.c3
Remote node: 16386,16387
join - remote
rqpath->path.startup_cost = parallel_setup_cost * 2;
rqpath->path.total_cost = rqpath->path.startup_cost + rel->rows * pgxc_remote_tuple_cost;
唯一有关系的只是行数
大表
10000 10
100
# hash join cost
hash join 代价计算分为两个阶段
启动代价 大致为 构造内表 hash table 的代价,lchild 和 rchild 的启动代价 以及表达式的 的启动代价,其中 内表 还可能 分批到多次,此时还需要加上磁盘读写的代价
执行代价 大致为 内外表的运行代价,分批是读写数据的代价;还有不同 join type 实际操作 tuple 数量可能不一样,所以cpu运行代价估算还需要配合算则率进行计算
```c++
initial_cost_hashjoin
=================
cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
seq_page_cost = DEFAULT_SEQ_PAGE_COST;
num_hashclauses = list_length(hashclauses);
inner_path_rows = inner_path->rows;
outer_path_rows = outer_path->rows;
startup_cost = 0
run_cost = 0
startup_cost += outer_path->startup_cost; // 外表启动代价
startup_cost += inner_path->total_cost; // 内表需要构造 hash table, 所以需要全部数据,所以是 total_cost
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost) * inner_path_rows; // 构造 hash table 的 代价
run_cost += outer_path->total_cost - outer_path->startup_cost; // 外表运行代价
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows; // 外表 hash 代价
if (numbatches > 1) { // 如果work_meme 不够大,则需要分批处理,会把tuple 写到磁盘
innerpages = page_size(inner_path_rows, inner_path->pathtarget->width);
outerpages = page_size(outer_path_rows, outer_path->pathtarget->width);
startup_cost += seq_page_cost * innerpages; // 启动代价,内表必须有读写磁盘的操作
run_cost += seq_page_cost * (innerpages + 2 * outerpages); // 启动代价和运行代价
}
workspace->startup_cost = startup_cost;
workspace->total_cost = startup_cost + run_cost;
final_cost_hashjoin
===================
startup_cost = workspace->startup_cost;
run_cost = workspace->run_cost;
virtualbuckets = (double) numbuckets * (double) numbatches;
startup_cost += hash_qual_cost.startup; // hash 运算的代价
startup_cost += qp_qual_cost.startup; // 其他表达式 other restriction clauses
startup_cost += path->jpath.path.pathtarget->cost.startup; // targetlist 的代价
// With a SEMI or ANTI join, or if the innerrel is known unique, the executor will stop after the first match.
// 所以代价会比其他类型小
// 这里主要计算的是操作lchild和rchild元组的cpu代价,所以一些选择率的计算和magic number,都是为了估算出操作的元组数
if (path->jpath.jointype == JOIN_SEMI || path->jpath.jointype == JOIN_ANTI || extra->inner_unique) {
run_cost += hash_qual_cost.per_tuple * // hash 表达式的tuple 代价
rint(outer_path_rows * extra->semifactors.outer_match_frac) * // 外表预估的行数,outer_match_frac 是使用函数 compute_semi_anti_join_factors 计算的选择率
rint(inner_path_rows * innerbucketsize * (2.0 / (extra->semifactors.match_count + 1.0))) * // 内表预估的函数,???
0.5; //
run_cost += hash_qual_cost.per_tuple * // hash 表达式的tuple 代价
(outer_path_rows - rint(outer_path_rows * extra->semifactors.outer_match_frac)) * // 外表不匹配的行数
rint(inner_path_rows / virtualbuckets) * // 内表不匹配的行数,这里直接估计 * 0.05
0.05;
if (path->jpath.jointype == JOIN_ANTI)
hashjointuples = outer_path_rows - outer_matched_rows;
else
hashjointuples = outer_matched_rows;
} else {
run_cost += hash_qual_cost.per_tuple * // hash 表达式的tuple 代价
outer_path_rows * // 外表行数
rint(inner_path_rows * innerbucketsize) * 0.5; // 内表行数
hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
}
run_cost += (cpu_tuple_cost + qp_qual_cost.per_tuple) * hashjointuples; // 计算 其他表达式 的代价
run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows; // 计算 target list 的代价
path->jpath.path.startup_cost = startup_cost;
path->jpath.path.total_cost = startup_cost + run_cost;
输出日志
HashJoin(supplier nation) id 0x5574bc399c18 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=356 cost=(4.35..943.53)=startup_cost:{workspace->startup_cost(4.35) + hash_qual_cost.startup(0.00) + qp_qual_cost.startup(0.00) + pathtarget->cost.startup(0.00)}..total_cost:{startup_cost(4.35) + workspace->run_cost(937.28) + qual_run_cost(1.33) + pathtarget_run_cost(0.00)}
clauses: supplier.s_nationkey = nation.n_nationkey
ClusterReducePath(supplier) id 0x5574bc3975f8 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=3333 cost=(1.00..929.95)=startup_cost:{subpath->startup_cost(0.00) + reduce_conn_cost(1.00) + sort}..total_cost:{subpath->total_cost(324.00) + startup_cost(1.00) + reduce_r}
SeqScan(supplier) id 0x5574bc35aa98 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=3333 cost=(0.00..324.00)=startup_cost:{qpqual_cost.startup(0.00) + pathtarget->cost.startup(0.00)}..total_cost:{startup_cost(0.00) + cpu_run_cost(100.00) + disk_run_cost(224.00)}
SeqScan(nation) id 0x5574bc362928 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=8 cost=(0.00..3.25)=startup_cost:{qpqual_cost.startup(0.00) + pathtarget->cost.startup(0.00)}..total_cost:{startup_cost(0.00) + cpu_run_cost(0.25) + disk_run_cost(3.00)}