执行计划代价计算规则梳理

表和各个index的物理结构及数据操作
统计信息
cost 计算
demo
选择率的计算
基表cost
index cost

统计信息与选择率与代价计算

当前统计信息会收集下面数据，可以使用视图 pg_stats 查看，在使用统计信息的时候，使用相关信息计算选择率

* Histogram：直方图，这个数据结构用来描述数据的分布，当前pg中计算的是等高直方图，在每一个范围内，数据数量均等
* Most common values: 出现次数最多的一组值。将它们踢出直方图可以减少极端值造成的估算误差。
* Distinct Number: 即这一列一共有多少个不同的值。值得注意的是 PostgreSQL 并没有为直方图的
  每个 bucket 维护一个 bucket 本身的不同的值。
* NULL values: 有多少行的值为 NULL。因为 NULL 是一个非常特殊的值，所以也会将 NULL 单独拿出来进行维护
* Average value width in bytes: 列平均长度，记录这个值可以用来对 SQL 使用的内存大小进行估算，以及
  对 IO 开销进行更细致的估算。
* Correlation: 索引和主键（或者说 row id）之间的顺序相关程度。正相关为1，负相关为-1，实际上是统计的协方差

选择率

选择率用于计算约束条件过滤之后大的数据量大小，主要用于代价计算中估算执行代价，选择率的计算主要会使用到直方图统计信息和MCV

对于等值约束，只会使用 MCV 进行计算，MCV是统计的高频值
- 首先需要判断列 isunique，如果是，则此时是没有mcv ，直接使用公式 selec = 1.0 / rel->tuples 计算选择率
- 否则如果条件中常数刚好是 MCV ，则直接返回当前 MCV 对应的概率即可
- 否则他假设数据是均匀分布，先去除 MCV 和 nullfac 的概率之后，然后再除以 otherdistinct，计算公式为 selec = (1.0 - sumcommon - nullfrac) / otherdistinct

如下例子，100 不是 MCV，对应上面第三条规则

esoye=# explain select * from t1 where a = 100;
                            QUERY PLAN
-------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=5.19..362.44 rows=99 width=12)
   Recheck Cond: (a = 100)
   ->  Bitmap Index Scan on idx  (cost=0.00..5.17 rows=99 width=0)
         Index Cond: (a = 100)
(4 rows)

n_distinct             | 10037
most_common_vals       | {6531,9646,4220,5958,710,933,1058,1082,3047,5060,6971,7584,8043,9110}
most_common_freqs      | {0.0004,0.0004,0.00036666667,0.00036666667,0.00033333333,
                          0.00033333333,0.00033333333,0.00033333333,0.00033333333,
                          0.00033333333,0.00033333333,0.00033333333,0.00033333333,0.00033333333}
此时在函数 eqsel_internal 调用的函数 var_eq_const 中，大致的计算过程为
selec = (1.0 - sumcommon - nullfrac) / otherdistinct 
      = (1 - 0.0048666666261851788 - 0 = 0.99513333337381482) / (10037 - 14) 
      = 9.9284977888238531e-05

9.9284977888238531e-05 * 1000000 = 99

对于不等于，也是同样的计算逻辑

对于 “<”, “<=”, “>”, “>="，分为下面三种情况
1. 如果数据是唯一约束的，则此时统计信息没有 MCV，此时直接使用直方图计算占比
2. 否则计算时会使用直方图和和 MCV 一起计算
  1. MCV 计算选择率的方法为，假如此时是 a < 1000，则 MCV 中所有小于 1000 的值的概率相加
  2. 直方图会计算当前常数在所在桶中得区间比例，然后加上前面桶得数量再除以所有桶的数量，计算公式为 ((i - 1) + ((val - low) / (high - low))) / (sslot.nvalues - 1)
  3. 使用公式 selec = selec * hist_selec + mcv_selec 计算最终的选择率，selec 的初始值为 1.0 - sumcommon - nullfrac，这里是因为需要去除 MCV 中重复计算的部分，然后乘上直方图计算的结果再加上MCV计算的选择率即可

如下面的例子中，插入大量1和2

esoye=# explain select * from t1 where a < 100;
                                QUERY PLAN
--------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=6588.95..18294.42 rows=372197 width=12)
   Recheck Cond: (a < 100)
   ->  Bitmap Index Scan on idx  (cost=0.00..6495.90 rows=372197 width=0)
         Index Cond: (a < 100)
(4 rows)


n_distinct             | 9674
most_common_vals       | {2,1,16,821,4871,4903,5141,8049}
most_common_freqs      | {0.2432,0.022566667,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003}
histogram_bounds       | {0,95,197,291,400,504,593,700,794,901,1001,1098,1192,1305,1410.....}

在 scalarineqsel 中，使用函数 mcv_selectivity 和 ineq_histogram_selectivity 分别计算选择率，然后再汇总
mcv_selec = 0.26606667094165459 = 0.2432 + 0.022566667 + 0.0003
sumcommon = 0.26756667101290077
binfrac = (val - low) / (high - low) = 100 - 95  / 197 - 95 = 0.049019607843137254
histfrac = ((double) (i - 1) + binfrac ) / (double) (sslot.nvalues - 1) = 1.04901 / 100 = 0.01049
最终使用公式
selec = selec * hist_selec + mcv_selec = 0.27367426598623362
1360000 * 0.273 =  372197.00174127775

对于多个约束条件，PG假设条件独立，会使用简单的概率对单个约束条件进行汇总

P(A+B) = P(A) + P(B) - P(AB) 
P(AB) = P(A) × P(B)

esoye=# explain select * from t1 where a < 100 or c > 9800;
                         QUERY PLAN
------------------------------------------------------------
 Seq Scan on t1  (cost=0.00..27453.00 rows=387263 width=12)
   Filter: ((a < 100) OR (c > 9800))
(2 rows)

假设 or 左右两端约束条件不想关，则会使用概率加法公式进行计算，在函数 clauselist_selectivity_or 中，
p(a < 100) = 0.273
p(c > 9800) = 0.015
p(a < 100 or c > 9800) = 0.273 + 0.0152 - 0.015 * 0.273 = 0.28475185873552034

1360000 * 0.28475185873552034 = 387262.52788030764

上面只是简单的例子，其他一些表达式这里不能简单的使用上面的方式计算，pg提供不同的计算函数，大致有下面几种，其他的复合条件，则是会递归或者循环调用这些函数单独计算，大致的调用关系如图

  src/backend/utils/adt/selfuncs.c:
     217:  *     eqsel             - Selectivity of "=" for any data types.
     548:  *     neqsel            - Selectivity of "!=" for any data types.
     561:  *     scalarineqsel     - Selectivity of "<", "<=", ">", ">=" for scalars.
    1466:  *     scalarltsel       - Selectivity of "<" for scalars.
    1475:  *     scalarlesel       - Selectivity of "<=" for scalars.
    1484:  *     scalargtsel       - Selectivity of ">" for scalars.
    1493:  *     scalargesel       - Selectivity of ">=" for scalars.
    1502:  *     boolvarsel        - Selectivity of Boolean variable.
    1535:  *     booltestsel       - Selectivity of BooleanTest Node.
    1693:  *     nulltestsel       - Selectivity of NullTest Node.
    1811:  *     scalararraysel    - Selectivity of ScalarArrayOpExpr Node.
    2162:  *     rowcomparesel     - Selectivity of RowCompareExpr Node.

选择率的主要入口是clauselist_selectivity，在最终构建path阶段，对于基表，会提前在函数 set_baserel_size_estimates 中计算 reloptinfo 的 rows，人、而对于一些path，会在其的cost阶段使用相关的约束条件计算选择率，用于估算其操作行数

cost 计算

总结

启动代价
算子在输出第一行可用tuple需要的代价
- 对于非阻塞算子，代价通用的计算规则为约束条件的启动代价加上 targetlist的启动代价，具体的表达式的代价计算使用函数 cost_qual_eval_node 计算，具体主要计算 startup cost 和 per_tuple cost
- 对于阻塞算子，例如join，sort等，则会再加上一些准备工作需要的代价，例如如果hash join的 hashtable 的构造代价，或者是 sort 算子的排序代价
总代价为所有数据输出开始到结束的代价，需要计算当前算子操作tuple的总代价和从下层节点获取数据的代价
- 此时需要使用选择率估算算子操作的tuple，在reloptinfo 中，baserel->tuples 代表得是统计信息收集得表得行数，baserel->rows 是使用选择率计算的行数
cost 代价计算简单公式总结为
- startup_cost = qpqual_cost.startup + pathtarget->cost.startup + other
- total_cost = startup_cost + cpu_run_cost + disk_run_cost + other
- other 对于具体的算子有不同的实现

total_cost = startup_cost + cpu_run_cost + disk_run_cost + other cpu_run_cost = cpu_per_tuple * inputrows + target_cost_per_tuple * outputrows disk_run_cost = spc_page_cost * pages

简单的seqscan

seq scan 比较简单，在前期工作完成之后，直接构造 path 且估算其代价

esoye=# explain select * from t1 where c < 100 ;
                        QUERY PLAN
----------------------------------------------------------
 Seq Scan on t1  (cost=0.00..24053.00 rows=9770 width=12)
   Filter: (c < 100)
(2 rows)

 relpages | reltuples
----------+-----------
     7053 |  1.36e+06

1. 使用函数 cost_qual_eval_node 计算表达式的启动代价和操作每一行数据的代价
    cost->per_tuple = 0.0025
    cost->startup = 0
2. targetlist 只是简单的输出，所以两个值都为 0 
3. 不同的磁盘介质有不同的IO代价，使用函数 get_tablespace_page_costs 获得 磁盘 io 的 page 代价 spc_seq_page_cost = 1
4. startup_cost = qpqual_cost.startup + path->pathtarget->cost.startup = 0
5. disk_run_cost = spc_seq_page_cost * baserel->pages = 7053
   cpu_run_cost = (cpu_tuple_cost + qpqual_cost.per_tuple)  * baserel->tuples = 0.0125 * 1.36e+06 = 17000
   cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows = 17000

7. total_cost = startup_cost + cpu_run_cost + disk_run_cost = 0 + 17000 + 7053 = 24053

复杂target list

遵循前面总结的公式，这里targetlist 中有一个子查询，所以代价不为 0

esoye=# explain select *, (select sum(a) from t2) from t1 where a < 100 and c < 5566;
                                QUERY PLAN
--------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=6713.76..19349.72 rows=151344 width=20)
   Recheck Cond: (a < 100)
   Filter: (c < 5566)
   InitPlan 1 (returns $0)
     ->  Aggregate  (cost=180.01..180.02 rows=1 width=8)
           ->  Seq Scan on t2  (cost=0.00..155.01 rows=10001 width=4)
   ->  Bitmap Index Scan on idx  (cost=0.00..6495.90 rows=372197 width=0)
         Index Cond: (a < 100)
(8 rows)

esoye=# explain select * from t1 where a < 100 and c < 5566;
                                QUERY PLAN
--------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=6533.74..19169.70 rows=151344 width=12)
   Recheck Cond: (a < 100)
   Filter: (c < 5566)
   ->  Bitmap Index Scan on idx  (cost=0.00..6495.90 rows=372197 width=0)
         Index Cond: (a < 100)
(5 rows)

esoye=# explain select sum(a) from t2;
                          QUERY PLAN
--------------------------------------------------------------
 Aggregate  (cost=180.01..180.02 rows=1 width=8)
   ->  Seq Scan on t2  (cost=0.00..155.01 rows=10001 width=4)
(2 rows)


6713.76  = 6533.74  + 180.01
19349.72 = 19169.70 + 180.02

使用 index scan

首先pg 对index 设计了一层抽象AM接口，下面实现有多种 index 引擎，常规的比如 btree index，hash index 等，不常见的比如 gin index，brin index 等，甚至还存在条件索引，接口中除了常规的 index 控制机访问方法外，还需要 index 实现其 cost 的计算方法，当前支持的 index cost 计算函数如下

amroutine->amcostestimate = blcostestimate;
amroutine->amcostestimate = brincostestimate;
amroutine->amcostestimate = gincostestimate;
amroutine->amcostestimate = gistcostestimate;
amroutine->amcostestimate = hashcostestimate;
amroutine->amcostestimate = btcostestimate;
amroutine->amcostestimate = spgcostestimate;

主要目的是计算下面几个关键参数，用于后续得 cost 计算

indexStartupCost 
indexTotalCost   
indexSelectivity 
indexCorrelation 
index_pages

下面是几个简单的例子，说明下 index 的 cost 的计算过程

esoye=# explain select * from t1 where a < 100 and c < 5566;
                                QUERY PLAN
--------------------------------------------------------------------------
 Bitmap Heap Scan on t1  (cost=6533.74..19169.70 rows=151344 width=12)
   Recheck Cond: (a < 100)
   Filter: (c < 5566)
   ->  Bitmap Index Scan on idx  (cost=0.00..6495.90 rows=372197 width=0)
         Index Cond: (a < 100)
(5 rows)

 relpages | reltuples
----------+-----------
     3382 |  1.36e+06

n_distinct             | 9674
most_common_vals       | {2,1,16,821,4871,4903,5141,8049}
most_common_freqs      | {0.2432,0.022566667,0.0003,0.0003,0.0003,0.0003,0.0003,0.0003}
histogram_bounds       | {0,95,197,291,400,504,593,700,794,901,1001,1098,1192......}
correlation            | -0.5632166

追踪add_path查看其路径的生成过程的日志如下，可以看出他生成了三种不同的path，然后通过代价比较选出最优的 path

accept_new
        SeqScan(t1) rows=151344 cost=0.00..27453.00
reject_new
        IdxScan(t1) rows=151344 cost=0.43..31024.65
remove_old
        SeqScan(t1) rows=151344 cost=0.00..27453.00
accept_new
        BitmapHeapScan(t1) rows=151344 cost=6533.74..19169.70

对于 IdxScan(t1) rows=151344 cost=0.43..31024.65

首先对于 path.rows，使用的是baserel 估计约束条件之后计算的结果，这里为 151344
amcostestimate 是 index 实现的 cost 的计算函数，这里是 btree index ，使用的是 btcostestimate
- indexSelectivity 是 index 的选择率，这里 index 的执行条件为 a < 100，前面的例子中已经说明此表达式的选择率的计算过程，结果为 0.27367426598623362
- indexCorrelation 为统计信息中关联度，这里直接查表为 -0.5632166
- index_pages 需要使用公式 numIndexPages = ceil((index->tuples * indexSelectivity) * index->pages / index->tuples)计算，表示的是 index 在当前条件下需要读取的 page 数量，带入数据计算得 index_pages = (1360000 * 0.27367) * 3382 / 1360000 = 926
- indexStartupCost 可以粗略得使用公式 indexStartupCost = {ceil( log2(N) ) + ( H+1 ) * 50} * cpu_operator_cost计算，但是实际上他的每一步都是分开单独计算得，因为其中参杂着一些特殊判断，但是目前这个例子并没有涉及到这些特殊处理
  - log2(N) * cpu_operator_cost 代表是叶子节点的代价
  - ( H+1 ) * 50 * cpu_operator_cost 代表的是处理内部节点的代价，这里的50是直接代码写死的
  - 带入参数计算得 indexStartupCost = 0.42749999999999999
- indexTotalCost 需要计算得代价和前面 seqscan 类似，需要计算 io cost 加上 CPU cost， index 使用的是随机读写，这里大致通用公式为 indexTotalCost = indexStartupCost + numIndexPages * spc_random_page_cost + numIndexTuples * (cpu_index_tuple_cost + qual_op_cost)，带入参数计算得 indexTotalCost = 6495.4775
```
  indexStartupCost = 0.4274
  indexTotalCost   = 6495.904
  indexSelectivity = 0.27367
  indexCorrelation = -0.5632166
  index_pages      = 926
```
之后回到 cost_index 中，对 cost 进行最终的计算，计算的方式和常规 cost 计算方式几乎一致
- startup_cost = indexStartupCost + qpqual_cost.startup + path->path.pathtarget->cost.startup = 0.4274
- total_cost 需要计算 index 回表时候的代价，包括 IO COST 和 CPU cost
  - 这里需要使用 index 和 table 的关联度计算 table_IO_cost，大致公式为 table_IO_cost = max_IO_cost + indexCorrelation^2 * (min_IO_cost − max_IO_cost)
    - 这里使用协方差计算代价，max_IO_cost 代表的是最差的时候全表扫描的代价，min_IO_cost 代表的最优的时候，表扫描的代价
    - 带入参数计算得 table_IO_cost = 19876.277613
  - 还需要计算 cpu cost 的代价，和之前通用的计算公式一样，对表来说，需要使用 filter 过滤的行数是 index 查询的行数，是 372197，最终输出的行数是加上其他条件之后过滤的行数是 151344，带入参数计算得 cpu_run_cost = 4652.4625
  - indexTotalCost + table_IO_cost + cpu_run_cost + startup_cost = 6495.904 + 19876.277613 + 4652.4625 + 0.4274 = 31025.071513
- 最终计算结果为
```
  startup_cost = 0.4274
  total_cost = 31025.071513
```

对于 BitmapHeapScan(t1) =151344 cost=6533.74..19169.70

bitmap index 需要使用之前的生成的 index path，对于上面的例子中，可用的index path 只有 a < 100这个，他在函数 cost_bitmap_heap_scan 中的 bitmapqual 是 IdxScan(t1) rows=151344 cost=0.43..31024.65
对于 indexTotalCost ，他会从下层使用函数 cost_bitmap_tree_node 计算 cost，这里 path 是 IndexPath 使用公式 cost = ((IndexPath *) path)->indextotalcost + 0.1 * cpu_operator_cost * path->rows 计算，带入参数得 cost = 6495.9049 + 0.1 * 0.0025 * 151344 = 6533.741
对于 total_cost，使用常规公式计算， total_cost = startup_cost + pages_fetched * cost_per_page + pathtarget->cost.per_tuple * path->rows + cpu_per_tuple * tuples_fetched，tuples_fetched 是节点传入得行数，path->rows是输出行数，分别为 372197 和 151344 ， total_cost = 6533.741 + 7053 + 0 * 151344 + 0.015 * 372197 = 19169.696


btree

startup_cost  = indexStartupCost + qpqual_cost.startup + path->path.pathtarget->cost.startup
  * indexStartupCost = {ceil( log2(index->tuples) ) + ( H+1 ) * 50} * cpu_operator_cost
  * H = select level from bt_metap('idx_cm_base_customer_1');

total_cost    = startup_cost + run_cost;
  * run_cost = cpu_run_cost + max_IO_cost + csquared * (min_IO_cost - max_IO_cost) + indexTotalCost - indexStartupCost
    * indexTotalCost = indexStartupCost
    * csquared = indexCorrelation * indexCorrelation
      * indexCorrelation = 
    * min_IO_cost = (pages_fetched * spc_random_page_cost) / loop_count
                     if (pages_fetched > 0)
                        min_IO_cost = spc_random_page_cost;
                        if (pages_fetched > 1)
                          min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost;
                      else
                        min_IO_cost = 0;
    * max_IO_cost = 
      if (loop_count > 1)
        (pages_fetched * spc_random_page_cost) / loop_count
      else
        pages_fetched * spc_random_page_cost

      * pages_fetched = index_pages_fetched(tuples_fetched * loop_count, baserel->pages, (double) index->pages)
        if (indexonly)
          pages_fetched = ceil(pages_fetched * (1.0 - baserel->allvisfrac));
          baserel->allvisfrac = select relallvisible from pg_class where relname = 'idx_cm_base_customer_1'
        * tuples_fetched = indexSelectivity * baserel->tuples
        * indexSelectivity = 
    * cpu_run_cost = cpu_per_tuple * tuples_fetched + path->path.pathtarget->cost.per_tuple * path->path.rows
      * tuples_fetched = indexSelectivity * baserel->tuples


cost_bitmap_or_node
  for path in bitmapquals
    startup_cost = total_cost
    totalCost += subCost
      subCost 
        IndexPath       0.1 * cpu_operator_cost * path->rows
        BitmapAndPath   path->total_cost
        BitmapOrPath    path->total_cost
      if (l != list_head(path->bitmapquals) &&
        !IsA(subpath, IndexPath))
        totalCost += 100.0 * cpu_operator_cost;


cost_bitmap_and_node
  for path in bitmapquals
    startup_cost = total_cost
    totalCost += subCost
      subCost 
        IndexPath       0.1 * cpu_operator_cost * path->rows
        BitmapAndPath   path->total_cost
        BitmapOrPath    path->total_cost
      if (l != list_head(path->bitmapquals) &&
        !IsA(subpath, IndexPath))
        totalCost += 100.0 * cpu_operator_cost;

cost_bitmap_heap_scan
  startup_cost = path->pathtarget->cost.startup + qpqual_cost.startup + indexTotalCost
  total_cost = startup_cost + run_cost
    run_cost = cpu_run_cost + path->pathtarget->cost.per_tuple * path->rows + pages_fetched * cost_per_page
      cpu_run_cost = cpu_per_tuple * tuples_fetched 
        tuples_fetched = indexSelectivity * baserel->tuples +

58.81 = 28.55 + + 28.55 + qpqual_cost.startup + path->pathtarget->cost.startup

6113.85 = pages_fetched * cost_per_page + (cpu_per_tuple * tuples_fetched) / parallel_divisor + 0 * 5405

25.43 - 0.57 + (44 - 1) * inner_rescan_start_cost + 6113.85 - 58.81 + (44 - 1) * inner_rescan_run_cost + cpu_per_tuple * 3440 * 44

25.43 + 6113.85 * 44 + 0.01 * 44 * 3440 95106.65 = 6113.85 + 25.43 * 3440 + 0.01 * 44 * 3440


seq scan
---
startup_cost  = 0
total_cost    = 0.0125 * baserel->tuples + baserel->pages


btree
btcostestimate

create extension pageinspect;

select level from bt_metap('idx_cm_base_customer_1');
---
// log2(index->tuples) 指的是叶子二分查找的cost，H + 1 指的是树的高度，50 是代码写死的定值 0.0025 = cpu_operator_cost
startup_cost    = {ceil( log2(index->tuples) ) + ( H + 1 ) * 50} * 0.0025   
// max_IO_cost + csquared * (min_IO_cost - max_IO_cost) 是disk cost， 这里使用相关度计算，后面是 cpu cost, 0.01 是 cpu_per_tuple
indexSelectivity = genericcostestimate(xxx)
indexTotalCost  = startup_cost + ceil(indexSelectivity * index->pages) * 4 + (indexSelectivity * index->tuples) * 0.0075
max_IO_cost     = PF * 4
min_IO_cost     = 4 + (ceil(indexSelectivity * (double) baserel->pages) - 1)
csquared        = indexCorrelation * indexCorrelation

total_cost      = indexTotalCost + max_IO_cost + csquared * (min_IO_cost - max_IO_cost) + 0.01 * ceil(indexSelectivity * baserel->tuples) 
// Btree 的 indexSelectivity 就是其 cond 计算的结果
// PF = page fetch, 指的是语句最差的情况下需要最大读取的page，4为random io cost
// 最优时读取的page cost，这两个表达式分别对应关联度最优和最差时的代价

  PF =
    min(2TNs/(2T+Ns), T)              when T <= b
    2TNs/(2T+Ns)                      when T > b and Ns <= 2Tb/(2T-b)
    b + (Ns - 2Tb/(2T-b))*(T-b)/T     when T > b and Ns > 2Tb/(2T-b)
 where
    T = # pages in table
    N = # tuples in table
    s = selectivity = fraction of table to be scanned
    b = # buffer pages available (we include kernel space here)
      = (double) effective_cache_size * T / (root->total_table_pages + index_pages);
      effective_cache_size = 524288 KB = 4Gb


brin
---
startup_cost      = 1 * statsData.revmapNumPages
total_cost        = indexTotalCost + max_IO_cost + csquared * (min_IO_cost - max_IO_cost) + 0.01 * ceil(indexSelectivity * baserel->tuples) 
indexTotalCost    = startup_cost + 4 * (index->pages - statsData.revmapNumPages) + 0.00025 * estimatedRanges * statsData.pagesPerRange
indexRanges       = max(ceil((double) baserel->pages / statsData.pagesPerRange), 1.0)  
minimalRanges     = ceil(indexRanges * qualSelectivity)   
estimatedRanges   = min(minimalRanges / indexCorrelation, indexRanges)  
indexSelectivity  = estimatedRanges / indexRanges  
max_IO_cost     = PF * 4
min_IO_cost     = 4 + (ceil(indexSelectivity * baserel->pages) - 1)
csquared        = indexCorrelation * indexCorrelation

bitmap
---
startup_cost  = ((IndexPath *) path)->indextotalcost +  0.1 * cpu_operator_cost * path->rows
total_cost    = startup_cost + pages_fetched * cost_per_page + pathtarget->cost.per_tuple * path->rows + cpu_per_tuple * tuples_fetched

SELECT * FROM brin_page_items(get_raw_page(‘idxb’, 2),‘idxb’); SELECT * FROM brin_revmap_data(get_raw_page(‘idxb’, 1)) ; SELECT * FROM brin_metapage_info(get_raw_page(‘idxb’, 0));

cost_sort (path=0x7ffccbfceae0, root=0x563be3d33940, pathkeys=0x0, input_cost=15406, tuples=1000000, width=12, comparison_cost=0, sort_mem=4096, limit_tuples=-1)

type = 3821921280, pathtype = 22075, parent = 0x563be3cde400, pathtarget = 0x563be3cacd40, param_info = 0x563be3cada20, parallel_aware = 80, parallel_safe = 117, parallel_workers = 1, rows = 1000000, startup_cost = 132154.34284662094, total_cost = 134654.34284662094, pathkeys = 0x563be3d86de0

FQS pgxc_FQS_create_remote_plan 没有计算代价，cost为0

Data Node Scan on “REMOTE_FQS_QUERY” (cost=0.00..0.00 rows=0 width=0) Primary node/s: dn1 Node/s: dn1, dn2 (3 rows)

remote create path

benchmarksql=# explain verbose select * from t1 ; QUERY PLAN

Data Node Scan on t1 “REMOTE_TABLE_QUERY” (cost=0.00..82037.50 rows=90005 width=12) Output: t1.c1, t1.c2, t1.c3 Primary node/s: dn1 Node/s: dn1, dn2 Remote query: SELECT c1, c2, c3 FROM ONLY public.t1 t1 WHERE true (5 rows)

使用seq scan 的代价计算方法，只是tuple_cost 为0.9，cluster这里还是使用的seq 的数据计算代价，最后再判断添加什么算子
run cost = cpu run cost + disk run cost
         = (cpu_tuple_cost + cpu_operator_cost) × Ntuple + seq_page_cost × Npage
         = 0.9 * 90005 + (1 * 1033)
         = 82037.50

cluster cost_cluster_gather 单表执行计划为gather rows = allrows / count(dn) startup_cost = subpath->startup_cost run_cost = subpath->run_cost + (remote_tuple_cost * rows) total_cost = (startup_cost + run_cost)

Cluster Gather (cost=0.00..28934.25 rows=90004 width=12) -> Seq Scan on t1 (cost=0.00..1933.05 rows=45002 width=12) (2 rows)

startup_cost = 0
run_cost = 1933.05 + (0.3 * 90004) = 28934.25
total_cost = (0 + 28934.25) = 28934.25

2.1 单表的scan

数据的读取，tuple的cpu代价，targetlist 的 eval 代价

run cost = cpu run cost + disk run cost
         = (cpu_tuple_cost + cpu_operator_cost) × Ntuple + seq_page_cost × Npage
         = 0.01 * 90005 + (1 * 1033)
         = 1933.05
``


join
reduce ？ 广播

reduce_conn_cost 1 
reduce_page_cost 3
reduce_setup_cost 1000

COALESCE(hash_combin_mod(2, hashint4(t2.c2)), 0)

startup_cost = reduce_conn_cost + sort_startup_cost;

path->path.startup_cost = subpath->startup_cost + startup_cost;
path->path.total_cost = subpath->total_cost + startup_cost + reduce_run_cost + sort_run_cost;


  reduce_run_cost = remote_tuple_cost * reduce_max_rows  /* rows cost */
          /* plus page cost */
          + page_size(reduce_max_rows,  subpath->pathtarget->width) * reduce_page_cost
          /* here should plus reduce expr cost */
          ;

benchmarksql=# explain verbose  select * from t1 join t2 on t1.c1 = t2.c2;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Cluster Gather  (cost=1562.62..10639.74 rows=22501 width=24)
   Remote node: 16386,16387
   ->  Hash Join  (cost=562.62..2889.44 rows=11250 width=24)
         Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3
         Hash Cond: (t1.c1 = t2.c2)
         ->  Seq Scan on public.t1  (cost=0.00..1933.05 rows=45002 width=12)
               Output: t1.c1, t1.c2, t1.c3
               Remote node: 16386,16387
         ->  Hash  (cost=562.00..562.00 rows=50 width=12)
               Output: t2.c1, t2.c2, t2.c3
               ->  Cluster Reduce  (cost=1.00..562.00 rows=50 width=12)
                     Reduce: ('[0:1]={16386,16387}'::oid[])[COALESCE(hash_combin_mod(2, hashint4(t2.c2)), 0)]
                     ->  Seq Scan on public.t2  (cost=0.00..543.00 rows=50 width=12)
                           Output: t2.c1, t2.c2, t2.c3
                           Remote node: 16386,16387

join - remote 
rqpath->path.startup_cost = parallel_setup_cost * 2;
rqpath->path.total_cost = rqpath->path.startup_cost + rel->rows * pgxc_remote_tuple_cost;
唯一有关系的只是行数



大表
10000  10

100


# hash join cost

hash join 代价计算分为两个阶段  
启动代价 大致为 构造内表 hash table 的代价，lchild 和 rchild 的启动代价 以及表达式的 的启动代价，其中 内表 还可能 分批到多次，此时还需要加上磁盘读写的代价
执行代价 大致为 内外表的运行代价，分批是读写数据的代价；还有不同 join type 实际操作 tuple 数量可能不一样，所以cpu运行代价估算还需要配合算则率进行计算

```c++
initial_cost_hashjoin
=================
cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST;
cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST;
seq_page_cost = DEFAULT_SEQ_PAGE_COST;
num_hashclauses = list_length(hashclauses);
inner_path_rows = inner_path->rows;
outer_path_rows = outer_path->rows;
startup_cost = 0
run_cost = 0

startup_cost += outer_path->startup_cost;   // 外表启动代价
startup_cost += inner_path->total_cost;     // 内表需要构造 hash table， 所以需要全部数据，所以是 total_cost
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost) * inner_path_rows;  // 构造 hash table 的 代价

run_cost += outer_path->total_cost - outer_path->startup_cost;        // 外表运行代价
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows;    // 外表 hash 代价
if (numbatches > 1) {     // 如果work_meme 不够大，则需要分批处理，会把tuple 写到磁盘
  innerpages = page_size(inner_path_rows, inner_path->pathtarget->width);
  outerpages = page_size(outer_path_rows, outer_path->pathtarget->width);

  startup_cost += seq_page_cost * innerpages;                           // 启动代价，内表必须有读写磁盘的操作
  run_cost += seq_page_cost * (innerpages + 2 * outerpages);            // 启动代价和运行代价
}

workspace->startup_cost = startup_cost;
workspace->total_cost = startup_cost + run_cost;


final_cost_hashjoin
===================
startup_cost = workspace->startup_cost;
run_cost = workspace->run_cost;
virtualbuckets = (double) numbuckets * (double) numbatches;

startup_cost += hash_qual_cost.startup;                     // hash 运算的代价
startup_cost += qp_qual_cost.startup;                       // 其他表达式 other restriction clauses
startup_cost += path->jpath.path.pathtarget->cost.startup;  // targetlist 的代价

// With a SEMI or ANTI join, or if the innerrel is known unique, the executor will stop after the first match.
//  所以代价会比其他类型小
//  这里主要计算的是操作lchild和rchild元组的cpu代价，所以一些选择率的计算和magic number，都是为了估算出操作的元组数
if (path->jpath.jointype == JOIN_SEMI || path->jpath.jointype == JOIN_ANTI || extra->inner_unique) {
  run_cost += hash_qual_cost.per_tuple *                                                                  // hash 表达式的tuple 代价
              rint(outer_path_rows * extra->semifactors.outer_match_frac) *                               // 外表预估的行数，outer_match_frac 是使用函数 compute_semi_anti_join_factors 计算的选择率
              rint(inner_path_rows * innerbucketsize * (2.0 / (extra->semifactors.match_count + 1.0))) *  // 内表预估的函数，？？？
              0.5;                                                                                        // 

  run_cost += hash_qual_cost.per_tuple *                                                                  // hash 表达式的tuple 代价   
              (outer_path_rows - rint(outer_path_rows * extra->semifactors.outer_match_frac)) *           // 外表不匹配的行数
              rint(inner_path_rows / virtualbuckets) *                                                    // 内表不匹配的行数，这里直接估计 * 0.05
              0.05;                 

  if (path->jpath.jointype == JOIN_ANTI)
    hashjointuples = outer_path_rows - outer_matched_rows;
  else
    hashjointuples = outer_matched_rows;
} else {
  run_cost += hash_qual_cost.per_tuple *                                                                  // hash 表达式的tuple 代价  
              outer_path_rows *                                                                           // 外表行数
              rint(inner_path_rows * innerbucketsize) * 0.5;                                              // 内表行数
  hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
}

run_cost += (cpu_tuple_cost + qp_qual_cost.per_tuple) * hashjointuples;                                   // 计算 其他表达式 的代价
run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;                          // 计算 target list 的代价

path->jpath.path.startup_cost = startup_cost;
path->jpath.path.total_cost = startup_cost + run_cost;

输出日志


  HashJoin(supplier nation) id 0x5574bc399c18 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=356 cost=(4.35..943.53)=startup_cost:{workspace->startup_cost(4.35) + hash_qual_cost.startup(0.00) + qp_qual_cost.startup(0.00) + pathtarget->cost.startup(0.00)}..total_cost:{startup_cost(4.35) + workspace->run_cost(937.28) + qual_run_cost(1.33) + pathtarget_run_cost(0.00)}
    clauses: supplier.s_nationkey = nation.n_nationkey
    ClusterReducePath(supplier) id 0x5574bc3975f8 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=3333 cost=(1.00..929.95)=startup_cost:{subpath->startup_cost(0.00) + reduce_conn_cost(1.00) + sort}..total_cost:{subpath->total_cost(324.00) + startup_cost(1.00) + reduce_r}
      SeqScan(supplier) id 0x5574bc35aa98 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=3333 cost=(0.00..324.00)=startup_cost:{qpqual_cost.startup(0.00) + pathtarget->cost.startup(0.00)}..total_cost:{startup_cost(0.00) + cpu_run_cost(100.00) + disk_run_cost(224.00)}
    SeqScan(nation) id 0x5574bc362928 reduce_info_list ( storage_nodes: { 16386,16387,16388}, exclude_exec: {}, type: H) rows=8 cost=(0.00..3.25)=startup_cost:{qpqual_cost.startup(0.00) + pathtarget->cost.startup(0.00)}..total_cost:{startup_cost(0.00) + cpu_run_cost(0.25) + disk_run_cost(3.00)}