remote Query
- 精确选择数据源,需要从语句的约束条件中计算出具体的执行数据源节点,然后把语句下发的对应的 DN
- 操作尽可能下推,下推一些操作到对应的DN上执行
remote query
当前使用两阶段优化
第一阶段是FQS,对语句进行简单的判断,如果语句可以完全在DN上执行,则直接在这里创建
remote plan
,然后直接返回,如下例子benchmarksql=# explain select * from t1; QUERY PLAN ---------------------------------------------------------------------------- Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0) Primary node/s: dn1 Node/s: dn1, dn2 (3 rows)
第二阶段,FQS判断失败的语句,此时需要进行常规优化,一般是语句需要DN之间的数据发生交换的时候,例如join使用多表, 但是关联条件不是分区键,此时需要把DN的数据读取到 CN 上,然后进行关联,
postgres=# explain select * from t1 join t2 on t1.c1 = t2.c1; QUERY PLAN --------------------------------------------------------------------------------------------------- Hash Join (cost=2752.50..5509.65 rows=3000 width=16) Hash Cond: (t2.c1 = t1.c1) -> Data Node Scan on t2 "_REMOTE_TABLE_QUERY_" (cost=0.00..2715.90 rows=3001 width=8) Primary node/s: dn1 Node/s: dn1, dn2, dn3 -> Hash (cost=2715.00..2715.00 rows=3000 width=8) -> Data Node Scan on t1 "_REMOTE_TABLE_QUERY__1" (cost=0.00..2715.00 rows=3000 width=8) Primary node/s: dn1 Node/s: dn1, dn2, dn3 (9 rows)
1. 构建基表remote query path
使用函数 create_plainrel_rqpath 构建 RemoteQueryPath ,并且 使用 add_path 添加到 pathlist 中
add_path(rel, (Path *)create_remotequery_path(root, rel, exec_nodes, param_info, NULL, NULL, 0, NULL));
exec_nodes 为当前 rel 的具体的使用条件计算之后的数据分布
2. 构建 joinrel 的 path
joinrel 使用 inner 和 outer 的path 构建path,主要分为两种:
- 当前 joinrel 不能完全下推到 dn 节点,此时按照正常的 joinrel path 逻辑进行处理,如下面的例子,
HashJoin(part partsupp)
的 outer path 和 inner path 分别是RemoteQuery(partsupp)
和RemoteQuery(part)
PPID 0x55862da74d28 NestLoop(part supplier partsupp nation) rows=1 cost=185101.75..933246.01
clauses: supplier.s_nationkey = nation.n_nationkey
PPID 0x55862d4d00f0 NestLoop(part supplier partsupp) rows=1 cost=185101.75..933220.41
clauses: partsupp.ps_suppkey = supplier.s_suppkey
PPID 0x55862d397638 HashJoin(part partsupp) rows=1 cost=185101.75..923954.75
clauses: part.p_partkey = partsupp.ps_partkey, unknown expr = partsupp.ps_supplycost
PPID 0x55862d391298 RemoteQuery(partsupp) rows=266667 cost=0.00..737453.00
PPID 0x55862d3877f0 RemoteQuery(part) rows=250 cost=0.00..185098.00
PPID 0x55862d38b378 RemoteQuery(supplier) rows=3333 cost=0.00..9224.00
PPID 0x55862d3942c8 RemoteQuery(nation) rows=8 cost=0.00..25.50
- 当前joinrel 可以下推到DN,此时使用函数 create_joinrel_rqpath 构建 RemoteQueryPath ,并且 使用 add_path 添加到 pathlist 中,create_joinrel_rqpath 在 add_paths_to_joinrel 中调用,但是由于 remote query 的cost 计算代价太大,所以当前即使有满足条件的语句,也会被代价比较然后剔除
benchmarksql=# explain select * from t1 join t2 on t1.c1 = t2.c1 join t3 on t3.c2 = t2.c1 where t1.c1 = 5;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=2.00..696.12 rows=25584 width=36)
-> Data Node Scan on t3 "_REMOTE_TABLE_QUERY_" (cost=0.00..415.25 rows=1 width=12)
Primary node/s: dn1
Node/s: dn1, dn2
-> Data Node Scan on "_REMOTE_TABLE_QUERY_" "_REMOTE_TABLE_QUERY__1" (cost=2.00..25.03 rows=25584 width=24)
Primary node/s: dn1
Node/s: dn1
(7 rows)
3. remote query 的cost 计算
主要分为两类
- remote query 为基表的时候,此时 cost 使用 seq scan 的方式计算代价, 只是 tuple cost 使用 pgxc_remote_tuple_cost 计算,默认为 0.9, 而之前 seq scan 的 tuple cost 为 0.01
if(rel->reloptkind == RELOPT_BASEREL)
{
cost_seqscan_internal(&rqpath->path, root, rel,
rqpath->path.param_info,
pgxc_remote_tuple_cost);
}
- remote query 为其他,当前只能是join,此时直接使用下面的计算公式,parallel_setup_cost 默认为1000,pgxc_remote_tuple_cost默认为0.9
{
rqpath->path.startup_cost = parallel_setup_cost * 2;
rqpath->path.total_cost = rqpath->path.startup_cost + rel->rows * pgxc_remote_tuple_cost;
rqpath->path.rows = rel->rows;
}
``
## 场景增强
上面说了实现以及现有的问题,现在对这些问题进行分析优化,主要思路是
1. 限制FQS,把FQS的定位从"快速判断语句是否可以ship到dn上"转换为"快速判断语句是否可以ship到一个DN上"
2. 添加下推逻辑,如果确定算子能在一个DN上执行,则下推到DN
下文中使用的表结构
```sql
create table t1 (c1 int , c2 int , c3 int);
create table t2 (c1 int , c2 int , c3 int);
create table t3 (c1 int , c2 int , c3 int);
insert into t1 select floor(random() * 100000)::int % 10, floor(random() * 100000)::int % 10, generate_series(1,100000);
insert into t2 select floor(random() * 100000)::int % 10, floor(random() * 100000)::int % 10, generate_series(1,100000);
insert into t3 select floor(random() * 100000)::int % 10, floor(random() * 100000)::int % 10, generate_series(1,100000);
1. 连接支持
链接有当前有实现对应create_remotequery_path
函数,但是只是实现了inner join
和 left join
,且在大多数时候,left join
无法触发,而inner join
也有的问题是当前remote query
的path的代价都非常高,所以即使在构建join的path 的过程中,左子树和右子树都在同一个节点,此时他的path还是常规path胜出,所以第一步是先修改此时的代价计算方式,上面的语句变为
benchmarksql=# explain select * from t1 join t2 on t1.c1 = t2.c1 where t1.c1 = 5;
QUERY PLAN
----------------------------------------------------------------------------------
Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=2.00..25.03 rows=25584 width=24)
Primary node/s: dn1
Node/s: dn1
(3 rows)
这是本来join的path中就有的一种path生成方式,只是修改了计算的代价的大小,现在即使多表关联,也可以使用表达式正确的进行进行优化
benchmarksql=# explain select * from t1 join t2 on t1.c1 = t2.c1 join t3 on t3.c1 = t2.c1 where t1.c1 = 5;
QUERY PLAN
------------------------------------------------------------------------------------
Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=2.00..117.13 rows=127920 width=36)
Primary node/s: dn1
Node/s: dn1
(3 rows)
benchmarksql=# explain select * from t1 join t2 on t1.c1 = t2.c1 join t3 on t3.c2 = t2.c1 where t1.c1 = 5;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Nested Loop (cost=2.00..696.12 rows=25584 width=36)
-> Data Node Scan on t3 "_REMOTE_TABLE_QUERY_" (cost=0.00..415.25 rows=1 width=12)
Primary node/s: dn1
Node/s: dn1, dn2
-> Data Node Scan on "_REMOTE_TABLE_QUERY_" "_REMOTE_TABLE_QUERY__1" (cost=2.00..25.03 rows=25584 width=24)
Primary node/s: dn1
Node/s: dn1
(7 rows)
left join
和right join
以及full join
在满足条件下,也可以实现相同的优化,只是需要注意他们语义上on和where的区别,这些细节join原来都处理掉了,我们只需要补充相应的规则即可。
另外对于left join
和right join
,在满足某些规则的时候会转换为inner join
,此时判断规则使用inner join
的规则。对于条件不足无法转换的语句,pg会把right join
转换为左深树,所以只需要添加left join
的判断即可,而full join ,按照条件转换为left join
或者right join
,再或者按照原来的逻辑处理
benchmarksql=# explain verbose select * from t1 left join t2 on t1.c1 = t2.c1 where t1.c1 = 5;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=2.00..2.04 rows=48 width=24)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT l.a_1, l.a_2, l.a_3, r.a_1, r.a_2, r.a_3 FROM ((SELECT t1.c1, t1.c2, t1.c3 FROM ONLY public.t1 t1 WHERE (t1.c1 = 5)) l(a_1, a_2, a_3) LEFT JOIN (SELECT t2.c1, t2.c2, t2.c3 FROM ONLY public.t2 t2 WHERE (t2.c1 = 5)) r(a_1, a_2, a_3) ON ((l.a_1 = r.a_1))) WHERE true
(5 rows)
benchmarksql=# explain verbose select * from t1 right join t2 on t1.c1 = t2.c1 where t2.c1 = 5;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Result (cost=2.00..2.52 rows=48 width=24)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3
-> Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=2.00..2.04 rows=48 width=24)
Output: t2.c1, t2.c2, t2.c3, t1.c1, t1.c2, t1.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT l.a_1, l.a_2, l.a_3, r.a_1, r.a_2, r.a_3 FROM ((SELECT t2.c1, t2.c2, t2.c3 FROM ONLY public.t2 t2 WHERE (t2.c1 = 5)) l(a_1, a_2, a_3) LEFT JOIN (SELECT t1.c1, t1.c2, t1.c3 FROM ONLY public.t1 t1 WHERE (t1.c1 = 5)) r(a_1, a_2, a_3) ON ((r.a_1 = l.a_1))) WHERE true
(7 rows)
benchmarksql=# explain verbose select * from t1 full join t2 on t1.c1 = t2.c1 where t1.c1 = 5;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=2.00..2.04 rows=48 width=24)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT l.a_1, l.a_2, l.a_3, r.a_1, r.a_2, r.a_3 FROM ((SELECT t1.c1, t1.c2, t1.c3 FROM ONLY public.t1 t1 WHERE (t1.c1 = 5)) l(a_1, a_2, a_3) LEFT JOIN (SELECT t2.c1, t2.c2, t2.c3 FROM ONLY public.t2 t2 WHERE (t2.c1 = 5)) r(a_1, a_2, a_3) ON ((l.a_1 = r.a_1))) WHERE true
(5 rows)
-- left 和right 混合的语句
benchmarksql=# explain verbose select * from t1 right join t2 on t1.c1 = t2.c1 left join t3 on t3.c1 = t2.c1 where t1.c1 = 5;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Data Node Scan on "_REMOTE_TABLE_QUERY_" (cost=2.00..117.13 rows=127920 width=36)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3, t3.c1, t3.c2, t3.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT l.a_1, l.a_2, l.a_3, l.a_4, l.a_5, l.a_6, r.a_1, r.a_2, r.a_3 FROM ((SELECT l_1.a_1, l_1.a_2, l_1.a_3, r_1.a_1, r_1.a_2, r_1.a_3 FROM ((SELECT t1.c1, t1.c2, t1.c3 FROM ONLY public.t1 t1 WHERE (t1.c1 = 5)) l_1(a_1, a_2, a_3) JOIN (SELECT t2.c1, t2.c2, t2.c3 FROM ONLY public.t2 t2 WHERE (t2.c1 = 5)) r_1(a_1, a_2, a_3)ON (true)) WHERE true) l(a_1, a_2, a_3, a_4, a_5, a_6) LEFT JOIN (SELECT t3.c1, t3.c2, t3.c3 FROM ONLY public.t3 t3 WHERE (t3.c1 = 5)) r(a_1, a_2, a_3) ON ((r.a_1 = l.a_4))) WHERE true
(5 rows)
只是pg没有实现full join 到
left join
然后再到inner join
的优化,在前面也有遇到某个场景,是可以进行优化的,但是pg并没有做,pg的优化器还是不够好,比较古老
2. 聚合操作,group支持
单纯的聚合操作是在query 的targetList中的,所以如果语句原本可以下推,则聚合操作完全不用操行,直接不做处理,deparse 的时候会进行处理
benchmarksql=# explain select avg(c2) from t1 where c1 = 2;
QUERY PLAN
----------------------------------------------------------------------------
Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0)
Primary node/s: dn1
Node/s: dn1
(3 rows)
对于group,这里需要首先明白group在query中的表现方式,group在query和结构体中表现为havingQual
和groupClause
, remote query
进行deparse 需要构造query 所以理论上,只需要在remote_query中构造对应的信息就可以了。
在pg中,group在create_plan的时候,会在原来的plan上构造一个Group node 或者Sort node 或者Agg node,然后原来的plan作为他的subplan,在原来的pgxc中,有一个create_remotegrouping_plan
函数,他的输入是此时group 的plan,作用是抽取node 中的group信息,填充到remote query
的对应的位置上去,包括targetlist,groupcluse,havingcluse等,然后此时再返回remote_query,所以这里的逻辑很清晰
在group的subplan是remoteplan 的时候,且是exec_node限定在一个datanode
的时候,受用group plan中的信息填充到remote query
中,用来构造下发到datanode
的语句
当前满足条件的语句可以实现group的下推操作
benchmarksql=# explain verbose select max(c2) from t1 where c1 = 2;
QUERY PLAN
----------------------------------------------------------------------------
Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0)
Output: (max(t1.c2))
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT max(c2) AS max FROM public.t1 t1 WHERE (c1 = 2)
(5 rows)
benchmarksql=# explain verbose select max(t1.c2), avg(t2.c2), sum(t2.c3) from t1 join t2 on t1.c1 = t2.c1 where t1.c1 = 2 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_GROUP_QUERY__" (cost=2.00..31.39 rows=32656 width=12)
Output: (max(t1.c2)), (avg(t2.c2)), (sum(t2.c3))
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT max(l.a_1), avg(r.a_1), sum(r.a_2) FROM ((SELECT t1.c2, t1.c1 FROM ONLY public.t1 t1 WHERE (t1.c1 = 2)) l(a_1, a_2) JOIN (SELECT t2.c2, t2.c3, t2.c1 FROM ONLY public.t2 t2 WHERE (t2.c1 = 2)) r(a_1, a_2, a_3) ON (true)) WHERE true
(5 rows)
benchmarksql=# explain verbose select c2, sum(c3) from t1 where c1 = 2 group by c2;
QUERY PLAN
----------------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0)
Output: t1.c2, (sum(t1.c3))
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT c2, sum(c3) AS sum FROM public.t1 t1 WHERE (c1 = 2) GROUP BY c2
(5 rows)
benchmarksql=# explain verbose select c2, sum(c3) from t1 where c1 = 2 group by c2 having sum (c1) < 1000;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0)
Output: t1.c2, (sum(t1.c3))
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT c2, sum(c3) AS sum FROM public.t1 t1 WHERE (c1 = 2) GROUP BY c2 HAVING (sum(c1) < 1000)
(5 rows)
benchmarksql=# explain verbose select t1.c2, sum (t1.c1), max(t2.c2) from t1 join t2 on t1.c1 = t2.c1 and t1.c1 = 4 group by t1.c2 having sum (t2.c3) < 1000;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_GROUP_QUERY__" (cost=2.00..24.78 rows=25312 width=16)
Output: t1.c2, (sum(t1.c1)), (max(t2.c2))
Node/s: dn2
Remote query: SELECT l.a_1, sum(r.a_2), sum(l.a_2), max(r.a_1) FROM ((SELECT t1.c2, t1.c1 FROM ONLY public.t1 t1 WHERE (t1.c1 = 4)) l(a_1, a_2) JOIN (SELECT t2.c2, t2.c3, t2.c1 FROM ONLY public.t2 t2 WHERE (t2.c1 = 4)) r(a_1, a_2, a_3) ON (true)) WHERE true GROUP BY 1
Coordinator quals: ((sum(t2.c3)) < 1000)
(5 rows)
3. limit 支持
只要语句中出现limit
,则在语义正确的情况下,任何时候都可以进行下推,例如select * from t1 limit 10
,这里最优的执行计划是在dn上进行limit,然后再在此基础上在coordinate
上进行limit
,在AntDB
最初的移植的postgres-xc
的代码中,已经实现了对应的函数,只是由于内核的变动,所以后面没有继续支持。
这里进行逻辑梳理,然后添加对应的实现,具体的实现逻辑和group
类似
benchmarksql=# explain verbose select * from t1 limit 10 offset 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Limit (cost=91.03..100.14 rows=10 width=12)
Output: t1.c1, t1.c2, t1.c3
-> Data Node Scan on "__REMOTE_LIMIT_QUERY__" (cost=0.00..91033.00 rows=100000 width=12)
Output: t1.c1, t1.c2, t1.c3
Primary node/s: dn1
Node/s: dn1, dn2
Remote query: SELECT c1, c2, c3 FROM ONLY public.t1 t1 WHERE true LIMIT ('10'::bigint + '100'::bigint)
(7 rows)
benchmarksql=# explain verbose select * from t1 where c1 = 2 limit 10 offset 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_LIMIT_QUERY__" (cost=1118.25..1230.08 rows=10 width=12)
Output: t1.c1, t1.c2, t1.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT c1, c2, c3 FROM ONLY public.t1 t1 WHERE (c1 = 2) OFFSET '100'::bigint LIMIT '10'::bigint
(5 rows)
benchmarksql=# explain verbose select * from t1 where c2 = 2 order by c3 limit 10 offset 100;
QUERY PLAN
-------------------------------------------------------------------------------------------------
Limit (cost=91664.26..91664.29 rows=10 width=12)
Output: t1.c1, t1.c2, t1.c3
-> Sort (cost=91664.01..91688.50 rows=9793 width=12)
Output: t1.c1, t1.c2, t1.c3
Sort Key: t1.c3
-> Data Node Scan on "__REMOTE_SORT_QUERY__" (cost=0.00..91283.00 rows=9793 width=12)
Output: t1.c1, t1.c2, t1.c3
Primary node/s: dn1
Node/s: dn1, dn2
Remote query: SELECT c1, c2, c3 FROM ONLY public.t1 t1 WHERE (c2 = 2) ORDER BY 3
(10 rows)
benchmarksql=# explain verbose select * from t1 right join t2 on t1.c1 = t2.c1 left join t3 on t3.c1 = t2.c1 where t1.c1 = 5 limit 10 offset 100;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_LIMIT_QUERY__" (cost=2.09..2.10 rows=10 width=36)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3, t3.c1, t3.c2, t3.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT l.a_1, l.a_2, l.a_3, l.a_4, l.a_5, l.a_6, r.a_1, r.a_2, r.a_3 FROM ((SELECT l_1.a_1, l_1.a_2, l_1.a_3, r_1.a_1, r_1.a_2, r_1.a_3 FROM ((SELECT t1.c1, t1.c2, t1.c3 FROM ONLY public.t1 t1 WHERE (t1.c1 = 5)) l_1(a_1, a_2, a_3) JOIN (SELECT t2.c1, t2.c2, t2.c3 FROM ONLY public.t2 t2 WHERE (t2.c1 = 5)) r_1(a_1, a_2, a_3)ON (true)) WHERE true) l(a_1, a_2, a_3, a_4, a_5, a_6) LEFT JOIN (SELECT t3.c1, t3.c2, t3.c3 FROM ONLY public.t3 t3 WHERE (t3.c1 = 5)) r(a_1, a_2, a_3) ON ((r.a_1 = l.a_4))) WHERE true OFFSET '100'::bigint LIMIT '10'::bigint
(5 rows)
benchmarksql=# explain verbose select * from t1 right join t2 on t1.c1 = t2.c1 left join t3 on t3.c1 = t2.c1 where t1.c1 = 5 limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..1.11 rows=10 width=36)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3, t3.c1, t3.c2, t3.c3
-> Nested Loop (cost=0.00..104078.12 rows=937970 width=36)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3, t3.c1, t3.c2, t3.c3
-> Nested Loop Left Join (cost=0.00..1049.17 rows=110 width=24)
Output: t2.c1, t2.c2, t2.c3, t3.c1, t3.c2, t3.c3
Join Filter: (t3.c1 = t2.c1)
-> Data Node Scan on t2 "_REMOTE_TABLE_QUERY_" (cost=0.00..632.25 rows=11 width=12)
Output: t2.c1, t2.c2, t2.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT c1, c2, c3 FROM ONLY public.t2 t2 WHERE (c1 = 5)
-> Materialize (cost=0.00..415.30 rows=10 width=12)
Output: t3.c1, t3.c2, t3.c3
-> Data Node Scan on t3 "_REMOTE_TABLE_QUERY__1" (cost=0.00..415.25 rows=10 width=12)
Output: t3.c1, t3.c2, t3.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT c1, c2, c3 FROM ONLY public.t3 t3 WHERE (c1 = 5)
-> Materialize (cost=0.00..91325.63 rows=8527 width=12)
Output: t1.c1, t1.c2, t1.c3
-> Data Node Scan on t1 "_REMOTE_TABLE_QUERY__2" (cost=0.00..91283.00 rows=8527 width=12)
Output: t1.c1, t1.c2, t1.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT c1, c2, c3 FROM ONLY public.t1 t1 WHERE (c1 = 5)
(26 rows)
这里上面的这条语句没有实现limit下推,但是这里确定对应的逻辑已经实现了,没有选择下推,最大的可能是代价对比的过程中,path被淘汰了。
4. order支持
order 有几种优化思路,
- 条件约束到单个节点执行,此时order下推
- 条件无法约束到单个节点
- dn执行查询,然后数据在
coordinate
汇总排序 - dn查询排序,然后数据在
coordinate
执行merge操作
- dn执行查询,然后数据在
其中简单语句的场景已经实现,因为order 在query中是一个附属的list,FQS中几乎没有检测,所以如果是简单的语句,按照简单语句的逻辑处理。对于复杂语句,需要在后面进行优化
在pg 原有的逻辑中,sort 在create_plan 中,会在原paln之上建立一个sort node,原plan作为他的subplan,具体细节可以看create_sort_plan
,在上面的group的优化中介绍了group的优化原理,这里的sort也是类似的操作
如果sort的subplan是remoteplan 的时候,从sort中抽取相关信息,构造remote query
的sortcluase,如果remote plan
只有一个执行节点,则直接返回subplan,否则返回sort
所以这里可以看出sort 在有多个执行节点的时候,节点上的语句也会进行sort,然后再coordinate
上再执行排序操作,具体的执行计划如下
benchmarksql=# explain verbose select * from t1 order by c2;
QUERY PLAN
---------------------------------------------------------------------------------------------
Sort (cost=99337.82..99587.82 rows=100000 width=12)
Output: t1.c1, t1.c2, t1.c3
Sort Key: t1.c2
-> Data Node Scan on "__REMOTE_SORT_QUERY__" (cost=0.00..91033.00 rows=100000 width=12)
Output: t1.c1, t1.c2, t1.c3
Primary node/s: dn1
Node/s: dn1, dn2
Remote query: SELECT c1, c2, c3 FROM ONLY public.t1 t1 WHERE true ORDER BY 2
(8 rows)
此执行计划明显可以看到ship 的语句也进行了排序,在coordinate
上也进行排序,这里coordinate
上的排序进行了一些优化,类似于merge操作,而不是一个完整的排序,这里也可以取消dn上的排序,但是并不知道这两者之间性能差异有多大,适用场景是什么,所以暂时没有改动。
dn语句上的sort原本是为了下发的语句添加的,如果有约束条件,则语句表现如下
benchmarksql=# explain verbose select * from t1 where t1.c1 = 4 order by c2;
QUERY PLAN
--------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0)
Output: t1.c1, t1.c2, t1.c3
Node/s: dn2
Remote query: SELECT c1, c2, c3 FROM public.t1 t1 WHERE (c1 = 4) ORDER BY c2
(4 rows)
另外语句sort优化的时候,检测的plan是Sort->remoteQuery
,所以满足这个模式的plan都可以添加sort支持,根据实际执行的dn选择合适的执行计划,例如下面的复杂的语句
benchmarksql=# explain verbose select * from t1 right join t2 on t1.c1 = t2.c1 left join t3 on t3.c1 = t2.c1 where t1.c1 = 5 order by t3.c2;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Data Node Scan on "__REMOTE_SORT_QUERY__" (cost=14467.87..14787.67 rows=127920 width=36)
Output: t1.c1, t1.c2, t1.c3, t2.c1, t2.c2, t2.c3, t3.c1, t3.c2, t3.c3
Primary node/s: dn1
Node/s: dn1
Remote query: SELECT l.a_1, l.a_2, l.a_3, l.a_4, l.a_5, l.a_6, r.a_1, r.a_2, r.a_3 FROM ((SELECT l_1.a_1, l_1.a_2, l_1.a_3, r_1.a_1, r_1.a_2, r_1.a_3 FROM ((SELECT t1.c1, t1.c2, t1.c3 FROM ONLY public.t1 t1 WHERE (t1.c1 = 5)) l_1(a_1, a_2, a_3) JOIN (SELECT t2.c1, t2.c2, t2.c3 FROM ONLY public.t2 t2 WHERE (t2.c1 = 5)) r_1(a_1, a_2, a_3) ON (true)) WHERE true) l(a_1,a_2, a_3, a_4, a_5, a_6) LEFT JOIN (SELECT t3.c1, t3.c2, t3.c3 FROM ONLY public.t3 t3 WHERE (t3.c1 = 5)) r(a_1, a_2, a_3) ON ((r.a_1 = l.a_4))) WHERE true ORDER BY 8
(5 rows)
对于复杂的语句,这里一般在优化过程中,在优化到聚合或者sort算子的时候,如果是此时plan是多节点执行的话,一般早就被淘汰,所以复杂语句这里一般都是优化的一个执行节点的情况,这也是本次优化的目标
5. 子查询支持
在FQS中几乎不支持子查询的check,一般都会失败,只支持极为简单的语句,这里稍作修改,支持下面的类似的场景
benchmarksql=# explain select count (*) from t1 where exists (
select * from t2 where t2.c1 = 2
) and exists (
select * from t3 where t3.c1 = 2
) and t1.c1 = 2;
QUERY PLAN
----------------------------------------------------------------------------
Data Node Scan on "__REMOTE_FQS_QUERY__" (cost=0.00..0.00 rows=0 width=0)
Primary node/s: dn1
Node/s: dn1
(3 rows)
上面的语句属于子链接的优化,且语句在每一个子查询上都有一个约束条件的,FQS中就是使用次约束条件才正确的计算了exec_node,然后在这里还添加了一些限制条件,在FQS中,只允许exists语句,其他子链接不支持,因为其他的子链接中都有隐藏的链接条件,在FQS中处理较为困难,所以直接判断失败。
另外对于复杂的子查询,需要进行在语句优化中综合上下文进行判断,例如a join b on a.c1 = b.c1 where a.c1 = 1
和a join b on a.c1 = b.c1 and a.c1 = 1
在FQS中是两种不同的数据结构,但是在语义上是完全相同的,且其中的隐含条件b.c1 = 1
是需要我们进行推导才能知道的,仅仅凭上述的条件,对于a表和b表,得到的exec_nodes
完全无意义的,这里也可以把后续的对应的优化操作提前,但是场景太多,所有对于这种场景,需要推后到后面的优化阶段进行判断
所以子查询都是在FQS之后进行优化的,而在优化过程中,对于子查询,pg有两种优化思路,一是上拉为join,此时使用join的优化逻辑优化,前面的join优化已经处理掉了;二是直接优化为subplan,此时subplan 是一个单独的完整的语句,也可以按照正常的remoteQuery
的优化逻辑进行优化
这里还需要对子链接中存在的问题进行单独说明,前面提到,FQS中只能处理exists类型的子链接,其他类型的子链接中存在隐藏的链接条件,例如
benchmarksql=# explain select * from t1 where t1.c2 in (select c2 from t2 where t2.c1 = 3);
QUERY PLAN
-----------------------------------------------------------------------------------------------
Hash Semi Join (cost=632.33..91994.57 rows=6000 width=12)
Hash Cond: (t1.c2 = t2.c2)
-> Data Node Scan on t1 "_REMOTE_TABLE_QUERY_" (cost=0.00..91033.00 rows=100000 width=12)
Primary node/s: dn1
Node/s: dn1, dn2
-> Hash (cost=632.25..632.25 rows=6 width=4)
-> Data Node Scan on t2 "_REMOTE_TABLE_QUERY__1" (cost=0.00..632.25 rows=6 width=4)
Node/s: dn2
(8 rows)
上面的语句中的隐藏条件为t1.c2 = t2.c2
,在没有表达式推导的时候,这个语句是无法确定是否能在一个数据节点运算的,所以此类语句在FQS中都限制掉了,需要在后面的优化过程中进行处理。
前面提到remoteQuery
的实现原理,remote query
之后在生成plan 之后,会使用plan构造query,然后使用query deparse 出语句,发送到datanode
,执行,但是在pg 中,并没有semi join
和anti join,此时如果优化出来的语句是semi join
或者anti join,则即使前面优化通过,确定可以生成remoteplan,但是会deparse失败,而在pg中,一般是子链接会生成这两种join。
所以对于此类语句。暂时搁置,解决方案有
- 语句中如果有子链接,则保存原始的query,优化的时候按照原有的逻辑进行优化,join中放开对
semi join
的限制,最后如果生成的plan是remotequery且只有一个执行节点,则使用保存的query 进行deparse,类似将FQS延申到这里,整个优化过程只是进行一个判断过程- 问题是语句情况复杂,场景复杂,逻辑复杂,需要对整个语句有完整的把握,需要实现过程是简单的,但是简但的替换需要考虑的问题很多
- 逻辑上来说,只要最后如果plan是remotequery,那就代表整个优化过程中已经对语句进行了完整的判断,这里是可以无条件替换的,但是有人可以保证直接替换没有问题吗
- 在deparse中实现
semi join
的等意语句的转换,无法deparse为semi join
,则直接转换为等义语句- 需要对
semi join
有完整的认知
- 需要对
- 把子链接以及join以及表达式的优化提前到FQS中,在FQS中进行判断
- 维持原状
遗留问题
- 子链接优化为
semi join
之后无法deparse - 代价模型需要综合
cluster
进行考虑,需要明确代价模型的计算原则 - 动态参数支持
- check 更多的场景
进一步优化的可能 如聚合函数更加细化
count() = sum(count(all dn)) sum() = sum(sum(all dn)) max() = max(max(all dn)) min() = min(min(all dn)) avg() = sum()/count() = sum(sum(all dn)) / sum(count(all dn)) sort() = merge(sort(all dn)) limit() = limit(limit(all dn))