(类 PG 信创)亿级 table 简单查询的一个优化问题 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
HOMO114514
V2EX    数据库

(类 PG 信创)亿级 table 简单查询的一个优化问题

  •  
  •   HOMO114514 2024-08-14 09:59:43 +08:00 1909 次点击
    这是一个创建于 502 天前的主题,其中的信息可能已经有所发展或是发生改变。

    部分信息借由 GPT 进行脱敏处理

    现有一个简单表结构:

    Table "schema.t_table_" Column | Type | Modifiers | Storage | Stats target | Description -----------------+-----------------------------+------------------------------------------+----------+--------------+------------- partition_id | smallint | not null | plain | | 分区 ID transaction_time| timestamp without time zone | not null default pg_systimestamp() | plain | | 交易时间 transaction_count | integer | | plain | | 交易数量 last_update_time | timestamp without time zone | not null default pg_systimestamp() | plain | | 更新时间 source_type | character varying(10) | not null default NULL::character varying | extended | | 事件源 transaction_code | character varying(30) | not null default NULL::character varying | extended | | 交易码 key_value | character varying(100) | not null default NULL::character varying | extended | | RV 变量键 value_data | character varying(500) | not null default NULL::character varying | extended | | RV 变量值 Indexes: "t_table_m_pkey" PRIMARY KEY, ubtree (transaction_time, value_data, key_value, transaction_code, source_type, partition_id) LOCAL WITH (storage_type=USTORE, active_pages=1314323) TABLESPACE pg_default Partition By LIST(partition_id) Number of partitions: 31 (View pg_partition to check each partition range.) Has OIDs: no Options: orientation=row, compression=no, storage_type=USTORE, segment=off 

    表由 Mysql 迁移而来,这个分区表中存储的记录条数为 126000000+。在开发侧,他们反馈下面的一个简单查询执行性能非常差,每条 SQL 要跑 15s 左右:

    SELECT value_data AS dataValue, SUM(transaction_count) AS txnCount FROM schema.t_table_m WHERE 1 = 1 AND source_type = '1' AND key_value IN ( 'key1', 'key2', 'key3', 'key4', 'key5', 'key6', 'key7', 'key8', 'key9', 'key10' ) AND partition_id IN (1, 31) AND transaction_time >= '2024-07-31 15:58:11' AND transaction_time <= '2024-08-01 15:58:11' AND value_data NOT IN ('null','space','#','*') AND transaction_code NOT IN ('code1', 'code2') GROUP BY value_data; 

    执行计划如下:

    postgres=> explain performance SELECT .......; id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | A-width | E-width | E-costs ----+------------------------------------------------------------------------------------------------+-----------+--------+--------+------------+-------------+---------+---------+---------------- 1 | -> HashAggregate | 12715.021 | 0 | 1 | | 22KB | | 8 | 28.485..28.495 2 | -> Partition Iterator | 12715.011 | 0 | 1 | | 69KB | | 8 | 0.010..28.480 3 | -> Partitioned Index Scan using t_table_m_pkey on schema.t_table_m | 12714.903 | 0 | 1 | | 29KB | | 8 | 0.010..28.480 (3 rows) Predicate Information (identified by plan id) ------------------------------------------------------------------------------------------ 2 --Partition Iterator Iterations: 2 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Index Cond: ((schema.t_table_m.transaction_time >= '2024-07-31 15:58:11'::timestamp without time zone) AND (schema.t_table_m.transaction_time <= '2024-08-01 15:58:11'::timestamp without time zone) AND ((schema.t_table_m.key_value)::text = ANY ('{key1,key2,key3,key4,key5,key6,key7,key8,key9,key10}'::text[])) AND ((schema.t_table_m.source_type)::text = '1'::text) AND (schema.t_table_m.partition_id = ANY ('{1,31}'::integer[]))) Filter: (((schema.t_table_m.transaction_code)::text <> 'code1'::text) AND ((schema.t_table_m.transaction_code)::text <> 'code2'::text) AND ((schema.t_table_m.value_data)::text <> ALL ('{"null",space,#,*}'::text[]))) Selected Partitions: 1,31 (6 rows) Memory Information (identified by plan id) ------------------------------------------------------------------------------------------ 1 --HashAggregate Peak Memory: 22KB, Estimate Memory: 32MB 2 --Partition Iterator Peak Memory: 69KB, Estimate Memory: 32MB 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Peak Memory: 29KB, Estimate Memory: 32MB (6 rows) Targetlist Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate Output: value_data, sum(transaction_count) Group By Key: schema.t_table_m.value_data 2 --Partition Iterator Output: value_data, transaction_count 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Output: value_data, transaction_count (7 rows) Datanode Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate (actual time=12715.021..12715.021 rows=0 loops=1) (Buffers: shared hit=760703) (CPU: ex c/r=0, ex row=0, ex cyc=437, inc cyc=847446039235951) 2 --Partition Iterator (actual time=12715.011..12715.011 rows=0 loops=1) (CPU: ex c/r=0, ex row=0, ex cyc=-847446039224001, inc cyc=847446039235514) 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m (actual time=12714.903..12714.903 rows=0 loops=2) (Buffers: shared hit=760703) (CPU: ex c/r=0, ex row=0, ex cyc=1694892078459515, inc cyc=1694892078459515) (11 rows) ====== Query Summary ===== ------------------------------------------ Datanode executor start time: 0.177 ms Datanode executor run time: 12715.042 ms Datanode executor end time: 0.021 ms Planner runtime: 1.124 ms Query Id: 10810046480632471817 Total runtime: 12715.260 ms (6 rows) 

    其中所有涉及列都已经命中索引扫描,测试时尝试更改某些 where 条件使得查询进行全表扫描的话,执行时间会暴增到 2 分钟左右。

    经由我们多轮调试后,发现仅仅是单纯地将 where 条件中partition_id IN (1, 31)更改为(partition_id=31 OR partition_id=1),就能使得查询平均耗时减少到 6s 。其它优化更改的效果低于系统性能波动,无法观测。

    原本猜测是因为 IN 语句改写后会触发分区剪枝类的优化,但是经对比发现改写前后的执行计划完全一致:

    postgres=> explain performance SELECT value_data AS dataValue, SUM(transaction_count) AS txnCount FROM schema.t_table_m WHERE 1 = 1 AND source_type = '1' AND key_value = any (array[ 'key1', 'key2', 'key3', 'key4', 'key5', 'key6', 'key7', 'key8', 'key9', 'key10' ]) AND (partition_id = 1 OR partition_id = 31) AND transaction_time >= '2024-07-31 15:58:11' AND transaction_time <= '2024-08-01 15:58:11' AND value_data not in ('null','space','#','*') AND transaction_code <> 'code1' AND transaction_code <> 'code2' GROUP BY value_data; id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | A-width | E-width | E-costs ----+------------------------------------------------------------------------------------------------+-----------+--------+--------+------------+-------------+---------+---------+---------------- 1 | -> HashAggregate | 6421.454 | 0 | 1 | | 22KB | | 8 | 14.790..14.800 2 | -> Partition Iterator | 6421.441 | 0 | 1 | | 69KB | | 8 | 0.010..14.785 3 | -> Partitioned Index Scan using t_table_m_pkey on schema.t_table_m | 6421.276 | 0 | 1 | | 31KB | | 8 | 0.010..14.785 (3 rows) Predicate Information (identified by plan id) ----------------------------------------------------------------------------------------- 2 --Partition Iterator Iterations: 2 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Index Cond: ((schema.t_table_m.transaction_time >= '2024-07-31 15:58:11'::timestamp without time zone) AND (schema.t_table_m.transaction_time <= '2024-08-01 15:58:11'::timestamp without time zone) AND ((schema.t_table_m.key_value)::text = ANY ('{key1,key2,key3,key4,key5,key6,key7,key8,key9,key10}'::text[])) AND ((schema.t_table_m.source_type)::text = '1'::text)) Filter: (((schema.t_table_m.transaction_code)::text <> 'code1'::text) AND ((schema.t_table_m.transaction_code)::text <> 'code2'::text) AND ((schema.t_table_m.partition_id = 31) OR (schema.t_table_m.partition_id = 1)) AND ((schema.t_table_m.value_data)::text <> ALL ('{"null",space,#,*}'::text[]))) Selected Partitions: 1,31 (6 rows) Memory Information (identified by plan id) ------------------------------------------------------------------------------------------ 1 --HashAggregate Peak Memory: 22KB, Estimate Memory: 32MB 2 --Partition Iterator Peak Memory: 69KB, Estimate Memory: 32MB 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Peak Memory: 31KB, Estimate Memory: 32MB (6 rows) Targetlist Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate Output: value_data, sum(transaction_count) Group By Key: schema.t_table_m.value_data 2 --Partition Iterator Output: value_data, transaction_count 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Output: value_data, transaction_count (7 rows) Datanode Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate (actual time=6421.454..6421.454 rows=0 loops=1) (Buffers: shared hit=380353) (CPU: ex c/r=0, ex row=0, ex cyc=488, inc cyc=847433445353297) 2 --Partition Iterator (actual time=6421.441..6421.441 rows=0 loops=1) (CPU: ex c/r=0, ex row=0, ex cyc=-847433445336080, inc cyc=847433445352809) 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m (actual time=6421.276..6421.276 rows=0 loops=2) (Buffers: shared hit=380353) (CPU: ex c/r=0, ex row=0, ex cyc=1694866890688889, inc cyc=1694866890688889) (11 rows) ====== Query Summary ===== ----------------------------------------- Datanode executor start time: 0.198 ms Datanode executor run time: 6421.479 ms Datanode executor end time: 0.022 ms Planner runtime: 1.417 ms Query Id: 10810046480632413864 Total runtime: 6421.729 ms (6 rows) 

    两者唯一的区别是,在 Partitioned Index Scan 时,使用 OR 的语句 Buffers: shared hit 是使用 IN 的一半。

    其中造成这样的差距的可能原因是什么?

    4 条回复    2024-08-14 12:04:29 +08:00
    967182
        1
    967182  
       2024-08-14 11:17:22 +08:00
    GPT 这样说: 从描述和执行计划中可以看出,使用 IN 与使用 OR 的主要区别在于访问缓冲区( Buffers )的行为不同。使用 OR 时,缓冲区的共享命中率只有使用 IN 的一半。这意味着在使用 OR 时,查询过程中的缓存利用效率更高,从而导致查询执行时间较短。

    以下是可能导致这种差异的原因:

    缓存使用效率:IN 操作符通常会转换成多个条件的 OR 表达式,这可能会增加某些情况下的缓存失效或增加磁盘 IO 。

    执行计划差异:尽管在高层次上执行计划看似相同,但在某些底层细节上,优化器可能会对 IN 和 OR 进行不同的处理,导致资源使用的不同。

    执行路径:OR 和 IN 在某些数据库系统中的处理路径可能不同,导致数据扫描和缓存的差异。

    总的来说,具体的原因可能是数据库在处理 IN 操作符时,在分区剪枝和索引扫描的优化方面没有完全优化,或者在使用 OR 时,查询路径更为直接,减少了缓存未命中的情况,从而使查询执行得更快。
    mark2025
        2
    mark2025  
       2024-08-14 11:27:35 +08:00
    考虑在 transaction_time 上面创建 BRIN 索引(不是 btree )
    ShuA1
        3
    ShuA1  
       2024-08-14 11:43:42 +08:00
    建议将 transaction_time 保存为时间戳,bigint 的对比速度会快很多。 你这里是日期,会存在隐式转换为时间戳的问题。
    另外我看 source_type 字段也是 text 类型, 也可以更换为 int4 。
    建议你用 explain analyze 进行分析,找出慢的地方,performance 更多是对资源消耗的分析。
    MoYi123
        4
    MoYi123  
       2024-08-14 12:04:29 +08:00
    优化器没写好呗, 能报 bug 的话就报一下, 这种应该很好修.
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     3056 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 33ms UTC 12:24 PVG 20:24 LAX 04:24 JFK 07:24
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86