(类 PG 信创)亿级 table 简单查询的一个优化问题

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

已注册用户请登录

这是一个创建于 502 天前的主题，其中的信息可能已经有所发展或是发生改变。

部分信息借由 GPT 进行脱敏处理

现有一个简单表结构：

Table "schema.t_table_" Column | Type | Modifiers | Storage | Stats target | Description -----------------+-----------------------------+------------------------------------------+----------+--------------+------------- partition_id | smallint | not null | plain | | 分区 ID transaction_time| timestamp without time zone | not null default pg_systimestamp() | plain | | 交易时间 transaction_count | integer | | plain | | 交易数量 last_update_time | timestamp without time zone | not null default pg_systimestamp() | plain | | 更新时间 source_type | character varying(10) | not null default NULL::character varying | extended | | 事件源 transaction_code | character varying(30) | not null default NULL::character varying | extended | | 交易码 key_value | character varying(100) | not null default NULL::character varying | extended | | RV 变量键 value_data | character varying(500) | not null default NULL::character varying | extended | | RV 变量值 Indexes: "t_table_m_pkey" PRIMARY KEY, ubtree (transaction_time, value_data, key_value, transaction_code, source_type, partition_id) LOCAL WITH (storage_type=USTORE, active_pages=1314323) TABLESPACE pg_default Partition By LIST(partition_id) Number of partitions: 31 (View pg_partition to check each partition range.) Has OIDs: no Options: orientation=row, compression=no, storage_type=USTORE, segment=off

表由 Mysql 迁移而来，这个分区表中存储的记录条数为 126000000+。在开发侧，他们反馈下面的一个简单查询执行性能非常差，每条 SQL 要跑 15s 左右：

SELECT value_data AS dataValue, SUM(transaction_count) AS txnCount FROM schema.t_table_m WHERE 1 = 1 AND source_type = '1' AND key_value IN ( 'key1', 'key2', 'key3', 'key4', 'key5', 'key6', 'key7', 'key8', 'key9', 'key10' ) AND partition_id IN (1, 31) AND transaction_time >= '2024-07-31 15:58:11' AND transaction_time <= '2024-08-01 15:58:11' AND value_data NOT IN ('null','space','#','*') AND transaction_code NOT IN ('code1', 'code2') GROUP BY value_data;

执行计划如下：

postgres=> explain performance SELECT .......; id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | A-width | E-width | E-costs ----+------------------------------------------------------------------------------------------------+-----------+--------+--------+------------+-------------+---------+---------+---------------- 1 | -> HashAggregate | 12715.021 | 0 | 1 | | 22KB | | 8 | 28.485..28.495 2 | -> Partition Iterator | 12715.011 | 0 | 1 | | 69KB | | 8 | 0.010..28.480 3 | -> Partitioned Index Scan using t_table_m_pkey on schema.t_table_m | 12714.903 | 0 | 1 | | 29KB | | 8 | 0.010..28.480 (3 rows) Predicate Information (identified by plan id) ------------------------------------------------------------------------------------------ 2 --Partition Iterator Iterations: 2 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Index Cond: ((schema.t_table_m.transaction_time >= '2024-07-31 15:58:11'::timestamp without time zone) AND (schema.t_table_m.transaction_time <= '2024-08-01 15:58:11'::timestamp without time zone) AND ((schema.t_table_m.key_value)::text = ANY ('{key1,key2,key3,key4,key5,key6,key7,key8,key9,key10}'::text[])) AND ((schema.t_table_m.source_type)::text = '1'::text) AND (schema.t_table_m.partition_id = ANY ('{1,31}'::integer[]))) Filter: (((schema.t_table_m.transaction_code)::text <> 'code1'::text) AND ((schema.t_table_m.transaction_code)::text <> 'code2'::text) AND ((schema.t_table_m.value_data)::text <> ALL ('{"null",space,#,*}'::text[]))) Selected Partitions: 1,31 (6 rows) Memory Information (identified by plan id) ------------------------------------------------------------------------------------------ 1 --HashAggregate Peak Memory: 22KB, Estimate Memory: 32MB 2 --Partition Iterator Peak Memory: 69KB, Estimate Memory: 32MB 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Peak Memory: 29KB, Estimate Memory: 32MB (6 rows) Targetlist Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate Output: value_data, sum(transaction_count) Group By Key: schema.t_table_m.value_data 2 --Partition Iterator Output: value_data, transaction_count 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Output: value_data, transaction_count (7 rows) Datanode Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate (actual time=12715.021..12715.021 rows=0 loops=1) (Buffers: shared hit=760703) (CPU: ex c/r=0, ex row=0, ex cyc=437, inc cyc=847446039235951) 2 --Partition Iterator (actual time=12715.011..12715.011 rows=0 loops=1) (CPU: ex c/r=0, ex row=0, ex cyc=-847446039224001, inc cyc=847446039235514) 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m (actual time=12714.903..12714.903 rows=0 loops=2) (Buffers: shared hit=760703) (CPU: ex c/r=0, ex row=0, ex cyc=1694892078459515, inc cyc=1694892078459515) (11 rows) ====== Query Summary ===== ------------------------------------------ Datanode executor start time: 0.177 ms Datanode executor run time: 12715.042 ms Datanode executor end time: 0.021 ms Planner runtime: 1.124 ms Query Id: 10810046480632471817 Total runtime: 12715.260 ms (6 rows)

其中所有涉及列都已经命中索引扫描，测试时尝试更改某些 where 条件使得查询进行全表扫描的话，执行时间会暴增到 2 分钟左右。

经由我们多轮调试后，发现仅仅是单纯地将 where 条件中partition_id IN (1, 31)更改为(partition_id=31 OR partition_id=1)，就能使得查询平均耗时减少到 6s 。其它优化更改的效果低于系统性能波动，无法观测。

原本猜测是因为 IN 语句改写后会触发分区剪枝类的优化，但是经对比发现改写前后的执行计划完全一致：

postgres=> explain performance SELECT value_data AS dataValue, SUM(transaction_count) AS txnCount FROM schema.t_table_m WHERE 1 = 1 AND source_type = '1' AND key_value = any (array[ 'key1', 'key2', 'key3', 'key4', 'key5', 'key6', 'key7', 'key8', 'key9', 'key10' ]) AND (partition_id = 1 OR partition_id = 31) AND transaction_time >= '2024-07-31 15:58:11' AND transaction_time <= '2024-08-01 15:58:11' AND value_data not in ('null','space','#','*') AND transaction_code <> 'code1' AND transaction_code <> 'code2' GROUP BY value_data; id | operation | A-time | A-rows | E-rows | E-distinct | Peak Memory | A-width | E-width | E-costs ----+------------------------------------------------------------------------------------------------+-----------+--------+--------+------------+-------------+---------+---------+---------------- 1 | -> HashAggregate | 6421.454 | 0 | 1 | | 22KB | | 8 | 14.790..14.800 2 | -> Partition Iterator | 6421.441 | 0 | 1 | | 69KB | | 8 | 0.010..14.785 3 | -> Partitioned Index Scan using t_table_m_pkey on schema.t_table_m | 6421.276 | 0 | 1 | | 31KB | | 8 | 0.010..14.785 (3 rows) Predicate Information (identified by plan id) ----------------------------------------------------------------------------------------- 2 --Partition Iterator Iterations: 2 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Index Cond: ((schema.t_table_m.transaction_time >= '2024-07-31 15:58:11'::timestamp without time zone) AND (schema.t_table_m.transaction_time <= '2024-08-01 15:58:11'::timestamp without time zone) AND ((schema.t_table_m.key_value)::text = ANY ('{key1,key2,key3,key4,key5,key6,key7,key8,key9,key10}'::text[])) AND ((schema.t_table_m.source_type)::text = '1'::text)) Filter: (((schema.t_table_m.transaction_code)::text <> 'code1'::text) AND ((schema.t_table_m.transaction_code)::text <> 'code2'::text) AND ((schema.t_table_m.partition_id = 31) OR (schema.t_table_m.partition_id = 1)) AND ((schema.t_table_m.value_data)::text <> ALL ('{"null",space,#,*}'::text[]))) Selected Partitions: 1,31 (6 rows) Memory Information (identified by plan id) ------------------------------------------------------------------------------------------ 1 --HashAggregate Peak Memory: 22KB, Estimate Memory: 32MB 2 --Partition Iterator Peak Memory: 69KB, Estimate Memory: 32MB 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Peak Memory: 31KB, Estimate Memory: 32MB (6 rows) Targetlist Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate Output: value_data, sum(transaction_count) Group By Key: schema.t_table_m.value_data 2 --Partition Iterator Output: value_data, transaction_count 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m Output: value_data, transaction_count (7 rows) Datanode Information (identified by plan id) -------------------------------------------------------------------------------------------- 1 --HashAggregate (actual time=6421.454..6421.454 rows=0 loops=1) (Buffers: shared hit=380353) (CPU: ex c/r=0, ex row=0, ex cyc=488, inc cyc=847433445353297) 2 --Partition Iterator (actual time=6421.441..6421.441 rows=0 loops=1) (CPU: ex c/r=0, ex row=0, ex cyc=-847433445336080, inc cyc=847433445352809) 3 --Partitioned Index Scan using t_table_m_pkey on schema.t_table_m (actual time=6421.276..6421.276 rows=0 loops=2) (Buffers: shared hit=380353) (CPU: ex c/r=0, ex row=0, ex cyc=1694866890688889, inc cyc=1694866890688889) (11 rows) ====== Query Summary ===== ----------------------------------------- Datanode executor start time: 0.198 ms Datanode executor run time: 6421.479 ms Datanode executor end time: 0.022 ms Planner runtime: 1.417 ms Query Id: 10810046480632413864 Total runtime: 6421.729 ms (6 rows)

两者唯一的区别是，在 Partitioned Index Scan 时，使用 OR 的语句 Buffers: shared hit 是使用 IN 的一半。

其中造成这样的差距的可能原因是什么？

partition_id

4 条回复 2024-08-14 12:04:29 +08:00

967182

2024-08-14 11:17:22 +08:00

GPT 这样说：从描述和执行计划中可以看出，使用 IN 与使用 OR 的主要区别在于访问缓冲区（ Buffers ）的行为不同。使用 OR 时，缓冲区的共享命中率只有使用 IN 的一半。这意味着在使用 OR 时，查询过程中的缓存利用效率更高，从而导致查询执行时间较短。

以下是可能导致这种差异的原因：

缓存使用效率：IN 操作符通常会转换成多个条件的 OR 表达式，这可能会增加某些情况下的缓存失效或增加磁盘 IO 。

执行计划差异：尽管在高层次上执行计划看似相同，但在某些底层细节上，优化器可能会对 IN 和 OR 进行不同的处理，导致资源使用的不同。

执行路径：OR 和 IN 在某些数据库系统中的处理路径可能不同，导致数据扫描和缓存的差异。

总的来说，具体的原因可能是数据库在处理 IN 操作符时，在分区剪枝和索引扫描的优化方面没有完全优化，或者在使用 OR 时，查询路径更为直接，减少了缓存未命中的情况，从而使查询执行得更快。

mark2025

2024-08-14 11:27:35 +08:00

考虑在 transaction_time 上面创建 BRIN 索引（不是 btree ）

ShuA1

2024-08-14 11:43:42 +08:00

建议将 transaction_time 保存为时间戳，bigint 的对比速度会快很多。你这里是日期，会存在隐式转换为时间戳的问题。
另外我看 source_type 字段也是 text 类型，也可以更换为 int4 。
建议你用 explain analyze 进行分析，找出慢的地方，performance 更多是对资源消耗的分析。

MoYi123

2024-08-14 12:04:29 +08:00

优化器没写好呗, 能报 bug 的话就报一下, 这种应该很好修.