[讨论] 为何 epoll_wait 有性能瓶颈？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

For Existing Member Sign In

Distributions

Ubuntu

Fedora

CentOS

中文资源站

网易开源镜像站

This topic created in 1444 days ago, the information mentioned may be changed or developed.

使用 Linux 提供的benchmark 工具perf 测试如下

机器 28 个核 -m 每个读线程使用独立的 epfd -t 读线程的数量 -r 执行 benchmark 的时间（给 2 秒跑的快些，默认 8 秒）

./perf bench epoll wait -m -t 1 -r 2 # Running 'epoll/wait' benchmark: Run summary [PID 23298]: 1 threads monitoring on 64 file-descriptors for 2 secs. [thread 0] fdmap: 0x27e0260 ... 0x27e035c [ 525069 ops/sec ] Averaged 525069 operations/sec (+- 0.00%), total secs = 2 ./perf bench epoll wait -m -t 2 -r 2 # Running 'epoll/wait' benchmark: Run summary [PID 23312]: 2 threads monitoring on 64 file-descriptors for 2 secs. [thread 0] fdmap: 0x1bec260 ... 0x1bec35c [ 463399 ops/sec ] [thread 1] fdmap: 0x1bec5f0 ... 0x1bec6ec [ 463392 ops/sec ] Averaged 463395 operations/sec (+- 0.00%), total secs = 2 ./perf bench epoll wait -m -t 4 -r 2 # Running 'epoll/wait' benchmark: Run summary [PID 23319]: 4 threads monitoring on 64 file-descriptors for 2 secs. [thread 0] fdmap: 0x22ea260 ... 0x22ea35c [ 208576 ops/sec ] [thread 1] fdmap: 0x22ea5f0 ... 0x22ea6ec [ 208576 ops/sec ] [thread 2] fdmap: 0x22ea980 ... 0x22eaa7c [ 208576 ops/sec ] [thread 3] fdmap: 0x22ead10 ... 0x22eae0c [ 208565 ops/sec ] Averaged 208573 operations/sec (+- 0.00%), total secs = 2 ./perf bench epoll wait -m -t 8 -r 2 # Running 'epoll/wait' benchmark: Run summary [PID 23328]: 8 threads monitoring on 64 file-descriptors for 2 secs. [thread 0] fdmap: 0x1832370 ... 0x183246c [ 150848 ops/sec ] [thread 1] fdmap: 0x1832700 ... 0x18327fc [ 150848 ops/sec ] [thread 2] fdmap: 0x1832a90 ... 0x1832b8c [ 150848 ops/sec ] [thread 3] fdmap: 0x1832e20 ... 0x1832f1c [ 150848 ops/sec ] [thread 4] fdmap: 0x18331b0 ... 0x18332ac [ 150848 ops/sec ] [thread 5] fdmap: 0x1833540 ... 0x183363c [ 150844 ops/sec ] [thread 6] fdmap: 0x18338d0 ... 0x18339cc [ 150824 ops/sec ] [thread 7] fdmap: 0x1833c60 ... 0x1833d5c [ 150816 ops/sec ] Averaged 150840 operations/sec (+- 0.00%), total secs = 2

可以看到每秒执行的的 epoll_wait 数量在减少，变化的参数是线程数，每个读线程调用 epoll_wait 等待 64 个 fd ，一个写线程向所有读线程的所有 fd 写入数据。然后打印的结果是每个读线程每秒能完成的 epoll_wait 数量

按理说，每个读线程都有独立的 epfd ，线程间没有资源共享，为何吞吐量（每秒执行的 epoll 数量）会下降？

16 replies 2022-05-17 00:39:50 +08:00

wslzy007

May 16, 2022

这样的意义是什么呢？ epoll_wait 是系统调用，原则上越少越好，否则 sys 会非常高

pwrliang

May 16, 2022

@wslzy007 在做网络方面的调研，测试基于 Kernel space 通信方法的极限性能。目前发现了瓶颈，但是不知道原因为何。
系统调用的开销大概是 200 nano second ，`./perf bench syscall basic`，并且这个开销不随着调用进程的数量增加而增加。
但是 epoll_wait 看起来扩展性很差，在达到 CPU 核数前性能就直线下降了。

LeeReamond

May 16, 2022 via Android

这种压力环境下 sys 占用时间应该超过 50%了，实际上没有任何意义，生产中不会出现未经优化的服务裸面对这种压力。另外关于同步问题，你说没有共享资源属于自说自话，毕竟系统并不为每一个进程分配独立的 epoll 管理资源，这意味着内核要解决这个中断不是那个中断，这个中断绑定的是这个 fd 而不是那个 fd ，红黑表数据结构等都是不可避免地需要同步的，也就是无论软件如何实现，落实到硬件上北桥一定有那一瞬间要阻塞，所以实际处理能力选落后于 ipc 就可以理解了

wslzy007

May 16, 2022

事实上如果是追求极致高性能 linux server 架构，epoll_wait per thread 只是 1 种方式，配合 lockfree 算法实现 muilt-epoll handles vs thread-pool 在实践上貌似更优。另外可以充分利用 epoll 支持 leader 模式的特征...
多年前实现过 linux server 下高性能服务器框架，目前在 SG 工具上有使用，如果你有很好的硬件环境可以直接使用 SG 工具做一下性能测试：./proxy_server -i100000 -o10000 -w24 -x28080 (命令参数说明 i:最大接入连接数,o:最大接出连接数,w:最大线程数,x:启动一个简单的 http-server 端口)

wslzy007

May 16, 2022

SG 见： https://github.com/lazy-luo/smarGate

statumer

May 16, 2022

epoll 的定位是高性能网络 IO ，你需要高性能内核线程-用户进程通信的话应该用 Linux 的 UIO 框架。

wslzy007

May 16, 2022

个人拙见，最求高性能最好是降低 sys 占用，毕竟系统中有很多自旋锁，高并发下会空耗 cpu

wslzy007

May 16, 2022

内核线程的确是最优的，但实践中只要有良好的算法避免过线程切换，也是 ok 的

pwrliang

May 16, 2022

@LeeReamond 感谢回复
1. 看了下，每个核的 sys 时间在 20 左右
2. epoll_create 会创建一个独立的 eventpoll 在内核，这个 benchmark 创建的 epoll 对象是隔离的，线程间不会共享同一个 epfd
3. epfd 和读线程的数量还没有达到核数，软中断应该是每个核都有一个进程叫 ksoftirqd 来处理，所以核数越多，irq 的处理能力应该越高
4. 红黑树是在管理多个 fd 用到的，在 epoll_ctl 才会访问
5. 这个 benchmark 的实现，epoll 监控的是 eventfd ，这个 fd 是内核的一个计数器，所以不会走网络的，不知道您提到的北桥是什么意思呢？

pwrliang

May 16, 2022

@wslzy007 您好，现在我不太理解的是为何 epoll 的性能随着线程的增加，每个线程的处理能力在减少。
举个例子，加入 1 个线程每秒能完成 10w 次 epoll ，那么 4 个线程应该能达到 40w （没有超过核数的情况下），但是实际测试不是这样的。给我的感觉是 Linux 的 epoll 实现有个全局锁，导致总体的性能是有极限的。

wslzy007

May 16, 2022

@pwrliang 你需要考虑系统整体负载。毕竟中断、系统调用、协议栈处理、事件回调等都抢占 cpu ，因此 cpu 切换开销不可避免，考虑到 fd 资源对应的内核对象大都是全局的，数量多了、并发大了锁竞争开销也就上去了。一般压测时使用 perf ，Oprofile 等工具采样分析一下就知道瓶颈在哪里了

zizon

May 16, 2022

"一个写线程向所有读线程的所有 fd 写入数据"

会不会是这个问题.
简单算了几个加总 ops 是增加的.

cnbatch

May 16, 2022

想要压榨极限网络性能？那么可以考虑用 io_uring

heiher

May 16, 2022

{ dpdk, raw socket packet mmap, xdp } + busyloop

LeeReamond

May 16, 2022 via Android

@pwrliang 你没有搞清楚我说的用户资源和管理资源的意思。内核通过 sysepollcreate 划分 fd 资源后内核也要对应产生 eventpoll 结构用来管理，并注册到红黑树，该过程是不能多线间无锁并发执行的。另外软中断触发后内核需要调用 sproc 将事件拷贝到用户空间并在内核解绑，同理也是需要线程间同步的，所以无论同步代码怎么实现，最终同步需求硬件层面的北桥阻塞必然发生

pwrliang

May 17, 2022

@zizon 破案了，确实是这个问题，Linux 提供的 benchmark 有问题，只有 1 个 writer 不能够喂饱这么多 reader ，导致 reader 等待，造成频繁的 context switch 。[我改了下 benchmark]( https://github.com/pwrliang/linux/blob/master/tools/perf/bench/epoll-wait.c)，现在每个线程都能够达到 60w 的 throughput
```
./perf bench epoll wait -t 14 -r 2 -w 14 -m
# Running 'epoll/wait' benchmark:
Run summary [PID 5905]: 14 threads monitoring on 64 file-descriptors for 2 secs.

[thread 0] fdmap: 0x1aec590 ... 0x1aec68c [ 648960 ops/sec ]
[thread 1] fdmap: 0x1aec920 ... 0x1aeca1c [ 640540 ops/sec ]
[thread 2] fdmap: 0x1aeccb0 ... 0x1aecdac [ 635712 ops/sec ]
[thread 3] fdmap: 0x1aed040 ... 0x1aed13c [ 650944 ops/sec ]
[thread 4] fdmap: 0x1aed3d0 ... 0x1aed4cc [ 638048 ops/sec ]
[thread 5] fdmap: 0x1aed760 ... 0x1aed85c [ 648064 ops/sec ]
[thread 6] fdmap: 0x1aedaf0 ... 0x1aedbec [ 632416 ops/sec ]
[thread 7] fdmap: 0x1aede80 ... 0x1aedf7c [ 647144 ops/sc ]
[thread 8] fdmap: 0x1aee210 ... 0x1aee30c [ 628896 ops/sec ]
[thread 9] fdmap: 0x1aee5a0 ... 0x1aee69c [ 634521 ops/sec ]
[thread 10] fdmap: 0x1aee930 ... 0x1aeea2c [ 640032 ops/sec ]
[thread 11] fdmap: 0x1aeecc0 ... 0x1aeedbc [ 626093 ops/sec ]
[thread 12] fdmap: 0x1aef050 ... 0x1aef14c [ 645843 ops/sec ]
[thread 13] fdmap: 0x1aef3e0 ... 0x1aef4dc [ 643849 ops/sec ]

Averaged 640075 operations/sec (+- 0.33%), total secs = 2
```