由于找不到具体问题在哪,没办法去对应的社区提问,因此我就只能发这里了,希望有划水路过的大佬帮忙看看,能帮忙解决的话就当我请大佬吃个海底捞。
背景:go 语言写的 http 服务,程序偶然出现无响应( Stop The World 那种无响应),进而导致 Tcp Socket 句柄文件不释放,进而服务器文件句柄数疯涨。故障复现方法未知,可能一周左右出现一次,也可能一个月都不出现。线上 2 个机房,均有发生,发生时都是单机,没有大量出现,重启服务后恢复
PS:目前配置了 ping 接口探活+自动摘流重启。大概最近半年就出现了十次左右,都自动重启恢复了,但是本着探索的精神,还是想找到问题根源
环境:centos7.2+golang1.18.3
调试过程:
1 、监控发现服务异常,文件句柄异常,第一反应是代码问题,没有执行 close 函数导致。但是仔细一想,我 tm 这是 go 项目啊,又不是 c 语言项目,go 项目用的 net.http 包启动的 http 服务,又不是我手动管理的 tcp ,哪特么来的 close 啊。
2 、怀疑是项目其他地方有什么 api 调用,没有关闭连接,排查了很久,并没有发现问题。
3 、发现程序出问题这段时间,连日志都不输出了,程序 debug 接口也挂了(和业务接口不是同一个端口,debug 接口本来用于监控程序 gc 、协程数量等信息的)。开始怀疑不是 close 的问题。
4 、在程序主逻辑启动前,加了一段测试代码,代码如下
go func() { for { time.Sleep(1 * time.Minute) fmt.Println("Hello World!") } }() 讲道理,这应该不受任何逻辑影响,始终都该输出吧。加上这段代码,重新发布项目,等了半个月,终于又出问题了。上服务器一看,出问题的时候,这个 HelloWorld 也不输出了。
5 、由于程序完全没响应,任何逻辑都不执行,因此从线上把故障机器摘流,然后使用 dlv 注入进程进行调试。结果如下:
( 1 )、协程总数异常,此时有 5w+个协程,正常情况应该不超过 500 个协程
( 2 )、大量协程被死锁,都是卡在同一个地方 Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s]
( 3 )、查看死锁协程的调用栈,结果与业务代码无关
0 0x000000000043a556 in runtime.gopark at /usr/local/go1.18.3/src/runtime/proc.go:362 1 0x000000000044b0b3 in runtime.goparkunlock at /usr/local/go1.18.3/src/runtime/proc.go:367 2 0x000000000044b0b3 in runtime.semacquire1 at /usr/local/go1.18.3/src/runtime/sema.go:144 3 0x00000000004665a5 in sync.runtime_SemacquireMutex at /usr/local/go1.18.3/src/runtime/sema.go:71 4 0x0000000000473785 in sync.(*Mutex).lockSlow at /usr/local/go1.18.3/src/sync/mutex.go:162 5 0x00000000004740bf in sync.(*Mutex).Lock at /usr/local/go1.18.3/src/sync/mutex.go:81 6 0x00000000004740bf in sync.(*Pool).pinSlow at /usr/local/go1.18.3/src/sync/pool.go:213 7 0x000000000047400e in sync.(*Pool).pin at /usr/local/go1.18.3/src/sync/pool.go:206 8 0x0000000000473d45 in sync.(*Pool).Get at /usr/local/go1.18.3/src/sync/pool.go:128 9 0x0000000000711e29 in net/http.newTextprotoReader at /usr/local/go1.18.3/src/net/http/request.go:994 10 0x0000000000711fd9 in net/http.readRequest at /usr/local/go1.18.3/src/net/http/request.go:1024 11 0x00000000007181ea in net/http.(*conn).readRequest at /usr/local/go1.18.3/src/net/http/server.go:988 12 0x000000000071c58b in net/http.(*conn).serve at /usr/local/go1.18.3/src/net/http/server.go:1891 13 0x0000000000721cce in net/http.(*Server).Serve.func3 at /usr/local/go1.18.3/src/net/http/server.go:3071 14 0x000000000046a641 in runtime.goexit at /usr/local/go1.18.3/src/runtime/asm_amd64.s:1571 6 、到这来我已经不知道该怎么继续排查了,服务器网络 io 、磁盘 io 、cpu 、内存等待均无明显异常。
7 、找运维在网关上面录制了流量,故障前一分钟都没发现异常流量,应该不存在被攻击的问题
8 、哪位大佬能给分析下呢,顺便给小弟讲解讲解,为啥我那个 HelloWorld 日志也不输出了?
下面附上完整的 dlv grs 结果(部分包名、路径打码了):
(dlv) grs Goroutine 1 - User: /home/xxxxxx/golibs/rpcserver/[email protected]/rpcserver/server.go:392 xxxx/golibs/rpcserver/v2/rpcserver.(*Server).waitHttp (0xd0c588) [select 452427h39m57.846100461s] Goroutine 2 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [force gc (idle) 452385h49m20.252083363s] Goroutine 3 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [finalizer wait 452427h39m57.869422736s] Goroutine 4 - User: /home/xxxxxx/devops/[email protected]/client_golang/statsdlib/psh.go:39 xxxxxx/devops/statsd/client_golang/statsdlib.statsdPushLoop (0x87b17e) [chan receive] Goroutine 9 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 10 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 15 - User: /home/xxxxxx/xxxxxx/gobiz/[email protected]/file.go:309 xxxxxx/gobiz/logger.(*fileLog).writeLoop (0x80a545) [select 452427h39m57.846215549s] Goroutine 16 - User: /home/xxxxxx/xxxxxx/rpc/[email protected]/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive] Goroutine 18 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC sweep wait] Goroutine 19 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC scavenge wait] Goroutine 27 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 28 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 29 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 30 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/[email protected]/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.86959434s] Goroutine 31 - User: /home/xxxxxx/xxxxxx/platform-ha/onekey-degrade_sdk_go/[email protected]/degrade/metric.go:96 xxxxxx/platform-ha/onekey-degrade_sdk_go/v3/degrade.(*metricSender).sendMetricFromChan (0xb05e65) [chan receive 452427h39m57.869609575s] Goroutine 34 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.683989625s] Goroutine 35 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684006698s] Goroutine 36 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684023352s] Goroutine 37 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684037091s] Goroutine 38 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684051841s] Goroutine 39 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684066197s] Goroutine 40 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684082165s] Goroutine 41 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684095326s] Goroutine 42 - User: /home/xxxxxx/xxxxxx/xxxx/golibs/[email protected]/metrics.go:280 xxxxxx/xxxx/golibs/metrics.(*Sender).asyncSendLoop.func1 (0x880a5b) [select 452391h19m51.199497773s] Goroutine 43 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68413238s] Goroutine 45 - User: /usr/local/go1.18.3/src/runtime/time.go:194 time.Sleep (0x46742e) [sleep] Goroutine 46 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.8518662s] Goroutine 48 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.863062571s] Goroutine 49 - User: /gaia/workspace-job/xxxxxx/xxxxx/common/handlers/apollo/apollo.go:50 xxxxx/common/handlers/apollo.OnConfigUpdate.func1 (0x884859) [chan receive 452427h39m57.846560172s] Goroutine 67 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 68 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 69 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [GC worker (idle)] Goroutine 72 - User: /home/xxxxxx/github.com/getsentry/[email protected]/transport.go:378 github.com/getsentry/sentry-go.(*HTTPTransport).worker (0x8010ec) [chan receive 452391h19m51.973327187s] Goroutine 73 - User: /home/xxxxxx/xxxxxx/rpc/[email protected]/rpc/common/watcher.go:49 xxxxxx/rpc/rpc-go-spl-v2/rpc/common.(*DirWatcher).run.func1 (0x84a9f9) [chan receive] Goroutine 75 - User: /usr/local/go1.18.3/src/database/sql/sql.go:1226 database/sql.(*DB).connectionOpener (0xa3d38d) [select 452427h39m57.863183594s] Goroutine 84 - User: /usr/local/go1.18.3/src/runtime/sigqueue.go:151 os/signal.signal_recv (0x466bef) (thread 1238) Goroutine 98 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait] Goroutine 99 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait 452427h39m57.846699823s] Goroutine 100 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684348233s] Goroutine 101 - User: /usr/local/go1.18.3/src/net/fd_unix.go:172 net.(*netFD).accept (0x522b15) [IO wait] Goroutine 102 - User: /usr/local/go1.18.3/src/runtime/proc.go:362 runtime.gopark (0x43a556) [select 452427h39m57.846735781s] * Goroutine 19024799 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684384197s] Goroutine 19168258 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684396903s] Goroutine 19913951 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684409102s] Goroutine 19958391 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684421539s] Goroutine 20081541 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684433615s] Goroutine 20116495 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.684446142s] Goroutine 20172239 - User: /usr/local/go1.18.3/src/runtime/sema.go:71 sync.runtime_SemacquireMutex (0x4665a5) [semacquire 452391h19m50.68446185s] ......(后面都是重复的,就省略了) 