最近碰到个问题,使用 deployment 运行 nexus3 跑一段时间后 replicas 会被设置为 0 ,也没启用 HPA ,restartPolicy 配置的是 Always ,查了 kube-apiserver 的 audit 日志也不是人为操作的,查看 nexus3 的日志也没有报错。
deployment yaml:
kind: Deployment apiVersion: apps/v1 metadata: name: service-nexus3-deployment namespace: service annotations: deployment.kubernetes.io/revision: '6' spec: replicas: 1 selector: matchLabels: app: service-nexus3 envronment: test template: metadata: creationTimestamp: null labels: app: service-nexus3 envronment: test annotations: kubesphere.io/restartedAt: '2022-02-16T01:11:44.479Z' spec: volumes: - name: service-nexus3-volume persistentVolumeClaim: claimName: service-nexus3-pvc - name: docker-proxy configMap: name: docker-proxy defaultMode: 493 containers: - name: nexus3 # 用的阿里的镜像仓库,删了仓库名 image: 'registry.cn-hangzhou.aliyuncs.com/nexus3-latest' ports: - name: tcp8081 containerPort: 8081 protocol: TCP resources: limits: cpu: '4' memory: 8Gi requests: cpu: 500m memory: 1Gi volumeMounts: - name: service-nexus3-volume mountPath: /data/server/nexus3/ terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: Always - name: docker-proxy # 用的阿里的镜像仓库,删了仓库名 image: 'registry.cn-hangzhou.aliyuncs.com/nginx-latest' ports: - name: tcp80 containerPort: 80 protocol: TCP resources: limits: cpu: '2' memory: 4Gi requests: cpu: 500m memory: 1Gi volumeMounts: - name: docker-proxy mountPath: /usr/local/nginx/conf/vhosts/ terminationMessagePath: /dev/termination-log terminationMessagePolicy: File imagePullPolicy: Always restartPolicy: Always terminationGracePeriodSeconds: 30 dnsPolicy: ClusterFirst nodeSelector: disktype: raid1 securityContext: {} imagePullSecrets: - name: registrysecret schedulerName: default-scheduler strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 0 maxSurge: 1 revisionHistoryLimit: 10 progressDeadlineSeconds: 600
HPA:
# kubectl get hpa -A No resources found
deployment describe:
... ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal ScalingReplicaSet 34m (x2 over 38h) deployment-controller Scaled down replica set service-nexus3-deployment-57995fcd76 to 0
kube controller 日志:
# kubectl logs kube-controller-manager-k8s-130 -n kube-system|grep nexus I0509 10:49:11.687356 1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"service", Name:"service-nexus3-deployment", UID:"e0c4abba-bbe5-4c19-9853-de63ee571124", APIVersion:"apps/v1", ResourceVersion:"126342143", FieldPath:""}): type: 'Normal' reason: 'ScalingReplicaSet' Scaled down replica set service-nexus3-deployment-57995fcd76 to 0 I0509 10:49:11.701642 1 event.go:281] Event(v1.ObjectReference{Kind:"ReplicaSet", Namespace:"service", Name:"service-nexus3-deployment-57995fcd76", UID:"9f96fdf1-1e20-4c83-ad18-1b3640d52493", APIVersion:"apps/v1", ResourceVersion:"126342151", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: service-nexus3-deployment-57995fcd76-t6bhx
kube-apiserver audit 相关日志:
nexus3 日志:
已经出现好几次,日志都查了实在没有思路,特来请教,希望大佬们支支招。
1 anonydmer 2022-05-09 16:53:58 +08:00 检查下是不是服务不稳定,容器在不停的失败和重启 |
2 rabbitz OP replica to 0 之前 RESTARTS 一直是 0 ![]() |
3 rabbitz OP 不好意思,上面的图片发错了,底下这个才是 ![]() |
![]() | 4 wubowen 2022-05-09 17:38:14 +08:00 有点怀疑审计日志图里的内容可以证明非人为操作嘛?如果是人为操作的 scale ,最终也需要 replicaset controller 去删除 pod 吧?是不是可以考虑直接搜审计日志里和 kubeconfig 用户相关的操作,直接看是否有人为 scale |
![]() | 5 defunct9 2022-05-09 17:46:28 +08:00 开 ssh ,让我上去看看 |
![]() | 6 basefas 2022-05-09 17:48:48 +08:00 监控下这个项目的 replicas ,变了报警,然后看 event |
![]() | 7 hwdef 2022-05-09 17:53:01 +08:00 hello kitty 可还行,有点萌 |