2018.11.23: 已单独回滚kube-proxy到1.11.4,观察几天。1.12.2及1.13.0感觉不稳定,rs不能正常更新,有时候又会删除正常的rs,导致流量异常
1.11.1升级至1.12.2
升级过程
git checkout -b k8s-v1.12
创建新分支git push
,master执行puppet同步,验证master升级情况git push
, 等待node同步,通过kubectl get node
验证node版本dashboard hpa报错
failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)
kube-apiserver日志
E1113 10:26:35.987606 17967 available_controller.go:311] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again E1113 10:28:41.530862 17967 available_controller.go:311] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
间歇性的访问不到metrics-server
E1113 15:11:25.457824 25356 available_controller.go:311] v1beta1.metrics.k8s.io failed with: Get https://169.169.197.26:443: net/http: request canceled (Client.Timeout exceeded while awaiting headers) E1113 15:11:48.635492 25356 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
通过curl手动测试,间歇性的响应非常慢(接近1分钟)
通过kubectl get apiservice v1beta1.metrics.k8s.io -o yaml
可以看到
status: conditions: - lastTransitionTime: 2018-11-13T06:29:47Z message: 'no response from https://169.169.197.26:443: Get https://169.169.197.26:443: net/http: request canceled (Client.Timeout exceeded while awaiting headers)' reason: FailedDiscoveryCheck status: "False" type: Available
变更配置后重部解决(不确定是否是root cause)
rules: - apiGroups: + - "/" + resources: + - pods + - nodes + - namespaces + verbs: + - get + - list + - watch +- apiGroups:
部分节点calico node更新失败,日志显示calico无法连接到typha,ipvsadm -ln 查看,typha 的rs更新。
重启kube-proxy
报错日志
E1113 11:43:24.447691 8117 graceful_termination.go:89] Try delete rs "169.169.67.83:5473/TCP/10.112.35.111:5473" err: invalid argument E1113 11:43:24.447811 8117 proxier.go:1496] Failed to list IPVS destinations, error: device or resource busy E1113 11:43:24.447852 8117 proxier.go:809] Failed to sync endpoint for service: 169.169.135.125:8140/TCP, err: device or resource busy E1113 11:43:24.448195 8117 proxier.go:1455] Failed to add IPVS service "kube-system/kube-dns:dns-tcp": file exists E1113 11:43:24.448219 8117 proxier.go:812] Failed to sync service: 169.169.0.2:53/TCP, err: file exists
ipvs 优雅关闭引入了一个bug,pull 69268已修复,1.12.2二进制中未加入该fix,更新kube-proxy二进制到1.13-beta版本即可。
无请求节点更新正常
TCP 169.169.218.191:80 rr -> 172.20.2.81:80 Masq 1 0 0 -> 172.20.10.73:80 Masq 1 0 0 -> 172.20.31.175:80 Masq 1 0 0 -> 172.20.37.54:80 Masq 1 0 0 -> 172.20.38.80:80 Masq 1 0 0 -> 172.20.41.86:80 Masq 1 0 0 -> 172.20.49.11:80 Masq 1 0 0 -> 172.20.60.11:80 Masq 1 0 0 -> 172.20.61.16:80 Masq 1 0 0
有请求节点未更新
TCP 169.169.218.191:80 rr -> 172.20.2.81:80 Masq 1 0 1515 -> 172.20.10.73:80 Masq 1 0 1523 -> 172.20.16.74:80 Masq 1 0 709 -> 172.20.21.86:80 Masq 1 0 694 -> 172.20.31.175:80 Masq 1 0 1521 -> 172.20.37.54:80 Masq 1 0 1536 -> 172.20.38.80:80 Masq 1 1 1519 -> 172.20.39.111:80 Masq 1 0 693 -> 172.20.40.85:80 Masq 1 0 706 -> 172.20.41.86:80 Masq 1 0 1517 -> 172.20.42.129:80 Masq 1 0 686 -> 172.20.44.101:80 Masq 1 0 690 -> 172.20.49.11:80 Masq 1 1 1540 -> 172.20.57.14:80 Masq 1 0 693 -> 172.20.59.9:80 Masq 1 0 703 -> 172.20.60.11:80 Masq 1 1 1507
日志
E1113 18:46:44.008235 9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server E1113 18:47:44.008842 9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server E1113 18:48:44.010713 9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server E1113 18:49:44.011275 9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server E1113 18:49:44.011584 9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.31.175:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.31.175:80", can't find the real server
好像只有部分节点更新失败
最终结论应是kube-proxy 1.12.2版本及1.13.0-beta.1 ipvs mode不稳定,影响了容器网络,导致metric-server, dashboard, calico-node都不稳定,calico无法连接typha vip时,可能影响容器路由。回退到1.11.4之后一切正常。可以通过zabbix的dns监控来监控每个节点上169.169.0.2的可用性,以此反应出各节点容器网络的状态
可能是因为metric-server慢
I1122 15:28:23.535501 19814 round_trippers.go:386] curl -k -v -XGET -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.12.2 (linux/amd64) kubernetes/17c77c7" 'https://10.0.9.8/apis/metrics.k8s.io/v1beta1/namespaces/public/pods' I1122 15:29:00.775704 19814 round_trippers.go:405] GET https://10.0.9.8/apis/metrics.k8s.io/v1beta1/namespaces/public/pods 200 OK in 37240 milliseconds
具体原因未知。更新到1.10.0之后(老版本为1.8.3)就好了。1.10.0 Role有变化
kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: name: kubernetes-dashboard-minimal namespace: kube-system rules: # Allow Dashboard to create 'kubernetes-dashboard-key-holder' secret. - apiGroups: [""] resources: ["secrets"] verbs: ["create"] # Allow Dashboard to create 'kubernetes-dashboard-settings' config map. - apiGroups: [""] resources: ["configmaps"] verbs: ["create"]