用户工具

站点工具


02-工程实践:kubernetes:1.11升级1.12

1.11升级1.12

2018.11.23: 已单独回滚kube-proxy到1.11.4,观察几天。1.12.2及1.13.0感觉不稳定,rs不能正常更新,有时候又会删除正常的rs,导致流量异常

1.11.1升级至1.12.2

升级过程

  • puppet 执行git checkout -b k8s-v1.12 创建新分支
  • 复制master新二进制文件覆盖旧文件
  • git push,master执行puppet同步,验证master升级情况
  • 复制node新二进制文件覆盖旧文件
  • git push, 等待node同步,通过kubectl get node验证node版本

问题

metrics-server报错

dashboard hpa报错

failed to get memory utilization: unable to get metrics for resource memory: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io)

kube-apiserver日志

E1113 10:26:35.987606   17967 available_controller.go:311] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again
E1113 10:28:41.530862   17967 available_controller.go:311] v1beta1.metrics.k8s.io failed with: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1beta1.metrics.k8s.io": the object has been modified; please apply your changes to the latest version and try again

间歇性的访问不到metrics-server

E1113 15:11:25.457824   25356 available_controller.go:311] v1beta1.metrics.k8s.io failed with: Get https://169.169.197.26:443: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E1113 15:11:48.635492   25356 memcache.go:134] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

通过curl手动测试,间歇性的响应非常慢(接近1分钟)

通过kubectl get apiservice v1beta1.metrics.k8s.io -o yaml 可以看到

status:
  conditions:
  - lastTransitionTime: 2018-11-13T06:29:47Z
    message: 'no response from https://169.169.197.26:443: Get https://169.169.197.26:443:
      net/http: request canceled (Client.Timeout exceeded while awaiting headers)'
    reason: FailedDiscoveryCheck
    status: "False"
    type: Available

变更配置后重部解决(不确定是否是root cause)

 rules:
 - apiGroups:
+  - "/"
+  resources:
+  - pods
+  - nodes
+  - namespaces
+  verbs:
+  - get
+  - list
+  - watch
+- apiGroups:

ipvs rs未更新

部分节点calico node更新失败,日志显示calico无法连接到typha,ipvsadm -ln 查看,typha 的rs更新。

重启kube-proxy

报错日志

E1113 11:43:24.447691    8117 graceful_termination.go:89] Try delete rs "169.169.67.83:5473/TCP/10.112.35.111:5473" err: invalid argument
E1113 11:43:24.447811    8117 proxier.go:1496] Failed to list IPVS destinations, error: device or resource busy
E1113 11:43:24.447852    8117 proxier.go:809] Failed to sync endpoint for service: 169.169.135.125:8140/TCP, err: device or resource busy
E1113 11:43:24.448195    8117 proxier.go:1455] Failed to add IPVS service "kube-system/kube-dns:dns-tcp": file exists
E1113 11:43:24.448219    8117 proxier.go:812] Failed to sync service: 169.169.0.2:53/TCP, err: file exists

缩容之后rs未更新

ipvs 优雅关闭引入了一个bug,pull 69268已修复,1.12.2二进制中未加入该fix,更新kube-proxy二进制到1.13-beta版本即可。

无请求节点更新正常

TCP  169.169.218.191:80 rr
  -> 172.20.2.81:80               Masq    1      0          0         
  -> 172.20.10.73:80              Masq    1      0          0         
  -> 172.20.31.175:80             Masq    1      0          0         
  -> 172.20.37.54:80              Masq    1      0          0         
  -> 172.20.38.80:80              Masq    1      0          0         
  -> 172.20.41.86:80              Masq    1      0          0         
  -> 172.20.49.11:80              Masq    1      0          0         
  -> 172.20.60.11:80              Masq    1      0          0         
  -> 172.20.61.16:80              Masq    1      0          0

有请求节点未更新

TCP  169.169.218.191:80 rr
  -> 172.20.2.81:80               Masq    1      0          1515      
  -> 172.20.10.73:80              Masq    1      0          1523      
  -> 172.20.16.74:80              Masq    1      0          709       
  -> 172.20.21.86:80              Masq    1      0          694       
  -> 172.20.31.175:80             Masq    1      0          1521      
  -> 172.20.37.54:80              Masq    1      0          1536      
  -> 172.20.38.80:80              Masq    1      1          1519      
  -> 172.20.39.111:80             Masq    1      0          693       
  -> 172.20.40.85:80              Masq    1      0          706       
  -> 172.20.41.86:80              Masq    1      0          1517      
  -> 172.20.42.129:80             Masq    1      0          686       
  -> 172.20.44.101:80             Masq    1      0          690       
  -> 172.20.49.11:80              Masq    1      1          1540      
  -> 172.20.57.14:80              Masq    1      0          693       
  -> 172.20.59.9:80               Masq    1      0          703       
  -> 172.20.60.11:80              Masq    1      1          1507 

日志

E1113 18:46:44.008235    9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server
E1113 18:47:44.008842    9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server
E1113 18:48:44.010713    9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server
E1113 18:49:44.011275    9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.42.128:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.42.128:80", can't find the real server
E1113 18:49:44.011584    9987 graceful_termination.go:89] Try delete rs "169.169.218.191:80/TCP/172.20.31.175:80" err: Failed to delete rs "169.169.218.191:80/TCP/172.20.31.175:80", can't find the real server

好像只有部分节点更新失败

dashboard响应慢

最终结论应是kube-proxy 1.12.2版本及1.13.0-beta.1 ipvs mode不稳定,影响了容器网络,导致metric-server, dashboard, calico-node都不稳定,calico无法连接typha vip时,可能影响容器路由。回退到1.11.4之后一切正常。可以通过zabbix的dns监控来监控每个节点上169.169.0.2的可用性,以此反应出各节点容器网络的状态

可能是因为metric-server慢

I1122 15:28:23.535501   19814 round_trippers.go:386] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: kubectl/v1.12.2 (linux/amd64) kubernetes/17c77c7" 'https://10.0.9.8/apis/metrics.k8s.io/v1beta1/namespaces/public/pods'
I1122 15:29:00.775704   19814 round_trippers.go:405] GET https://10.0.9.8/apis/metrics.k8s.io/v1beta1/namespaces/public/pods 200 OK in 37240 milliseconds

具体原因未知。更新到1.10.0之后(老版本为1.8.3)就好了。1.10.0 Role有变化

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: kubernetes-dashboard-minimal
  namespace: kube-system
rules:
  # Allow Dashboard to create 'kubernetes-dashboard-key-holder' secret.
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["create"]
  # Allow Dashboard to create 'kubernetes-dashboard-settings' config map.
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["create"]
02-工程实践/kubernetes/1.11升级1.12.txt · 最后更改: 2020/04/07 06:34 由 annhe