故障描述
直接使用 yum update -y
升级到 CentOS 7.6.1810
版本及其新内核 3.10.0-957
,并且使用了 puppet 批量升级。由于 yum update
升级了所有包,docker-ce 也被升级到了 19.03 版本,导致容器挂掉,容器网络不通,业务中断。
解决方案
应卸载 docker 后在使用 yum update
,然后在安装 docker-ce,并且要指定版本
# yum install -y docker-ce-18.06.3.ce
脚本
#!/bin/bash [ $# -lt 1 ] && echo "need ip" && exit 1 KERNEL=957 DOCKER="18.06.3.ce" SSH="ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no" function nodeStat() { kubectl get node $1 |tail -n1 |awk '{print $2}' } function _info() { echo -e "\033[32m[INFO] $1\033[0m" } function _warn() { echo -e "\033[31m[WARN] $1\033[0m" } function _puppet() { $SSH "$1" "puppet agent --config /etc/puppet/puppet.conf --onetime --verbose --no-daemonize" } kubectl drain $1 --ignore-daemonsets --delete-local-data [ $? -ne 0 ] && _warn "kubectl drain error. exit.." && exit 1 _puppet "$1" kernel=`$SSH "$1" "uname -r"` dockerversion=`$SSH "$1" "docker --version"` # 使用 3.10.0-957 内核 echo $kernel |grep "$KERNEL" && kernelcheck=0 || kernelcheck=1 echo $dockerversion |grep "$DOCKER" && dockercheck=0 || dockercheck=1 ready=`nodeStat $1` _info "Kernel: $kernel" _info "Docker Version: $dockerversion" _info "nodeStat: $ready" if [ $dockercheck -eq 1 -a "$ready"x == "Ready,SchedulingDisabled"x ];then $SSH "$1" "systemctl stop kubelet" # 需事先准备好此脚本,脚本内容类似 docker stop $(docker ps -aq) # 清理原来的容器,防止老容器和新建容器一起启动导致的问题 $SSH "$1" "/usr/local/bin/docker_all.sh stop" $SSH "$1" "/usr/local/bin/docker_all.sh rm" $SSH "$1" "systemctl stop docker" $SSH "$1" "yum remove -y docker-*" $SSH "$1" "rm -fr /puppet/docker" $SSH "$1" "yum update -y" $SSH "$1" "yum install -y docker-ce-$DOCKER" else _warn "docker version check pass. Or node Stat error. Will not update docker" fi if [ $kernelcheck -eq 0 ];then _puppet "$1" else ready=`nodeStat $1 |cut -f2 -d','` if [ "$ready"x == "SchedulingDisabled"x ];then _warn "node Will reboot" $SSH "$1" "reboot" sleep 2 else _warn "node Stat is $ready. Will not reboot" fi fi second=0 while true;do stat=`$SSH "$1" "route -n |grep tunl0 |wc -l"` [ "$stat"x == ""x ] && _warn "Wait node boot..." && sleep 5 && continue ready=`nodeStat $1` if [ $stat -gt 2 -a "$ready"x == "Ready,SchedulingDisabled"x ];then _info "Wait 30s to uncordon node" sleep 30 kubectl uncordon $1 break else _warn "$1 Not Ready. stat=$stat ready=$ready ... Wait ${second}s" sleep 5 ((second+=5)) fi done
关于 --delete-local-data
The only thing that --delete-local-data affects is emptyDir volumes: https://github.com/kubernetes/kubernetes/blob/v1.10.2/pkg/kubectl/cmd/drain.go#L413
The flag doesn't seem very useful if it aborts the entire node drain because of a single pod with emptyDir volume at /tmp... I'd imagine that it might make sense to leave the pods with emptyDir volumes terminated but still scheduled on the same node for the duration of the reboot, but that isn't an option either.
参见:https://github.com/kontena/pharos-host-upgrades/issues/26
问题描述
grub2-set-default 0; grub2-mkconfig -o /boot/grub2/grub.cfg
等都无效,并且开机不显示boot menue
编辑启动项,将老内核文件名改为新内核文件名,可以进入系统。但是重启仍然是老内核,看起来 grub2 没有使用 /boot/grub2/grub.cfg
的配置grub2-install --boot-directory=/boot /dev/sda
之后恢复正常参考:
多网卡可能会出现识别顺序错误,原来的eth0可能变为eth1。解决方案是
参见: