用户工具

站点工具


02-工程实践:kubernetes:update_kernel

升级内核

故障描述

直接使用 yum update -y 升级到 CentOS 7.6.1810 版本及其新内核 3.10.0-957 ,并且使用了 puppet 批量升级。由于 yum update 升级了所有包,docker-ce 也被升级到了 19.03 版本,导致容器挂掉,容器网络不通,业务中断。

解决方案

应卸载 docker 后在使用 yum update ,然后在安装 docker-ce,并且要指定版本

snippet.bash
# yum install -y docker-ce-18.06.3.ce

脚本

snippet.bash
#!/bin/bash
 
[ $# -lt 1 ] && echo "need ip" && exit 1
 
KERNEL=957
DOCKER="18.06.3.ce"
SSH="ssh -o ConnectTimeout=1 -o StrictHostKeyChecking=no" 
 
function nodeStat() {
	kubectl get node $1 |tail -n1 |awk '{print $2}'	
}
 
function _info() {
	echo -e "\033[32m[INFO] $1\033[0m"
}
 
function _warn() {
	echo -e "\033[31m[WARN] $1\033[0m"
}
 
function _puppet() {
	$SSH "$1" "puppet agent --config /etc/puppet/puppet.conf --onetime --verbose --no-daemonize"
}
 
 
kubectl drain $1 --ignore-daemonsets --delete-local-data
 
[ $? -ne 0 ] && _warn "kubectl drain error. exit.." && exit 1
 
_puppet "$1"
 
 
kernel=`$SSH "$1" "uname -r"`
dockerversion=`$SSH "$1" "docker --version"`
 
# 使用 3.10.0-957 内核
echo $kernel |grep "$KERNEL" && kernelcheck=0 || kernelcheck=1
echo $dockerversion |grep "$DOCKER" && dockercheck=0 || dockercheck=1
 
ready=`nodeStat $1`
 
_info "Kernel: $kernel"
_info "Docker Version: $dockerversion"
_info "nodeStat: $ready"
 
if [ $dockercheck -eq 1 -a "$ready"x == "Ready,SchedulingDisabled"x ];then
	$SSH "$1" "systemctl stop kubelet"
	# 需事先准备好此脚本,脚本内容类似 docker stop $(docker ps -aq)
	# 清理原来的容器,防止老容器和新建容器一起启动导致的问题
	$SSH "$1" "/usr/local/bin/docker_all.sh stop"
	$SSH "$1" "/usr/local/bin/docker_all.sh rm"
	$SSH "$1" "systemctl stop docker"
	$SSH "$1" "yum remove -y docker-*"
	$SSH "$1" "rm -fr /puppet/docker"
	$SSH "$1" "yum update -y"
	$SSH "$1" "yum install -y docker-ce-$DOCKER"
else
	_warn "docker version check pass. Or node Stat error. Will not update docker"
fi
 
if [ $kernelcheck -eq 0 ];then
	_puppet "$1"
else
	ready=`nodeStat $1 |cut -f2 -d','`
	if [ "$ready"x == "SchedulingDisabled"x ];then
		_warn "node Will reboot"
		$SSH "$1" "reboot"
		sleep 2
	else
		_warn "node Stat is $ready. Will not reboot"
	fi
fi
 
second=0
while true;do
	stat=`$SSH "$1" "route -n |grep tunl0 |wc -l"`
	[ "$stat"x == ""x ] && _warn "Wait node boot..." && sleep 5 && continue
	ready=`nodeStat $1`
	if [ $stat -gt 2 -a "$ready"x == "Ready,SchedulingDisabled"x ];then
		_info "Wait 30s to uncordon node"
		sleep 30
		kubectl uncordon $1
		break
	else
		_warn "$1 Not Ready. stat=$stat ready=$ready ... Wait ${second}s"
		sleep 5
		((second+=5))
	fi
done

关于 --delete-local-data

The only thing that --delete-local-data affects is emptyDir volumes: https://github.com/kubernetes/kubernetes/blob/v1.10.2/pkg/kubectl/cmd/drain.go#L413
The flag doesn't seem very useful if it aborts the entire node drain because of a single pod with emptyDir volume at /tmp... I'd imagine that it might make sense to leave the pods with emptyDir volumes terminated but still scheduled on the same node for the duration of the reboot, but that isn't an option either.

参见:https://github.com/kontena/pharos-host-upgrades/issues/26

新内核不生效问题

问题描述

  1. 升级内核 reboot 之后,仍然是老内核,尝试过 grub2-set-default 0; grub2-mkconfig -o /boot/grub2/grub.cfg 等都无效,并且开机不显示boot menu
  2. 直接删除老内核,reboot后无法启动,可以看到 boot menu 了,但是 boot menu 中显示的依然是老内核,新内核没有显示。按 e 编辑启动项,将老内核文件名改为新内核文件名,可以进入系统。但是重启仍然是老内核,看起来 grub2 没有使用 /boot/grub2/grub.cfg 的配置
  3. 执行命令 grub2-install --boot-directory=/boot /dev/sda 之后恢复正常

参考:

网卡识别顺序问题

多网卡可能会出现识别顺序错误,原来的eth0可能变为eth1。解决方案是

  • ifcfg-xxx 中指定mac(没成功)。
  • 交换配置文件,成功

参见:

02-工程实践/kubernetes/update_kernel.txt · 最后更改: 2020/04/09 00:42 由 54.36.150.129