用户工具

站点工具


02-工程实践:存储:ceph:errors

问题记录

osd大面积挂掉

从日志可以看到,因为网络波动,心跳检测无响应

2019-10-14 23:03:15.241 7f0edf09d700 -1 osd.0 10553 heartbeat_check: no reply from 10.110.94.50:6884 osd.38 ever on either front or back, first ping sent 2019-10-14 23:02:51.542173 (oldest deadline 2019-10-14 23:03:11.542173)

解决方案:

希望网络恢复后 osd 能自动启动

systemctl status 返回代码含义

Value Description in LSB Use in systemd
0 “program is running or service is OK” unit is active
1 “program is dead and /var/run pid file exists” unit not failed (used by is-failed)
2 “program is dead and /var/lock lock file exists” unused
3 “program is not running” unit is not active
4 “program or service status is unknown” no such unit

ls /var/lib/ceph/osd 可以看到本机 osd 编号,通过以下脚本检测 osd 运行状态,通过定时任务,没2分钟检查一次

snippet.bash
#!/bin/bash
 
DT=`date +%Y%m%d\ %H:%M:%S`
ALL_OSD=`ls /var/lib/ceph/osd |sed 's/ceph-//g'`
DAY=`echo $DT |awk '{print $1}'`
 
function check_osd() {
	for osd in $ALL_OSD;do
		systemctl status ceph-osd@$osd.service &>/dev/null
		stat=$?
		if [ $stat -eq 0 ];then
			echo "$DT ceph-osd@$osd  ok"
		elif [ $stat -eq 3 ];then
			systemctl start ceph-osd@$osd.service
			systemctl status ceph-osd@$osd.service
			echo "$DT ceph-osd@$osd  down try restart .. $?"
		else
			echo "$DT ceph-osd@$osd  unkown"
		fi
	done
}
 
check_osd >> /tmp/check_osd-$DAY.log

参考资料:

missing required protocol features

snippet.bash
# dmesg -T |grep -i ceph
libceph: mon0 10.11.2.35:6789 missing required protocol features
 
手动map
# rbd map rbd/kubernetes-dynamic-pvc-7f783e8c-c00c-11e9-9838-5cb901987210
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (5) Input/output error

内核版本

snippet.bash
# uname -r
3.10.0-327.18.2.el7.x86_64

另一台内核版本为 3.10.0-957.5.1.el7.x86_64 的没有问题。须升级k8s所有node的内核

alpine dd 测试磁盘读写

busybox 版本的 dd 不输出速度,须安装 coreutils

02-工程实践/存储/ceph/errors.txt · 最后更改: 2020/04/07 06:34 由 annhe