从日志可以看到,因为网络波动,心跳检测无响应
2019-10-14 23:03:15.241 7f0edf09d700 -1 osd.0 10553 heartbeat_check: no reply from 10.110.94.50:6884 osd.38 ever on either front or back, first ping sent 2019-10-14 23:02:51.542173 (oldest deadline 2019-10-14 23:03:11.542173)
解决方案:
希望网络恢复后 osd 能自动启动
systemctl status
返回代码含义
Value | Description in LSB | Use in systemd |
---|---|---|
0 | “program is running or service is OK” | unit is active |
1 | “program is dead and /var/run pid file exists” | unit not failed (used by is-failed) |
2 | “program is dead and /var/lock lock file exists” | unused |
3 | “program is not running” | unit is not active |
4 | “program or service status is unknown” | no such unit |
ls /var/lib/ceph/osd
可以看到本机 osd 编号,通过以下脚本检测 osd 运行状态,通过定时任务,没2分钟检查一次
#!/bin/bash DT=`date +%Y%m%d\ %H:%M:%S` ALL_OSD=`ls /var/lib/ceph/osd |sed 's/ceph-//g'` DAY=`echo $DT |awk '{print $1}'` function check_osd() { for osd in $ALL_OSD;do systemctl status ceph-osd@$osd.service &>/dev/null stat=$? if [ $stat -eq 0 ];then echo "$DT ceph-osd@$osd ok" elif [ $stat -eq 3 ];then systemctl start ceph-osd@$osd.service systemctl status ceph-osd@$osd.service echo "$DT ceph-osd@$osd down try restart .. $?" else echo "$DT ceph-osd@$osd unkown" fi done } check_osd >> /tmp/check_osd-$DAY.log
参考资料:
# dmesg -T |grep -i ceph libceph: mon0 10.11.2.35:6789 missing required protocol features 手动map # rbd map rbd/kubernetes-dynamic-pvc-7f783e8c-c00c-11e9-9838-5cb901987210 rbd: sysfs write failed In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (5) Input/output error
内核版本
# uname -r
3.10.0-327.18.2.el7.x86_64
另一台内核版本为 3.10.0-957.5.1.el7.x86_64
的没有问题。须升级k8s所有node的内核
busybox
版本的 dd
不输出速度,须安装 coreutils
。