目录

威联通NAS夜半死机:未操作时停止响应,只能强行关机

症状

最近碰到了比较折腾人的问题,每天夜间一点到四点,NAS会在某一时间停止响应,并且不会自动恢复,直到强行关机重启才能恢复。

根据阿里云自动监控nas上个人网站的记录,大约是在一点到四点之间停止响应,但阿里云有时候延迟很多,实际停止响应的时间应该比这早。

昨天我打开网页控制台并持续录屏,发现失去响应前后,Qboost和资源监控器的显示没有明显变化(或者是死机太快来不及反映在网页界面上),就是突然多出一个CPU占用%30的系统内核进程,然后网页就断开了。

失去响应期间,ssh无法登录,用ssh -vvv尝试登录发现卡在“we sent a key, wait for response”步骤。但是看硬盘灯的话,QVR Pro似乎还能工作。

第一次失去响应的时候,因为以前遇到过确实是有比较重的任务,就等了一天,结果还是不能用。第二天等了半天也是不行。于是第三天第四天就直接强关了,记得重新开机后检查文件系统。

一些尝试

首先被怀疑的是自动缩略图生成服务,这东西耗CPU我是知道的。这两天刚好下载了几个蓝光,会不会是缩略图生成累死了机器?尝试了禁止为视频文件生成缩略图,甚至关掉这个功能,但第二天起来还是死机。

然后怀疑hybrid backup的计划备份,因为第一天出问题是四点,而我依稀记得设置了三点左右自动备份,而且第二天强关后的错误记录里面有一条“因为异常断电,备份没有完成”。于是把这个时间调整到一点,于是下次死机的时间变成了两点多(看阿里云报错),我以为找对了,是以前创建的备份任务与新版不兼容。但重新设置了备份后,第三天还是死机。并且我手动启动了一次备份,并没有死机。

查相关故障,很难搜,只是有一个去年的“360发现威联通平台挖矿病毒”的消息,查了一下不像。毕竟如果一个作者注意到要在白天挺直挖矿,却在晚上全速运行到死机,是有点精神分裂的。

找了一圈找设置在晚上的定时任务,没找到时间完全对得上的。

后面就没辙了,第三天我让电脑打开网页界面,整晚上开着,然后开了个录屏,要看看到底是何方神圣,结果啥也没看到。就是有一瞬间屏幕中出现了一个占30% CPU的“内核进程”,然后网页界面停止刷新,一分钟后报错显示断开,没了。

日志分析

于是我觉得这事儿不能忍了,打算报威联通技术支持,报支持过程中发现需要详细日志,可以用“支持平台”APP生成,这个我之前就卸载了,现在只好装回来,然后生成一份日志(zip包),填信息报支持,然后自己分析。

通过时间对应,发现确实有四次彻底死机,分布在凌晨一点到四点之间,特征是在这之前一直陆续有kernel message刷新,但某一瞬间过后完全没有日志,下一条就是第二天上午或下午的开机信息。这说明情况起码严重到了日志都写不下去的地步。

于是分析司机前一瞬间的共同特征,倒是挺明显的,类似以下的一端日志在每次死机之前都出现了,在其他时候一次都没出现:

2022-01-11 03:00:59 +08:00 <4> [58878.505271] EXT4-fs (dm-114): Mount option "noacl" will be removed by 3.5
2022-01-11 03:00:59 +08:00 <4> [58878.505271] Contact linux-ext4@vger.kernel.org if you think we should keep it.
2022-01-11 03:00:59 +08:00 <4> [58878.505271] 
2022-01-11 03:00:59 +08:00 <6> [58878.676229] ext4_init_reserve_inode_table0: dm-114, 24576
2022-01-11 03:00:59 +08:00 <6> [58878.681715] ext4_init_reserve_inode_table2: dm-114, 24576, 0, 0, 4096
2022-01-11 03:00:59 +08:00 <6> [58878.725057] EXT4-fs (dm-114): orphan cleanup on readonly fs
2022-01-11 03:01:00 +08:00 <6> [58879.015642] EXT4-fs (dm-114): 16 orphan inodes deleted
2022-01-11 03:01:00 +08:00 <6> [58879.183505] EXT4-fs (dm-114): recovery complete
2022-01-11 03:01:00 +08:00 <6> [58879.320703] EXT4-fs (dm-114): mounted filesystem with ordered data mode. Opts: usrjquota=aquota.user,jqfmt=vfsv1,user_xattr,data=ordered,data_err=abort,delalloc,nopriv,nodiscard,noacl
2022-01-11 03:01:03 +08:00 <6> [58882.542275] ext4_init_reserve_inode_table0: dm-114, 24576
2022-01-11 03:01:03 +08:00 <6> [58882.547749] ext4_init_reserve_inode_table2: dm-114, 24576, 0, 0, 4096
2022-01-11 03:01:03 +08:00 <6> [58882.617280] EXT4-fs (dm-114): mounted filesystem with ordered data mode. Opts: usrjquota=aquota.user,jqfmt=vfsv1,user_xattr,data=ordered,data_err=abort,delalloc,nopriv,nodiscard,acl
2022-01-11 03:01:13 +08:00 <4> [58892.638787] ext2 filesystem being mounted at /tmp/nasconfig_tmp supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:01:20 +08:00 <4> [58898.975028] ext2 filesystem being mounted at /tmp/nasconfig_tmp supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:01:30 +08:00 <4> [58908.944119] ext2 filesystem being mounted at /tmp/nasconfig_tmp supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:01:33 +08:00 <4> [58912.249198] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:01:33 +08:00 <4> [58912.299864] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:01:34 +08:00 <4> [58912.948026] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:01:37 +08:00 <4> [58916.328700] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:01:37 +08:00 <4> [58916.363656] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:02:23 +08:00 <4> [58962.421353] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:02:43 +08:00 <4> [58981.940206] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:02:43 +08:00 <4> [58982.059447] ext2 filesystem being mounted at /tmp/config supports timestamps until 2038 (0x7fffffff)
2022-01-11 03:14:11 +08:00 <4> [59670.847551] EXT4-fs (dm-53): Mount option "noacl" will be removed by 3.5
2022-01-11 03:14:11 +08:00 <4> [59670.847551] Contact linux-ext4@vger.kernel.org if you think we should keep it.
2022-01-11 03:14:11 +08:00 <4> [59670.847551] 
2022-01-11 03:14:12 +08:00 <6> [59671.021426] ext4_init_reserve_inode_table0: dm-53, 24576
2022-01-11 03:14:12 +08:00 <6> [59671.026806] ext4_init_reserve_inode_table2: dm-53, 24576, 0, 0, 4096
2022-01-11 03:14:12 +08:00 <6> [59671.036967] EXT4-fs (dm-53): orphan cleanup on readonly fs
2022-01-11 03:14:12 +08:00 <6> [59671.346769] EXT4-fs (dm-53): 16 orphan inodes deleted
2022-01-11 03:14:12 +08:00 <6> [59671.642693] EXT4-fs (dm-53): recovery complete
2022-01-11 03:14:12 +08:00 <6> [59671.820782] EXT4-fs (dm-53): mounted filesystem with ordered data mode. Opts: usrjquota=aquota.user,jqfmt=vfsv1,user_xattr,data=ordered,data_err=abort,delalloc,nopriv,nodiscard,noacl
2022-01-11 03:17:06 +08:00 <3> [59845.231117] INFO: task python:2884 blocked for more than 122 seconds.
2022-01-11 03:17:06 +08:00 <3> [59845.237600]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:06 +08:00 <3> [59845.243367] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:06 +08:00 <6> [59845.251231] task:python          state:D stack:    0 pid: 2884 ppid:     1 flags:0x00000000
2022-01-11 03:17:06 +08:00 <6> [59845.259602] Call Trace:
2022-01-11 03:17:06 +08:00 <6> [59845.262079]  __schedule+0x1dd/0x610
2022-01-11 03:17:06 +08:00 <6> [59845.265596]  schedule+0x41/0xa0
2022-01-11 03:17:06 +08:00 <6> [59845.268764]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:06 +08:00 <6> [59845.272797]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.276743]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:06 +08:00 <6> [59845.280777]  vfs_write+0x2ad/0x2e0
2022-01-11 03:17:06 +08:00 <6> [59845.284199]  ksys_write+0x50/0xc0
2022-01-11 03:17:06 +08:00 <6> [59845.287530]  do_syscall_64+0x2d/0x40
2022-01-11 03:17:06 +08:00 <6> [59845.291127]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-01-11 03:17:06 +08:00 <6> [59845.296200] RIP: 0033:0x7f49325415d0
2022-01-11 03:17:06 +08:00 <6> [59845.299793] RSP: 002b:00007ffdaf30bb78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
2022-01-11 03:17:06 +08:00 <6> [59845.307380] RAX: ffffffffffffffda RBX: 00000000000000a1 RCX: 00007f49325415d0
2022-01-11 03:17:06 +08:00 <6> [59845.314536] RDX: 00000000000000a1 RSI: 00007f493334c000 RDI: 0000000000000005
2022-01-11 03:17:06 +08:00 <6> [59845.321689] RBP: 00007f493334c000 R08: 0000000001a4bc60 R09: 00007f4933343700
2022-01-11 03:17:06 +08:00 <6> [59845.328843] R10: 00007f4933343700 R11: 0000000000000246 R12: 0000000001a4bb80
2022-01-11 03:17:06 +08:00 <6> [59845.336002] R13: 00000000000000a1 R14: 00007f492d11e820 R15: 00007f492d127b90
2022-01-11 03:17:06 +08:00 <3> [59845.343338] INFO: task dockerd:5018 blocked for more than 122 seconds.
2022-01-11 03:17:06 +08:00 <3> [59845.349891]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:06 +08:00 <3> [59845.355666] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:06 +08:00 <6> [59845.363520] task:dockerd         state:D stack:    0 pid: 5018 ppid:  2884 flags:0x00004000
2022-01-11 03:17:06 +08:00 <6> [59845.371895] Call Trace:
2022-01-11 03:17:06 +08:00 <6> [59845.374362]  __schedule+0x1dd/0x610
2022-01-11 03:17:06 +08:00 <6> [59845.377876]  schedule+0x41/0xa0
2022-01-11 03:17:06 +08:00 <6> [59845.381083]  io_schedule+0xd/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.384423]  filemap_fault+0x3cd/0xa70
2022-01-11 03:17:06 +08:00 <6> [59845.388201]  ? generic_perform_write+0x1e0/0x1e0
2022-01-11 03:17:06 +08:00 <6> [59845.392839]  ext4_filemap_fault+0x27/0x40
2022-01-11 03:17:06 +08:00 <6> [59845.396882]  __do_fault+0x2f/0xf0
2022-01-11 03:17:06 +08:00 <6> [59845.400225]  handle_mm_fault+0xa7b/0xf00
2022-01-11 03:17:06 +08:00 <6> [59845.404168]  exc_page_fault+0x23d/0x420
2022-01-11 03:17:06 +08:00 <6> [59845.408028]  ? asm_exc_page_fault+0x8/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.412150]  asm_exc_page_fault+0x1e/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.416179] RIP: 0033:0x4550ae
2022-01-11 03:17:06 +08:00 <6> [59845.419255] RSP: 002b:00007ffe6f2415c8 EFLAGS: 00010293
2022-01-11 03:17:06 +08:00 <6> [59845.424503] RAX: 000000007fffffff RBX: 0000000000000000 RCX: 000000c000f0be00
2022-01-11 03:17:06 +08:00 <6> [59845.431686] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000000
2022-01-11 03:17:06 +08:00 <6> [59845.438849] RBP: 00007ffe6f241890 R08: 0000000000000000 R09: 0000000000000000
2022-01-11 03:17:06 +08:00 <6> [59845.446000] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000033bc4f0
2022-01-11 03:17:06 +08:00 <6> [59845.453158] R13: 0000000004557a20 R14: 000000c000f0be00 R15: 00000000033bc4f0
2022-01-11 03:17:06 +08:00 <3> [59845.460641] INFO: task dockerd:5047 blocked for more than 123 seconds.
2022-01-11 03:17:06 +08:00 <3> [59845.467208]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:06 +08:00 <3> [59845.472973] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:06 +08:00 <6> [59845.480821] task:dockerd         state:D stack:    0 pid: 5047 ppid:  2884 flags:0x00000000
2022-01-11 03:17:06 +08:00 <6> [59845.489189] Call Trace:
2022-01-11 03:17:06 +08:00 <6> [59845.491660]  __schedule+0x1dd/0x610
2022-01-11 03:17:06 +08:00 <6> [59845.495170]  schedule+0x41/0xa0
2022-01-11 03:17:06 +08:00 <6> [59845.498332]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:06 +08:00 <6> [59845.502356]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.506297]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:06 +08:00 <6> [59845.510330]  mnt_want_write+0x60/0x80
2022-01-11 03:17:06 +08:00 <6> [59845.514012]  filename_create+0x5f/0x160
2022-01-11 03:17:06 +08:00 <6> [59845.517869]  do_mknodat+0xd4/0x200
2022-01-11 03:17:06 +08:00 <6> [59845.521291]  do_syscall_64+0x2d/0x40
2022-01-11 03:17:06 +08:00 <6> [59845.524890]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-01-11 03:17:06 +08:00 <6> [59845.529966] RIP: 0033:0x4bf25a
2022-01-11 03:17:06 +08:00 <6> [59845.533039] RSP: 002b:000000c0008e1170 EFLAGS: 00000202 ORIG_RAX: 0000000000000103
2022-01-11 03:17:06 +08:00 <6> [59845.540627] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 00000000004bf25a
2022-01-11 03:17:06 +08:00 <6> [59845.547776] RDX: 00000000000011c0 RSI: 000000c000157a20 RDI: ffffffffffffff9c
2022-01-11 03:17:06 +08:00 <6> [59845.554924] RBP: 000000c0008e11e0 R08: 0000000000000000 R09: 0000000000000000
2022-01-11 03:17:06 +08:00 <6> [59845.562076] R10: 0000000000000000 R11: 0000000000000202 R12: ffffffffffffffff
2022-01-11 03:17:06 +08:00 <6> [59845.569252] R13: 0000000000000027 R14: 0000000000000026 R15: 000000000000002e
2022-01-11 03:17:06 +08:00 <3> [59845.576633] INFO: task dockerd:5116 blocked for more than 123 seconds.
2022-01-11 03:17:06 +08:00 <3> [59845.583189]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:06 +08:00 <3> [59845.588948] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:06 +08:00 <6> [59845.596795] task:dockerd         state:D stack:    0 pid: 5116 ppid:  2884 flags:0x00000000
2022-01-11 03:17:06 +08:00 <6> [59845.605170] Call Trace:
2022-01-11 03:17:06 +08:00 <6> [59845.607641]  __schedule+0x1dd/0x610
2022-01-11 03:17:06 +08:00 <6> [59845.611156]  schedule+0x41/0xa0
2022-01-11 03:17:06 +08:00 <6> [59845.614321]  io_schedule+0xd/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.617652]  filemap_fault+0x3cd/0xa70
2022-01-11 03:17:06 +08:00 <6> [59845.621430]  ? generic_perform_write+0x1e0/0x1e0
2022-01-11 03:17:06 +08:00 <6> [59845.626068]  ext4_filemap_fault+0x27/0x40
2022-01-11 03:17:06 +08:00 <6> [59845.630093]  __do_fault+0x2f/0xf0
2022-01-11 03:17:06 +08:00 <6> [59845.633422]  handle_mm_fault+0xa7b/0xf00
2022-01-11 03:17:06 +08:00 <6> [59845.637361]  exc_page_fault+0x23d/0x420
2022-01-11 03:17:06 +08:00 <6> [59845.641215]  ? asm_exc_page_fault+0x8/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.645332]  asm_exc_page_fault+0x1e/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.649361] RIP: 0033:0x4550ae
2022-01-11 03:17:06 +08:00 <6> [59845.652436] RSP: 002b:00007f844d6be720 EFLAGS: 00010293
2022-01-11 03:17:06 +08:00 <6> [59845.657688] RAX: 000000007fffffff RBX: 0000000000000000 RCX: 000000c000f3ed80
2022-01-11 03:17:06 +08:00 <6> [59845.664848] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
2022-01-11 03:17:06 +08:00 <6> [59845.672001] RBP: 00007f844d6be9e8 R08: 0000000000000000 R09: 0000000000000000
2022-01-11 03:17:06 +08:00 <6> [59845.679150] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000334eff0
2022-01-11 03:17:06 +08:00 <6> [59845.686302] R13: 0000000004557a20 R14: 00000000024ed724 R15: 0000000000000000
2022-01-11 03:17:06 +08:00 <3> [59845.693677] INFO: task VM Periodic Tas:8389 blocked for more than 123 seconds.
2022-01-11 03:17:06 +08:00 <3> [59845.700919]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:06 +08:00 <3> [59845.706682] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:06 +08:00 <6> [59845.714523] task:VM Periodic Tas state:D stack:    0 pid: 8389 ppid:  8314 flags:0x00000320
2022-01-11 03:17:06 +08:00 <6> [59845.722892] Call Trace:
2022-01-11 03:17:06 +08:00 <6> [59845.725365]  __schedule+0x1dd/0x610
2022-01-11 03:17:06 +08:00 <6> [59845.728874]  schedule+0x41/0xa0
2022-01-11 03:17:06 +08:00 <6> [59845.732037]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:06 +08:00 <6> [59845.736066]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.740004]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:06 +08:00 <6> [59845.744040]  ext4_page_mkwrite+0x3bb/0x540
2022-01-11 03:17:06 +08:00 <6> [59845.748158]  do_page_mkwrite+0x4f/0xb0
2022-01-11 03:17:06 +08:00 <6> [59845.751927]  do_wp_page+0x20e/0x360
2022-01-11 03:17:06 +08:00 <6> [59845.755431]  handle_mm_fault+0x300/0xf00
2022-01-11 03:17:06 +08:00 <6> [59845.759379]  ? tick_sched_handle+0x20/0x50
2022-01-11 03:17:06 +08:00 <6> [59845.763494]  exc_page_fault+0x23d/0x420
2022-01-11 03:17:06 +08:00 <6> [59845.767359]  ? asm_exc_page_fault+0x8/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.771473]  asm_exc_page_fault+0x1e/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.775497] RIP: 0033:0x7fbba7c237be
2022-01-11 03:17:06 +08:00 <6> [59845.779090] RSP: 002b:00007fbb77cfbd60 EFLAGS: 00010206
2022-01-11 03:17:06 +08:00 <6> [59845.784328] RAX: 000000001c2f81d8 RBX: 00007fbba8ffa420 RCX: 0000000000000002
2022-01-11 03:17:06 +08:00 <6> [59845.791474] RDX: 00007fbba7501080 RSI: 00007fbba828492c RDI: 00007fbba0019a60
2022-01-11 03:17:06 +08:00 <6> [59845.798620] RBP: 00007fbb77cfbd70 R08: 00000881df9ac29c R09: 0000000000000032
2022-01-11 03:17:06 +08:00 <6> [59845.805767] R10: 0000000006d3e5c2 R11: 0018e920e8f95290 R12: 0000000000000008
2022-01-11 03:17:06 +08:00 <6> [59845.812918] R13: 00007fbba0150d60 R14: 0000000000000002 R15: 0000000000000001
2022-01-11 03:17:06 +08:00 <3> [59845.820315] INFO: task cc3-fastcgi:19981 blocked for more than 123 seconds.
2022-01-11 03:17:06 +08:00 <3> [59845.827311]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:06 +08:00 <3> [59845.833074] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:06 +08:00 <6> [59845.840922] task:cc3-fastcgi     state:D stack:    0 pid:19981 ppid:     1 flags:0x00000000
2022-01-11 03:17:06 +08:00 <6> [59845.849287] Call Trace:
2022-01-11 03:17:06 +08:00 <6> [59845.851754]  __schedule+0x1dd/0x610
2022-01-11 03:17:06 +08:00 <6> [59845.855260]  schedule+0x41/0xa0
2022-01-11 03:17:06 +08:00 <6> [59845.858430]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:06 +08:00 <6> [59845.862459]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:06 +08:00 <6> [59845.866394]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:06 +08:00 <6> [59845.870426]  vfs_write+0x2ad/0x2e0
2022-01-11 03:17:06 +08:00 <6> [59845.873845]  ksys_write+0x50/0xc0
2022-01-11 03:17:06 +08:00 <6> [59845.877180]  do_syscall_64+0x2d/0x40
2022-01-11 03:17:06 +08:00 <6> [59845.880779]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-01-11 03:17:06 +08:00 <6> [59845.885854] RIP: 0033:0x7fd1c17ac84d
2022-01-11 03:17:06 +08:00 <6> [59845.889441] RSP: 002b:00007fd1827f9000 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
2022-01-11 03:17:06 +08:00 <6> [59845.897024] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fd1c17ac84d
2022-01-11 03:17:06 +08:00 <6> [59845.904172] RDX: 0000000000001000 RSI: 000000000184e5c0 RDI: 0000000000000009
2022-01-11 03:17:06 +08:00 <6> [59845.911316] RBP: 00000000017e36c8 R08: 0000000000000016 R09: 00007fd1827f9090
2022-01-11 03:17:06 +08:00 <6> [59845.918471] R10: 0000000000000002 R11: 0000000000000293 R12: 0000000000000000
2022-01-11 03:17:06 +08:00 <6> [59845.925626] R13: 000000000184e5c0 R14: 0000000000001000 R15: 0000000000001000
2022-01-11 03:17:07 +08:00 <3> [59845.933248] INFO: task cc3_watchdog.sh:13462 blocked for more than 123 seconds.
2022-01-11 03:17:07 +08:00 <3> [59845.940591]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:07 +08:00 <3> [59845.946356] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:07 +08:00 <6> [59845.954202] task:cc3_watchdog.sh state:D stack:    0 pid:13462 ppid:     1 flags:0x00000000
2022-01-11 03:17:07 +08:00 <6> [59845.962572] Call Trace:
2022-01-11 03:17:07 +08:00 <6> [59845.965036]  __schedule+0x1dd/0x610
2022-01-11 03:17:07 +08:00 <6> [59845.968544]  schedule+0x41/0xa0
2022-01-11 03:17:07 +08:00 <6> [59845.971704]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:07 +08:00 <6> [59845.975734]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:07 +08:00 <6> [59845.979679]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:07 +08:00 <6> [59845.983704]  mnt_want_write+0x60/0x80
2022-01-11 03:17:07 +08:00 <6> [59845.987386]  filename_create+0x5f/0x160
2022-01-11 03:17:07 +08:00 <6> [59845.991238]  do_mkdirat+0x49/0x170
2022-01-11 03:17:07 +08:00 <6> [59845.994652]  do_syscall_64+0x2d/0x40
2022-01-11 03:17:07 +08:00 <6> [59845.998248]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-01-11 03:17:07 +08:00 <6> [59846.003311] RIP: 0033:0x7f80307d52f7
2022-01-11 03:17:07 +08:00 <6> [59846.006898] RSP: 002b:00007fff1c4458c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000053
2022-01-11 03:17:07 +08:00 <6> [59846.014480] RAX: ffffffffffffffda RBX: 0000000001135008 RCX: 00007f80307d52f7
2022-01-11 03:17:07 +08:00 <6> [59846.021636] RDX: 00007fff1c4458f6 RSI: 00000000000001fd RDI: 00007fff1c4458e0
2022-01-11 03:17:07 +08:00 <6> [59846.028791] RBP: 00007fff1c445a00 R08: 00000000004adf48 R09: 0000000000000004
2022-01-11 03:17:07 +08:00 <6> [59846.035941] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000418e40
2022-01-11 03:17:07 +08:00 <6> [59846.043091] R13: 00007fff1c446830 R14: 0000000000000000 R15: 0000000000000000
2022-01-11 03:17:07 +08:00 <3> [59846.050452] INFO: task manager_svc:12583 blocked for more than 123 seconds.
2022-01-11 03:17:07 +08:00 <3> [59846.057444]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:07 +08:00 <3> [59846.063225] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:07 +08:00 <6> [59846.071064] task:manager_svc     state:D stack:    0 pid:12583 ppid: 11331 flags:0x00000000
2022-01-11 03:17:07 +08:00 <6> [59846.079433] Call Trace:
2022-01-11 03:17:07 +08:00 <6> [59846.081935]  __schedule+0x1dd/0x610
2022-01-11 03:17:07 +08:00 <6> [59846.085439]  schedule+0x41/0xa0
2022-01-11 03:17:07 +08:00 <6> [59846.088600]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:07 +08:00 <6> [59846.092625]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:07 +08:00 <6> [59846.096565]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:07 +08:00 <6> [59846.100586]  mnt_want_write+0x60/0x80
2022-01-11 03:17:07 +08:00 <6> [59846.104269]  path_openat+0x42e/0xeb0
2022-01-11 03:17:07 +08:00 <6> [59846.107860]  ? path_lookupat+0x77/0x120
2022-01-11 03:17:07 +08:00 <6> [59846.111710]  ? filename_lookup+0x108/0x1a0
2022-01-11 03:17:07 +08:00 <6> [59846.115820]  ? rcu_sched_clock_irq+0x2b0/0x770
2022-01-11 03:17:07 +08:00 <6> [59846.120286]  do_filp_open+0x94/0x110
2022-01-11 03:17:07 +08:00 <6> [59846.123882]  ? _copy_to_user+0x1c/0x30
2022-01-11 03:17:07 +08:00 <6> [59846.127645]  ? cp_new_stat+0x154/0x190
2022-01-11 03:17:07 +08:00 <6> [59846.131411]  ? _cond_resched+0x10/0x20
2022-01-11 03:17:07 +08:00 <6> [59846.135173]  ? do_sys_openat2+0x1f3/0x390
2022-01-11 03:17:07 +08:00 <6> [59846.139203]  do_sys_openat2+0x1f3/0x390
2022-01-11 03:17:07 +08:00 <6> [59846.143057]  do_sys_open+0x55/0x70
2022-01-11 03:17:07 +08:00 <6> [59846.146479]  do_syscall_64+0x2d/0x40
2022-01-11 03:17:07 +08:00 <6> [59846.150083]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-01-11 03:17:07 +08:00 <6> [59846.155166] RIP: 0033:0x7f681e660d19
2022-01-11 03:17:07 +08:00 <6> [59846.158756] RSP: 002b:00007f681618d600 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
2022-01-11 03:17:07 +08:00 <6> [59846.166340] RAX: ffffffffffffffda RBX: 00007f67ec01b7e0 RCX: 00007f681e660d19
2022-01-11 03:17:07 +08:00 <6> [59846.173492] RDX: 0000000000000442 RSI: 000055abeb6160a0 RDI: 00000000ffffff9c
2022-01-11 03:17:07 +08:00 <6> [59846.180653] RBP: 000055abeb4122a3 R08: 0000000000000000 R09: 000055abeb4122a4
2022-01-11 03:17:07 +08:00 <6> [59846.187803] R10: 00000000000001b6 R11: 0000000000000293 R12: 0000000000001000
2022-01-11 03:17:07 +08:00 <6> [59846.194956] R13: 000055abeb4122a2 R14: 0000000000000000 R15: 0000000000000194
2022-01-11 03:17:07 +08:00 <3> [59846.202347] INFO: task WTJourn.Flusher:14439 blocked for more than 123 seconds.
2022-01-11 03:17:07 +08:00 <3> [59846.209695]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:07 +08:00 <3> [59846.215460] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:07 +08:00 <6> [59846.223303] task:WTJourn.Flusher state:D stack:    0 pid:14439 ppid: 11331 flags:0x00000000
2022-01-11 03:17:07 +08:00 <6> [59846.231674] Call Trace:
2022-01-11 03:17:07 +08:00 <6> [59846.234147]  __schedule+0x1dd/0x610
2022-01-11 03:17:07 +08:00 <6> [59846.237654]  schedule+0x41/0xa0
2022-01-11 03:17:07 +08:00 <6> [59846.240844]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:07 +08:00 <6> [59846.244874]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:07 +08:00 <6> [59846.248811]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:07 +08:00 <6> [59846.252839]  vfs_write+0x2ad/0x2e0
2022-01-11 03:17:07 +08:00 <6> [59846.256255]  ksys_pwrite64+0x61/0x90
2022-01-11 03:17:07 +08:00 <6> [59846.259850]  do_syscall_64+0x2d/0x40
2022-01-11 03:17:07 +08:00 <6> [59846.263442]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-01-11 03:17:07 +08:00 <6> [59846.268511] RIP: 0033:0x7f1b2aa8b04f
2022-01-11 03:17:07 +08:00 <6> [59846.272096] RSP: 002b:00007f1b1f75e410 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
2022-01-11 03:17:07 +08:00 <6> [59846.279679] RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007f1b2aa8b04f
2022-01-11 03:17:07 +08:00 <6> [59846.286829] RDX: 0000000000000080 RSI: 000055edd75b8000 RDI: 0000000000000012
2022-01-11 03:17:07 +08:00 <6> [59846.293977] RBP: 000055edd75b8000 R08: 0000000000000000 R09: 0000000000000000
2022-01-11 03:17:07 +08:00 <6> [59846.301120] R10: 00000000000f6100 R11: 0000000000000293 R12: 0000000000000080
2022-01-11 03:17:07 +08:00 <6> [59846.308265] R13: 00000000000f6100 R14: 000055edd4d50b80 R15: 00000000000f6100
2022-01-11 03:17:07 +08:00 <3> [59846.315562] INFO: task ftdc:14484 blocked for more than 123 seconds.
2022-01-11 03:17:07 +08:00 <3> [59846.321938]       Tainted: P     U     O      5.10.60-qnap #1
2022-01-11 03:17:07 +08:00 <3> [59846.327697] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2022-01-11 03:17:07 +08:00 <6> [59846.335541] task:ftdc            state:D stack:    0 pid:14484 ppid: 11331 flags:0x00000000
2022-01-11 03:17:07 +08:00 <6> [59846.343906] Call Trace:
2022-01-11 03:17:07 +08:00 <6> [59846.346372]  __schedule+0x1dd/0x610
2022-01-11 03:17:07 +08:00 <6> [59846.349910]  schedule+0x41/0xa0
2022-01-11 03:17:07 +08:00 <6> [59846.353076]  percpu_rwsem_wait+0xd3/0x180
2022-01-11 03:17:07 +08:00 <6> [59846.357103]  ? percpu_up_write+0x30/0x30
2022-01-11 03:17:07 +08:00 <6> [59846.361039]  __percpu_down_read+0x53/0x60
2022-01-11 03:17:07 +08:00 <6> [59846.365064]  mnt_want_write+0x60/0x80
2022-01-11 03:17:07 +08:00 <6> [59846.368753]  path_openat+0x42e/0xeb0
2022-01-11 03:17:07 +08:00 <6> [59846.372346]  ? release_pages+0x23e/0x300
2022-01-11 03:17:07 +08:00 <6> [59846.376284]  do_filp_open+0x94/0x110
2022-01-11 03:17:07 +08:00 <6> [59846.379879]  ? do_sys_openat2+0x1f3/0x390
2022-01-11 03:17:07 +08:00 <6> [59846.383914]  do_sys_openat2+0x1f3/0x390
2022-01-11 03:17:07 +08:00 <6> [59846.387769]  ? hrtimer_interrupt+0x14d/0x260
2022-01-11 03:17:07 +08:00 <6> [59846.392054]  do_sys_open+0x55/0x70
2022-01-11 03:17:07 +08:00 <6> [59846.395487]  do_syscall_64+0x2d/0x40
2022-01-11 03:17:07 +08:00 <6> [59846.399081]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
2022-01-11 03:17:07 +08:00 <6> [59846.404144] RIP: 0033:0x7f1b2a797d19
2022-01-11 03:17:07 +08:00 <6> [59846.407730] RSP: 002b:00007f1b1c7584b0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
2022-01-11 03:17:07 +08:00 <6> [59846.415311] RAX: ffffffffffffffda RBX: 000055edd9aadb40 RCX: 00007f1b2a797d19
2022-01-11 03:17:07 +08:00 <6> [59846.422460] RDX: 0000000000000241 RSI: 000055edd4b189e0 RDI: 00000000ffffff9c
2022-01-11 03:17:07 +08:00 <6> [59846.429603] RBP: 000055edd0d967dc R08: 0000000000000000 R09: 000055edd0d967dd
2022-01-11 03:17:07 +08:00 <6> [59846.436748] R10: 00000000000001b6 R11: 0000000000000293 R12: 0000000000000004
2022-01-11 03:17:07 +08:00 <6> [59846.443892] R13: 000055edd0d967db R14: 000055edd1b76048 R15: 000055edd1b75a10

首先是几条连续的ext2 filesystem being mounted at /tmp/nasconfig_tmp supports timestamps until 2038信息,表示连续进行了一些mount,然后是若干程序严重卡住的信息(blocked for more than 123 seconds),以及一些相关的堆栈、十六进制信息等。

INFO: task manager_svc:12583 blocked for more than 123 seconds.
Tainted: P     U     O      5.10.60-qnap #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message

容易查出来blocked for more than 123 seconds是内存脏数据太多,来不及写到磁盘而撑爆缓存的消息,说明此时突然出现了大量的IO或者内存占用,但123 seconds说明这已经是死机两分钟之后了。而这个条消息与上面的大量mount之间似乎没有有用信息了。

其实还有,但我当时没有想到。

当时我想的是,书到用时方恨少,也许可以通过一些调试办法找原因,但是太难了。crontab也没发现与这个时间相关联的东西。

外援

结果威联通的回复比我想的快,午饭过后就看到了邮件,很直白的让我:

> 把区域设置修改为“全球”
> 禁用再启用一下“Malware Remover”
> 再把区域设置改回去

这个自信的语气听起来很靠谱,我赶紧去找被遗忘了的“Malware Remover”,先打开看看,发现打开后居然是个错误页面,这就更有谱了,照邮件里面说的做一遍,发现“Malware Remover”能打开了,显示我设置了一个每晚一点钟的的病毒扫描。

试着手动执行了一次错误扫描,没出问题,没有十分卡顿。再去生成一份详细日志,发现最新日志里面出现了一系列的ext2 filesystem being mounted at /tmp/nasconfig_tmp supports timestamps until 2038 (0x7fffffff),但没有后续的IO问题报告。

并且,不知道是不是心理作用,总感觉响应比前几天灵敏了一些。

就这样过了24小时,没有再次死机,功能都正常,这个问题算是解决了。

复盘的话,大概是Malware Remover在最近的某次更新里面出了什么问题,同时,设定在一点的扫描在遇到其他定时任务时会自动让路,或者由于更新的问题导致Malware Remover在夜里三四点更新后会再扫描一次,最终导致了这一系列似乎随机又似乎有规律的死机事件。

失误总结

以下失误,都是从当时的信息出发应该能够做到的,但没有做到,拖延了排错进程。

  1. 寻找定时任务的时候,没有检查Malware Remover,否则应该可以更早发现这个APP有问题——不过试了一下,从那个错误信息也搜不到特别有用的东西;
  2. 发现“威联通挖矿病毒”的消息的时候,本应该手动拿Malware Remover查一下,但实际上只是按消息内容确认了没有可疑文件;当然这里有威联通的锅,在应用商店卖McAfee,大家当然本能觉得Malware Remover不是什么好货。
  3. 发现死机与连串ext2 filesystem being mounted的相关性后,在日志中搜"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message发现特异性很强,只出现了四组,而ext2 filesystem being mounted则散布的比较多,就没有深入挖掘。实际上,ext2 filesystem being mounted虽然很多,但连串的则只有在Malware Remover运行时才能发现,大概每天一两次。如果进一步按连串消息的出现时间去对应系统事件日志,应该可以发现这种日志代表Malware Remover的后台运行。
  4. 但是,即使发现了问题与Malware Remover相关,网上搜不到也想不到改区域-禁用-启用-改回来这种骚操作。下次一有问题就立即叫技术支持,劳资付过钱的。

淦!

刚才写完文章,去看了一眼威联通的官网,结果居然在页面顶部看到一个红色弹窗,没有标日期,但肯定是最近的(否则不会在其他不相干页面弹窗):

如何解決 Malware Remover 服務更新所造成的 NAS 連線問題
近日 Malware Remover 服務更新導致部分 QNAP NAS 出現連線異常問題。使用者如於 Malware Remover 的設定選項中啟用「自動更新」將受此問題影響。我們已經再次調整 Malware Remover 服務並修正此問題。

如您的 NAS 完成 Malware Remover 服務更新後發生停止回應或是無法存取的問題,請依照以下步驟手動重啟您的 NAS:

持續按壓電源鍵直到 NAS 完成斷電關機
再次按壓電源鍵重新啟動 NAS
受影響的 NAS 經過手動重啟後即可恢復正常運作。如果仍有異常情況或疑問,請聯繫 QNAP 技術支援尋求協助。

威联通你二大爷……

不过即使我早看到了这个通知,强关再打开这种操作仍然解决不了我的问题,还是需要改区域-禁用再启用-再把区域改回来这种骚操作。之于为啥不把这个操作放到通知里面,大概是因为太丢人吧……