dpu iperf soft lockup
dmesg
- https://gitee.com/openeuler/community-issue/issues/IA5ACV?skip_mobile=true
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
| 40825 [329724.749710] watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [migration/8:53] 40826 [329724.758280] Modules linked in: btrfs blake2b_generic xor xor_neon raid6_pq ntfs msdos 8021q garp stp mrp llc bonding rfkill rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod vfat target_core_mod fat ib_iser rdma_cm iw_cm ib_cm ext4 libiscsi scsi_transport_iscsi m bcache jbd2 sg hns_roce_hw_v2 ib_uverbs ib_core hns3_pmu uio_pdrv_genirq uio hibmc_drm drm_vram_helper ofpart drm_ttm_helper ipmi_ssif cmdlinepart ttm ipmi_devintf spi_nor mtd ipmi_msghandler spi_hisi_sfc_v3xx hisi_uncore_cpa_pmu hisi_uncore_pa_pmu hisi_uncore_l3c_pmu hisi_u ncore_ddrc_pmu hisi_uncore_hha_pmu hisi_uncore_sllc_pmu hisi_ptt hisi_pcie_pmu hisi_uncore_pmu sch_fq_codel fuse xfs libcrc32c sd_mod t10_pi virtio_net net_failover virtio_blk failover ghash_ce sm4_ce sm4_ce_cipher sm4 sm3_ce sha3_ce sha3_generic hisi_sas_v3_hw sha512_ce ahc i hisi_sas_main sha512_arm64 virtio_pci sha2_ce hclge libsas libahci virtio sha256_arm64 sha1_ce sbsa_gwdt virtio_pci_modern_dev libata hns3 virtio_ring hnae3 40827 [329724.758363] scsi_transport_sas host_edma_drv hisi_trng_v2 hns_mdio dm_mirror dm_region_hash dm_log dm_mod aes_ce_blk crypto_simd cryptd aes_ce_cipher 40828 [329724.863203] CPU: 8 PID: 53 Comm: migration/8 Kdump: loaded Tainted: G L 5.10.0-136.12.0.88.4.yql.ctl3.aarch64 #1 40829 [329724.876429] Hardware name: Huawei S920X20/BC83AMDA, BIOS 09.04.02.01.13 01/11/2025 40830 [329724.885280] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--) 40831 [329724.892563] pc : rcu_momentary_dyntick_idle+0x40/0x60 40832 [329724.898931] lr : multi_cpu_stop+0x100/0x1a0 40833 [329724.904391] sp : ffff800012093d60 40834 [329724.908982] x29: ffff800012093d60 x28: ffff80006ac139e0 40835 [329724.915591] x27: ffffb88970482dd0 x26: 00000000000000e0 40836 [329724.922207] x25: 0000000000000000 x24: 0000000000000000 40837 [329724.928793] x23: 0000000000000000 x22: ffffb88970ff0418 40838 [329724.935385] x21: 0000000000000001 x20: ffff80006ac139e0 40839 [329724.941995] x19: 0000000000000001 x18: 0000000000000000 40840 [329724.948587] x17: 0000000000000000 x16: 0000000000000000 40841 [329724.955177] x15: 0000000000000000 x14: 0000000000000000 40842 [329724.961763] x13: 0000000000000000 x12: 0000000000000000 40843 [329724.968346] x11: 00000a74804c995c x10: 19d0e478f99fa735 40844 [329724.974917] x9 : ffffb88970482e70 x8 : ffff12795ab84c88 40845 [329724.981495] x7 : 00000000000001e7 x6 : 0000000000000008 40846 [329724.988080] x5 : 00000000480fd020 x4 : 0000000000000000 40847 [329724.994708] x3 : 00000000f465d74a x2 : ffff12f6bdfb0920 40848 [329725.001286] x1 : 00000000f465d74e x0 : ffff12f6bdfb0800 40849 [329725.007858] Call trace: 40850 [329725.011564] rcu_momentary_dyntick_idle+0x40/0x60 40851 [329725.017565] cpu_stopper_thread+0x100/0x1b0 40852 [329725.023016] smpboot_thread_fn+0x15c/0x1a0 40853 [329725.028373] kthread+0x108/0x13c 40854 [329725.032893] ret_from_fork+0x10/0x18 40855 [329725.037722] Kernel panic - not syncing: softlockup: hung tasks 40856 [329725.044834] CPU: 8 PID: 53 Comm: migration/8 Kdump: loaded Tainted: G L 5.10.0-136.12.0.88.4.yql.ctl3.aarch64 #1 40857 [329725.058096] Hardware name: Huawei S920X20/BC83AMDA, BIOS 09.04.02.01.13 01/11/2025 40858 [329725.066964] Call trace: 40859 [329725.070717] dump_backtrace+0x0/0x1e4 40860 [329725.075690] show_stack+0x20/0x2c 40861 [329725.080307] dump_stack+0xd8/0x140 40862 [329725.085000] panic+0x168/0x390 40863 [329725.089349] watchdog_timer_fn+0x230/0x290 40864 [329725.094725] __run_hrtimer+0x98/0x2a0 40865 [329725.099642] __hrtimer_run_queues+0xb0/0x134 40866 [329725.105202] hrtimer_interrupt+0x13c/0x3c0 40867 [329725.110535] arch_timer_handler_phys+0x3c/0x50 40868 [329725.116203] handle_percpu_devid_irq+0x90/0x1f4 40869 [329725.121937] __handle_domain_irq+0x84/0xf0 40870 [329725.127220] gic_handle_irq+0x78/0x2c0 40871 [329725.132142] el1_irq+0xb8/0x140 40872 [329725.136442] rcu_momentary_dyntick_idle+0x40/0x60 40873 [329725.142325] cpu_stopper_thread+0x100/0x1b0 40874 [329725.147641] smpboot_thread_fn+0x15c/0x1a0 40875 [329725.152880] kthread+0x108/0x13c 40876 [329725.157211] ret_from_fork+0x10/0x18 40877 [329725.162035] SMP: stopping secondary CPUs 40878 [329725.177300] Starting crashdump kernel... 40879 [329725.188123] Bye!
|
crash
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| crash> sys KERNEL: /usr/lib/debug/lib/modules/5.10.0-136.12.0.88.4.ctl3.aarch64/vmlinux [TAINTED] DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 256 [OFFLINE: 1] DATE: Thu Feb 27 15:52:42 CST 2025 UPTIME: 1 days, 22:04:57 LOAD AVERAGE: 144.05, 56.30, 37.88 TASKS: 3958 NODENAME: gaoji-10-8-94-87 RELEASE: 5.10.0-136.12.0.88.4.ctl3.aarch64 VERSION: MACHINE: aarch64 (unknown Mhz) MEMORY: 1023.7 GB PANIC: "Kernel panic - not syncing: softlockup: hung tasks"
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| crash> bt PID: 463 TASK: ffff5dd79b964d00 CPU: 90 COMMAND: "migration/90" --- <IRQ stack> ---
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| crash> bt -l PID: 463 TASK: ffff5dd79b964d00 CPU: 90 COMMAND: "migration/90" /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/./arch/arm64/include/asm/kexec.h: 52 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/panic.c: 251 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/watchdog.c: 448 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/time/hrtimer.c: 1583 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/time/hrtimer.c: 1647 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/time/hrtimer.c: 1709 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/drivers/clocksource/arm_arch_timer.c: 647 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/irq/chip.c: 933 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/./include/linux/irqdesc.h: 153 /usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/./include/linux/irqdesc.h: 171 --- <IRQ stack> ---
/usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/arch/arm64/kernel/entry.S: 672
/usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/rcu/tree.c: 421
/usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/stop_machine.c: 504
/usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/smpboot.c: 164
/usr/src/debug/kernel-5.10.0-136.12.0.88.4.ctl3.aarch64/linux-5.10.0-136.12.0.88.4.ctl3.aarch64/kernel/kthread.c: 313
|
1 2 3 4
| crash> ps | grep migration | grep IN | wc -l 254 crash> ps | grep migration | grep RU | wc -l 2
|
1 2 3
| crash> ps | grep migration | grep RU > 463 2 90 ffff5dd79b964d00 RU 0.0 0 0 [migration/90] 974 2 192 ffff7dd7111fcd00 RU 0.0 0 0 [migration/192]
|
1 2 3 4 5 6 7
| crash> ps | grep migration 14 2 0 ffff5dd79a79b9c0 IN 0.0 0 0 [migration/0] 17 2 1 ffff5dd79aa58000 IN 0.0 0 0 [migration/1] 22 2 2 ffff5dd79aa94d00 IN 0.0 0 0 [migration/2] 27 2 3 ffff5dd79aada680 IN 0.0 0 0 [migration/3] 32 2 4 ffff5dd79ab239c0 IN 0.0 0 0 [migration/4] 38 2 5 ffff5dd79ab38000 IN 0.0 0 0 [migration/5]
|
1 2 3 4 5 6 7
| crash> bt -p 974 PID: 974 TASK: ffff7dd7111fcd00 CPU: 192 COMMAND: "migration/192"
|
初步结论
这看起来像是 Linux 内核崩溃(kernel
panic)的日志,具体是由“softlockup: hung
tasks”(软锁死:任务挂起)触发的。以下是对日志的分析和可能的修复建议:
1. 日志分析
- 时间和系统信息:
- 崩溃发生在 2025 年 2 月 27 日 15:52:42 CST。
- 系统运行了 1 天 22 小时 94 分钟 57 秒。
- 内核版本为
5.10.0-136.12.0.88.4.ctl3.aarch64
,标记为
[TAINTED]
,表示内核可能存在第三方模块或不推荐的配置。
- 机器是基于
aarch64
架构(ARM 64 位),内存为 1023.7
GB,CPU 核心数为 256(其中 11 个离线)。
- 崩溃的文件为
vmcore
(内核转储文件),表示系统在崩溃时生成了内存转储。
- 崩溃原因:
- 主要问题是一个软锁死(softlockup),意味着某个或某些任务在 CPU
上运行时间过长(通常超过 20 秒),导致系统无法响应。
- 涉及的任务(PID 463,命令为
migration/90
)可能与 CPU
迁移线程(migration thread)有关,这通常用于管理进程在多核 CPU
之间的调度。
- 调用栈(Backtrace):
- 日志显示了崩溃时的调用栈,涉及多个内核函数,如
crash_kexec
、panic
、watchdog_timer_fn
等。
- 其中
watchdog_timer_fn
通常与内核的看门狗(watchdog)机制有关,用于检测系统是否响应。如果看门狗检测到任务挂起,它会触发
panic。
- 其他函数(如
hrtimer_interrupt
、handle_percpu_devirq
)表明问题可能与定时器、中断或多核调度相关。
- TAINTED 标记:
[TAINTED]
提示内核可能被第三方模块或不受支持的硬件/驱动污染。这可能是问题的根源之一。
2. 可能的原因
根据日志,以下是可能导致软锁死的原因: -
硬件问题:如 CPU 过热、内存错误或硬件不兼容。 -
内核问题:内核版本可能存在
bug,或者配置不正确(例如启用了一些实验性功能)。 -
驱动问题:第三方驱动(如
GPU、存储或网络驱动)可能不稳定,导致内核崩溃。 -
负载过高:系统负载较高(日志显示负载平均值为
56.30、37.88),可能导致任务挂起。 -
配置问题:内核参数或调度器配置可能不适合当前工作负载。