ext4_inode_info double free
https://pms.uniontech.com/bug-view-233267.html
dmesg
dmesg.202312111431
warning
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 [201866.888630 ] WARNING: CPU: 2 PID: 67 at fs/dcache.c:338 dentry_free+0x24 /0xa8 [201866.888632 ] Modules linked in: tun uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media arc4 md4 sha512_generic sha512_arm64 nls_utf8 cifs (O) ccm dns_resolver uinput nfnetlink_queue nfnet link_log nfnetlink fuse dlp_fcore (O) bnep st bluetooth ecdh_generic rfkill vfs_monitor (O) clink_vhci_hcd (O) clink_usbip_core (O) aes_ce_blk crypto_simd cryptd nls_iso8859_1 aes_ce_cipher crct10dif_ce nls_cp437 ghash_ce aes_arm64 s ha2_ce sha256_arm64 sha1_ce hid_generic usbkbd usbhid snd_intel8x0 virtio_balloon snd_ac97_codec qemu_fw_cfg uos_resources (O) uos_bluetooth_connection_control (O) rdma_cm iw_cm ib_cm ib_core efivarfs ip_tables x_tables btrfs xor r aid6_pq virtio_blk virtio_net net_failover failover button virtio_mmio [201866.888856] CPU: 2 PID: 67 Comm: kswapd0 Kdump: loaded Tainted: G O 4.19.0-arm64-desktop #5312 [201866.888857] Hardware name: RDO OpenStack Compute, BIOS 0.0.0 02/06/2015 [201866.888860] pstate: 40c00005 (nZcv daif +PAN +UAO) [201866.888862] pc : dentry_free+0x24/0xa8 [201866.888863] lr : dentry_free+0x24/0xa8 [201866.888864] sp : ffff8003dd467a70 [201866.888865] x29: ffff8003dd467a70 x28: 000000000000040e [201866.888867] x27: ffff800069db41d8 x26: 0000000000000000 [201866.888868] x25: ffff8003dd467b70 x24: ffff800069db4180 [201866.888870] x23: ffff800069e49e70 x22: ffff800069db40c0 [201866.888871] x21: ffff800069db4118 x20: ffff800069db4000 [201866.888873] x19: ffff800069db40c0 x18: ffff000009824000 [201866.888874] x17: 0000000000000000 x16: 0000000000000000 [201866.888875] x15: 00000000fffffff0 x14: ffff000009959752 [201866.888880] x13: 0000000000061498 x12: ffff000009958000 [201866.888881] x11: ffff000009824000 x10: ffff000009958da8 [201866.888883] x9 : 0000000000000000 x8 : 0000000000000004 [201866.888884] x7 : ffff000009958000 x6 : 00000000000026ff [201866.888885] x5 : 0000000000000000 x4 : 0000000000000000 [201866.888886] x3 : 0000000000000000 x2 : ffff8003ffe4ac88 [201866.888888] x1 : 00008003f6995000 x0 : 0000000000000024 [201866.888890] Call trace: [201866.888892] dentry_free+0x24/0xa8 [201866.888894] __dentry_kill+0x148/0x1d0 [201866.888895] dentry_kill+0x1cc/0x270 [201866.888897] shrink_dentry_list+0x1d0/0x2c0 [201866.888899] prune_dcache_sb+0x44/0x58 [201866.888904] super_cache_scan+0xcc/0x160 [201866.888917] do_shrink_slab+0x188/0x320 [201866.888918] shrink_slab+0x1f8/0x2b0 [201866.888920] shrink_node+0xb4/0x480 [201866.888921] kswapd+0x3dc/0x760 [201866.888934] kthread+0x128/0x130 [201866.888940] ret_from_fork+0x10/0x18 [201866.888941] ---[ end trace 8c73588fb79b6328 ]---
warning出现前:
1 2 grep 'this file format is not right , so can not luopan ! ' dmesg.202312111431 -inr --color | wc -l 364
pc : dentry_free+0x24/0xa8
总共出现了146次,CPU 2
、CPU 6
调度过内核线程PID: 67 Comm: kswapd0
。
kswapd 是一个系统启动后一直运行的内核线程,它在系统运行期间监视内存使用情况,并根据需要执行内存回收和页面交换等操作。
oops
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 [201866.936912 ] Unable to handle kernel NULL pointer dereference at virtual address 00000000000000 d0 [201866.939427 ] Mem abort info: [201866.940177 ] ESR = 0x96000005 [201866.941018 ] Exception class = DABT (current EL), IL = 32 bits [201866.942561 ] SET = 0 , FnV = 0 [201866.943377 ] EA = 0 , S1PTW = 0 [201866.944163 ] Data abort info: [201866.944995 ] ISV = 0 , ISS = 0x00000005 [201866.946019 ] CM = 0 , WnR = 0 [201866.946820 ] user pgtable: 4 k pages, 48 -bit VAs, pgdp = 000000009b ba10c9 [201866.948553 ] [00000000000000 d0] pgd=000000014 eba4003, pud=0000000000000000 [201866.950597 ] Internal error: Oops: 96000005 [#1 ] SMP [201866.951888 ] Modules linked in: tun uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev media arc4 md4 sha512_generic sha512_arm64 nls_utf8 cifs(O) ccm dns_resolver uinput nfnetlink_queue nfnetlink_log nfnetlink fuse dlp_fcore(O) bnep st bluetooth ecdh_generic rfkill vfs_monitor(O) clink_vhci_hcd(O) clink_usbip_core(O) aes_ce_blk crypto_simd cryptd nls_iso8859_1 aes_ce_cipher crct10dif_ce nls_cp437 ghash_ce aes_arm64 sha2_ce sha256_arm64 sha1_ce hid_generic usbkbd usbhid snd_intel8x0 virtio_balloon snd_ac97_codec qemu_fw_cfg uos_resources(O) uos_bluetooth_connection_control(O) rdma_cm iw_cm ib_cm ib_core efivarfs ip_tables x_tables btrfs xor raid6_pq virtio_blk virtio_net net_failover failover button virtio_mmio [201866.971480 ] Process Isolated Web Co (pid: 29060 , stack limit = 0x0000000033ac522b ) [201866.971484 ] CPU: 6 PID: 29060 Comm: Isolated Web Co Kdump: loaded Tainted: G W O 4.19 .0 -arm64-desktop #5312 [201866.971485 ] Hardware name: RDO OpenStack Compute, BIOS 0.0 .0 02 /06 /2015 [201866.971487 ] pstate: a0400005 (NzCv daif +PAN -UAO) [201866.971499 ] pc : kmem_cache_free+0xfc /0x1c8 [201866.971518 ] lr : ext4_i_callback+0x18 /0x20 [201866.971519 ] sp : ffff8003ffec1640 [201866.971520 ] x29: ffff8003ffec1640 x28: 000000000000000 a [201866.971522 ] x27: ffff8003ffec1700 x26: ffff000009809810 [201866.971523 ] x25: 00008003f 6a05000 x24: ffff0000098096d8 [201866.971525 ] x23: ffff0000094ac018 x22: ffff000009825680 [201866.971526 ] x21: ffff8003dd4a9980 x20: ffff000008383800 [201866.971527 ] x19: ffff8000791ac380 x18: 0000000000000000 [201866.971529 ] x17: 0000000000000000 x16: 0000000000000000 [201866.971530 ] x15: 0000000000000000 x14: 0000000000000000 [201866.971531 ] x13: ffff000008c17740 x12: ffff000008c17748 [201866.971533 ] x11: 0000000000000092 x10: 0000000000693e94 [201866.971534 ] x9 : 0000000000000004 x8 : 00000003f bd70000 [201866.971536 ] x7 : 0000000000000018 x6 : ffff000009984d20 [201866.971537 ] x5 : ffff000009984d20 x4 : 00000000f 0000000 [201866.971538 ] x3 : ffff7e0001e46c00 x2 : ffffffffffffffff [201866.971539 ] x1 : 0000000000000000 x0 : 0000000000000000 [201866.971541 ] Call trace: [201866.971544 ] kmem_cache_free+0xfc /0x1c8 [201866.971546 ] ext4_i_callback+0x18 /0x20 [201866.971561 ] rcu_process_callbacks+0x2d8 /0x500 [201866.971563 ] __do_softirq+0x110 /0x2e8 [201866.971573 ] irq_exit+0x9c /0xb8 [201866.971578 ] __handle_domain_irq+0x64 /0xb8 [201866.971580 ] gic_handle_irq+0x7c /0x178 [201866.971583 ] el1_irq+0xb0 /0x140 [201867.017634 ] clear_page+0x10 /0x24 [201867.017644 ] clear_subpage+0x3c /0x58 [201867.019834 ] clear_huge_page+0x80 /0x218 [201867.021046 ] do_huge_pmd_anonymous_page+0x180 /0x770 [201867.022532 ] __handle_mm_fault+0x788 /0x908 [201867.023764 ] handle_mm_fault+0xec /0x1b0 [201867.024943 ] do_page_fault+0x164 /0x480 [201867.026089 ] do_translation_fault+0x58 /0x60 [201867.027359 ] do_mem_abort+0x3c /0xd0 [201867.028442 ] el0_da+0x20 /0x24 [201867.029366 ] Code: 9 a820000 f9400c00 eb0002bf 54f ff980 (f9406801)
初步定位
1 2 3 4 5 ./scripts/faddr2line vmlinux kmem_cache_free+0xfc /0x1c8 kmem_cache_free+0xfc /0x1c8 : slab_equal_or_root 于 mm/slab.h:229 (已内连入)cache_from_obj 于 mm/slab.h:375 (已内连入)kmem_cache_free 于 mm/slub.c:2966
1 objdump -d -l -S mm/slub.o > slub.objdump
1 2 3 4 5 6 7 8 5246 slab_equal_or_root(): 5247 /data3/home/yuanqiliang/code/arm-kernel/mm/slab.h:229 5248 return p == s || p == s->memcg_params.root_cache;5249 238 c: eb0002bf cmp x21, x05250 2390 : 54f ff980 b.eq 22 c0 <kmem_cache_free+0x28 > 5251 2394 : f9406801 ldr x1, [x0, #208 ] 5252 2398 : eb0102bf cmp x21, x1 5253 239 c: 540004e0 b.eq 2438 <kmem_cache_free+0x1a0 >
1 2 p ((struct kmem_cache *)0 )->memcg_params Cannot access memory at address 0 xd0
1 2 p &((struct kmem_cache *)0 )->memcg_params$1 = (struct memcg_cache_params *) 0 xd0
源码
1 git clone "http://gerrit.uniontech.com/kernel/arm-kernel"
1 2 git log --oneline | head -n1 b3ea87eb1d45 elfverify: stricter security check for executable section
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 79 82 struct kmem_cache { 83 struct kmem_cache_cpu __percpu *cpu_slab ; 84 85 slab_flags_t flags; 86 unsigned long min_partial; 87 unsigned int size; 88 unsigned int object_size; 89 unsigned int offset; 90 #ifdef CONFIG_SLUB_CPU_PARTIAL 91 92 unsigned int cpu_partial; 93 #endif 94 struct kmem_cache_order_objects oo ; 95 96 97 struct kmem_cache_order_objects max ; 98 struct kmem_cache_order_objects min ; 99 gfp_t allocflags; 100 int refcount; 101 void (*ctor)(void *); 102 unsigned int inuse; 103 unsigned int align; 104 unsigned int red_left_pad; 105 const char *name; 106 struct list_head list ; 107 #ifdef CONFIG_SYSFS 108 struct kobject kobj ; 109 struct work_struct kobj_remove_work ; 110 #endif 111 #ifdef CONFIG_MEMCG 112 struct memcg_cache_params memcg_params ;
1 2 3 ./scripts/faddr2line vmlinux ext4_i_callback+0x18 /0x20 ext4_i_callback+0x18 /0x20 : ext4_i_callback 于 fs/ext4/super.c:1114
1 2 3 4 5 1110 static void ext4_i_callback (struct rcu_head *head) 1111 {1112 struct inode *inode = container_of(head, struct inode, i_rcu);1113 kmem_cache_free(ext4_inode_cachep, EXT4_I(inode)); 1114 }
1 21 #define EXT4_I(inode) (container_of(inode, struct ext4_inode_info, vfs_inode))
在 Ext4 文件系统中,每个文件或目录对应于一个唯一的 inode。 每个 inode 在内核中都有一个对应的 struct ext4_inode_info 结构体。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 940 943 struct ext4_inode_info { 944 __le32 i_data[15 ]; 945 __u32 i_dtime; 946 ext4_fsblk_t i_file_acl; ...... 1001 1009 struct rw_semaphore i_mmap_sem ;1010 struct inode vfs_inode ;1011 struct jbd2_inode *jinode ;
1 2 3 4 5 6 7 8 9 2964 void kmem_cache_free (struct kmem_cache *s, void *x) 2965 {2966 s = cache_from_obj(s, x);2967 if (!s)2968 return ;2969 slab_free(s, virt_to_head_page(x), x, NULL , 1 , _RET_IP_);2970 trace_kmem_cache_free(_RET_IP_, x);2971 }2972 EXPORT_SYMBOL(kmem_cache_free);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 357 static inline struct kmem_cache *cache_from_obj (struct kmem_cache *s, void *x) 358 {359 struct kmem_cache *cachep ;360 struct page *page ;361 362 369 if (!memcg_kmem_enabled() &&370 !unlikely(s->flags & SLAB_CONSISTENCY_CHECKS))371 return s;372 373 page = virt_to_head_page(x); 374 cachep = page->slab_cache; 375 if (slab_equal_or_root(cachep, s))376 return cachep;377 378 pr_err("%s: Wrong slab cache. %s but object is from %s\n" ,379 __func__, s->name, cachep->name);380 WARN_ON_ONCE(1 );381 return s; 382 }
1 2 3 4 5 6 651 static inline struct page *virt_to_head_page (const void *x) 652 {653 struct page *page = virt_to_page(x);654 655 return compound_head(page);656 }
1 73 #define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
1 108 # define pfn_to_page(pfn) (vmem_map + (pfn))
1 2 3 4 5 226 static inline bool slab_equal_or_root (struct kmem_cache *s, 227 struct kmem_cache *p) 228 {229 return p == s || p == s->memcg_params.root_cache; 230 }
问题基本确定了,这是一个double free的问题。
ext4_i_callback
函数是 inode_operations 结构中的一个回调函数,定义在 fs/ext4/inode.c 文件中。这个函数在 inode 被释放时被调用,用于执行一些清理和处理工作。
如果多个线程同时尝试删除同一个文件,而删除文件的过程中会触发 ext4_i_callback,那么由于竞态条件,可能导致 double free 错误。
ext4_inode_info是ext4文件系统挂载的时候生成的,不掉盘的话ext4_inode_info对应的kmem_cache不可能指向null。
这里给出bpftrace参考语句:
1 sudo bpftrace -e 'tracepoint:syscalls:sys_enter_unlink { printf("%s deleted by command ' %s' with pid %d\n", str(args->pathname), comm, pid);} tracepoint:syscalls:sys_enter_unlinkat { printf("%s deleted by command ' %s' with pid %d\n", str(args->pathname), comm, pid);}'
尝试修复
ext4日志
crash
1 crash /lib/debug/vmlinux-4.19.0-arm64-desktop-tyy-5312 /home/uos/202312111431/dump.202312111431
ps
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 crash> ps PID PPID CPU TASK ST %MEM VSZ RSS COMM > 0 0 0 ffff0000098131c0 RU 0.0 0 0 [swapper/0] 0 0 1 ffff8003e85a2700 RU 0.0 0 0 [swapper/1] 0 0 2 ffff8003e85a3400 RU 0.0 0 0 [swapper/2] > 0 0 3 ffff8003e85a4100 RU 0.0 0 0 [swapper/3] 0 0 4 ffff8003e85a4e00 RU 0.0 0 0 [swapper/4] 0 0 5 ffff8003e85a5b00 RU 0.0 0 0 [swapper/5] 0 0 6 ffff8003e85a6800 RU 0.0 0 0 [swapper/6] > 0 0 7 ffff8003e85d8000 RU 0.0 0 0 [swapper/7] > 403 -1 2 ffff8003e80b4100 RU 0.6 214476 111464 systemd-journal > 961 -1 1 ffff800156d0db00 RU 0.2 44032 29004 clink-iotop > 1004 -1 5 ffff80006cf93400 RU 0.0 6620 3792 smartctl > 5310 -1 4 ffff8000306e2700 RU 1.1 4304016 190780 deepin-voice-no > 29060 -1 6 ffff800035462700 RU 3.6 3060304 633720 Isolated Web Co
Firefox中的进程Isolated Web Co占用内存过多,且还在请求巨型匿名页面。
bt
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 crash> bt -l PID: 29060 TASK: ffff800035462700 CPU: 6 COMMAND: "Isolated Web Co" /data3/home/yuanqiliang/code/arm-kernel/kernel/kexec_core.c: 977 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/kernel/traps.c: 202 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 258 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 286 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 588 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 597 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 728 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/kernel/entry.S: 562 PC: ffff000008268b5c [kmem_cache_free+252] LR: ffff000008383800 [ext4_i_callback+24] SP: ffff8003ffec1640 PSTATE: a0400005 X29: ffff8003ffec1640 X28: 000000000000000a X27: ffff8003ffec1700 X26: ffff000009809810 X25: 00008003f6a05000 X24: ffff0000098096d8 X23: ffff0000094ac018 X22: ffff000009825680 X21: ffff8003dd4a9980 X20: ffff000008383800 X19: ffff8000791ac380 X18: 0000000000000000 X17: 0000000000000000 X16: 0000000000000000 X15: 0000000000000000 X14: 0000000000000000 X13: ffff000008c17740 X12: ffff000008c17748 X11: 0000000000000092 X10: 0000000000693e94 X9: 0000000000000004 X8: 00000003fbd70000 X7: 0000000000000018 X6: ffff000009984d20 X5: ffff000009984d20 X4: 00000000f0000000 X3: ffff7e0001e46c00 X2: ffffffffffffffff X1: 0000000000000000 X0: 0000000000000000 /data3/home/yuanqiliang/code/arm-kernel/mm/slab.h: 229 /data3/home/yuanqiliang/code/arm-kernel/mm/slab.h: 229 /data3/home/yuanqiliang/code/arm-kernel/fs/ext4/super.c: 1113 /data3/home/yuanqiliang/code/arm-kernel/kernel/rcu/rcu.h: 236 /data3/home/yuanqiliang/code/arm-kernel/kernel/softirq.c: 292 /data3/home/yuanqiliang/code/arm-kernel/./include/linux/interrupt.h: 504 /data3/home/yuanqiliang/code/arm-kernel/kernel/irq/irqdesc.c: 685 /data3/home/yuanqiliang/code/arm-kernel/./include/linux/irqdesc.h: 173 --- <IRQ stack> --- /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/kernel/entry.S: 611 PC: ffff000008acb250 [clear_page+16] LR: ffff0000080a1cc4 [__cpu_clear_user_page+12] SP: ffff8001dd367b40 PSTATE: 80400005 X29: ffff8001dd367b40 X28: ffff8003df6234e0 X27: 0000000000000002 X26: 0000000000000000 X25: 0000000000000000 X24: 0000000000000000 X23: ffff7e0002310000 X22: 0000000000001000 X21: 0000ffff7c600000 X20: 0000000000000000 X19: ffff800035462700 X18: 0000000000000000 X17: 0000000000000000 X16: 0000000000000000 X15: 0000000000000400 X14: 0000000000000400 X13: 00000000000003f9 X12: 0000000000000001 X11: 0000000000000001 X10: 00000000000008e0 X9: ffff8001dd367ab0 X8: ffff800035463040 X7: ffff8003ffeb17b8 X6: ffff8003e1a3f900 X5: 000000000000a7f6 X4: ffff000009809000 X3: 0000000000004600 X2: 0000000000000004 X1: 0000000000000040 X0: ffff80008c518000 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/lib/clear_page.S: 23 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/lib/clear_page.S: 21 /data3/home/yuanqiliang/code/arm-kernel/./include/linux/highmem.h: 137 /data3/home/yuanqiliang/code/arm-kernel/mm/memory.c: 4679 /data3/home/yuanqiliang/code/arm-kernel/mm/huge_memory.c: 578 /data3/home/yuanqiliang/code/arm-kernel/mm/memory.c: 3942 /data3/home/yuanqiliang/code/arm-kernel/mm/memory.c: 4212 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 399 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 597 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/mm/fault.c: 728 /data3/home/yuanqiliang/code/arm-kernel/arch/arm64/kernel/entry.S: 724 PC: 0000ffff990e9548 LR: 0000aaaac18625f8 SP: 0000ffffe16bba40 X29: 0000ffffe16bd560 X28: 0000000000000000 X27: 0000ffff66a29318 X26: 0000000000000001 X25: 0000000000000001 X24: 0000ffff726478d0 X23: 0000ffff7c600000 X22: 0000000000200000 X21: 0000000000100000 X20: 0000ffff98e00000 X19: 0000ffff80000000 X18: 0000000000000000 X17: 0000ffff990e9450 X16: 0000aaaac190c020 X15: 0000ffff80000000 X14: 0000000000000000 X13: ff000178005d4748 X12: 0000000000000000 X11: 0000000004d00000 X10: 00000000039e8000 X9: 0aa90b643fdb4900 X8: 0aa90b643fdb4900 X7: 3133303131333230 X6: 32000000110000ff X5: 0000ffff7c700000 X4: 0000ffff80100000 X3: 0000ffff7c600000 X2: 0000000000100000 X1: 0000ffff80000000 X0: 0000ffff7c600000 ORIG_X0: 0000ffff7c800000 SYSCALLNO: ffffffff PSTATE: 20001000
mount
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 crash> mount MOUNT SUPERBLK TYPE DEVNAME DIRNAME ffff8003e80a4000 ffff8003e80b7000 rootfs rootfs / ffff80039444e000 ffff800394450000 sysfs sysfs /sys ffff8003e80a4c00 ffff8003e80da800 proc proc /proc ffff8003944a0000 ffff8003e7ce0000 devtmpfs udev /dev ffff8003e80a4d80 ffff800394b2c000 devpts devpts /dev/pts ffff80039452c000 ffff800394538000 tmpfs tmpfs /run ffff80039452c180 ffff80039453d000 ext4 /dev/vda2 / ffff80039452c300 ffff8003e80db800 securityfs securityfs /sys/kernel/security ffff80039452c600 ffff80039453f000 tmpfs tmpfs /dev/shm ffff80039452c480 ffff80039453f800 tmpfs tmpfs /run/lock ffff80039452c780 ffff80039453c800 tmpfs tmpfs /sys/fs/cgroup ffff80039452c900 ffff800393f75800 cgroup2 cgroup2 /sys/fs/cgroup/unified ffff80039452ca80 ffff800393f74800 cgroup cgroup /sys/fs/cgroup/systemd ffff80039452cc00 ffff800393f73800 pstore pstore /sys/fs/pstore ffff80039452cd80 ffff800393f75000 efivarfs efivarfs /sys/firmware/efi/efivars ffff80039452cf00 ffff800393f70800 bpf bpf /sys/fs/bpf ffff80039452d080 ffff800393f71800 cgroup cgroup /sys/fs/cgroup/net_cls,net_prio ffff80039452d200 ffff800393f70000 cgroup cgroup /sys/fs/cgroup/cpu,cpuacct ffff80039452d380 ffff800393f77800 cgroup cgroup /sys/fs/cgroup/blkio ffff80039452d500 ffff800393f74000 cgroup cgroup /sys/fs/cgroup/cpuset ffff80039452d680 ffff800393f72800 cgroup cgroup /sys/fs/cgroup/hugetlb ffff80039452d800 ffff800393f77000 cgroup cgroup /sys/fs/cgroup/memory ffff80039452d980 ffff800393f76800 cgroup cgroup /sys/fs/cgroup/freezer ffff80039452db00 ffff800393f76000 cgroup cgroup /sys/fs/cgroup/pids ffff80039452dc80 ffff800393f73000 cgroup cgroup /sys/fs/cgroup/devices ffff80039452de00 ffff800393f72000 cgroup cgroup /sys/fs/cgroup/rdma ffff8003dcf82000 ffff8003935f9800 autofs systemd-1 /proc/sys/fs/binfmt_misc ffff8003944a0180 ffff8003dd0c0800 mqueue mqueue /dev/mqueue ffff8003dcf82180 ffff8003e80dc000 debugfs debugfs /sys/kernel/debug ffff8003e7cc8d80 ffff8003931e9800 hugetlbfs hugetlbfs /dev/hugepages ffff8003e16d2000 ffff8003dd0c1800 configfs configfs /sys/kernel/config ffff8003dd0b4600 ffff8003947da000 squashfs /dev/loop0 /snap/core22/586 ffff8003e16d2180 ffff8003946b5800 squashfs /dev/loop1 /snap/core/15515 ffff8003e7cc8f00 ffff800393a2e800 squashfs /dev/loop2 /snap/core/16204 ffff8003944a0300 ffff800393a96000 squashfs /dev/loop3 /snap/lxd/24646 ffff8003e16d2900 ffff8003946b3000 squashfs /dev/loop4 /snap/core/15425 ffff8003944a0600 ffff8003df4c0800 vfat /dev/vda1 /boot/efi ffff8003944f3c80 ffff8003944bb000 binfmt_misc binfmt_misc /proc/sys/fs/binfmt_misc/ ffff8003dd0b4a80 ffff800393a80800 ext4 /dev/vdb /mnt/vdb ffff8003e090f800 ffff800394538000 tmpfs tmpfs /run/snapd/ns ffff8003dfb55e00 ffff8003e80da000 nsfs nsfs /run/snapd/ns/snapd/ns/lxd.mnt ffff8003e80a4f00 ffff8003e04b1800 tmpfs tmpfs /run/user/1000 ffff8003a4613080 ffff80037a40b800 fusectl fusectl /sys/fs/fuse/connections ffff8003a790e900 ffff8003e04b6800 fuse gvfsd-fuse /run/user/1000/gvfs ffff8002c2e08f00 ffff80012834c000 cifs //127.0.0.1/WANGZHIQUAN-PC C /media/WANGZHIQUAN-PC C
swap
1 2 crash> swap SWAP_INFO_STRUCT TYPE SIZE USED PCT PRI FILENAME
kmem
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 3967057 15.1 GB ---- FREE 28799 112.5 MB 0% of TOTAL MEM USED 3938258 15 GB 99% of TOTAL MEM SHARED 743953 2.8 GB 18% of TOTAL MEM BUFFERS 102855 401.8 MB 2% of TOTAL MEM CACHED 1387497 5.3 GB 34% of TOTAL MEM SLAB 199579 779.6 MB 5% of TOTAL MEM TOTAL HUGE 0 0 ---- HUGE FREE 0 0 0% of TOTAL HUGE TOTAL SWAP 0 0 ---- SWAP USED 0 0 0% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP COMMIT LIMIT 1983528 7.6 GB ---- COMMITTED 7399654 28.2 GB 373% of TOTAL LIMIT
这些信息反映了系统当前的内存使用情况,包括空闲内存、已用内存、缓存等。请注意,COMMITTED 显示的内存值高于 COMMIT LIMIT,这可能表明系统当前存在一些超过物理内存和交换空间总和的内存压力。在这种情况下,系统可能会开始使用 OOM(Out of Memory)策略,例如终止某些进程来释放内存。
在 crash 输出中,COMMIT LIMIT 和 COMMITTED 是有关系统内存提交(commit)的两个关键指标。
COMMIT LIMIT:
COMMIT LIMIT 表示系统在当前配置下允许的最大内存提交限制,即系统允许分配给进程的虚拟内存和交换空间的总和。
在这个例子中,COMMIT LIMIT 为 7.6 GB,表示系统限制了进程可以使用的总虚拟内存和交换空间的量。
COMMITTED:
COMMITTED 表示当前已经由系统分配和使用的虚拟内存和交换空间的总和。
在这个例子中,COMMITTED 为 28.2 GB,表示系统当前已经分配并使用的总虚拟内存和交换空间的量。
分析:
COMMITTED 大于 COMMIT LIMIT,这表明系统当前分配的虚拟内存和交换空间总量已经超过了系统的限制。
当 COMMITTED 接近或超过 COMMIT LIMIT 时,系统可能会面临内存压力,并且可能会采取 OOM(Out of Memory)策略,例如终止某些进程以释放内存。
这种情况下,你可能需要进一步分析系统的进程、内存使用情况以及是否存在内存泄漏等问题,以找出为何 COMMITTED 超过了 COMMIT LIMIT 的原因。
资源耗尽导致slub回收崩溃
SLUB回收过程需要分配和释放内存对象,以及对内存数据结构进行操作。如果系统内存已经耗尽,无法满足回收过程的内存需求,就会导致回收过程失败或异常。
此外,SLUB回收可能还涉及到其他系统资源,例如锁、CPU时间等。如果这些资源也耗尽,可能会导致回收过程无法继续进行,从而引发崩溃。
在进行SLUB回收压力测试时,如果没有足够的资源可用,例如内存资源不足,可能会导致回收过程无法正常进行,从而出现崩溃或异常行为。
为了预防资源耗尽引发SLUB回收崩溃,可以采取以下措施:
资源评估和监控:在进行压力测试之前,评估系统资源的可用性,并监控资源的使用情况。确保有足够的可用资源来执行回收过程。
调整资源限制:根据压力测试的需求,可能需要调整系统的资源限制,例如增加可用内存大小、调整锁的数量或调度策略等,以确保回收过程有足够的资源可用。
优化回收策略:对SLUB回收策略进行优化,例如调整回收频率、调整回收算法等,以减少对资源的需求或提高资源利用效率。
合理设计测试用例:设计合理的测试用例,确保测试过程中的资源使用情况在可控范围内,并避免过度消耗系统资源的操作。
overcommit_memory
1 cat /proc/sys/vm/overcommit_memory
输出的值有以下几种可能:
0 – Heuristic overcommit handling. 这是缺省值,它允许overcommit,但过于明目张胆的overcommit会被拒绝,比如malloc一次性申请的内存大小就超过了系统总内存。Heuristic的意思是“试 探式的”,内核利用某种算法猜测你的内存申请是否合理,它认为不合理就会拒绝overcommit。
1 – Always overcommit. 允许overcommit,对内存申请来者不拒。内核执行无内存过量使用处理。使用这个设置会增大内存超载的可能性,但也可以增强大量使用内存任务的性能。
2 – Don’t overcommit. 禁止overcommit。 内存拒绝等于或者大于总可用 swap 大小以及overcommit_ratio 指定的物理 RAM 比例的内存请求。如果希望减小内存过度使用的风险,这个设置就是最好的。
overcommit_ratio
overcommit_ratio、overcommit_kbytes在overcommit_memory=2时才有用,overcommit_kbytes非0时,overcommit_ratio无效。
1 cat /proc/sys/vm/overcommit_ratio
overcommit_kbytes
1 cat /proc/sys/vm/overcommit_kbytes
当前机器overcommit_memory配置
1 2 3 4 5 6 cat /proc/sys/vm/overcommit_memory 0cat /proc/sys/vm/overcommit_ratio 50cat /proc/sys/vm/overcommit_kbytes 0
单次申请的内存大小不能超过以下值,否则本次申请就会失败。
free memory + free swap + pagecache的大小 + SLAB
Linux对大部分申请内存的请求都回复"yes",以便能跑更多更大的程序。因为申请内存后,并不会马上使用内存。这种技术叫做Overcommit。当linux发现内存不足时,会发生OOM killer(OOM=out-of-memory)。它会选择杀死一些进程(用户态进程,不是内核线程),以便释放内存。
当oom-killer发生时,linux会选择杀死哪些进程?选择进程的函数是oom_badness函数(在mm/oom_kill.c中),该函数会计算每个进程的点数(0~1000)。点数越高,这个进程越有可能被杀死。每个进程的点数跟oom_score_adj有关,而且oom_score_adj可以被设置(-1000最低,1000最高)。
min_free_kbytes
1 2 cat /proc/sys/vm/min_free_kbytes 67584
通常情况下 WMARK_LOW 的值是 WMARK_MIN 的 1.25 倍,WMARK_HIGH 的值是 WMARK_LOW 的 1.5 倍。而 WMARK_MIN 的数值就是由这个内核参数 min_free_kbytes 来决定的。
WMARK_MIN = 67584KB / 1024 = 66MB
WMARK_LOW = 66MB * 1.25 = 82.5MB
WMARK_HIGH = 82.5MB * 1.5 = 123.75MB
当前free memory等于112.5MB,在 WMARK_LOW 与 WMARK_HIGH 之间,内存在正常范围内,内存回收kswpd、内存规整kcompactd被正常的周期性调度,故不会触发 OOM。
当可用物理内存低于 WMARK_MIN 会触发下方操作:
内核也不会直接开始 OOM,而是进入到重试流程,在重试流程开始之前内核需要调用 should_reclaim_retry 判断是否应该进行重试,重试标准:
如果内核已经重试了 MAX_RECLAIM_RETRIES (16) 次仍然失败,则放弃重试执行后续 OOM。
如果内核将所有可选内存区域中的所有可回收页面全部回收之后,仍然无法满足内存的分配,那么放弃重试执行后续 OOM。
__alloc_pages
初步结论
没有使用交换分区。
当前机器内核参数kernel.panic = 0
kernel.panic_on_oops = 1
。
内核发现内存不足时可以通过软中断触发内存回收kswapd、内存规整kcompactd被调度,kswapd会调用shrink_slab进行slab回收。
本次oops中是ext4_inode_info被删除了,也就是某个文件被删除了,导致ext4_i_callback回调函数被调用,ext4_i_callback会调用kmem_cache_free释放slab中的ext4_inode_info对象,发生了double free,导致了oops,进一步导致panic,应该是kernel.panic_on_oops=1导致kdump触发。
共计分析了4个kdump,kmem -i
都存在内核耗尽,基本都只剩下100M了,每次oops的堆栈都跟回收slub缓存相关。
用户的使用场景应该是在不断下载文件并删除文件,同时在使用firefox、deepin-voice-note等。
疑点
某个第三方模块多个线程删除同一个文件时存在竞态。
某个第三方模块打开多个文件未关闭,导致cache占用很多,导致cache内存无法回收。
故接下来排查方向需要为啥为啥slub回收的时候double free。
ext4文件系统稳定性问题?
进一步排查
kdump
在grub中调整crashkernel参数,看是否能获取完整的kdump:
1 crashkernel=2G-4G:320M,4G-32G:512M,32G-64G:1024M,64G-128G:2048M,128G-:4096M
ext4 journal
可以到系统底层的宿主机上执行一下ps aux | grep -i qemu,我想看看ext4文件系统对应的磁盘是通过virtio or vfio 挂载上去的?
ext4_inode_info是ext4文件系统挂载的时候生成的,不掉盘的话ext4_inode_info对应的kmem_cache不可能指向null。
grub中更改ext4根文件系统日志为journal模式:
slub_debug
打下slub_debug相关内核编译选项:
1 2 3 4 5 6 7 CONFIG_SLUB=y CONFIG_SLUB_DEBUG=y CONFIG_SLUB_DEBUG_ON=y CONFIG_SLUB_STATS=y
grub中增加slub_debug=UFPZ参数。
hcache
使用hcache查看Buffer&Cache使用率高的进程:
bpftrace
bpftrace追踪ext4磁盘稳定性问题。
合入patch
临时优化方案
启用swap分区:
1 2 3 4 sudo dd if =/dev/zero of=/tmp/swap bs=1M count=32768 sudo chmod 600 /tmp/swap sudo mkswap /tmp/swap sudo swapon /tmp/swap
1 2 3 4 5 6 7 8 9 sudo cat << EOF >> /etc/sysctl.conf vm.dirty_background_ratio=3 # 当脏页面占系统内存的3%时触发后台写回 vm.dirty_ratio=80 # 当脏页面占系统内存的80%时停止用户进程,开始写回 vm.dirty_writeback_centisecs=100 # 每100毫秒检查一次是否需要写回脏页面 vm.nr_hugepages=128 # 系统中的大页面数量设置为128 vm.overcommit_memory=2 # 启用严格的内存分配机制 vm.overcommit_ratio=90 # 允许系统内存超过物理内存大小的80% net.ipv4.tcp_fastopen=3 # 启用 TCP Fast Open 并使用 Cookie 来防止连接污染攻击 EOF