SUSE Hack Week: Kernel live dump

There is possibility to run crash on live system, this has some drawbacks though:

not all its features are available (e.g. inspecting stacks of tasks),
crash may be intrusive (e.g. wr), i.e. danger for production systems,
time window for live session may be limited.

For userspace programs there is gcore utility (based on ptrace) that can take a coredump of a running program for deferred analysis.

Explore possibilities of implementing live dumping for kernel and attempt a live dump implementation.

Introduce: Live Dump

No Hackers yet

Join this project Leave this project

Looking for hackers with the skills:

kdump crash

This project is part of:

Hack Week 18

Activity

over 6 years ago: mkubecek liked this project.

over 6 years ago: mbrugger liked this project.

over 6 years ago: mkoutny added keyword "kdump" to this project.

over 6 years ago: mkoutny added keyword "crash" to this project.

over 6 years ago: mkoutny originated this project.

Comments

Be the first to comment!

Similar Projects

kdump

early stage kdump support by mbrugger

Project Description

When we experience a early boot crash, we are not able to analyze the kernel dump, as user-space wasn't able to load the crash system. The idea is to make the crash system compiled into the host kernel (think of initramfs) so that we can create a kernel dump really early in the boot process.

Goal for the Hackweeks

Investigate if this is possible and the implications it would have (done in HW21)
Hack up a PoC (done in HW22 and HW23)
Prepare RFC series (giving it's only one week, we are entering wishful thinking territory here).

update HW23

I was able to include the crash kernel into the kernel Image.
I'll need to find a way to load that from init/main.c:start_kernel() probably after kcsan_init()
I workaround for a smoke test was to hack kexec_file_load() systemcall which has two problems:
1. My initramfs in the porduction kernel does not have a new enough kexec version, that's not a blocker but where the week ended
2. As the crash kernel is part of init.data it will be already stale once I can call kexec_file_load() from user-space.

The solution is probably to rewrite the POC so that the invocation can be done from init.text (that's my theory) but I'm not sure if I can reuse the kexec infrastructure in the kernel from there, which I rely on heavily.

update HW24

Day1
- rebased on v6.12 with no problems others then me breaking the config
- setting up a new compilation and qemu/virtme env
- getting desperate as nothing works that used to work
Day 2
- getting to call the invocation of loading the early kernel from __init after kcsan_init()
Day 3
- fix problem of memdup not being able to alloc so much memory... use 64K page sizes for now
- code refactoring
- I'm now able to load the crash kernel
- When using virtme I can boot into the crash kernel, also it doesn't boot completely (major milestone!), crash in elfcorehdr_read_notes()
Day 4
- crash systems crashes (no pun intended) in copy_old_mempage() link; will need to understand elfcorehdr...
- call path vmcore_init() -> parse_crash_elf_headers() -> elfcorehdr_read() -> read_from_oldmem() -> copy_oldmem_page() -> copy_to_iter()
Day 5
- hacking arch/arm64/kernel/crash_dump.c:copy_old_mempage() to see if crash system really starts. It does.
- fun fact: retested with more reserved memory and with UEFI FW, host kernel crashes in init but directly starts the crash kernel, so it works (somehow) \o/

update HW25

Day 1
- rebased crash-kernel on v6.12.59 (for now), still crashing

Related