You need to sign in or sign up before continuing.

Project Description

Everything we do in SUSE requires a certain amount of energy. This energy has a cost and it causes also a certain amount of CO2 emissions. In particular, as Kernel QA team, we run Kernel testing pretty often causing energy consumption that could be saved by introducing optimizations inside the LTP testing.

In this project we use a new parallel execution implementation, in order to talk about how software creation process can save energy and CO2 emissions inside a SW company.

Goal for this Hackweek

We want to answer the following questions:

How many tests can run in parallel?
How much energy we save per LTP execution in a virtualized system such as openQA?
Can we improve the parallelization model to save more energy?

Resources

runltp-ng: https://github.com/linux-test-project/runltp-ng/
runltp-ng with parallelization support: https://github.com/acerv/runltp-ng/tree/parallel_coroutines

Jan 31

I had some issues with the runltp-ng parallel execution, due to the choice of moving UI thread in the coroutines Thread. Tests took +30% time to complete with previous code, but now UI thread is working back again. Created a script to check how many parallel executions we have for all testing suites.

``` Suite: can Total tests: 3 Parallelizable tests: 2

Suite: cap_bounds Total tests: 1 Parallelizable tests: 0

Suite: commands Total tests: 37 Parallelizable tests: 0

Suite: connectors Total tests: 1 Parallelizable tests: 0

Suite: containers Total tests: 86 Parallelizable tests: 0

Suite: controllers Total tests: 346 Parallelizable tests: 1

Suite: cpuhotplug Total tests: 6 Parallelizable tests: 0

Suite: crashme Total tests: 4 Parallelizable tests: 0

Suite: crypto Total tests: 10 Parallelizable tests: 6

Suite: cve Total tests: 77 Parallelizable tests: 5

Suite: dio Total tests: 30 Parallelizable tests: 0

Suite: dmathreaddiotest Total tests: 7 Parallelizable tests: 0

Suite: fcntl-locktests Total tests: 1 Parallelizable tests: 0

Suite: filecaps Total tests: 1 Parallelizable tests: 0

Suite: fs Total tests: 68 Parallelizable tests: 0

Suite: fs_bind Total tests: 95 Parallelizable tests: 0

Suite: fspermssimple Total tests: 18 Parallelizable tests: 0

Suite: fs_readonly Total tests: 55 Parallelizable tests: 0

Suite: fsx Total tests: 1 Parallelizable tests: 0

Suite: hugetlb Total tests: 50 Parallelizable tests: 0

Suite: hyperthreading Total tests: 2 Parallelizable tests: 0

Suite: ima Total tests: 9 Parallelizable tests: 0

Suite: input Total tests: 6 Parallelizable tests: 0

Suite: io Total tests: 2 Parallelizable tests: 1

Suite: ipc Total tests: 8 Parallelizable tests: 0

Suite: irq Total tests: 1 Parallelizable tests: 1

Suite: kernel_misc Total tests: 16 Parallelizable tests: 0

Suite: kvm Total tests: 1 Parallelizable tests: 0

Suite: ltp-aio-stress Total tests: 54 Parallelizable tests: 0

Suite: ltp-aiodio.part1 Total tests: 140 Parallelizable tests: 0

Suite: ltp-aiodio.part2 Total tests: 83 Parallelizable tests: 0

Suite: ltp-aiodio.part3 Total tests: 48 Parallelizable tests: 0

Suite: ltp-aiodio.part4 Total tests: 57 Parallelizable tests: 0

Suite: math Total tests: 10 Parallelizable tests: 0

Suite: mm Total tests: 75 Parallelizable tests: 2

Suite: net.features Total tests: 62 Parallelizable tests: 0

Suite: net.ipv6 Total tests: 11 Parallelizable tests: 0

Suite: net.ipv6_lib Total tests: 6 Parallelizable tests: 2

Suite: net.multicast Total tests: 4 Parallelizable tests: 0

Suite: net.nfs Total tests: 84 Parallelizable tests: 0

Suite: net.rpc_tests Total tests: 51 Parallelizable tests: 0

Suite: net.sctp Total tests: 41 Parallelizable tests: 0

Suite: net.tcp_cmds Total tests: 21 Parallelizable tests: 0

Suite: net.tirpc_tests Total tests: 41 Parallelizable tests: 0

Suite: net_stress.appl Total tests: 10 Parallelizable tests: 0

Suite: netstress.brokenip Total tests: 11 Parallelizable tests: 0

Suite: net_stress.interface Total tests: 25 Parallelizable tests: 0

Suite: netstress.ipsecdccp Total tests: 104 Parallelizable tests: 0

Suite: netstress.ipsecicmp Total tests: 86 Parallelizable tests: 0

Suite: netstress.ipsecsctp Total tests: 104 Parallelizable tests: 0

Suite: netstress.ipsectcp Total tests: 104 Parallelizable tests: 0

Suite: netstress.ipsecudp Total tests: 106 Parallelizable tests: 0

Suite: net_stress.multicast Total tests: 24 Parallelizable tests: 0

Suite: net_stress.route Total tests: 14 Parallelizable tests: 0

Suite: nptl Total tests: 1 Parallelizable tests: 0

Suite: numa Total tests: 20 Parallelizable tests: 2

Suite: powermanagementtests Total tests: 5 Parallelizable tests: 0

Suite: powermanagementtests_exclusive Total tests: 5 Parallelizable tests: 0

Suite: pty Total tests: 9 Parallelizable tests: 1

Suite: s390x_tests Total tests: 1 Parallelizable tests: 0

Suite: sched Total tests: 11 Parallelizable tests: 0

Suite: scsi_debug.part1 Total tests: 140 Parallelizable tests: 0

Suite: securebits Total tests: 3 Parallelizable tests: 0

Suite: smack Total tests: 10 Parallelizable tests: 0

Suite: smoketest Total tests: 15 Parallelizable tests: 5

Suite: staging Total tests: 1 Parallelizable tests: 0

Suite: syscalls Total tests: 1384 Parallelizable tests: 526

Suite: syscalls-ipc Total tests: 61 Parallelizable tests: 26

Suite: tpm_tools Total tests: 12 Parallelizable tests: 0

Suite: tracing Total tests: 9 Parallelizable tests: 0

Suite: uevent Total tests: 3 Parallelizable tests: 0

Suite: watchqueue Total tests: 9 Parallelizable tests: 9

Total tests: 4017 Parallelizable tests: 589

14.66% of the tests are parallelizable ```

Feb 1

Added a new option runltp-ng --force-parallel to force parallelization even if it's not enabled by tests, but using it causes application crashes, especially for more important suites such as syscalls or syscalls-ipc. Not a good idea to use it. In general, I run a few suites collecting times we need to complete them. It seems the current rule selecting tests for parallel execution is not smart enough and most of the selected tests just end in a seconds or less. This will reflect on time results, where important testing suites, such as syscalls, will end up just a few minutes before the normal execution. We can do probably better on that side by optimizing the rule, which is currently implemented here.

``` Qemu: Distro: Tumbleweed Kernel: 6.1.8-1-default SMP: 16 RAM: 2GB

syscalls: tests: 1384 parallel: 526 (38% of the tests)

16 workers: 31m 54s
1 worker:   36m 18s

syscalls-ipc: tests: 61 parallel: 26 (42.62% of the tests)

16 workers: 2m 4s
1 worker:   2m 7s

mm: tests: 75 parallel: 2 (42.62% of the tests)

16 workers: 8m 2s
1 worker:   8m 10s

cve: tests: 77 parallel: 5 (6.49% of the tests)

16 workers: 29m 53s
1 worker:   29m 57s

```

02-03 Feb

I focused more on syscalls testing suites, since it's the most important suite that can be easily parallelized. All power consumption measurements have been taken using powerstat -a -R -d 0 1 3600 command, bringing data from the start of the testing suite execution until the end. All stats have been taken using my own laptop, since I wasn't able to access openQA workers physically. Also, to improve measurements, it would be better to have an external device for measuring power consumption. All tests run inside a Qemu instance. According with openQA stats, syscalls has been executed 35 times in the last month (Jan 2023), so we take this value into account.

Environment

``` Laptop: Model: Lenovo T14s Gen 1 CPU: AMD Ryzen 7 PRO 4750U Memory: 16GB DDR4 Hard disk: NVMe SSD

Qemu:
    CPUs: 16
    RAM:  4096MB

```

Data

CO2 emission per kWh -> W = 0.244kg CO2/kWh (5% uncertainty) Avg idle consumption -> I = 2.50 W Cost energy in germany -> P = 0.534 $/kWh syscalls exec per month -> R = 35

Normal execution

execution time: T1 = 38m 57s = 2337s energy consumption: E1 = 9 Wh monthly consumption: C1 = 35 * 9 = 0.315 kWh

Parallel execution (16 workers)

execution time: T2 = 35m 22s = 2122s -> 10% less energy consumption: E2 = 10 Wh monthly consumption: C2 = 35 * 10 = 0.350 kWh

Results

As we notice, there's a small difference between parallelization and normal execution, but overall it's so small that it won't particularly affect CO2 emissions or costs. In particular, in one year we have:

diff: D = (0.315 - 0.350) * 12 = +0.42 kWh cost: C = D * P = -0.42 * 0.534 = +0.224 $ emissions: C02 = D * W = -0.42 * 0.244 = +0.102 kg

Considering that servers might consume a bit more energy during the execution, we might have bigger values, but still pretty small. The reason is that during parallelization we use more power to run many tests in parallel.

Optimizations

At the end, we can see that in terms of costs or emissions, we don't have a big impact, but in terms of time we still can have a significant impact in one year. We have the possibility to realease openQA workers in a faster way and to complete also other jobs a bit faster. And that of course will have an impact on production, energy consumption and emissions. By taking into account our data, we can say that in one year we will save:

(T1 - T2) * R * 12 = (2337 - 2122) * 35 * 12 ~25 hours

If we are able to introduce a smarter rule to select tests which can run in parallel, the amount of saved time per year might significantly increase. Also, we still have 332 syscalls tests (about 24%) using old API which can't run in parallel nowadays.

Looking for hackers with the skills:

optimization energy kernel ltp runltp co2 testing

This project is part of:

Hack Week 22

Activity

almost 3 years ago: mkoutny liked this project.

almost 3 years ago: maritawerner liked this project.

almost 3 years ago: okurz liked this project.

almost 3 years ago: acervesato added keyword "testing" to this project.

almost 3 years ago: acervesato added keyword "optimization" to this project.

almost 3 years ago: acervesato added keyword "energy" to this project.

almost 3 years ago: acervesato added keyword "kernel" to this project.

almost 3 years ago: acervesato added keyword "ltp" to this project.

almost 3 years ago: acervesato added keyword "runltp" to this project.

almost 3 years ago: acervesato added keyword "co2" to this project.

almost 3 years ago: acervesato started this project.

almost 3 years ago: acervesato originated this project.

Comments

almost 3 years ago by acervesato | Reply

.

Similar Projects

optimization

RMT.rs: High-Performance Registration Path for RMT using Rust by gbasso

Description

The SUSE Repository Mirroring Tool (RMT) is a critical component for managing software updates and subscriptions, especially for our Public Cloud Team (PCT). In a cloud environment, hundreds or even thousands of new SUSE instances (VPS/EC2) can be provisioned simultaneously. Each new instance attempts to register against an RMT server, creating a "thundering herd" scenario.

We have observed that the current RMT server, written in Ruby, faces performance issues under this high-concurrency registration load. This can lead to request overhead, slow registration times, and outright registration failures, delaying the readiness of new cloud instances.

This Hackweek project aims to explore a solution by re-implementing the performance-critical registration path in Rust. The goal is to leverage Rust's high performance, memory safety, and first-class concurrency handling to create an alternative registration endpoint that is fast, reliable, and can gracefully manage massive, simultaneous request spikes.

The new Rust module will be integrated into the existing RMT Ruby application, allowing us to directly compare the performance of both implementations.

Goals

The primary objective is to build and benchmark a high-performance Rust-based alternative for the RMT server registration endpoint.

Key goals for the week:

Analyze & Identify: Dive into the SUSE/rmt Ruby codebase to identify and map out the exact critical path for server registration (e.g., controllers, services, database interactions).
Develop in Rust: Implement a functionally equivalent version of this registration logic in Rust.
Integrate: Explore and implement a method for Ruby/Rust integration to "hot-wire" the new Rust module into the RMT application. This may involve using FFI, or libraries like rb-sys or magnus.
Benchmark: Create a benchmarking script (e.g., using k6, ab, or a custom tool) that simulates the high-concurrency registration load from thousands of clients.
Compare & Present: Conduct a comparative performance analysis (requests per second, latency, success/error rates, CPU/memory usage) between the original Ruby path and the new Rust path. The deliverable will be this data and a summary of the findings.

Resources

RMT Source Code (Ruby):
- https://github.com/SUSE/rmt
RMT Documentation:
- https://documentation.suse.com/sles/15-SP7/html/SLES-all/book-rmt.html
Tooling & Stacks:
- RMT/Ruby development environment (for running the base RMT)
- Rust development environment (rustup, cargo)
Potential Integration Libraries:
- rb-sys: https://github.com/oxidize-rb/rb-sys
- Magnus: https://github.com/matsadler/magnus
Benchmarking Tools:
- k6 (https://k6.io/)
- ab (ApacheBench)

kernel

bpftrace contribution by mkoutny

Description

bpftrace is a great tool, no need to sing odes to it here. It can access any kernel data and process them in real time. It provides helpers for some common Linux kernel structures but not all.

Goals

set up bpftrace toolchain
learn about bpftrace implementation and internals
implement support for percpu_counters
look into some of the first issues
send a refined PR (on Thu)

Resources

Add Qualcomm Snapdragon 765G (SM7250) basic device tree to mainline linux kernel by pvorel

Qualcomm Snapdragon 765G (SM7250) (smartphone SoC) has no support in the linux kernel, nor in u-boot. Try to add basic device tree support. The hardest part will be to create boot.img which will be accepted by phone.

UART is available for smartphone :).

Improve UML page fault handler by ptesarik

Description

Improve UML handling of segmentation faults in kernel mode. Although such page faults are generally caused by a kernel bug, it is annoying if they cause an infinite loop, or panic the kernel. More importantly, a robust implementation allows to write KUnit tests for various guard pages, preventing potential kernel self-protection regressions.

Goals

Convert the UML page fault handler to use oops_* helpers, go through a few review rounds and finally get my patch series merged in 6.14.

Resources

Wrong initial attempt: https://lore.kernel.org/lkml/20231215121431.680-1-petrtesarik@huaweicloud.com/T/

dynticks-testing: analyse perf / trace-cmd output and aggregate data by m.crivellari

Description

dynticks-testing is a project started years ago by Frederic Weisbecker. One of the feature is to check the actual configuration (isolcpus, irqaffinity etc etc) and give feedback on it.

An important goal of this tool is to parse the output of trace-cmd / perf and provide more readable data, showing the duration of every events grouped by PID (showing also the CPU number, if the tasks has been migrated etc).

An example of data captured on my laptop (incomplete!!):

          -0     [005] dN.2. 20310.270699: sched_wakeup:         WaylandProxy:46380 [120] CPU:005
          -0     [005] d..2. 20310.270702: sched_switch:         swapper/5:0 [120] R ==> WaylandProxy:46380 [120]
...
    WaylandProxy-46380 [004] d..2. 20310.295397: sched_switch:         WaylandProxy:46380 [120] S ==> swapper/4:0 [120]
          -0     [006] d..2. 20310.295397: sched_switch:         swapper/6:0 [120] R ==> firefox:46373 [120]
         firefox-46373 [006] d..2. 20310.295408: sched_switch:         firefox:46373 [120] S ==> swapper/6:0 [120]
          -0     [004] dN.2. 20310.295466: sched_wakeup:         WaylandProxy:46380 [120] CPU:004

Output of noise_parse.py:

Task: WaylandProxy Pid: 46380 cpus: {4, 5} (Migrated!!!)
        Wakeup Latency                                Nr:        24     Duration:          89
        Sched switch: kworker/12:2                    Nr:         1     Duration:           6

My first contribution is around Nov. 2024!

Goals

add more features (eg cpuset)
test / bugfix

Resources

Frederic's public repository: https://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git/
https://docs.kernel.org/timers/no_hz.html#testing

Progresses

isolcpus and cpusets implemented and merged in master: dynticks-testing.git commit

Backporting patches using LLM by jankara

Description

Backporting Linux kernel fixes (either for CVE issues or as part of general git-fixes workflow) is boring and mostly mechanical work (dealing with changes in context, renamed variables, new helper functions etc.). The idea of this project is to explore usage of LLM for backporting Linux kernel commits to SUSE kernels using LLM.

Goals

Create safe environment allowing LLM to run and backport patches without exposing the whole filesystem to it (for privacy and security reasons).
Write prompt that will guide LLM through the backporting process. Fine tune it based on experimental results.
Explore success rate of LLMs when backporting various patches.

Resources

Docker
Gemini CLI

Repository

Current version of the container with some instructions for use are at: https://gitlab.suse.de/jankara/gemini-cli-backporter

testing

Multimachine on-prem test with opentofu, ansible and Robot Framework by apappas

Description

A long time ago I explored using the Robot Framework for testing. A big deficiency over our openQA setup is that bringing up and configuring the connection to a test machine is out of scope.

Nowadays we have a way¹ to deploy SUTs outside openqa, but we only use if for cloud tests in conjuction with openqa. Using knowledge gained from that project I am going to try to create a test scenario that replicates an openqa test but this time including the deployment and setup of the SUT.

Goals

Create a simple multimachine test scenario with the support server and SUT all created by the robot framework.

Resources

https://github.com/SUSE/qe-sap-deployment
terraform-libvirt-provider

Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil

Join the Gitter channel! https://gitter.im/uyuni-project/hackweek

Uyuni is a configuration and infrastructure management tool that saves you time and headaches when you have to manage and update tens, hundreds or even thousands of machines. It also manages configuration, can run audits, build image containers, monitor and much more!

Currently there are a few distributions that are completely untested on Uyuni or SUSE Manager (AFAIK) or just not tested since a long time, and could be interesting knowing how hard would be working with them and, if possible, fix whatever is broken.

For newcomers, the easiest distributions are those based on DEB or RPM packages. Distributions with other package formats are doable, but will require adapting the Python and Java code to be able to sync and analyze such packages (and if salt does not support those packages, it will need changes as well). So if you want a distribution with other packages, make sure you are comfortable handling such changes.

No developer experience? No worries! We had non-developers contributors in the past, and we are ready to help as long as you are willing to learn. If you don't want to code at all, you can also help us preparing the documentation after someone else has the initial code ready, or you could also help with testing :-)

The idea is testing Salt (including bootstrapping with bootstrap script) and Salt-ssh clients

To consider that a distribution has basic support, we should cover at least (points 3-6 are to be tested for both salt minions and salt ssh minions):

Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file)
Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
Package management (install, remove, update...)
Patching
Applying any basic salt state (including a formula)
Salt remote commands
Bonus point: Java part for product identification, and monitoring enablement
Bonus point: sumaform enablement (https://github.com/uyuni-project/sumaform)
Bonus point: Documentation (https://github.com/uyuni-project/uyuni-docs)
Bonus point: testsuite enablement (https://github.com/uyuni-project/uyuni/tree/master/testsuite)

If something is breaking: we can try to fix it, but the main idea is research how supported it is right now. Beyond that it's up to each project member how much to hack :-)

If you don't have knowledge about some of the steps: ask the team
If you still don't know what to do: switch to another distribution and keep testing.

This card is for EVERYONE, not just developers. Seriously! We had people from other teams helping that were not developers, and added support for Debian and new SUSE Linux Enterprise and openSUSE Leap versions :-)

In progress/done for Hack Week 25

Guide

We started writin a Guide: Adding a new client GNU Linux distribution to Uyuni at https://github.com/uyuni-project/uyuni/wiki/Guide:-Adding-a-new-client-GNU-Linux-distribution-to-Uyuni, to make things easier for everyone, specially those not too familiar wht Uyuni or not technical.

openSUSE Leap 16.0

The distribution will all love!

https://en.opensuse.org/openSUSE:Roadmap#DRAFTScheduleforLeap16.0

Curent Status We started last year, it's complete now for Hack Week 25! :-D

[W] Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file) NOTE: Done, client tools for SLMicro6 are using as those for SLE16.0/openSUSE Leap 16.0 are not available yet
[W] Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
[W] Package management (install, remove, update...). Works, even reboot requirement detection

openQA tests needles elaboration using AI image recognition by mdati

Description

In the openQA test framework, to identify the status of a target SUT image, a screenshots of GUI or CLI-terminal images, the needles framework scans the many pictures in its repository, having associated a given set of tags (strings), selecting specific smaller parts of each available image. For the needles management actually we need to keep stored many screenshots, variants of GUI and CLI-terminal images, eachone accompanied by a dedicated set of data references (json).

A smarter framework, using image recognition based on AI or other image elaborations tools, nowadays widely available, could improve the matching process and hopefully reduce time and errors, during the images verification and detection process.

Goals

Main scope of this idea is to match a "graphical" image of the console or GUI status of a running openQA test, an image of a shell console or application-GUI screenshot, using less time and resources and with less errors in data preparation and use, than the actual openQA needles framework; that is:

having a given SUT (system under test) GUI or CLI-terminal screenshot, with a local distribution of pixels or text commands related to a running test status,
we want to identify a desired target, e.g. a screen image status or data/commands context,
- based on AI/ML-pretrained archives containing object or other proper elaboration tools,
- possibly able to identify also object not present in the archive, i.e. by means of AI/ML mechanisms.
the matching result should be then adapted to continue working in the openQA test, likewise and in place of the same result that would have been produced by the original openQA needles framework.
We expect an improvement of the matching-time(less time), reliability of the expected result(less error) and simplification of archive maintenance in adding/removing objects(smaller DB and less actions).

Hackweek POC:

Main steps

Phase 1 - Plan
- study the available tools
- prepare a plan for the process to build
Phase 2 - Implement
- write and build a draft application
Phase 3 - Data
- prepare the data archive from a subset of needles
- initialize/pre-train the base archive
- select a screenshot from the subset, removing/changing some part
Phase 4 - Test
- run the POC application
- expect the image type is identified in a good %.

Resources

First step of this project is quite identification of useful resources for the scope; some possibilities are:

SUSE AI and other ML tools (i.e. Tensorflow)
Tools able to manage images
RPA test tools (like i.e. Robot framework)
other.

Project references

Repository: openqa-needles-AI-driven

Project Description

Goal for this Hackweek

Resources

Jan 31

Feb 1

02-03 Feb

Environment

Data

Normal execution

Parallel execution (16 workers)

Results

Optimizations

Looking for hackers with the skills:

This project is part of:

Activity

Comments

almost 3 years ago by acervesato | Reply

Similar Projects

optimization

RMT.rs: High-Performance Registration Path for RMT using Rust by gbasso

Description

Goals

Resources

kernel

bpftrace contribution by mkoutny

Description

Goals

Resources

Add Qualcomm Snapdragon 765G (SM7250) basic device tree to mainline linux kernel by pvorel

Improve UML page fault handler by ptesarik

Description

Goals

Resources

dynticks-testing: analyse perf / trace-cmd output and aggregate data by m.crivellari

Description

Goals

Resources

Progresses

Backporting patches using LLM by jankara

Description

Goals

Resources

Repository

testing

Multimachine on-prem test with opentofu, ansible and Robot Framework by apappas

Description

Goals

Resources

Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil

In progress/done for Hack Week 25

Guide

openSUSE Leap 16.0

openQA tests needles elaboration using AI image recognition by mdati

Description

Goals

Hackweek POC:

Resources

Project references