I once had a bad dream.
I started good, a sunny day. I had just fixed an issue and push it to my fork, in order to create a Pull Request. I was happy. It felt awesome to have found a fix so elegant. Two lines of code.
But then, something happened. A cloud appear in the sky and partially hide the sun. github triggered the end to end test suite. You could see the waiting icon, you could almost hear the engines starting, ... a lightning and a thunder appeared in the sky, the sky turned dark and the e2e test started in our private jenkins server.
1 hour passed by... 2 hours .... the e2e tests still running ... 4 hours, not even 50%. Finally, I had to go, get some dinner, get some rest, so I decided to look at it the next day.
Sleep was not good. Dreaming the tests would fail and the deadline be missed .... I couldn't sleep no more so I wake up early in the morning, before sunrise. The computer still opened was lightning up the room. Out, the storm turned out to be a heavy rain. And the tests did not passed! 4 hours and half after starting them, so half an hour after leaving for dinner, and there was a fail test.
Finding out what went wrong took 2 hours...it was not even related to the code, but an infrastructure issue. A timeout when connecting to a mirror which was being restarted when the tests run. Apparently we hit a maintenance window.
Damn! So let's just "restart" the tests. This time, I decided to put an alarm 4 hours after and switch to another task, just to avoid the anxiety of the previous evening looking at the results. 4 hours passed by, and the tests are good. It is not raining anymore, a git of sun is filtering throw the window, hope is in the air. 4 more hours, and the tests still have not failed. Going to dinner and bed again and putting the alarm early in the morning. First thing in the morning I looked at the tests ... running again, feels good. All tests have passed and there is only one final "cleanup" state. Shower, breakfeast, sun is shining again!
Finally, all is green, so let's go and push the merge button ... oh oh ... clouds hide the sun again ... where is the merge button?? no way, there is a conflict with the code in master and I can't merge!!! I can't merge!!! shocked, I looked into the history ... John merged a PR overnight that was touching the same file ... hopeless I start crying on my desk...
Scared?
This is not so different on what could happen if we were running the whole e2e test suite in the SUSE Manager (uyuni) Pull Requests. However, not running any e2e tests has also a bad consequence. We find the issues after code has been merged in master, and then we spend days looking at what could have caused this, reverting commits, and starting again.
However, if we could run a subset of the e2e tests, we could shorten the time it takes and so run them at the PR level, so no broken code could get merged.
The thing is , how do we select which tests to run?
This project is about using Machine Learning to do predictive test selection, thus selecting which tests to run based on the history of previous test runs.
Inspired by: https://engineering.fb.com/2018/11/21/developer-tools/predictive-test-selection/
No Hackers yet
This project is part of:
Hack Week 20
Activity
Comments
-
almost 5 years ago by llansky3 | Reply
This could have some wider potential - clever picking of tests to shorten the time yet maximize the findings (and learning that in time based on test execution history to predict). I can see how that saves tons of money in cases where test on a real industrial machines need to be executed (e.g. cost of running engine on a test bed is 5-10k$ per day so there is huge difference if you need to run for 1 or 3 days)
Similar Projects
Kubernetes-Based ML Lifecycle Automation by lmiranda
Description
This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.
The pipeline will automate the lifecycle of a machine learning model, including:
- Data ingestion/collection
- Model training as a Kubernetes Job
- Model artifact storage in an S3-compatible registry (e.g. Minio)
- A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
- A lightweight inference service that loads and serves the latest model
- Monitoring of model performance and service health through Prometheus/Grafana
The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.
Goals
By the end of Hack Week, the project should:
Produce a fully functional ML pipeline running on Kubernetes with:
- Data collection job
- Training job container
- Storage and versioning of trained models
- Automated deployment of new model versions
- Model inference API service
- Basic monitoring dashboards
Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.
Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).
Prepare a short demo explaining the end-to-end process and how new models flow through the system.
Resources
Updates
- Training pipeline and datasets
- Inference Service py
Backporting patches using LLM by jankara
Description
Backporting Linux kernel fixes (either for CVE issues or as part of general git-fixes workflow) is boring and mostly mechanical work (dealing with changes in context, renamed variables, new helper functions etc.). The idea of this project is to explore usage of LLM for backporting Linux kernel commits to SUSE kernels using LLM.
Goals
- Create safe environment allowing LLM to run and backport patches without exposing the whole filesystem to it (for privacy and security reasons).
- Write prompt that will guide LLM through the backporting process. Fine tune it based on experimental results.
- Explore success rate of LLMs when backporting various patches.
Resources
- Docker
- Gemini CLI
Repository
Current version of the container with some instructions for use are at: https://gitlab.suse.de/jankara/gemini-cli-backporter
The Agentic Rancher Experiment: Do Androids Dream of Electric Cattle? by moio
Rancher is a beast of a codebase. Let's investigate if the new 2025 generation of GitHub Autonomous Coding Agents and Copilot Workspaces can actually tame it. 
The Plan
Create a sandbox GitHub Organization, clone in key Rancher repositories, and let the AI loose to see if it can handle real-world enterprise OSS maintenance - or if it just hallucinates new breeds of Kubernetes resources!
Specifically, throw "Agentic Coders" some typical tasks in a complex, long-lived open-source project, such as:
❥ The Grunt Work: generate missing GoDocs, unit tests, and refactorings. Rebase PRs.
❥ The Complex Stuff: fix actual (historical) bugs and feature requests to see if they can traverse the complexity without (too much) human hand-holding.
❥ Hunting Down Gaps: find areas lacking in docs, areas of improvement in code, dependency bumps, and so on.
If time allows, also experiment with Model Context Protocol (MCP) to give agents context on our specific build pipelines and CI/CD logs.
Why?
We know AI can write "Hello World." and also moderately complex programs from a green field. But can it rebase a 3-month-old PR with conflicts in rancher/rancher? I want to find the breaking point of current AI agents to determine if and how they can help us to reduce our technical debt, work faster and better. At the same time, find out about pitfalls and shortcomings.
The CONCLUSION!!!
A
State of the Union
document was compiled to summarize lessons learned this week. For more gory details, just read on the diary below!
Kubernetes-Based ML Lifecycle Automation by lmiranda
Description
This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.
The pipeline will automate the lifecycle of a machine learning model, including:
- Data ingestion/collection
- Model training as a Kubernetes Job
- Model artifact storage in an S3-compatible registry (e.g. Minio)
- A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
- A lightweight inference service that loads and serves the latest model
- Monitoring of model performance and service health through Prometheus/Grafana
The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.
Goals
By the end of Hack Week, the project should:
Produce a fully functional ML pipeline running on Kubernetes with:
- Data collection job
- Training job container
- Storage and versioning of trained models
- Automated deployment of new model versions
- Model inference API service
- Basic monitoring dashboards
Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.
Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).
Prepare a short demo explaining the end-to-end process and how new models flow through the system.
Resources
Updates
- Training pipeline and datasets
- Inference Service py
MCP Trace Suite by r1chard-lyu
Description
This project plans to create an MCP Trace Suite, a system that consolidates commonly used Linux debugging tools such as bpftrace, perf, and ftrace.
The suite is implemented as an MCP Server. This architecture allows an AI agent to leverage the server to diagnose Linux issues and perform targeted system debugging by remotely executing and retrieving tracing data from these powerful tools.
- Repo: https://github.com/r1chard-lyu/systracesuite
- Demo: Slides
Goals
Build an MCP Server that can integrate various Linux debugging and tracing tools, including bpftrace, perf, ftrace, strace, and others, with support for future expansion of additional tools.
Perform testing by intentionally creating bugs or issues that impact system performance, allowing an AI agent to analyze the root cause and identify the underlying problem.
Resources
- Gemini CLI: https://geminicli.com/
- eBPF: https://ebpf.io/
- bpftrace: https://github.com/bpftrace/bpftrace/
- perf: https://perfwiki.github.io/main/
- ftrace: https://github.com/r1chard-lyu/tracium/
Extended private brain - RAG my own scripts and data into offline LLM AI by tjyrinki_suse
Description
For purely studying purposes, I'd like to find out if I could teach an LLM some of my own accumulated knowledge, to use it as a sort of extended brain.
I might use qwen3-coder or something similar as a starting point.
Everything would be done 100% offline without network available to the container, since I prefer to see when network is needed, and make it so it's never needed (other than initial downloads).
Goals
- Learn something about RAG, LLM, AI.
- Find out if everything works offline as intended.
- As an end result have a new way to access my own existing know-how, but so that I can query the wisdom in them.
- Be flexible to pivot in any direction, as long as there are new things learned.
Resources
To be found on the fly.
Timeline
Day 1 (of 4)
- Tried out a RAG demo, expanded on feeding it my own data
- Experimented with qwen3-coder to add a persistent chat functionality, and keeping vectors in a pickle file
- Optimizations to keep everything within context window
- Learn and add a bit of PyTest
Day 2
- More experimenting and more data
- Study ChromaDB
- Add a Web UI that works from another computer even though the container sees network is down
Day 3
- The above RAG is working well enough for demonstration purposes.
- Pivot to trying out OpenCode, configuring local Ollama qwen3-coder there, to analyze the RAG demo.
- Figured out how to configure Ollama template to be usable under OpenCode. OpenCode locally is super slow to just running qwen3-coder alone.
Day 4 (final day)
- Battle with OpenCode that was both slow and kept on piling up broken things.
- Call it success as after all the agentic AI was working locally.
- Clean up the mess left behind a bit.
Blog Post
Summarized the findings at blog post.