Description

Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

Goals

By the end of Hack Week, we aim to have a single, working Python script that:

  1. Connects to Prometheus and executes a query to fetch detailed test failure history.
  2. Processes the raw data into a format suitable for the Gemini API.
  3. Successfully calls the Gemini API with the data and a clear prompt.
  4. Parses the AI's response to extract a simple list of flaky tests.
  5. Saves the list to a JSON file that can be displayed in Grafana.
  6. New panel in our Dashboard listing the Flaky tests

Resources

Outcome

Looking for hackers with the skills:

uyuni prometheus grafana ai completed

This project is part of:

Hack Week 25

Activity

  • 11 days ago: jordimassaguerpla liked this project.
  • 18 days ago: oscar-barrios added keyword "completed" to this project.
  • 24 days ago: deneb_alpha liked this project.
  • about 1 month ago: ygutierrez liked this project.
  • 2 months ago: oscar-barrios liked this project.
  • 2 months ago: oscar-barrios added keyword "uyuni" to this project.
  • 2 months ago: oscar-barrios added keyword "prometheus" to this project.
  • 2 months ago: oscar-barrios added keyword "grafana" to this project.
  • 2 months ago: oscar-barrios added keyword "ai" to this project.
  • 2 months ago: oscar-barrios started this project.
  • 2 months ago: oscar-barrios left this project.
  • 2 months ago: oscar-barrios started this project.
  • 2 months ago: oscar-barrios originated this project.

  • Comments

    • oscar-barrios
      19 days ago by oscar-barrios | Reply

      The code of the flaky detector is here: https://github.com/srbarrios/jenkins-flaky-tests-detector

      I also published a Docker container to use it here: https://github.com/srbarrios/jenkins-flaky-tests-detector/pkgs/container/jenkins-flaky-tests-detector

      The plan now is to write a Salt state in our MLM internal infra, so it runs this container, it expose the results in a web server running on the container, and then I parse it on Grafana.

    • oscar-barrios
      19 days ago by oscar-barrios | Reply

      I created the new Grafana dashboard for Uyuni here: https://grafana.mgr.suse.de/d/flaky-tests/flaky-tests-detection?orgId=1&from=now-6h&to=now&timezone=browser&refresh=1m

      Next step now is to build it in a way that I can get the flaky tests for all the Jenkins job test results that we monitoring in MLM.

    • oscar-barrios
      19 days ago by oscar-barrios | Reply

      Now we can select any of the running test suites, and get a list of the most probable flaky tests :)

    • oscar-barrios
      18 days ago by oscar-barrios | Reply

      I will consider this hackweek done for now, to move to my second hackweek project. The outcome it has been good, I must admit that I also vibe coded some parts using Gemini 3. Also, the script analyzing the prometheus series is not relying only on a LLM call, but it also do a first triage based on a simple algorithm, saving resources to ask AI only for ambiguos and complex test failures.

    Similar Projects

    Set Uyuni to manage edge clusters at scale by RDiasMateus

    Description

    Prepare a Poc on how to use MLM to manage edge clusters. Those cluster are normally equal across each location, and we have a large number of them.

    The goal is to produce a set of sets/best practices/scripts to help users manage this kind of setup.

    Goals

    step 1: Manual set-up

    Goal: Have a running application in k3s and be able to update it using System Update Controler (SUC)

    • Deploy Micro 6.2 machine
    • Deploy k3s - single node

      • https://docs.k3s.io/quick-start
    • Build/find a simple web application (static page)

      • Build/find a helmchart to deploy the application
    • Deploy the application on the k3s cluster

    • Install App updates through helm update

    • Install OS updates using MLM

    step 2: Automate day 1

    Goal: Trigger the application deployment and update from MLM

    • Salt states For application (with static data)
      • Deploy the application helmchart, if not present
      • install app updates through helmchart parameters
    • Link it to GIT
      • Define how to link the state to the machines (based in some pillar data? Using configuration channels by importing the state? Naming convention?)
      • Use git update to trigger helmchart app update
    • Recurrent state applying configuration channel?

    step 3: Multi-node cluster

    Goal: Use SUC to update a multi-node cluster.

    • Create a multi-node cluster
    • Deploy application
      • call the helm update/install only on control plane?
    • Install App updates through helm update
    • Prepare a SUC for OS update (k3s also? How?)
      • https://github.com/rancher/system-upgrade-controller
      • https://documentation.suse.com/cloudnative/k3s/latest/en/upgrades/automated.html
      • Update/deploy the SUC?
      • Update/deploy the SUC CRD with the update procedure


    Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil

    Join the Gitter channel! https://gitter.im/uyuni-project/hackweek

    Uyuni is a configuration and infrastructure management tool that saves you time and headaches when you have to manage and update tens, hundreds or even thousands of machines. It also manages configuration, can run audits, build image containers, monitor and much more!

    Currently there are a few distributions that are completely untested on Uyuni or SUSE Manager (AFAIK) or just not tested since a long time, and could be interesting knowing how hard would be working with them and, if possible, fix whatever is broken.

    For newcomers, the easiest distributions are those based on DEB or RPM packages. Distributions with other package formats are doable, but will require adapting the Python and Java code to be able to sync and analyze such packages (and if salt does not support those packages, it will need changes as well). So if you want a distribution with other packages, make sure you are comfortable handling such changes.

    No developer experience? No worries! We had non-developers contributors in the past, and we are ready to help as long as you are willing to learn. If you don't want to code at all, you can also help us preparing the documentation after someone else has the initial code ready, or you could also help with testing :-)

    The idea is testing Salt (including bootstrapping with bootstrap script) and Salt-ssh clients

    To consider that a distribution has basic support, we should cover at least (points 3-6 are to be tested for both salt minions and salt ssh minions):

    1. Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file)
    2. Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
    3. Package management (install, remove, update...)
    4. Patching
    5. Applying any basic salt state (including a formula)
    6. Salt remote commands
    7. Bonus point: Java part for product identification, and monitoring enablement
    8. Bonus point: sumaform enablement (https://github.com/uyuni-project/sumaform)
    9. Bonus point: Documentation (https://github.com/uyuni-project/uyuni-docs)
    10. Bonus point: testsuite enablement (https://github.com/uyuni-project/uyuni/tree/master/testsuite)

    If something is breaking: we can try to fix it, but the main idea is research how supported it is right now. Beyond that it's up to each project member how much to hack :-)

    • If you don't have knowledge about some of the steps: ask the team
    • If you still don't know what to do: switch to another distribution and keep testing.

    This card is for EVERYONE, not just developers. Seriously! We had people from other teams helping that were not developers, and added support for Debian and new SUSE Linux Enterprise and openSUSE Leap versions :-)

    In progress/done for Hack Week 25

    Guide

    We started writin a Guide: Adding a new client GNU Linux distribution to Uyuni at https://github.com/uyuni-project/uyuni/wiki/Guide:-Adding-a-new-client-GNU-Linux-distribution-to-Uyuni, to make things easier for everyone, specially those not too familiar wht Uyuni or not technical.

    openSUSE Leap 16.0

    The distribution will all love!

    https://en.opensuse.org/openSUSE:Roadmap#DRAFTScheduleforLeap16.0

    Curent Status We started last year, it's complete now for Hack Week 25! :-D

    • [W] Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file) NOTE: Done, client tools for SLMicro6 are using as those for SLE16.0/openSUSE Leap 16.0 are not available yet
    • [W] Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
    • [W] Package management (install, remove, update...). Works, even reboot requirement detection


    Uyuni read-only replica by cbosdonnat

    Description

    For now, there is no possible HA setup for Uyuni. The idea is to explore setting up a read-only shadow instance of an Uyuni and make it as useful as possible.

    Possible things to look at:

    • live sync of the database, probably using the WAL. Some of the tables may have to be skipped or some features disabled on the RO instance (taskomatic, PXT sessions…)
    • Can we use a load balancer that routes read-only queries to either instance and the other to the RW one? For example, packages or PXE data can be served by both, the API GET requests too. The rest would be RW.

    Goals

    • Prepare a document explaining how to do it.
    • PR with the needed code changes to support it


    mgr-ansible-ssh - Intelligent, Lightweight CLI for Distributed Remote Execution by deve5h

    Description

    By the end of Hack Week, the target will be to deliver a minimal functional version 1 (MVP) of a custom command-line tool named mgr-ansible-ssh (a unified wrapper for BOTH ad-hoc shell & playbooks) that allows operators to:

    1. Execute arbitrary shell commands on thousand of remote machines simultaneously using Ansible Runner with artifacts saved locally.
    2. Pass runtime options such as inventory file, remote command string/ playbook execution, parallel forks, limits, dry-run mode, or no-std-ansible-output.
    3. Leverage existing SSH trust relationships without additional setup.
    4. Provide a clean, intuitive CLI interface with --help for ease of use. It should provide consistent UX & CI-friendly interface.
    5. Establish a foundation that can later be extended with advanced features such as logging, grouping, interactive shell mode, safe-command checks, and parallel execution tuning.

    The MVP should enable day-to-day operations to efficiently target thousands of machines with a single, consistent interface.

    Goals

    Primary Goals (MVP):

    Build a functional CLI tool (mgr-ansible-ssh) capable of executing shell commands on multiple remote hosts using Ansible Runner. Test the tool across a large distributed environment (1000+ machines) to validate its performance and reliability.

    Looking forward to significantly reducing the zypper deployment time across all 351 RMT VM servers in our MLM cluster by eliminating the dependency on the taskomatic service, bringing execution down to a fraction of the current duration. The tool should also support multiple runtime flags, such as:

    mgr-ansible-ssh: Remote command execution wrapper using Ansible Runner
    
    Usage: mgr-ansible-ssh [--help] [--version] [--inventory INVENTORY]
                       [--run RUN] [--playbook PLAYBOOK] [--limit LIMIT]
                       [--forks FORKS] [--dry-run] [--no-ansible-output]
    
    Required Arguments
    --inventory, -i      Path to Ansible inventory file to use
    
    Any One of the Arguments Is Required
    --run, -r            Execute the specified shell command on target hosts
    --playbook, -p       Execute the specified Ansible playbook on target hosts
    
    Optional Arguments
    --help, -h           Show the help message and exit
    --version, -v        Show the version and exit
    --limit, -l          Limit execution to specific hosts or groups
    --forks, -f          Number of parallel Ansible forks
    --dry-run            Run in Ansible check mode (requires -p or --playbook)
    --no-ansible-output  Suppress Ansible stdout output
    

    Secondary/Stretched Goals (if time permits):

    1. Add pretty output formatting (success/failure summary per host).
    2. Implement basic logging of executed commands and results.
    3. Introduce safety checks for risky commands (shutdown, rm -rf, etc.).
    4. Package the tool so it can be installed with pip or stored internally.

    Resources

    Collaboration is welcome from anyone interested in CLI tooling, automation, or distributed systems. Skills that would be particularly valuable include:

    1. Python especially around CLI dev (argparse, click, rich)


    Uyuni Saltboot rework by oholecek

    Description

    When Uyuni switched over to the containerized proxies we had to abandon salt based saltboot infrastructure we had before. Uyuni already had integration with a Cobbler provisioning server and saltboot infra was re-implemented on top of this Cobbler integration.

    What was not obvious from the start was that Cobbler, having all it's features, woefully slow when dealing with saltboot size environments. We did some improvements in performance, introduced transactions, and generally tried to make this setup usable. However the underlying slowness remained.

    Goals

    This project is not something trying to invent new things, it is just finally implementing saltboot infrastructure directly with the Uyuni server core.

    Instead of generating grub and pxelinux configurations by Cobbler for all thousands of systems and branches, we will provide a GET access point to retrieve grub or pxelinux file during the boot:

    /saltboot/group/grub/$fqdn and similar for systems /saltboot/system/grub/$mac

    Next we adapt our tftpd translator to query these points when asked for default or mac based config.

    Lastly similar thing needs to be done on our apache server when HTTP UEFI boot is used.

    Resources


    Uyuni Health-check Grafana AI Troubleshooter by ygutierrez

    Description

    This project explores the feasibility of using the open-source Grafana LLM plugin to enhance the Uyuni Health-check tool with LLM capabilities. The idea is to integrate a chat-based "AI Troubleshooter" directly into existing dashboards, allowing users to ask natural-language questions about errors, anomalies, or performance issues.

    Goals

    • Investigate if and how the grafana-llm-app plug-in can be used within the Uyuni Health-check tool.
    • Investigate if this plug-in can be used to query LLMs for troubleshooting scenarios.
    • Evaluate support for local LLMs and external APIs through the plugin.
    • Evaluate if and how the Uyuni MCP server could be integrated as another source of information.

    Resources

    Grafana LMM plug-in

    Uyuni Health-check


    Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

    Description

    Build a solid understanding of the current landscape of Artificial Intelligence and how modern cloud-native technologies—especially Kubernetes—support AI workloads.

    Goals

    Use Gemini Learning Mode to guide the exploration, surface relevant concepts, and structure the learning journey:

    • Gain insight into the latest AI trends, tools, and architectural concepts.
    • Understand how Kubernetes and related cloud-native technologies are used in the AI ecosystem (model training, deployment, orchestration, MLOps).

    Resources

    • Red Hat AI Topic Articles

      • https://www.redhat.com/en/topics/ai
    • Kubeflow Documentation

      • https://www.kubeflow.org/docs/
    • Q4 2025 CNCF Technology Landscape Radar report:

      • https://www.cncf.io/announcements/2025/11/11/cncf-and-slashdata-report-finds-leading-ai-tools-gaining-adoption-in-cloud-native-ecosystems/
      • https://www.cncf.io/wp-content/uploads/2025/11/cncfreporttechradar_111025a.pdf
    • Agent-to-Agent (A2A) Protocol

      • https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/


    Multi-agent AI assistant for Linux troubleshooting by doreilly

    Description

    Explore multi-agent architecture as a way to avoid MCP context rot.

    Having one agent with many tools bloats the context with low-level details about tool descriptions, parameter schemas etc which hurts LLM performance. Instead have many specialised agents, each with just the tools it needs for its role. A top level supervisor agent takes the user prompt and delegates to appropriate sub-agents.

    Goals

    Create an AI assistant with some sub-agents that are specialists at troubleshooting Linux subsystems, e.g. systemd, selinux, firewalld etc. The agents can get information from the system by implementing their own tools with simple function calls, or use tools from MCP servers, e.g. a systemd-agent can use tools from systemd-mcp.

    Example prompts/responses:

    user$ the system seems slow
    assistant$ process foo with pid 12345 is using 1000% cpu ...
    
    user$ I can't connect to the apache webserver
    assistant$ the firewall is blocking http ... you can open the port with firewall-cmd --add-port ...
    

    Resources

    Language Python. The Python ADK is more mature than Golang.

    https://google.github.io/adk-docs/

    https://github.com/djoreilly/linux-helper


    The Agentic Rancher Experiment: Do Androids Dream of Electric Cattle? by moio

    Rancher is a beast of a codebase. Let's investigate if the new 2025 generation of GitHub Autonomous Coding Agents and Copilot Workspaces can actually tame it. A GitHub robot mascot trying to lasso a blue bull with a Kubernetes logo tatooed on it


    The Plan

    Create a sandbox GitHub Organization, clone in key Rancher repositories, and let the AI loose to see if it can handle real-world enterprise OSS maintenance - or if it just hallucinates new breeds of Kubernetes resources!

    Specifically, throw "Agentic Coders" some typical tasks in a complex, long-lived open-source project, such as:


    The Grunt Work: generate missing GoDocs, unit tests, and refactorings. Rebase PRs.

    The Complex Stuff: fix actual (historical) bugs and feature requests to see if they can traverse the complexity without (too much) human hand-holding.

    Hunting Down Gaps: find areas lacking in docs, areas of improvement in code, dependency bumps, and so on.


    If time allows, also experiment with Model Context Protocol (MCP) to give agents context on our specific build pipelines and CI/CD logs.

    Why?

    We know AI can write "Hello World." and also moderately complex programs from a green field. But can it rebase a 3-month-old PR with conflicts in rancher/rancher? I want to find the breaking point of current AI agents to determine if and how they can help us to reduce our technical debt, work faster and better. At the same time, find out about pitfalls and shortcomings.

    The CONCLUSION!!!

    A add-emoji State of the Union add-emoji document was compiled to summarize lessons learned this week. For more gory details, just read on the diary below! add-emoji


    Background Coding Agent by mmanno

    Description

    I had only bad experiences with AI one-shots. However, monitoring agent work closely and interfering often did result in productivity gains.

    Now, other companies are using agents in pipelines. That makes sense to me, just like CI, we want to offload work to pipelines: Our engineering teams are consistently slowed down by "toil": low-impact, repetitive maintenance tasks. A simple linter rule change, a dependency bump, rebasing patch-sets on top of newer releases or API deprecation requires dozens of manual PRs, draining time from feature development.

    So far we have been writing deterministic, script-based automation for these tasks. And it turns out to be a common trap. These scripts are brittle, complex, and become a massive maintenance burden themselves.

    Can we make prompts and workflows smart enough to succeed at background coding?

    Goals

    We will build a platform that allows engineers to execute complex code transformations using prompts.

    By automating this toil, we accelerate large-scale migrations and allow teams to focus on high-value work.

    Our platform will consist of three main components:

    • "Change" Definition: Engineers will define a transformation as a simple, declarative manifest:
      • The target repositories.
      • A wrapper to run a "coding agent", e.g., "gemini-cli".
      • The task as a natural language prompt.
    • "Change" Management Service: A central service that orchestrates the jobs. It will receive Change definitions and be responsible for the job lifecycle.
    • Execution Runners: We could use existing sandboxed CI runners (like GitHub/GitLab runners) to execute each job or spawn a container.

    MVP

    • Define the Change manifest format.
    • Build the core Management Service that can accept and queue a Change.
    • Connect management service and runners, dynamically dispatch jobs to runners.
    • Create a basic runner script that can run a hard-coded prompt against a test repo and open a PR.

    Stretch Goals:

    • Multi-layered approach, Workflow Agents trigger Coding Agents:
      1. Workflow Agent: Gather information about the task interactively from the user.
      2. Coding Agent: Once the interactive agent has refined the task into a clear prompt, it hands this prompt off to the "coding agent." This background agent is responsible for executing the task and producing the actual pull request.
    • Use MCP:
      1. Workflow Agent gathers context information from Slack, Github, etc.
      2. Workflow Agent triggers a Coding Agent.
    • Create a "Standard Task" library with reliable prompts.
      1. Rebasing rancher-monitoring to a new version of kube-prom-stack
      2. Update charts to use new images
      3. Apply changes to comply with a new linter
      4. Bump complex Go dependencies, like k8s modules
      5. Backport pull requests to other branches
    • Add “review agents” that review the generated PR.

    See also


    Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

    Self-Scaling LLM Infrastructure Powered by Rancher

    logo


    Description

    The Problem

    Running LLMs can get expensive and complex pretty quickly.

    Today there are typically two choices:

    1. Use cloud APIs like OpenAI or Anthropic. Easy to start with, but costs add up at scale.
    2. Self-host everything - set up Kubernetes, figure out GPU scheduling, handle scaling, manage model serving... it's a lot of work.

    What if there was a middle ground?

    What if infrastructure scaled itself instead of making you scale it?

    Can we use existing Rancher capabilities like CAPI, autoscaling, and GitOps to make this simpler instead of building everything from scratch?

    Project Repository: github.com/alexander-demicev/llmserverless


    What This Project Does

    A key feature is hybrid deployment: requests can be routed based on complexity or privacy needs. Simple or low-sensitivity queries can use public APIs (like OpenAI), while complex or private requests are handled in-house on local infrastructure. This flexibility allows balancing cost, privacy, and performance - using cloud for routine tasks and on-premises resources for sensitive or demanding workloads.

    A complete, self-scaling LLM infrastructure that:

    • Scales to zero when idle (no idle costs)
    • Scales up automatically when requests come in
    • Adds more nodes when needed, removes them when demand drops
    • Runs on any infrastructure - laptop, bare metal, or cloud

    Think of it as "serverless for LLMs" - focus on building, the infrastructure handles itself.

    How It Works

    A combination of open source tools working together:

    Flow:

    • Users interact with OpenWebUI (chat interface)
    • Requests go to LiteLLM Gateway
    • LiteLLM routes requests to:
      • Ollama (Knative) for local model inference (auto-scales pods)
      • Or cloud APIs for fallback


    Move Uyuni Test Framework from Selenium to Playwright + AI by oscar-barrios

    Description

    This project aims to migrate the existing Uyuni Test Framework from Selenium to Playwright. The move will improve the stability, speed, and maintainability of our end-to-end tests by leveraging Playwright's modern features. We'll be rewriting the current Selenium code in Ruby to Playwright code in TypeScript, which includes updating the test framework runner, step definitions, and configurations. This is also necessary because we're moving from Cucumber Ruby to CucumberJS.

    If you're still curious about the AI in the title, it was just a way to grab your attention. Thanks for your understanding.

    Nah, let's be honest add-emoji AI helped a lot to vibe code a good part of the Ruby methods of the Test framework, moving them to Typescript, along with the migration from Capybara to Playwright. I've been using "Cline" as plugin for WebStorm IDE, using Gemini API behind it.


    Goals

    • Migrate Core tests including Onboarding of clients
    • Improve test reliabillity: Measure and confirm a significant reduction of flakiness.
    • Implement a robust framework: Establish a well-structured and reusable Playwright test framework using the CucumberJS

    Resources