Rancher is a beast of a codebase. Let's investigate if the new 2025 generation of GitHub Autonomous Coding Agents and Copilot Workspaces can actually tame it. A GitHub robot mascot trying to lasso a blue bull with a Kubernetes logo tatooed on it

The Plan

Create a sandbox GitHub Organization, clone in key Rancher repositories, and let the AI loose to see if it can handle real-world enterprise OSS maintenance - or if it just hallucinates new breeds of Kubernetes resources!

Specifically, throw "Agentic Coders" some typical tasks in a complex, long-lived open-source project, such as:

❥ The Grunt Work: generate missing GoDocs, unit tests, and refactorings. Rebase PRs.

❥ The Complex Stuff: fix actual (historical) bugs and feature requests to see if they can traverse the complexity without (too much) human hand-holding.

❥ Hunting Down Gaps: find areas lacking in docs, areas of improvement in code, dependency bumps, and so on.

If time allows, also experiment with Model Context Protocol (MCP) to give agents context on our specific build pipelines and CI/CD logs.

Why?

We know AI can write "Hello World." and also moderately complex programs from a green field. But can it rebase a 3-month-old PR with conflicts in rancher/rancher? I want to find the breaking point of current AI agents to determine if and how they can help us to reduce our technical debt, work faster and better. At the same time, find out about pitfalls and shortcomings.

The CONCLUSION!!!

A State of the Union document was compiled to summarize lessons learned this week. For more gory details, just read on the diary below!

Looking for hackers with the skills:

kubernetes rancher ai agenticai github

This project is part of:

Hack Week 25

Activity

about 2 months ago: mpiala liked this project.

about 2 months ago: tneau liked this project.

about 2 months ago: vizhestkov liked this project.

about 2 months ago: j_renner liked this project.

about 2 months ago: ademicev0 liked this project.

2 months ago: moio added keyword "kubernetes" to this project.

2 months ago: moio added keyword "rancher" to this project.

2 months ago: moio added keyword "ai" to this project.

2 months ago: moio added keyword "agenticai" to this project.

2 months ago: moio added keyword "github" to this project.

2 months ago: pgonin liked this project.

2 months ago: tonyhansen liked this project.

2 months ago: aruiz liked this project.

2 months ago: moio liked this project.

2 months ago: moio started this project.

2 months ago: moio originated this project.

Comments

about 2 months ago by moio | Reply

Day 1

Successes

I had Copilot agents:
- Resolving a straightforward Rancher issue: adding a small feature to the Helm chart: https://github.com/moio/rancher/pull/1. A "very solid minimal implementation" (reviewer’s words) ready for iteration!
- Write docs: for a small Rancher feature https://github.com/moio/rancher-docs/pull/5
- Adapt docs: from Rancher Community to Rancher Product docs. Polyglot Powers: Copilot handled the format conversion (Markdown → Asciidoc) and - terrifyingly - also the English-to-Chinese translation! https://github.com/rancher/rancher-product-docs/pull/611
- Bump dependencies: in a non-straightforward context https://github.com/moio/dartboard/pull/2
Failures (and Tips to Avoid Them)

I had Copilot agents:
- Torching API Limits: A naive approach to surveying ~3,000 open issues hit the wall immediately.
  - Mitigation: think about API mis(use) *before* hitting enter. Agents like to take shortest paths to solutions.
- Getting into infinite loops as they repeatedly lose context. Agents love to chase their own tails as context windows overflow, wasting time and resources!
  - Mitigation: limit the amount of context the agent needs to keep in working memory to accomplish the task. That is not always possible though.
- Finding duplicate issues in Rancher https://github.com/moio/rancher/pull/2. The approach just did not scale. I am working at another approach
- Refusing to work across repos. Agents are strictly bound to a single repository. They can access other repos (including issues, PRs, etc) read-only for context, but they cannot, say, open PRs other than the one they were started in.
  - Avoid by: framing multiple requests to multiple repos and coordinating the results. Or use less distinct repos where possible (this is a big problem in Rancher, as we use so many).
- Burn out 13% of my monthly requests on Day 1! Speed running the budget!
The Jury is Still Out On
- Performing a whole-Rancher issue cleanup https://github.com/moio/rancher/pull/2 (needs human review)
- Mass-reviewing all Rancher design docs for consistency https://github.com/moio/rancher-architecture/pull/1 (needs human review)
- Creating a tool to mass download Rancher issues in JSON form https://github.com/moio/issue-analyzer (Copilot still working at time of writing)
Good/Bad/Ugly
- Async Warfare: Massive parallelism. Fire off multiple tasks and check back later.
- PR interaction: agents dutifully follow up to every single line-by-line comment! The PR review user interaction really works well compared to CLI workflows
- Latency: Simple tasks can take a *painfully* long time
- Context size limitations: agents do not typically have the full repo in context - even for a small one. You have to point them to files/directories/greppable strings to direct any efforts efficiently.
Pitfalls & Sharp Edges
- Pitfall: not knowing where to start - but it’s easy if you know where to click. Tricky icon is “New agent task” close to the clone button:
- Pitfall: creating custom org for Copilot was a no-go. Requires payment of enterprise subscription, would need to copy over issues and PRs, no evident advantage for now
- Pitfall: trying to understand why agents are not enabled on all repos. Agents require write permission on the repo being changed.
  - Workaround: run agents on your fork instead
- Facepalm Moment of the Day:
  - Agent: Thanks for asking me to work on this.
  - Me: Dial back the obsequiousness, Microsoft! These are programming tools, not training butlers!
The VERDICT for Today

Agentic AI offers High-bandwidth, high-latency coding
- Upside: working on multiple fronts in parallel works very well
- The Catch: It’s slow. Slower than AI CLI tools. Sometimes slower than you.
  - Just like CI or the Open Build Service, working with them “interactively” (that is, waiting on them to conclude) will kill your velocity.
You have been warned.

about 2 months ago by moio | Reply

Day 2

Successes
- Fixing a HackWeek site Emoji issue: Agent provided a solid starting point for resolution https://github.com/moio/hackweek/pull/1
- Tool for Mass-downloading of Rancher issues: Success. Agents excel at small, greenfield, throwaway tools. https://github.com/moio/issue-analyzer
Failures (and The Sharp Edges)
- Merge Conflicts: When multiple agents touch the same file, concurrent changes require manual reconciliation. Agents cannot force push or rebase, and they messed up merging.
  - Sharp Edge: Avoid concurrent agent work on the same file/section. (Difficult to enforce in large projects like Rancher)
- Whole-Rancher issue cleanup: Total hallucination. The agent invented links to unrelated reports and invented responses. https://github.com/moio/rancher/pull/2
- Sharp Edge: do not even try feeding an agent with so much data. In this case a full issue dump (~600 MiB) crushed the ~2 MiB limit.
- Core Library Bugfix: Failed. The agent assumed the wrong base branch. Worse, it misinterpreted the report and confidently suggested a broken solution. https://github.com/moio/steve/pull/1
  - Mitigation 1: Verify your default branch/fork. Agents accept the default as absolute truth!
  - Mitigation 2: Sharpen your prompts. Since the full codebase won't fit in context, agents need "greppable" keywords (function names, error strings) to locate the relevant code. A bug report written for humans may be too vague for an agent. Compare:
  - https://github.com/rancher/rancher/issues/52872 (made for humans)
  - “When BeginTx is called, there is a chance SQLITE_BUSY or SQLITE_BUSY_SNAPSHOT are returned - without the busy handler being engaged. We want to retry in that case.” (points agent to portion of code to patch, making the starting point “greppable”)
The Jury is Still Out On
- Mass-reviewing all Rancher design docs for consistency https://github.com/moio/rancher-architecture/pull/1 (needs human review)
- Writing automated tests for a Rancher side project’s Terraform files https://github.com/moio/dartboard/pull/1 (looking good, there might be a specific GHA limitation to deal with)
The VERDICT for Today

Know your tool. Understanding agent internals makes the difference between velocity and hallucinated infinite loops.
1. Their context window (working memory) is limited - e.g., for GPT-5.1 that’s 400k tokens or about 1.6M characters.
2. Exceeding this limit guarantees severe hallucinations.
3. Agents do (and you should too) try to work around that limit with plain text search before the LLM phase
Best Weapon: Manually limit context. Feed the agent error strings, function names, and file paths so it can scope down immediately.

Larger problems may just be intractable at this point.

Additionally, a critical limitation is that agents cannot solve merge conflicts. That limits their viability in large projects, as it reduces the parallelism.

about 2 months ago by moio | Reply

Day 3

Successes
- Terraform Testing: Successfully wrote automated tests for Terraform files in a Rancher QA project https://github.com/moio/dartboard/pull/1
- Core Library Bugfix (Redemption): After yesterday's failure, applying the "sharper prompt" mitigation yielded a credible first solution https://github.com/moio/steve/pull/2
- UI Bugfix: Easy win. This case proved that agents shine when the blast radius is confined to a single repository https://github.com/moio/dashboard/pull/1
Failures (and Tips to Avoid Them)
- Fixing a Bug Requiring Multi-Repo Coordination: The agent struggled to bridge the UI and Backend repositories simultaneously, even when hand-held (https://github.com/moio/dashboard/pull/2 https://github.com/moio/steve/pull/3). The proposed API interface was unconvincing, and the implementation failed to run. Silver Lining: It acted as a decent "grep." It identified the correct code paths, even if it couldn't fix them.
- Sharp Edge: The Firewall. Agents live in a sandbox. They cannot access the open web (e.g., k8s.io or scc.suse.com) - whitelist domains they access in Settings -- Copilot -- Copilot agents or you will get the error message below.
  - Mitigation: Design for "Offline Mode." Ensure lint/test/validation steps don't require external fetch. If the agent can't reach it, it can't run it.
The Jury is Still Out On
- Resolving a bit of Rancher UI tech debt: https://github.com/rancher/dashboard/issues/15326#issuecomment-3606805281 (under human review)
The VERDICT for Today

Multi-repository work is an issue for GitHub agents. I wonder how well Claude does, and I am planning to look into it next.

about 2 months ago by moio | Reply

Day 4

Claude Code vs. GitHub Copilot

Unplanned Detour: Subscribed to Claude Pro to test the hype against GitHub Copilot.

Impressions
- The Good:
  - CLI UI/UX is significantly smoother
  - Text reports are readable and clear
  - Seemed less prone to the "death spirals" (loops) I saw with Gemini
- The Bad:
  - Integration Friction: GitHub integration feels second-class. Example: I asked it to remove lines in a review; it proceeded to remove only the blank lines. Malicious compliance?.
  - Blindspots: No nice UI for logging activities, just raw GHA verbose logs. Quota errors are cryptic.
- The Ugly:
  - Burn Rate: The "5-hour limit" is (almost) a joke; I burned it in under 2 hours.
  - ROI: Significantly lower mileage compared to Gemini or Copilot plans at similar prices.
Successes
- i18n on a Budget: almost successful at internationalizing a small hobby app (~100 strings).
  - Catch: It worked, but consumed an excruciating amount of tokens and required manual "pointing" to missed spots.
Failures (and The Sharp Edges)
- The Plagiarist: While attempting a UI fix, the Copilot agent scraped code from an unmerged PR and presented it as a solution.
  - Risk: It will happily reuse work in progress without credit!
- Sharp Edge: Admin Walls. Whitelisting domains (Firewall) in Copilot requires repo Admin rights.
  - If you work on a shared repo without admin access, you cannot add the necessary exceptions. This adds a significant maintenance burden.
- Mass-reviewing all Rancher design docs: Copilot was technically correct, but practically useless.
  - It successfully flagged inconsistencies... between "Archival" and "Live" docs. It found entropy, not value.
The Jury is Still Out On
- Copilot PR Reviews: Waiting on infrastructure team enablement.
The VERDICT for Today

No Silver Bullet. Claude has a slicker UI and avoids some logic loops, but the "smartness" delta does not seem to justify the abysmal burn rate. At least today. Looking forward to play more tomorrow.

about 2 months ago by moio | Reply

Day 5

Continued testing of Claude Code vs. GitHub Copilot vs. Gemini CLI

Impressions
- The Good Claude
  - Meta-programming: While refactoring code to change to a different coordinate system in a hobby project, the agent had to translate some hardcoded patterns. Instead of brute-forcing the change, Claude wrote a Python script to perform the coordinate transformation, applied it to the patterns, and injected the results in the pull request. Neat!
  - Insight: It didn't just write code; it built a tool to write the code.
  - Follow-up: it did something similar for audio later. I needed sound effects; Claude wrote a script to synthesize them rather than hallucinating binary data.
  - Instruction Adherence: I found Claude to be significantly better at following complex instructions and its own self-generated "TODO" lists than Gemini CLI + Code Assist.
  - Ask without the Rush: it generally produced clearer explanations than GitHub Copilot and asked clarifying questions before rushing into implementation (unlike the "shoot first, ask later" approach of Gemini CLI).
  - Rebasing: Claude was surprisingly competent at handling git rebases, which did not work well in Copilot
  - PR reviews: Claude did excellent PR reviews, both in terms of content and reading clarity
- The Ugly Claude
  - The "Pro" Misnomer: Claude Pro should really be renamed Claude Demo.
  - Sharp Edge: The usage limits are comically low for serious development. You hit the wall just as you get into the "flow"
Being quite unsure about the verdict, I have also had the three agents (Copilot with GPT-5.1, Gemini CLI with Code Assist Pro and Claude with Sonnet 4.5) the same mid-sized task and observed the results.

The VERDICT for Today

Smart but Lazy. Claude comes up with better plans, writes clearer text and uses tools more intelligently than its competitors. However, the aggressive rate limits mean this "Senior Engineer" only works 2 hours a day before clocking out.

Better, but not radically so: Claude Code has the best UX of the three, and the model does have an edge as described, but I did not find the difference to be huge. Paying for a Claude Pro subscription when one already has Copilot or Gemini isn’t justified, the jury is still out on Claude Ultra, where potentially much more could be delegated to GitHub Actions.

Similar Projects

kubernetes

A CLI for Harvester by mohamed.belgaied

Harvester does not officially come with a CLI tool, the user is supposed to interact with Harvester mostly through the UI. Though it is theoretically possible to use kubectl to interact with Harvester, the manipulation of Kubevirt YAML objects is absolutely not user friendly. Inspired by tools like multipass from Canonical to easily and rapidly create one of multiple VMs, I began the development of Harvester CLI. Currently, it works but Harvester CLI needs some love to be up-to-date with Harvester v1.0.2 and needs some bug fixes and improvements as well.

Project Description

Harvester CLI is a command line interface tool written in Go, designed to simplify interfacing with a Harvester cluster as a user. It is especially useful for testing purposes as you can easily and rapidly create VMs in Harvester by providing a simple command such as: harvester vm create my-vm --count 5 to create 5 VMs named my-vm-01 to my-vm-05.

Harvester CLI is functional but needs a number of improvements: up-to-date functionality with Harvester v1.0.2 (some minor issues right now), modifying the default behaviour to create an opensuse VM instead of an ubuntu VM, solve some bugs, etc.

Github Repo for Harvester CLI: https://github.com/belgaied2/harvester-cli

Done in previous Hackweeks

Create a Github actions pipeline to automatically integrate Harvester CLI to Homebrew repositories: DONE
Automatically package Harvester CLI for OpenSUSE / Redhat RPMs or DEBs: DONE

Goal for this Hackweek

The goal for this Hackweek is to bring Harvester CLI up-to-speed with latest Harvester versions (v1.3.X and v1.4.X), and improve the code quality as well as implement some simple features and bug fixes.

Some nice additions might be: * Improve handling of namespaced objects * Add features, such as network management or Load Balancer creation ? * Add more unit tests and, why not, e2e tests * Improve CI * Improve the overall code quality * Test the program and create issues for it

Issue list is here: https://github.com/belgaied2/harvester-cli/issues

Resources

The project is written in Go, and using client-go the Kubernetes Go Client libraries to communicate with the Harvester API (which is Kubernetes in fact). Welcome contributions are:

Testing it and creating issues
Documentation
Go code improvement

What you might learn

Harvester CLI might be interesting to you if you want to learn more about:

GitHub Actions
Harvester as a SUSE Product
Go programming language
Kubernetes API
Kubevirt API objects (Manipulating VMs and VM Configuration in Kubernetes using Kubevirt)

Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

Description

Build a solid understanding of the current landscape of Artificial Intelligence and how modern cloud-native technologies—especially Kubernetes—support AI workloads.

Goals

Use Gemini Learning Mode to guide the exploration, surface relevant concepts, and structure the learning journey:

Gain insight into the latest AI trends, tools, and architectural concepts.
Understand how Kubernetes and related cloud-native technologies are used in the AI ecosystem (model training, deployment, orchestration, MLOps).

Resources

Red Hat AI Topic Articles
- https://www.redhat.com/en/topics/ai
Kubeflow Documentation
- https://www.kubeflow.org/docs/
Q4 2025 CNCF Technology Landscape Radar report:
- https://www.cncf.io/announcements/2025/11/11/cncf-and-slashdata-report-finds-leading-ai-tools-gaining-adoption-in-cloud-native-ecosystems/
- https://www.cncf.io/wp-content/uploads/2025/11/cncfreporttechradar_111025a.pdf
Agent-to-Agent (A2A) Protocol
- https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/

Kubernetes-Based ML Lifecycle Automation by lmiranda

Description

This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.

The pipeline will automate the lifecycle of a machine learning model, including:

Data ingestion/collection
Model training as a Kubernetes Job
Model artifact storage in an S3-compatible registry (e.g. Minio)
A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
A lightweight inference service that loads and serves the latest model
Monitoring of model performance and service health through Prometheus/Grafana

The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.

Goals

By the end of Hack Week, the project should:

Produce a fully functional ML pipeline running on Kubernetes with:
- Data collection job
- Training job container
- Storage and versioning of trained models
- Automated deployment of new model versions
- Model inference API service
- Basic monitoring dashboards
Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.
Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).
Prepare a short demo explaining the end-to-end process and how new models flow through the system.

Resources

Project Repository

Updates

Training pipeline and datasets
Inference Service py

Rancher/k8s Trouble-Maker by tonyhansen

Project Description

When studying for my RHCSA, I found trouble-maker, which is a program that breaks a Linux OS and requires you to fix it. I want to create something similar for Rancher/k8s that can allow for troubleshooting an unknown environment.

Goals for Hackweek 25

Update to modern Rancher and verify that existing tests still work
Change testing logic to populate secrets instead of requiring a secondary script
Add new tests

Goals for Hackweek 24 (Complete)

Create a basic framework for creating Rancher/k8s cluster lab environments as needed for the Break/Fix
Create at least 5 modules that can be applied to the cluster and require troubleshooting

Resources

https://github.com/celidon/rancher-troublemaker
https://github.com/rancher/terraform-provider-rancher2
https://github.com/rancher/tf-rancher-up
https://github.com/rancher/quickstart

Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

Self-Scaling LLM Infrastructure Powered by Rancher

Description

The Problem

Running LLMs can get expensive and complex pretty quickly.

Today there are typically two choices:

Use cloud APIs like OpenAI or Anthropic. Easy to start with, but costs add up at scale.
Self-host everything - set up Kubernetes, figure out GPU scheduling, handle scaling, manage model serving... it's a lot of work.

What if there was a middle ground?

What if infrastructure scaled itself instead of making you scale it?

Can we use existing Rancher capabilities like CAPI, autoscaling, and GitOps to make this simpler instead of building everything from scratch?

Project Repository: github.com/alexander-demicev/llmserverless

What This Project Does

A key feature is hybrid deployment: requests can be routed based on complexity or privacy needs. Simple or low-sensitivity queries can use public APIs (like OpenAI), while complex or private requests are handled in-house on local infrastructure. This flexibility allows balancing cost, privacy, and performance - using cloud for routine tasks and on-premises resources for sensitive or demanding workloads.

A complete, self-scaling LLM infrastructure that:

Scales to zero when idle (no idle costs)
Scales up automatically when requests come in
Adds more nodes when needed, removes them when demand drops
Runs on any infrastructure - laptop, bare metal, or cloud

Think of it as "serverless for LLMs" - focus on building, the infrastructure handles itself.

How It Works

A combination of open source tools working together:

Flow:

Users interact with OpenWebUI (chat interface)
Requests go to LiteLLM Gateway
LiteLLM routes requests to:
- Ollama (Knative) for local model inference (auto-scales pods)
- Or cloud APIs for fallback

rancher

Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

Self-Scaling LLM Infrastructure Powered by Rancher

Description

The Problem

Running LLMs can get expensive and complex pretty quickly.

Today there are typically two choices:

Use cloud APIs like OpenAI or Anthropic. Easy to start with, but costs add up at scale.
Self-host everything - set up Kubernetes, figure out GPU scheduling, handle scaling, manage model serving... it's a lot of work.

What if there was a middle ground?

What if infrastructure scaled itself instead of making you scale it?

Can we use existing Rancher capabilities like CAPI, autoscaling, and GitOps to make this simpler instead of building everything from scratch?

Project Repository: github.com/alexander-demicev/llmserverless

What This Project Does

A complete, self-scaling LLM infrastructure that:

Scales to zero when idle (no idle costs)
Scales up automatically when requests come in
Adds more nodes when needed, removes them when demand drops
Runs on any infrastructure - laptop, bare metal, or cloud

Think of it as "serverless for LLMs" - focus on building, the infrastructure handles itself.

How It Works

A combination of open source tools working together:

Flow:

Users interact with OpenWebUI (chat interface)
Requests go to LiteLLM Gateway
LiteLLM routes requests to:
- Ollama (Knative) for local model inference (auto-scales pods)
- Or cloud APIs for fallback

Liz - Prompt autocomplete by ftorchia

Description

Liz is the Rancher AI assistant for cluster operations.

Goals

We want to help users when sending new messages to Liz, by adding an autocomplete feature to complete their requests based on the context.

Example:

User prompt: "Can you show me the list of p"
Autocomplete suggestion: "Can you show me the list of p...od in local cluster?"

Example:

User prompt: "Show me the logs of #rancher-"
Chat console: It shows a drop-down widget, next to the # character, with the list of available pod names starting with "rancher-".

Technical Overview

The AI agent should expose a new ws/autocomplete endpoint to proxy autocomplete messages to the LLM.
The UI extension should be able to display prompt suggestions and allow users to apply the autocomplete to the Prompt via keyboard shortcuts.

Resources

GitHub repository

Rancher Cluster Lifecycle Visualizer by jferraz

Description

Rancher’s v2 provisioning system represents each downstream cluster with several Kubernetes custom resources across multiple API groups, such as clusters.provisioning.cattle.io and clusters.management.cattle.io. Understanding why a cluster is stuck in states like "Provisioning", "Updating", or "Unavailable" often requires jumping between these resources, reading conditions, and correlating them with agent connectivity and known failure modes. This project will build a Cluster Lifecycle Visualizer: a small, read-only controller that runs in the Rancher management cluster and generates a single, human-friendly view per cluster. It will watch Rancher cluster CRDs, derive a simplified lifecycle phase, keep a history of phase transitions from installation time onward, and attach a short, actionable recommendation string that hints at what the operator should check or do next.

Goals

Provide a compact lifecycle summary for each Rancher-managed cluster (e.g. Provisioning, WaitingForClusterAgent, Active, Updating, Error) derived from provisioning.cattle.io/v1 Cluster and management.cattle.io/v3 Cluster status and conditions.
Maintain a phase history for each cluster, allowing operators to see how its state evolved over time since the visualizer was installed.
Attach a recommended action to the current phase using a small ruleset based on common Rancher failure modes (for example, cluster agent not connected, cluster still stabilizing after an upgrade, or generic error states), to improve the day-to-day debugging experience.
Deliver an easy-to-install, read-only component (single YAML or small Helm chart) that Rancher users can deploy to their management cluster and inspect via kubectl get/describe, without UI changes or direct access to downstream clusters.
Use idiomatic Go, wrangler, and Rancher APIs.

Resources

Rancher Manager documentation on RKE2 and K3s cluster configuration and provisioning flows.
Rancher API Go types for provisioning.cattle.io/v1 and management.cattle.io/v3 (from the rancher/rancher repository or published Go packages).
Existing Rancher architecture docs and internal notes about cluster provisioning, cluster agents, and node agents.
A local Rancher management cluster (k3s or RKE2) with a few test downstream clusters to validate phase detection, history tracking, and recommendations.

A CLI for Harvester by mohamed.belgaied

Project Description

Github Repo for Harvester CLI: https://github.com/belgaied2/harvester-cli

Done in previous Hackweeks

Create a Github actions pipeline to automatically integrate Harvester CLI to Homebrew repositories: DONE
Automatically package Harvester CLI for OpenSUSE / Redhat RPMs or DEBs: DONE

Goal for this Hackweek

Issue list is here: https://github.com/belgaied2/harvester-cli/issues

Resources

The project is written in Go, and using client-go the Kubernetes Go Client libraries to communicate with the Harvester API (which is Kubernetes in fact). Welcome contributions are:

Testing it and creating issues
Documentation
Go code improvement

What you might learn

Harvester CLI might be interesting to you if you want to learn more about:

GitHub Actions
Harvester as a SUSE Product
Go programming language
Kubernetes API
Kubevirt API objects (Manipulating VMs and VM Configuration in Kubernetes using Kubevirt)

Cluster API Provider for Harvester by rcase

Project Description

The Cluster API "infrastructure provider" for Harvester, also named CAPHV, makes it possible to use Harvester with Cluster API. This enables people and organisations to create Kubernetes clusters running on VMs created by Harvester using a declarative spec.

The project has been bootstrapped in HackWeek 23, and its code is available here.

Work done in HackWeek 2023

Have a early working version of the provider available on Rancher Sandbox : *DONE *
Demonstrated the created cluster can be imported using Rancher Turtles: DONE
Stretch goal - demonstrate using the new provider with CAPRKE2: DONE and the templates are available on the repo

DONE in HackWeek 24:

Add more Unit Tests
Improve Status Conditions for some phases
Add cloud provider config generation
Testing with Harvester v1.3.2
Template improvements
Issues creation

DONE in 2025 (out of Hackweek)

Support of ClusterClass
Add to clusterctl community providers, you can add it directly with clusterctl
Testing on newer versions of Harvester v1.4.X and v1.5.X
Support for clusterctl generate cluster ...
Improve Status Conditions to reflect current state of Infrastructure
Improve CI (some bugs for release creation)

Goals for HackWeek 2025

FIRST and FOREMOST, any topic is important to you
Add e2e testing
Certify the provider for Rancher Turtles
Add Machine pool labeling
Add PCI-e passthrough capabilities.
Other improvement suggestions are welcome!

Thanks to @isim and Dominic Giebert for their contributions!

Resources

Looking for help from anyone interested in Cluster API (CAPI) or who wants to learn more about Harvester.

This will be an infrastructure provider for Cluster API. Some background reading for the CAPI aspect:

ai

AI-Powered Unit Test Automation for Agama by joseivanlopez

The Agama project is a multi-language Linux installer that leverages the distinct strengths of several key technologies:

Rust: Used for the back-end services and the core HTTP API, providing performance and safety.
TypeScript (React/PatternFly): Powers the modern web user interface (UI), ensuring a consistent and responsive user experience.
Ruby: Integrates existing, robust YaST libraries (e.g., yast-storage-ng) to reuse established functionality.

The Problem: Testing Overhead

Developing and maintaining code across these three languages requires a significant, tedious effort in writing, reviewing, and updating unit tests for each component. This high cost of testing is a drain on developer resources and can slow down the project's evolution.

The Solution: AI-Driven Automation

This project aims to eliminate the manual overhead of unit testing by exploring and integrating AI-driven code generation tools. We will investigate how AI can:

Automatically generate new unit tests as code is developed.
Intelligently correct and update existing unit tests when the application code changes.

By automating this crucial but monotonous task, we can free developers to focus on feature implementation and significantly improve the speed and maintainability of the Agama codebase.

Goals

Proof of Concept: Successfully integrate and demonstrate an authorized AI tool (e.g., gemini-cli) to automatically generate unit tests.
Workflow Integration: Define and document a new unit test automation workflow that seamlessly integrates the selected AI tool into the existing Agama development pipeline.
Knowledge Sharing: Establish a set of best practices for using AI in code generation, sharing the learned expertise with the broader team.

Contribution & Resources

We are seeking contributors interested in AI-powered development and improving developer efficiency. Whether you have previous experience with code generation tools or are eager to learn, your participation is highly valuable.

If you want to dive deep into AI for software quality, please reach out and join the effort!

Authorized AI Tools: Tools supported by SUSE (e.g., gemini-cli)
Focus Areas: Rust, TypeScript, and Ruby components within the Agama project.

Interesting Links

goose

Local AI assistant with optional integrations and mobile companion by livdywan

Description

Setup a local AI assistant for research, brainstorming and proof reading. Look into SurfSense, Open WebUI and possibly alternatives. Explore integration with services like openQA. There should be no cloud dependencies. Mobile phone support or an additional companion app would be a bonus. The goal is not to develop everything from scratch.

User Story

Allison Average wants a one-click local AI assistent on their openSUSE laptop.
Ash Awesome wants AI on their phone without an expensive subscription.

Goals

Evaluate a local SurfSense setup for day to day productivity
Test opencode for vibe coding and tool calling

Timeline

Day 1

Took a look at SurfSense and started setting up a local instance.
Unfortunately the container setup did not work well. Tho this was a great opportunity to learn some new podman commands and refresh my memory on how to recover a corrupted btrfs filesystem.

Day 2

Due to its sheer size and complexity SurfSense seems to have triggered btrfs fragmentation. Naturally this was not visible in any podman-related errors or in the journal. So this took up much of my second day.

Day 3

Trying out opencode with Qwen3-Coder and Qwen2.5-Coder.

Day 4

Context size is a thing, and models are not equally usable for vibe coding.
Through arduous browsing for ollama models I did find some like myaniu/qwen2.5-1m:7b with 1m but even then it is not obvious if they are meant for tool calls.

Day 5

Whilst trying to make opencode usable I discovered ramalama which worked instantly and very well.

Outcomes

surfsense

I could not easily set this up completely. Maybe in part due to my filesystem issues. Was expecting this to be less of an effort.

opencode

Installing opencode and ollama in my distrobox container along with the following configs worked for me.

When preparing a new project from scratch it is a good idea to start out with a template.

opencode.json

``` {

Uyuni Health-check Grafana AI Troubleshooter by ygutierrez

Description

This project explores the feasibility of using the open-source Grafana LLM plugin to enhance the Uyuni Health-check tool with LLM capabilities. The idea is to integrate a chat-based "AI Troubleshooter" directly into existing dashboards, allowing users to ask natural-language questions about errors, anomalies, or performance issues.

Goals

Investigate if and how the grafana-llm-app plug-in can be used within the Uyuni Health-check tool.
Investigate if this plug-in can be used to query LLMs for troubleshooting scenarios.
Evaluate support for local LLMs and external APIs through the plugin.
Evaluate if and how the Uyuni MCP server could be integrated as another source of information.

Resources

Grafana LMM plug-in

Uyuni Health-check

Song Search with CLAP by gcolangiuli

Description

Contrastive Language-Audio Pretraining (CLAP) is an open-source library that enables the training of a neural network on both Audio and Text descriptions, making it possible to search for Audio using a Text input. Several pre-trained models for song search are already available on huggingface

Goals

Evaluate how CLAP can be used for song searching and determine which types of queries yield the best results by developing a Minimum Viable Product (MVP) in Python. Based on the results of this MVP, future steps could include:

Music Tagging;
Free text search;
Integration with an LLM (for example, with MCP or the OpenAI API) for music suggestions based on your own library.

The code for this project will be entirely written using AI to better explore and demonstrate AI capabilities.

Result

In this MVP we implemented:

Async Song Analysis with Clap model
Free Text Search of the songs
Similar song search based on vector representation
Containerised version with web interface

We also documented what went well and what can be improved in the use of AI.

You can have a look at the result here:

Future implementation can be related to performance improvement and stability of the analysis.

References

CLAP: The main model being researched;
huggingface: Pre-trained models for CLAP;
Free Music Archive: Creative Commons songs that can be used for testing;

Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

Self-Scaling LLM Infrastructure Powered by Rancher

Description

The Problem

Running LLMs can get expensive and complex pretty quickly.

Today there are typically two choices:

Use cloud APIs like OpenAI or Anthropic. Easy to start with, but costs add up at scale.
Self-host everything - set up Kubernetes, figure out GPU scheduling, handle scaling, manage model serving... it's a lot of work.

What if there was a middle ground?

What if infrastructure scaled itself instead of making you scale it?

Can we use existing Rancher capabilities like CAPI, autoscaling, and GitOps to make this simpler instead of building everything from scratch?

Project Repository: github.com/alexander-demicev/llmserverless

What This Project Does

A complete, self-scaling LLM infrastructure that:

Scales to zero when idle (no idle costs)
Scales up automatically when requests come in
Adds more nodes when needed, removes them when demand drops
Runs on any infrastructure - laptop, bare metal, or cloud

Think of it as "serverless for LLMs" - focus on building, the infrastructure handles itself.

How It Works

A combination of open source tools working together:

Flow:

Users interact with OpenWebUI (chat interface)
Requests go to LiteLLM Gateway
LiteLLM routes requests to:
- Ollama (Knative) for local model inference (auto-scales pods)
- Or cloud APIs for fallback

agenticai

Enable more features in mcp-server-uyuni by j_renner

Description

I would like to contribute to mcp-server-uyuni, the MCP server for Uyuni / Multi-Linux Manager) exposing additional features as tools. There is lots of relevant features to be found throughout the API, for example:

At the end of the week I managed to enable basic system group operations:

List all system groups visible to the user
Create new system groups
List systems assigned to a group
Add and remove systems from groups

Goals

Set up test environment locally with the MCP server and client + a recent MLM server [DONE]
Identify features and use cases offering a benefit with limited effort required for enablement [DONE]
Create a PR to the repo [DONE]

Resources

SUSE Observability MCP server by drutigliano

Description

The idea is to implement the SUSE Observability Model Context Protocol (MCP) Server as a specialized, middle-tier API designed to translate the complex, high-cardinality observability data from StackState (topology, metrics, and events) into highly structured, contextually rich, and LLM-ready snippets.

This MCP Server abstract the StackState APIs. Its primary function is to serve as a Tool/Function Calling target for AI agents. When an AI receives an alert or a user query (e.g., "What caused the outage?"), the AI calls an MCP Server endpoint. The server then fetches the relevant operational facts, summarizes them, normalizes technical identifiers (like URNs and raw metric names) into natural language concepts, and returns a concise JSON or YAML payload. This payload is then injected directly into the LLM's prompt, ensuring the final diagnosis or action is grounded in real-time, accurate SUSE Observability data, effectively minimizing hallucinations.

Goals

Grounding AI Responses: Ensure that all AI diagnoses, root cause analyses, and action recommendations are strictly based on verifiable, real-time data retrieved from the SUSE Observability StackState platform.
Simplifying Data Access: Abstract the complexity of StackState's native APIs (e.g., Time Travel, 4T Data Model) into simple, semantic functions that can be easily invoked by LLM tool-calling mechanisms.
Data Normalization: Convert complex, technical identifiers (like component URNs, raw metric names, and proprietary health states) into standardized, natural language terms that an LLM can easily reason over.
Enabling Automated Remediation: Define clear, action-oriented MCP endpoints (e.g., execute_runbook) that allow the AI agent to initiate automated operational workflows (e.g., restarts, scaling) after a diagnosis, closing the loop on observability.

Hackweek STEP

Create a functional MCP endpoint exposing one (or more) tool(s) to answer queries like "What is the health of service X?") by fetching, normalizing, and returning live StackState data in an LLM-ready format.

Scope

Implement read-only MCP server that can:
- Connect to a live SUSE Observability instance and authenticate (with API token)
- Use tools to fetch data for a specific component URN (e.g., current health state, metrics, possibly topology neighbors, ...).
- Normalize response fields (e.g., URN to "Service Name," health state DEVIATING to "Unhealthy", raw metrics).
- Return the data as a structured JSON payload compliant with the MCP specification.

Deliverables

MCP Server v0.1 A running Golang MCP server with at least one tool.
A README.md and a test script (e.g., curl commands or a simple notebook) showing how an AI agent would call the endpoint and the resulting JSON payload.

Outcome A functional and testable API endpoint that proves the core concept: translating complex StackState data into a simple, LLM-ready format. This provides the foundation for developing AI-driven diagnostics and automated remediation.

Resources

https://www.honeycomb.io/blog/its-the-end-of-observability-as-we-know-it-and-i-feel-fine
https://www.datadoghq.com/blog/datadog-remote-mcp-server
https://modelcontextprotocol.io/specification/2025-06-18/index
https://modelcontextprotocol.io/docs/develop/build-server

Basic implementation

https://github.com/drutigliano19/suse-observability-mcp-server

Results

Successfully developed and delivered a fully functional SUSE Observability MCP Server that bridges language models with SUSE Observability's operational data. This project demonstrates how AI agents can perform intelligent troubleshooting and root cause analysis using structured access to real-time infrastructure data.

Example execution

github

issuefs: FUSE filesystem representing issues (e.g. JIRA) for the use with AI agents code-assistants by llansky3

Description

Creating a FUSE filesystem (issuefs) that mounts issues from various ticketing systems (Github, Jira, Bugzilla, Redmine) as files to your local file system.

And why this is good idea?

User can use favorite command line tools to view and search the tickets from various sources
User can use AI agents capabilities from your favorite IDE or cli to ask question about the issues, project or functionality while providing relevant tickets as context without extra work.
User can use it during development of the new features when you let the AI agent to jump start the solution. The issuefs will give the AI agent the context (AI agents just read few more files) about the bug or requested features. No need for copying and pasting issues to user prompt or by using extra MCP tools to access the issues. These you can still do but this approach is on purpose different.

Goals

Add Github issue support
Proof the concept/approach by apply the approach on itself using Github issues for tracking and development of new features
Add support for Bugzilla and Redmine using this approach in the process of doing it. Record a video of it.
Clean-up and test the implementation and create some documentation
Create a blog post about this approach

Resources

There is a prototype implementation here. This currently sort of works with JIRA only.

Is SUSE Trending? Popularity and Developer Sentiment Insight Using Native AI Capabilities by terezacerna

Description

This project aims to explore the popularity and developer sentiment around SUSE and its technologies compared to Red Hat and their technologies. Using publicly available data sources, I will analyze search trends, developer preferences, repository activity, and media presence. The final outcome will be an interactive Power BI dashboard that provides insights into how SUSE is perceived and discussed across the web and among developers.

Goals

Assess the popularity of SUSE products and brand compared to Red Hat using Google Trends.
Analyze developer satisfaction and usage trends from the Stack Overflow Developer Survey.
Use the GitHub API to compare SUSE and Red Hat repositories in terms of stars, forks, contributors, and issue activity.
Perform sentiment analysis on GitHub issue comments to measure community tone and engagement using built-in Copilot capabilities.
Perform sentiment analysis on Reddit comments related to SUSE technologies using built-in Copilot capabilities.
Use Gnews.io to track and compare the volume of news articles mentioning SUSE and Red Hat technologies.
Test the integration of Copilot (AI) within Power BI for enhanced data analysis and visualization.
Deliver a comprehensive Power BI report summarizing findings and insights.
Test the full potential of Power BI, including its AI features and native language Q&A.

Resources

Google Trends: Web scraping for search popularity data
Stack Overflow Developer Survey: For technology popularity and satisfaction comparison
GitHub API: For repository data (stars, forks, contributors, issues, comments).
Gnews.io API: For article volume and mentions analysis.
Reddit: SUSE related topics with comments.

The Plan

Why?

The CONCLUSION!!!

Looking for hackers with the skills:

This project is part of:

Activity

Comments

about 2 months ago by moio | Reply

Day 1

Successes

Failures (and Tips to Avoid Them)

The Jury is Still Out On

Good/Bad/Ugly

Pitfalls & Sharp Edges

The VERDICT for Today

about 2 months ago by moio | Reply

Day 2

Successes

Failures (and The Sharp Edges)

The Jury is Still Out On

The VERDICT for Today

about 2 months ago by moio | Reply

Day 3

Successes

Failures (and Tips to Avoid Them)

The Jury is Still Out On

The VERDICT for Today

about 2 months ago by moio | Reply

Day 4

Claude Code vs. GitHub Copilot

Impressions

Successes

Failures (and The Sharp Edges)

The Jury is Still Out On

The VERDICT for Today

about 2 months ago by moio | Reply

Day 5

Continued testing of Claude Code vs. GitHub Copilot vs. Gemini CLI

Impressions

The VERDICT for Today

Similar Projects

kubernetes

A CLI for Harvester by mohamed.belgaied

Project Description

Done in previous Hackweeks

Goal for this Hackweek

Resources

What you might learn

Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

Description

Goals

Resources

Kubernetes-Based ML Lifecycle Automation by lmiranda

Description

Goals

Resources

Updates

Rancher/k8s Trouble-Maker by tonyhansen

Project Description

Goals for Hackweek 25

Goals for Hackweek 24 (Complete)

Resources

Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

Self-Scaling LLM Infrastructure Powered by Rancher

Description

The Problem

What This Project Does

How It Works

rancher

Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

Self-Scaling LLM Infrastructure Powered by Rancher

Description

The Problem

What This Project Does

How It Works

Liz - Prompt autocomplete by ftorchia

Description

Goals

Technical Overview

Resources