Self-Scaling LLM Infrastructure Powered by Rancher

logo

Description

The Problem

Running LLMs can get expensive and complex pretty quickly.

Today there are typically two choices:

Use cloud APIs like OpenAI or Anthropic. Easy to start with, but costs add up at scale.
Self-host everything - set up Kubernetes, figure out GPU scheduling, handle scaling, manage model serving... it's a lot of work.

What if there was a middle ground?

What if infrastructure scaled itself instead of making you scale it?

Can we use existing Rancher capabilities like CAPI, autoscaling, and GitOps to make this simpler instead of building everything from scratch?

Project Repository: github.com/alexander-demicev/llmserverless

What This Project Does

A key feature is hybrid deployment: requests can be routed based on complexity or privacy needs. Simple or low-sensitivity queries can use public APIs (like OpenAI), while complex or private requests are handled in-house on local infrastructure. This flexibility allows balancing cost, privacy, and performance - using cloud for routine tasks and on-premises resources for sensitive or demanding workloads.

A complete, self-scaling LLM infrastructure that:

Scales to zero when idle (no idle costs)
Scales up automatically when requests come in
Adds more nodes when needed, removes them when demand drops
Runs on any infrastructure - laptop, bare metal, or cloud

Think of it as "serverless for LLMs" - focus on building, the infrastructure handles itself.

How It Works

A combination of open source tools working together:

Flow:

Users interact with OpenWebUI (chat interface)
Requests go to LiteLLM Gateway
LiteLLM routes requests to:
- Ollama (Knative) for local model inference (auto-scales pods)
- Or cloud APIs for fallback
Cluster Autoscaler scales nodes up/down as needed
Fleet keeps everything in sync via GitOps

Goals

Provide a middle ground between cloud APIs and self-hosted LLMs
Enable cost-efficient, privacy-preserving, and flexible LLM deployments
Make LLM infrastructure easy to deploy and manage (Helm chart, GitOps)
Support local development and production scaling
Experiment with hybrid routing, serverless scaling, and GitOps automation

Resources

Features

Packaged as a Helm chart: The entire stack is delivered as a Helm chart for easy deployment. See the project repository for setup instructions.
Scale to Zero: No requests? No pods. No pods? No nodes (well, minimum 1). LLM infrastructure costs nothing when idle.
Hybrid Routing: Simple requests can use public APIs, while complex or private queries are handled in-house, balancing cost and privacy.
GitOps Native: Everything is Fleet bundles.
Local Development Ready: Uses KIND + Docker provider for local dev. Same architecture that scales to production.

Tech Stack

Rancher 2.13 - Cluster management (Turtles is now built-in!)
Cluster API - Infrastructure as Kubernetes resources
Knative Serving - Serverless pod autoscaling
Ollama - Run LLMs locally
LiteLLM - Unified LLM API gateway
OpenWebUI - Chat interface
Fleet - GitOps deployment

What's Next

This is a hackweek project, but here are ideas for the future:

GPU node pools for production workloads
Cloud provider templates (AWS/Azure/GCP)
Smarter routing based on prompt complexity
Cost tracking dashboard
Response caching

Setup & Usage

For all setup and usage instructions, please refer to the project repository.

Why This Matters

LLMs are becoming a core part of many applications. But infrastructure options are still catching up.

This project explores a middle path:

Privacy - run models locally, keep data in-house
Cost efficiency - scale to zero, pay only for actual usage
Flexibility - mix local and cloud models based on needs
Simplicity - one command to deploy, GitOps to manage

It's an experiment in making LLM infrastructure more accessible and practical.

Updates

Update 1: Pushed some project prototype I had before along with changes needed to run it on most recent Rancher version
Update 2: Added multiple improvements for POC

Hackweek Results and Conclusion

Project Repository: github.com/alexander-demicev/llmserverless

The main conclusion is that it’s already possible to build something like this using the existing Rancher provisioning and management features. However, there are still a few questions and areas to improve for the future:

The POC is based on Kubeadm, it can and should be migrated to RKE2.
The SUSE AI stack wasn’t used for the sake of time efficiency, the goal was to assemble something that might currently be missing from it.
Cluster Autoscaler is getting support in Rancher, so the POC should be updated to use the autoscaler setup recommended by Rancher.
I’m not sure Knative is the best tool for self-scaling, maybe Keda would be a better alternative? I found Knative a bit complicated to configure and use, and it might be an overhead for the scope we have.

Looking for hackers with the skills:

rancher ai ll llm kubernetes

This project is part of:

Hack Week 25

Activity

about 2 months ago: pgonin liked this project.

2 months ago: ademicev0 liked this project.

2 months ago: ademicev0 added keyword "llm" to this project.

2 months ago: ademicev0 added keyword "kubernetes" to this project.

2 months ago: ademicev0 added keyword "rancher" to this project.

2 months ago: ademicev0 added keyword "ai" to this project.

2 months ago: ademicev0 added keyword "ll" to this project.

2 months ago: ademicev0 started this project.

2 months ago: ademicev0 originated this project.

Comments

Be the first to comment!

Similar Projects

rancher

Liz - Prompt autocomplete by ftorchia

Description

Liz is the Rancher AI assistant for cluster operations.

Goals

We want to help users when sending new messages to Liz, by adding an autocomplete feature to complete their requests based on the context.

Example:

User prompt: "Can you show me the list of p"
Autocomplete suggestion: "Can you show me the list of p...od in local cluster?"

Example:

User prompt: "Show me the logs of #rancher-"
Chat console: It shows a drop-down widget, next to the # character, with the list of available pod names starting with "rancher-".

Technical Overview

The AI agent should expose a new ws/autocomplete endpoint to proxy autocomplete messages to the LLM.
The UI extension should be able to display prompt suggestions and allow users to apply the autocomplete to the Prompt via keyboard shortcuts.

Resources

GitHub repository

SUSE Virtualization (Harvester): VM Import UI flow by wombelix

Description

SUSE Virtualization (Harvester) has a vm-import-controller that allows migrating VMs from VMware and OpenStack, but users need to write manifest files and apply them with kubectl to use it. This project is about adding the missing UI pieces to the harvester-ui-extension, making VM Imports accessible without requiring Kubernetes and YAML knowledge.

VMware and OpenStack admins aren't automatically familiar with Kubernetes and YAML. Implementing the UI part for the VM Import feature makes it easier to use and more accessible. The Harvester Enhancement Proposal (HEP) VM Migration controller included a UI flow implementation in its scope. Issue #2274 received multiple comments that an UI integration would be a nice addition, and issue #4663 was created to request the implementation but eventually stalled.

Right now users need to manually create either VmwareSource or OpenstackSource resources, then write VirtualMachineImport manifests with network mappings and all the other configuration options. Users should be able to do that and track import status through the UI without writing YAML.

Work during the Hack Week will be done in this fork in a branch called suse-hack-week-25, making progress publicly visible and open for contributions. When everything works out and the branch is in good shape, it will be submitted as a pull request to harvester-ui-extension to get it included in the next Harvester release.

Testing will focus on VMware since that's what is available in the lab environment (SUSE Virtualization 1.6 single-node cluster, ESXi 8.0 standalone host). Given that this is about UI and surfacing what the vm-import-controller handles, the implementation should work for OpenStack imports as well.

This project is also a personal challenge to learn vue.js and get familiar with Rancher Extensions development, since harvester-ui-extension is built on that framework.

Goals

Learn Vue.js and Rancher Extensions fundamentals required to finish the project
Read and learn from other Rancher UI Extensions code, especially understanding the harvester-ui-extension code base
Understand what the vm-import-controller and its CRDs require, identify ready to use components in the Rancher UI Extension API that can be leveraged
Implement UI logic for creating and managing VmwareSource / OpenstackSource and VirtualMachineImport resources with all relevant configuration options and credentials
Implemnt UI elements to display VirtualMachineImport status and errors

Resources

HEP and related discussion

SUSE Virtualization VM Import Documentation

https://docs.harvesterhci.io/v1.6/advanced/addons/vmimport

Rancher Extensions Documentation

https://extensions.rancher.io/extensions/next/home

Rancher UI Plugin Examples

https://github.com/rancher/ui-plugin-examples

Vue Router Essentials

https://v3.router.vuejs.org/guide/

Vue Router API

https://v3.router.vuejs.org/api/#routes

Vuex Documentation

Cluster API Provider for Harvester by rcase

Project Description

The Cluster API "infrastructure provider" for Harvester, also named CAPHV, makes it possible to use Harvester with Cluster API. This enables people and organisations to create Kubernetes clusters running on VMs created by Harvester using a declarative spec.

The project has been bootstrapped in HackWeek 23, and its code is available here.

Work done in HackWeek 2023

Have a early working version of the provider available on Rancher Sandbox : *DONE *
Demonstrated the created cluster can be imported using Rancher Turtles: DONE
Stretch goal - demonstrate using the new provider with CAPRKE2: DONE and the templates are available on the repo

DONE in HackWeek 24:

Add more Unit Tests
Improve Status Conditions for some phases
Add cloud provider config generation
Testing with Harvester v1.3.2
Template improvements
Issues creation

DONE in 2025 (out of Hackweek)

Support of ClusterClass
Add to clusterctl community providers, you can add it directly with clusterctl
Testing on newer versions of Harvester v1.4.X and v1.5.X
Support for clusterctl generate cluster ...
Improve Status Conditions to reflect current state of Infrastructure
Improve CI (some bugs for release creation)

Goals for HackWeek 2025

FIRST and FOREMOST, any topic is important to you
Add e2e testing
Certify the provider for Rancher Turtles
Add Machine pool labeling
Add PCI-e passthrough capabilities.
Other improvement suggestions are welcome!

Thanks to @isim and Dominic Giebert for their contributions!

Resources

Looking for help from anyone interested in Cluster API (CAPI) or who wants to learn more about Harvester.

This will be an infrastructure provider for Cluster API. Some background reading for the CAPI aspect:

Rancher Cluster Lifecycle Visualizer by jferraz

Description

Rancher’s v2 provisioning system represents each downstream cluster with several Kubernetes custom resources across multiple API groups, such as clusters.provisioning.cattle.io and clusters.management.cattle.io. Understanding why a cluster is stuck in states like "Provisioning", "Updating", or "Unavailable" often requires jumping between these resources, reading conditions, and correlating them with agent connectivity and known failure modes. This project will build a Cluster Lifecycle Visualizer: a small, read-only controller that runs in the Rancher management cluster and generates a single, human-friendly view per cluster. It will watch Rancher cluster CRDs, derive a simplified lifecycle phase, keep a history of phase transitions from installation time onward, and attach a short, actionable recommendation string that hints at what the operator should check or do next.

Goals

Provide a compact lifecycle summary for each Rancher-managed cluster (e.g. Provisioning, WaitingForClusterAgent, Active, Updating, Error) derived from provisioning.cattle.io/v1 Cluster and management.cattle.io/v3 Cluster status and conditions.
Maintain a phase history for each cluster, allowing operators to see how its state evolved over time since the visualizer was installed.
Attach a recommended action to the current phase using a small ruleset based on common Rancher failure modes (for example, cluster agent not connected, cluster still stabilizing after an upgrade, or generic error states), to improve the day-to-day debugging experience.
Deliver an easy-to-install, read-only component (single YAML or small Helm chart) that Rancher users can deploy to their management cluster and inspect via kubectl get/describe, without UI changes or direct access to downstream clusters.
Use idiomatic Go, wrangler, and Rancher APIs.

Resources

Rancher Manager documentation on RKE2 and K3s cluster configuration and provisioning flows.
Rancher API Go types for provisioning.cattle.io/v1 and management.cattle.io/v3 (from the rancher/rancher repository or published Go packages).
Existing Rancher architecture docs and internal notes about cluster provisioning, cluster agents, and node agents.
A local Rancher management cluster (k3s or RKE2) with a few test downstream clusters to validate phase detection, history tracking, and recommendations.

Rancher/k8s Trouble-Maker by tonyhansen

Project Description

When studying for my RHCSA, I found trouble-maker, which is a program that breaks a Linux OS and requires you to fix it. I want to create something similar for Rancher/k8s that can allow for troubleshooting an unknown environment.

Goals for Hackweek 25

Update to modern Rancher and verify that existing tests still work
Change testing logic to populate secrets instead of requiring a secondary script
Add new tests

Goals for Hackweek 24 (Complete)

Create a basic framework for creating Rancher/k8s cluster lab environments as needed for the Break/Fix
Create at least 5 modules that can be applied to the cluster and require troubleshooting

Resources

https://github.com/celidon/rancher-troublemaker
https://github.com/rancher/terraform-provider-rancher2
https://github.com/rancher/tf-rancher-up
https://github.com/rancher/quickstart

ai

Docs Navigator MCP: SUSE Edition by mackenzie.techdocs

Description

Docs Navigator MCP: SUSE Edition is an AI-powered documentation navigator that makes finding information across SUSE, Rancher, K3s, and RKE2 documentation effortless. Built as a Model Context Protocol (MCP) server, it enables semantic search, intelligent Q&A, and documentation summarization using 100% open-source AI models (no API keys required!). The project also allows you to bring your own keys from Anthropic and Open AI for parallel processing.

Goals

[ X ] Build functional MCP server with documentation tools
[ X ] Implement semantic search with vector embeddings
[ X ] Create user-friendly web interface
[ X ] Optimize indexing performance (parallel processing)
[ X ] Add SUSE branding and polish UX
[ X ] Stretch Goal: Add more documentation sources
[ X ] Stretch Goal: Implement document change detection for auto-updates

Coming Soon!

Community Feedback: Test with real users and gather improvement suggestions

Resources

Repository: Docs Navigator MCP: SUSE Edition GitHub
UI Demo: Live UI Demo of Docs Navigator MCP: SUSE Edition

Explore LLM evaluation metrics by thbertoldi

Description

Learn the best practices for evaluating LLM performance with an open-source framework such as DeepEval.

Goals

Curate the knowledge learned during practice and present it to colleagues.

-> Maybe publish a blog post on SUSE's blog?

Resources

https://deepeval.com

https://docs.pactflow.io/docs/bi-directional-contract-testing

Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

Description

Build a solid understanding of the current landscape of Artificial Intelligence and how modern cloud-native technologies—especially Kubernetes—support AI workloads.

Goals

Use Gemini Learning Mode to guide the exploration, surface relevant concepts, and structure the learning journey:

Gain insight into the latest AI trends, tools, and architectural concepts.
Understand how Kubernetes and related cloud-native technologies are used in the AI ecosystem (model training, deployment, orchestration, MLOps).

Resources

Red Hat AI Topic Articles
- https://www.redhat.com/en/topics/ai
Kubeflow Documentation
- https://www.kubeflow.org/docs/
Q4 2025 CNCF Technology Landscape Radar report:
- https://www.cncf.io/announcements/2025/11/11/cncf-and-slashdata-report-finds-leading-ai-tools-gaining-adoption-in-cloud-native-ecosystems/
- https://www.cncf.io/wp-content/uploads/2025/11/cncfreporttechradar_111025a.pdf
Agent-to-Agent (A2A) Protocol
- https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/

SUSE Edge Image Builder MCP by eminguez

Description

Based on my other hackweek project, SUSE Edge Image Builder's Json Schema I would like to build also a MCP to be able to generate EIB config files the AI way.

Realistically I don't think I'll be able to have something consumable at the end of this hackweek but at least I would like to start exploring MCPs, the difference between an API and MCP, etc.

Goals

Familiarize myself with MCPs
Unrealistic: Have an MCP that can generate an EIB config file

Resources

https://hackweek.opensuse.org/25/projects/suse-edge-image-builder-json-schema
Anything else :)

Result

https://github.com/e-minguez/eib-mcp

I've extensively used antigravity and its agent mode to code this. This heavily uses https://hackweek.opensuse.org/25/projects/suse-edge-image-builder-json-schema for the MCP to be built.

I've ended up learning a lot of things about "prompting", json schemas in general, some golang, MCPs and AI in general :)

Example:

Generate an Edge Image Builder configuration for an ISO image based on slmicro-6.2.iso, targeting x86_64 architecture. The output name should be 'my-edge-image' and it should install to /dev/sda. It should deploy a 3 nodes kubernetes cluster with nodes names "node1", "node2" and "node3" as: * hostname: node1, IP: 1.1.1.1, role: initializer * hostname: node2, IP: 1.1.1.2, role: agent * hostname: node3, IP: 1.1.1.3, role: agent The kubernetes version should be k3s 1.33.4-k3s1 and it should deploy a cert-manager helm chart (the latest one available according to https://cert-manager.io/docs/installation/helm/). It should create a user called "suse" with password "suse" and set ntp to "foo.ntp.org". The VIP address for the API should be 1.2.3.4

Generates:

``` apiVersion: "1.0" image: arch: x86_64 baseImage: slmicro-6.2.iso imageType: iso outputImageName: my-edge-image kubernetes: helm: charts: - name: cert-manager repositoryName: jetstack

Try AI training with ROCm and LoRA by bmwiedemann

Description

I want to setup a Radeon RX 9600 XT 16 GB at home with ROCm on Slowroll.

Goals

I want to test how fast AI inference can get with the GPU and if I can use LoRA to re-train an existing free model for some task.

Resources

https://rocm.docs.amd.com/en/latest/compatibility/compatibility-matrix.html
https://build.opensuse.org/project/show/science:GPU:ROCm
https://src.opensuse.org/ROCm/
https://www.suse.com/c/lora-fine-tuning-llms-for-text-classification/

Results

got inference working with llama.cpp:

export LLAMACPP_ROCM_ARCH=gfx1200
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" \
cmake -S . -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=$LLAMACPP_ROCM_ARCH \
-DCMAKE_BUILD_TYPE=Release -DLLAMA_CURL=ON \
-Dhipblas_DIR=/usr/lib64/cmake/hipblaslt/ \
&amp;&amp; cmake --build build --config Release -j8
m=models/gpt-oss-20b-mxfp4.gguf
cd $P/llama.cpp &amp;&amp; build/bin/llama-server --model $m --threads 8 --port 8005 --host 0.0.0.0 --device ROCm0 --n-gpu-layers 999

Without the --device option it faulted. Maybe because my APU also appears there?

I updated/fixed various related packages: https://src.opensuse.org/ROCm/rocm-examples/pulls/1 https://src.opensuse.org/ROCm/hipblaslt/pulls/1 SR 1320959

benchmark

I benchmarked inference with llama.cpp + gpt-oss-20b-mxfp4.gguf and ROCm offloading to a Radeon RX 9060 XT 16GB. I varied the number of layers that went to the GPU:

0 layers 14.49 tokens/s (8 CPU cores)
9 layers 17.79 tokens/s 34% VRAM
15 layers 22.39 tokens/s 51% VRAM
20 layers 27.49 tokens/s 64% VRAM
24 layers 41.18 tokens/s 74% VRAM
25+ layers 86.63 tokens/s 75% VRAM (only 200% CPU load)

So there is a significant performance-boost if the whole model fits into the GPU's VRAM.

llm

Bugzilla goes AI - Phase 1 by nwalter

Description

This project, Bugzilla goes AI, aims to boost developer productivity by creating an autonomous AI bug agent during Hackweek. The primary goal is to reduce the time employees spend triaging bugs by integrating Ollama to summarize issues, recommend next steps, and push focused daily reports to a Web Interface.

Goals

To reduce employee time spent on Bugzilla by implementing an AI tool that triages and summarizes bug reports, providing actionable recommendations to the team via Web Interface.

Project Charter

Bugzilla goes AI Phase 1

Description

Project Achievements during Hackweek

In this file you can read about what we achieved during Hackweek.

Project Achievements

Explore LLM evaluation metrics by thbertoldi

Description

Learn the best practices for evaluating LLM performance with an open-source framework such as DeepEval.

Goals

Curate the knowledge learned during practice and present it to colleagues.

-> Maybe publish a blog post on SUSE's blog?

Resources

https://deepeval.com

https://docs.pactflow.io/docs/bi-directional-contract-testing

Backporting patches using LLM by jankara

Description

Backporting Linux kernel fixes (either for CVE issues or as part of general git-fixes workflow) is boring and mostly mechanical work (dealing with changes in context, renamed variables, new helper functions etc.). The idea of this project is to explore usage of LLM for backporting Linux kernel commits to SUSE kernels using LLM.

Goals

Create safe environment allowing LLM to run and backport patches without exposing the whole filesystem to it (for privacy and security reasons).
Write prompt that will guide LLM through the backporting process. Fine tune it based on experimental results.
Explore success rate of LLMs when backporting various patches.

Resources

Docker
Gemini CLI

Repository

Current version of the container with some instructions for use are at: https://gitlab.suse.de/jankara/gemini-cli-backporter

SUSE Observability MCP server by drutigliano

Description

The idea is to implement the SUSE Observability Model Context Protocol (MCP) Server as a specialized, middle-tier API designed to translate the complex, high-cardinality observability data from StackState (topology, metrics, and events) into highly structured, contextually rich, and LLM-ready snippets.

This MCP Server abstract the StackState APIs. Its primary function is to serve as a Tool/Function Calling target for AI agents. When an AI receives an alert or a user query (e.g., "What caused the outage?"), the AI calls an MCP Server endpoint. The server then fetches the relevant operational facts, summarizes them, normalizes technical identifiers (like URNs and raw metric names) into natural language concepts, and returns a concise JSON or YAML payload. This payload is then injected directly into the LLM's prompt, ensuring the final diagnosis or action is grounded in real-time, accurate SUSE Observability data, effectively minimizing hallucinations.

Goals

Grounding AI Responses: Ensure that all AI diagnoses, root cause analyses, and action recommendations are strictly based on verifiable, real-time data retrieved from the SUSE Observability StackState platform.
Simplifying Data Access: Abstract the complexity of StackState's native APIs (e.g., Time Travel, 4T Data Model) into simple, semantic functions that can be easily invoked by LLM tool-calling mechanisms.
Data Normalization: Convert complex, technical identifiers (like component URNs, raw metric names, and proprietary health states) into standardized, natural language terms that an LLM can easily reason over.
Enabling Automated Remediation: Define clear, action-oriented MCP endpoints (e.g., execute_runbook) that allow the AI agent to initiate automated operational workflows (e.g., restarts, scaling) after a diagnosis, closing the loop on observability.

Hackweek STEP

Create a functional MCP endpoint exposing one (or more) tool(s) to answer queries like "What is the health of service X?") by fetching, normalizing, and returning live StackState data in an LLM-ready format.

Scope

Implement read-only MCP server that can:
- Connect to a live SUSE Observability instance and authenticate (with API token)
- Use tools to fetch data for a specific component URN (e.g., current health state, metrics, possibly topology neighbors, ...).
- Normalize response fields (e.g., URN to "Service Name," health state DEVIATING to "Unhealthy", raw metrics).
- Return the data as a structured JSON payload compliant with the MCP specification.

Deliverables

MCP Server v0.1 A running Golang MCP server with at least one tool.
A README.md and a test script (e.g., curl commands or a simple notebook) showing how an AI agent would call the endpoint and the resulting JSON payload.

Outcome A functional and testable API endpoint that proves the core concept: translating complex StackState data into a simple, LLM-ready format. This provides the foundation for developing AI-driven diagnostics and automated remediation.

Resources

https://www.honeycomb.io/blog/its-the-end-of-observability-as-we-know-it-and-i-feel-fine
https://www.datadoghq.com/blog/datadog-remote-mcp-server
https://modelcontextprotocol.io/specification/2025-06-18/index
https://modelcontextprotocol.io/docs/develop/build-server

Basic implementation

https://github.com/drutigliano19/suse-observability-mcp-server

Results

Successfully developed and delivered a fully functional SUSE Observability MCP Server that bridges language models with SUSE Observability's operational data. This project demonstrates how AI agents can perform intelligent troubleshooting and root cause analysis using structured access to real-time infrastructure data.

Example execution

Creating test suite using LLM on existing codebase of a solar router by fcrozat

Description

Two years ago, I evaluated solar routers as part of hackweek24, I've assembled one and it is running almost smoothly.

However, its code quality is not perfect and the codebase doesn't have any testcase (which is tricky, since it is embedded code and rely on getting external data to react).

Before improving the code itself, a testsuite should be created to ensure code additional don't cause regression.

Goals

Create a testsuite, allowing to test solar router code in a virtual environment. Using LLM to help to create this test suite.

If succesful, try to improve the codebase itself by having it reviewed by LLM.

Resources

Solar router github project

kubernetes

A CLI for Harvester by mohamed.belgaied

Harvester does not officially come with a CLI tool, the user is supposed to interact with Harvester mostly through the UI. Though it is theoretically possible to use kubectl to interact with Harvester, the manipulation of Kubevirt YAML objects is absolutely not user friendly. Inspired by tools like multipass from Canonical to easily and rapidly create one of multiple VMs, I began the development of Harvester CLI. Currently, it works but Harvester CLI needs some love to be up-to-date with Harvester v1.0.2 and needs some bug fixes and improvements as well.

Project Description

Harvester CLI is a command line interface tool written in Go, designed to simplify interfacing with a Harvester cluster as a user. It is especially useful for testing purposes as you can easily and rapidly create VMs in Harvester by providing a simple command such as: harvester vm create my-vm --count 5 to create 5 VMs named my-vm-01 to my-vm-05.

Harvester CLI is functional but needs a number of improvements: up-to-date functionality with Harvester v1.0.2 (some minor issues right now), modifying the default behaviour to create an opensuse VM instead of an ubuntu VM, solve some bugs, etc.

Github Repo for Harvester CLI: https://github.com/belgaied2/harvester-cli

Done in previous Hackweeks

Create a Github actions pipeline to automatically integrate Harvester CLI to Homebrew repositories: DONE
Automatically package Harvester CLI for OpenSUSE / Redhat RPMs or DEBs: DONE

Goal for this Hackweek

The goal for this Hackweek is to bring Harvester CLI up-to-speed with latest Harvester versions (v1.3.X and v1.4.X), and improve the code quality as well as implement some simple features and bug fixes.

Some nice additions might be: * Improve handling of namespaced objects * Add features, such as network management or Load Balancer creation ? * Add more unit tests and, why not, e2e tests * Improve CI * Improve the overall code quality * Test the program and create issues for it

Issue list is here: https://github.com/belgaied2/harvester-cli/issues

Resources

The project is written in Go, and using client-go the Kubernetes Go Client libraries to communicate with the Harvester API (which is Kubernetes in fact). Welcome contributions are:

Testing it and creating issues
Documentation
Go code improvement

What you might learn

Harvester CLI might be interesting to you if you want to learn more about:

GitHub Actions
Harvester as a SUSE Product
Go programming language
Kubernetes API
Kubevirt API objects (Manipulating VMs and VM Configuration in Kubernetes using Kubevirt)

Rancher/k8s Trouble-Maker by tonyhansen

Project Description

Goals for Hackweek 25

Update to modern Rancher and verify that existing tests still work
Change testing logic to populate secrets instead of requiring a secondary script
Add new tests

Goals for Hackweek 24 (Complete)

Create a basic framework for creating Rancher/k8s cluster lab environments as needed for the Break/Fix
Create at least 5 modules that can be applied to the cluster and require troubleshooting

Resources

https://github.com/celidon/rancher-troublemaker
https://github.com/rancher/terraform-provider-rancher2
https://github.com/rancher/tf-rancher-up
https://github.com/rancher/quickstart

Kubernetes-Based ML Lifecycle Automation by lmiranda

Description

This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.

The pipeline will automate the lifecycle of a machine learning model, including:

Data ingestion/collection
Model training as a Kubernetes Job
Model artifact storage in an S3-compatible registry (e.g. Minio)
A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
A lightweight inference service that loads and serves the latest model
Monitoring of model performance and service health through Prometheus/Grafana

The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.

Goals

By the end of Hack Week, the project should:

Produce a fully functional ML pipeline running on Kubernetes with:
- Data collection job
- Training job container
- Storage and versioning of trained models
- Automated deployment of new model versions
- Model inference API service
- Basic monitoring dashboards
Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.
Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).
Prepare a short demo explaining the end-to-end process and how new models flow through the system.

Resources

Project Repository

Updates

Training pipeline and datasets
Inference Service py

Preparing KubeVirtBMC for project transfer to the KubeVirt organization by zchang

Description

KubeVirtBMC is preparing to transfer the project to the KubeVirt organization. One requirement is to enhance the modeling design's security. The current v1alpha1 API (the VirtualMachineBMC CRD) was designed during the proof-of-concept stage. It's immature and inherently insecure due to its cross-namespace object references, exposing security concerns from an RBAC perspective.

The other long-awaited feature is the ability to mount virtual media so that virtual machines can boot from remote ISO images.

Goals

Deliver the v1beta1 API and its corresponding controller implementation
Enable the Redfish virtual media mount function for KubeVirt virtual machines

Resources

Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

Description

Build a solid understanding of the current landscape of Artificial Intelligence and how modern cloud-native technologies—especially Kubernetes—support AI workloads.

Goals

Use Gemini Learning Mode to guide the exploration, surface relevant concepts, and structure the learning journey:

Gain insight into the latest AI trends, tools, and architectural concepts.
Understand how Kubernetes and related cloud-native technologies are used in the AI ecosystem (model training, deployment, orchestration, MLOps).

Resources

Red Hat AI Topic Articles
- https://www.redhat.com/en/topics/ai
Kubeflow Documentation
- https://www.kubeflow.org/docs/
Q4 2025 CNCF Technology Landscape Radar report:
- https://www.cncf.io/announcements/2025/11/11/cncf-and-slashdata-report-finds-leading-ai-tools-gaining-adoption-in-cloud-native-ecosystems/
- https://www.cncf.io/wp-content/uploads/2025/11/cncfreporttechradar_111025a.pdf
Agent-to-Agent (A2A) Protocol
- https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/