Description

This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.

The pipeline will automate the lifecycle of a machine learning model, including:

Data ingestion/collection
Model training as a Kubernetes Job
Model artifact storage in an S3-compatible registry (e.g. Minio)
A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
A lightweight inference service that loads and serves the latest model
Monitoring of model performance and service health through Prometheus/Grafana

The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.

Goals

By the end of Hack Week, the project should:

Produce a fully functional ML pipeline running on Kubernetes with:
- Data collection job
- Training job container
- Storage and versioning of trained models
- Automated deployment of new model versions
- Model inference API service
- Basic monitoring dashboards
Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.
Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).
Prepare a short demo explaining the end-to-end process and how new models flow through the system.

Resources

Project Repository

Updates

Training pipeline and datasets
Inference Service py

Looking for hackers with the skills:

ai mlops kubernetes ml learning

This project is part of:

Hack Week 25

Activity

2 months ago: lmiranda added keyword "ai" to this project.

2 months ago: lmiranda added keyword "mlops" to this project.

2 months ago: lmiranda added keyword "kubernetes" to this project.

2 months ago: lmiranda added keyword "ml" to this project.

2 months ago: lmiranda added keyword "learning" to this project.

2 months ago: lmiranda started this project.

2 months ago: lmiranda originated this project.

Comments

Be the first to comment!

Similar Projects

ai

Extended private brain - RAG my own scripts and data into offline LLM AI by tjyrinki_suse

Description

For purely studying purposes, I'd like to find out if I could teach an LLM some of my own accumulated knowledge, to use it as a sort of extended brain.

I might use qwen3-coder or something similar as a starting point.

Everything would be done 100% offline without network available to the container, since I prefer to see when network is needed, and make it so it's never needed (other than initial downloads).

Goals

Learn something about RAG, LLM, AI.
Find out if everything works offline as intended.
As an end result have a new way to access my own existing know-how, but so that I can query the wisdom in them.
Be flexible to pivot in any direction, as long as there are new things learned.

Resources

To be found on the fly.

Timeline

Day 1 (of 4)

Tried out a RAG demo, expanded on feeding it my own data
Experimented with qwen3-coder to add a persistent chat functionality, and keeping vectors in a pickle file
Optimizations to keep everything within context window
Learn and add a bit of PyTest

Day 2

More experimenting and more data
Study ChromaDB
Add a Web UI that works from another computer even though the container sees network is down

Day 3

The above RAG is working well enough for demonstration purposes.
Pivot to trying out OpenCode, configuring local Ollama qwen3-coder there, to analyze the RAG demo.
Figured out how to configure Ollama template to be usable under OpenCode. OpenCode locally is super slow to just running qwen3-coder alone.

Day 4 (final day)

Battle with OpenCode that was both slow and kept on piling up broken things.
Call it success as after all the agentic AI was working locally.
Clean up the mess left behind a bit.

Blog Post

Summarized the findings at blog post.

issuefs: FUSE filesystem representing issues (e.g. JIRA) for the use with AI agents code-assistants by llansky3

Description

Creating a FUSE filesystem (issuefs) that mounts issues from various ticketing systems (Github, Jira, Bugzilla, Redmine) as files to your local file system.

And why this is good idea?

User can use favorite command line tools to view and search the tickets from various sources
User can use AI agents capabilities from your favorite IDE or cli to ask question about the issues, project or functionality while providing relevant tickets as context without extra work.
User can use it during development of the new features when you let the AI agent to jump start the solution. The issuefs will give the AI agent the context (AI agents just read few more files) about the bug or requested features. No need for copying and pasting issues to user prompt or by using extra MCP tools to access the issues. These you can still do but this approach is on purpose different.

Goals

Add Github issue support
Proof the concept/approach by apply the approach on itself using Github issues for tracking and development of new features
Add support for Bugzilla and Redmine using this approach in the process of doing it. Record a video of it.
Clean-up and test the implementation and create some documentation
Create a blog post about this approach

Resources

There is a prototype implementation here. This currently sort of works with JIRA only.

Bugzilla goes AI - Phase 1 by nwalter

Description

This project, Bugzilla goes AI, aims to boost developer productivity by creating an autonomous AI bug agent during Hackweek. The primary goal is to reduce the time employees spend triaging bugs by integrating Ollama to summarize issues, recommend next steps, and push focused daily reports to a Web Interface.

Goals

To reduce employee time spent on Bugzilla by implementing an AI tool that triages and summarizes bug reports, providing actionable recommendations to the team via Web Interface.

Project Charter

Bugzilla goes AI Phase 1

Description

Project Achievements during Hackweek

In this file you can read about what we achieved during Hackweek.

Project Achievements

Docs Navigator MCP: SUSE Edition by mackenzie.techdocs

Description

Docs Navigator MCP: SUSE Edition is an AI-powered documentation navigator that makes finding information across SUSE, Rancher, K3s, and RKE2 documentation effortless. Built as a Model Context Protocol (MCP) server, it enables semantic search, intelligent Q&A, and documentation summarization using 100% open-source AI models (no API keys required!). The project also allows you to bring your own keys from Anthropic and Open AI for parallel processing.

Goals

[ X ] Build functional MCP server with documentation tools
[ X ] Implement semantic search with vector embeddings
[ X ] Create user-friendly web interface
[ X ] Optimize indexing performance (parallel processing)
[ X ] Add SUSE branding and polish UX
[ X ] Stretch Goal: Add more documentation sources
[ X ] Stretch Goal: Implement document change detection for auto-updates

Coming Soon!

Community Feedback: Test with real users and gather improvement suggestions

Resources

Repository: Docs Navigator MCP: SUSE Edition GitHub
UI Demo: Live UI Demo of Docs Navigator MCP: SUSE Edition

Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

Self-Scaling LLM Infrastructure Powered by Rancher

Description

The Problem

Running LLMs can get expensive and complex pretty quickly.

Today there are typically two choices:

Use cloud APIs like OpenAI or Anthropic. Easy to start with, but costs add up at scale.
Self-host everything - set up Kubernetes, figure out GPU scheduling, handle scaling, manage model serving... it's a lot of work.

What if there was a middle ground?

What if infrastructure scaled itself instead of making you scale it?

Can we use existing Rancher capabilities like CAPI, autoscaling, and GitOps to make this simpler instead of building everything from scratch?

Project Repository: github.com/alexander-demicev/llmserverless

What This Project Does

A key feature is hybrid deployment: requests can be routed based on complexity or privacy needs. Simple or low-sensitivity queries can use public APIs (like OpenAI), while complex or private requests are handled in-house on local infrastructure. This flexibility allows balancing cost, privacy, and performance - using cloud for routine tasks and on-premises resources for sensitive or demanding workloads.

A complete, self-scaling LLM infrastructure that:

Scales to zero when idle (no idle costs)
Scales up automatically when requests come in
Adds more nodes when needed, removes them when demand drops
Runs on any infrastructure - laptop, bare metal, or cloud

Think of it as "serverless for LLMs" - focus on building, the infrastructure handles itself.

How It Works

A combination of open source tools working together:

Flow:

Users interact with OpenWebUI (chat interface)
Requests go to LiteLLM Gateway
LiteLLM routes requests to:
- Ollama (Knative) for local model inference (auto-scales pods)
- Or cloud APIs for fallback

mlops

Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

Description

Build a solid understanding of the current landscape of Artificial Intelligence and how modern cloud-native technologies—especially Kubernetes—support AI workloads.

Goals

Use Gemini Learning Mode to guide the exploration, surface relevant concepts, and structure the learning journey:

Gain insight into the latest AI trends, tools, and architectural concepts.
Understand how Kubernetes and related cloud-native technologies are used in the AI ecosystem (model training, deployment, orchestration, MLOps).

Resources

Red Hat AI Topic Articles
- https://www.redhat.com/en/topics/ai
Kubeflow Documentation
- https://www.kubeflow.org/docs/
Q4 2025 CNCF Technology Landscape Radar report:
- https://www.cncf.io/announcements/2025/11/11/cncf-and-slashdata-report-finds-leading-ai-tools-gaining-adoption-in-cloud-native-ecosystems/
- https://www.cncf.io/wp-content/uploads/2025/11/cncfreporttechradar_111025a.pdf
Agent-to-Agent (A2A) Protocol
- https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/

kubernetes

Technical talks at universities by agamez

Description

This project aims to empower the next generation of tech professionals by offering hands-on workshops on containerization and Kubernetes, with a strong focus on open-source technologies. By providing practical experience with these cutting-edge tools and fostering a deep understanding of open-source principles, we aim to bridge the gap between academia and industry.

For now, the scope is limited to Spanish universities, since we already have the contacts and have started some conversations.

Goals

Technical Skill Development: equip students with the fundamental knowledge and skills to build, deploy, and manage containerized applications using open-source tools like Kubernetes.
Open-Source Mindset: foster a passion for open-source software, encouraging students to contribute to open-source projects and collaborate with the global developer community.
Career Readiness: prepare students for industry-relevant roles by exposing them to real-world use cases, best practices, and open-source in companies.

Resources

Instructors: experienced open-source professionals with deep knowledge of containerization and Kubernetes.
SUSE Expertise: leverage SUSE's expertise in open-source technologies to provide insights into industry trends and best practices.

Self-Scaling LLM Infrastructure Powered by Rancher by ademicev0

Self-Scaling LLM Infrastructure Powered by Rancher

Description

The Problem

Running LLMs can get expensive and complex pretty quickly.

Today there are typically two choices:

Use cloud APIs like OpenAI or Anthropic. Easy to start with, but costs add up at scale.
Self-host everything - set up Kubernetes, figure out GPU scheduling, handle scaling, manage model serving... it's a lot of work.

What if there was a middle ground?

What if infrastructure scaled itself instead of making you scale it?

Can we use existing Rancher capabilities like CAPI, autoscaling, and GitOps to make this simpler instead of building everything from scratch?

Project Repository: github.com/alexander-demicev/llmserverless

What This Project Does

A complete, self-scaling LLM infrastructure that:

Scales to zero when idle (no idle costs)
Scales up automatically when requests come in
Adds more nodes when needed, removes them when demand drops
Runs on any infrastructure - laptop, bare metal, or cloud

Think of it as "serverless for LLMs" - focus on building, the infrastructure handles itself.

How It Works

A combination of open source tools working together:

Flow:

Users interact with OpenWebUI (chat interface)
Requests go to LiteLLM Gateway
LiteLLM routes requests to:
- Ollama (Knative) for local model inference (auto-scales pods)
- Or cloud APIs for fallback

A CLI for Harvester by mohamed.belgaied

Harvester does not officially come with a CLI tool, the user is supposed to interact with Harvester mostly through the UI. Though it is theoretically possible to use kubectl to interact with Harvester, the manipulation of Kubevirt YAML objects is absolutely not user friendly. Inspired by tools like multipass from Canonical to easily and rapidly create one of multiple VMs, I began the development of Harvester CLI. Currently, it works but Harvester CLI needs some love to be up-to-date with Harvester v1.0.2 and needs some bug fixes and improvements as well.

Project Description

Harvester CLI is a command line interface tool written in Go, designed to simplify interfacing with a Harvester cluster as a user. It is especially useful for testing purposes as you can easily and rapidly create VMs in Harvester by providing a simple command such as: harvester vm create my-vm --count 5 to create 5 VMs named my-vm-01 to my-vm-05.

Harvester CLI is functional but needs a number of improvements: up-to-date functionality with Harvester v1.0.2 (some minor issues right now), modifying the default behaviour to create an opensuse VM instead of an ubuntu VM, solve some bugs, etc.

Github Repo for Harvester CLI: https://github.com/belgaied2/harvester-cli

Done in previous Hackweeks

Create a Github actions pipeline to automatically integrate Harvester CLI to Homebrew repositories: DONE
Automatically package Harvester CLI for OpenSUSE / Redhat RPMs or DEBs: DONE

Goal for this Hackweek

The goal for this Hackweek is to bring Harvester CLI up-to-speed with latest Harvester versions (v1.3.X and v1.4.X), and improve the code quality as well as implement some simple features and bug fixes.

Some nice additions might be: * Improve handling of namespaced objects * Add features, such as network management or Load Balancer creation ? * Add more unit tests and, why not, e2e tests * Improve CI * Improve the overall code quality * Test the program and create issues for it

Issue list is here: https://github.com/belgaied2/harvester-cli/issues

Resources

The project is written in Go, and using client-go the Kubernetes Go Client libraries to communicate with the Harvester API (which is Kubernetes in fact). Welcome contributions are:

Testing it and creating issues
Documentation
Go code improvement

What you might learn

Harvester CLI might be interesting to you if you want to learn more about:

GitHub Actions
Harvester as a SUSE Product
Go programming language
Kubernetes API
Kubevirt API objects (Manipulating VMs and VM Configuration in Kubernetes using Kubevirt)

The Agentic Rancher Experiment: Do Androids Dream of Electric Cattle? by moio

Rancher is a beast of a codebase. Let's investigate if the new 2025 generation of GitHub Autonomous Coding Agents and Copilot Workspaces can actually tame it.

The Plan

Create a sandbox GitHub Organization, clone in key Rancher repositories, and let the AI loose to see if it can handle real-world enterprise OSS maintenance - or if it just hallucinates new breeds of Kubernetes resources!

Specifically, throw "Agentic Coders" some typical tasks in a complex, long-lived open-source project, such as:

❥ The Grunt Work: generate missing GoDocs, unit tests, and refactorings. Rebase PRs.

❥ The Complex Stuff: fix actual (historical) bugs and feature requests to see if they can traverse the complexity without (too much) human hand-holding.

❥ Hunting Down Gaps: find areas lacking in docs, areas of improvement in code, dependency bumps, and so on.

If time allows, also experiment with Model Context Protocol (MCP) to give agents context on our specific build pipelines and CI/CD logs.

Why?

We know AI can write "Hello World." and also moderately complex programs from a green field. But can it rebase a 3-month-old PR with conflicts in rancher/rancher? I want to find the breaking point of current AI agents to determine if and how they can help us to reduce our technical debt, work faster and better. At the same time, find out about pitfalls and shortcomings.

The CONCLUSION!!!

A State of the Union document was compiled to summarize lessons learned this week. For more gory details, just read on the diary below!

Cluster API Provider for Harvester by rcase

Project Description

The Cluster API "infrastructure provider" for Harvester, also named CAPHV, makes it possible to use Harvester with Cluster API. This enables people and organisations to create Kubernetes clusters running on VMs created by Harvester using a declarative spec.

The project has been bootstrapped in HackWeek 23, and its code is available here.

Work done in HackWeek 2023

Have a early working version of the provider available on Rancher Sandbox : *DONE *
Demonstrated the created cluster can be imported using Rancher Turtles: DONE
Stretch goal - demonstrate using the new provider with CAPRKE2: DONE and the templates are available on the repo

DONE in HackWeek 24:

Add more Unit Tests
Improve Status Conditions for some phases
Add cloud provider config generation
Testing with Harvester v1.3.2
Template improvements
Issues creation

DONE in 2025 (out of Hackweek)

Support of ClusterClass
Add to clusterctl community providers, you can add it directly with clusterctl
Testing on newer versions of Harvester v1.4.X and v1.5.X
Support for clusterctl generate cluster ...
Improve Status Conditions to reflect current state of Infrastructure
Improve CI (some bugs for release creation)

Goals for HackWeek 2025

FIRST and FOREMOST, any topic is important to you
Add e2e testing
Certify the provider for Rancher Turtles
Add Machine pool labeling
Add PCI-e passthrough capabilities.
Other improvement suggestions are welcome!

Thanks to @isim and Dominic Giebert for their contributions!

Resources

Looking for help from anyone interested in Cluster API (CAPI) or who wants to learn more about Harvester.

This will be an infrastructure provider for Cluster API. Some background reading for the CAPI aspect:

learning

Advent of Code: The Diaries by amanzini

Description

It was the Night Before Compile Time ...

Hackweek 25 (December 1-5) perfectly coincides with the first five days of Advent of Code 2025. This project will leverage this overlap to participate in the event in real-time.

To add a layer of challenge and exploration (in the true spirit of Hackweek), the puzzles will be solved using a non-mainstream, modern language like Ruby, D, Crystal, Gleam or Zig.

The primary project intent is not just simply to solve the puzzles, but to exercise result sharing and documentation. I'd create a public-facing repository documenting the process. This involves treating each day's puzzle as a mini-project: solving it, then documenting the solution with detailed write-ups, analysis of the language's performance and ergonomics, and visualizations.

                               |
                             \ ' /
                           -- (*) --
                              >*<
                             >0<@<
                            >>>@<<*
                           >@>*<0<<<
                          >*>>@<<<@<<
                         >@>>0<<<*<<@<
                        >*>>0<<@<<<@<<<
                       >@>>*<<@<>*<<0<*<
         \*/          >0>>*<<@<>0><<*<@<<
     ___\\U//___     >*>>@><0<<*>>@><*<0<<
     |\\ | | \\|    >@>>0<*<0>>@<<0<<<*<@<<
     | \\| | _(UU)_ >((*))_>0><*<0><@<<<0<*<
     |\ \| || / //||.*.*.*.|>>@<<*<<@>><0<<<
     |\\_|_|&&_// ||*.*.*.*|_\\db//_
     """"|'.'.'.|~~|.*.*.*|     ____|_
         |'.'.'.|   ^^^^^^|____|>>>>>>|
         ~~~~~~~~         '""""`------'
------------------------------------------------
This ASCII pic can be found at
https://asciiart.website/art/1831

Goals

Code, Docs, and Memes: An AoC Story

Have fun!
Involve more people, play together
Solve Days 1-5: Successfully solve both parts of the Advent of Code 2025 puzzles for Days 1-5 using the chosen non-mainstream language.
Daily Documentation & Language Review: Publish a detailed write-up for each day. This documentation will include the solution analysis, the chosen algorithm, and specific commentary on the language's ergonomics, performance, and standard library for the given task.