SUSE Hack Week: FuseML - accelerate your Hack Week ML projects

Project Description

So you have an idea for a machine learning project for HackWeek. Have you thought about what tools you'll be using? Choosing the right set of machine learning tools and making them work together can be time consuming, not to mention the unavoidable learning curve. Perhaps you could use some help with that.

The SUSE AI/ML team has the answer: FuseML - an open source machine learning DevOps orchestrator that can get your machine learning projects up and running as easy as lighting a fuse.

FuseML started as a spin off project Carrier. Think "Carrier for Machine Learning": you write your ML application using one of the popular machine learning libraries (e.g. scikit-learn, TensorFlow, PyTorch, XGBoost) and FuseML takes care of all operations necessary to get your machine learning models in action, so you can concentrate on your code.

FuseML workflow

The catch: FuseML is still in a pre-alpha state, although it can already be used to showcase basic features. While using it, you may run into some corner cases we haven't covered yet, but you'll not be alone: we're here to help.

The rewards: access to expert knowledge in AI/ML and a chance to have your ML project published into the FuseML gallery of sample applications.

What you'll need: to install and use FuseML, you'll need a kubernetes cluster. If you don't already have one handy, or if you're low on hardware resources, you can install minikube, kind or k3s on your machine.

Goal for this Hackweek

discover new use cases and AI/ML tools to be enabled for FuseML
offer assistance and guidelines on AI/ML best practices and tools in the context of FuseML
pimp up FuseML's gallery of sample applications

Resources

FuseML github project page
RocketChat channel: #machine-learning

Join this project Leave this project

Looking for hackers with the skills:

ai machinelearning kubernetes artificial-intelligence mlops mlflow sklearn pytorch fuseml tensorflow

This project is part of:

Hack Week 20

Activity

over 4 years ago: acho liked this project.

over 4 years ago: ories liked this project.

over 4 years ago: afesta liked this project.

over 4 years ago: jsuchome joined this project.

over 4 years ago: flaviosr liked this project.

over 4 years ago: flaviosr joined this project.

over 4 years ago: stefannica started this project.

over 4 years ago: stefannica added keyword "#fuseml" to this project.

over 4 years ago: stefannica added keyword "#ai" to this project.

over 4 years ago: stefannica added keyword "#machinelearning" to this project.

over 4 years ago: stefannica added keyword "#kubernetes" to this project.

over 4 years ago: stefannica added keyword "#artificial-intelligence" to this project.

over 4 years ago: stefannica added keyword "#mlops" to this project.

over 4 years ago: stefannica added keyword "#mlflow" to this project.

over 4 years ago: stefannica added keyword "#sklearn" to this project.

over 4 years ago: stefannica added keyword "#pytorch" to this project.

over 4 years ago: stefannica added keyword "#ternsorflow" to this project.

over 4 years ago: stefannica originated this project.

Comments

Be the first to comment!

Similar Projects

ai

SUSE Observability MCP server by drutigliano

Description

The idea is to implement the SUSE Observability Model Context Protocol (MCP) Server as a specialized, middle-tier API designed to translate the complex, high-cardinality observability data from StackState (topology, metrics, and events) into highly structured, contextually rich, and LLM-ready snippets.

This MCP Server abstract the StackState APIs. Its primary function is to serve as a Tool/Function Calling target for AI agents. When an AI receives an alert or a user query (e.g., "What caused the outage?"), the AI calls an MCP Server endpoint. The server then fetches the relevant operational facts, summarizes them, normalizes technical identifiers (like URNs and raw metric names) into natural language concepts, and returns a concise JSON or YAML payload. This payload is then injected directly into the LLM's prompt, ensuring the final diagnosis or action is grounded in real-time, accurate SUSE Observability data, effectively minimizing hallucinations.

Goals

Grounding AI Responses: Ensure that all AI diagnoses, root cause analyses, and action recommendations are strictly based on verifiable, real-time data retrieved from the SUSE Observability StackState platform.
Simplifying Data Access: Abstract the complexity of StackState's native APIs (e.g., Time Travel, 4T Data Model) into simple, semantic functions that can be easily invoked by LLM tool-calling mechanisms.
Data Normalization: Convert complex, technical identifiers (like component URNs, raw metric names, and proprietary health states) into standardized, natural language terms that an LLM can easily reason over.
Enabling Automated Remediation: Define clear, action-oriented MCP endpoints (e.g., execute_runbook) that allow the AI agent to initiate automated operational workflows (e.g., restarts, scaling) after a diagnosis, closing the loop on observability.

Resources

https://www.honeycomb.io/blog/its-the-end-of-observability-as-we-know-it-and-i-feel-fine
https://www.datadoghq.com/blog/datadog-remote-mcp-server
https://modelcontextprotocol.io/specification/2025-06-18/index

Basic implementation

https://github.com/drutigliano19/suse-observability-mcp-server

Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

Description

Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

Goals

By the end of Hack Week, we aim to have a single, working Python script that:

Connects to Prometheus and executes a query to fetch detailed test failure history.
Processes the raw data into a format suitable for the Gemini API.
Successfully calls the Gemini API with the data and a clear prompt.
Parses the AI's response to extract a simple list of flaky tests.
Saves the list to a JSON file that can be displayed in Grafana.
New panel in our Dashboard listing the Flaky tests

Resources

Jenkins Prometheus Exporter: https://github.com/uyuni-project/jenkins-exporter/
Data Source: Our internal Prometheus server.
Key Metric: jenkins_build_test_case_failure_age{jobname, buildid, suite, case, status, failedsince}.
Existing Query for Reference: count by (suite) (max_over_time(jenkins_build_test_case_failure_age{status=~"FAILED|REGRESSION", jobname="$jobname"}[$__range])).
AI Model: The Google Gemini API.
Example about how to interact with Gemini API: https://github.com/srbarrios/FailTale/
Visualization: Our internal Grafana Dashboard.
Internal IaC: https://gitlab.suse.de/galaxy/infrastructure/-/tree/master/srv/salt/monitoring

kubernetes

Cluster API Provider for Harvester by rcase

Project Description

The Cluster API "infrastructure provider" for Harvester, also named CAPHV, makes it possible to use Harvester with Cluster API. This enables people and organisations to create Kubernetes clusters running on VMs created by Harvester using a declarative spec.

The project has been bootstrapped in HackWeek 23, and its code is available here.

Work done in HackWeek 2023

Have a early working version of the provider available on Rancher Sandbox : *DONE *
Demonstrated the created cluster can be imported using Rancher Turtles: DONE
Stretch goal - demonstrate using the new provider with CAPRKE2: DONE and the templates are available on the repo

DONE in HackWeek 24:

Add more Unit Tests
Improve Status Conditions for some phases
Add cloud provider config generation
Testing with Harvester v1.3.2
Template improvements
Issues creation

DONE in 2025 (out of Hackweek)

Support of ClusterClass
Add to clusterctl community providers, you can add it directly with clusterctl
Testing on newer versions of Harvester v1.4.X and v1.5.X
Support for clusterctl generate cluster ...
Improve Status Conditions to reflect current state of Infrastructure
Improve CI (some bugs for release creation)

Goals for HackWeek 2025

FIRST and FOREMOST, any topic is important to you
Add e2e testing
Certify the provider for Rancher Turtles
Add Machine pool labeling
Add PCI-e passthrough capabilities.
Other improvement suggestions are welcome!

Thanks to @isim and Dominic Giebert for their contributions!

Resources

Looking for help from anyone interested in Cluster API (CAPI) or who wants to learn more about Harvester.

This will be an infrastructure provider for Cluster API. Some background reading for the CAPI aspect:

Rancher/k8s Trouble-Maker by tonyhansen

Project Description

When studying for my RHCSA, I found trouble-maker, which is a program that breaks a Linux OS and requires you to fix it. I want to create something similar for Rancher/k8s that can allow for troubleshooting an unknown environment.

Goals for Hackweek 25

Update to modern Rancher and verify that existing tests still work
Change testing logic to populate secrets instead of requiring a secondary script
Add new tests

Goals for Hackweek 24 (Complete)

Create a basic framework for creating Rancher/k8s cluster lab environments as needed for the Break/Fix
Create at least 5 modules that can be applied to the cluster and require troubleshooting

Resources

https://github.com/celidon/rancher-troublemaker
https://github.com/rancher/terraform-provider-rancher2
https://github.com/rancher/tf-rancher-up
https://github.com/rancher/quickstart

A CLI for Harvester by mohamed.belgaied

[comment]: # Harvester does not officially come with a CLI tool, the user is supposed to interact with Harvester mostly through the UI [comment]: # Though it is theoretically possible to use kubectl to interact with Harvester, the manipulation of Kubevirt YAML objects is absolutely not user friendly. [comment]: # Inspired by tools like multipass from Canonical to easily and rapidly create one of multiple VMs, I began the development of Harvester CLI. Currently, it works but Harvester CLI needs some love to be up-to-date with Harvester v1.0.2 and needs some bug fixes and improvements as well.

Project Description

Harvester CLI is a command line interface tool written in Go, designed to simplify interfacing with a Harvester cluster as a user. It is especially useful for testing purposes as you can easily and rapidly create VMs in Harvester by providing a simple command such as: harvester vm create my-vm --count 5 to create 5 VMs named my-vm-01 to my-vm-05.

Harvester CLI is functional but needs a number of improvements: up-to-date functionality with Harvester v1.0.2 (some minor issues right now), modifying the default behaviour to create an opensuse VM instead of an ubuntu VM, solve some bugs, etc.

Github Repo for Harvester CLI: https://github.com/belgaied2/harvester-cli

Done in previous Hackweeks

Create a Github actions pipeline to automatically integrate Harvester CLI to Homebrew repositories: DONE
Automatically package Harvester CLI for OpenSUSE / Redhat RPMs or DEBs: DONE

Goal for this Hackweek

The goal for this Hackweek is to bring Harvester CLI up-to-speed with latest Harvester versions (v1.3.X and v1.4.X), and improve the code quality as well as implement some simple features and bug fixes.

Some nice additions might be: * Improve handling of namespaced objects * Add features, such as network management or Load Balancer creation ? * Add more unit tests and, why not, e2e tests * Improve CI * Improve the overall code quality * Test the program and create issues for it

Issue list is here: https://github.com/belgaied2/harvester-cli/issues

Resources

The project is written in Go, and using client-go the Kubernetes Go Client libraries to communicate with the Harvester API (which is Kubernetes in fact). Welcome contributions are:

Testing it and creating issues
Documentation
Go code improvement

What you might learn

Harvester CLI might be interesting to you if you want to learn more about:

GitHub Actions
Harvester as a SUSE Product
Go programming language
Kubernetes API

Mammuthus - The NFS-Ganesha inside Kubernetes controller by vcheng

Description

As the user-space NFS provider, the NFS-Ganesha is wieldy use with serval projects. e.g. Longhorn/Rook. We want to create the Kubernetes Controller to make configuring NFS-Ganesha easy. This controller will let users configure NFS-Ganesha through different backends like VFS/CephFS.

Goals

Create NFS-Ganesha Package on OBS: nfs-ganesha5, nfs-ganesha6
Create NFS-Ganesha Container Image on OBS: Image
Create a Kubernetes controller for NFS-Ganesha and support the VFS configuration on demand. Mammuthus

Resources

NFS-Ganesha

Technical talks at universities by agamez

Description

This project aims to empower the next generation of tech professionals by offering hands-on workshops on containerization and Kubernetes, with a strong focus on open-source technologies. By providing practical experience with these cutting-edge tools and fostering a deep understanding of open-source principles, we aim to bridge the gap between academia and industry.

For now, the scope is limited to Spanish universities, since we already have the contacts and have started some conversations.

Goals

Technical Skill Development: equip students with the fundamental knowledge and skills to build, deploy, and manage containerized applications using open-source tools like Kubernetes.
Open-Source Mindset: foster a passion for open-source software, encouraging students to contribute to open-source projects and collaborate with the global developer community.
Career Readiness: prepare students for industry-relevant roles by exposing them to real-world use cases, best practices, and open-source in companies.

Resources

Instructors: experienced open-source professionals with deep knowledge of containerization and Kubernetes.
SUSE Expertise: leverage SUSE's expertise in open-source technologies to provide insights into industry trends and best practices.

artificial-intelligence

SUSE Observability MCP server by drutigliano

Description

Goals

Grounding AI Responses: Ensure that all AI diagnoses, root cause analyses, and action recommendations are strictly based on verifiable, real-time data retrieved from the SUSE Observability StackState platform.
Simplifying Data Access: Abstract the complexity of StackState's native APIs (e.g., Time Travel, 4T Data Model) into simple, semantic functions that can be easily invoked by LLM tool-calling mechanisms.
Data Normalization: Convert complex, technical identifiers (like component URNs, raw metric names, and proprietary health states) into standardized, natural language terms that an LLM can easily reason over.
Enabling Automated Remediation: Define clear, action-oriented MCP endpoints (e.g., execute_runbook) that allow the AI agent to initiate automated operational workflows (e.g., restarts, scaling) after a diagnosis, closing the loop on observability.

Resources

https://www.honeycomb.io/blog/its-the-end-of-observability-as-we-know-it-and-i-feel-fine
https://www.datadoghq.com/blog/datadog-remote-mcp-server
https://modelcontextprotocol.io/specification/2025-06-18/index

Basic implementation

https://github.com/drutigliano19/suse-observability-mcp-server