Description

A prerequisite for running applications in a cloud environment is the presence of a container registry. Another common scenario is users performing machine learning workloads in such environments. However, these types of workloads require dedicated infrastructure to run properly. We can leverage these two facts to help users save resources by storing their machine learning models in OCI registries, similar to how we handle some WebAssembly modules. This approach will save users the resources typically required for a machine learning model repository for the applications they need to run.

Goals

Allow PyTorch users to save and load machine learning models in OCI registries.

Resources

Looking for hackers with the skills:

ai mlops pytorch oci cloud

This project is part of:

Hack Week 24

Activity

  • about 1 year ago: horon liked this project.
  • about 1 year ago: jguilhermevanz started this project.
  • about 1 year ago: jguilhermevanz added keyword "ai" to this project.
  • about 1 year ago: jguilhermevanz added keyword "mlops" to this project.
  • about 1 year ago: jguilhermevanz added keyword "pytorch" to this project.
  • about 1 year ago: jguilhermevanz added keyword "oci" to this project.
  • about 1 year ago: jguilhermevanz added keyword "cloud" to this project.
  • about 1 year ago: jguilhermevanz originated this project.

  • Comments

    Be the first to comment!

    Similar Projects

    Backporting patches using LLM by jankara

    Description

    Backporting Linux kernel fixes (either for CVE issues or as part of general git-fixes workflow) is boring and mostly mechanical work (dealing with changes in context, renamed variables, new helper functions etc.). The idea of this project is to explore usage of LLM for backporting Linux kernel commits to SUSE kernels using LLM.

    Goals

    • Create safe environment allowing LLM to run and backport patches without exposing the whole filesystem to it (for privacy and security reasons).
    • Write prompt that will guide LLM through the backporting process. Fine tune it based on experimental results.
    • Explore success rate of LLMs when backporting various patches.

    Resources

    • Docker
    • Gemini CLI

    Repository

    Current version of the container with some instructions for use are at: https://gitlab.suse.de/jankara/gemini-cli-backporter


    Docs Navigator MCP: SUSE Edition by mackenzie.techdocs

    MCP Docs Navigator: SUSE Edition

    Description

    Docs Navigator MCP: SUSE Edition is an AI-powered documentation navigator that makes finding information across SUSE, Rancher, K3s, and RKE2 documentation effortless. Built as a Model Context Protocol (MCP) server, it enables semantic search, intelligent Q&A, and documentation summarization using 100% open-source AI models (no API keys required!). The project also allows you to bring your own keys from Anthropic and Open AI for parallel processing.

    Goals

    • [ X ] Build functional MCP server with documentation tools
    • [ X ] Implement semantic search with vector embeddings
    • [ X ] Create user-friendly web interface
    • [ X ] Optimize indexing performance (parallel processing)
    • [ X ] Add SUSE branding and polish UX
    • [ X ] Stretch Goal: Add more documentation sources
    • [ X ] Stretch Goal: Implement document change detection for auto-updates

    Coming Soon!

    • Community Feedback: Test with real users and gather improvement suggestions

    Resources


    Bugzilla goes AI - Phase 1 by nwalter

    Description

    This project, Bugzilla goes AI, aims to boost developer productivity by creating an autonomous AI bug agent during Hackweek. The primary goal is to reduce the time employees spend triaging bugs by integrating Ollama to summarize issues, recommend next steps, and push focused daily reports to a Web Interface.

    Goals

    To reduce employee time spent on Bugzilla by implementing an AI tool that triages and summarizes bug reports, providing actionable recommendations to the team via Web Interface.

    Project Charter

    Bugzilla goes AI Phase 1

    Description

    Project Achievements during Hackweek

    In this file you can read about what we achieved during Hackweek.

    Project Achievements


    Uyuni Health-check Grafana AI Troubleshooter by ygutierrez

    Description

    This project explores the feasibility of using the open-source Grafana LLM plugin to enhance the Uyuni Health-check tool with LLM capabilities. The idea is to integrate a chat-based "AI Troubleshooter" directly into existing dashboards, allowing users to ask natural-language questions about errors, anomalies, or performance issues.

    Goals

    • Investigate if and how the grafana-llm-app plug-in can be used within the Uyuni Health-check tool.
    • Investigate if this plug-in can be used to query LLMs for troubleshooting scenarios.
    • Evaluate support for local LLMs and external APIs through the plugin.
    • Evaluate if and how the Uyuni MCP server could be integrated as another source of information.

    Resources

    Grafana LMM plug-in

    Uyuni Health-check


    Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

    Description

    Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

    This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

    The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

    Goals

    By the end of Hack Week, we aim to have a single, working Python script that:

    1. Connects to Prometheus and executes a query to fetch detailed test failure history.
    2. Processes the raw data into a format suitable for the Gemini API.
    3. Successfully calls the Gemini API with the data and a clear prompt.
    4. Parses the AI's response to extract a simple list of flaky tests.
    5. Saves the list to a JSON file that can be displayed in Grafana.
    6. New panel in our Dashboard listing the Flaky tests

    Resources

    Outcome


    Kubernetes-Based ML Lifecycle Automation by lmiranda

    Description

    This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.

    The pipeline will automate the lifecycle of a machine learning model, including:

    • Data ingestion/collection
    • Model training as a Kubernetes Job
    • Model artifact storage in an S3-compatible registry (e.g. Minio)
    • A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
    • A lightweight inference service that loads and serves the latest model
    • Monitoring of model performance and service health through Prometheus/Grafana

    The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.

    Goals

    By the end of Hack Week, the project should:

    1. Produce a fully functional ML pipeline running on Kubernetes with:

      • Data collection job
      • Training job container
      • Storage and versioning of trained models
      • Automated deployment of new model versions
      • Model inference API service
      • Basic monitoring dashboards
    2. Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.

    3. Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).

    4. Prepare a short demo explaining the end-to-end process and how new models flow through the system.

    Resources

    Project Repository

    Updates

    1. Training pipeline and datasets
    2. Inference Service py


    Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

    Description

    Build a solid understanding of the current landscape of Artificial Intelligence and how modern cloud-native technologies—especially Kubernetes—support AI workloads.

    Goals

    Use Gemini Learning Mode to guide the exploration, surface relevant concepts, and structure the learning journey:

    • Gain insight into the latest AI trends, tools, and architectural concepts.
    • Understand how Kubernetes and related cloud-native technologies are used in the AI ecosystem (model training, deployment, orchestration, MLOps).

    Resources

    • Red Hat AI Topic Articles

      • https://www.redhat.com/en/topics/ai
    • Kubeflow Documentation

      • https://www.kubeflow.org/docs/
    • Q4 2025 CNCF Technology Landscape Radar report:

      • https://www.cncf.io/announcements/2025/11/11/cncf-and-slashdata-report-finds-leading-ai-tools-gaining-adoption-in-cloud-native-ecosystems/
      • https://www.cncf.io/wp-content/uploads/2025/11/cncfreporttechradar_111025a.pdf
    • Agent-to-Agent (A2A) Protocol

      • https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/


    Create a Cloud-Native policy engine with notifying capabilities to optimize resource usage by gbazzotti

    Description

    The goal of this project is to begin the initial phase of development of an all-in-one Cloud-Native Policy Engine that notifies resource owners when their resources infringe predetermined policies. This was inspired by a current issue in the CES-SRE Team where other solutions seemed to not exactly correspond to the needs of the specific workloads running on the Public Cloud Team space.

    The initial architecture can be checked out on the Repository listed under Resources.

    Among the features that will differ this project from other monitoring/notification systems:

    • Pre-defined sensible policies written at the software-level, avoiding a learning curve by requiring users to write their own policies
    • All-in-one functionality: logging, mailing and all other actions are not required to install any additional plugins/packages
    • Easy account management, being able to parse all required configuration by a single JSON file
    • Eliminate integrations by not requiring metrics to go through a data-agreggator

    Goals

    • Create a minimal working prototype following the workflow specified on the documentation
    • Provide instructions on installation/usage
    • Work on email notifying capabilities

    Resources