Starting from prometheus ( and grafana if needed), learn how to monitor kubernetes and docker and do some valid alert/graph etc.

https://docs.docker.com/config/thirdparty/prometheus/

Looking for hackers with the skills:

golang prometheus monitoring kubernetes docker grafana

This project is part of:

Hack Week 17

Activity

  • over 7 years ago: dmaiocchi added keyword "grafana" to this project.
  • over 7 years ago: dmaiocchi added keyword "golang" to this project.
  • over 7 years ago: dmaiocchi added keyword "prometheus" to this project.
  • over 7 years ago: dmaiocchi added keyword "monitoring" to this project.
  • over 7 years ago: dmaiocchi added keyword "kubernetes" to this project.
  • over 7 years ago: dmaiocchi added keyword "docker" to this project.
  • over 7 years ago: dmaiocchi started this project.
  • over 7 years ago: dmaiocchi originated this project.

  • Comments

    Be the first to comment!

    Similar Projects

    Updatecli Autodiscovery supporting WASM plugins by olblak

    Description

    Updatecli is a Golang Update policy engine that allow to write Update policies in YAML manifest. Updatecli already has a plugin ecosystem for common update strategies such as automating Dockerfile or Kubernetes manifest from Git repositories.

    This is what we call autodiscovery where Updatecli generate manifest and apply them dynamically based on some context.

    Obviously, the Updatecli project doesn't accept plugins specific to an organization.

    I saw project using different languages such as python, C#, or JS to generate those manifest.

    It would be great to be able to share and reuse those specific plugins

    During the HackWeek, I'll hang on the Updatecli matrix channel

    https://matrix.to/#/#Updatecli_community:gitter.im

    Goals

    Implement autodiscovery plugins using WASM. I am planning to experiment with https://github.com/extism/extism

    To build a simple WASM autodiscovery plugin and run it from Updatecli

    Resources

    • https://github.com/extism/extism
    • https://github.com/updatecli/updatecli
    • https://www.updatecli.io/docs/core/autodiscovery/
    • https://matrix.to/#/#Updatecli_community:gitter.im


    Contribute to terraform-provider-libvirt by pinvernizzi

    Description

    The SUSE Manager (SUMA) teams' main tool for infrastructure automation, Sumaform, largely relies on terraform-provider-libvirt. That provider is also widely used by other teams, both inside and outside SUSE.

    It would be good to help the maintainers of this project and give back to the community around it, after all the amazing work that has been already done.

    If you're interested in any of infrastructure automation, Terraform, virtualization, tooling development, Go (...) it is also a good chance to learn a bit about them all by putting your hands on an interesting, real-use-case and complex project.

    Goals

    • Get more familiar with Terraform provider development and libvirt bindings in Go
    • Solve some issues and/or implement some features
    • Get in touch with the community around the project

    Resources


    Create a Cloud-Native policy engine with notifying capabilities to optimize resource usage by gbazzotti

    Description

    The goal of this project is to begin the initial phase of development of an all-in-one Cloud-Native Policy Engine that notifies resource owners when their resources infringe predetermined policies. This was inspired by a current issue in the CES-SRE Team where other solutions seemed to not exactly correspond to the needs of the specific workloads running on the Public Cloud Team space.

    The initial architecture can be checked out on the Repository listed under Resources.

    Among the features that will differ this project from other monitoring/notification systems:

    • Pre-defined sensible policies written at the software-level, avoiding a learning curve by requiring users to write their own policies
    • All-in-one functionality: logging, mailing and all other actions are not required to install any additional plugins/packages
    • Easy account management, being able to parse all required configuration by a single JSON file
    • Eliminate integrations by not requiring metrics to go through a data-agreggator

    Goals

    • Create a minimal working prototype following the workflow specified on the documentation
    • Provide instructions on installation/usage
    • Work on email notifying capabilities

    Resources


    terraform-provider-feilong by e_bischoff

    Project Description

    People need to test operating systems and applications on s390 platform. While this is straightforward with KVM, this is very difficult with z/VM.

    IBM Cloud Infrastructure Center (ICIC) harnesses the Feilong API, but you can use Feilong without installing ICIC(see this schema).

    What about writing a terraform Feilong provider, just like we have the terraform libvirt provider? That would allow to transparently call Feilong from your main.tf files to deploy and destroy resources on your z/VM system.

    Goal for Hackweek 23

    I would like to be able to easily deploy and provision VMs automatically on a z/VM system, in a way that people might enjoy even outside of SUSE.

    My technical preference is to write a terraform provider plugin, as it is the approach that involves the least software components for our deployments, while remaining clean, and compatible with our existing development infrastructure.

    Goals for Hackweek 24

    Feilong provider works and is used internally by SUSE Manager team. Let's push it forward!

    Let's add support for fiberchannel disks and multipath.

    Goals for Hackweek 25

    Modernization, maturity, and maintenance: support for SLES 16 and openTofu, new API calls, fixes...

    Resources

    Outcome


    Rewrite Distrobox in go (POC) by fabriziosestito

    Description

    Rewriting Distrobox in Go.

    Main benefits:

    • Easier to maintain and to test
    • Adapter pattern for different container backends (LXC, systemd-nspawn, etc.)

    Goals

    • Build a minimal starting point with core commands
    • Keep the CLI interface compatible: existing users shouldn't notice any difference
    • Use a clean Go architecture with adapters for different container backends
    • Keep dependencies minimal and binary size small
    • Benchmark against the original shell script

    Resources

    • Upstream project: https://github.com/89luca89/distrobox/
    • Distrobox site: https://distrobox.it/
    • ArchWiki: https://wiki.archlinux.org/title/Distrobox


    Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

    Description

    Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

    This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

    The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

    Goals

    By the end of Hack Week, we aim to have a single, working Python script that:

    1. Connects to Prometheus and executes a query to fetch detailed test failure history.
    2. Processes the raw data into a format suitable for the Gemini API.
    3. Successfully calls the Gemini API with the data and a clear prompt.
    4. Parses the AI's response to extract a simple list of flaky tests.
    5. Saves the list to a JSON file that can be displayed in Grafana.
    6. New panel in our Dashboard listing the Flaky tests

    Resources

    Outcome


    A CLI for Harvester by mohamed.belgaied

    Harvester does not officially come with a CLI tool, the user is supposed to interact with Harvester mostly through the UI. Though it is theoretically possible to use kubectl to interact with Harvester, the manipulation of Kubevirt YAML objects is absolutely not user friendly. Inspired by tools like multipass from Canonical to easily and rapidly create one of multiple VMs, I began the development of Harvester CLI. Currently, it works but Harvester CLI needs some love to be up-to-date with Harvester v1.0.2 and needs some bug fixes and improvements as well.

    Project Description

    Harvester CLI is a command line interface tool written in Go, designed to simplify interfacing with a Harvester cluster as a user. It is especially useful for testing purposes as you can easily and rapidly create VMs in Harvester by providing a simple command such as: harvester vm create my-vm --count 5 to create 5 VMs named my-vm-01 to my-vm-05.

    asciicast

    Harvester CLI is functional but needs a number of improvements: up-to-date functionality with Harvester v1.0.2 (some minor issues right now), modifying the default behaviour to create an opensuse VM instead of an ubuntu VM, solve some bugs, etc.

    Github Repo for Harvester CLI: https://github.com/belgaied2/harvester-cli

    Done in previous Hackweeks

    • Create a Github actions pipeline to automatically integrate Harvester CLI to Homebrew repositories: DONE
    • Automatically package Harvester CLI for OpenSUSE / Redhat RPMs or DEBs: DONE

    Goal for this Hackweek

    The goal for this Hackweek is to bring Harvester CLI up-to-speed with latest Harvester versions (v1.3.X and v1.4.X), and improve the code quality as well as implement some simple features and bug fixes.

    Some nice additions might be: * Improve handling of namespaced objects * Add features, such as network management or Load Balancer creation ? * Add more unit tests and, why not, e2e tests * Improve CI * Improve the overall code quality * Test the program and create issues for it

    Issue list is here: https://github.com/belgaied2/harvester-cli/issues

    Resources

    The project is written in Go, and using client-go the Kubernetes Go Client libraries to communicate with the Harvester API (which is Kubernetes in fact). Welcome contributions are:

    • Testing it and creating issues
    • Documentation
    • Go code improvement

    What you might learn

    Harvester CLI might be interesting to you if you want to learn more about:

    • GitHub Actions
    • Harvester as a SUSE Product
    • Go programming language
    • Kubernetes API
    • Kubevirt API objects (Manipulating VMs and VM Configuration in Kubernetes using Kubevirt)


    Kubernetes-Based ML Lifecycle Automation by lmiranda

    Description

    This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.

    The pipeline will automate the lifecycle of a machine learning model, including:

    • Data ingestion/collection
    • Model training as a Kubernetes Job
    • Model artifact storage in an S3-compatible registry (e.g. Minio)
    • A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
    • A lightweight inference service that loads and serves the latest model
    • Monitoring of model performance and service health through Prometheus/Grafana

    The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.

    Goals

    By the end of Hack Week, the project should:

    1. Produce a fully functional ML pipeline running on Kubernetes with:

      • Data collection job
      • Training job container
      • Storage and versioning of trained models
      • Automated deployment of new model versions
      • Model inference API service
      • Basic monitoring dashboards
    2. Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.

    3. Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).

    4. Prepare a short demo explaining the end-to-end process and how new models flow through the system.

    Resources

    Project Repository

    Updates

    1. Training pipeline and datasets
    2. Inference Service py


    Exploring Modern AI Trends and Kubernetes-Based AI Infrastructure by jluo

    Description

    Build a solid understanding of the current landscape of Artificial Intelligence and how modern cloud-native technologies—especially Kubernetes—support AI workloads.

    Goals

    Use Gemini Learning Mode to guide the exploration, surface relevant concepts, and structure the learning journey:

    • Gain insight into the latest AI trends, tools, and architectural concepts.
    • Understand how Kubernetes and related cloud-native technologies are used in the AI ecosystem (model training, deployment, orchestration, MLOps).

    Resources

    • Red Hat AI Topic Articles

      • https://www.redhat.com/en/topics/ai
    • Kubeflow Documentation

      • https://www.kubeflow.org/docs/
    • Q4 2025 CNCF Technology Landscape Radar report:

      • https://www.cncf.io/announcements/2025/11/11/cncf-and-slashdata-report-finds-leading-ai-tools-gaining-adoption-in-cloud-native-ecosystems/
      • https://www.cncf.io/wp-content/uploads/2025/11/cncfreporttechradar_111025a.pdf
    • Agent-to-Agent (A2A) Protocol

      • https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/


    Preparing KubeVirtBMC for project transfer to the KubeVirt organization by zchang

    Description

    KubeVirtBMC is preparing to transfer the project to the KubeVirt organization. One requirement is to enhance the modeling design's security. The current v1alpha1 API (the VirtualMachineBMC CRD) was designed during the proof-of-concept stage. It's immature and inherently insecure due to its cross-namespace object references, exposing security concerns from an RBAC perspective.

    The other long-awaited feature is the ability to mount virtual media so that virtual machines can boot from remote ISO images.

    Goals

    1. Deliver the v1beta1 API and its corresponding controller implementation
    2. Enable the Redfish virtual media mount function for KubeVirt virtual machines

    Resources


    Technical talks at universities by agamez

    Description

    This project aims to empower the next generation of tech professionals by offering hands-on workshops on containerization and Kubernetes, with a strong focus on open-source technologies. By providing practical experience with these cutting-edge tools and fostering a deep understanding of open-source principles, we aim to bridge the gap between academia and industry.

    For now, the scope is limited to Spanish universities, since we already have the contacts and have started some conversations.

    Goals

    • Technical Skill Development: equip students with the fundamental knowledge and skills to build, deploy, and manage containerized applications using open-source tools like Kubernetes.
    • Open-Source Mindset: foster a passion for open-source software, encouraging students to contribute to open-source projects and collaborate with the global developer community.
    • Career Readiness: prepare students for industry-relevant roles by exposing them to real-world use cases, best practices, and open-source in companies.

    Resources

    • Instructors: experienced open-source professionals with deep knowledge of containerization and Kubernetes.
    • SUSE Expertise: leverage SUSE's expertise in open-source technologies to provide insights into industry trends and best practices.


    Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

    Description

    Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

    This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

    The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

    Goals

    By the end of Hack Week, we aim to have a single, working Python script that:

    1. Connects to Prometheus and executes a query to fetch detailed test failure history.
    2. Processes the raw data into a format suitable for the Gemini API.
    3. Successfully calls the Gemini API with the data and a clear prompt.
    4. Parses the AI's response to extract a simple list of flaky tests.
    5. Saves the list to a JSON file that can be displayed in Grafana.
    6. New panel in our Dashboard listing the Flaky tests

    Resources

    Outcome


    Uyuni Health-check Grafana AI Troubleshooter by ygutierrez

    Description

    This project explores the feasibility of using the open-source Grafana LLM plugin to enhance the Uyuni Health-check tool with LLM capabilities. The idea is to integrate a chat-based "AI Troubleshooter" directly into existing dashboards, allowing users to ask natural-language questions about errors, anomalies, or performance issues.

    Goals

    • Investigate if and how the grafana-llm-app plug-in can be used within the Uyuni Health-check tool.
    • Investigate if this plug-in can be used to query LLMs for troubleshooting scenarios.
    • Evaluate support for local LLMs and external APIs through the plugin.
    • Evaluate if and how the Uyuni MCP server could be integrated as another source of information.

    Resources

    Grafana LMM plug-in

    Uyuni Health-check