Project Description

A supportconfig provides a lot of files and data from the system, but it is often hard to spot the real issue in it. The idea of this project is to get machine-readable output for the supportconfig data and analyze them.

Then we would try to provide hints using the tool about what is wrong.

The name of this tool is: uyuni-health-check.

GitHub repository: https://github.com/uyuni-project/poc-uyuni-health-check

Summary:

  • Research about machine learning log anomaly detectors: few alternatives out there.
  • Getting custom metrics for Salt and Uyuni via prometheus exporter from live server.
  • Setting up Loki to process relevant Uyuni logs from live server.
  • Allow data visualization with Grafana.
  • Really easy-to-use CLI tool to run "health checks" and get feedback.

Details:

  • Grafana, Loki, Uyuni prometheus exporter and all other components run on "containers"
  • The containers run on the Uyuni server. "podman" is required on the server.
  • CLI tool takes care of building and deploying the "container" image to the server, collect the metrics and provide output on the command line.
  • Prometheus / Grafana expose containers metrics.

Goals for Hackweek #23

  • Enhance and collect more Uyuni / Salt metrics.
  • Use "supportconfig" as source for logs/metrics instead of live server.

Achievements during HW #23

  • ...

Goals for Hackweek #22

  • Improve CLI and performance.
  • Fix memory leak on "uyuni-health-exporter".
  • Complete automated deployment of Loki and other containers.

Achievements during HW #22:

  • Fix memory leak on uyuni-health-exporter.
  • Fix python packaging and installation.
  • Deploy grafana and prometheus dashboard.
  • Fix loki and promtail deployments.
  • Run all containers in the same POD.
  • Unify console logging across deployment functions.
  • More friendly CLI with new functions.
  • Containers are not wiped by default after executions.
  • Minor and cosmetic changes.
  • Update README.md to reflect latest changes

Goals for this Hackweek #21

  • Getting a machine readable version of supportconfig
  • First analysis and tweaking

This project is part of:

Hack Week 21 Hack Week 22 Hack Week 23

Activity

  • almost 2 years ago: pinvernizzi liked this project.
  • almost 2 years ago: oscar-barrios liked this project.
  • almost 2 years ago: juliogonzalezgil liked this project.
  • almost 2 years ago: emendonca liked this project.
  • over 3 years ago: cbosdonnat added keyword "uyuni" to this project.
  • over 3 years ago: cbosdonnat added keyword "susemanager" to this project.
  • over 3 years ago: cbosdonnat added keyword "monitoring" to this project.
  • over 3 years ago: cbosdonnat added keyword "grafana" to this project.
  • over 3 years ago: cbosdonnat added keyword "loki" to this project.
  • over 3 years ago: cbosdonnat added keyword "prometheus" to this project.
  • over 3 years ago: cbosdonnat added keyword "python3" to this project.
  • over 3 years ago: rangelino liked this project.
  • over 3 years ago: ygutierrez liked this project.
  • over 3 years ago: cbbayburt liked this project.
  • over 3 years ago: j_renner liked this project.
  • over 3 years ago: mbussolotto liked this project.
  • over 3 years ago: firoyang liked this project.
  • over 3 years ago: PSuarezHernandez joined this project.
  • over 3 years ago: PSuarezHernandez liked this project.
  • over 3 years ago: cbosdonnat started this project.
  • over 3 years ago: cbosdonnat added keyword "supportconfig" to this project.
  • over 3 years ago: cbosdonnat added keyword "analysis" to this project.
  • over 3 years ago: cbosdonnat added keyword "tool" to this project.
  • over 3 years ago: cbosdonnat added keyword "dashboard" to this project.
  • over 3 years ago: cbosdonnat originated this project.

  • Comments

    • PSuarezHernandez
      over 2 years ago by PSuarezHernandez | Reply

      I've updated project description to reflect latest changes after Hackweek 22!

    • PSuarezHernandez
      about 2 years ago by PSuarezHernandez | Reply

      Let's keep hacking on this project during upcoming Hackweek 23!

    Similar Projects

    Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

    Description

    Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

    This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

    The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

    Goals

    By the end of Hack Week, we aim to have a single, working Python script that:

    1. Connects to Prometheus and executes a query to fetch detailed test failure history.
    2. Processes the raw data into a format suitable for the Gemini API.
    3. Successfully calls the Gemini API with the data and a clear prompt.
    4. Parses the AI's response to extract a simple list of flaky tests.
    5. Saves the list to a JSON file that can be displayed in Grafana.
    6. New panel in our Dashboard listing the Flaky tests

    Resources


    Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

    Description

    Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

    This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

    The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

    Goals

    By the end of Hack Week, we aim to have a single, working Python script that:

    1. Connects to Prometheus and executes a query to fetch detailed test failure history.
    2. Processes the raw data into a format suitable for the Gemini API.
    3. Successfully calls the Gemini API with the data and a clear prompt.
    4. Parses the AI's response to extract a simple list of flaky tests.
    5. Saves the list to a JSON file that can be displayed in Grafana.
    6. New panel in our Dashboard listing the Flaky tests

    Resources


    Move Uyuni Test Framework from Selenium to Playwright + AI by oscar-barrios

    Description

    This project aims to migrate the existing Uyuni Test Framework from Selenium to Playwright. The move will improve the stability, speed, and maintainability of our end-to-end tests by leveraging Playwright's modern features. We'll be rewriting the current Selenium code in Ruby to Playwright code in TypeScript, which includes updating the test framework runner, step definitions, and configurations. This is also necessary because we're moving from Cucumber Ruby to CucumberJS.

    If you're still curious about the AI in the title, it was just a way to grab your attention. Thanks for your understanding.


    Goals

    • Migrate Core tests including Onboarding of clients
    • Improve test reliabillity: Measure and confirm a significant reduction of flakynes.
    • Implement a robust framework: Establish a well-structured and reusable Playwright test framework using the CucumberJS

    Resources


    Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

    Description

    Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

    This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

    The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

    Goals

    By the end of Hack Week, we aim to have a single, working Python script that:

    1. Connects to Prometheus and executes a query to fetch detailed test failure history.
    2. Processes the raw data into a format suitable for the Gemini API.
    3. Successfully calls the Gemini API with the data and a clear prompt.
    4. Parses the AI's response to extract a simple list of flaky tests.
    5. Saves the list to a JSON file that can be displayed in Grafana.
    6. New panel in our Dashboard listing the Flaky tests

    Resources