SUSE Hack Week: Single Cluster RBAC for Prometheus Operator

Project Description

What is Prometheus Operator?

Prometheus Operator is an open-source community project that provides Kubernetes native deployment and management of Prometheus and related monitoring components.

It facilitates the creation of Alertmanager and Prometheus pods whose configuration is updated based on the state of the Kubernetes cluster based on the following CRDs exposed by the operator:

1) ServiceMonitors: a scrape configuration for Prometheus for one or more Kubernetes services

2) PodMonitors: a scrape configuration for Prometheus for one or more Kubernetes Pods

3) Probes: a scrape configuration for Prometheus for one or more ingresses or static targets

4) PrometheusRule: a set of rule groups for Prometheus that each contain either alerting rules or recording rules

5) AlertmanagerConfig: a subsection of the Alertmanager configuration

Problems

There are two issues with deploying Prometheus Operator in multi-tenant environments where only a single Prometheus or Alertmanager deployment is desired:

1) Providing users with permissions to create the CRDs can allow them to potentially set up monitoring or alerting for resources they do not have access to since Monitors can scrape across namespaces by default. A user can turn this ability off, but this breaks certain integrations with other features like Istio.

2) Providing users with permissions to access the UIs can allow them to query for series that they should not have access to.

Proposal

Create a way for a user deploying a single Prometheus, Alertmanager, and Grafana to authorize users to perform actions on them based on k8s native RBAC.

Goal for this Hackweek

1) Create Admission Controllers for Prometheus Operator-based deployments (including Rancher Monitoring V1, Rancher Monitoring V2, and kube-prometheus-stack) that will determine whether a user has permissions to create a Prometheus Operator custom resource depending on their Kubernetes RBAC. e.g. you must have access to a Service to create a ServiceMonitor, you must have access to a Pod to create a PodMonitor, and any PrometheusRule you create must be scoped to metrics collected from namespaces / resources you have access to. 2) Create an optional Revocation Operator that will remove ServiceMonitors and PodMonitors from a cluster when a user who creates them no longer has access to a Service or Pod within the cluster. Once revoked, the Operator should send an alert directly to a configured Alertmanager to notify the cluster admin of the removal of these resources with YAML that allows them to add it back into the cluster. 3) Create an authorization plugin for Prometheus (similar to rancher/prometheus-auth) that will verify whether a user is authenticated with the k8s API and authorized to request the results of a given series.

Resources

Looking for help from people with experience in Prometheus, Grafana, or Alertmanager who have ideas on how to better support multi-tenant environments!

No Hackers yet

Join this project Leave this project

Looking for hackers with the skills:

prometheus alertmanager grafana monitoring alerting

This project is part of:

Hack Week 20

Activity

almost 5 years ago: aiyengar2 added keyword "prometheus" to this project.

almost 5 years ago: aiyengar2 added keyword "alertmanager" to this project.

almost 5 years ago: aiyengar2 added keyword "grafana" to this project.

almost 5 years ago: aiyengar2 added keyword "monitoring" to this project.

almost 5 years ago: aiyengar2 added keyword "alerting" to this project.

almost 5 years ago: aiyengar2 originated this project.

Comments

Be the first to comment!

Similar Projects

prometheus

Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

Description

Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.

This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.

The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

Goals

By the end of Hack Week, we aim to have a single, working Python script that:

Connects to Prometheus and executes a query to fetch detailed test failure history.
Processes the raw data into a format suitable for the Gemini API.
Successfully calls the Gemini API with the data and a clear prompt.
Parses the AI's response to extract a simple list of flaky tests.
Saves the list to a JSON file that can be displayed in Grafana.
New panel in our Dashboard listing the Flaky tests

Resources

Jenkins Prometheus Exporter: https://github.com/uyuni-project/jenkins-exporter/
Data Source: Our internal Prometheus server.
Key Metric: jenkins_build_test_case_failure_age{jobname, buildid, suite, case, status, failedsince}.
Existing Query for Reference: count by (suite) (max_over_time(jenkins_build_test_case_failure_age{status=~"FAILED|REGRESSION", jobname="$jobname"}[$__range])).
AI Model: The Google Gemini API.
Example about how to interact with Gemini API: https://github.com/srbarrios/FailTale/
Visualization: Our internal Grafana Dashboard.
Internal IaC: https://gitlab.suse.de/galaxy/infrastructure/-/tree/master/srv/salt/monitoring

Outcome

grafana

Uyuni Health-check Grafana AI Troubleshooter by ygutierrez

Description

This project explores the feasibility of using the open-source Grafana LLM plugin to enhance the Uyuni Health-check tool with LLM capabilities. The idea is to integrate a chat-based "AI Troubleshooter" directly into existing dashboards, allowing users to ask natural-language questions about errors, anomalies, or performance issues.

Goals

Investigate if and how the grafana-llm-app plug-in can be used within the Uyuni Health-check tool.
Investigate if this plug-in can be used to query LLMs for troubleshooting scenarios.
Evaluate support for local LLMs and external APIs through the plugin.
Evaluate if and how the Uyuni MCP server could be integrated as another source of information.

Resources

Grafana LMM plug-in

Uyuni Health-check

Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios

Description

The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.

Goals

By the end of Hack Week, we aim to have a single, working Python script that:

Connects to Prometheus and executes a query to fetch detailed test failure history.
Processes the raw data into a format suitable for the Gemini API.
Successfully calls the Gemini API with the data and a clear prompt.
Parses the AI's response to extract a simple list of flaky tests.
Saves the list to a JSON file that can be displayed in Grafana.
New panel in our Dashboard listing the Flaky tests

Resources

Jenkins Prometheus Exporter: https://github.com/uyuni-project/jenkins-exporter/
Data Source: Our internal Prometheus server.
Key Metric: jenkins_build_test_case_failure_age{jobname, buildid, suite, case, status, failedsince}.
Existing Query for Reference: count by (suite) (max_over_time(jenkins_build_test_case_failure_age{status=~"FAILED|REGRESSION", jobname="$jobname"}[$__range])).
AI Model: The Google Gemini API.
Example about how to interact with Gemini API: https://github.com/srbarrios/FailTale/
Visualization: Our internal Grafana Dashboard.
Internal IaC: https://gitlab.suse.de/galaxy/infrastructure/-/tree/master/srv/salt/monitoring