Project Description
What is Prometheus Operator?
Prometheus Operator is an open-source community project that provides Kubernetes native deployment and management of Prometheus and related monitoring components.
It facilitates the creation of Alertmanager and Prometheus pods whose configuration is updated based on the state of the Kubernetes cluster based on the following CRDs exposed by the operator:
1) ServiceMonitors: a scrape configuration for Prometheus for one or more Kubernetes services
2) PodMonitors: a scrape configuration for Prometheus for one or more Kubernetes Pods
3) Probes: a scrape configuration for Prometheus for one or more ingresses or static targets
4) PrometheusRule: a set of rule groups for Prometheus that each contain either alerting rules or recording rules
5) AlertmanagerConfig: a subsection of the Alertmanager configuration
Problems
There are two issues with deploying Prometheus Operator in multi-tenant environments where only a single Prometheus or Alertmanager deployment is desired:
1) Providing users with permissions to create the CRDs can allow them to potentially set up monitoring or alerting for resources they do not have access to since Monitors can scrape across namespaces by default. A user can turn this ability off, but this breaks certain integrations with other features like Istio.
2) Providing users with permissions to access the UIs can allow them to query for series that they should not have access to.
Proposal
Create a way for a user deploying a single Prometheus, Alertmanager, and Grafana to authorize users to perform actions on them based on k8s native RBAC.
Goal for this Hackweek
1) Create Admission Controllers for Prometheus Operator-based deployments (including Rancher Monitoring V1, Rancher Monitoring V2, and kube-prometheus-stack) that will determine whether a user has permissions to create a Prometheus Operator custom resource depending on their Kubernetes RBAC. e.g. you must have access to a Service to create a ServiceMonitor, you must have access to a Pod to create a PodMonitor, and any PrometheusRule you create must be scoped to metrics collected from namespaces / resources you have access to. 2) Create an optional Revocation Operator that will remove ServiceMonitors and PodMonitors from a cluster when a user who creates them no longer has access to a Service or Pod within the cluster. Once revoked, the Operator should send an alert directly to a configured Alertmanager to notify the cluster admin of the removal of these resources with YAML that allows them to add it back into the cluster. 3) Create an authorization plugin for Prometheus (similar to rancher/prometheus-auth) that will verify whether a user is authenticated with the k8s API and authorized to request the results of a given series.
Resources
Looking for help from people with experience in Prometheus, Grafana, or Alertmanager who have ideas on how to better support multi-tenant environments!
No Hackers yet
Looking for hackers with the skills:
This project is part of:
Hack Week 20
Activity
Comments
Be the first to comment!
Similar Projects
Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios
Description
Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.
This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age
metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.
The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.
Goals
By the end of Hack Week, we aim to have a single, working Python script that:
- Connects to Prometheus and executes a query to fetch detailed test failure history.
- Processes the raw data into a format suitable for the Gemini API.
- Successfully calls the Gemini API with the data and a clear prompt.
- Parses the AI's response to extract a simple list of flaky tests.
- Saves the list to a JSON file that can be displayed in Grafana.
- New panel in our Dashboard listing the Flaky tests
Resources
- Jenkins Prometheus Exporter: https://github.com/uyuni-project/jenkins-exporter/
- Data Source: Our internal Prometheus server.
- Key Metric:
jenkins_build_test_case_failure_age{jobname, buildid, suite, case, status, failedsince}
. - Existing Query for Reference:
count by (suite) (max_over_time(jenkins_build_test_case_failure_age{status=~"FAILED|REGRESSION", jobname="$jobname"}[$__range]))
. - AI Model: The Google Gemini API.
- Example about how to interact with Gemini API: https://github.com/srbarrios/FailTale/
- Visualization: Our internal Grafana Dashboard.
- Internal IaC: https://gitlab.suse.de/galaxy/infrastructure/-/tree/master/srv/salt/monitoring
Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios
Description
Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.
This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age
metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.
The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.
Goals
By the end of Hack Week, we aim to have a single, working Python script that:
- Connects to Prometheus and executes a query to fetch detailed test failure history.
- Processes the raw data into a format suitable for the Gemini API.
- Successfully calls the Gemini API with the data and a clear prompt.
- Parses the AI's response to extract a simple list of flaky tests.
- Saves the list to a JSON file that can be displayed in Grafana.
- New panel in our Dashboard listing the Flaky tests
Resources
- Jenkins Prometheus Exporter: https://github.com/uyuni-project/jenkins-exporter/
- Data Source: Our internal Prometheus server.
- Key Metric:
jenkins_build_test_case_failure_age{jobname, buildid, suite, case, status, failedsince}
. - Existing Query for Reference:
count by (suite) (max_over_time(jenkins_build_test_case_failure_age{status=~"FAILED|REGRESSION", jobname="$jobname"}[$__range]))
. - AI Model: The Google Gemini API.
- Example about how to interact with Gemini API: https://github.com/srbarrios/FailTale/
- Visualization: Our internal Grafana Dashboard.
- Internal IaC: https://gitlab.suse.de/galaxy/infrastructure/-/tree/master/srv/salt/monitoring