Project Description
The sar(1)
tool, from the openSUSE package "sysstat", provides a comprehensive method for collecting performance data on a running system.
There isn't, however, a satisfactory way to display historical sar
data. It is not uncommon for users who experience a performance degradation to include a few days worth of sar
archives in their bug reports. The experts looking into the report often resort to scan these archives using text-oriented utilities such as sed
and awk
, which may fail to reveal the story hidden behind the data.
We aim at devising a method and tool to visualize large historical sar
datasets.
Goal for this Hackweek
Take one or two sample sar
datasets and feed them into Grafana[LINK], the Perfetto trace viewer[LINK] and Performance Co-Pilot (PCP)[LINK]. The solution of choice will need to compare favorably with my two previous sar
visualization attempts made with ad-hoc scripts: [LINK-1][LINK-2]. Evaluate the outcome, write a report and select one single tool to focus future efforts on.
Prior art
To the best of my knowledge, these are the methods currently available for plotting sar
data:
- kSar[MAYBE-LINK-1][MAYBE-LINK-2][MAYBE-LINK-3] Ksar is a self-contained Java GUI. This tool is advertised in our openSUSE Tuning Guide, chapter 2 "System monitoring utilities", section 2.1.3.2 "Visualizing
sar
data"[LINK] although we don't have a package for it. Its drawback is it can display asar
datafile at a time: assar
writes one datafile per day, to visualize a week worth of archives one has to have multiple kSar windows open. The resulting charts will have different Y axis scales, which makes them difficult to compare. You can find a set of kSar screenshots illustrating this usability problem here: [LINK](requires SUSE confluence login). - sadf -g
sadf -g your_datafile [ -- sar_options ] > output.svg
sadf(1) can emit svg files that can be viewed in web browsers. The same limitation of kSar applies here: multiple days require multiple plots. - sar2pcp[LINK] There exist a Performance Co-Pilot (PCP) plugin to import sar data, which we package in openSUSE as "pcp-import-sar2pcp". Last time I checked, sar2pcp had to be invoked with
LD_PRELOAD=/usr/lib64/libpcp_import.so
because that shared library wasn't correctly linked in the build of the package. That can easily be fixed, but it's unclear to me how well thesar
+ PCP combination works, as I haven't yet tried it. It could very well suffer from the one-day-per-plot limitation of the previous two tools. The feature is advertised in bothsar
and pcp's documentation: - sadf -j
sadf -j your_datafile [ -- sar_options ] > output.json
Converting sar data to json, then write an ad-hoc script to produce the charts. This is what I've done in the past, but writing special-purpose code every time takes energy away from actually analyzing the data and debugging the problem at hand. The experience was valuable though, as it showed what these charts should look like (see [LINK-1][LINK-2])
Tentative plan
It seems a sensible choice to leverage an existing graphing tool and make adjustments so that it's tailored at sar
data (eg. create a plug-in or a new plot type within an existing framework). Candidates are Grafana (web tool), the Perfetto trace visualizer (web tool) and Performance Co-Pilot (desktop app). I'm slightly biased towards interactive charts as opposed to static images since zooming in/out and moving around the data range may help exploring the dataset. Interactivity is best achieved with web based tools, which offer the possibility of sharing access to the visualization with a URL, without the need for the recipient to install new software locally.
Looking for hackers with the skills:
grafana performance perfetto performance-co-pilot pcp sar visualization monitoring observability
This project is part of:
Hack Week 22 Hack Week 23
Activity
Comments
-
-
over 2 years ago by ggherdovich | Reply
Hello Heikki, thanks for commenting. What I'm taking from the status of sar+pcp interoperability in openSUSE is that no-one has ever used it. The problem you mention, plus the ones I already knew about, are failures that you'd notice immediately as you launch the tool. So on one side those are fixable with one-line edits in the spec file of the package, but on the other one they show that in essence the sar+pcp combo is uncharted territory, in my estimation.
-
over 2 years ago by heikkiyp | Reply
Also noted that with later sysstat version you can convert to pcp supported format directly without the need of sar2pcp tool . The newer syststat is not available for SLE12 branch .. You have to use SLE15 or Leap or TW version . Sa files from SLE12 branch needs conversion anyway as sar2pcp will complain the format .
Most usable way to use pcp is on SLE15 or TW or LEAP . Convert the sa files with :
sadf -l -O pcparchive=sample02 sa20230125 -- -A
Style command and then using pmchart tool ..
Combining is easy .. pmlogextract sample01 sample02 looongsample
-
over 2 years ago by ggherdovich | Reply
That's fantastic information Heikki! Thanks for sharing! I haven't tried the sar->pcp conversion technique you suggest (I'll do it soon), but I imagine the limitation you mention (sadf must be from SLE-15 or later) is not so severe, because I suspect the user can still collect sar data using a SLE-12 sar daemon, it's only the person analyzing the data that must have a recent sysstat to do the format conversion. I'd expect the new sadf can still read old archives. Again, thanks for pointing this out!
-
-
almost 2 years ago by ggherdovich | Reply
Hi Paolo, thanks for the link. Looks like a node app that can be deployed on premises. I need to try hosting a local instance of that "sargraph" and see how it does on some sample datasets I have.
Similar Projects
Flaky Tests AI Finder for Uyuni and MLM Test Suites by oscar-barrios
Description
Our current Grafana dashboards provide a great overview of test suite health, including a panel for "Top failed tests." However, identifying which of these failures are due to legitimate bugs versus intermittent "flaky tests" is a manual, time-consuming process. These flaky tests erode trust in our test suites and slow down development.
This project aims to build a simple but powerful Python script that automates flaky test detection. The script will directly query our Prometheus instance for the historical data of each failed test, using the jenkins_build_test_case_failure_age
metric. It will then format this data and send it to the Gemini API with a carefully crafted prompt, asking it to identify which tests show a flaky pattern.
The final output will be a clean JSON list of the most probable flaky tests, which can then be used to populate a new "Top Flaky Tests" panel in our existing Grafana test suite dashboard.
Goals
By the end of Hack Week, we aim to have a single, working Python script that:
- Connects to Prometheus and executes a query to fetch detailed test failure history.
- Processes the raw data into a format suitable for the Gemini API.
- Successfully calls the Gemini API with the data and a clear prompt.
- Parses the AI's response to extract a simple list of flaky tests.
- Saves the list to a JSON file that can be displayed in Grafana.
- New panel in our Dashboard listing the Flaky tests
Resources
- Jenkins Prometheus Exporter: https://github.com/uyuni-project/jenkins-exporter/
- Data Source: Our internal Prometheus server.
- Key Metric:
jenkins_build_test_case_failure_age{jobname, buildid, suite, case, status, failedsince}
. - Existing Query for Reference:
count by (suite) (max_over_time(jenkins_build_test_case_failure_age{status=~"FAILED|REGRESSION", jobname="$jobname"}[$__range]))
. - AI Model: The Google Gemini API.
- Example about how to interact with Gemini API: https://github.com/srbarrios/FailTale/
- Visualization: Our internal Grafana Dashboard.
- Internal IaC: https://gitlab.suse.de/galaxy/infrastructure/-/tree/master/srv/salt/monitoring