Description

This project plans to create an MCP Trace Suite, a system that consolidates commonly used Linux debugging tools such as bpftrace, perf, and ftrace.

The suite is implemented as an MCP Server. This architecture allows an AI agent to leverage the server to diagnose Linux issues and perform targeted system debugging by remotely executing and retrieving tracing data from these powerful tools.

  • Repo: https://github.com/r1chard-lyu/systracesuite
  • Demo: Slides

Goals

  1. Build an MCP Server that can integrate various Linux debugging and tracing tools, including bpftrace, perf, ftrace, strace, and others, with support for future expansion of additional tools.

  2. Perform testing by intentionally creating bugs or issues that impact system performance, allowing an AI agent to analyze the root cause and identify the underlying problem.

Resources

  • Gemini CLI: https://geminicli.com/
  • eBPF: https://ebpf.io/
  • bpftrace: https://github.com/bpftrace/bpftrace/
  • perf: https://perfwiki.github.io/main/
  • ftrace: https://github.com/r1chard-lyu/tracium/

Looking for hackers with the skills:

ai mcp perf observability bpftrace ftrace top

This project is part of:

Hack Week 25

Activity

  • about 1 month ago: r1chard-lyu added keyword "top" to this project.
  • about 1 month ago: r1chard-lyu added keyword "ftrace" to this project.
  • about 1 month ago: r1chard-lyu removed keyword ebpf from this project.
  • about 1 month ago: r1chard-lyu added keyword "bpftrace" to this project.
  • about 1 month ago: r1chard-lyu started this project.
  • about 2 months ago: r1chard-lyu added keyword "ai" to this project.
  • about 2 months ago: r1chard-lyu added keyword "mcp" to this project.
  • about 2 months ago: r1chard-lyu added keyword "ebpf" to this project.
  • about 2 months ago: r1chard-lyu added keyword "perf" to this project.
  • about 2 months ago: r1chard-lyu added keyword "observability" to this project.
  • about 2 months ago: r1chard-lyu originated this project.

  • Comments

    • r1chard-lyu
      about 1 month ago by r1chard-lyu | Reply

      Hack Week 25 - Status


      Day 1

      • Systracesuite Development:
        • Feature Development (Server & Tracing Integration):
          • Secured non-interactive sudo for bpftrace_list.
          • Integrated Bpftrace utilities (version and help commands).
          • Added .gemini settings for Gemini CLI MCP connection.
          • Added fastmcp to requirements.txt.
          • Implemented initial FastMCP server (with status, echo, stop, top, perf tools).
        • Documentation:
          • Refined JSON examples in README to a single-line format.
          • Updated README.md with FastMCP server installation, JSON-RPC 2.0, and usage instructions.
        • Initial Setup: Added README.md file.

      Day 2

      • Systracesuite Development:
        • Feature Development (Tracing Tools):
          • Added read_bpftrace_tool for Bpftrace tool information.
          • Integrated new Bpftrace tools (e.g., tcpconnect.bt, tcpdrop.bt, etc.).
          • Added bpftrace_help tool.
          • Added ftrace_help and ftrace_version tools.
          • Added strace_help and strace_version tools.
        • Refactoring & Setup:
          • Refactored interactive root handling, removing password prompts; now uses bpftrace_list_all (assumes passwordless sudo).
          • Implemented setup.sh for passwordless sudo on specific tools.
        • Documentation: README updated to reflect these changes.

      Day 3

      • Systracesuite Development:
        • Project Refactoring: Rename project to "Systracesuite"; update README, MCP server name, and arguments.
        • Maintenance: Remove local Gemini settings; add to .gitignore.
        • Documentation: Update the README; add LICENSE file.
      • Testing the execution result of Systracesuite.
      • Testing bpftrace tools from upstream and SUSE sources.

      Day 4

      • Systracesuite Development:
        • Documentation Updates: Clarify Bpftrace tool sources in README; include a system architecture diagram.
      • Testing different Systracesuite use cases, such as network latency and program crashes.
      • Prepare demo experiments and slides.

      Day 5

      • Demo. Slides
      • Collect Q&A and feedback.
      • Plan and outline optimization directions.

    Similar Projects

    Extended private brain - RAG my own scripts and data into offline LLM AI by tjyrinki_suse

    Description

    For purely studying purposes, I'd like to find out if I could teach an LLM some of my own accumulated knowledge, to use it as a sort of extended brain.

    I might use qwen3-coder or something similar as a starting point.

    Everything would be done 100% offline without network available to the container, since I prefer to see when network is needed, and make it so it's never needed (other than initial downloads).

    Goals

    1. Learn something about RAG, LLM, AI.
    2. Find out if everything works offline as intended.
    3. As an end result have a new way to access my own existing know-how, but so that I can query the wisdom in them.
    4. Be flexible to pivot in any direction, as long as there are new things learned.

    Resources

    To be found on the fly.

    Timeline

    Day 1 (of 4)

    • Tried out a RAG demo, expanded on feeding it my own data
    • Experimented with qwen3-coder to add a persistent chat functionality, and keeping vectors in a pickle file
    • Optimizations to keep everything within context window
    • Learn and add a bit of PyTest

    Day 2

    • More experimenting and more data
    • Study ChromaDB
    • Add a Web UI that works from another computer even though the container sees network is down

    Day 3

    • The above RAG is working well enough for demonstration purposes.
    • Pivot to trying out OpenCode, configuring local Ollama qwen3-coder there, to analyze the RAG demo.
    • Figured out how to configure Ollama template to be usable under OpenCode. OpenCode locally is super slow to just running qwen3-coder alone.

    Day 4 (final day)

    • Battle with OpenCode that was both slow and kept on piling up broken things.
    • Call it success as after all the agentic AI was working locally.
    • Clean up the mess left behind a bit.

    Blog Post

    Summarized the findings at blog post.


    Enable more features in mcp-server-uyuni by j_renner

    Description

    I would like to contribute to mcp-server-uyuni, the MCP server for Uyuni / Multi-Linux Manager) exposing additional features as tools. There is lots of relevant features to be found throughout the API, for example:

    • System operations and infos
    • System groups
    • Maintenance windows
    • Ansible
    • Reporting
    • ...

    At the end of the week I managed to enable basic system group operations:

    • List all system groups visible to the user
    • Create new system groups
    • List systems assigned to a group
    • Add and remove systems from groups

    Goals

    • Set up test environment locally with the MCP server and client + a recent MLM server [DONE]
    • Identify features and use cases offering a benefit with limited effort required for enablement [DONE]
    • Create a PR to the repo [DONE]

    Resources


    Kubernetes-Based ML Lifecycle Automation by lmiranda

    Description

    This project aims to build a complete end-to-end Machine Learning pipeline running entirely on Kubernetes, using Go, and containerized ML components.

    The pipeline will automate the lifecycle of a machine learning model, including:

    • Data ingestion/collection
    • Model training as a Kubernetes Job
    • Model artifact storage in an S3-compatible registry (e.g. Minio)
    • A Go-based deployment controller that automatically deploys new model versions to Kubernetes using Rancher
    • A lightweight inference service that loads and serves the latest model
    • Monitoring of model performance and service health through Prometheus/Grafana

    The outcome is a working prototype of an MLOps workflow that demonstrates how AI workloads can be trained, versioned, deployed, and monitored using the Kubernetes ecosystem.

    Goals

    By the end of Hack Week, the project should:

    1. Produce a fully functional ML pipeline running on Kubernetes with:

      • Data collection job
      • Training job container
      • Storage and versioning of trained models
      • Automated deployment of new model versions
      • Model inference API service
      • Basic monitoring dashboards
    2. Showcase a Go-based deployment automation component, which scans the model registry and automatically generates & applies Kubernetes manifests for new model versions.

    3. Enable continuous improvement by making the system modular and extensible (e.g., additional models, metrics, autoscaling, or drift detection can be added later).

    4. Prepare a short demo explaining the end-to-end process and how new models flow through the system.

    Resources

    Project Repository

    Updates

    1. Training pipeline and datasets
    2. Inference Service py


    AI-Powered Unit Test Automation for Agama by joseivanlopez

    The Agama project is a multi-language Linux installer that leverages the distinct strengths of several key technologies:

    • Rust: Used for the back-end services and the core HTTP API, providing performance and safety.
    • TypeScript (React/PatternFly): Powers the modern web user interface (UI), ensuring a consistent and responsive user experience.
    • Ruby: Integrates existing, robust YaST libraries (e.g., yast-storage-ng) to reuse established functionality.

    The Problem: Testing Overhead

    Developing and maintaining code across these three languages requires a significant, tedious effort in writing, reviewing, and updating unit tests for each component. This high cost of testing is a drain on developer resources and can slow down the project's evolution.

    The Solution: AI-Driven Automation

    This project aims to eliminate the manual overhead of unit testing by exploring and integrating AI-driven code generation tools. We will investigate how AI can:

    1. Automatically generate new unit tests as code is developed.
    2. Intelligently correct and update existing unit tests when the application code changes.

    By automating this crucial but monotonous task, we can free developers to focus on feature implementation and significantly improve the speed and maintainability of the Agama codebase.

    Goals

    • Proof of Concept: Successfully integrate and demonstrate an authorized AI tool (e.g., gemini-cli) to automatically generate unit tests.
    • Workflow Integration: Define and document a new unit test automation workflow that seamlessly integrates the selected AI tool into the existing Agama development pipeline.
    • Knowledge Sharing: Establish a set of best practices for using AI in code generation, sharing the learned expertise with the broader team.

    Contribution & Resources

    We are seeking contributors interested in AI-powered development and improving developer efficiency. Whether you have previous experience with code generation tools or are eager to learn, your participation is highly valuable.

    If you want to dive deep into AI for software quality, please reach out and join the effort!

    • Authorized AI Tools: Tools supported by SUSE (e.g., gemini-cli)
    • Focus Areas: Rust, TypeScript, and Ruby components within the Agama project.

    Interesting Links


    MCP Server for SCC by digitaltomm

    Description

    Provide an MCP Server implementation for customers to access data on scc.suse.com via MCP protocol. The core benefit of this MCP interface is that it has direct (read) access to customer data in SCC, so the AI agent gets enhanced knowledge about individual customer data, like subscriptions, orders and registered systems.

    Architecture

    Schema

    Goals

    We want to demonstrate a proof of concept to connect to the SCC MCP server with any AI agent, for example gemini-cli or codex. Enabling the user to ask questions regarding their SCC inventory.

    For this Hackweek, we target that users get proper responses to these example questions:

    • Which of my currently active systems are running products that are out of support?
    • Do I have ready to use registration codes for SLES?
    • What are the latest 5 released patches for SLES 15 SP6? Output as a list with release date, patch name, affected package names and fixed CVEs.
    • Which versions of kernel-default are available on SLES 15 SP6?

    Technical Notes

    Similar to the organization APIs, this can expose to customers data about their subscriptions, orders, systems and products. Authentication should be done by organization credentials, similar to what needs to be provided to RMT/MLM. Customers can connect to the SCC MCP server from their own MCP-compatible client and Large Language Model (LLM), so no third party is involved.

    Milestones

    [x] Basic MCP API setup
      MCP endpoints
      [x] Products / Repositories
      [x] Subscriptions / Orders 
      [x] Systems
      [x] Packages
    [x] Document usage with Gemini CLI, Codex
    

    Resources

    Gemini CLI setup:

    ~/.gemini/settings.json:


    Multi-agent AI assistant for Linux troubleshooting by doreilly

    Description

    Explore multi-agent architecture as a way to avoid MCP context rot.

    Having one agent with many tools bloats the context with low-level details about tool descriptions, parameter schemas etc which hurts LLM performance. Instead have many specialised agents, each with just the tools it needs for its role. A top level supervisor agent takes the user prompt and delegates to appropriate sub-agents.

    Goals

    Create an AI assistant with some sub-agents that are specialists at troubleshooting Linux subsystems, e.g. systemd, selinux, firewalld etc. The agents can get information from the system by implementing their own tools with simple function calls, or use tools from MCP servers, e.g. a systemd-agent can use tools from systemd-mcp.

    Example prompts/responses:

    user$ the system seems slow
    assistant$ process foo with pid 12345 is using 1000% cpu ...
    
    user$ I can't connect to the apache webserver
    assistant$ the firewall is blocking http ... you can open the port with firewall-cmd --add-port ...
    

    Resources

    Language Python. The Python ADK is more mature than Golang.

    https://google.github.io/adk-docs/

    https://github.com/djoreilly/linux-helper


    Bugzilla goes AI - Phase 1 by nwalter

    Description

    This project, Bugzilla goes AI, aims to boost developer productivity by creating an autonomous AI bug agent during Hackweek. The primary goal is to reduce the time employees spend triaging bugs by integrating Ollama to summarize issues, recommend next steps, and push focused daily reports to a Web Interface.

    Goals

    To reduce employee time spent on Bugzilla by implementing an AI tool that triages and summarizes bug reports, providing actionable recommendations to the team via Web Interface.

    Project Charter

    Bugzilla goes AI Phase 1

    Description

    Project Achievements during Hackweek

    In this file you can read about what we achieved during Hackweek.

    Project Achievements


    SUSE Edge Image Builder MCP by eminguez

    Description

    Based on my other hackweek project, SUSE Edge Image Builder's Json Schema I would like to build also a MCP to be able to generate EIB config files the AI way.

    Realistically I don't think I'll be able to have something consumable at the end of this hackweek but at least I would like to start exploring MCPs, the difference between an API and MCP, etc.

    Goals

    • Familiarize myself with MCPs
    • Unrealistic: Have an MCP that can generate an EIB config file

    Resources

    Result

    https://github.com/e-minguez/eib-mcp

    I've extensively used antigravity and its agent mode to code this. This heavily uses https://hackweek.opensuse.org/25/projects/suse-edge-image-builder-json-schema for the MCP to be built.

    I've ended up learning a lot of things about "prompting", json schemas in general, some golang, MCPs and AI in general :)

    Example:

    Generate an Edge Image Builder configuration for an ISO image based on slmicro-6.2.iso, targeting x86_64 architecture. The output name should be 'my-edge-image' and it should install to /dev/sda. It should deploy a 3 nodes kubernetes cluster with nodes names "node1", "node2" and "node3" as: * hostname: node1, IP: 1.1.1.1, role: initializer * hostname: node2, IP: 1.1.1.2, role: agent * hostname: node3, IP: 1.1.1.3, role: agent The kubernetes version should be k3s 1.33.4-k3s1 and it should deploy a cert-manager helm chart (the latest one available according to https://cert-manager.io/docs/installation/helm/). It should create a user called "suse" with password "suse" and set ntp to "foo.ntp.org". The VIP address for the API should be 1.2.3.4

    Generates:

    ``` apiVersion: "1.0" image: arch: x86_64 baseImage: slmicro-6.2.iso imageType: iso outputImageName: my-edge-image kubernetes: helm: charts: - name: cert-manager repositoryName: jetstack


    Song Search with CLAP by gcolangiuli

    Description

    Contrastive Language-Audio Pretraining (CLAP) is an open-source library that enables the training of a neural network on both Audio and Text descriptions, making it possible to search for Audio using a Text input. Several pre-trained models for song search are already available on huggingface

    SUSE Hackweek AI Song Search

    Goals

    Evaluate how CLAP can be used for song searching and determine which types of queries yield the best results by developing a Minimum Viable Product (MVP) in Python. Based on the results of this MVP, future steps could include:

    • Music Tagging;
    • Free text search;
    • Integration with an LLM (for example, with MCP or the OpenAI API) for music suggestions based on your own library.

    The code for this project will be entirely written using AI to better explore and demonstrate AI capabilities.

    Result

    In this MVP we implemented:

    • Async Song Analysis with Clap model
    • Free Text Search of the songs
    • Similar song search based on vector representation
    • Containerised version with web interface

    We also documented what went well and what can be improved in the use of AI.

    You can have a look at the result here:

    Future implementation can be related to performance improvement and stability of the analysis.

    References


    MCP Perl SDK by kraih

    Description

    We've been using the MCP Perl SDK to connect openQA with AI. And while the basics are working pretty well, the SDK is not fully spec compliant yet. So let's change that!

    Goals

    • Support for Resources
    • All response types (Audio, Resource Links, Embedded Resources...)
    • Tool/Prompt/Resource update notifications
    • Dynamic Tool/Prompt/Resource lists
    • New authentication mechanisms

    Resources


    dynticks-testing: analyse perf / trace-cmd output and aggregate data by m.crivellari

    Description

    dynticks-testing is a project started years ago by Frederic Weisbecker. One of the feature is to check the actual configuration (isolcpus, irqaffinity etc etc) and give feedback on it.

    An important goal of this tool is to parse the output of trace-cmd / perf and provide more readable data, showing the duration of every events grouped by PID (showing also the CPU number, if the tasks has been migrated etc).

    An example of data captured on my laptop (incomplete!!):

              -0     [005] dN.2. 20310.270699: sched_wakeup:         WaylandProxy:46380 [120] CPU:005
              -0     [005] d..2. 20310.270702: sched_switch:         swapper/5:0 [120] R ==> WaylandProxy:46380 [120]
    ...
        WaylandProxy-46380 [004] d..2. 20310.295397: sched_switch:         WaylandProxy:46380 [120] S ==> swapper/4:0 [120]
              -0     [006] d..2. 20310.295397: sched_switch:         swapper/6:0 [120] R ==> firefox:46373 [120]
             firefox-46373 [006] d..2. 20310.295408: sched_switch:         firefox:46373 [120] S ==> swapper/6:0 [120]
              -0     [004] dN.2. 20310.295466: sched_wakeup:         WaylandProxy:46380 [120] CPU:004
    

    Output of noise_parse.py:

    Task: WaylandProxy Pid: 46380 cpus: {4, 5} (Migrated!!!)
            Wakeup Latency                                Nr:        24     Duration:          89
            Sched switch: kworker/12:2                    Nr:         1     Duration:           6
    

    My first contribution is around Nov. 2024!

    Goals

    • add more features (eg cpuset)
    • test / bugfix

    Resources

    Progresses

    isolcpus and cpusets implemented and merged in master: dynticks-testing.git commit


    bpftrace contribution by mkoutny

    Description

    bpftrace is a great tool, no need to sing odes to it here. It can access any kernel data and process them in real time. It provides helpers for some common Linux kernel structures but not all.

    Goals

    • set up bpftrace toolchain
    • learn about bpftrace implementation and internals
    • implement support for percpu_counters
    • look into some of the first issues
    • send a refined PR (on Thu)

    Resources


    SUSE Observability MCP server by drutigliano

    Description

    The idea is to implement the SUSE Observability Model Context Protocol (MCP) Server as a specialized, middle-tier API designed to translate the complex, high-cardinality observability data from StackState (topology, metrics, and events) into highly structured, contextually rich, and LLM-ready snippets.

    This MCP Server abstract the StackState APIs. Its primary function is to serve as a Tool/Function Calling target for AI agents. When an AI receives an alert or a user query (e.g., "What caused the outage?"), the AI calls an MCP Server endpoint. The server then fetches the relevant operational facts, summarizes them, normalizes technical identifiers (like URNs and raw metric names) into natural language concepts, and returns a concise JSON or YAML payload. This payload is then injected directly into the LLM's prompt, ensuring the final diagnosis or action is grounded in real-time, accurate SUSE Observability data, effectively minimizing hallucinations.

    Goals

    • Grounding AI Responses: Ensure that all AI diagnoses, root cause analyses, and action recommendations are strictly based on verifiable, real-time data retrieved from the SUSE Observability StackState platform.
    • Simplifying Data Access: Abstract the complexity of StackState's native APIs (e.g., Time Travel, 4T Data Model) into simple, semantic functions that can be easily invoked by LLM tool-calling mechanisms.
    • Data Normalization: Convert complex, technical identifiers (like component URNs, raw metric names, and proprietary health states) into standardized, natural language terms that an LLM can easily reason over.
    • Enabling Automated Remediation: Define clear, action-oriented MCP endpoints (e.g., execute_runbook) that allow the AI agent to initiate automated operational workflows (e.g., restarts, scaling) after a diagnosis, closing the loop on observability.

     Hackweek STEP

    • Create a functional MCP endpoint exposing one (or more) tool(s) to answer queries like "What is the health of service X?") by fetching, normalizing, and returning live StackState data in an LLM-ready format.

     Scope

    • Implement read-only MCP server that can:
      • Connect to a live SUSE Observability instance and authenticate (with API token)
      • Use tools to fetch data for a specific component URN (e.g., current health state, metrics, possibly topology neighbors, ...).
      • Normalize response fields (e.g., URN to "Service Name," health state DEVIATING to "Unhealthy", raw metrics).
      • Return the data as a structured JSON payload compliant with the MCP specification.

    Deliverables

    • MCP Server v0.1 A running Golang MCP server with at least one tool.
    • A README.md and a test script (e.g., curl commands or a simple notebook) showing how an AI agent would call the endpoint and the resulting JSON payload.

    Outcome A functional and testable API endpoint that proves the core concept: translating complex StackState data into a simple, LLM-ready format. This provides the foundation for developing AI-driven diagnostics and automated remediation.

    Resources

    • https://www.honeycomb.io/blog/its-the-end-of-observability-as-we-know-it-and-i-feel-fine
    • https://www.datadoghq.com/blog/datadog-remote-mcp-server
    • https://modelcontextprotocol.io/specification/2025-06-18/index
    • https://modelcontextprotocol.io/docs/develop/build-server

     Basic implementation

    • https://github.com/drutigliano19/suse-observability-mcp-server

    Results

    Successfully developed and delivered a fully functional SUSE Observability MCP Server that bridges language models with SUSE Observability's operational data. This project demonstrates how AI agents can perform intelligent troubleshooting and root cause analysis using structured access to real-time infrastructure data.

    Example execution