Orcas are amazing animals. They are playful, intelligent, great swimmers, and very social. They also love to play with their food, hunting down their prey with advanced strategies - understanding where its prey hides, how it will try to escape, and how to overcome those tactics - and having a lot of fun doing so, before relentlessly tearing it apart, killing it, and eat it. Not necessarily in that order. Oh, and they have the right color scheme.

This forces their prey to also improve and adapt more advanced strategies and tactics. In this arms race, both sides evolve and improve: the evolutionary pressure has made cephalopods highly intelligent, adaptable, and resilient. Unfortunately (for them), they are still very tasty. So we should exert more evolutionary pressure on individuals to help them stay alive as a species.

The most promiment example of this is Netflix's chaos monkey. However, that is very heavily focused on Amazon cloud services. The Ceph project also has Teuthology; but that's mainly checking whether Ceph remembers all the tricks it has been taught. And CBT, which measures how fast it can swim while static. CeTune helps it swim faster. All are needed and provide valuable insights, but too tame; Ceph is not afraid enough of them.

A large distributed Ceph cluster will always be "in transition"; something fails, it's being rebalanced, nodes are being added, removed, ... all the while the clients are expecting it to deliver service.

We need a stress test harness for Ceph that one can point at an existing Ceph cluster, and that will understand the failure domains (OSD trees, nodes, NIC connections, ...) and inject faults until it eventually breaks. (All the while measuring the performance to see if the cluster is still within it's SLAs.)

You could think of this as a form of black-/gray-box testing at the system level. We don't really need to know a lot about Ceph's internals; we only know the high level architecture so we can group the components into failure domains and see how many errors we should be able to inject without failure. And once we heal the error, watch while - or rather, if - Ceph properly recovers.

Customers also don't care if it's Ceph crashing and not recovering, or if the specific workload has triggered a bug in some other part of the kernel. Thus, we need to holistically test at the system level.

Goals: - Make Ceph more robust in the face of faults; - Improve Ceph recovery; - Increase customer confidence in their deployed clusters; - Improve supportability of production clusters by forcing developers to look into failure scenarios more frequently.

Possible errors to inject: - killing daemons, - SIGSTOP (simulates hangs), - inducing kernel panics, - network outages on the front-end or back-end, - invoking random network latency and bottlenecks, - out of memory errors, - CPU overload, - corrupting data on disk, - Full cluster outage and reboot (think power outage), - ...

There are several states of the cluster to trigger:

  • baseline ("sunny weather"): establish a performance baseline while everything actually works. (While this is never really the case in production, it is the goal of performance under adverse conditions.)

  • "lightly" degraded - the system must be able to cope with a single fault in one of its failure domains, all the while providing service within the high-end range of its SLAs. Also, if this error is healed, the system should fully recover.

  • "heavily" degraded - the system should be able to cope with a single fault in several of its failure domains, all the while providing services within its SLAs. Also, if this error is healed, the system should fully recover. (This is harder than the previous case due to unexpected interdependencies.)

  • "crashed": if the faults in any of its failure domains exceed the available redundancy, it would be expected that the system indeed stops providing service. However, it must do so cleanly. And for many of these scenarios, it would still be expected that the system is capable of automatically recovery once the faults have healed.

  • "byzantine" faults: if the faults injected have corrupted more than a certain threshold of the persistently stored data, the data can be considered lost beyond hope. (Think split brain, etc.) For faults that are within the design spec, this state should never occur, even if the system had crashed; it must refuse service before reaching this state. Dependable systems also must fail gracefully ("safely") and detect this state ("scrub") and refuse service as appropriate.

While this can be run in a lab, it should actually be possible to run Orca against a production cluster as part of its on-going evaluation or pre-production certification. It may even be possible to run Teuthology while Orca is running(?), one of these days.

Basic loop: - discover topology (may be manually configured in the beginning) - Start load generator - Audit cluster health - Induce a new fault - Watch cluster state - Heal faults (possibly, unless we want to next induce one in a different failure domain) - Watch whether it heals as expected - Repeat ;-)

  • Runs should be repeatable if provided with the same (random) seed and list of allowed tests.
  • It must also be possible to specify a list of tests and timing explicitly.
  • Configure list of tests/blacklists of tests for specific environments
  • Fault inducers configurable
  • Audits configurable
  • Number of max faults per failure domain and in total to be configurable, of course

  • Can this be done within Teuthology?

  • Can this leverage any of the Pacemaker CTS work?

  • Flag and abort the run if the state the cluster is in is worse than we anticipated. e.g., if we think we induced a lightly degraded cluster, but service actually went down. Or if we healed all faults and triggered a restart, and the system does not recover within a reasonable timeout.

  • We need to minimize false positives, otherwise it'll require just as much as overhead to sort through as Teuthology.

First step is to design the requirements a bit better and then decide where to implement this. I don't want to randomly start a new project, but also not shoehorn it into an existing project if it's not a good fit.

  • Trello board for requirements/use cases? taiga.io project? ;-)

I think that's about enough for a quick draft ;-)

Looking for hackers with the skills:

ceph testing qa paranoia distributedsystems python

This project is part of:

Hack Week 14

Activity

  • over 8 years ago: jcejka liked this project.
  • over 8 years ago: tdig liked this project.
  • over 8 years ago: locilka liked this project.
  • over 8 years ago: pgonin liked this project.
  • over 8 years ago: dwaas joined this project.
  • over 8 years ago: dwaas liked this project.
  • over 8 years ago: jluis joined this project.
  • over 8 years ago: LarsMB added keyword "python" to this project.
  • over 8 years ago: LarsMB added keyword "ceph" to this project.
  • over 8 years ago: LarsMB added keyword "testing" to this project.
  • over 8 years ago: LarsMB added keyword "qa" to this project.
  • over 8 years ago: LarsMB added keyword "paranoia" to this project.
  • over 8 years ago: LarsMB added keyword "distributedsystems" to this project.
  • over 8 years ago: LenzGr liked this project.
  • over 8 years ago: jfajerski liked this project.
  • over 8 years ago: dmdiss liked this project.
  • over 8 years ago: abhishekl joined this project.
  • over 8 years ago: jfajerski joined this project.
  • over 8 years ago: LarsMB started this project.
  • over 8 years ago: LarsMB originated this project.

  • Comments

    • jluis
      over 8 years ago by jluis | Reply

      so, are we meeting up to discuss how to approach this at some point? Also, I'd happily vote for a taiga.io project - at least to try it out with a proper project, see how it feels and what not ;)

    • jfajerski
      over 8 years ago by jfajerski | Reply

      Hi Guys, I took the liberty of creating an Etherpad with mostly content from Lars' project text and a few thoughts that came to mind. Feel free to add: https://etherpad.nue.suse.com/p/ceph-testing-with-Orca

    Similar Projects

    Hack on isotest-ng - a rust port of isotovideo (os-autoinst aka testrunner of openQA) by szarate

    Description

    Some time ago, I managed to convince ByteOtter to hack something that resembles isotovideo but in Rust, not because I believe that Perl is dead, but more because there are certain limitations in the perl code (how it was written), and its always hard to add new functionalities when they are about implementing a new backend, or fixing bugs (Along with people complaining that Perl is dead, and that they don't like it)

    In reality, I wanted to see if this could be done, and ByteOtter proved that it could be, while doing an amazing job at hacking a vnc console, and helping me understand better what RuPerl needs to work.

    I plan to keep working on this for the next few years, and while I don't aim for feature completion or replacing isotovideo tih isotest-ng (name in progress), I do plan to be able to use it on a daily basis, using specialized tooling with interfaces, instead of reimplementing everything in the backend

    Todo

    • Add make targets for testability, e.g "spawn qemu and type"
    • Add image search matching algorithm
    • Add a Null test distribution provider
    • Add a Perl Test Distribution Provider
    • Fix unittests https://github.com/os-autoinst/isotest-ng/issues/5
    • Research OpenTofu how to add new hypervisors/baremetal to OpenTofu
    • Add an interface to openQA cli

    Goals

    • Implement at least one of the above, prepare proposals for GSoC
    • Boot a system via it's BMC

    Resources

    See https://github.com/os-autoinst/isotest-ng


    Make more sense of openQA test results using AI by livdywan

    Description

    AI has the potential to help with something many of us spend a lot of time doing which is making sense of openQA logs when a job fails.

    User Story

    Allison Average has a puzzled look on their face while staring at log files that seem to make little sense. Is this a known issue, something completely new or maybe related to infrastructure changes?

    Goals

    • Leverage a chat interface to help Allison
    • Create a model from scratch based on data from openQA
    • Proof of concept for automated analysis of openQA test results

    Bonus

    • Use AI to suggest solutions to merge conflicts
      • This would need a merge conflict editor that can suggest solving the conflict
    • Use image recognition for needles

    Resources

    Timeline

    Day 1

    • Conversing with open-webui to teach me how to create a model based on openQA test results

    Day 2

    Highlights

    • I briefly tested compared models to see if they would make me more productive. Between llama, gemma and mistral there was no amazing difference in the results for my case.
    • Convincing the chat interface to produce code specific to my use case required very explicit instructions.
    • Asking for advice on how to use open-webui itself better was frustratingly unfruitful both in trivial and more advanced regards.
    • Documentation on source materials used by LLM's and tools for this purpose seems virtually non-existent - specifically if a logo can be generated based on particular licenses

    Outcomes

    • Chat interface-supported development is providing good starting points and open-webui being open source is more flexible than Gemini. Although currently some fancy features such as grounding and generated podcasts are missing.
    • Allison still has to be very experienced with openQA to use a chat interface for test review. Publicly available system prompts would make that easier, though.


    Yearly Quality Engineering Ask me Anything - AMA for not-engineering by szarate

    Goal

    Get a closer look at how developers work on the Engineering team (R & D) of SUSE, and close the collaboration gap between GSI and Engineering

    Why?

    Santiago can go over different development workflows, and can do a deepdive into how Quality Engineering works (think of my QE Team, the advocates for your customers), The idea of this session is to help open the doors to opportunities for collaboration, and broaden our understanding of SUSE as a whole.

    Objectives

    • Give $audience a small window on how to get some questions answered either on the spot or within days of how some things at engineering are done
    • Give Santiago Zarate from Quality Engineering a look into how $audience sees the engineering departments, and find out possibilities of further collaboration

    How?

    By running an "Ask me Anything" session, which is a format of a kind of open Q & A session, where participants ask the host multiple questions.

    How to make it happen?

    I'm happy to help joining a call or we can do it async (online/in person is more fun). Ping me over email-slack and lets make the magic happen!. Doesn't need to be during hackweek, but we gotta kickstart the idea during hackweek ;)

    Rules

    The rules are simple, the more questions the more fun it will be; while this will be only a window into engineering, it can also be the place to help all of us get to a similar level of understanding of the processes that are behind our respective areas of the organization.

    Dynamics

    The host will be monitoring the questions on some pre-agreed page, and try to answer to the best of their knowledge, if a question is too difficult or the host doesn't have the answer, he will do his best to provide an answer at a later date.

    Atendees are encouraged to add questions beforehand; in the case there aren't any, we would be looking at how Quality Engineering tests new products or performs regression tests

    Agenda

    • Introduction of Santiago Zarate, Product Owner of Quality Engineering Core team
    • Introduction of the Group/Team/Persons interested
    • Ice breaker
    • AMA time! Add your questions $PAGE
    • Looking at QE Workflows: How is
      • A maintenance update being tested before being released to our customers
      • Products in development are tested before making it generally available
    • Engineering Opportunity Board


    Automated Test Report reviewer by oscar-barrios

    Description

    In SUMA/Uyuni team we spend a lot of time reviewing test reports, analyzing each of the test cases failing, checking if the test is a flaky test, checking logs, etc.

    Goals

    Speed up the review by automating some parts through AI, in a way that we can consume some summary of that report that could be meaningful for the reviewer.

    Resources

    No idea about the resources yet, but we will make use of:

    • HTML/JSON Report (text + screenshots)
    • The Test Suite Status GithHub board (via API)
    • The environment tested (via SSH)
    • The test framework code (via files)


    Drag Race - comparative performance testing for pull requests by balanza

    Description

    «Sophia, a backend developer, submitted a pull request with optimizations for a critical database query. Once she pushed her code, an automated load test ran, comparing her query against the main branch. Moments later, she saw a new comment automatically added to her PR: the comparison results showed reduced execution time and improved efficiency. Smiling, Sophia messaged her team, “Performance gains confirmed!”»

    Goals

    • To have a convenient and ergonomic framework to describe test scenarios, including environment and seed;
    • to compare results from different tests
    • to have a GitHub action that executes such tests on a CI environment

    Resources

    The MVP will be built on top of Preevy and K6.


    Yearly Quality Engineering Ask me Anything - AMA for not-engineering by szarate

    Goal

    Get a closer look at how developers work on the Engineering team (R & D) of SUSE, and close the collaboration gap between GSI and Engineering

    Why?

    Santiago can go over different development workflows, and can do a deepdive into how Quality Engineering works (think of my QE Team, the advocates for your customers), The idea of this session is to help open the doors to opportunities for collaboration, and broaden our understanding of SUSE as a whole.

    Objectives

    • Give $audience a small window on how to get some questions answered either on the spot or within days of how some things at engineering are done
    • Give Santiago Zarate from Quality Engineering a look into how $audience sees the engineering departments, and find out possibilities of further collaboration

    How?

    By running an "Ask me Anything" session, which is a format of a kind of open Q & A session, where participants ask the host multiple questions.

    How to make it happen?

    I'm happy to help joining a call or we can do it async (online/in person is more fun). Ping me over email-slack and lets make the magic happen!. Doesn't need to be during hackweek, but we gotta kickstart the idea during hackweek ;)

    Rules

    The rules are simple, the more questions the more fun it will be; while this will be only a window into engineering, it can also be the place to help all of us get to a similar level of understanding of the processes that are behind our respective areas of the organization.

    Dynamics

    The host will be monitoring the questions on some pre-agreed page, and try to answer to the best of their knowledge, if a question is too difficult or the host doesn't have the answer, he will do his best to provide an answer at a later date.

    Atendees are encouraged to add questions beforehand; in the case there aren't any, we would be looking at how Quality Engineering tests new products or performs regression tests

    Agenda

    • Introduction of Santiago Zarate, Product Owner of Quality Engineering Core team
    • Introduction of the Group/Team/Persons interested
    • Ice breaker
    • AMA time! Add your questions $PAGE
    • Looking at QE Workflows: How is
      • A maintenance update being tested before being released to our customers
      • Products in development are tested before making it generally available
    • Engineering Opportunity Board


    Ansible for add-on management by lmanfredi

    Description

    Machines can contains various combinations of add-ons and are often modified during the time.

    The list of repos can change so I would like to create an automation able to reset the status to a given state, based on metadata available for these machines

    Goals

    Create an Ansible automation able to take care of add-on (repo list) configuration using metadata as reference

    Resources

    Results

    Created WIP project Ansible-add-on-openSUSE


    SUSE AI Meets the Game Board by moio

    Use tabletopgames.ai’s open source TAG and PyTAG frameworks to apply Statistical Forward Planning and Deep Reinforcement Learning to two board games of our own design. On an all-green, all-open source, all-AWS stack!
    A chameleon playing chess in a train car, as a metaphor of SUSE AI applied to games


    Results: Infrastructure Achievements

    We successfully built and automated a containerized stack to support our AI experiments. This included:

    A screenshot of k9s and nvtop showing PyTAG running in Kubernetes with GPU acceleration

    ./deploy.sh and voilà - Kubernetes running PyTAG (k9s, above) with GPU acceleration (nvtop, below)

    Results: Game Design Insights

    Our project focused on modeling and analyzing two card games of our own design within the TAG framework:

    • Game Modeling: We implemented models for Dario's "Bamboo" and Silvio's "Totoro" and "R3" games, enabling AI agents to play thousands of games ...in minutes!
    • AI-driven optimization: By analyzing statistical data on moves, strategies, and outcomes, we iteratively tweaked the game mechanics and rules to achieve better balance and player engagement.
    • Advanced analytics: Leveraging AI agents with Monte Carlo Tree Search (MCTS) and random action selection, we compared performance metrics to identify optimal strategies and uncover opportunities for game refinement .

    Cards from the three games

    A family picture of our card games in progress. From the top: Bamboo, Totoro, R3

    Results: Learning, Collaboration, and Innovation

    Beyond technical accomplishments, the project showcased innovative approaches to coding, learning, and teamwork:

    • "Trio programming" with AI assistance: Our "trio programming" approach—two developers and GitHub Copilot—was a standout success, especially in handling slightly-repetitive but not-quite-exactly-copypaste tasks. Java as a language tends to be verbose and we found it to be fitting particularly well.
    • AI tools for reporting and documentation: We extensively used AI chatbots to streamline writing and reporting. (Including writing this report! ...but this note was added manually during edit!)
    • GPU compute expertise: Overcoming challenges with CUDA drivers and cloud infrastructure deepened our understanding of GPU-accelerated workloads in the open-source ecosystem.
    • Game design as a learning platform: By blending AI techniques with creative game design, we learned not only about AI strategies but also about making games fun, engaging, and balanced.

    Last but not least we had a lot of fun! ...and this was definitely not a chatbot generated line!

    The Context: AI + Board Games


    Symbol Relations by hli

    Description

    There are tools to build function call graphs based on parsing source code, for example, cscope.

    This project aims to achieve a similar goal by directly parsing the disasembly (i.e. objdump) of a compiled binary. The assembly code is what the CPU sees, therefore more "direct". This may be useful in certain scenarios, such as gdb/crash debugging.

    Detailed description and Demos can be found in the README file:

    Supports x86 for now (because my customers only use x86 machines), but support for other architectures can be added easily.

    Tested with python3.6

    Goals

    Any comments are welcome.

    Resources

    https://github.com/lhb-cafe/SymbolRelations

    symrellib.py: mplements the symbol relation graph and the disassembly parser

    symrel_tracer*.py: implements tracing (-t option)

    symrel.py: "cli parser"


    Run local LLMs with Ollama and explore possible integrations with Uyuni by PSuarezHernandez

    Description

    Using Ollama you can easily run different LLM models in your local computer. This project is about exploring Ollama, testing different LLMs and try to fine tune them. Also, explore potential ways of integration with Uyuni.

    Goals

    • Explore Ollama
    • Test different models
    • Fine tuning
    • Explore possible integration in Uyuni

    Resources

    • https://ollama.com/
    • https://huggingface.co/
    • https://apeatling.com/articles/part-2-building-your-training-data-for-fine-tuning/


    Saline (state deployment control and monitoring tool for SUSE Manager/Uyuni) by vizhestkov

    Project Description

    Saline is an addition for salt used in SUSE Manager/Uyuni aimed to provide better control and visibility for states deploymend in the large scale environments.

    In current state the published version can be used only as a Prometheus exporter and missing some of the key features implemented in PoC (not published). Now it can provide metrics related to salt events and state apply process on the minions. But there is no control on this process implemented yet.

    Continue with implementation of the missing features and improve the existing implementation:

    • authentication (need to decide how it should be/or not related to salt auth)

    • web service providing the control of states deployment

    Goal for this Hackweek

    • Implement missing key features

    • Implement the tool for state deployment control with CLI

    Resources

    https://github.com/openSUSE/saline