Orcas are amazing animals. They are playful, intelligent, great swimmers, and very social. They also love to play with their food, hunting down their prey with advanced strategies - understanding where its prey hides, how it will try to escape, and how to overcome those tactics - and having a lot of fun doing so, before relentlessly tearing it apart, killing it, and eat it. Not necessarily in that order. Oh, and they have the right color scheme.

This forces their prey to also improve and adapt more advanced strategies and tactics. In this arms race, both sides evolve and improve: the evolutionary pressure has made cephalopods highly intelligent, adaptable, and resilient. Unfortunately (for them), they are still very tasty. So we should exert more evolutionary pressure on individuals to help them stay alive as a species.

The most promiment example of this is Netflix's chaos monkey. However, that is very heavily focused on Amazon cloud services. The Ceph project also has Teuthology; but that's mainly checking whether Ceph remembers all the tricks it has been taught. And CBT, which measures how fast it can swim while static. CeTune helps it swim faster. All are needed and provide valuable insights, but too tame; Ceph is not afraid enough of them.

A large distributed Ceph cluster will always be "in transition"; something fails, it's being rebalanced, nodes are being added, removed, ... all the while the clients are expecting it to deliver service.

We need a stress test harness for Ceph that one can point at an existing Ceph cluster, and that will understand the failure domains (OSD trees, nodes, NIC connections, ...) and inject faults until it eventually breaks. (All the while measuring the performance to see if the cluster is still within it's SLAs.)

You could think of this as a form of black-/gray-box testing at the system level. We don't really need to know a lot about Ceph's internals; we only know the high level architecture so we can group the components into failure domains and see how many errors we should be able to inject without failure. And once we heal the error, watch while - or rather, if - Ceph properly recovers.

Customers also don't care if it's Ceph crashing and not recovering, or if the specific workload has triggered a bug in some other part of the kernel. Thus, we need to holistically test at the system level.

Goals: - Make Ceph more robust in the face of faults; - Improve Ceph recovery; - Increase customer confidence in their deployed clusters; - Improve supportability of production clusters by forcing developers to look into failure scenarios more frequently.

Possible errors to inject: - killing daemons, - SIGSTOP (simulates hangs), - inducing kernel panics, - network outages on the front-end or back-end, - invoking random network latency and bottlenecks, - out of memory errors, - CPU overload, - corrupting data on disk, - Full cluster outage and reboot (think power outage), - ...

There are several states of the cluster to trigger:

baseline ("sunny weather"): establish a performance baseline while everything actually works. (While this is never really the case in production, it is the goal of performance under adverse conditions.)
"lightly" degraded - the system must be able to cope with a single fault in one of its failure domains, all the while providing service within the high-end range of its SLAs. Also, if this error is healed, the system should fully recover.
"heavily" degraded - the system should be able to cope with a single fault in several of its failure domains, all the while providing services within its SLAs. Also, if this error is healed, the system should fully recover. (This is harder than the previous case due to unexpected interdependencies.)
"crashed": if the faults in any of its failure domains exceed the available redundancy, it would be expected that the system indeed stops providing service. However, it must do so cleanly. And for many of these scenarios, it would still be expected that the system is capable of automatically recovery once the faults have healed.
"byzantine" faults: if the faults injected have corrupted more than a certain threshold of the persistently stored data, the data can be considered lost beyond hope. (Think split brain, etc.) For faults that are within the design spec, this state should never occur, even if the system had crashed; it must refuse service before reaching this state. Dependable systems also must fail gracefully ("safely") and detect this state ("scrub") and refuse service as appropriate.

While this can be run in a lab, it should actually be possible to run Orca against a production cluster as part of its on-going evaluation or pre-production certification. It may even be possible to run Teuthology while Orca is running(?), one of these days.

Basic loop: - discover topology (may be manually configured in the beginning) - Start load generator - Audit cluster health - Induce a new fault - Watch cluster state - Heal faults (possibly, unless we want to next induce one in a different failure domain) - Watch whether it heals as expected - Repeat ;-)

Runs should be repeatable if provided with the same (random) seed and list of allowed tests.
It must also be possible to specify a list of tests and timing explicitly.
Configure list of tests/blacklists of tests for specific environments
Fault inducers configurable
Audits configurable
Number of max faults per failure domain and in total to be configurable, of course
Can this be done within Teuthology?
Can this leverage any of the Pacemaker CTS work?
Flag and abort the run if the state the cluster is in is worse than we anticipated. e.g., if we think we induced a lightly degraded cluster, but service actually went down. Or if we healed all faults and triggered a restart, and the system does not recover within a reasonable timeout.
We need to minimize false positives, otherwise it'll require just as much as overhead to sort through as Teuthology.

First step is to design the requirements a bit better and then decide where to implement this. I don't want to randomly start a new project, but also not shoehorn it into an existing project if it's not a good fit.

Trello board for requirements/use cases? taiga.io project? ;-)

I think that's about enough for a quick draft ;-)

Join this project Leave this project

Looking for hackers with the skills:

ceph testing qa paranoia distributedsystems python

This project is part of:

Hack Week 14

Activity

over 9 years ago: jcejka liked this project.

over 9 years ago: tdig liked this project.

over 9 years ago: locilka liked this project.

over 9 years ago: pgonin liked this project.

over 9 years ago: dwaas joined this project.

over 9 years ago: dwaas liked this project.

over 9 years ago: jluis joined this project.

over 9 years ago: LarsMB added keyword "python" to this project.

over 9 years ago: LarsMB added keyword "ceph" to this project.

over 9 years ago: LarsMB added keyword "testing" to this project.

over 9 years ago: LarsMB added keyword "qa" to this project.

over 9 years ago: LarsMB added keyword "paranoia" to this project.

over 9 years ago: LarsMB added keyword "distributedsystems" to this project.

over 9 years ago: LenzGr liked this project.

over 9 years ago: jfajerski liked this project.

over 9 years ago: dmdiss liked this project.

over 9 years ago: abhishekl joined this project.

over 9 years ago: jfajerski joined this project.

over 9 years ago: LarsMB started this project.

over 9 years ago: LarsMB originated this project.

Comments

over 9 years ago by jluis | Reply

so, are we meeting up to discuss how to approach this at some point? Also, I'd happily vote for a taiga.io project - at least to try it out with a proper project, see how it feels and what not ;)

over 9 years ago by jfajerski | Reply

Hi Guys, I took the liberty of creating an Etherpad with mostly content from Lars' project text and a few thoughts that came to mind. Feel free to add: https://etherpad.nue.suse.com/p/ceph-testing-with-Orca

Similar Projects

testing

Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil

Join the Gitter channel! https://gitter.im/uyuni-project/hackweek

Uyuni is a configuration and infrastructure management tool that saves you time and headaches when you have to manage and update tens, hundreds or even thousands of machines. It also manages configuration, can run audits, build image containers, monitor and much more!

Currently there are a few distributions that are completely untested on Uyuni or SUSE Manager (AFAIK) or just not tested since a long time, and could be interesting knowing how hard would be working with them and, if possible, fix whatever is broken.

For newcomers, the easiest distributions are those based on DEB or RPM packages. Distributions with other package formats are doable, but will require adapting the Python and Java code to be able to sync and analyze such packages (and if salt does not support those packages, it will need changes as well). So if you want a distribution with other packages, make sure you are comfortable handling such changes.

No developer experience? No worries! We had non-developers contributors in the past, and we are ready to help as long as you are willing to learn. If you don't want to code at all, you can also help us preparing the documentation after someone else has the initial code ready, or you could also help with testing :-)

The idea is testing Salt (including bootstrapping with bootstrap script) and Salt-ssh clients

To consider that a distribution has basic support, we should cover at least (points 3-6 are to be tested for both salt minions and salt ssh minions):

Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file)
Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
Package management (install, remove, update...)
Patching
Applying any basic salt state (including a formula)
Salt remote commands
Bonus point: Java part for product identification, and monitoring enablement
Bonus point: sumaform enablement (https://github.com/uyuni-project/sumaform)
Bonus point: Documentation (https://github.com/uyuni-project/uyuni-docs)
Bonus point: testsuite enablement (https://github.com/uyuni-project/uyuni/tree/master/testsuite)

If something is breaking: we can try to fix it, but the main idea is research how supported it is right now. Beyond that it's up to each project member how much to hack :-)

If you don't have knowledge about some of the steps: ask the team
If you still don't know what to do: switch to another distribution and keep testing.

This card is for EVERYONE, not just developers. Seriously! We had people from other teams helping that were not developers, and added support for Debian and new SUSE Linux Enterprise and openSUSE Leap versions :-)

In progress/done for Hack Week 25

Guide

We started writin a Guide: Adding a new client GNU Linux distribution to Uyuni at https://github.com/uyuni-project/uyuni/wiki/Guide:-Adding-a-new-client-GNU-Linux-distribution-to-Uyuni, to make things easier for everyone, specially those not too familiar wht Uyuni or not technical.

openSUSE Leap 16.0

The distribution will all love!

https://en.opensuse.org/openSUSE:Roadmap#DRAFTScheduleforLeap16.0

Curent Status We started last year, it's complete now for Hack Week 25! :-D

[W] Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file) NOTE: Done, client tools for SLMicro6 are using as those for SLE16.0/openSUSE Leap 16.0 are not available yet
[W] Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
[W] Package management (install, remove, update...). Works, even reboot requirement detection

openQA tests needles elaboration using AI image recognition by mdati

Description

In the openQA test framework, to identify the status of a target SUT image, a screenshots of GUI or CLI-terminal images, the needles framework scans the many pictures in its repository, having associated a given set of tags (strings), selecting specific smaller parts of each available image. For the needles management actually we need to keep stored many screenshots, variants of GUI and CLI-terminal images, eachone accompanied by a dedicated set of data references (json).

A smarter framework, using image recognition based on AI or other image elaborations tools, nowadays widely available, could improve the matching process and hopefully reduce time and errors, during the images verification and detection process.

Goals

Main scope of this idea is to match a "graphical" image of the console or GUI status of a running openQA test, an image of a shell console or application-GUI screenshot, using less time and resources and with less errors in data preparation and use, than the actual openQA needles framework; that is:

having a given SUT (system under test) GUI or CLI-terminal screenshot, with a local distribution of pixels or text commands related to a running test status,
we want to identify a desired target, e.g. a screen image status or data/commands context,
- based on AI/ML-pretrained archives containing object or other proper elaboration tools,
- possibly able to identify also object not present in the archive, i.e. by means of AI/ML mechanisms.
the matching result should be then adapted to continue working in the openQA test, likewise and in place of the same result that would have been produced by the original openQA needles framework.
We expect an improvement of the matching-time(less time), reliability of the expected result(less error) and simplification of archive maintenance in adding/removing objects(smaller DB and less actions).

Hackweek POC:

Main steps

Phase 1 - Plan
- study the available tools
- prepare a plan for the process to build
Phase 2 - Implement
- write and build a draft application
Phase 3 - Data
- prepare the data archive from a subset of needles
- initialize/pre-train the base archive
- select a screenshot from the subset, removing/changing some part
Phase 4 - Test
- run the POC application
- expect the image type is identified in a good %.

Resources

First step of this project is quite identification of useful resources for the scope; some possibilities are:

SUSE AI and other ML tools (i.e. Tensorflow)
Tools able to manage images
RPA test tools (like i.e. Robot framework)
other.

Project references

Repository: openqa-needles-AI-driven

Multimachine on-prem test with opentofu, ansible and Robot Framework by apappas

Description

A long time ago I explored using the Robot Framework for testing. A big deficiency over our openQA setup is that bringing up and configuring the connection to a test machine is out of scope.

Nowadays we have a way¹ to deploy SUTs outside openqa, but we only use if for cloud tests in conjuction with openqa. Using knowledge gained from that project I am going to try to create a test scenario that replicates an openqa test but this time including the deployment and setup of the SUT.

Goals

Create a simple multimachine test scenario with the support server and SUT all created by the robot framework.

Resources

https://github.com/SUSE/qe-sap-deployment
terraform-libvirt-provider

python

Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil

Join the Gitter channel! https://gitter.im/uyuni-project/hackweek

The idea is testing Salt (including bootstrapping with bootstrap script) and Salt-ssh clients

To consider that a distribution has basic support, we should cover at least (points 3-6 are to be tested for both salt minions and salt ssh minions):

Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file)
Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
Package management (install, remove, update...)
Patching
Applying any basic salt state (including a formula)
Salt remote commands
Bonus point: Java part for product identification, and monitoring enablement
Bonus point: sumaform enablement (https://github.com/uyuni-project/sumaform)
Bonus point: Documentation (https://github.com/uyuni-project/uyuni-docs)
Bonus point: testsuite enablement (https://github.com/uyuni-project/uyuni/tree/master/testsuite)

If something is breaking: we can try to fix it, but the main idea is research how supported it is right now. Beyond that it's up to each project member how much to hack :-)

If you don't have knowledge about some of the steps: ask the team
If you still don't know what to do: switch to another distribution and keep testing.

In progress/done for Hack Week 25

Guide

openSUSE Leap 16.0

The distribution will all love!

https://en.opensuse.org/openSUSE:Roadmap#DRAFTScheduleforLeap16.0

Curent Status We started last year, it's complete now for Hack Week 25! :-D

[W] Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file) NOTE: Done, client tools for SLMicro6 are using as those for SLE16.0/openSUSE Leap 16.0 are not available yet
[W] Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
[W] Package management (install, remove, update...). Works, even reboot requirement detection

Improvements to osc (especially with regards to the Git workflow) by mcepl

Description

There is plenty of hacking on osc, where we could spent some fun time. I would like to see a solution for https://github.com/openSUSE/osc/issues/2006 (which is sufficiently non-serious, that it could be part of HackWeek project).

Improve chore and screen time doc generator script `wochenplaner` by gniebler

Description

I wrote a little Python script to generate PDF docs, which can be used to track daily chore completion and screen time usage for several people, with one page per person/week.

I named this script wochenplaner and have been using it for a few months now.

It needs some improvements and adjustments in how the screen time should be tracked and how chores are displayed.

Goals

Fix chore field separation lines
Change screen time tracking logic from "global" (week-long) to daily subtraction and weekly addition of remainders (more intuitive than current "weekly time budget method)
Add logic to fill in chore fields/lines, ideally with pictures, falling back to text.

Resources

tbd (Gitlab repo)

Bring to Cockpit + System Roles capabilities from YAST by miguelpc

Bring to Cockpit + System Roles features from YAST

Cockpit and System Roles have been added to SLES 16 There are several capabilities in YAST that are not yet present in Cockpit and System Roles We will follow the principle of "automate first, UI later" being System Roles the automation component and Cockpit the UI one.

Goals

The idea is to implement service configuration in System Roles and then add an UI to manage these in Cockpit. For some capabilities it will be required to have an specific Cockpit Module as they will interact with a reasource already configured.

Resources

A plan on capabilities missing and suggested implementation is available here: https://docs.google.com/spreadsheets/d/1ZhX-Ip9MKJNeKSYV3bSZG4Qc5giuY7XSV0U61Ecu9lo/edit

Linux System Roles:

https://linux-system-roles.github.io/
https://build.opensuse.org/package/show/openSUSE:Factory/ansible-linux-system-roles Package on sle16 ansible-linux-system-roles

First meeting Hackweek catchup

Monday, December 1 · 11:00 – 12:00
Time zone: Europe/Madrid
Google Meet link: https://meet.google.com/rrc-kqch-hca

Update M2Crypto by mcepl

There are couple of projects I work on, which need my attention and putting them to shape:

M2Crypto

Goal for this Hackweek

Put M2Crypto into better shape (most issues closed, all pull requests processed)
More fun to learn jujutsu
Play more with Gemini, how much it help (or not).
Perhaps, also (just slightly related), help to fix vis to work with LuaJIT, particularly to make vis-lspc working.

Looking for hackers with the skills:

This project is part of:

Activity

Comments

over 9 years ago by jluis | Reply

over 9 years ago by jfajerski | Reply

Similar Projects

testing

Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil

In progress/done for Hack Week 25

Guide

openSUSE Leap 16.0

openQA tests needles elaboration using AI image recognition by mdati

Description

Goals

Hackweek POC:

Resources

Project references

Multimachine on-prem test with opentofu, ansible and Robot Framework by apappas

Description

Goals

Resources

python

Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil

In progress/done for Hack Week 25

Guide

openSUSE Leap 16.0

Improvements to osc (especially with regards to the Git workflow) by mcepl

Description

Improve chore and screen time doc generator script `wochenplaner` by gniebler

Description

Goals

Resources

Bring to Cockpit + System Roles capabilities from YAST by miguelpc

Bring to Cockpit + System Roles features from YAST

Goals

Resources

Update M2Crypto by mcepl

Goal for this Hackweek