Orcas are amazing animals. They are playful, intelligent, great swimmers, and very social. They also love to play with their food, hunting down their prey with advanced strategies - understanding where its prey hides, how it will try to escape, and how to overcome those tactics - and having a lot of fun doing so, before relentlessly tearing it apart, killing it, and eat it. Not necessarily in that order. Oh, and they have the right color scheme.
This forces their prey to also improve and adapt more advanced strategies and tactics. In this arms race, both sides evolve and improve: the evolutionary pressure has made cephalopods highly intelligent, adaptable, and resilient. Unfortunately (for them), they are still very tasty. So we should exert more evolutionary pressure on individuals to help them stay alive as a species.
The most promiment example of this is Netflix's chaos monkey. However, that is very heavily focused on Amazon cloud services. The Ceph project also has Teuthology; but that's mainly checking whether Ceph remembers all the tricks it has been taught. And CBT, which measures how fast it can swim while static. CeTune helps it swim faster. All are needed and provide valuable insights, but too tame; Ceph is not afraid enough of them.
A large distributed Ceph cluster will always be "in transition"; something fails, it's being rebalanced, nodes are being added, removed, ... all the while the clients are expecting it to deliver service.
We need a stress test harness for Ceph that one can point at an existing Ceph cluster, and that will understand the failure domains (OSD trees, nodes, NIC connections, ...) and inject faults until it eventually breaks. (All the while measuring the performance to see if the cluster is still within it's SLAs.)
You could think of this as a form of black-/gray-box testing at the system level. We don't really need to know a lot about Ceph's internals; we only know the high level architecture so we can group the components into failure domains and see how many errors we should be able to inject without failure. And once we heal the error, watch while - or rather, if - Ceph properly recovers.
Customers also don't care if it's Ceph crashing and not recovering, or if the specific workload has triggered a bug in some other part of the kernel. Thus, we need to holistically test at the system level.
Goals: - Make Ceph more robust in the face of faults; - Improve Ceph recovery; - Increase customer confidence in their deployed clusters; - Improve supportability of production clusters by forcing developers to look into failure scenarios more frequently.
Possible errors to inject: - killing daemons, - SIGSTOP (simulates hangs), - inducing kernel panics, - network outages on the front-end or back-end, - invoking random network latency and bottlenecks, - out of memory errors, - CPU overload, - corrupting data on disk, - Full cluster outage and reboot (think power outage), - ...
There are several states of the cluster to trigger:
baseline ("sunny weather"): establish a performance baseline while everything actually works. (While this is never really the case in production, it is the goal of performance under adverse conditions.)
"lightly" degraded - the system must be able to cope with a single fault in one of its failure domains, all the while providing service within the high-end range of its SLAs. Also, if this error is healed, the system should fully recover.
"heavily" degraded - the system should be able to cope with a single fault in several of its failure domains, all the while providing services within its SLAs. Also, if this error is healed, the system should fully recover. (This is harder than the previous case due to unexpected interdependencies.)
"crashed": if the faults in any of its failure domains exceed the available redundancy, it would be expected that the system indeed stops providing service. However, it must do so cleanly. And for many of these scenarios, it would still be expected that the system is capable of automatically recovery once the faults have healed.
"byzantine" faults: if the faults injected have corrupted more than a certain threshold of the persistently stored data, the data can be considered lost beyond hope. (Think split brain, etc.) For faults that are within the design spec, this state should never occur, even if the system had crashed; it must refuse service before reaching this state. Dependable systems also must fail gracefully ("safely") and detect this state ("scrub") and refuse service as appropriate.
While this can be run in a lab, it should actually be possible to run Orca against a production cluster as part of its on-going evaluation or pre-production certification. It may even be possible to run Teuthology while Orca is running(?), one of these days.
Basic loop: - discover topology (may be manually configured in the beginning) - Start load generator - Audit cluster health - Induce a new fault - Watch cluster state - Heal faults (possibly, unless we want to next induce one in a different failure domain) - Watch whether it heals as expected - Repeat ;-)
- Runs should be repeatable if provided with the same (random) seed and list of allowed tests.
- It must also be possible to specify a list of tests and timing explicitly.
- Configure list of tests/blacklists of tests for specific environments
- Fault inducers configurable
- Audits configurable
Number of max faults per failure domain and in total to be configurable, of course
Can this be done within Teuthology?
Can this leverage any of the Pacemaker CTS work?
Flag and abort the run if the state the cluster is in is worse than we anticipated. e.g., if we think we induced a lightly degraded cluster, but service actually went down. Or if we healed all faults and triggered a restart, and the system does not recover within a reasonable timeout.
We need to minimize false positives, otherwise it'll require just as much as overhead to sort through as Teuthology.
First step is to design the requirements a bit better and then decide where to implement this. I don't want to randomly start a new project, but also not shoehorn it into an existing project if it's not a good fit.
- Trello board for requirements/use cases? taiga.io project? ;-)
I think that's about enough for a quick draft ;-)
This project is part of:
Hack Week 14
This project is one of its kind!