Description
Along with pacemaker3 / corosync3
stack landed openSUSE Tumbleweed. The new underline protocol kronosnet
becomes as the fundamental piece.
This exercise tries to expand the pacemaker3 cluster toward 100+ nodes and find the limitation and the best practices to do so.
Resources
crmsh.git/test/run-functional-tests -h
No Hackers yet
Looking for hackers with the skills:
This project is part of:
Hack Week 24
Comments
-
about 1 month ago by zzhou | Reply
Summary
At this stage, it's challenging to start the cluster more than 20 nodes. It can be challenging just join one more node either the cold start of the cluster gradually, or grow the nodes online.
Potential further research directions: - probably to tune corosync timeout and knet configure for the bigger cluster size - knet protocol sctp instead of udp to avoid retransmit storming?
Misc.: to improve tool to grow node gradually https://github.com/ClusterLabs/crmsh/pull/1235
Observations
Environment:
- VM, CPU=8, RAM=16G
- CPU: Half of #CPU of the host
- MEM: One node container consumes 300M
Challenges:
difficult to setup 20 node cluster all at once
difficult to grow 4 nodes all at once on a 32 node cluster (even a 20 node cluster)
- CPU overloaded
- DC election can't finish
- Token timeout
- corosync membership
- [TOTEM ] Token has not been received in 41025 ms
- Repeat: Completed service synchronization, ready to provide service.
- Network
corosync-cfgtool -R
might mess KNET layer for a while- hanode1 kernel: neighbour: arp_cache: neighbor table overflow!
- [KNET ] loopback: send local failed. error=Resource temporarily unavailable
- kernel: podman2: port 19(veth38) entered blocking state
- Mem can be consumed quickly. Memory leak?
- CPU overloaded
Similar Projects
This project is one of its kind!