Description

Along with pacemaker3 / corosync3 stack landed openSUSE Tumbleweed. The new underline protocol kronosnet becomes as the fundamental piece.

This exercise tries to expand the pacemaker3 cluster toward 100+ nodes and find the limitation and the best practices to do so.

Resources

crmsh.git/test/run-functional-tests -h

Looking for hackers with the skills:

ha pacemakercluster corosync

This project is part of:

Hack Week 24

Activity

  • 12 days ago: zzhou added keyword "ha" to this project.
  • 12 days ago: zzhou added keyword "pacemakercluster" to this project.
  • 12 days ago: zzhou added keyword "corosync" to this project.
  • 12 days ago: zzhou originated this project.

  • Comments

    • zzhou
      11 days ago by zzhou | Reply

      Summary

      At this stage, it's challenging to start the cluster more than 20 nodes. It can be challenging just join one more node either the cold start of the cluster gradually, or grow the nodes online.

      Potential further research direction: probably to tune corosync timeout and knet configure for the bigger cluster size

      Misc.: to improve tool to grow node gradually https://github.com/ClusterLabs/crmsh/pull/1235

      Observations

      Environment:

      • VM, CPU=8, RAM=16G
      • CPU: grow 4 nodes all at once. Half of #CPU of the system
      • MEM: One node container consumes 300M

      Challenges:

      • difficult to setup 20 node cluster all at once

      • difficult to grow 4 nodes all at once on a 32 node cluster (even a 20 node cluster)

        • CPU overloaded
          • DC election can't finish
          • Token timeout
        • corosync membership
          • [TOTEM ] Token has not been received in 41025 ms
          • Repeat: Completed service synchronization, ready to provide service.
        • Network
          • corosync-cfgtool -R might mess KNET layer for a while
          • hanode1 kernel: neighbour: arp_cache: neighbor table overflow!
          • [KNET ] loopback: send local failed. error=Resource temporarily unavailable
          • kernel: podman2: port 19(veth38) entered blocking state
        • Mem can be consumed quickly. Memory leak?

    Similar Projects

    This project is one of its kind!