Description

Along with pacemaker3 / corosync3 stack landed openSUSE Tumbleweed. The new underline protocol kronosnet becomes as the fundamental piece.

This exercise tries to expand the pacemaker3 cluster toward 100+ nodes and find the limitation and the best practices to do so.

Resources

crmsh.git/test/run-functional-tests -h

Looking for hackers with the skills:

ha pacemakercluster corosync

This project is part of:

Hack Week 24

Activity

  • about 1 year ago: zzhou added keyword "ha" to this project.
  • about 1 year ago: zzhou added keyword "pacemakercluster" to this project.
  • about 1 year ago: zzhou added keyword "corosync" to this project.
  • about 1 year ago: zzhou originated this project.

  • Comments

    • zzhou
      about 1 year ago by zzhou | Reply

      Summary

      At this stage, it's challenging to start the cluster more than 20 nodes. It can be challenging just join one more node either the cold start of the cluster gradually, or grow the nodes online.

      Potential further research directions: - probably to tune corosync timeout and knet configure for the bigger cluster size - knet protocol sctp instead of udp to avoid retransmit storming?

      Misc.: to improve tool to grow node gradually https://github.com/ClusterLabs/crmsh/pull/1235

      Observations

      Environment:

      • VM, CPU=8, RAM=16G
      • CPU: Half of #CPU of the host
      • MEM: One node container consumes 300M

      Challenges:

      • difficult to setup 20 node cluster all at once

      • difficult to grow 4 nodes all at once on a 32 node cluster (even a 20 node cluster)

        • CPU overloaded
          • DC election can't finish
          • Token timeout
        • corosync membership
          • [TOTEM ] Token has not been received in 41025 ms
          • Repeat: Completed service synchronization, ready to provide service.
        • Network
          • corosync-cfgtool -R might mess KNET layer for a while
          • hanode1 kernel: neighbour: arp_cache: neighbor table overflow!
          • [KNET ] loopback: send local failed. error=Resource temporarily unavailable
          • kernel: podman2: port 19(veth38) entered blocking state
        • Mem can be consumed quickly. Memory leak?

    Similar Projects

    Uyuni read-only replica by cbosdonnat

    Description

    For now, there is no possible HA setup for Uyuni. The idea is to explore setting up a read-only shadow instance of an Uyuni and make it as useful as possible.

    Possible things to look at:

    • live sync of the database, probably using the WAL. Some of the tables may have to be skipped or some features disabled on the RO instance (taskomatic, PXT sessions…)
    • Can we use a load balancer that routes read-only queries to either instance and the other to the RW one? For example, packages or PXE data can be served by both, the API GET requests too. The rest would be RW.

    Goals

    • Prepare a document explaining how to do it.
    • PR with the needed code changes to support it


    Hacking a SUSE MLS 7.9 Cluster by roseswe

    Description

    SUSE MLS (Multi-Linux Support) - A subscription where SUSE provides technical support and updates for Red Hat Enterprise Linux (RHEL) and CentOS servers

    The most significant operational difference between SUSE MLS 7 and the standard SUSE Linux Enterprise Server High Availability Extension (SLES HAE) lies in the administrative toolchain. While both distributions rely on the same underlying Pacemaker resource manager and Corosync messaging layer, MLS 7 preserves the native Red Hat Enterprise Linux 7 user space. Consequently, MLS 7 administrators must utilize the Pacemaker Configuration System (pcs), a monolithic and imperative tool. The pcs utility abstracts the entire stack, controlling Corosync networking, cluster bootstrapping, and resource management through single-line commands that automatically generate the necessary configuration files. In contrast, SLES HAE employs the Cluster Resource Management Shell (crmsh). The crm utility operates as a declarative shell that focuses primarily on the Cluster Information Base (CIB). Unlike the command-driven nature of pcs, crmsh allows administrators to enter a configuration context to define the desired state of the cluster using syntax that maps closely to the underlying XML structure. This makes SLES HAE more flexible for complex edits but requires a different syntax knowledge base compared to the rigid, command-execution workflow of MLS 7.

    Scope is here MLS 7.9

    Goals

    • Get more familiar with MLS7.9 HA toolchain and Graphical User Interface and Daemons
    • Create a two node MLS cluster with SBD
    • Check different use cases
    • Create a "SUSE Best Practices" presentation slide set suitable for Consulting Customers

    Resources

    • You need MLS7.9 (Qcow2) installed + subscription
    • KVM server with 2 KVMs, 2 SBD
    • RHEL7 and HA skills


    SUSE Health Check Tools by roseswe

    SUSE HC Tools Overview

    A collection of tools written in Bash or Go 1.24++ to make life easier with handling of a bunch of tar.xz balls created by supportconfig.

    Background: For SUSE HC we receive a bunch of supportconfig tar balls to check them for misconfiguration, areas for improvement or future changes.

    Main focus on these HC are High Availability (pacemaker), SLES itself and SAP workloads, esp. around the SUSE best practices.

    Goals

    • Overall improvement of the tools
    • Adding new collectors
    • Add support for SLES16

    Resources

    csv2xls* example.sh go.mod listprodids.txt sumtext* trails.go README.md csv2xls.go exceltest.go go.sum m.sh* sumtext.go vercheck.py* config.ini csvfiles/ getrpm* listprodids* rpmdate.sh* sumxls* verdriver* credtest.go example.py getrpm.go listprodids.go sccfixer.sh* sumxls.go verdriver.go

    docollall.sh* extracthtml.go gethostnamectl* go.sum numastat.go cpuvul* extractcluster.go firmwarebug* gethostnamectl.go m.sh* numastattest.go cpuvul.go extracthtml* firmwarebug.go go.mod numastat* xtr_cib.sh*

    $ getrpm -r pacemaker >> Product ID: 2795 (SUSE Linux Enterprise Server for SAP Applications 15 SP7 x86_64), RPM Name: +--------------+----------------------------+--------+--------------+--------------------+ | Package Name | Version | Arch | Release | Repository | +--------------+----------------------------+--------+--------------+--------------------+ | pacemaker | 2.1.10+20250718.fdf796ebc8 | x86_64 | 150700.3.3.1 | sle-ha/15.7/x86_64 | | pacemaker | 2.1.9+20250410.471584e6a2 | x86_64 | 150700.1.9 | sle-ha/15.7/x86_64 | +--------------+----------------------------+--------+--------------+--------------------+ Total packages found: 2


    Hacking a SUSE MLS 7.9 Cluster by roseswe

    Description

    SUSE MLS (Multi-Linux Support) - A subscription where SUSE provides technical support and updates for Red Hat Enterprise Linux (RHEL) and CentOS servers

    The most significant operational difference between SUSE MLS 7 and the standard SUSE Linux Enterprise Server High Availability Extension (SLES HAE) lies in the administrative toolchain. While both distributions rely on the same underlying Pacemaker resource manager and Corosync messaging layer, MLS 7 preserves the native Red Hat Enterprise Linux 7 user space. Consequently, MLS 7 administrators must utilize the Pacemaker Configuration System (pcs), a monolithic and imperative tool. The pcs utility abstracts the entire stack, controlling Corosync networking, cluster bootstrapping, and resource management through single-line commands that automatically generate the necessary configuration files. In contrast, SLES HAE employs the Cluster Resource Management Shell (crmsh). The crm utility operates as a declarative shell that focuses primarily on the Cluster Information Base (CIB). Unlike the command-driven nature of pcs, crmsh allows administrators to enter a configuration context to define the desired state of the cluster using syntax that maps closely to the underlying XML structure. This makes SLES HAE more flexible for complex edits but requires a different syntax knowledge base compared to the rigid, command-execution workflow of MLS 7.

    Scope is here MLS 7.9

    Goals

    • Get more familiar with MLS7.9 HA toolchain and Graphical User Interface and Daemons
    • Create a two node MLS cluster with SBD
    • Check different use cases
    • Create a "SUSE Best Practices" presentation slide set suitable for Consulting Customers

    Resources

    • You need MLS7.9 (Qcow2) installed + subscription
    • KVM server with 2 KVMs, 2 SBD
    • RHEL7 and HA skills