SUSE Hack Week: Expand the pacemaker/corosync3 cluster toward 100+ nodes

Description

Along with pacemaker3 / corosync3 stack landed openSUSE Tumbleweed. The new underline protocol kronosnet becomes as the fundamental piece.

This exercise tries to expand the pacemaker3 cluster toward 100+ nodes and find the limitation and the best practices to do so.

Resources

crmsh.git/test/run-functional-tests -h

No Hackers yet

Join this project Leave this project

Looking for hackers with the skills:

ha pacemakercluster corosync

This project is part of:

Hack Week 24

Activity

about 1 year ago: zzhou added keyword "ha" to this project.

about 1 year ago: zzhou added keyword "pacemakercluster" to this project.

about 1 year ago: zzhou added keyword "corosync" to this project.

about 1 year ago: zzhou originated this project.

Comments

about 1 year ago by zzhou | Reply

Summary

At this stage, it's challenging to start the cluster more than 20 nodes. It can be challenging just join one more node either the cold start of the cluster gradually, or grow the nodes online.

Potential further research directions: - probably to tune corosync timeout and knet configure for the bigger cluster size - knet protocol sctp instead of udp to avoid retransmit storming?

Misc.: to improve tool to grow node gradually https://github.com/ClusterLabs/crmsh/pull/1235

Observations

Environment:
- VM, CPU=8, RAM=16G
- CPU: Half of #CPU of the host
- MEM: One node container consumes 300M
Challenges:
- difficult to setup 20 node cluster all at once
- difficult to grow 4 nodes all at once on a 32 node cluster (even a 20 node cluster)
  - CPU overloaded
    - DC election can't finish
    - Token timeout
  - corosync membership
    - [TOTEM ] Token has not been received in 41025 ms
    - Repeat: Completed service synchronization, ready to provide service.
  - Network
    - corosync-cfgtool -R might mess KNET layer for a while
    - hanode1 kernel: neighbour: arp_cache: neighbor table overflow!
    - [KNET ] loopback: send local failed. error=Resource temporarily unavailable
    - kernel: podman2: port 19(veth38) entered blocking state
  - Mem can be consumed quickly. Memory leak?

Similar Projects

Description

For now, there is no possible HA setup for Uyuni. The idea is to explore setting up a read-only shadow instance of an Uyuni and make it as useful as possible.

Possible things to look at:

live sync of the database, probably using the WAL. Some of the tables may have to be skipped or some features disabled on the RO instance (taskomatic, PXT sessions…)
Can we use a load balancer that routes read-only queries to either instance and the other to the RW one? For example, packages or PXE data can be served by both, the API GET requests too. The rest would be RW.

Goals

Prepare a document explaining how to do it.
PR with the needed code changes to support it

Description

SUSE MLS (Multi-Linux Support) - A subscription where SUSE provides technical support and updates for Red Hat Enterprise Linux (RHEL) and CentOS servers

The most significant operational difference between SUSE MLS 7 and the standard SUSE Linux Enterprise Server High Availability Extension (SLES HAE) lies in the administrative toolchain. While both distributions rely on the same underlying Pacemaker resource manager and Corosync messaging layer, MLS 7 preserves the native Red Hat Enterprise Linux 7 user space. Consequently, MLS 7 administrators must utilize the Pacemaker Configuration System (pcs), a monolithic and imperative tool. The pcs utility abstracts the entire stack, controlling Corosync networking, cluster bootstrapping, and resource management through single-line commands that automatically generate the necessary configuration files. In contrast, SLES HAE employs the Cluster Resource Management Shell (crmsh). The crm utility operates as a declarative shell that focuses primarily on the Cluster Information Base (CIB). Unlike the command-driven nature of pcs, crmsh allows administrators to enter a configuration context to define the desired state of the cluster using syntax that maps closely to the underlying XML structure. This makes SLES HAE more flexible for complex edits but requires a different syntax knowledge base compared to the rigid, command-execution workflow of MLS 7.

Scope is here MLS 7.9

Goals

Get more familiar with MLS7.9 HA toolchain and Graphical User Interface and Daemons
Create a two node MLS cluster with SBD
Check different use cases
Create a "SUSE Best Practices" presentation slide set suitable for Consulting Customers

Resources

You need MLS7.9 (Qcow2) installed + subscription
KVM server with 2 KVMs, 2 SBD
RHEL7 and HA skills

SUSE Health Check Tools by roseswe

SUSE HC Tools Overview

A collection of tools written in Bash or Go 1.24++ to make life easier with handling of a bunch of tar.xz balls created by supportconfig.

Background: For SUSE HC we receive a bunch of supportconfig tar balls to check them for misconfiguration, areas for improvement or future changes.

Main focus on these HC are High Availability (pacemaker), SLES itself and SAP workloads, esp. around the SUSE best practices.

Goals

Overall improvement of the tools
Adding new collectors
Add support for SLES16

Resources

csv2xls* example.sh go.mod listprodids.txt sumtext* trails.go README.md csv2xls.go exceltest.go go.sum m.sh* sumtext.go vercheck.py* config.ini csvfiles/ getrpm* listprodids* rpmdate.sh* sumxls* verdriver* credtest.go example.py getrpm.go listprodids.go sccfixer.sh* sumxls.go verdriver.go

docollall.sh* extracthtml.go gethostnamectl* go.sum numastat.go cpuvul* extractcluster.go firmwarebug* gethostnamectl.go m.sh* numastattest.go cpuvul.go extracthtml* firmwarebug.go go.mod numastat* xtr_cib.sh*

corosync

Hacking a SUSE MLS 7.9 Cluster by roseswe

Description

SUSE MLS (Multi-Linux Support) - A subscription where SUSE provides technical support and updates for Red Hat Enterprise Linux (RHEL) and CentOS servers

Scope is here MLS 7.9

Goals

Get more familiar with MLS7.9 HA toolchain and Graphical User Interface and Daemons
Create a two node MLS cluster with SBD
Check different use cases
Create a "SUSE Best Practices" presentation slide set suitable for Consulting Customers

Resources

You need MLS7.9 (Qcow2) installed + subscription
KVM server with 2 KVMs, 2 SBD
RHEL7 and HA skills

Description

Resources

No Hackers yet

Looking for hackers with the skills:

This project is part of:

Activity

Comments

about 1 year ago by zzhou | Reply

Summary

Observations

Similar Projects

ha

Uyuni read-only replica by cbosdonnat

Description

Goals

pacemakercluster

Hacking a SUSE MLS 7.9 Cluster by roseswe

Description

Goals

Resources

SUSE Health Check Tools by roseswe

SUSE HC Tools Overview

Goals

Resources

corosync

Hacking a SUSE MLS 7.9 Cluster by roseswe

Description

Goals

Resources