Description

When setting up their management cluster, Rancher users start by installing... Rancher. This can quickly result in complex sets of config files, not to mention that upgrading Rancher itself is a manual process.

What if... Rancher itself could be configured through Fleet, harnessing the power of GitOps and/or HelmOps?

Goals

  • Find a way for Rancher, at install time, to adopt an existing Fleet release instead of installing a new one, provided that the Rancher and Fleet releases are compatible with one another
  • Check what happens in Rancher upgrade cases, and improve what can/should be

Stretch goals

  • Establish best practices for configuring Rancher through Fleet (e.g. fleet.yaml or equivalent files, repository structure, etc)

Resources

Outcome

Preliminary research

What already worked

Fleet could already install Rancher, just like any Helm chart, through a GitRepo or HelmOp resource. However, Rancher would then overwrite a pre-existing Fleet deployment with its own, only taking its own configuration (e.g. pinned Fleet version) into account.

Output of this project

Git repository structure enabling Rancher to be installed in a gitOps fashion

Rancher can be set up by doing the following:

  1. Installing Fleet
  2. Creating a GitRepo as follows (or by saving the GitRepo itself into a file and applying it through kubectl apply -n fleet-local -f $file):

    cat << EOF | kubectl apply -n fleet-local -f -
    kind: GitRepo
    apiVersion: fleet.cattle.io/v1alpha1
    metadata:
      name: install-rancher
    spec:
      repo: https://github.com/weyfonk/test-fleet
      branch: test-install-rancher
      bundles:
        - base: cert-manager
        - base: rancher-cfg
        - base: rancher
      targets:
        - clusterName: local
    EOF

This will create 3 bundles:

  • cert-manager, which will install cert-manager v1.19.0
  • rancher-cfg, containing a config map with Rancher values; this is stored in a separate directory and created as a different bundle to enable dynamic resolution of a Traefik load balancer service IP as Rancher's hostname value.
  • rancher, depending on the above two, installing Rancher 2.13.0.

Note: at the time of writing, this requires Fleet v0.15.0-alpha.1 or above, which has improved Helm lookup support. This will eventually also be possible with Fleet v0.14, as the lookup fix has been backported there, but that branch needs a new release including the fix.

Rancher patch enabling Fleet adoption

See feature branch, containing a few commits enabling Rancher's Fleet charts controller to:

  • Check if a new FleetBeforeRancher flag is enabled
  • If so, read the existing Fleet release, extract its version and values, and merge them with values corresponding to watched Rancher settings when such settings are updated (as opposed to installing the Fleet version pinned with Rancher, with Rancher-populated values exclusively)

These commits are of course not part of any Rancher release at this point. Testing them can be done as follows:

  1. Check out the feature branch locally
  2. Run make quick
  3. Tag the created rancher/rancher image as if it belonged to Rancher 2.13.0, e.g. (beware of the commit SHA as it may change as new commits are pushed to that branch): docker tag rancher/rancher:v2.14-33eddc4a0-head rancher/rancher:v2.13.0
  4. If using k3d, import the image into your cluster, e.g.: k3d image import -c upstream -m direct rancher/rancher:v2.13.0
  5. Run the steps described in the previous section to install Rancher through Fleet

Bug fix in Fleet's chart URL resolution

Testing the above surfaced a glitch in Fleet's chart URL resolution; more info here. This glitch did not have any effect on users nor customers, as it was only present on Fleet's main branch following recent refactoring. Kudos and thanks to Alejandro Ruiz for his swift help!

Way forward

Rancher configuration monitoring through Fleet
  • With Rancher being installed through gitOps, secrets and config maps referenced by the Rancher deployment are subject to changes, which should lead to updates of the Rancher deployment accordingly. This is a known shortcoming of Fleet's current usage of valuesFrom, and should be addressed through this issue.
Readiness

Bundle diffs have been added to prevent additional resources, such as CRDs, installed by the Cert Manager bundle, from appearing as not owned by the bundle, which resulted in its status appearing as Modified. However, the same remains to be done for Rancher.

Manual Fleet updates in Rancher

Updating Fleet manually through the Rancher UI still causes install loops, with troubleshooting work pending to diagnose and fix them.

RBAC

With Rancher installed through Fleet, a Rancher bundle will be visible through Rancher's Continuous Delivery UI. Without additional RBAC, users would be able to edit or even delete that bundle, with potentially disastrous consequences. This, together with the previous point on manually updating Fleet through the Rancher UI, raises the following question: with elements of Rancher configuration owned by Fleet, which workflow, and which restrictions or changes to the current UX, should be expected? In particular:

  • Should parts of the UI be greyed out with a warning message stating that their elements are owned by Fleet, prompting authorised users to apply changes through git commits rather than the Rancher UI?
  • Or the other way around: should authorised users be able to make changes in the UI, which would them reflect them in git? (this would be contrary to gitOps, not to mention risks of conflicts)
Cluster registration

In the interest of time, this Hack Week's efforts have been focused on a minimal single-cluster setup. However, users and customers will be more interested in running this in a multi-cluster fashion. This would require more tests for scenarios such as cluster registration. In particular, what would happen:

  • When registering a cluster through Rancher, with Fleet having been installed before Rancher? Cluster registration should be triggered as usual by Rancher, but successful deployment of workloads by Fleet should be tested
  • When updating a multi-cluster workload deployed by Fleet prior to the Rancher installation? Would any conflicts or unexpected glitches appear?

Looking for hackers with the skills:

Nothing? Add some keywords!

This project is part of:

Hack Week 25

Activity

  • 8 days ago: pgomes liked this project.
  • 8 days ago: ncuralli liked this project.
  • 11 days ago: tneau started this project.
  • 11 days ago: pgonin liked this project.
  • 12 days ago: gcolangiuli liked this project.
  • 12 days ago: eminguez liked this project.
  • 12 days ago: mpiala liked this project.
  • 12 days ago: tneau originated this project.

  • Comments

    Be the first to comment!

    Similar Projects

    This project is one of its kind!