Motivation

We have many machines and server hardware in our SUSE datacenters meaning physical hardware using electrical power. During this hack week project I would like to find out information about the current use of hardware within our scope, look into monitoring and measurement best practices or existing solutions, e.g. power metering data from PDUs. Then I would like to think about and at best prototype solutions how we can prevent unused, idling machines wasting power, feed back information from the hardware in used to users, e.g. cost of executing jobs in financial currency like € as well as ecological impact like kg CO2e. Also I am thinking about making existing hardware reusable for other purposes when it's not needed for other purposes, e.g. using openQA workers as hypervisor for personal virtual machines in parallel.

Goals

  • G1: General overview about current power usage of SUSE LSG QE hardware in our datacenters
  • G2: Overview about existing best practices and industry standards for efficient hardware usage in datacenters
  • G3: Budgeting information available to end-users
  • G4: Concept about using existing hardware for multiple workloads

Execution

  • Research existing best practices and industry standards for hardware inventory and power usage monitoring
  • Gather existing power usage of SUSE LSG QE hardware in our datacenters from PDUs or the SUSE general sustainability report and breaking down what our use ratio is
  • Prototype power usage budgeting to users, e.g. in openQA jobs
  • Try out harvester/longhorn or simply a general hypervisor like libvirt on an openQA worker and look into security or performance implications. Also for libvirt maybe some kind of cluster approach unless harvester is already the way to go and can be combined with openQA?

Progress

Day 1

Semi-related: Picked up hack week T-Shirt and swags. Tried out https://suse-ai.openplatform.suse.com/ and API with curl -sS https://ollama.openplatform.suse.com/api/generate -d '{"model": "gemma:2b", "prompt": "What is the time of the day?" }' | jq -rs '.[].response' | tr -d '\n' which yields

I do not have the ability to perceive time of day or experience subjective changes in consciousness. I do not have a physical body or lived experiences that allow me to measure time.

Asked about industry best practices in https://suse-ai.openplatform.suse.com/c/98baa9ca-d086-4e57-8465-c09939be84d1

Day 2

We could measure power usage from systems. Starting with a bit more exotic setup to see what is more generally available. Running on diesel.qe.nue2.suse.org, a IBM Power8 machine in OPAL mode with Leap 15.6 (kernel on 15.3 version due to boo#1202138

``` diesel:~ # cat /sys/class/powersupply/BAT*/powernow cat: '/sys/class/powersupply/BAT*/powernow': No such file or directory diesel:~ # cat /sys/class/powersupply/BAT0/currentnow /sys/class/powersupply/BAT0/voltagenow | xargs | awk cat: /sys/class/powersupply/BAT0/currentnow: No such file or directory cat: /sys/class/powersupply/BAT0/voltagenow: No such file or directory … diesel:~ # cat /sys/class/powercap//energy_uj cat: '/sys/class/powercap//energy_uj': No such file or directory diesel:~ # sensors ibmpowernv-isa-0000 Adapter: ISA adapter Core 0: +48.0°C
Core 8: +49.0°C
Core 16: +49.0°C
Core 24: +52.0°C
Core 32: +53.0°C
Core 40: +52.0°C
Core 48: +50.0°C
Core 56: +50.0°C
Centaur 0: +36.0°C
Centaur 1: +45.0°C
Centaur 4: +39.0°C
Centaur 5: +38.0°C

diesel:~ # zypper in tlp-stat Loading repository data... Reading installed packages... Package 'tlp-stat' not found. Resolving package dependencies... Nothing to do. diesel:~ # zypper in powerstat Loading repository data... Reading installed packages... Resolving package dependencies...

The following 2 NEW packages are going to be installed: powerstat powerstat-bash-completion … diesel:~ # powerstat Device is not discharging, cannot measure power usage. Perhaps re-run with -z (ignore zero power) ```

so no easy information to retrieve from the running system but from https://racktables.nue.suse.com/index.php?page=object&tab=default&object_id=3026 I know that the machine is connected to outlet 3 on PDU-FC-B1 for which http://epdu-b1.qe.nue2.suse.org/ tells me the current usage with 1.56 A, 335 W, power factor -0.96, energy since 2022-07-17 3590 kWh By the way I realized that the date on epdu-b1.qe.nue2.suse.org through epdu-b5.qe.nue2.suse.org were off by one day and not using NTP. I now switched that on all 5 PDUs to use ntp1.suse.de and a proper ISO8601 date format.

The NUE2 PDUs support "Cisco EnergyWise" which seems to be a product to measure power usage in bigger datacenters but I have never heard of it so I assume SUSE does not use it. …

Results

After two days I was mostly pulled into non-hack-week work and could not continue. One idea I had is that openQA workers could either measure their power usage or in case of such data not being available to the OS on the machine have a manual configuration in the openQA worker config. From those values as present an openQA worker instance could estimate the power usage, e.g. for 10 instances and 500W overall power drawn each instance can be accounted for 500W/10=50W, and jobs could account for according energy, e.g. for 2h runtime 50W*2h=100Wh and according COe depending on the electricity grid the machines are operated in.

Looking for hackers with the skills:

Nothing? Add some keywords!

This project is part of:

Hack Week 24

Activity

  • about 2 months ago: okurz started this project.
  • 2 months ago: hennevogel liked this project.
  • 2 months ago: giusdp left this project.
  • 3 months ago: giusdp liked this project.
  • 3 months ago: giusdp started this project.
  • 3 months ago: dmdiss liked this project.
  • 4 months ago: okurz originated this project.

  • Comments

    • giusdp
      3 months ago by giusdp | Reply

      Hi! This project is really interesting, I'd be glad to help if possible!

      • okurz
        3 months ago by okurz | Reply

        Sure you can! Just follow your own pace along the suggested execution steps and towards the mentioned goals. I am looking forward to your project results regardless if it's research results or results of experiments that you can conduct.

    • jzerebecki
      3 months ago by jzerebecki | Reply

      If you try anything for this that generally works for Harvester and/or Kubernetes please report your experience. E.g. https://kubernetes.io/docs/concepts/scheduling-eviction/resource-bin-packing/ or https://kubernetes.io/docs/concepts/cluster-administration/cluster-autoscaling/

    Similar Projects

    This project is one of its kind!