There are several use cases where it's beneficial to be able to automatically rearrange VM instances in a cloud into a different placement. These usually fall into one of two categories, which could be called defragmentation, and rebalancing:
- Defragmentation - condense VMs onto fewer physical VM host
machines (conceptually similar to VMware
DPM).
Example use cases:
- Reduce power consumption
- Increase CPU / RAM / I/O utilization (e.g. by over-committing and/or memory page sharing)
- Evacuate physical servers
- for repurposing
- for maintenance (BIOS upgrades, re-cabling etc.)
- to protect SLAs, e.g. when SMART monitors indicate potential imminent hardware failure, or when HVAC failures are likely to cause servers to shutdown due to over-heating
- Rebalancing - spread the VMs evenly across as many physical VM host
machines as possible (conceptually similar to
VMware DRS).
Example use cases:
- Optimise workloads for performance, by reducing CPU / I/O hotspots
- Maximise headroom on each physical machine
- Reduce thermal hotspots in order to reduce power consumption
Custom rearrangements may be required according to other IT- or business-driven policies, e.g. only rearrange VM instances relating to a specific workload, in order to increase locality of reference, reduce latency, respect availability zones, or facilitate other out-of-band workflows.
It is clear from the above that VM placement policies are likely to
vary greatly across clouds, and sometimes even within a single cloud.
OpenStack Compute
(nova
) has fairly sophisticated scheduling capabilities
which can be configured to implement some of the above policies on an
incremental basis, i.e. every time a VM instance is started or
migrated, the destination VM host can be chosen according to filters
and weighted cost functions. However, this approach is somewhat
limited, because the placement policies are implemented cloud-wide,
and only considered one migration at a time.
Since OpenStack is rapidly evolving to the point where a VM's network and storage dependencies can be live migrated along with the workload in a near seamless fashion, it is advantageous to develop mechanisms for implementing finer-grained placement policies, where not only is VM rearrangement performed automatically, but the policies themselves can be varied dynamically over time as workload requirements change.
Developing algorithms to determine optimal placement is distinctly non-trivial. For example, the defragmentation scenario above is a complex variant of the bin packing problem, which is NP-hard. The following constraints add significant complexity to the problem:
A useful real world solution should take into account not only the RAM footprint of the VMs, but also CPU, disk, and network.
It also needs to ensure that SLAs are maintained whilst any rearrangement takes place.
If the cloud is sufficiently near capacity, it may not be possible to rearrange the VMs from their current placement to a more optimal placement without first shutting down some VMs, which could be prohibited by the SLAs.
Even if the arrangement is achievable purely via a sequence of live migrations, the algorithm must also be sensitive to the performance impact to running workloads when performing multiple live migrations, since live migrations require intensive bursts of network I/O in order to synchronize the VM's memory contents between the source and target hosts, followed by a momentary freezing of the VM as it flips from the source to the target. This trade-off between optimal resource utilization and service availability means that a sub-optimal final placement may be preferable to an optimal one.
In the case where the hypervisor is capable of sharing memory pages between VMs, the algorithm should try to place together VMs which are likely to share memory pages (e.g. VMs running the same OS platform, OS version, software libraries, or applications. A research paper published in 2011 demonstrated that VM packing which optimises placement in this fashion can be approximated in polytime, achieving 32% to 50% reduction in servers and a 25% to 57% reduction in memory footprint compared to sharing-oblivious algorithms.
As noted by the above paper, this area of computer science is still evolving. There is one constant however: any rearrangement solution must not only provide a final VM placement optimised according to the chosen constraints, but also a sequence of migrations to it from the current placement. There will often be multiple migration sequences reaching the optimised placement from the current one, and their efficiency can vary widely. In other words, there are two questions which need answering:
- Given a starting placement A, which is the best final placement B to head for?
- What's the best way to get from A to B?
The above considerations strongly suggest that the first question is much harder to answer than the second. I propose that by adopting a divide and conquer approach, solving the second may simplify the first. Decoupling the two should also provide a mechanism for comparatively evaluating the effectiveness of potential answers to the first. Another bonus of this decoupling is that it should be possible for the path-finding algorithm to also discover opportunities for parallelizing live migrations when walking the path, so that the target placement B can be reached more quickly.
Therefore, for this Hack Week project, I intend to design and implement an algorithm which reliably calculates a reasonably optimal sequence of VM migrations from a given initial placement to a given final placement. My goals are as follows:
- If there is any migration path, the algorithm must find at least one.
- It should find a path which is reasonably optimal with respect to the migration cost function. In prior work, I already tried Dijkstra's shortest path algorithm and demonstrated that an exhaustive search for the shortest path has intolerably high complexity.
- The migration cost function should be pluggable. Initially it will simply consider the cost as proportional to the RAM footprint of the VM being migrated.
- For now, use a smoke-and-mirrors in-memory model of the cloud's state.
Later, the code can be ported to consume the
nova
API. - In light of the above, the implementation should be in Python.
- Determination of which states are sane should be pluggable. For example, the algorithm should not hardcode any assumptions about whether physical servers can over-commit CPU or RAM.
Looking for hackers with the skills:
cloud virtualization orchestration openstack performance python
This project is part of:
Hack Week 10
Activity
Comments
-
about 11 years ago by aspiers | Reply
I completed this project, and intend to publish the code and also blog about it. I uploaded a YouTube video demoing what the code can do.
Similar Projects
Mortgage Plan Analyzer by RMestre
https://github.com/rjpmestre/mortgage-plan-analyzer
Project Description
Many people face challenges when trying to renegotiate their mortgages with different banks. They receive offers from multiple lenders and struggle to compare them effectively. Each proposal may have slightly different terms and data presentation, making it hard to make informed decisions. Additionally, understanding the impact of various taxes and variables can be complex. The Mortgage Plan Analyzer project aims to address these issues.
Project Overview:
The Mortgage Plan Analyzer is a web-based tool built using PHP, Laravel, Livewire, and AdminLTE/bootstrap. It provides a user-friendly platform for individuals to input basic specifications about their mortgage, adjust taxes and variables, and obtain short-term projections for each proposal. Users can also compare multiple mortgage offers side by side, enabling them to make informed decisions about their mortgage renegotiation.
Why Start This Project:
I found myself in this position and most tools I found around are either for marketing/selling purposes or not flexible enough. As i was starting getting lost in a jungle of spreadsheets i thought I could just create a tool to help me and others that may be experiencing the same struggles to provide clarity and transparency in the decision-making process.
Hackweek 24 update
- Improved summaries graphs by adding:
- - Line graph;
- - Accumulated line graph;
- - Set the range to short/mid/long term;
- - Highlight best simulation and value per year;
- Improve the general behaviour of the forms:
- - Simulations name setting;
- - Cloning simulations;
- - Adjust update timing on input changes;
- Show/Hide big tables;
- Support multi languages (added english);
- Added examples;
- Adjustments to fonts and sizes;
- Fixed loading screen;
- Dependencies adjustments;
Hackweek 23 initial release
- Developed a base site that:
- - Allows adding up to 3 simulations;
- - Create financial plans;
- - Simulations comparison graph for the first 4 years;
- Created Github project @ https://github.com/rjpmestre/mortgage-plan-analyzer ;
- Launched a demo instance using Oracle Cloud Free Tier currently @ http://138.3.251.182/
Resources
- Banco de Portugal: Main simulator all portuguese banks have to follow ( https://clientebancario.bportugal.pt/credito-habitacao )
- Laravel: A PHP web application framework for building robust and scalable applications. ( https://laravel.com/ )
- Livewire: A Laravel library for building dynamic interfaces without writing JavaScript. ( https://livewire.laravel.com/ )
- AdminLTE: A responsive admin dashboard template for creating a visually appealing interface. ( https://adminlte.io/ )
- GitHub: We will host the project on GitHub for version control and collaboration. ( bet you didn't know this one, https://github.com/ )
- Oracle Cloud ( https://www.oracle.com/cloud/free/ )
Save pytorch models in OCI registries by jguilhermevanz
Description
A prerequisite for running applications in a cloud environment is the presence of a container registry. Another common scenario is users performing machine learning workloads in such environments. However, these types of workloads require dedicated infrastructure to run properly. We can leverage these two facts to help users save resources by storing their machine learning models in OCI registries, similar to how we handle some WebAssembly modules. This approach will save users the resources typically required for a machine learning model repository for the applications they need to run.
Goals
Allow PyTorch users to save and load machine learning models in OCI registries.
Resources
Extending KubeVirtBMC's capability by adding Redfish support by zchang
Description
In Hack Week 23, we delivered a project called KubeBMC (renamed to KubeVirtBMC now), which brings the good old-fashioned IPMI ways to manage virtual machines running on KubeVirt-powered clusters. This opens the possibility of integrating existing bare-metal provisioning solutions like Tinkerbell with virtualized environments. We even received an inquiry about transferring the project to the KubeVirt organization. So, a proposal was filed, which was accepted by the KubeVirt community, and the project was renamed after that. We have many tasks on our to-do list. Some of them are administrative tasks; some are feature-related. One of the most requested features is Redfish support.
Goals
Extend the capability of KubeVirtBMC by adding Redfish support. Currently, the virtbmc component only exposes IPMI endpoints. We need to implement another simulator to expose Redfish endpoints, as we did with the IPMI module. We aim at a basic set of functionalities:
- Power management
- Boot device selection
- Virtual media mount (this one is not so basic )
Resources
SUSE KVM Best Practices by roseswe
Description
SUSE Best Practices around KVM, especially for SAP workloads. Early Google presentation already made from various customer projects and SUSE sources.
Goals
Complete presentation we can reuse in SUSE Consulting projects
Resources
KVM (virt-manager) images
SUSE/SAP/KVM Best Practices
- https://documentation.suse.com/en-us/sles/15-SP6/single-html/SLES-virtualization/
- SAP Note 1522993 - "Linux: SAP on SUSE KVM - Kernel-based Virtual Machine" && 2284516 - SAP HANA virtualized on SUSE Linux Enterprise hypervisors https://me.sap.com/notes/2284516
- SUSECon24: [TUTORIAL-1253] Virtualizing SAP workloads with SUSE KVM || https://youtu.be/PTkpRVpX2PM
- SUSE Best Practices for SAP HANA on KVM - https://documentation.suse.com/sbp/sap-15/html/SBP-SLES4SAP-HANAonKVM-SLES15SP4/index.html
Harvester Packer Plugin by mrohrich
Description
Hashicorp Packer is an automation tool that allows automatic customized VM image builds - assuming the user has a virtualization tool at their disposal. To make use of Harvester as such a virtualization tool a plugin for Packer needs to be written. With this plugin users could make use of their Harvester cluster to build customized VM images, something they likely want to do if they have a Harvester cluster.
Goals
Write a Packer plugin bridging the gap between Harvester and Packer. Users should be able to create customized VM images using Packer and Harvester with no need to utilize another virtualization platform.
Resources
Hashicorp documentation for building custom plugins for Packer https://developer.hashicorp.com/packer/docs/plugins/creation/custom-builders
Source repository of the Harvester Packer plugin https://github.com/m-ildefons/harvester-packer-plugin
Contribute to terraform-provider-libvirt by pinvernizzi
Description
The SUSE Manager (SUMA) teams' main tool for infrastructure automation, Sumaform, largely relies on terraform-provider-libvirt. That provider is also widely used by other teams, both inside and outside SUSE.
It would be good to help the maintainers of this project and give back to the community around it, after all the amazing work that has been already done.
If you're interested in any of infrastructure automation, Terraform, virtualization, tooling development, Go (...) it is also a good chance to learn a bit about them all by putting your hands on an interesting, real-use-case and complex project.
Goals
- Get more familiar with Terraform provider development and libvirt bindings in Go
- Solve some issues and/or implement some features
- Get in touch with the community around the project
Resources
- CONTRIBUTING readme
- Go libvirt library in use by the project
- Terraform plugin development
- "Good first issue" list
Team Hedgehogs' Data Observability Dashboard by gsamardzhiev
Description
This project aims to develop a comprehensive Data Observability Dashboard that provides r insights into key aspects of data quality and reliability. The dashboard will track:
Data Freshness: Monitor when data was last updated and flag potential delays.
Data Volume: Track table row counts to detect unexpected surges or drops in data.
Data Distribution: Analyze data for null values, outliers, and anomalies to ensure accuracy.
Data Schema: Track schema changes over time to prevent breaking changes.
The dashboard's aim is to support historical tracking to support proactive data management and enhance data trust across the data function.
Goals
Although the final goal is to create a power bi dashboard that we are able to monitor, our goals is to 1. Create the necessary tables that track the relevant metadata about our current data 2. Automate the process so it runs in a timely manner
Resources
AWS Redshift; AWS Glue, Airflow, Python, SQL
Why Hedgehogs?
Because we like them.
Run local LLMs with Ollama and explore possible integrations with Uyuni by PSuarezHernandez
Description
Using Ollama you can easily run different LLM models in your local computer. This project is about exploring Ollama, testing different LLMs and try to fine tune them. Also, explore potential ways of integration with Uyuni.
Goals
- Explore Ollama
- Test different models
- Fine tuning
- Explore possible integration in Uyuni
Resources
- https://ollama.com/
- https://huggingface.co/
- https://apeatling.com/articles/part-2-building-your-training-data-for-fine-tuning/
Symbol Relations by hli
Description
There are tools to build function call graphs based on parsing source code, for example, cscope
.
This project aims to achieve a similar goal by directly parsing the disasembly (i.e. objdump) of a compiled binary. The assembly code is what the CPU sees, therefore more "direct". This may be useful in certain scenarios, such as gdb/crash debugging.
Detailed description and Demos can be found in the README file:
Supports x86 for now (because my customers only use x86 machines), but support for other architectures can be added easily.
Tested with python3.6
Goals
Any comments are welcome.
Resources
https://github.com/lhb-cafe/SymbolRelations
symrellib.py: mplements the symbol relation graph and the disassembly parser
symrel_tracer*.py: implements tracing (-t option)
symrel.py: "cli parser"
Testing and adding GNU/Linux distributions on Uyuni by juliogonzalezgil
Join the Gitter channel! https://gitter.im/uyuni-project/hackweek
Uyuni is a configuration and infrastructure management tool that saves you time and headaches when you have to manage and update tens, hundreds or even thousands of machines. It also manages configuration, can run audits, build image containers, monitor and much more!
Currently there are a few distributions that are completely untested on Uyuni or SUSE Manager (AFAIK) or just not tested since a long time, and could be interesting knowing how hard would be working with them and, if possible, fix whatever is broken.
For newcomers, the easiest distributions are those based on DEB or RPM packages. Distributions with other package formats are doable, but will require adapting the Python and Java code to be able to sync and analyze such packages (and if salt does not support those packages, it will need changes as well). So if you want a distribution with other packages, make sure you are comfortable handling such changes.
No developer experience? No worries! We had non-developers contributors in the past, and we are ready to help as long as you are willing to learn. If you don't want to code at all, you can also help us preparing the documentation after someone else has the initial code ready, or you could also help with testing :-)
The idea is testing Salt and Salt-ssh clients, but NOT traditional clients, which are deprecated.
To consider that a distribution has basic support, we should cover at least (points 3-6 are to be tested for both salt minions and salt ssh minions):
- Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file)
- Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)
- Package management (install, remove, update...)
- Patching
- Applying any basic salt state (including a formula)
- Salt remote commands
- Bonus point: Java part for product identification, and monitoring enablement
- Bonus point: sumaform enablement (https://github.com/uyuni-project/sumaform)
- Bonus point: Documentation (https://github.com/uyuni-project/uyuni-docs)
- Bonus point: testsuite enablement (https://github.com/uyuni-project/uyuni/tree/master/testsuite)
If something is breaking: we can try to fix it, but the main idea is research how supported it is right now. Beyond that it's up to each project member how much to hack :-)
- If you don't have knowledge about some of the steps: ask the team
- If you still don't know what to do: switch to another distribution and keep testing.
This card is for EVERYONE, not just developers. Seriously! We had people from other teams helping that were not developers, and added support for Debian and new SUSE Linux Enterprise and openSUSE Leap versions :-)
Pending
FUSS
FUSS is a complete GNU/Linux solution (server, client and desktop/standalone) based on Debian for managing an educational network.
https://fuss.bz.it/
Seems to be a Debian 12 derivative, so adding it could be quite easy.
[ ]
Reposync (this will require using spacewalk-common-channels and adding channels to the .ini file)[ ]
Onboarding (salt minion from UI, salt minion from bootstrap scritp, and salt-ssh minion) (this will probably require adding OS to the bootstrap repository creator)[ ]
Package management (install, remove, update...)[ ]
Patching (if patch information is available, could require writing some code to parse it, but IIRC we have support for Ubuntu already)[ ]
Applying any basic salt state (including a formula)[ ]
Salt remote commands[ ]
Bonus point: Java part for product identification, and monitoring enablement
SUSE AI Meets the Game Board by moio
Use tabletopgames.ai’s open source TAG and PyTAG frameworks to apply Statistical Forward Planning and Deep Reinforcement Learning to two board games of our own design. On an all-green, all-open source, all-AWS stack!
AI + Board Games
Board games have long been fertile ground for AI innovation, pushing the boundaries of capabilities such as strategy, adaptability, and real-time decision-making - from Deep Blue's chess mastery to AlphaZero’s domination of Go. Games aren’t just fun: they’re complex, dynamic problems that often mirror real-world challenges, making them interesting from an engineering perspective.
As avid board gamers, aspiring board game designers, and engineers with careers in open source infrastructure, we’re excited to dive into the latest AI techniques first-hand.
Our goal is to develop an all-open-source, all-green AWS-based stack powered by some serious hardware to drive our board game experiments forward!
Project Goals
Set Up the Stack:
- Install and configure the TAG and PyTAG frameworks on SUSE Linux Enterprise Base Container Images.
- Integrate with the SUSE AI stack for GPU-accelerated training on AWS.
- Validate a sample GPU-accelerated PyTAG workload on SUSE AI.
- Ensure the setup is entirely repeatable with Terraform and configuration scripts, documenting results along the way.
Design and Implement AI Agents:
- Develop AI agents for the two board games, incorporating Statistical Forward Planning and Deep Reinforcement Learning techniques.
- Fine-tune model parameters to optimize game-playing performance.
- Document the advantages and limitations of each technique.
Test, Analyze, and Refine:
- Conduct AI vs. AI and AI vs. human matches to evaluate agent strategies and performance.
- Record insights, document learning outcomes, and refine models based on real-world gameplay.
Technical Stack
- Frameworks: TAG and PyTAG for AI agent development
- Platform: SUSE AI
- Tools: AWS for high-performance GPU acceleration
Why This Project Matters
This project not only deepens our understanding of AI techniques by doing but also showcases the power and flexibility of SUSE’s open-source infrastructure for supporting high-level AI projects. By building on an all-open-source stack, we aim to create a pathway for other developers and AI enthusiasts to explore, experiment, and deploy their own innovative projects within the open-source space.
Our Motivation
We believe hands-on experimentation is the best teacher.
Combining our engineering backgrounds with our passion for board games, we’ll explore AI in a way that’s both challenging and creatively rewarding. Our ultimate goal? To hack an AI agent that’s as strategic and adaptable as a real human opponent (if not better!) — and to leverage it to design even better games... for humans to play!