# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

I think I was a bit too ambitious when I wrote this description :) ... but it is been fun any way.

**This is what I accomplished**

* **Seting up a SUSE CaaSP cluster where the admin and the master where running on top of kvm and the worker was a workstation with an nvidia GPU.**
The first trick was to setup the virtual machines to use the ethernet network interface from the host (macvtap). For whatever reason I could not setup this with the virt-manager run as a "normal user" but I could if I started virt-manager from YaST (with root permissions... may that be the reason?).
The second trick was to restrict master to 2GB of RAM and admin to 4GB, so I could run this on my laptop (thanks [@ereslibre](/users/ereslibre) !)
Finally the third trick was to add "hostname=UNIQUE_HOSTNAME" as a linuxrc parameter when installing each machine (otherwise they were all be  named linux.lan :) )

* **Building nvidia packages for CaaSP**. Nvidia packages built for SLE12SP3 by SUSE, but provided by nvidia at http://download.nvidia.com/suse/sles12sp3, had been built for an older kernel than the one released in CaaSP. Thus, when installing those packages, the nvidia kernel modules could not be loaded. For this reason, I built them for the latest kernel in openSUSE Leap 42.3, and install them at the same time I was upgrading the kernel to the one in openSUSE Leap 42.3 (see [0] why openSUSE Leap 42.3). You can download them from [this project](https://build.opensuse.org/project/show/home\:jordimassaguerpla\:branches\:X11\:Drivers\:Video).

* **Installing and fixing nvidia-runtime-hooks and libnvidia-containers**: There is no package for SUSE but instead I took the ones from centos 7; the trick was to run a centos7 container, and follow the instructions from https://nvidia.github.io/libnvidia-container/, but add the "--download-only" option to yum. Luckily, the packages installed without any error ... but they were not really working! Using "strace nvidia-container-cli info" I realized the problem was on the permissions of /dev/nvidia* files. Thus, running "chmod 0666 /dev/nvidia*" fixed the installation... but you have to do this on every reboot (actually, everytime the nvidia mod is loaded). The trick was to use "transactional-update shell" to do all these changes :) . Note I am not installing nvidia-container-runtime, but only the hook. That is because we will use cri-o and not docker. For cri-o we don't need to install the nvidia-container-runtime.

See as a "proof":

> nvidia-container-cli info

NVRM version:   390.67                                                          
 CUDA version:   9.1                                                             
                                                                                
 Device Index:   0                                                               
 Device Minor:   0                                                               
 Model:          GeForce GTX 1060 3GB                                            
 GPU UUID:       GPU-f96a76d4-7ba9-07cc-2774-bb7a55ef3e68                        
 Bus Location:   00000000:02:00.0                                                
 Architecture:   6.1

* **Setting up the cri-o hook to use libnvidia-container**: I just had to follow the instructions here: https://github.com/kubernetes-incubator/cri-o/issues/1222. I couldn't really verify this, but I am quite confident this worked, as kubelet was starting and parsing the hook.

**and this is where I failed**

* **Using a chainned forward proxy to add the workstation into a SUSE CaaSP cluster which was running in a SUSE Cloud cluster** I tried configuring 2 proxies with apache2 and using mod_proxy, mod_proxy_http, mod_proxy_connect, where both were configured as forward proxies and the second one was using the "RemoteProxy" configuration to "chain" with the first one. Then I placed the first one inside the SUSE Cloud cluster, as a virtual machine, and the second one on my laptop. The tricked worked, and I was able to access the autoyast file from the admin node which was in the SUSE Cloud cluster (http://admin_node/autoyast), when installing the workstation via the DVD, even thought the admin node was not accessible outside the SUSE Cloud cluster, and the SUSE Cloud cluster is inside the vpn, where the workstation is not (but the laptop is). It sounds a bit complicated but actually the solution was quite simple. However, salt-minion does not use http but zeromq, and was not going through the proxies.

* **Building nvidia-container and libnvidia-container packages for SUSE**: I tried getting the spec file from github but it required too many tunning that it would have taken me the whole hackweek (or more) to have them building for SUSE, so I ended up using the ones from centos 7.

* **Setting up k8s to schedule jobs that require gpu**: Even thought cri-o seemed correctly configured, jobs were not being scheduled. More docs I found in internet were referring to add the "--experimental-nvidia-gpus=1" option to kubelet, but this is not possible because kubelet does not recognize this option and fails to start. Then, I read in the k8s docs about enabling this via a device plugin: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/. This required enabling feature gates, which by default is not. Here I think I failed cause I didn't know how to do it and unfortunately I run out of time ... However, while writing this report, flavio pointed me to https://wiki.microfocus.com/index.php/SUSE_CaaS_Platform/FAQ#How_to_enable_Kubernetes_feature_gates (thanks [@flavio_castelli](/users/flavio_castelli) !) where you can see how to enable the feature gates. **This is where we should resume the work if we have some time at some point**.
 
 * **Run a kubeflow deployment** I didn't had time to reach to this point. This was the last step and a project on its own. Next hackweek, maybe...

[0] Why openSUSE Leap 42.3? SLE12SP3 has the same common code as openSUSE Leap 42.3, and for the hackweek I wanted to build the nvidia package in the openBuildService https://build.opensuse.org. Using openSUSE Leap 42.3 (plus its update repo) was easier than trying to build that for exact kernel that has been shipped in CaaSPv3.

I think I was a bit too ambitious when I wrote this description :) ... but it is been fun any way.

This is what I accomplished

Seting up a SUSE CaaSP cluster where the admin and the master where running on top of kvm and the worker was a workstation with an nvidia GPU. The first trick was to setup the virtual machines to use the ethernet network interface from the host (macvtap). For whatever reason I could not setup this with the virt-manager run as a "normal user" but I could if I started virt-manager from YaST (with root permissions... may that be the reason?). The second trick was to restrict master to 2GB of RAM and admin to 4GB, so I could run this on my laptop (thanks @ereslibre !) Finally the third trick was to add "hostname=UNIQUE_HOSTNAME" as a linuxrc parameter when installing each machine (otherwise they were all be named linux.lan :) )
Building nvidia packages for CaaSP. Nvidia packages built for SLE12SP3 by SUSE, but provided by nvidia at http://download.nvidia.com/suse/sles12sp3, had been built for an older kernel than the one released in CaaSP. Thus, when installing those packages, the nvidia kernel modules could not be loaded. For this reason, I built them for the latest kernel in openSUSE Leap 42.3, and install them at the same time I was upgrading the kernel to the one in openSUSE Leap 42.3 (see [0] why openSUSE Leap 42.3). You can download them from this project.
Installing and fixing nvidia-runtime-hooks and libnvidia-containers: There is no package for SUSE but instead I took the ones from centos 7; the trick was to run a centos7 container, and follow the instructions from https://nvidia.github.io/libnvidia-container/, but add the "--download-only" option to yum. Luckily, the packages installed without any error ... but they were not really working! Using "strace nvidia-container-cli info" I realized the problem was on the permissions of /dev/nvidia* files. Thus, running "chmod 0666 /dev/nvidia*" fixed the installation... but you have to do this on every reboot (actually, everytime the nvidia mod is loaded). The trick was to use "transactional-update shell" to do all these changes :) . Note I am not installing nvidia-container-runtime, but only the hook. That is because we will use cri-o and not docker. For cri-o we don't need to install the nvidia-container-runtime.

See as a "proof":

> nvidia-container-cli info

NVRM version: 390.67
CUDA version: 9.1

Device Index: 0
Device Minor: 0
Model: GeForce GTX 1060 3GB
GPU UUID: GPU-f96a76d4-7ba9-07cc-2774-bb7a55ef3e68
Bus Location: 00000000:02:00.0
Architecture: 6.1

Setting up the cri-o hook to use libnvidia-container: I just had to follow the instructions here: https://github.com/kubernetes-incubator/cri-o/issues/1222. I couldn't really verify this, but I am quite confident this worked, as kubelet was starting and parsing the hook.

and this is where I failed

Using a chainned forward proxy to add the workstation into a SUSE CaaSP cluster which was running in a SUSE Cloud cluster I tried configuring 2 proxies with apache2 and using modproxy, modproxyhttp, modproxyconnect, where both were configured as forward proxies and the second one was using the "RemoteProxy" configuration to "chain" with the first one. Then I placed the first one inside the SUSE Cloud cluster, as a virtual machine, and the second one on my laptop. The tricked worked, and I was able to access the autoyast file from the admin node which was in the SUSE Cloud cluster (http://adminnode/autoyast), when installing the workstation via the DVD, even thought the admin node was not accessible outside the SUSE Cloud cluster, and the SUSE Cloud cluster is inside the vpn, where the workstation is not (but the laptop is). It sounds a bit complicated but actually the solution was quite simple. However, salt-minion does not use http but zeromq, and was not going through the proxies.
Building nvidia-container and libnvidia-container packages for SUSE: I tried getting the spec file from github but it required too many tunning that it would have taken me the whole hackweek (or more) to have them building for SUSE, so I ended up using the ones from centos 7.
Setting up k8s to schedule jobs that require gpu: Even thought cri-o seemed correctly configured, jobs were not being scheduled. More docs I found in internet were referring to add the "--experimental-nvidia-gpus=1" option to kubelet, but this is not possible because kubelet does not recognize this option and fails to start. Then, I read in the k8s docs about enabling this via a device plugin: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/. This required enabling feature gates, which by default is not. Here I think I failed cause I didn't know how to do it and unfortunately I run out of time ... However, while writing this report, flavio pointed me to https://wiki.microfocus.com/index.php/SUSECaaSPlatform/FAQ#HowtoenableKubernetesfeaturegates (thanks [@flavio_castelli](/users/flaviocastelli) !) where you can see how to enable the feature gates. This is where we should resume the work if we have some time at some point.
Run a kubeflow deployment I didn't had time to reach to this point. This was the last step and a project on its own. Next hackweek, maybe...

[0] Why openSUSE Leap 42.3? SLE12SP3 has the same common code as openSUSE Leap 42.3, and for the hackweek I wanted to build the nvidia package in the openBuildService https://build.opensuse.org. Using openSUSE Leap 42.3 (plus its update repo) was easier than trying to build that for exact kernel that has been shipped in CaaSPv3.

# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

More Complex Markdown Help

Formatting Help

The url for how to enable the feature gates got formatted weirdly ... This is the url

https://wiki.microfocus.com/index.php/SUSE_CaaS_Platform/FAQ#How_to_enable_Kubernetes_feature_gates

and I think this is an internal document, so for the ones that do not have access:

How to enable Kubernetes feature gates

Feature gates are a way used by kubernetes to enable experimental features in advance.

It's possible to enable Kubernetes feature gates on SUSE CaaS Platform 3.

Please note: feature gates are experimental features, hence they won't be supported by SUSE.

Let's assume a user wants to use two feature gates:

DevicePlugins
ReadOnlyAPIDataVolumes

The user would have to log into the admin node and execute this command:

docker exec $(docker ps | grep velum-dashboard | awk {'print $1'}) entrypoint.sh bundle exec rails runner "Pillar.apply(kubernetes_feature_gates: 'DevicePlugins=true,ReadOnlyAPIDataVolumes=true')"

And then issue an orchestration. This can be done using the following command on the admin node:

docker exec $(docker ps | grep salt-master | awk {'print $1'}) salt-run state.orchestrate orch.kubernetes

# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

More Complex Markdown Help

Formatting Help

namespaced RoleBinding would add host path mount privileges , without granting excess privileges over all namespaces:

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nvidia-device-plugin
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: nvidia-device-plugin-psp-privileged
  namespace: kube-system
roleRef:
  kind: ClusterRole
  name: suse:caasp:psp:privileged
  apiGroup: rbac.authorization.k8s.io
subjects:
- kind: ServiceAccount
  name: nvidia-device-plugin
  namespace: kube-system

And then in your DeamonSet spec, `serviceAccount: nvidia-device-plugin` .

This creates the ServiceAccount+RoleBinding in the kube-system
namespace - if you're deploying into another NS, swap out `kube-system` 
for the namespace you're using.

Thanks to Ludovic and Kiall

# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

More Complex Markdown Help

Formatting Help

# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

More Complex Markdown Help

Formatting Help

Edit
Preview

So I was able to make SUSE CaaSP schedule jobs that need an nvidia GPU to the node that has an nvidia GPU :)  
  
Here the documentation:  
  
> __Disclaimer__
>
> _This is a hackweek project, so this is not ready for production use. It contains hacks and workarounds just to "make it work"._
>
> _This has been tested with [SUSE CaaSP Beta 3 (public beta)](https://www.suse.com/betaprogram/caasp-beta/). I used that "kubernetes distribution" because given I am
> involved on that project, I am familiar with it (and I love it :) )_
>
>_However, instructions here should be also valid [openSUSE Kubic](https://kubic.opensuse.org/) and in general for any [kubernetes](https://kubernetes.io/)+[cri-o](https://cri-o.io/) distribution_

# How to setup SUSE CaaSP to work with GPU

...
  
## Installing SUSE CaaSP

We need a cluster, that is obvious :), so let's start with installing two nodes with [SUSE CaaSP 4.0 Beta3](https://www.suse.com/betaprogram/caasp-beta/) powered by [SUSE Linux Enterprise Server 15 SP1](https://www.suse.com/products/server/), which will become our worker and master nodes:

* __gpu__: A bare metal workstation with [NVIDIA Quadro K2000 graphics card](https://www.nvidia.com/content/PDF/data-sheet/DS_NV_Quadro_K2000_OCT13_NV_US_LR.pdf)

* __master__:A virtual machine

You can do that by following the [SUSE CaaSP Beta 3 deployment instructions](https://susedoc.github.io/doc-caasp/beta/caasp-deployment/single-html/#_deployment_on_existing_sles_installation).

> __Tip__   
_Make sure you have a user "sles" which can run "sudo" without a password, and that both nodes have the other's public ssh keys in "~sles/.ssh/authorized_keys". Also, as stated in the deployment guide, that you have the ssh-agent running, and add the hostnames into /etc/hosts so they are both reachable by hostname_. Then, disable _firewalld_ and enable _sshd_.

> __Tip__
_Do not configure a swap partition and set the vm to use 2 CPUs. Otherwise, SUSE CaaSP will fail to install._
>

Finally, we need both machines to be in the same network. For this, I setup the vm to use a _macvtap host device_. You can find more info on _macvtap_ [here](https://virt.kernelnewbies.org/MacVTap). However, I just did that with _virt-manager_ on [openSUSE Leap 15.1](https://en.opensuse.org/Portal:15.1)*.

_(*): This is not precise. Actually, for whatever reason, I could not setup this with the virt-manager run as a "normal user" but I could if I started virt-manager from YaST (with root permissions... may that be the reason?)_

>__Tip__
_Do you want to know if the cluster is properly setup? Run the [k8s conformance tests](https://github.com/cncf/k8s-conformance/blob/master/instructions.md)._
>

Once we have a SUSE CaaSP cluster running, we can proceed to install the NVIDIA drivers.

## Installing nvidia graphics driver kernel module

So we have a workstation with NVIDIA GPU [compatible with](https://developer.nvidia.com/cuda-gpus) [CUDA](https://developer.nvidia.com/cuda-zone) (in our case Quadro K2000). Now is time to install the right drivers so we can use that.

Drivers can be installed from [NVIDIA download servers](https://download.nvidia.com/suse/sle15sp1/x86_64/):
    zypper ar https://developer.nvidia.com/cuda-gpus nvidia
    zypper ref
    zypper install nvidia-gfxG05-kmp-default

You can check if drivers are loaded:
    lsmod | grep nvidia

Now that we have the drivers, we can install the NVIDIA driver for computing with GPUs using CUDA.

## Installing NVIDIA driver for computing with GPUs using CUDA

After installing nvidia drivers, we need to install the nvidia-computeG05. Given we setup the nvidia repo from the previous step, all we need to do is:

zypper install nvidia-computeG05

You can check if this is running by running
    nvidia-smi

and you should get this output

Wed Jun 26 14:30:59 2019                                                        
+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     | 
|-------------------------------+----------------------+----------------------+ 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | 
|===============================+======================+======================| 
|   0  Quadro K2000        Off  | 00000000:05:00.0 Off |                  N/A | 
| 31%   47C    P8    N/A /  N/A |      0MiB /  1998MiB |      0%      Default | 
+-------------------------------+----------------------+----------------------+ 
                                                                                
+-----------------------------------------------------------------------------+ 
| Processes:                                                       GPU Memory | 
|  GPU       PID   Type   Process name                             Usage      | 
|=============================================================================| 
|  No running processes found                                                 | 
+-----------------------------------------------------------------------------+

## Installing libnvidia

## Installing nvidia-container-cli

## Installing nvidia-container-runtime-hook

## Creating a Service Daemon Device Nvidia

## Testing

# TODOs

* Package
* Package
 [text](link-here)

So I was able to make SUSE CaaSP schedule jobs that need an nvidia GPU to the node that has an nvidia GPU :)

Here the documentation:

> Disclaimer > > This is a hackweek project, so this is not ready for production use. It contains hacks and workarounds just to "make it work". > > This has been tested with SUSE CaaSP Beta 3 (public beta). I used that "kubernetes distribution" because given I am > involved on that project, I am familiar with it (and I love it :) ) > >However, instructions here should be also valid openSUSE Kubic and in general for any kubernetes+cri-o distribution

How to setup SUSE CaaSP to work with GPU

...

Installing SUSE CaaSP

We need a cluster, that is obvious :), so let's start with installing two nodes with SUSE CaaSP 4.0 Beta3 powered by SUSE Linux Enterprise Server 15 SP1, which will become our worker and master nodes:

gpu: A bare metal workstation with NVIDIA Quadro K2000 graphics card
master:A virtual machine

You can do that by following the SUSE CaaSP Beta 3 deployment instructions.

> Tip
Make sure you have a user "sles" which can run "sudo" without a password, and that both nodes have the other's public ssh keys in "~sles/.ssh/authorizedkeys". Also, as stated in the deployment guide, that you have the ssh-agent running, and add the hostnames into /etc/hosts so they are both reachable by hostname. Then, disable _firewalld and enable sshd.

> Tip Do not configure a swap partition and set the vm to use 2 CPUs. Otherwise, SUSE CaaSP will fail to install. >

Finally, we need both machines to be in the same network. For this, I setup the vm to use a macvtap host device. You can find more info on macvtap here. However, I just did that with virt-manager on openSUSE Leap 15.1*.

(*): This is not precise. Actually, for whatever reason, I could not setup this with the virt-manager run as a "normal user" but I could if I started virt-manager from YaST (with root permissions... may that be the reason?)

>Tip Do you want to know if the cluster is properly setup? Run the k8s conformance tests. >

Once we have a SUSE CaaSP cluster running, we can proceed to install the NVIDIA drivers.

Installing nvidia graphics driver kernel module

So we have a workstation with NVIDIA GPU compatible with CUDA (in our case Quadro K2000). Now is time to install the right drivers so we can use that.

Drivers can be installed from NVIDIA download servers: zypper ar https://developer.nvidia.com/cuda-gpus nvidia zypper ref zypper install nvidia-gfxG05-kmp-default

You can check if drivers are loaded: lsmod | grep nvidia

Now that we have the drivers, we can install the NVIDIA driver for computing with GPUs using CUDA.

Installing NVIDIA driver for computing with GPUs using CUDA

After installing nvidia drivers, we need to install the nvidia-computeG05. Given we setup the nvidia repo from the previous step, all we need to do is:

zypper install nvidia-computeG05

You can check if this is running by running nvidia-smi

and you should get this output

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Installing libnvidia

Installing nvidia-container-cli

Installing nvidia-container-runtime-hook

Creating a Service Daemon Device Nvidia

Testing

TODOs

Package
Package text

# A First Level Header
## A Second Level Header

Use one asterisk to *emphasize*

Use two asterisks for **strong emphasis**

- Use hyphens
- for unordereed
- lists

This is an [link to example.com](http://example.com/)

This is an image ![an openSUSE geeko icon](https://en.opensuse.org/images/d/d0/Icon-distribution.png)

This is a user link @hans

This is a project link hw#some-cool-title

More Complex Markdown Help

Formatting Help

Looking for hackers with the skills:

This project is part of:

Activity

Comments

almost 7 years ago by jordimassaguerpla | Reply

almost 7 years ago by jordimassaguerpla | Reply

almost 7 years ago by jordimassaguerpla | Reply

almost 7 years ago by jordimassaguerpla | Reply

almost 7 years ago by jordimassaguerpla | Reply

about 6 years ago by jordimassaguerpla | Reply

about 6 years ago by jordimassaguerpla | Reply

How to setup SUSE CaaSP to work with GPU

Installing SUSE CaaSP

Installing nvidia graphics driver kernel module

Installing NVIDIA driver for computing with GPUs using CUDA

Installing libnvidia

Installing nvidia-container-cli

Installing nvidia-container-runtime-hook

Creating a Service Daemon Device Nvidia

Testing

TODOs

about 6 years ago by jordimassaguerpla | Reply

about 6 years ago by jordimassaguerpla | Reply

about 6 years ago by jordimassaguerpla | Reply

about 6 years ago by jordimassaguerpla | Reply

about 6 years ago by jordimassaguerpla | Reply

Similar Projects