The idea here is to study and understand how ephemeral storage for containers works and investigate if local storage could be avoided at all and Ceph used instead. Could new storage driver be developed to support Ceph storage: https://github.com/containers/storage

The goal of this project is to understand requirements for container's ephemeral storage (performance, latency and etc) and maybe PoC on storage driver for Ceph.

Looking for hackers with the skills:

ceph k8s

This project is part of:

Hack Week 19

Activity

  • over 2 years ago: SLindoMansilla liked this project.
  • over 2 years ago: denisok added keyword "ceph" to this project.
  • over 2 years ago: denisok added keyword "k8s" to this project.
  • over 2 years ago: tbechtold liked this project.
  • over 2 years ago: denisok started this project.
  • over 2 years ago: STorresi liked this project.
  • over 2 years ago: denisok liked this project.
  • over 2 years ago: denisok originated this project.

  • Comments

    • denisok
      over 2 years ago by denisok | Reply

      so after reading some info about different storage drivers and getting know some details of containers/storage, I would first try to just mount CephFS and point vfs storage driver to that mount and see what impact on perf would be and how would it work.

      Also I plan to search for some benchmarks that I could use to test solution.

    • denisok
      over 2 years ago by denisok | Reply

      I have created virtual 3 nodes Ceph cluster and tried CephFS as a mount point for the graphroot for the storage driver. It was working fine but quite slower than local (ssd) storage:

      dkondratenko@f66:~/Workspace> time podman pull opensuse/toolbox Trying to pull docker.io/opensuse/toolbox... denied: requested access to the resource is denied Trying to pull quay.io/opensuse/toolbox... unauthorized: access to the requested resource is not authorized Trying to pull registry.opensuse.org/opensuse/toolbox... Getting image source signatures Copying blob 6b18d304886e done
      Copying blob 32329fe4ff6e done
      Copying config d5ebabf36c done
      Writing manifest to image destination Storing signatures d5ebabf36cd400a32148eda6b97ec2989c307a4d4ebfdb03822ec6e35b0d7f36

      real 3m25.789s user 0m35.065s sys 0m9.852s

      real 0m28.302s user 0m30.095s sys 0m6.377s

      dkondratenko@f66:~/Workspace> time podman pull opensuse/busybox Trying to pull docker.io/opensuse/busybox... denied: requested access to the resource is denied Trying to pull quay.io/opensuse/busybox... unauthorized: access to the requested resource is not authorized Trying to pull registry.opensuse.org/opensuse/busybox... Getting image source signatures Copying blob ce98310bda16 done
      Copying config 52423b05e2 done
      Writing manifest to image destination Storing signatures 52423b05e29476a7619641e48c4a5c960eb18b7edb5264c433f7af0bfeebcad0

      real 0m25.236s user 0m1.147s sys 0m0.593s

      real 0m3.696s user 0m0.907s sys 0m0.327s

      time podman run -it opensuse/toolbox /bin/sh -c 'zypper in -y bonnie; bonnie'

      Bonnie: Warning: You have 15927MiB RAM, but you test with only 100MiB datasize!
      Bonnie: This might yield unrealistically good results, Bonnie: for reading and seeking and writing. Bonnie 1.6: File './Bonnie.201', size: 104857600, volumes: 1 Writing 25MiB with putc()... done: 64357 kiB/s 74.1 %CPU Rewriting 100MiB... done:2696297 kiB/s 41.2 %CPU Writing 100MiB intelligently... done: 558372 kiB/s 31.7 %CPU Reading 12MiB with getc()... done: 86603 kiB/s 100.0 %CPU Reading 100MiB intelligently... done:5248052 kiB/s 99.5 %CPU Seeker 2 1 3 4 7 6 8 5 9 10 12 11 15 16 14 13 start 'em ................ Estimated seek time: raw 0.010ms, eff 0.009ms

      ----Sequential Output (nosync)----- ----Sequential Input--- --Rnd Seek- -Per Char-- --Block---- -Rewrite--- -Per Char-- --Block---- --04k (16)- Machine MiB kiB/s %CPU kiB/s %CPU kiB/s %CPU kiB/s %CPU kiB/s %CPU /sec %CPU 2b337e6 1* 100 64357 74.1 558372 31.72696297 41.2 86603 1005248052 99.5 96801 617

      real 0m18.456s user 0m0.127s sys 0m0.056s

      Bonnie: Warning: You have 15927MiB RAM, but you test with only 100MiB datasize! Bonnie: This might yield unrealistically good results, Bonnie: for reading and seeking and writing. Bonnie 1.6: File './Bonnie.202', size: 104857600, volumes: 1 Writing 25MiB with putc()... done: 37109 kiB/s 46.4 %CPU Rewriting 100MiB... done:1687791 kiB/s 26.0 %CPU Writing 100MiB intelligently... done: 197676 kiB/s 11.3 %CPU Reading 12MiB with getc()... done: 87179 kiB/s 100.0 %CPU Reading 100MiB intelligently... done:6430948 kiB/s 99.4 %CPU Seeker 1 3 2 4 6 5 8 7 9 10 12 11 14 13 15 16 start 'em ................ Estimated seek time: raw 1.088ms, eff 1.087ms

      ----Sequential Output (nosync)----- ----Sequential Input--- --Rnd Seek- -Per Char-- --Block---- -Rewrite--- -Per Char-- --Block---- --04k (16)- Machine MiB kiB/s %CPU kiB/s %CPU kiB/s %CPU kiB/s %CPU kiB/s %CPU /sec %CPU b653a06 1* 100 37109 46.4 197676 11.31687791 26.0 87179 1006430948 99.4 919 5.3

      real 0m51.470s user 0m0.113s sys 0m0.094s

    • denisok
      over 2 years ago by denisok | Reply

      So there are two main investigation point here

      • tar to CephFS is quite slow
      • operations in container image are two slow

      As for unpacking to CephFS, first of all it is slow because Ceph run in VMs and no way could provide a performance.

      Second, in case of CephFS unpacking of same image layer could be done only once, so the same image layer wouldn't need unpacking and could be just mounted. Container run from CephFS doesn't have that dramatic performance impact:

      dkondratenko@f66:~/Workspace> time podman run -it opensuse/toolbox /bin/sh -c 'echo Hello world!' Hello world!

      real 0m0.448s user 0m0.097s sys 0m0.088s

      real 0m1.470s user 0m0.117s sys 0m0.079s

      Next point is a container layer that shouldn't be persistent but eats quite a space and performance, as in bonnie example, to intall it and for tests. That layer could be mounted in local FS and deleted after each run.

      Also CephFS performance could be improved by caching, images are persistent, so technics similar to ceph-immmutable-object-cache could be used.

      I have tried to pull some images to local storage and instead of some layers just mount CephFS layers, but such container failed to start. I track it down to simple container that tries to start with pre-mounted fuse-overlayfs mount that has both CephFS layers and local ones, but such container doesn't see right mount point. I wasn't able to track it further.

      Next steps:

      • try some hw ceph cluster and see how big perf differences
      • find out if CephFS could be mounted together with local layers as overlayfs to the namspace
      • implement storage driver that could provide hybrid approach, CephFS for the common layers and local layer container layer
      • find out if implementation of on-prem registry that would be a proxy and unpack layers to CephFS makes sense

    Similar Projects

    Uyuni/SUSE Manager Server Helm chart on k3s by moio

    ![Combined icons of k3s and Uyuni](https://user...


    investigate seal secrets for used in a home-cluster (k3s + fluxci) by fcrozat

    [comment]: # (Please use the project descriptio...