mirror of
https://github.com/aljazceru/kata-containers.git
synced 2026-01-01 13:34:20 +01:00
Explain cpu cgroups are implemented in kata containers fixes #386 Signed-off-by: Julio Montes <julio.montes@intel.com>
258 lines
10 KiB
Markdown
258 lines
10 KiB
Markdown
* [CPU constraints in Kata Containers](#cpu-constraints-in-kata-containers)
|
|
* [Default number of virtual CPUs](#default-number-of-virtual-cpus)
|
|
* [Virtual CPUs and Kubernetes pods](#virtual-cpus-and-kubernetes-pods)
|
|
* [Container lifecycle](#container-lifecycle)
|
|
* [Container without CPU constraint](#container-without-cpu-constraint)
|
|
* [Container with CPU constraint](#container-with-cpu-constraint)
|
|
* [Do not waste resources](#do-not-waste-resources)
|
|
* [CPU cgroups](#cpu-cgroups)
|
|
* [cgroups in the guest](#cgroups-in-the-guest)
|
|
* [CPU pinning](#cpu-pinning)
|
|
* [cgroups in the host](#cgroups-in-the-host)
|
|
|
|
|
|
# CPU constraints in Kata Containers
|
|
|
|
## Default number of virtual CPUs
|
|
|
|
Before starting a container, the [runtime][6] reads the `default_vcpus` option
|
|
from the [configuration file][7] to determine the number of virtual CPUs
|
|
(vCPUs) needed to start the virtual machine. By default, `default_vcpus` is
|
|
equal to 1 for fast boot time and a small memory footprint per virtual machine.
|
|
Be aware that increasing this value negatively impacts the virtual machine's
|
|
boot time and memory footprint.
|
|
In general, we recommend that you do not edit this variable, unless you know
|
|
what are you doing. If your container needs more than one vCPU, use
|
|
[docker `--cpus`][1], [docker update][4], or [kubernetes `cpu` limits][2] to
|
|
assign more resources.
|
|
|
|
*Docker*
|
|
|
|
```sh
|
|
$ docker run --name foo -ti --cpus 2 debian bash
|
|
$ docker update --cpus 4 foo
|
|
```
|
|
|
|
|
|
*Kubernetes*
|
|
|
|
```yaml
|
|
# ~/cpu-demo.yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: cpu-demo
|
|
namespace: sandbox
|
|
spec:
|
|
containers:
|
|
- name: cpu0
|
|
image: vish/stress
|
|
resources:
|
|
limits:
|
|
cpu: "3"
|
|
args:
|
|
- -cpus
|
|
- "5"
|
|
```
|
|
|
|
```sh
|
|
$ sudo -E kubectl create -f ~/cpu-demo.yaml
|
|
```
|
|
|
|
## Virtual CPUs and Kubernetes pods
|
|
|
|
A Kubernetes pod is a group of one or more containers, with shared storage and
|
|
network, and a specification for how to run the containers [[specification][3]].
|
|
In Kata Containers this group of containers, which is called a sandbox, runs inside
|
|
the same virtual machine. If you do not specify a CPU constraint, the runtime does
|
|
not add more vCPUs and the container is not placed inside a CPU cgroup.
|
|
Instead, the container uses the number of vCPUs specified by `default_vcpus`
|
|
and shares these resources with other containers in the same situation
|
|
(without a CPU constraint).
|
|
|
|
## Container lifecycle
|
|
|
|
When you create a container with a CPU constraint, the runtime adds the
|
|
number of vCPUs required by the container. Similarly, when the container terminates,
|
|
the runtime removes these resources.
|
|
|
|
## Container without CPU constraint
|
|
|
|
A container without a CPU constraint uses the default number of vCPUs specified
|
|
in the configuration file. In the case of Kubernetes pods, containers without a
|
|
CPU constraint use and share between them the default number of vCPUs. For
|
|
example, if `default_vcpus` is equal to 1 and you have 2 containers without CPU
|
|
constraints with each container trying to consume 100% of vCPU, the resources
|
|
divide in two parts, 50% of vCPU for each container because your virtual
|
|
machine does not have enough resources to satisfy containers needs. If you want
|
|
to give access to a greater or lesser portion of vCPUs to a specific container,
|
|
use [docker --cpu-shares][1] or [Kubernetes `cpu` requests][2].
|
|
|
|
*Docker*
|
|
|
|
```sh
|
|
$ docker run -ti --cpus-shares=512 debian bash
|
|
```
|
|
|
|
*Kubernetes*
|
|
|
|
```yaml
|
|
# ~/cpu-demo.yaml
|
|
apiVersion: v1
|
|
kind: Pod
|
|
metadata:
|
|
name: cpu-demo
|
|
namespace: sandbox
|
|
spec:
|
|
containers:
|
|
- name: cpu0
|
|
image: vish/stress
|
|
resources:
|
|
requests:
|
|
cpu: "0.7"
|
|
args:
|
|
- -cpus
|
|
- "3"
|
|
```
|
|
|
|
```sh
|
|
$ sudo -E kubectl create -f ~/cpu-demo.yaml
|
|
```
|
|
|
|
Before running containers without CPU constraint, consider that your containers
|
|
are not running alone. Since your containers run inside a virtual machine other
|
|
processes use the vCPUs as well (e.g. `systemd` and the Kata Containers
|
|
[agent][5]). In general, we recommend setting `default_vcpus` equal to 1 to
|
|
allow non-container processes to run on this vCPU and to specify a CPU
|
|
constraint for each container. If your container is already running and needs
|
|
more vCPUs, you can add more using [docker update][4].
|
|
|
|
## Container with CPU constraint
|
|
|
|
The runtime calculates the number of vCPUs required by a container with CPU
|
|
constraints using the following formula: `(quota + (period -1)) / period`, where
|
|
`quota` specifies the number of microseconds per CPU Period that the container is
|
|
guaranteed CPU access and `period` specifies the CPU CFS scheduler period of time
|
|
in microseconds. The result determines the number of vCPU to hot plug into the
|
|
virtual machine. Once the vCPUs have been added, the [agent][5] places the
|
|
container inside a CPU cgroup. This placement allows the container to use only
|
|
its assigned resources.
|
|
|
|
## Do not waste resources
|
|
|
|
If you already know the number of vCPUs needed for each container and pod, or
|
|
just want to run them with the same number of vCPUs, you can specify that
|
|
number using the `default_vcpus` option in the configuration file, each virtual
|
|
machine starts with that number of vCPUs. One limitation of this approach is
|
|
that these vCPUs cannot be removed later and you might be wasting
|
|
resources. For example, if you set `default_vcpus` to 8 and run only one
|
|
container with a CPU constraint of 1 vCPUs, you might be wasting 7 vCPUs since
|
|
the virtual machine starts with 8 vCPUs and 1 vCPUs is added and assigned
|
|
to the container. Non-container processes might be able to use 8 vCPUs but they
|
|
use a maximum 1 vCPU, hence 7 vCPUs might not be used.
|
|
|
|
|
|
*Container without CPU constraint*
|
|
|
|
```sh
|
|
$ docker run -ti debian bash -c "nproc; cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_*"
|
|
1 # number of vCPUs
|
|
100000 # cfs period
|
|
-1 # cfs quota
|
|
```
|
|
|
|
*Container with CPU constraint*
|
|
|
|
```sh
|
|
docker run --cpus 4 -ti debian bash -c "nproc; cat /sys/fs/cgroup/cpu,cpuacct/cpu.cfs_*"
|
|
5 # number of vCPUs
|
|
100000 # cfs period
|
|
400000 # cfs quota
|
|
```
|
|
|
|
|
|
## CPU cgroups
|
|
|
|
Kata Containers runs over two layers of cgroups, the first layer is in the guest where
|
|
only the workload is placed, the second layer is in the host that is more complex and
|
|
might contain more than one process and task (thread) depending of the number of
|
|
containers per POD and vCPUs per container. The following diagram represents a nginx container
|
|
created with `docker` with the default number of vcpus.
|
|
|
|
|
|
```
|
|
$ docker run -dt --runtime=kata-runtime nginx
|
|
|
|
|
|
.-------.
|
|
| nginx |
|
|
.--'-------'---. .------------.
|
|
| Guest Cgroup | | Kata agent |
|
|
.-'--------------'--'------------'. .-----------.
|
|
| Thread: Hypervisor's vCPU 0 | | Kata Shim |
|
|
.'---------------------------------'. .'-----------'.
|
|
| Tasks | | Processes |
|
|
.'-----------------------------------'--'-------------'.
|
|
| Host Cgroup |
|
|
'------------------------------------------------------'
|
|
```
|
|
|
|
The next sections explain the difference between processes and tasks and why only hypervisor
|
|
vCPUs are constrained.
|
|
|
|
### cgroups in the guest
|
|
|
|
Only the workload process including all its threads are placed into cpu cgroups, this means
|
|
that `kata-agent` and `systemd` run without constraints in the guest.
|
|
|
|
#### CPU pinning
|
|
|
|
Kata Containers tries to apply and honor the cgroups but sometimes that is not possible.
|
|
An example of this occurs with cpu cgroups when the number of virtual CPUs (in the guest)
|
|
does not match the actual number of physical host CPUs.
|
|
In Kata Containers to have a good performance and small memory footprint, the resources are
|
|
hot added when they are needed, therefore the number of virtual resources is not the same
|
|
as the number of physical resources. The problem with this approach is that it's not possible
|
|
to pin a process on a specific resource that is not present in the guest. To deal with this
|
|
limitation and to not fail when the container is being created, Kata Containers does not apply
|
|
the constraint in the first layer (guest) if the resource does not exist in the guest, but it
|
|
is applied in the second layer (host) where the hypervisor is running. The constraint is applied
|
|
in both layers when the resource is available in the guest and host. The next sections provide
|
|
further details on what parts of the hypervisor are constrained.
|
|
|
|
### cgroups in the host
|
|
|
|
In Kata Containers the workloads run in a virtual machine that is managed and represented by a
|
|
hypervisor running in the host. Like other processes the hypervisor might use threads to realize
|
|
several tasks, for example IO and Network operations. One of the most important uses for the
|
|
threads is as vCPUs. The processes running in the guest see these vCPUs as physical CPUs, while
|
|
in the host those vCPU are just threads that are part of a process. This is the key to ensure
|
|
workloads consumes only the amount of CPU resources that were assigned to it without impacting
|
|
other operations. From user perspective the easier approach to implement it would be to take the
|
|
whole hypervisor including its threads and move them into the cgroup, unfortunately this will
|
|
impact negatively the performance, since vCPUs, IO and Network threads will be fighting for
|
|
resources. The following table shows a random read performance comparison between a Kata Container
|
|
with all its hypervisor threads in the cgroup and other with only its hypervisor vCPU threads
|
|
constrained, the difference is huge.
|
|
|
|
|
|
| Bandwidth | All threads | vCPU threads | Units |
|
|
|:-------------:|:-------------:|:------------:|:-----:|
|
|
| 4k | 136.2 | 294.7 | MB/s |
|
|
| 8k | 166.6 | 579.4 | MB/s |
|
|
| 16k | 178.3 | 1093.3 | MB/s |
|
|
| 32k | 179.9 | 1931.5 | MB/s |
|
|
| 64k | 213.6 | 3994.2 | MB/s |
|
|
|
|
|
|
To have the best performance in Kata Containers only the vCPU threads are constrained.
|
|
|
|
|
|
[1]: https://docs.docker.com/config/containers/resource_constraints/#cpu
|
|
[2]: https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource
|
|
[3]: https://kubernetes.io/docs/concepts/workloads/pods/pod/
|
|
[4]: https://docs.docker.com/engine/reference/commandline/update/
|
|
[5]: https://github.com/kata-containers/agent
|
|
[6]: https://github.com/kata-containers/runtime
|
|
[7]: https://github.com/kata-containers/runtime#configuration
|