Augment the mock hypervisor so that we can validate that ACPI memory hotplug
is carried out as expected.
We'll augment the number of memory slots in the hypervisor config each
time the memory of the hypervisor is changed. In this way we can ensure
that large memory hotplugs are broken up into appropriately sized
pieces in the unit test.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
If we're using ACPI hotplug for memory, there's a limitation on the
amount of memory which can be hotplugged at a single time.
During hotplug, we'll allocate memory for the memmap for each page,
resulting in a 64 byte per 4KiB page allocation. As an example, hotplugging 12GiB
of memory requires ~192 MiB of *free* memory, which is about the limit
we should expect for an idle 256 MiB guest (conservative heuristic of 75%
of provided memory).
From experimentation, at pod creation time we can reliably add 48 times
what is provided to the guest. (a factor of 48 results in using 75% of
provided memory for hotplug). Using prior example of a guest with 256Mi
RAM, 256 Mi * 48 = 12 Gi; 12GiB is upper end of what we should expect
can be hotplugged successfully into the guest.
Note: It isn't expected that we'll need to hotplug large amounts of RAM
after workloads have already started -- container additions are expected
to occur first in pod lifecycle. Based on this, we expect that provided
memory should be freely available for hotplug.
If virtio-mem is being utilized, there isn't such a limitation - we can
hotplug the max allowed memory at a single time.
Fixes: #4847
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
It'll be useful to get the total memory provided to the guest
(hotplugged + coldplugged). We'll use this information when calcualting
how much memory we can add at a time when utilizing ACPI hotplug.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
The root span should exist the duration of the trace. Defer ending span
until the end of the trace instead of end of function. Add the span to
the service struct to do so.
Fixes#4902
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
(cherry picked from commit fcc1e0c617)
In some cases do_create_container may return an error, mostly due to
`container.start(process)` call. This commit will do some rollback
works if this function failed.
Fixes: #4749
Signed-off-by: Bin Liu <bin@hyper.sh>
(cherry picked from commit 09672eb2da)
The new 'payload' interface now contains the 'kernel' and 'initramfs'
config.
Fixes: #4952
Signed-off-by: Bo Chen <chen.bo@intel.com>
(cherry picked from commit 3a597c2742)
Enable Kata runtime to handle `disable_selinux` flag properly in order
to be able to change the status by the runtime configuration whether the
runtime applies the SELinux label to VMM process.
Fixes: #4599
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
When a container terminated, we should make sure there's no processes
left after destroying the container.
Before this commit, kata-agent depended on the kernel's pidns
to destroy all of the process in a container after the 1 process
exit in a container. This is true for those container using a
separated pidns, but for the case of shared pidns within the
sandbox, the container exit wouldn't trigger the pidns terminated,
and there would be some daemon process left in this container, this
wasn't expected.
Fixes: #4663
Signed-off-by: Fupan Li <fupan.lfp@antgroup.com>
Replace `libc::setgroups()`, `libc::fchown()`, and `libc::sethostname()`
functions with nix crate ones for safety and maintainability.
Fixes: #4579
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Run the OCI `poststart` hooks must be called after the
user-specified process is executed but before the `start`
operation returns in accordance with OCI runtime spec.
Fixes: #4575
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Some clients like nerdctl may pass mount type of none for volumes/bind mounts,
this will lead to container start fails.
Referring to runc, it overwrites the mount type to bind and ignores the input value.
Fixes: #4548
Signed-off-by: liubin <liubin0329@gmail.com>
For runC, send the signal to the init process directly.
For kata, we try to send `SIGKILL` instead of `SIGTERM` when the process
has not installed the handler for `SIGTERM`.
The `is_signal_handled` function determine which signal the container
process has been handled. But currently `is_signal_handled` is only
catching (SigCgt). While the container process is ignoring (SigIgn) or
blocking (SigBlk) also should not be converted from the `SIGTERM` to
`SIGKILL`. For example, when using terminationGracePeriodSeconds the k8s
will send SIGTERM first and then send `SIGKILL`, in this case, the
container ignores the `SIGTERM`, so we should send the `SIGTERM` not the
`SIGKILL` to the container.
Fixes: #4478
Signed-off-by: quanweiZhou <quanweiZhou@linux.alibaba.com>
The tests ensure that interactions between drop-ins and the base
configuration.toml and among drop-ins themselves work as intended,
basically that files are evaluated in the correct order (base file
first, then drop-ins in alphabetical order) and the last one to set
a specific key wins.
Signed-off-by: Pavel Mores <pmores@redhat.com>
updateFromDropIn() uses the infrastructure built by previous commits to
ensure no contents of 'tomlConfig' are lost during decoding. To do
this, we preserve the current contents of our tomlConfig in a clone and
decode a drop-in into the original. At this point, the original
instance is updated but its Agent and/or Hypervisor fields are
potentially damaged.
To merge, we update the clone's Agent/Hypervisor from the original
instance. Now the clone has the desired Agent/Hypervisor and the
original instance has the rest, so to finish, we just need to move the
clone's Agent/Hypervisor to the original.
Signed-off-by: Pavel Mores <pmores@redhat.com>
These functions take a TOML key - an array of individual components,
e.g. ["agent" "kata" "enable_tracing"], as returned by BurntSushi - and
two 'tomlConfig' instances. They copy the value of the struct field
identified by the key from the source instance to the target one if
necessary.
This is only done if the TOML key points to structures stored in
maps by 'tomlConfig', i.e. 'hypervisor' and 'agent'. Nothing needs to
be done in other cases.
Signed-off-by: Pavel Mores <pmores@redhat.com>
For 'tomlConfig' substructures stored in Golang maps - 'hypervisor' and
'agent' - BurntSushi doesn't preserve their previous contents as it does
for substructures stored directly (e.g. 'runtime'). We use reflection
to work around this.
This commit adds three primitive operations to work with struct fields
identified by their `toml:"..."` tags - one to get a field value, one to
set a field value and one to assign a source struct field value to the
corresponding field of a target.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Return code is an int32 type, so if an error occurred, the default value
may be zero, this value will be created as a normal exit code.
Set return code to 255 will let the caller(for example Kubernetes) know
that there are some problems with the pod/container.
Fixes: #4419
Signed-off-by: liubin <liubin0329@gmail.com>
Prior device config move didn't update the comments. Let's address this,
and make sure comments match the new path...
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Ideally this config validation would be in a seperate package
(katautils?), but that would introduce circular dependency since we'd
call it from vc, and it depends on vc types (which, shouldn't be vc, but
probably a hypervisor package instead).
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
While working on the previous commits, some of the functions become
non-used. Let's simply remove them.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Expose the newly added `default_maxmemory` to the project's Makefile and
to the configuration files.
Fixes: #4516
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's adapt Cloud Hypervisor's and QEMU's code to properly behave to the
newly added `default_maxmemory` config.
While implementing this, a change of behaviour (or a bug fix, depending
on how you see it) has been introduced as if a pod requests more memory
than the amount avaiable in the host, instead of failing to start the
pod, we simply hotplug the maximum amount of memory available, mimicing
better the runc behaviour.
Fixes: #4516
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add a `default_maxmemory` configuration, which allows the admins
to set the maximum amount of memory to be used by a VM, considering the
initial amount + whatever ends up being hotplugged via the pod limits.
By default this value is 0 (zero), and it means that the whole physical
RAM is the limit.
Fixes: #4516
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Now kata shim only supports stdout/stderr of fifo from
containerd/CRI-O, but shim v2 supports logging plugins,
and nerdctl default will use the binary schema for logs.
This commit will add the others type of log plugins:
- file
- binary
In case of binary, kata shim will receive a stdout/stderr like:
binary:///nerdctl?_NERDCTL_INTERNAL_LOGGING=/var/lib/nerdctl/1935db59
That means the nerdctl process will handle the logs(stdout/stderr)
Fixes: #4420
Signed-off-by: Bin Liu <bin@hyper.sh>