Enable Kata runtime to handle `disable_selinux` flag properly in order
to be able to change the status by the runtime configuration whether the
runtime applies the SELinux label to VMM process.
Fixes: #4599
Signed-off-by: Manabu Sugimoto <Manabu.Sugimoto@sony.com>
Some clients like nerdctl may pass mount type of none for volumes/bind mounts,
this will lead to container start fails.
Referring to runc, it overwrites the mount type to bind and ignores the input value.
Fixes: #4548
Signed-off-by: liubin <liubin0329@gmail.com>
The tests ensure that interactions between drop-ins and the base
configuration.toml and among drop-ins themselves work as intended,
basically that files are evaluated in the correct order (base file
first, then drop-ins in alphabetical order) and the last one to set
a specific key wins.
Signed-off-by: Pavel Mores <pmores@redhat.com>
updateFromDropIn() uses the infrastructure built by previous commits to
ensure no contents of 'tomlConfig' are lost during decoding. To do
this, we preserve the current contents of our tomlConfig in a clone and
decode a drop-in into the original. At this point, the original
instance is updated but its Agent and/or Hypervisor fields are
potentially damaged.
To merge, we update the clone's Agent/Hypervisor from the original
instance. Now the clone has the desired Agent/Hypervisor and the
original instance has the rest, so to finish, we just need to move the
clone's Agent/Hypervisor to the original.
Signed-off-by: Pavel Mores <pmores@redhat.com>
These functions take a TOML key - an array of individual components,
e.g. ["agent" "kata" "enable_tracing"], as returned by BurntSushi - and
two 'tomlConfig' instances. They copy the value of the struct field
identified by the key from the source instance to the target one if
necessary.
This is only done if the TOML key points to structures stored in
maps by 'tomlConfig', i.e. 'hypervisor' and 'agent'. Nothing needs to
be done in other cases.
Signed-off-by: Pavel Mores <pmores@redhat.com>
For 'tomlConfig' substructures stored in Golang maps - 'hypervisor' and
'agent' - BurntSushi doesn't preserve their previous contents as it does
for substructures stored directly (e.g. 'runtime'). We use reflection
to work around this.
This commit adds three primitive operations to work with struct fields
identified by their `toml:"..."` tags - one to get a field value, one to
set a field value and one to assign a source struct field value to the
corresponding field of a target.
Signed-off-by: Pavel Mores <pmores@redhat.com>
Return code is an int32 type, so if an error occurred, the default value
may be zero, this value will be created as a normal exit code.
Set return code to 255 will let the caller(for example Kubernetes) know
that there are some problems with the pod/container.
Fixes: #4419
Signed-off-by: liubin <liubin0329@gmail.com>
Prior device config move didn't update the comments. Let's address this,
and make sure comments match the new path...
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Let's add a `default_maxmemory` configuration, which allows the admins
to set the maximum amount of memory to be used by a VM, considering the
initial amount + whatever ends up being hotplugged via the pod limits.
By default this value is 0 (zero), and it means that the whole physical
RAM is the limit.
Fixes: #4516
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Now kata shim only supports stdout/stderr of fifo from
containerd/CRI-O, but shim v2 supports logging plugins,
and nerdctl default will use the binary schema for logs.
This commit will add the others type of log plugins:
- file
- binary
In case of binary, kata shim will receive a stdout/stderr like:
binary:///nerdctl?_NERDCTL_INTERNAL_LOGGING=/var/lib/nerdctl/1935db59
That means the nerdctl process will handle the logs(stdout/stderr)
Fixes: #4420
Signed-off-by: Bin Liu <bin@hyper.sh>
Policy for whats valid/invalid within the config varies by VMM, host,
and by silicon architecture. Let's keep katautils simple for just
translating a toml to the hypervisor config structure, and leave
validation to virtcontainers.
Without this change, we're doing duplicate validation.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Before, we maintained almost identical structures between our persist
API and what we keep for our devices, with the persist API being a
slight subset of device structures.
Let's deduplicate this, now that persist is importing device package.
Json unmarshal of prior persist structure will work fine, since it was
an exact subset of fields.
Fixes: #4468
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Rather than have device package depend on persist, let's define the
(almost duplicate) structures within device itself, and have the Kata
Container's persist pkg import these.
This'll help avoid unecessary dependencies within our core packages.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Enable "-sandbox on" in qemu can introduce another protect layer
on the host, to make the secure container more secure.
The default option is disable because this feature may introduce some
performance cost, even though user can enable
/proc/sys/net/core/bpf_jit_enable to reduce the impact.
Fixes: #2266
Signed-off-by: Feng Wang <feng.wang@databricks.com>
Remove space from root span name to follow camel casing of other tracing
span names in the runtime and to make parsing easier in testing.
Fixes#4483
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
Changed bitsize for parsing functions to 64-bit in order to avoid
parsing errors.
Fixes#4435
Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
GetOOMEvent is a blocking call that will fail if
the container exit, in this case, it's not an error or warning.
Changing the log level for logs in case of GetOOMEvent call fails
will reduce log noise in a large cluster that has pods
creating/deleting frequently.
Fixes: #4376
Signed-off-by: Bin Liu <bin@hyper.sh>
Set thestop container force flag to true so that the container state is always set to
“StateStopped” after the container wait goroutine is finished. This is necessary for
the following delete container step to succeed.
Fixes: #4359
Signed-off-by: Feng Wang <feng.wang@databricks.com>
In linux 5.14 and hopefully some backports, core scheduling allows processes to
be co scheduled within the same domain on SMT enabled systems.
Containerd impl sets the core sched domain when launching a shim. This
allows a clean way for each shim(container/pod) to be in its own domain and any
additional containers, (v2 pods) be be launched with the same domain as well as
any exec'd process added to the container.
kernel docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html
For Kata specifically, we will look for SCHED_CORE environment variable
to be set to indicate we shuold create a new schedule core domain.
This is equivalent to the containerd shim's PR: e48bbe8394Fixes: #4309
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Michael Crosby <michael@thepasture.io>
Without this, potential errors are silently dropped. Let's ensure we
return the error code as well as potenial data from the response.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Before, we had a mix of slash, etc. Unfortunately, when cleaning URL
paths, serve mux seems to mangle the request method, resulting in each
request being a GET (instead of PUT or POST).
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Add two endpoints: ip6tables, iptables.
Each url handler supports GET and PUT operations. PUT expects
the requests' data to be []bytes, and to contain iptable information in
format to be consumed by iptables-restore.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
The go default http mux AFAIK doesn’t support pattern
routing so right now client is padding the url
for direct-volume stats with a subpath of the volume
path and this will always result in 404 not found returned
by the shim.
This change will update the shim to take the volume
path as a GET query parameter instead of a subpath.
If the parameter is missing or empty, then return
400 BadRequest to the client.
Fixes: #4297
Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
Let's add the newly added disk rate limiter configurations to the Cloud
Hypervisor's hypervisor configuration.
Right now those are not used anywhere, and there's absolutely no way the
users can set those up. That's coming later in this very same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
This is the disk counterpart of the what was introduced for the network
as part of the previous commits in this series.
The newly added fields are:
* DiskRateLimiterBwMaxRate, defined in bits per second, which is used to
control the network I/O bandwidth at the VM level.
* DiskRateLimiterBwOneTimeBurst, also defined in bits per second, which
is used to define an *initial* max rate, which doesn't replenish.
* DiskRateLimiterOpsMaxRate, the operations per second equivalent of the
DiskRateLimiterBwMaxRate.
* DiskRateLimiterOpsOneTimeBurst, the operations per second equivalent of
the DiskRateLimiterBwOneTimeBurst.
For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
Let's add the newly added network rate limiter configurations to the
Cloud Hypervisor's hypervisor configuration.
Right now those are not used anywhere, and there's absolutely no way the
users can set those up. That's coming later in this very same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
In a similar way to what's already exposed as RxRateLimiterMaxRate and
TxRateLimiterMaxRate, let's add four new fields to the Hypervisor's
configuration.
The values added are related to bandwidth and operations rate limiters,
which have to be added so we can expose I/O throttling configurations to
users using Cloud Hypervisor as their preferred VMM.
The reason we cannot simply re-use {Rx,Tx}RateLimiterMaxRate is because
Cloud Hypervisor exposes a single MaxRate to be used for both inbound
and outbound queues.
The newly added fields are:
* NetRateLimiterBwMaxRate, defined in bits per second, which is used to
control the network I/O bandwidth at the VM level.
* NetRateLimiterBwOneTimeBurst, also defined in bits per second, which
is used to define an *initial* max rate, which doesn't replenish.
* NetRateLimiterOpsMaxRate, the operations per second equivalent of the
NetRateLimiterBwMaxRate.
* NetRateLimiterOpsOneTimeBurst, the operations per second equivalent of
the NetRateLimiterBwOneTimeBurst.
For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.
Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
The tests in hook_test.go run a mock hook binary, which does some debug
logging to /tmp/mock_hook.log. Currently we don't clean up those logs
when the tests are done. Use a test cleanup function to do this.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
SetupOCIConfigFile creates a temporary directory with os.MkDirTemp(). This
means the callers need to register a deferred function to remove it again.
At least one of them was commented out meaning that a /temp/katatest-
directory was leftover after the unit tests ran.
Change to using t.TempDir() which as well as better matching other parts of
the tests means the testing framework will handle cleaning it up.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>