Commit Graph

1621 Commits

Author SHA1 Message Date
Bin Liu
81acfc1286 Merge pull request #4425 from liubin/fix/4376-change-log-level-of-getoomevent
shim: change the log level for GetOOMEvent call failures
2022-06-13 17:53:11 +08:00
James O. D. Hunt
9b93db0220 Merge pull request #4417 from jodh-intel/docs-monitor-considerations
docs: Add more kata monitor details
2022-06-13 10:51:52 +01:00
Fabiano Fidêncio
1ef0b7ded0 runtime: Switch to using the rust version of virtiofsd (all but power)
So far this has been done for x86_64.  Now that the support for building
and testing has been added for all arches, let's do the second part of
the switch.

We're still not done yet for powerpc, as some a virtifosd crash on the
rust version has been found by the maintainer.

Fixes: #4258, #4260

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-13 10:41:26 +02:00
Alexandru Matei
721ca72a64 runtime: fix error when trying to parse sandbox sizing annotations
Changed bitsize for parsing functions to 64-bit in order to avoid
parsing errors.

Fixes #4435

Signed-off-by: Alexandru Matei <alexandru.matei@uipath.com>
2022-06-11 18:51:10 +03:00
Archana Shinde
aefe11b9ba Merge pull request #4331 from dgibson/config-enable-iommu-annotation
Allow io.katacontainers.config.hypervisor.enable_iommu annotation by …
2022-06-10 17:43:27 -07:00
James O. D. Hunt
412441308b docs: Add more kata monitor details
Add more detail to the `kata-monitor` doc to allow an admin to make a
more informed decision about where and how to run the daemon.

Fixes: #4416.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2022-06-09 09:20:11 +01:00
Eric Ernst
4ebf9d38b9 Merge pull request #4310 from egernst/core-sched
shim: add support for core scheduling
2022-06-08 17:42:45 +02:00
Bin Liu
eff4e1017d shim: change the log level for GetOOMEvent call failures
GetOOMEvent is a blocking call that will fail if
the container exit, in this case, it's not an error or warning.

Changing the log level for logs in case of GetOOMEvent call fails
will reduce log noise in a large cluster that has pods
creating/deleting frequently.

Fixes: #4376

Signed-off-by: Bin Liu <bin@hyper.sh>
2022-06-08 22:17:24 +08:00
dependabot[bot]
5d7fb7b7b0 build(deps): bump github.com/containerd/containerd in /src/runtime
Bumps [github.com/containerd/containerd](https://github.com/containerd/containerd) from 1.6.1 to 1.6.6.
- [Release notes](https://github.com/containerd/containerd/releases)
- [Changelog](https://github.com/containerd/containerd/blob/main/RELEASES.md)
- [Commits](https://github.com/containerd/containerd/compare/v1.6.1...v1.6.6)

---
updated-dependencies:
- dependency-name: github.com/containerd/containerd
  dependency-type: direct:production
...

Fixes: #4421
Signed-off-by: dependabot[bot] <support@github.com>
2022-06-08 10:54:46 +03:00
David Gibson
8f10e13e07 config: Allow enable_iommu pod annotation by default
Since #902 the `io.katacontainers.config.hypervisor` pod annotations
have only been permitted if explicitly allowed in the global
configuration.  The default global configuration allows no such
annotations.  That's important because several of those annotations
would cause Kata to execute arbitrary binaries, and so were wildly
unsafe.

However, this is inconvenient for the
`io.katacontainers.config.hypervisor.enable_iommu` annotation
specifically, which controls whether the sandbox VM includes a vIOMMU.
A guest side vIOMMU is necessary to implement VFIO passthrough devices
with `vfio_mode = vfio`, so enabling that mode of operation currently
requires a global configuration change, and can't just be enabled
per-pod.

Unlike some of the other hypervisor annotations, the `enable_iommu`
annotation is quite safe.  By default the vIOMMU is not present, so
allowing a user to override it for a pod only improves their
facilities for isolation.  Even if the global default were changed to
enable the vIOMMU, that doesn't compel the guest kernel to use it, so
allowing a user to disable the vIOMMU doesn't materially affect
isolation either.

Therefore, allow the io.katacontainers.config.hypervisor.enable_iommu
annotation to work in the default configurations.

fixes #4330

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-06-04 13:02:05 +10:00
Eric Ernst
430da47215 Merge pull request #4360 from fengwang666/shim-leak
runtime: ignore ESRCH error from stop container
2022-06-02 12:42:19 -07:00
Feng Wang
9726f56fdc runtime: force stop container after the container process exits
Set thestop container force flag to true so that the container state is always set to
“StateStopped” after the container wait goroutine is finished. This is necessary for
the following delete container step to succeed.

Fixes: #4359

Signed-off-by: Feng Wang <feng.wang@databricks.com>
2022-06-02 08:17:08 -07:00
Eric Ernst
d2df1209a5 docs: describe kata handling for core-scheduling
Add initial documentation for core-scheduling.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 16:17:00 -07:00
Michael Crosby
22b6a94a84 shim: add support for core scheduling
In linux 5.14 and hopefully some backports, core scheduling allows processes to
be co scheduled within the same domain on SMT enabled systems.

Containerd impl sets the core sched domain when launching a shim. This
allows a clean way for each shim(container/pod) to be in its own domain and any
additional containers, (v2 pods) be be launched with the same domain as well as
any exec'd process added to the container.

kernel docs: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/core-scheduling.html

For Kata specifically, we will look for SCHED_CORE environment variable
to be set to indicate we shuold create a new schedule core domain.

This is equivalent to the containerd shim's PR: e48bbe8394

Fixes: #4309

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Signed-off-by: Michael Crosby <michael@thepasture.io>
2022-05-31 10:10:40 -07:00
Eric Ernst
65f0cef16c kata-runtime: add iptables CLI to test http endpoint
While end users can connect directly to the shim, let's provide a way to
easily get/set iptables from kata-runtime itself.

Fixes: #4080
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 09:27:58 -07:00
Eric Ernst
3201ad0830 shim-client: ensure we check resp status for Put/Post
Without this, potential errors are silently dropped. Let's ensure we
return the error code as well as potenial data from the response.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 09:27:58 -07:00
Eric Ernst
0706fb28ac kata-runtime: shmgmt: make url usage consistent
Before, we had a mix of slash, etc. Unfortunately, when cleaning URL
paths, serve mux seems to mangle the request method, resulting in each
request being a GET (instead of PUT or POST).

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 09:27:58 -07:00
Eric Ernst
2a09378dd9 shim-client: add support for DoPut
While at it, make sure we check for nil in DoPost

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 09:27:58 -07:00
Eric Ernst
640173cfc2 shim-mgmt: Add endpoint handler for interacting with iptables
Add two endpoints: ip6tables, iptables.

Each url handler supports GET and PUT operations. PUT expects
the requests' data to be []bytes, and to contain iptable information in
format to be consumed by iptables-restore.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 09:27:58 -07:00
Eric Ernst
0136be22ca virtcontainers: plumb iptable set/get from sandbox to agent
Introduce get/set iptable handling. We add a sandbox API for getting and
setting the IPTables within the guest. This routes it from sandbox
interface, through kata-agent, ultimately making requests to the guest
agent.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 09:27:58 -07:00
Eric Ernst
03176a9e09 proto: update generated code based on proto update
Update the generated agent.pb.go code based on proto update.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 08:45:59 -07:00
Fabiano Fidêncio
fff832874e clh: Update to v24.0
This release has been tracked through the v24.0 project.

virtio-iommu specification describes how a device can be attached by default
to a bypass domain. This feature is particularly helpful for booting a VM with
guest software which doesn't support virtio-iommu but still need to access
the device. Now that Cloud Hypervisor supports this feature, it can boot a VM
with Rust Hypervisor Firmware or OVMF even if the virtio-block device exposing
the disk image is placed behind a virtual IOMMU.

Multiple checks have been added to the code to prevent devices with identical
identifiers from being created, and therefore avoid unexpected behaviors at boot
or whenever a device was hot plugged into the VM.

Sparse mmap support has been added to both VFIO and vfio-user devices. This
allows the device regions that are not fully mappable to be partially mapped.
And the more a device region can be mapped into the guest address space, the
fewer VM exits will be generated when this device is accessed. This directly
impacts the performance related to this device.

A new serial_number option has been added to --platform, allowing a user to
set a specific serial number for the platform. This number is exposed to the
guest through the SMBIOS.

* Fix loading RAW firmware (#4072)
* Reject compressed QCOW images (#4055)
* Reject virtio-mem resize if device is not activated (#4003)
* Fix potential mmap leaks from VFIO/vfio-user MMIO regions (#4069)
* Fix algorithm finding HOB memory resources (#3983)

* Refactor interrupt handling (#4083)
* Load kernel asynchronously (#4022)
* Only create ACPI memory manager DSDT when resizable (#4013)

Deprecated features will be removed in a subsequent release and users should
plan to use alternatives

* The mergeable option from the virtio-pmem support has been deprecated
(#3968)
* The dax option from the virtio-fs support has been deprecated (#3889)

Fixes: #4317

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-05-26 08:51:18 +00:00
Eric Ernst
6d00701ec9 Merge pull request #4298 from yibozhuang/fix-direct-volume
Fix issues with direct-volume stats feature
2022-05-23 15:23:51 -07:00
Yibo Zhuang
4428ceae16 runtime: direct-volume stats use correct name
Today the shim does a translation when doing
direct-volume stats where it takes the source and
returns the mount path within the guest.

The source for a direct-assigned volume is actually
the device path on the host and not the publish
volume path.

This change will perform a lookup of the mount info
during direct-volume stats to ensure that the
device path is provided to the shim for querying
the volume stats.

Fixes: #4297

Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
2022-05-20 18:42:47 -07:00
Yibo Zhuang
ffdc065b4c runtime: direct-volume stats update to use GET parameter
The go default http mux AFAIK doesn’t support pattern
routing so right now client is padding the url
for direct-volume stats with a subpath of the volume
path and this will always result in 404 not found returned
by the shim.

This change will update the shim to take the volume
path as a GET query parameter instead of a subpath.
If the parameter is missing or empty, then return
400 BadRequest to the client.

Fixes: #4297

Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
2022-05-20 18:41:51 -07:00
Yibo Zhuang
f295953183 runtime: fix incorrect Action function for direct-volume stats
The action function expects a function that returns error
but the current direct-volume stats Action returns
(string, error) which is invalid.

This change fixes the format and print out the stats from
the command instead.

Fixes: #4293

Signed-off-by: Yibo Zhuang <yibzhuang@gmail.com>
2022-05-20 14:55:00 -07:00
Peng Tao
2c238c8504 Merge pull request #4213 from zvonkok/vfio
runtime: Adding the correct detection of mediated PCIe devices
2022-05-20 15:00:23 +08:00
Fabiano Fidêncio
811ac6a8ce Merge pull request #4282 from r4f4/runtime-dedup-types-import
runtime: remove duplicate 'types' import
2022-05-19 22:15:36 +02:00
Chelsea Mafrica
d8be0f8e9f Merge pull request #4281 from r4f4/runtime-qemu-comments
runtime: sync docstrings with function names
2022-05-19 09:17:38 -07:00
Rafael Fonseca
7a5ccd1264 runtime: sync docstrings with function names
The functions were renamed but their docstrings were not.

Fixes #4006

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-19 14:31:47 +02:00
Greg Kurz
fa61bd43ee Merge pull request #4238 from snir911/wip/legacy_console
qemu: allow using legacy serial device for the console
2022-05-19 14:30:59 +02:00
Rafael Fonseca
ce2e521a0f runtime: remove duplicate 'types' import
Fallout of 09f7962ff

Fixes #4285

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-19 13:49:47 +02:00
Snir Sheriber
f4994e486b runtime: allow annotation configuration to use_legacy_serial
and update the docs and test

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-18 18:58:21 +03:00
Fabiano Fidêncio
c88a48be21 Merge pull request #4271 from r4f4/runtime-err-check-fix
runtime: do not check for EOF error in console watcher
2022-05-18 09:49:48 +02:00
GabyCT
12f0ab120a Merge pull request #4191 from dgibson/go-test-script
Improve Go unit test script
2022-05-17 10:27:04 -05:00
Rafael Fonseca
8052fe62fa runtime: do not check for EOF error in console watcher
The documentation of the bufio package explicitly says

"Err returns the first non-EOF error that was encountered by the
Scanner."

When io.EOF happens, `Err()` will return `nil` and `Scan()` will return
`false`.

Fixes #4079

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-17 15:14:33 +02:00
Snir Sheriber
c67b9d2975 qemu: allow using legacy serial device for the console
This allows to get guest early boot logs which are usually
missed when virtconsole is used.
- It utilizes previous work on the govmm side:
https://github.com/kata-containers/govmm/pull/203
- unit test added

Fixes: #4237
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-17 12:06:11 +03:00
Snir Sheriber
44814dce19 qemu: treat console kernel params within appendConsole
as it is tightly coupled with the appended console device
additionally have it tested

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-17 12:05:31 +03:00
Fabiano Fidêncio
c39852e83f runtime: Use ${LIBEXEC}/virtiofsd as the default virtiofsd path
As now we build and ship the rust version of virtiofsd, which is not
tied to QEMU, we need to update its default location to match with where
we're installing this binary.

Fixes: #4249

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-05-16 09:30:24 +02:00
David Gibson
e73b70baff runtime: Don't run unit tests verbose by default
go-test.sh by default adds the -v option to 'go test' meaning that output
will be printed from all the passing tests as well as any failing ones.
This results in a lot of output in which it's often difficult to locate the
failing tests you're interested in.

So, remove -v from the default flags.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:22:31 +10:00
David Gibson
f24a6e761f runtime: Consolidate flags setting in unit tests script
One of the responsibilities of the go-test.sh script is setting up the
default flags for 'go test'.  This is constructed across several different
places in the script using several unneeded intermediate variables though.

Consolidate all the flag construction into one place.

fixes #4190

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:22:29 +10:00
David Gibson
cf465feb02 runtime: Don't change test behaviour based on $CI or $KATA_DEV_MODE
go-test.sh changes behaviour based on both the $CI and $KATA_DEV_MODE
variables, but not in a way that makes a lot of sense.

If either one is set it uses the test_coverage path, instead of the
test_local path.  That collects coverage information, as the name
suggests, but it also means it runs the tests twice as root and
non-root, which is very non-obvious.

It's not clear what use case the test_local path is for at all.
Developer local builds will typically have $KATA_DEV_MODE set and CI
builds will have $CI set.  There's essentially no downside to running
coverage all the time - it has little impact on the test runtime.

In addition, if *both* $CI and $KATA_DEV_MODE are set, the script
refuses to run things as root, considering it "unsafe".  While having
both set might be unwise in a general sense, there's not really any
way running sudo can be any more unsafe than it is with either one
set.

So, simplify everything by just always running the test_coverage path.
This leaves the test_local path unused, so we can remove it entirely.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:14:37 +10:00
David Gibson
34c4ac599c runtime: Remove redundant subcommands from go-test.sh
go-test.sh accepts subcommands, however invoking it in the usual way via
the Makefile doesn't use them.  In fact the only remaining subcommand is
"help" and we already have another way of getting the usage information
(-h or --help).  We don't need a second way, so just drop subcommand
handling.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:14:37 +10:00
David Gibson
0aff5aaa39 runtime: Simplify package listing in go-test.sh
go-test.sh defaults to testing all the packages listed by go list, except
for a number filtered out.  It turns out that none of those filters are
necessary any more:
  * We've long required a Go newer than 1.9 which means the vendor filter
    isn't needed
  * The agent filter doesn't do anything now that we've moved to the Kata
    2.x unified repo
  * The tests filters don't hit anything on the list of modules in
    src/runtime (which is the only user of the script)

But since we don't need to filter anything out any more, we don't even need
to iterate through a list ourselves.  We can simply pass "./..." directly
to go test and it will iterate through all the sub-packages itself.

Interestingly this more than doubles the speed of "make test" for me - I
suspect because go test's internal paralellism works better over a larger
pool of tests.

This also lets us remove handling of non-existent coverage files from
test_go_package(), since with default options we will no longer test packages without tests
by default.  If the user explicitly requests testing of a package with no
tests, then failing makes sense.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:14:37 +10:00
David Gibson
557c4cfd00 runtime: Don't chmod coverage files in Go tests
The go-test.sh script has an explicit chmod command, run as root, to
set the mode of the temporary coverage files to 0644.  AFAICT the
point of this is specifically the 004 bit allowing world read access,
so that we can then merge the temporary coverage file into the main
coverage file.

That's a convoluted way of doing things.  Instead we can just run the tail
command which reads the temporary file as the same user that generated it.

In addition, go-test.sh became root to remove that temporary coverage
file.  This is not necessary, since deleting a regular file just requires
write access to the directory, not the file itself.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:14:37 +10:00
David Gibson
04c8b52e04 runtime: Remove HTML coverage option from go-test.sh
The html-coverage option to this script doesn't really alter behaviour
it just does the same thing as normal coverage, then converts the
report to HTML.  That conversion is a single command, plus a chmod to
make the final output mode 0644.  That overrides any umask the user
has set, which doesn't seem like a policy decision this script should
be making.

Nothing in the kata-containers or tests repository uses this, so it doesn't
really make sense to keep this logic inside this script.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:14:37 +10:00
David Gibson
7f76914422 runtime: Add coverage.txt.tmp to gitignore
In addition to coverage.txt, the go-test.sh script creates
coverage.txt.tmp files while running.  These are temporary and
certainly shouldn't be committed, so add them to the gitignore file.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:14:37 +10:00
David Gibson
13c2577004 runtime: Move go testing script locally
The go unit tests for the runtime are invoked by the helper script
ci/go-test.sh.  Which calls the run_go_test() function in ci/lib.sh.  Which
calls into .ci/go-test.sh from the tests repository.

But.. the runtime is the only user of this script, and generally stuff for
unit tests (rather than functional or integration tests) lives in the main
repository, not the tests repository.

So, just move the actual script into src/runtime.  A change to remove it
from the tests repo will follow.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-05-13 13:14:37 +10:00
Zvonko Kaiser
2a1d394147 runtime: Adding the correct detection of mediated PCIe devices
Fixes #4212

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2022-05-09 00:57:06 -07:00
Fabiano Fidêncio
33a8b70558 clh: Rely on Cloud Hypervisor for generating the device ID
We're currently hitting a race condition on the Cloud Hypervisor's
driver code when quickly removing and adding a block device.

This happens because the device removal is an asynchronous operation,
and we currently do *not* monitor events coming from Cloud Hypervisor to
know when the device was actually removed.  Together with this, the
sandbox code doesn't know about that and when a new device is attached
it'll quickly assign what may be the very same ID to the new device,
leading to the Cloud Hypervisor's driver trying to hotplug a device with
the very same ID of the device that was not yet removed.

This is, in a nutshell, why the tests with Cloud Hypervisor and
devmapper have been failing every now and then.

The workaround taken to solve the issue is basically *not* passing down
the device ID to Cloud Hypervisor and simply letting Cloud Hypervisor
itself generate those, as Cloud Hypervisor does it in a manner that
avoids such conflicts.  With this addition we have then to keep a map of
the device ID and the Cloud Hypervisor's generated ID, so we can
properly remove the device.

This workaround will probably stay for a while, at least till someone
has enough cycles to implement a way to watch the device removal event
and then properly act on that.  Spoiler alert, this will be a complex
change that may not even be worth it considering the race can be avoided
with this commit.

Fixes: #4176

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-05-04 09:04:03 +02:00