Commit Graph

241 Commits

Author SHA1 Message Date
Pavel Mores
351a01bd7e runtime: handle io.katacontainers.config.hypervisor.virtio_fs_extra_args
Users can specify extra arguments for virtiofsd in a pod spec using the
io.katacontainers.config.hypervisor.virtio_fs_extra_args annontation.
However, this annotation was ignored so far by the runtime.  This commit
fixes the issue by processing the annotation value (if present) and
translating it to the corresponding hypervisor configuration item.

Fixes #1523

Signed-off-by: Pavel Mores <pmores@redhat.com>
2021-04-26 10:17:48 +02:00
Bo Chen
e3efcfd40f runtime: Fix the format of the client code of cloud-hypervisor APIs
Regenerate the client code with the added `go-fmt` step. No functional
changes.

Fixes: #1606

Signed-off-by: Bo Chen <chen.bo@intel.com>
(cherry picked from commit 0c38d9ecc4)
2021-04-01 15:15:54 -07:00
Bo Chen
5a92333f4b runtime: Format auto-generated client code for cloud-hypervisor API
This patch extends the current process of generating client code for
cloud-hypervisor API with an additional step, `go-fmt`, which will remove
the generated `client/go.mod` file and format all auto-generated code.

Fixes: #1606

Signed-off-by: Bo Chen <chen.bo@intel.com>
(cherry picked from commit 52cacf8838)
2021-04-01 15:15:54 -07:00
Bo Chen
ec0424e153 versions: Update cloud-hypervisor to release v0.14.1
Highlights for cloud-hypervisor version 0.14.0 include: 1) Structured
event monitoring; 2) MSHV improvements; 3) Improved aarch64 platform; 4)
Updated hotplug documentation; 6) PTY control for serial and
virtio-console; 7) Block device rate limiting; 8) Plan to deprecate the
support of "LinuxBoot" protocol and support PVH protocol only.

Highlights for cloud-hypervisor version 0.13.0 include: 1) Wider VFIO
device support; 2) Improve huge page support; 3) MACvTAP support; 4) VHD
disk image support; 5) Improved Virtio device threading; 6) Clean
shutdown support via synthetic power button.

Details can be found:
https://github.com/cloud-hypervisor/cloud-hypervisor/releases

Note: The client code of cloud-hypervisor's OpenAPI is automatically
generated by `openapi-generator` [1-2]. As the API changes do not
impact usages in Kata, no additional changes in kata's runtime are
needed to work with the latest version of cloud-hypervisor.

[1] https://github.com/OpenAPITools/openapi-generator
[2] https://github.com/kata-containers/kata-containers/blob/main/src/runtime/virtcontainers/pkg/cloud-hypervisor/README.md

Fixes: #1591

Signed-off-by: Bo Chen <chen.bo@intel.com>
(cherry picked from commit 84b62dc3b1)
2021-04-01 15:15:06 -07:00
Eric Ernst
ac9f838e33 container: on cleanup, rm container directory for mounts path
A wrong path was being used for container directory when
virtiofs is utilized. This resulted in a warning message in
logs when a container is killed, or completes:

level=warning msg="Could not remove container share dir"

Without proper removal, they'd later be cleaned up when the shared
path is removed as part of stopping the sandbox.

Fixes: #1559

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2021-03-29 14:16:22 -07:00
Fabiano Fidêncio
9ea851ee53 Merge pull request #1485 from egernst/backport-bindmount-fixes
backport: bindmount fixes
2021-03-26 18:45:54 +01:00
Peng Tao
50aa89fa05 runtime: fix virtiofsd RO volume sharing
Right now we rely heavily on mount propagation to share host
files/directories to the guest. However, because virtiofsd
pivots and moves itself to a separate mount namespace, the remount
mount is not present in virtiofsd's mount. And it causes guest to be
able to write to the host RO volume.

To fix it, create a private RO mount and then move it to the host mounts
dir so that it will be present readonly in the host-guest shared dir.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2021-03-26 12:53:04 +08:00
Peng Tao
57aa746d0d runtime: mount shared mountpoint readonly
bindmount remount events are not propagated through mount subtrees,
so we have to remount the shared dir mountpoint directly.

E.g.,
```
mkdir -p source dest foo source/foo

mount -o bind --make-shared source dest

mount -o bind foo source/foo
echo bind mount rw
mount | grep foo
echo remount ro
mount -o remount,bind,ro source/foo
mount | grep foo
```
would result in:
```
bind mount rw
/dev/xvda1 on /home/ubuntu/source/foo type ext4 (rw,relatime,discard,data=ordered)
/dev/xvda1 on /home/ubuntu/dest/foo type ext4 (rw,relatime,discard,data=ordered)
remount ro
/dev/xvda1 on /home/ubuntu/source/foo type ext4 (ro,relatime,discard,data=ordered)
/dev/xvda1 on /home/ubuntu/dest/foo type ext4 (rw,relatime,discard,data=ordered)
```

The reason is that bind mount creats new mount structs and attaches them to different mount subtrees.
However, MS_REMOUNT only looks for existing mount structs to modify and does not try to propagate the
change to mount structs in other subtrees.

Fixes: #1061
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2021-03-26 12:00:28 +08:00
Peng Tao
ce2798b688 runtime: readonly mounts should be readonly bindmount on the host
So that we get protected at the VM boundary not just the guest kernel.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2021-03-26 12:00:27 +08:00
Snir Sheriber
b7208b3c6c runtime: increase dial timeout
On some setups, starting multiple kata pods (qemu) simultaneously on the same node
might cause kata VMs booting time to increase and the pods to fail with:
Failed to check if grpc server is working: rpc error: code = DeadlineExceeded desc = timed
out connecting to vsock 1358662990:1024: unknown

Increasing default dialing timeout to 30s should cover most cases.

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
Fixes: #1543
(backport https://github.com/kata-containers/kata-containers/pull/1544)
2021-03-25 10:01:36 +02:00
bin
d87076eea5 runtime: return hypervisor Pid in TaskExit event
Other RPC calls return Pid of hypervisor, the TaskExit should
return the same Pid.

Fixes: #1497

Signed-off-by: bin <bin@hyper.sh>

(backport https://github.com/kata-containers/kata-containers/pull/1498)
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
[ fix missing GetHypervisorPid method in MockSandbox ]
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2021-03-23 15:34:45 +08:00
fupan.lfp
2dd859bfce shimv2: return the hypervisor's pid as the container pid
Since the kata's hypervisor process is in the network namespace,
which is close to container's process, and some host metrics
such as cadvisor can use this pid to access the network namespace
to get some network metrics. Thus this commit replace the shim's
pid with the hypervisor's pid.

Fixes: #1451

Signed-off-by: fupan.lfp <fupan.lfp@antfin.com>

(backport https://github.com/kata-containers/kata-containers/pull/1452)
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
2021-03-23 15:34:45 +08:00
Peng Tao
476467115f Merge pull request #1522 from fgiudici/stable-2.0
[backport] Fixup systemd cgroup handling
2021-03-23 15:13:53 +08:00
Eric Ernsteernst
506f4f2adc cgroups: Add systemd detection when creating cgroup manager
Look at the provided cgroup path to determine whether systemd is being
used to manage the cgroups. With this, systemd cgroups are being detected
and created appropriately for the sandbox.

Fixes: #599

Signed-off-by: Eric Ernsteernst <eric@amperecomputing.com>

(forward port of https://github.com/kata-containers/runtime/pull/2817)
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
2021-03-17 17:50:14 +01:00
Eric Ernsteernst
a3e35e7e92 cgroups: remove unused SystemdCgroup variable and accessor/mutators
Since we are now detecting, no longer to keep this state.

Signed-off-by: Eric Ernsteernst <eric@amperecomputing.com>

(forward port of https://github.com/kata-containers/runtime/pull/2817)
Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
2021-03-17 17:50:07 +01:00
Bo Chen
bcd8fd538d versions: Update cloud-hypervisor to release v0.12.0
Highlights for cloud-hypervisor version v0.12.0 include: removal of
`vhost-user-net` and `vhost-user-block` self spawning, migration of
`vhost-user-fs` backend, ARM64 enhancements with full support of
`--watchdog` for rebooting, and enhanced `info` HTTP API to include the
details of devices used by the VM including VFIO devices.

Fixes: #1315

Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-03-17 11:31:32 +08:00
Eric Ernst
32feb10331 runtime: cpuset: when creating container, don't pass cpuset details
Today we only clear out the cpuset details when doing an update call on
existing container/pods. This works in the case of Kubernetes, but not
in the case where we are explicitly setting the cpuset details at boot
time. For example, if you are running a single container via docker ala:

docker run --cpuset-cpus 0-3 -it alpine sh

What would happen is the cpuset info would be passed in with the
container spec for create container request to the agent. At that point
in time, there'd only be the defualt number of CPUs available in the
guest (1), so you'd be left with cpusets set to 0. Next, we'd hotplug
the vCPUs, providing 0-4 CPUs in the guest, but the cpuset would never
be updated, leaving the application tied to CPU 0.

Ouch.

Until the day we support cpusets in the guest, let's make sure that we
start off clearing the cpuset fields.

Fixes: #1405

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2021-03-17 11:31:32 +08:00
Eric Ernst
e4cea92ad3 blk-dev: hotplug readonly if applicable
If a block based volume is read only, let's make sure we add as a RO
device

Fixes: #1246

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2021-01-13 14:44:45 -08:00
Eric Ernst
0590fedd98 volumes: cleanup / minor refactoring
Update some headers, very minor refactoring

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2021-01-13 14:44:45 -08:00
Eric Ernst
6b6668998f vendor: revendor govmm from intel to kata-containers
- Update where we vendor govmm
- Grab latest

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2021-01-13 14:44:45 -08:00
Eric Ernst
4f7f25d1a1 Merge pull request #1251 from bergwolf/backport-2.0.0
Backport to stable-2.0 branch
2021-01-13 12:25:15 -08:00
Julio Montes
65ae12710d runtime: clh: update cloud-hypervisor
Update cloud-hypervisor to commit 2706319.
Fixes a limitation in OpenAPITools/openapi-generator tool,
it's impossible to send go zero types, like false and 0 to
cloud-hypervisor because `omitempty` is added if a field is not
required.
See cloud-hypervisor/cloud-hypervisor#1961 for more information

Signed-off-by: Julio Montes <julio.montes@intel.com>
2021-01-13 11:38:24 -06:00
Julio Montes
9bc6fe6c83 runtime: clh: disable virtiofs DAX when FS cache size is 0
Guest consumes 120Mb more of memory when DAX is enabled and the default
FS cache size (8G) is used. Disable dax when it is not required
reducing guest's memory footprint.

Without this patch:

```
7fdea4000000-7fdee4000000 rw-s 18850589 /memfd:ch_ram (deleted)
Size:            1048576 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:              187876 kB
```

With this patch:

```
7fa970000000-7fa9b0000000 rw-s 612001  /memfd:ch_ram (deleted)
Size:            1048576 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:               57308 kB
Pss:               56722 kB
```

fixes #1100

Signed-off-by: Julio Montes <julio.montes@intel.com>
2021-01-13 11:38:24 -06:00
Bo Chen
349d496f7f versions: Update cloud-hypervisor to release v0.11.0
The release v0.11.0 of cloud-hypervisor features the following changes:
1) Improved Linux Boot Time, 2) `SIGTERM/SIGINT` Interrupt Signal,
Handling 3) Default Log Level Changed, 4) `io_uring` support by default
for `virtio-block` (on host kernel version 5.8+), 5) Windows Guest
Support, 6) New `--balloon` Parameter Added, 7) Experimental
`virtio-watchdog` Support, 8) Bug fixes.

Fixes: #1089

Signed-off-by: Bo Chen <chen.bo@intel.com>
2021-01-13 11:38:02 -06:00
Eric Ernst
40316f688a qemu: no state to save if QEMU isn't running
On pod delete, we were looking to read files that we had just deleted. In particular,
stopSandbox for QEMU was called (we cleanup up vmpath), and then QEMU's
save function was called, which immediately checks for the PID file.

Let's only update the persist store for QEMU if QEMU is actually
running. This'll avoid Error messages being displayed when we are
stopping and deleting a sandbox:

```
level=error msg="Could not read qemu pid file"
```

I reviewed CLH, and it looks like it is already taking appropriate
action, so no changes needed.

Ideally we won't spend much time saving state to persist.json unless
there's an actual error during stop/delete/shutdown path, as the persist will
also be removed after the pod is removed. We may want to optimize this,
as currently we are doing a persist store when deleting each container
(after the sandbox is stopped, VM is killed), and when we stop the sandbox.
This'll require more rework... tracked in:
  https://github.com/kata-containers/kata-containers/issues/1181

Fixes: #1179

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2021-01-13 18:30:46 +08:00
David Gibson
9117dd409e runtime/network: Fix error reporting in listRoutes()
If the upcast from resultingRoutes to *grpc.IRoutes fails, we return
(nil, err), but previous code ensures that err is nil at that point, so we
return no error.

fixes #1206

Forward port of
0ffaeeb5d8

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-01-13 18:30:46 +08:00
David Gibson
fce14f3697 runtime/network: Correct error reporting in listInterfaces()
If the upcast from resultingInterfaces to *grpc.Interfaces fails, we
return (nil, err), but previous code ensures that err is nil at that
point, so we return no error.

Forward port of
b86e904c2d

fixes #1206

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2021-01-13 18:30:46 +08:00
Peng Tao
78df4a0c3f runtime: remove the unused proto files
These are moved to the agent and no longer needed.

Fixes: #1028
Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2020-11-27 08:26:51 -06:00
Peng Tao
7daf9cffb1 agent: move gogo.proto out of the github.com namespance
To follow the same namespace scope as other proto files.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2020-11-27 08:26:51 -06:00
Peng Tao
293be9d0ad agent: types.pb.go is not regenerated
When types.proto was relocated, types.pb.go is not regenerated and still
references the old location.

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
2020-11-27 08:26:51 -06:00
Christophe de Dinechin
8a364d2145 annotations: Correct unit tests to validate new protections
Add the verification of some basic protections, namely that:
- EnableAnnotations is honored
- Dangerous paths cannot be modified if no match
- Errors are returned when expected

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:43:15 +08:00
Christophe de Dinechin
0cc6297716 annotations: Split addHypervisorOverrides to reduce complexity
Warning from gocyclo during make check:
 virtcontainers/pkg/oci/utils.go:404:1: cyclomatic complexity 37 of func `addHypervisorConfigOverrides` is high (> 30) (gocyclo)
 func addHypervisorConfigOverrides(ocispec specs.Spec, config *vc.SandboxConfig, runtime RuntimeConfig) error {
^

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:43:15 +08:00
Christophe de Dinechin
b6059f3566 annotations: Add unit test for checkPathIsInGlobs
There are a few interesting corner cases to consider for this
function.

Fixes: #901

Suggested-by: James O.D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:43:15 +08:00
Christophe de Dinechin
c6afad2a06 annotations: Add unit test for regexpContains function
James O.D Hunt: "But also, regexpContains() and
checkPathIsInGlobList() seem like good candidates for some unit
tests. The "look" obvious, but a few boundary condition tests would be
useful I think (filenames with spaces, backslashes, special
characters, and relative & absolute paths are also an interesting
thought here)."

There aren't that many boundary conditions on a list with regexps,
if you assume the regexp match function itself works. However, the
tests is useful in documenting expectations.

Fixes: #901

Suggested-by: James O.D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:43:15 +08:00
Christophe de Dinechin
a92a63031d annotations: Give better names to local variabes in search functions
Use more meaningful variable names for clarity.

Fixes: #901

Suggested-by: James O.D. Hunt james.o.hunt@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:43:15 +08:00
Christophe de Dinechin
997f7c4433 annotations: Rename checkPathIsInGlobList with checkPathIsInGlobs
The name is shorter and more specific

Fixes: #901

Suggested-by: James O.D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:43:15 +08:00
Christophe de Dinechin
73bb3fdbee config: Whitelist hypervisor annotations by name
Add a field "enable_annotations" to the runtime configuration that can
be used to whitelist annotations using a list of regular expressions,
which are used to match any part of the base annotation name, i.e. the
part after "io.katacontainers.config.hypervisor."

For example, the following configuraiton will match "virtio_fs_daemon",
"initrd" and "jailer_path", but not "path" nor "firmware":

  enable_annotations = [ "virtio.*", "initrd", "_path" ]

The default is an empty list of enabled annotations, which disables
annotations entirely.

If an anontation is rejected, the message is something like:

  annotation io.katacontainers.config.hypervisor.virtio_fs_daemon is not enabled

Fixes: #901

Suggested-by: Peng Tao <tao.peng@linux.alibaba.com>
Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:43:10 +08:00
Christophe de Dinechin
5a587ba506 config: Use glob instead of regexp to match paths in annotations
When filtering annotations that correspond to paths,
e.g. hypervisor.path, it is better to use a glob syntax than a regexp
syntax, as it is more usual for paths, and prevents classes of matches
that are undesirable in our case, such as matching .. against .*

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
29f5dec38f annotations: Fix typo in comment
A comment talking about runtime related annotations describes them as
being related to the agent. A similar comment for the agent
annotations is missing.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
28c386c51f config: Protect file_mem_backend against annotation attacks
This one could theoretically be used to overwrite data on the host.
It seems somewhat less risky than the earlier ones for a number
of reasons, but worth protecting a little anyway.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
c2a186b18c config: Protect vhost_user_store_path against annotation attacks
This path could be used to overwrite data on the host.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
b5f2a1e8c4 config: Protect ctlpath from annotation attack
This also adds annotation for ctlpath which were not present
before. It's better to implement the code consistenly right now to make
sure that we don't end up with a leaky implementation tacked on later.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
2d65b3bfd8 config: Protect jailer_path annotation
The jailer_path annotation can be used to execute arbitrary code on
the host. Add a jailer_path_list configuration entry providing a list
of regular expressions that can be used to filter annotations that
represent valid file names.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
3f7bcf54f0 annotations: Simplify negative logic
Replace strange negative logic  (!ok -> continue) with positive
logic (ok -> do it)

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
80144fc415 config: Add hypervisor path override through annotations
The annotation is provided, so it should be respected.
Furthermore, it is important to implement it with the appropriate
protetions similar to what was done for virtiofsd.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
2f5f35608a config: Fix typo in function name
There was an extra 'p' in addHypervisorVirtioFsOverrides.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Christophe de Dinechin
2faafbdd3a config: Protect virtio_fs_daemon annotation
Sending the virtio_fs_daemon annotation can be used to execute
arbitrary code on the host. In order to prevent this, restrict the
values of the annotation to a list provided by the configuration
file.

Fixes: #901

Signed-off-by: Christophe de Dinechin <dinechin@redhat.com>
2020-10-18 00:40:16 +08:00
Eric Ernst
183823398d cpuset: add cpuset pkg
Pulled from 1.18.4 Kubernetes, adding the cpuset pkg for managing
CPUSet calculations on the host. Go mod'ing the original code from
k8s.io/kubernetes was very painful, and this is very static, so let's
just pull in what we need.

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2020-10-18 00:40:16 +08:00
Eric Ernst
bfbbe8ba6b cpuset: don't set cpuset.mems in the guest
Kata doesn't map any numa topologies in the guest. Let's make sure we
clear the Cpuset fields before passing container updates to the
guest.

Note, in the future we may want to have a vCPU to guest CPU mapping and
still include the cpuset.Cpus. Until we have this support, clear this as
well.

Fixes: #932

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2020-10-18 00:40:16 +08:00
Eric Ernst
5c21ec278c sandbox: consider cpusets if quota is not enforced
CPUSet cgroup allows for pinning the memory associated with a cpuset to
a given numa node. Similar to cpuset.cpus, we should take cpuset.mems
into account for the sandbox-cgroup that Kata creates.

Signed-off-by: Eric Ernst <eric.g.ernst@gmail.com>
2020-10-18 00:40:16 +08:00