Commit Graph

1208 Commits

Author SHA1 Message Date
Samuel Ortiz
fa0e9dc6b1 virtcontainers: Make all Linux VMMs only build on Linux
Some of them (e.g. QEMU) can run on other OSes (e.g. Darwin) but the
current virtcontainers implementation is Linux specific.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:07:34 +01:00
Samuel Ortiz
c91035d0e1 virtcontainers: Move non QEMU specific constants to hypervisor.go
Hotplugging errors and 9pfs size are not particularily QEMU specific.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:07:34 +01:00
Samuel Ortiz
10ae05914c virtcontainers: Move guest protection definitions to hypervisor.go
They're not QEMU specific, other VMMs may implement support for it.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:07:31 +01:00
Samuel Ortiz
b28d0274ff virtcontainers: Make max vCPU config less QEMU specific
Even though it's still actually defined as the QEMU upper bound,
it's now abstracted away through govmm.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:06:32 +01:00
Samuel Ortiz
a5f6df6a49 govmm: Define the number of supported vCPUs per architecture
Based on qhe QEMU supports on those architectures.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-16 19:06:32 +01:00
Fabiano Fidêncio
be2e90469a Merge pull request #3669 from fidencio/wip/virtiofsd-use-announce-submounts
virtiofsd: Use "-o announce_submounts"
2022-02-16 16:43:18 +01:00
James O. D. Hunt
9818cf7196 docs: Improve top-level and runtime README
Various improvements to the top-level README file:

- Moved the following sections from the runtime's README to the
  top-level README:
  - License
  - Platform support / Hardware requirements
- Added the following sections to the top-level README:
  - Configuration
  - Hypervisors
- Improved formatting of the Documentation section in the top-level
  README.
- Removed some unused named links from the top-level README.

Also improvements to the runtime README:

- Removed confusing mention of the old 1.x runtime name.
- Clarify the binary name for the 2.x runtime and the utility program.

> **Note:**
>
> We cannot currently link to the AMD website as that site's
> configuration causes the CI static checks to fail. See
> https://github.com/kata-containers/tests/issues/4401

Fixes: #3557.

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2022-02-16 09:52:48 +00:00
bin
81a8baa5e5 runtime: add hugepages support
Add hugepages support, port from:
b486387cba

Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Signed-off-by: bin <bin@hyper.sh>
2022-02-16 15:14:53 +08:00
bin
7df677c01e runtime: Update calculateSandboxMemory to include Hugepages Limit
Support hugepages and port from:
96dbb2e8f0

Fixes: #3342

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Pradipta Banerjee <pradipta.banerjee@gmail.com>
Signed-off-by: bin <bin@hyper.sh>
2022-02-16 15:14:37 +08:00
Samuel Ortiz
4f96e3eae3 katautils: Pass the nerdctl netns annotation to the OCI hooks
We need to let nerdctl know which namespace to use when calling the
selected CNI plugin.
See https://github.com/containerd/nerdctl/issues/787

Fixes: #1935

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-15 18:11:23 +01:00
Samuel Ortiz
a871a33b65 katautils: Run the createRuntime hooks
The preStart hooks are being deprecated over the createRuntime ones.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-15 17:31:56 +01:00
Samuel Ortiz
d9dfce1453 katautils: Run the preStart hook in the host namespace
The OCI spec is very specific about it:

"The prestart hooks MUST be executed in the runtime namespace."

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-15 17:31:56 +01:00
Samuel Ortiz
6be6d0a3b3 katautils: Pass the OCI annotations back to the called OCI hooks
That allows us to amend those annotations with information that could be
used when running those hooks.

For example nerdctl will use those annotations to resolve the networking
namespace path in where to run the CNI plugin, i.e. the created pod
networking namespace.

Fixes #3629

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-15 17:31:56 +01:00
Fabiano Fidêncio
4bd945b67b virtiofsd: Use "-o announce_submounts"
German Maglione, one of the current virtio-fs developers, has brought to
our attention that using "announce-submounts" could help us to prevent
inode number collisions.

This feature was introduced a year ago or so by Hanna Reitz as part of
the 08dce386e77eb9ab044cb118e5391dc9ae11c5a8, and as we already mandate
QEMU >= 6.1.0, let's take advantage of that.

Fixes: #3507

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-02-15 08:52:03 +01:00
Yu Li
37df1678ae build: always reset ARCH after getting it
When building with `ARCH=x86_64`, the previous `Makefile` will use it
without checking and cause:

Makefile:319: *** "ERROR: No hypervisors known for architecture x86_64 (looked for: acrn firecracker qemu cloud-hypervisor)".  Stop.

This commit fix the above issue by checking `ARCH` no matter where it
is assigned.

Fixes: #3444

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Signed-off-by: Yu Li <liyu.yukiteru@bytedance.com>
2022-02-15 14:26:34 +08:00
Shengjing Zhu
3a641b56f6 katatestutils: remove distro constraints
The distro constraint parses os release files, which may not contain
distro version(VERSION_ID field), for example rolling release distributions
like Debian testing, archlinux.

These distro constraints are not used anyway, so removing them instead
of fixing the complex version detection.

Fixes: #1864

Signed-off-by: Shengjing Zhu <zhsj@debian.org>
2022-02-15 02:11:52 +08:00
Fabiano Fidêncio
90fd625d0c versions: Udpate Cloud Hypervisor to 55479a64d237
Let's update cloud-hypervisor to a version that exposes the TDx support
via the OpenAPI's auto-generated code.

Fixes: #3663

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-02-14 17:32:30 +01:00
James O. D. Hunt
8f80dffead Merge pull request #3648 from yaoyinnan/index-in-for
runtime: The index variable is initialized multiple times in for
2022-02-14 12:36:46 +00:00
Bin Liu
cf53ec2c71 Merge pull request #2977 from luodw/support_nydus
feature(nydusd): add nydusd support to introduce lazyload ability
2022-02-14 13:08:50 +08:00
Matt Layher
c1ce67d905 runtime: use github.com/mdlayher/vsock@v1.1.0
Fixes #3625
Signed-off-by: Matt Layher <mdlayher@gmail.com>
2022-02-12 19:57:15 -05:00
yaoyinnan
42a878e6c1 runtime: The index variable is initialized multiple times in for
Change the variables `mountTypeFieldIdx := 8`, `mntDestIdx := 4` and `netNsMountType := "nsfs"` to const.

And unify the variable naming style, modify `mntDestIdx` to `mountDestIdx`.

Fixes: #3646

Signed-off-by: yaoyinnan <yaoyinnan@foxmail.com>
2022-02-12 11:10:10 +08:00
luodaowen.backend
2d9f89aec7 feature(nydusd): add nydusd support to introduse lazyload ability
Pulling image is the most time-consuming step in the container lifecycle. This PR
introduse nydus to kata container, it can lazily pull image when container start. So it
can speed up kata container create and start.

Fixes #2724

Signed-off-by: luodaowen.backend <luodaowen.backend@bytedance.com>
2022-02-11 21:41:17 +08:00
Daniel Höxtermann
b19b6938a8 docs: Fix relative links in Markdown
Relative links within this repository allow for easier navigation to
the corresponding file / directory in the current commit / for the
selected version.

Link text was slightly changed / fixed in
- docs/Unit-Test-Advice.md
- docs/how-to/how-to-run-docker-with-kata.md

Fixes #3045

Signed-off-by: Daniel Höxtermann <daniel@hxtm.dev>
2022-02-11 13:49:42 +01:00
Julio Montes
982f14fa66 runtime: support QEMU SGX
Enable SGX in QEMU when `sgx.intel.com/epc` annotation is defined

fixes #3436

Signed-off-by: Julio Montes <julio.montes@intel.com>
2022-02-10 09:45:48 -06:00
Samuel Ortiz
07b9d93f5f virtcontainer: Simplify the sandbox network creation flow
We don't need to call NewNetwork() twice, and we can have the VM factory
case return immediatly. That makes the code more readable.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
2c7087ff42 virtcontainers: Make all endpoints Linux only
All of the networking endpoints are Linux specific.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
49d2cde1e2 virtcontainers: Split network tests into generic and OS specific parts
Some unit tests are generic while others, mostly because they depend on
netlink, are Linux specific.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
0269077ebf virtcontainers: Remove the netlink package dependency from network.go
Move the netlink dependent code into network_linux.go.
Other OSes will have to provide the same functions.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
7fca5792f7 virtcontainers: Unify Network endpoints management interface
And only have AddEndpoints/RemoveEndpoints for all cases (single
endpoint vs all of them, hotplug or not).

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
c67109a251 virtcontainers: Remove the Network PostAdd method
It's used once by the sandbox code and can be implemented directly
there.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
e0b264430d virtcontainers: Define a Network interface
And move the Linux implementation into a GOOS specific file.

Fixes #3005

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
5e119e90e8 virtcontainers: Rename the Network structure fields and methods
We are converting the Network structure into an interface, so that
different host OSes can have different networking implementations for
Kata.
One step into that direction is to rename all the Network structure
fields and methods to something that is less Linux networking namespace
specific. This will make the Network interface naming consistent.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
b858d0dedf virtcontainers: Make all Network fields private
Prepare for making it a real interface.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
49eee79f5f virtcontainers: Remove the NetworkNamespace structure
It is now replaced with a single Network structure

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
844eb61992 virtcontainers: Have CreateVM use a Network reference
We are replacing the NetworkingNamespace structure with the Network
one, so we should have the hypervisor interface switching to it as well.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
d7b67a7d1a virtcontainers: Network API cleanups and simplifications
Remove unused parameters.
Reduce the number of parameters by deriving some of them (e.g. a
networking config) from their outer structure (e.g. a Sandbox
reference).

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
2edea88369 virtcontainers: Make the Network structure manage endpoints
Endpoints creations, attachement and hotplug are bound to the networking
namespace described through the Network structure.
Making them Network methods is natural and simplifies the code.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Samuel Ortiz
8f48e28325 virtcontainers: Expand the Network structure
For simplicity sake, there should only be one networking structure per
sandbox, as opposed to two (Network and NetworkingNamespace) currently.

This commit start expanding the Network structure in order to eventually
make it the single representation of a virtcontainers sandbox
networking.

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
2022-02-08 22:27:53 +01:00
Pierre Kohler
5ef522f7c3 runtime: check kvm module sev correctly
Runtime now accepts both `1` and `Y` as valid values for
kvm_amd module parameter kvm_amd.sev.

Fixes #3273

Signed-off-by: Pierre Kohler <pierre.kohler@cysec.systems>
2022-02-07 23:48:47 +01:00
Eric Ernst
e8eb5e8295 Merge pull request #3609 from egernst/rootless-linux
virtcontainers: Split the rootless package into OS specific parts
2022-02-03 12:19:31 -08:00
Jakob Naucke
7ffe9e5198 virtcontainers: Do not add a virtio-rng-ccw device
On s390x, skip adding a virtio-rng device. The on-chip CPACF provides
entropy instead. For Confidential Containers, when using Secure
Execution, entropy attacks on virtio-rng are mitigated.

Fixes: #3598
Signed-off-by: Jakob Naucke <jakob.naucke@ibm.com>
2022-02-02 17:06:20 +01:00
Julio Montes
1f29478b09 runtime: suppport split firmware
firmware can be split into FIRMWARE_VARS.fd (UEFI variables as
configuration) and FIRMWARE_CODE.fd (UEFI program image). UEFI
variables can be customized per each user while UEFI code is kept same.

fixes #3583

Signed-off-by: Julio Montes <julio.montes@intel.com>
2022-02-01 13:40:19 -06:00
Samuel Ortiz
14e7f52a91 virtcontainers: Split the rootless package into OS specific parts
Move the netns specific bits into a Linux specific file.

Fixes: #3607

Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-01-28 16:20:28 -08:00
James O. D. Hunt
7c956e0d27 virtcontainers: Enable initrd for Cloud Hypervisor
Since CH has supported booting with an initramfs since version 0.7.0
[1], allow an `initrd=` to be specified.

Fixes: #3566.

[1] - https://github.com/cloud-hypervisor/cloud-hypervisor/releases/tag/v0.7.0

Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
2022-01-28 10:49:10 +00:00
Eric Ernst
a5ebeb96c1 Merge pull request #2941 from egernst/sandbox-sizing-feature
Sandbox sizing feature
2022-01-27 09:37:57 -08:00
Eric Ernst
8cde54131a runtime: introduce static sandbox resource management
There are software and hardware architectures which do not support
dynamically adjusting the CPU and memory resources associated with a
sandbox. For these, today, they rely on "default CPU" and "default
memory" configuration options for the runtime, either set by annotation
or by the configuration toml on disk.

In the case of a single container (launched by ctr, or something like
"docker run"), we could allow for sizing the VM correctly, since all of
the information is already available to us at creation time.

In the sandbox / pod container case, it is possible for the upper layer
container runtime (ie, containerd or crio) could send a specific
annotation indicating the total workload resource requirements
associated with the sandbox creation request.

In the case of sizing information not being provided, we will follow
same behavior as today: start the VM with (just) the default CPU/memory.

If this information is provided, we'll track this as Workload specific
resources, and track default sizing information as Base resources. We
will update the hypervisor configuration to utilize Base+Workload
resources, thus starting the VM with the appropriate amount of CPU and
memory.

In this scenario (we start the VM with the "right" amount of
CPU/Memory), we do not want to update the VM resources when containers
are added, or adjusted in size.

This functionality is introduced behind a configuration flag,
`static_sandbox_resource_mgmt`. This is defaulted to false for all
configurations except Firecracker, which is set to true.

This'll greatly improve UX for folks who are utilizing
Kata with a VMM or hardware architecture that doesn't support hotplug.

Note, users will still be unable to do in place vertical pod autoscaling
or other dynamic container/pod sizing with this enabled.

Fixes: #3264

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-01-26 09:04:38 -08:00
Eric Ernst
c3e97a0a22 config: updates to configuration clh, fc toml template
There's some cruft -- let's update to reflect reality, and ensure that
we match what is expected.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-01-26 09:45:50 -08:00
Francesco Giudici
ab447285ba kata-monitor: add kubernetes pod metadata labels to metrics
Add the POD metadata we get from the container manager to the metrics by
adding more labels.

Fixes: #3551

Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
2022-01-26 13:48:45 +01:00
Francesco Giudici
834e199eee kata-monitor: drop unused functions
Drop the functions we are not using anymore.
Update the tests too.

Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
2022-01-26 13:48:45 +01:00
Francesco Giudici
7516a8c51b kata-monitor: rework the sandbox cache sync with the container manager
Kata-monitor detects started and terminated kata pods by monitoring the
vc/sbs fs (this makes sense since we will have to access that path to
access the sockets there to get the metrics from the shim).
While kata-monitor updates its sandbox cache based on the sbs fs events,
it will schedule also a sync with the container manager via the CRI in
order to sync the list of sandboxes there.
The container manager will be the ultimate source of truth, so we will
stick with the response from the container manager, removing the
sandboxes not reported from the container manager.

May happen anyway that when we check the container manager, the new kata
pod is not reported yet, and we will remove it from the kata-monitor pod
cache. If we don't get any new kata pod added or removed, we will not
check with the container manager again, missing reporting metrics about
that kata pod.

Let's stick with the sbs fs as the source of truth: we will update the
cache just following what happens on the sbs fs.
At this point we may have also decided to drop the container manager
connection... better instead to keep it in order to get the kube pod
metadata from it, i.e., the kube UID, Name and Namespace associated with
the sandbox.
Every time we get a new sandbox from the sbs fs we will try to retrieve the
pod metadata associated with it.

Right now we just attach the container manager sandbox id as a label to
the exposed metrics, making hard to link the metrics to the running pod
in the kubernetes cluster.
With kubernetes pod metadata we will be able to add them as labels to map
explicitly the metrics to the kubernetes workloads.

Fixes: #3550

Signed-off-by: Francesco Giudici <fgiudici@redhat.com>
2022-01-26 13:48:45 +01:00