Commit Graph

838 Commits

Author SHA1 Message Date
Georgina Kinge
4f80ea1962 CCv0: Merge main into CCv0 branch
Merge remote-tracking branch 'upstream/main' into CCv0

Fixes: #4507
Signed-off-by: Georgina Kinge <georgina.kinge@ibm.com>
2022-06-22 10:06:27 +01:00
Eric Ernst
72049350ae Merge pull request #4288 from fengwang666/enable-qemu-sandbox
runtime: enable sandbox feature on qemu
2022-06-21 09:22:26 -07:00
Liang Zhou
ef925d40ce runtime: enable sandbox feature on qemu
Enable "-sandbox on" in qemu can introduce another protect layer
on the host, to make the secure container more secure.

The default option is disable because this feature may introduce some
performance cost, even though user can enable
/proc/sys/net/core/bpf_jit_enable to reduce the impact.

Fixes: #2266

Signed-off-by: Feng Wang <feng.wang@databricks.com>
2022-06-17 15:30:46 -07:00
Megan Wright
94695869b0 CCv0: Merge main into CCv0 branch
Merge remote-tracking branch 'upstream/main' into CCv0

Fixes: #4460
Signed-off-by: Megan-Wright <megan.wright@ibm.com>
2022-06-15 11:05:51 +01:00
Fabiano Fidêncio
ac5dbd8598 clh: Improve logging related to the net dev addition
Let's improve the log so we make it clear that we're only *actually*
adding the net device to the Cloud Hypervisor configuration when calling
our own version of VmAddNetPut().

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-14 10:53:09 +00:00
Fabiano Fidêncio
0b75522e1f network: Set queues to 1 to ensure we get the network fds
We want to have the file descriptors of the opened tuntap device to pass
them down to the VMMs, so the VMMs don't have to explicitly open a new
tuntap device themselves, as the `container_kvm_t` label does not allow
such a thing.

With this change we ensure that what's currently done when using QEMU as
the hypervisor, can be easily replicated with other VMMs, even if they
don't support multiqueue.

As a side effect of this, we need to close the received file descriptors
in the code of the VMMs which are not going to use them.

Fixes: #3533

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-14 10:53:09 +00:00
Fabiano Fidêncio
93b61e0f07 network: Add FFI_NO_PI to the netlink flags
Adding FFI_NO_PI to the netlink flags causes no harm to the supported
and tested hypervisors as when opening the device by its name Cloud
Hypervisor[0], Firecracker[1], and QEMU[2] do set the flag already.

However, when receiving the file descriptor of an opened tutap device
Cloud Hypervisor is not able to set the flag, leaving the guest without
connectivity.

To avoid such an issue, let's simply add the FFI_NO_PI flag to the
netlink flags and ensure, from our side, that the VMMs don't have to set
it on their side when dealing with an already opened tuntap device.

Note that there's a PR opened[3] just for testing that this change
doesn't cause any breakage.

[0]: e52175c2ab/net_util/src/tap.rs (L129)
[1]: b6d6f71213/src/devices/src/virtio/net/tap.rs (L126)
[2]: 3757b0d08b/net/tap-linux.c (L54)
[3]: https://github.com/kata-containers/kata-containers/pull/4292

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-14 10:53:09 +00:00
Fabiano Fidêncio
bf3ddc125d clh: Pass the tuntap fds down to Cloud Hypervisor
This is basically a no-op right now, as:
* netPair.TapInterface.VMFds is nil
* the tap name is still passed to Cloud Hypervisor, which is the Cloud
  Hypervisor's first choice when opening a tap device.

In the very near future we'll stop passing the tap name to Cloud
Hypervisor, and start passing the file descriptors of the opened tap
instead.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-14 10:53:09 +00:00
Fabiano Fidêncio
55ed32e924 clh: Take care of the VmAdNetdPut request ourselves
Knowing that VmAddNetPut works as expected, let's switch to manually
building the request and writing it to the appropriate socket.

By doing this it gives us more flexibility to, later on, pass the file
descriptor of the tuntap device to Cloud Hypervisor, as openAPI doesn't
support such operation (it has no notion of SCM Rights).

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-14 10:53:09 +00:00
Fabiano Fidêncio
01fe09a4ee clh: Hotplug the network devices
Instead of creating the VM with the network device already plugged in,
let's actually add the network device *after* the VM is created, but
*before* the Vm is actually booted.

Although it looks like it doesn't make any functional difference between
what's done in the past and what this commit introduces, this will be
used to workaround a limitation on OpenAPI when it comes to passing down
the network device's file descriptor to Cloud Hypervisor, so Cloud
Hypervisor can use it instead of opening the device by its name on the
VMM side.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-14 10:51:02 +00:00
Fabiano Fidêncio
2e07538334 clh: Expose VmAddNetPut
VmAddNetPut is the API provided by the Cloud Hypervisor client (auto
generated) code to hotplug a new network device to the VM.

Let's expose it now as it'll be used as part this series, mostly to
guide the reviewer through the process of what we have to do, as later
on, spoiler alert, it'll end up being removed.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-06-14 10:27:30 +00:00
Megan Wright
aa9d875a8d CCv0: Merge main into CCv0 branch
Merge remote-tracking branch 'upstream/main' into CCv0

Fixes: #4424
Signed-off-by: Megan Wright <megan.wright@ibm.com>
2022-06-08 15:51:18 +01:00
Eric Ernst
0136be22ca virtcontainers: plumb iptable set/get from sandbox to agent
Introduce get/set iptable handling. We add a sandbox API for getting and
setting the IPTables within the guest. This routes it from sandbox
interface, through kata-agent, ultimately making requests to the guest
agent.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 09:27:58 -07:00
Eric Ernst
03176a9e09 proto: update generated code based on proto update
Update the generated agent.pb.go code based on proto update.

Signed-off-by: Eric Ernst <eric_ernst@apple.com>
2022-05-31 08:45:59 -07:00
Georgina Kinge
7eb74e51be CCv0: Merge main into CCv0 branch
Merge remote-tracking branch 'upstream/main' into CCv0

Fixes: #4345
Signed-off-by: Georgina Kinge <Georgina.Kinge@ibm.com>
2022-05-31 13:50:38 +01:00
Rafael Fonseca
322c6dab66 runtime: sync docstrings with function names
The functions were renamed but their docstrings were not.

Fixes #4006

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-30 16:02:29 +02:00
Rafael Fonseca
4d5e446643 runtime: remove duplicate 'types' import
Fallout of 09f7962ff

Fixes #4285

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-30 16:02:29 +02:00
Snir Sheriber
a48d13f68d runtime: allow annotation configuration to use_legacy_serial
and update the docs and test

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-30 16:02:29 +02:00
Snir Sheriber
060fed814c qemu: allow using legacy serial device for the console
This allows to get guest early boot logs which are usually
missed when virtconsole is used.
- It utilizes previous work on the govmm side:
https://github.com/kata-containers/govmm/pull/203
- unit test added

Fixes: #4237
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-30 16:02:29 +02:00
Snir Sheriber
5453128159 qemu: treat console kernel params within appendConsole
as it is tightly coupled with the appended console device
additionally have it tested

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-30 16:02:29 +02:00
Zvonko Kaiser
79a060ac68 runtime: Adding the correct detection of mediated PCIe devices
Fixes #4212

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2022-05-30 16:02:29 +02:00
Fabiano Fidêncio
fff832874e clh: Update to v24.0
This release has been tracked through the v24.0 project.

virtio-iommu specification describes how a device can be attached by default
to a bypass domain. This feature is particularly helpful for booting a VM with
guest software which doesn't support virtio-iommu but still need to access
the device. Now that Cloud Hypervisor supports this feature, it can boot a VM
with Rust Hypervisor Firmware or OVMF even if the virtio-block device exposing
the disk image is placed behind a virtual IOMMU.

Multiple checks have been added to the code to prevent devices with identical
identifiers from being created, and therefore avoid unexpected behaviors at boot
or whenever a device was hot plugged into the VM.

Sparse mmap support has been added to both VFIO and vfio-user devices. This
allows the device regions that are not fully mappable to be partially mapped.
And the more a device region can be mapped into the guest address space, the
fewer VM exits will be generated when this device is accessed. This directly
impacts the performance related to this device.

A new serial_number option has been added to --platform, allowing a user to
set a specific serial number for the platform. This number is exposed to the
guest through the SMBIOS.

* Fix loading RAW firmware (#4072)
* Reject compressed QCOW images (#4055)
* Reject virtio-mem resize if device is not activated (#4003)
* Fix potential mmap leaks from VFIO/vfio-user MMIO regions (#4069)
* Fix algorithm finding HOB memory resources (#3983)

* Refactor interrupt handling (#4083)
* Load kernel asynchronously (#4022)
* Only create ACPI memory manager DSDT when resizable (#4013)

Deprecated features will be removed in a subsequent release and users should
plan to use alternatives

* The mergeable option from the virtio-pmem support has been deprecated
(#3968)
* The dax option from the virtio-fs support has been deprecated (#3889)

Fixes: #4317

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-05-26 08:51:18 +00:00
Peng Tao
2c238c8504 Merge pull request #4213 from zvonkok/vfio
runtime: Adding the correct detection of mediated PCIe devices
2022-05-20 15:00:23 +08:00
Fabiano Fidêncio
811ac6a8ce Merge pull request #4282 from r4f4/runtime-dedup-types-import
runtime: remove duplicate 'types' import
2022-05-19 22:15:36 +02:00
Chelsea Mafrica
d8be0f8e9f Merge pull request #4281 from r4f4/runtime-qemu-comments
runtime: sync docstrings with function names
2022-05-19 09:17:38 -07:00
Rafael Fonseca
7a5ccd1264 runtime: sync docstrings with function names
The functions were renamed but their docstrings were not.

Fixes #4006

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-19 14:31:47 +02:00
Greg Kurz
fa61bd43ee Merge pull request #4238 from snir911/wip/legacy_console
qemu: allow using legacy serial device for the console
2022-05-19 14:30:59 +02:00
Rafael Fonseca
ce2e521a0f runtime: remove duplicate 'types' import
Fallout of 09f7962ff

Fixes #4285

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-19 13:49:47 +02:00
Snir Sheriber
f4994e486b runtime: allow annotation configuration to use_legacy_serial
and update the docs and test

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-18 18:58:21 +03:00
Georgina Kinge
dd78e4915c CCv0: Merge main into CCv0 branch
Merge remote-tracking branch 'upstream/main' into CCv0

Fixes: #4275
Signed-off-by: Georgina Kinge <georgina.kinge@ibm.com>
2022-05-18 11:19:22 +01:00
Rafael Fonseca
8052fe62fa runtime: do not check for EOF error in console watcher
The documentation of the bufio package explicitly says

"Err returns the first non-EOF error that was encountered by the
Scanner."

When io.EOF happens, `Err()` will return `nil` and `Scan()` will return
`false`.

Fixes #4079

Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
2022-05-17 15:14:33 +02:00
Snir Sheriber
c67b9d2975 qemu: allow using legacy serial device for the console
This allows to get guest early boot logs which are usually
missed when virtconsole is used.
- It utilizes previous work on the govmm side:
https://github.com/kata-containers/govmm/pull/203
- unit test added

Fixes: #4237
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-17 12:06:11 +03:00
Snir Sheriber
44814dce19 qemu: treat console kernel params within appendConsole
as it is tightly coupled with the appended console device
additionally have it tested

Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
2022-05-17 12:05:31 +03:00
Zvonko Kaiser
2a1d394147 runtime: Adding the correct detection of mediated PCIe devices
Fixes #4212

Signed-off-by: Zvonko Kaiser <zkaiser@nvidia.com>
2022-05-09 00:57:06 -07:00
Megan Wright
ef1ae5bc93 CCv0: Merge main into CCv0 branch
Merge remote-tracking branch 'upstream/main' into CCv0

Fixes: #4200
Signed-off-by: Megan Wright <megan.wright@.ibm.com>
2022-05-04 11:26:50 +01:00
Fabiano Fidêncio
33a8b70558 clh: Rely on Cloud Hypervisor for generating the device ID
We're currently hitting a race condition on the Cloud Hypervisor's
driver code when quickly removing and adding a block device.

This happens because the device removal is an asynchronous operation,
and we currently do *not* monitor events coming from Cloud Hypervisor to
know when the device was actually removed.  Together with this, the
sandbox code doesn't know about that and when a new device is attached
it'll quickly assign what may be the very same ID to the new device,
leading to the Cloud Hypervisor's driver trying to hotplug a device with
the very same ID of the device that was not yet removed.

This is, in a nutshell, why the tests with Cloud Hypervisor and
devmapper have been failing every now and then.

The workaround taken to solve the issue is basically *not* passing down
the device ID to Cloud Hypervisor and simply letting Cloud Hypervisor
itself generate those, as Cloud Hypervisor does it in a manner that
avoids such conflicts.  With this addition we have then to keep a map of
the device ID and the Cloud Hypervisor's generated ID, so we can
properly remove the device.

This workaround will probably stay for a while, at least till someone
has enough cycles to implement a way to watch the device removal event
and then properly act on that.  Spoiler alert, this will be a complex
change that may not even be worth it considering the race can be avoided
with this commit.

Fixes: #4176

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-05-04 09:04:03 +02:00
Jianyong Wu
982c32358a Merge pull request #4031 from Jaylyn-Ren/kata-spdk
Virtcontainers: Enable hot plugging vhost-user-blk device on ARM
2022-04-29 12:16:38 +08:00
Fabiano Fidêncio
a88adabaae clh: Cloud Hypervisor has a built-in Rate Limiter
The notion of "built-in rate limiter" was added as part of
bd8658e362, and that commit considered
that only Firecracker had a built-in rate limiter, which I think was the
case when that was introduced (mid 2020).

Nowadays, however, Cloud Hypervisor takes advantage of the very same crate
used by Firecraker to do I/O throttling.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:27:56 +02:00
Fabiano Fidêncio
63c4da03a9 clh: Implement the Disk RateLimiter logic
Let's take advantage of the newly added DiskRateLimiter* options and
apply those to the network device configuration.

The logic here is identical to the one already present in the Network
part of Cloud Hypervisor's driver.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:27:53 +02:00
Fabiano Fidêncio
5b18575dfe hypervisor: Add disk bandwidth and operations rate limiters
This is the disk counterpart of the what was introduced for the network
as part of the previous commits in this series.

The newly added fields are:
* DiskRateLimiterBwMaxRate, defined in bits per second, which is used to
  control the network I/O bandwidth at the VM level.
* DiskRateLimiterBwOneTimeBurst, also defined in bits per second, which
  is used to define an *initial* max rate, which doesn't replenish.
* DiskRateLimiterOpsMaxRate, the operations per second equivalent of the
  DiskRateLimiterBwMaxRate.
* DiskRateLimiterOpsOneTimeBurst, the operations per second equivalent of
  the DiskRateLimiterBwOneTimeBurst.

For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:27:11 +02:00
Fabiano Fidêncio
1cf9469297 clh: Implement the Network RateLimiter logic
Let's take advantage of the newly added NetRateLimiter* options and
apply those to the network device configuration.

The logic here is quite similar to the one already present in the
Firecracker's driver, with the main difference being the single Inbound
/ Outbound MaxRate and the presence of both Bandwidth and Operations
rate limiter.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:26:38 +02:00
Fabiano Fidêncio
00a5b1bda9 utils: Define DefaultRateLimiterRefillTimeMilliSecs
Firecracker's driver doesn't expose the RefillTime option of the rate
limiter to the user.  Instead, it uses a contant value of 1000
miliseconds (1 second).

As we're following Firecracker's driver implementation, let's expose
create a new constant, use it as part of the Firecracker's driver, and
later on re-use it as part of the Cloud Hypervisor's driver.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:22:42 +02:00
Fabiano Fidêncio
be1bb7e39f utils: Move FC's function to revert bytes to utils
Firecracker's revertBytes function, now called "RevertBytes", can be
exposed as part of the virtcontainers' utils file, as this function will
be reused by Cloud Hypervisor, when adding the rate limiter logic there.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:22:42 +02:00
Fabiano Fidêncio
2d35e6066d hypervisor: Add network bandwidth and operations rate limiters
In a similar way to what's already exposed as RxRateLimiterMaxRate and
TxRateLimiterMaxRate, let's add four new fields to the Hypervisor's
configuration.

The values added are related to bandwidth and operations rate limiters,
which have to be added so we can expose I/O throttling configurations to
users using Cloud Hypervisor as their preferred VMM.

The reason we cannot simply re-use {Rx,Tx}RateLimiterMaxRate is because
Cloud Hypervisor exposes a single MaxRate to be used for both inbound
and outbound queues.

The newly added fields are:
* NetRateLimiterBwMaxRate, defined in bits per second, which is used to
  control the network I/O bandwidth at the VM level.
* NetRateLimiterBwOneTimeBurst, also defined in bits per second, which
  is used to define an *initial* max rate, which doesn't replenish.
* NetRateLimiterOpsMaxRate, the operations per second equivalent of the
  NetRateLimiterBwMaxRate.
* NetRateLimiterOpsOneTimeBurst, the operations per second equivalent of
  the NetRateLimiterBwOneTimeBurst.

For now those extra fields have only been added to the hypervisor's
configuration and they'll be used in the coming patches of this very
same series.

Signed-off-by: Fabiano Fidêncio <fabiano.fidencio@intel.com>
2022-04-28 10:22:42 +02:00
Georgina Kinge
67015ac1d7 CCv0: Merge main into CCv0 branch
Merge remote-tracking branch 'upstream/main' into CCv0

Fixes: #4157
Signed-off-by: Georgina Kinge <georgina.kinge@ibm.com>
2022-04-27 10:39:08 +01:00
David Gibson
1b931f4203 runtime: Allock mockfs storage to be placed in any directory
Currently EnableMockTesting() takes no arguments and will always place the
mock storage in the fixed location /tmp/vc/mockfs.  This means that one
test run can interfere with the next one if anything isn't cleaned up
(and there are other bugs which means that happens).  If if those were
fixed this would allow developers testing on the same machine to interfere
with each other.

So, allow the mockfs to be placed at an arbitrary place given as a
parameter to EnableMockTesting().  In TestMain() we place it under our
existing temporary directory, so we don't need any additional cleanup just
for the mockfs.

fixes #4140

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-04-22 14:47:59 +10:00
David Gibson
ef6d54a781 runtime: Let MockFSInit create a mock fs driver at any path
Currently MockFSInit always creates the mockfs at the fixed path
/tmp/vc/mockfs.  This change allows it to be initialized at any path
given as a parameter.  This allows the tests in fs_test.go to be
simplified, because the by using a temporary directory from
t.TempDir(), which is automatically cleaned up, we don't need to
manually trigger initTestDir() (which is misnamed, it's actually a
cleanup function).

For now we still use the fixed path when auto-creating the mockfs in
MockAutoInit(), but we'll change that later.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-04-22 14:23:36 +10:00
David Gibson
5d8438e939 runtime: Move mockfs control global into mockfs.go
virtcontainers/persist/fs/mockfs.go defines a mock filesystem type for
testing.  A global variable in virtcontainers/persist/manager.go is used to
force use of the mock fs rather than a normal one.

This patch moves the global, and the EnableMockTesting() function which
sets it into mockfs.go.  This is slightly cleaner to begin with, and will
allow some further enhancements.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-04-22 14:23:36 +10:00
David Gibson
963d03ea8a runtime: Export StoragePathSuffix
storagePathSuffix defines the file path suffix - "vc" - used for
Kata's persistent storage information, as a private constant.  We
duplicate this information in fc.go which also needs it.

Export it from fs.go instead, so it can be used in fc.go.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-04-22 14:23:36 +10:00
David Gibson
1719a8b491 runtime: Don't abuse MockStorageRootPath() for factory tests
A number of unit tests under virtcontainers/factory use
MockStorageRootPath() as a general purpose temporary directory.  This
doesn't make sense: the mockfs driver isn't even in use here since we only
call EnableMockTesting for the pase virtcontainers package, not the
subpackages.

Instead use t.TempDir() which is for exactly this purpose.  As a bonus it
also handles the cleanup, so we don't need MockStorageDestroy any more.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
2022-04-22 14:23:36 +10:00