The QMP shutdown is taken care of by the sandbox release, through a
call to hypervisor.disconnect(). By shutting down the QMP at the qemu
level directly, we are creating some unrecoverable errors by trying to
close an already closed channel.
This patch simply removes the faulty code, following the same design
other hotplug functions are designed.
Fixes#627
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Add additional `context.Context` parameters and `struct` fields to allow
trace spans to be created by the `virtcontainers` internal functions,
objects and sub-packages.
Note that not every function is traced; we can add more traces as
desired.
Fixes#566.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
We need this configuration due to a limitation in seabios
firmware in handling hotplug for PCI devices with large BARS.
Long term, this needs to be fixed in the firmware.
Fixes#594
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
With qemu 2.10, a write lock was added for qcow images that
prevents the same image to be passed more than once.
This can be over-ridden using the --share-rw option which is
desired for raw images.
This solves an issue with running Kata with devicemapper
using the privileged mode as in this case all devices on the host
are passed to the container including the block device associated
with the rootfs, causing it to be passed twice to qemu.
Fixes#606
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Right now we create it in `createsandbox` and it would
create the vm dir unnecessarily for fetchsandbox() and
it ends up leaving an empty vm dir behind even after
DeleteSandbox.
Fixes: #547
Signed-off-by: Peng Tao <bergwolf@gmail.com>
To use the filepath.Join() instead of the simple
string append method to form the file path, otherwise
it will lose the "/" between the two parts.
Fixes#543.
Signed-off-by: Fupan Li <lifupan@gmail.com>
Remove the `initcall_debug` boot option from the kernel command-line as
we don't need it any more and it generates a ton of boot messages that
may well be impacting performance.
Fixes#526.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
And do not append sandbox id to kernel arguments since that
would fail qemu args comparison in vm factory.
Fixes: #523
Signed-off-by: Peng Tao <bergwolf@gmail.com>
`appendVSockPCI` function can be used to cold plug vocks, vhost file descriptor
holds the context ID and it's inherit by QEMU process, ID must be unique and
disable-modern prevents qemu from relying on fast MMIO.
Signed-off-by: Julio Montes <julio.montes@intel.com>
So that if there is any remaining state, we do not let it interfere
with the new one. This should fix the occasional vm factory hang.
Fixes: #535
Signed-off-by: Peng Tao <bergwolf@gmail.com>
The interface "VhostUserDevice" has duplicate functions and fields with
Device, so we can merge them into one interface and manage them with one
group of interfaces.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
Fixes#50
Previously the devices are created with device manager and laterly
attached to hypervisor with "device.Attach()", this could work, but
there's no way to remember the reference count for every device, which
means if we plug one device to hypervisor twice, it's truly inserted
twice, but actually we only need to insert once but use it in many
places.
Use device manager as a consolidated entrypoint of device management can
give us a way to handle many "references" to single device, because it
can save all devices and remember it's use count.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
This PR got merged while it had some issues with some shim processes
being left behind after k8s testing. And because those issues were
real issues introduced by this PR (not some random failures), now
the master branch is broken and new pull requests cannot get the
CI passing. That's the reason why this commit revert the changes
introduced by this PR so that we can fix the master branch.
Fixes#529
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
The interface "VhostUserDevice" has duplicate functions and fields with
Device, so we can merge them into one interface and manage them with one
group of interfaces.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
Fixes#50
Previously the devices are created with device manager and laterly
attached to hypervisor with "device.Attach()", this could work, but
there's no way to remember the reference count for every device, which
means if we plug one device to hypervisor twice, it's truly inserted
twice, but actually we only need to insert once but use it in many
places.
Use device manager as a consolidated entrypoint of device management can
give us a way to handle many "references" to single device, because it
can save all devices and remember it's use count.
Signed-off-by: Wei Zhang <zhangwei555@huawei.com>
For each time a sandbox structure is created, we ensure s.Release()
is called. Then we can keep the qmp connection as long as Sandbox
pointer is alive.
All VC interfaces are still stateless as s.Release() is called before
each API returns.
OTOH, for VCSandbox APIs, FetchSandbox() must be paired with s.Release,
the same as before.
Fixes: #500
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Unify qmp channel setup and teardown. This also fixes the issue that
sometimes qmp pointer is not reset after qmp is shutdown.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
1. support qemu migration save operation
2. setup vm templating parameters per hypervisor config
3. create vm storage path when it does not exist. This can happen when
an empty guest is created without a sandbox.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
A hypervisor implementation does not need to depend on a sandbox
structure. Decouple them in preparation for vm factory.
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Add a kernel command-line option that the agent can read to determine
the sandbox ID of the VM. It can use this to create a `sandbox=` log
field for improved log analysis.
Fixes#465.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
We only need one qmp channel and it is qemu internal detail thus
sandbox.go does not need to be aware of it.
Fixes: #428
Signed-off-by: Peng Tao <bergwolf@gmail.com>
The "Failed to allocate HTAB of requested size,
try with smaller maxmem" error in ppc64le occurs
when maxmem allocated is very high. This got fixed
in qemu 2.10 and kernel 4.11. Hence put a maxmem
restriction of 32GB per kata-container if qemu
version less than 2.10
Fixes: #415
Signed-off-by: Nitesh Konkar <niteshkonkar@in.ibm.com>
Don't fail if a new container with a CPU constraint was added to
a POD and no more vCPUs are available, instead apply the constraint
and let kernel balance the resources.
Signed-off-by: Julio Montes <julio.montes@intel.com>
There is a relation between the maximum number of vCPUs and the
memory footprint, if QEMU maxcpus option and kernel nr_cpus
cmdline argument are big, then memory footprint is big, this
issue only occurs if CPU hotplug support is enabled in the kernel,
might be because of kernel needs to allocate resources to watch all
sockets waiting for a CPU to be connected (ACPI event).
For example
```
+---------------+-------------------------+
| | Memory Footprint (KB) |
+---------------+-------------------------+
| NR_CPUS=240 | 186501 |
+---------------+-------------------------+
| NR_CPUS=8 | 110684 |
+---------------+-------------------------+
```
In order to do not affect CPU hotplug and allow to users to have containers
with the same number of physical CPUs, this patch tries to mitigate the
big memory footprint by using the actual number of physical CPUs as the
maximum number of vCPUs for each container if `default_maxvcpus` is <= 0 in
the runtime configuration file, otherwise `default_maxvcpus` is used as the
maximum number of vCPUs.
Before this patch a container with 256MB of RAM
```
total used free shared buff/cache available
Mem: 195M 40M 113M 26M 41M 112M
Swap: 0B 0B 0B
```
With this patch
```
total used free shared buff/cache available
Mem: 236M 11M 188M 26M 36M 186M
Swap: 0B 0B 0B
```
fixes#295
Signed-off-by: Julio Montes <julio.montes@intel.com>
A Unix domain socket is limited to 107 usable bytes on Linux. However,
not all code creating socket paths was checking for this limits.
Created a new `utils.BuildSocketPath()` function (with tests) to
encapsulate the logic and updated all code creating sockets to use it.
Fixes#268.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
Fixes#50
This is done for decoupling device management part from other parts.
It seperate device.go to several dirs and files:
```
virtcontainers/device
├── api
│ └── interface.go
├── config
│ └── config.go
├── drivers
│ ├── block.go
│ ├── generic.go
│ ├── utils.go
│ ├── vfio.go
│ ├── vhost_user_blk.go
│ ├── vhost_user.go
│ ├── vhost_user_net.go
│ └── vhost_user_scsi.go
└── manager
├── manager.go
└── utils.go
```
* `api` contains interface definition of device management, so upper level caller
should import and use the interface, and lower level should implement the interface.
it's bridge to device drivers and callers.
* `config` contains structed exported data.
* `drivers` contains specific device drivers including block, vfio and vhost user
devices.
* `manager` exposes an external management package with a `DeviceManager`.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Introduce a new field in Drive to store the PCI address if the drive is
attached using virtio-blk.
Assign PCI address in the format bridge-addr/device-addr.
Since we need to assign the address while hotplugging, pass Drive
by address.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Change the function to return the bridge itself that the
device is attached to. This will allow bridge address to be used
for determining the PCI slot of the device within the guest.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
Pass the slot address while attaching bridges. This is needed
to determine the pci/e address of devices that are attached
to the bridge.
Fixes#210
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
When imported, the vc files carried in the 'full style' apache
license text, but the standard for kata is to use SPDX style.
Update the relevant files to SPDX.
Fixes: #227
Signed-off-by: Graham whaley <graham.whaley@intel.com>
As agreed in [the kata containers API
design](https://github.com/kata-containers/documentation/blob/master/design/kata-api-design.md),
we need to rename pod notion to sandbox. The patch is a bit big but the
actual change is done through the script:
```
sed -i -e 's/pod/sandbox/g' -e 's/Pod/Sandbox/g' -e 's/POD/SB/g'
```
The only expections are `pod_sandbox` and `pod_container` annotations,
since we already pushed them to cri shims, we have to use them unchanged.
Fixes: #199
Signed-off-by: Peng Tao <bergwolf@gmail.com>
Add a hypervisor configuration to specify if IO should
be handled in a separate thread. Add support for iothreads for
virtio-scsi for now. Since we attach all scsi drives to the
same scsi controller, all the drives will be handled in a separate
IO thread which would still give better performance.
Going forward we need to assess if adding more controllers and
attaching iothreasds to each of them with distributing drives
among teh scsi controllers should be done, based on more performance
analysis.
Signed-off-by: Archana Shinde <archana.m.shinde@intel.com>
To fix CI complains:
virtcontainers/qemu.go:248:⚠️ cyclomatic complexity 18 of
function (*qemu).createPod() is high (> 15) (gocyclo)
Signed-off-by: Peng Tao <bergwolf@gmail.com>