Merge pull request #4804 from openanolis/anolis/merge_runtime_rs_to_main

runtime-rs:merge runtime rs to main
This commit is contained in:
Bin Liu
2022-08-11 08:40:41 +08:00
committed by GitHub
279 changed files with 40922 additions and 554 deletions

View File

@@ -47,7 +47,7 @@ jobs:
uses: tim-actions/commit-message-checker-with-regex@v0.3.1
with:
commits: ${{ steps.get-pr-commits.outputs.commits }}
pattern: '^.{0,75}(\n.*)*$'
pattern: '^.{0,75}(\n.*)*$|^Merge pull request (?:kata-containers)?#[\d]+ from.*'
error: 'Subject too long (max 75)'
post_error: ${{ env.error_msg }}
@@ -95,6 +95,6 @@ jobs:
uses: tim-actions/commit-message-checker-with-regex@v0.3.1
with:
commits: ${{ steps.get-pr-commits.outputs.commits }}
pattern: '^[\s\t]*[^:\s\t]+[\s\t]*:'
pattern: '^[\s\t]*[^:\s\t]+[\s\t]*:|^Merge pull request (?:kata-containers)?#[\d]+ from.*'
error: 'Failed to find subsystem in subject'
post_error: ${{ env.error_msg }}

View File

@@ -59,7 +59,7 @@ jobs:
exit 1
}
project_name="Issue backlog"
project_name="runtime-rs"
project_type="org"
project_column="In progress"

View File

@@ -6,8 +6,10 @@
# List of available components
COMPONENTS =
COMPONENTS += libs
COMPONENTS += agent
COMPONENTS += runtime
COMPONENTS += runtime-rs
# List of available tools
TOOLS =
@@ -21,11 +23,6 @@ STANDARD_TARGETS = build check clean install test vendor
default: all
all: logging-crate-tests build
logging-crate-tests:
make -C src/libs/logging
include utils.mk
include ./tools/packaging/kata-deploy/local-build/Makefile
@@ -49,7 +46,6 @@ docs-url-alive-check:
binary-tarball \
default \
install-binary-tarball \
logging-crate-tests \
static-checks \
docs-url-alive-check

View File

@@ -71,6 +71,7 @@ See the [official documentation](docs) including:
- [Developer guide](docs/Developer-Guide.md)
- [Design documents](docs/design)
- [Architecture overview](docs/design/architecture)
- [Architecture 3.0 overview](docs/design/architecture_3.0/)
## Configuration
@@ -117,6 +118,8 @@ The table below lists the core parts of the project:
|-|-|-|
| [runtime](src/runtime) | core | Main component run by a container manager and providing a containerd shimv2 runtime implementation. |
| [agent](src/agent) | core | Management process running inside the virtual machine / POD that sets up the container environment. |
| [libraries](src/libs) | core | Library crates shared by multiple Kata Container components or published to [`crates.io`](https://crates.io/index.html) |
| [`dragonball`](src/dragonball) | core | An optional built-in VMM brings out-of-the-box Kata Containers experience with optimizations on container workloads |
| [documentation](docs) | documentation | Documentation common to all components (such as design and install documentation). |
| [libraries](src/libs) | core | Library crates shared by multiple Kata Container components or published to [`crates.io`](https://crates.io/index.html) |
| [tests](https://github.com/kata-containers/tests) | tests | Excludes unit tests which live with the main code. |

View File

@@ -0,0 +1,170 @@
# Kata 3.0 Architecture
## Overview
In cloud-native scenarios, there is an increased demand for container startup speed, resource consumption, stability, and security, areas where the present Kata Containers runtime is challenged relative to other runtimes. To achieve this, we propose a solid, field-tested and secure Rust version of the kata-runtime.
Also, we provide the following designs:
- Turn key solution with builtin `Dragonball` Sandbox
- Async I/O to reduce resource consumption
- Extensible framework for multiple services, runtimes and hypervisors
- Lifecycle management for sandbox and container associated resources
### Rationale for choosing Rust
We chose Rust because it is designed as a system language with a focus on efficiency.
In contrast to Go, Rust makes a variety of design trade-offs in order to obtain
good execution performance, with innovative techniques that, in contrast to C or
C++, provide reasonable protection against common memory errors (buffer
overflow, invalid pointers, range errors), error checking (ensuring errors are
dealt with), thread safety, ownership of resources, and more.
These benefits were verified in our project when the Kata Containers guest agent
was rewritten in Rust. We notably saw a significant reduction in memory usage
with the Rust-based implementation.
## Design
### Architecture
![architecture](./images/architecture.png)
### Built-in VMM
#### Current Kata 2.x architecture
![not_builtin_vmm](./images/not_built_in_vmm.png)
As shown in the figure, runtime and VMM are separate processes. The runtime process forks the VMM process and interacts through the inter-process RPC. Typically, process interaction consumes more resources than peers within the process, and it will result in relatively low efficiency. At the same time, the cost of resource operation and maintenance should be considered. For example, when performing resource recovery under abnormal conditions, the exception of any process must be detected by others and activate the appropriate resource recovery process. If there are additional processes, the recovery becomes even more difficult.
#### How To Support Built-in VMM
We provide `Dragonball` Sandbox to enable built-in VMM by integrating VMM's function into the Rust library. We could perform VMM-related functionalities by using the library. Because runtime and VMM are in the same process, there is a benefit in terms of message processing speed and API synchronization. It can also guarantee the consistency of the runtime and the VMM life cycle, reducing resource recovery and exception handling maintenance, as shown in the figure:
![builtin_vmm](./images/built_in_vmm.png)
### Async Support
#### Why Need Async
**Async is already in stable Rust and allows us to write async code**
- Async provides significantly reduced CPU and memory overhead, especially for workloads with a large amount of IO-bound tasks
- Async is zero-cost in Rust, which means that you only pay for what you use. Specifically, you can use async without heap allocations and dynamic dispatch, which greatly improves efficiency
- For more (see [Why Async?](https://rust-lang.github.io/async-book/01_getting_started/02_why_async.html) and [The State of Asynchronous Rust](https://rust-lang.github.io/async-book/01_getting_started/03_state_of_async_rust.html)).
**There may be several problems if implementing kata-runtime with Sync Rust**
- Too many threads with a new TTRPC connection
- TTRPC threads: reaper thread(1) + listener thread(1) + client handler(2)
- Add 3 I/O threads with a new container
- In Sync mode, implementing a timeout mechanism is challenging. For example, in TTRPC API interaction, the timeout mechanism is difficult to align with Golang
#### How To Support Async
The kata-runtime is controlled by TOKIO_RUNTIME_WORKER_THREADS to run the OS thread, which is 2 threads by default. For TTRPC and container-related threads run in the `tokio` thread in a unified manner, and related dependencies need to be switched to Async, such as Timer, File, Netlink, etc. With the help of Async, we can easily support no-block I/O and timer. Currently, we only utilize Async for kata-runtime. The built-in VMM keeps the OS thread because it can ensure that the threads are controllable.
**For N tokio worker threads and M containers**
- Sync runtime(both OS thread and `tokio` task are OS thread but without `tokio` worker thread) OS thread number: 4 + 12*M
- Async runtime(only OS thread is OS thread) OS thread number: 2 + N
```shell
├─ main(OS thread)
├─ async-logger(OS thread)
└─ tokio worker(N * OS thread)
├─ agent log forwarder(1 * tokio task)
├─ health check thread(1 * tokio task)
├─ TTRPC reaper thread(M * tokio task)
├─ TTRPC listener thread(M * tokio task)
├─ TTRPC client handler thread(7 * M * tokio task)
├─ container stdin io thread(M * tokio task)
├─ container stdin io thread(M * tokio task)
└─ container stdin io thread(M * tokio task)
```
### Extensible Framework
The Kata 3.x runtime is designed with the extension of service, runtime, and hypervisor, combined with configuration to meet the needs of different scenarios. At present, the service provides a register mechanism to support multiple services. Services could interact with runtime through messages. In addition, the runtime handler handles messages from services. To meet the needs of a binary that supports multiple runtimes and hypervisors, the startup must obtain the runtime handler type and hypervisor type through configuration.
![framework](./images/framework.png)
### Resource Manager
In our case, there will be a variety of resources, and every resource has several subtypes. Especially for `Virt-Container`, every subtype of resource has different operations. And there may be dependencies, such as the share-fs rootfs and the share-fs volume will use share-fs resources to share files to the VM. Currently, network and share-fs are regarded as sandbox resources, while rootfs, volume, and cgroup are regarded as container resources. Also, we abstract a common interface for each resource and use subclass operations to evaluate the differences between different subtypes.
![resource manager](./images/resourceManager.png)
## Roadmap
- Stage 1 (June): provide basic features (current delivered)
- Stage 2 (September): support common features
- Stage 3: support full features
| **Class** | **Sub-Class** | **Development Stage** |
| -------------------------- | ------------------- | --------------------- |
| Service | task service | Stage 1 |
| | extend service | Stage 3 |
| | image service | Stage 3 |
| Runtime handler | `Virt-Container` | Stage 1 |
| | `Wasm-Container` | Stage 3 |
| | `Linux-Container` | Stage 3 |
| Endpoint | VETH Endpoint | Stage 1 |
| | Physical Endpoint | Stage 2 |
| | Tap Endpoint | Stage 2 |
| | `Tuntap` Endpoint | Stage 2 |
| | `IPVlan` Endpoint | Stage 3 |
| | `MacVlan` Endpoint | Stage 3 |
| | MACVTAP Endpoint | Stage 3 |
| | `VhostUserEndpoint` | Stage 3 |
| Network Interworking Model | Tc filter | Stage 1 |
| | Route | Stage 1 |
| | `MacVtap` | Stage 3 |
| Storage | Virtio-fs | Stage 1 |
| | `nydus` | Stage 2 |
| Hypervisor | `Dragonball` | Stage 1 |
| | QEMU | Stage 2 |
| | ACRN | Stage 3 |
| | Cloud Hypervisor | Stage 3 |
| | Firecracker | Stage 3 |
## FAQ
- Are the "service", "message dispatcher" and "runtime handler" all part of the single Kata 3.x runtime binary?
Yes. They are components in Kata 3.x runtime. And they will be packed into one binary.
1. Service is an interface, which is responsible for handling multiple services like task service, image service and etc.
2. Message dispatcher, it is used to match multiple requests from the service module.
3. Runtime handler is used to deal with the operation for sandbox and container.
- What is the name of the Kata 3.x runtime binary?
Apparently we can't use `containerd-shim-v2-kata` because it's already used. We are facing the hardest issue of "naming" again. Any suggestions are welcomed.
Internally we use `containerd-shim-v2-rund`.
- Is the Kata 3.x design compatible with the containerd shimv2 architecture?
Yes. It is designed to follow the functionality of go version kata. And it implements the `containerd shim v2` interface/protocol.
- How will users migrate to the Kata 3.x architecture?
The migration plan will be provided before the Kata 3.x is merging into the main branch.
- Is `Dragonball` limited to its own built-in VMM? Can the `Dragonball` system be configured to work using an external `Dragonball` VMM/hypervisor?
The `Dragonball` could work as an external hypervisor. However, stability and performance is challenging in this case. Built in VMM could optimise the container overhead, and it's easy to maintain stability.
`runD` is the `containerd-shim-v2` counterpart of `runC` and can run a pod/containers. `Dragonball` is a `microvm`/VMM that is designed to run container workloads. Instead of `microvm`/VMM, we sometimes refer to it as secure sandbox.
- QEMU, Cloud Hypervisor and Firecracker support are planned, but how that would work. Are they working in separate process?
Yes. They are unable to work as built in VMM.
- What is `upcall`?
The `upcall` is used to hotplug CPU/memory/MMIO devices, and it solves two issues.
1. avoid dependency on PCI/ACPI
2. avoid dependency on `udevd` within guest and get deterministic results for hotplug operations. So `upcall` is an alternative to ACPI based CPU/memory/device hotplug. And we may cooperate with the community to add support for ACPI based CPU/memory/device hotplug if needed.
`Dbs-upcall` is a `vsock-based` direct communication tool between VMM and guests. The server side of the `upcall` is a driver in guest kernel (kernel patches are needed for this feature) and it'll start to serve the requests once the kernel has started. And the client side is in VMM , it'll be a thread that communicates with VSOCK through `uds`. We have accomplished device hotplug / hot-unplug directly through `upcall` in order to avoid virtualization of ACPI to minimize virtual machine's overhead. And there could be many other usage through this direct communication channel. It's already open source.
https://github.com/openanolis/dragonball-sandbox/tree/main/crates/dbs-upcall
- The URL below says the kernel patches work with 4.19, but do they also work with 5.15+ ?
Forward compatibility should be achievable, we have ported it to 5.10 based kernel.
- Are these patches platform-specific or would they work for any architecture that supports VSOCK?
It's almost platform independent, but some message related to CPU hotplug are platform dependent.
- Could the kernel driver be replaced with a userland daemon in the guest using loopback VSOCK?
We need to create device nodes for hot-added CPU/memory/devices, so it's not easy for userspace daemon to do these tasks.
- The fact that `upcall` allows communication between the VMM and the guest suggests that this architecture might be incompatible with https://github.com/confidential-containers where the VMM should have no knowledge of what happens inside the VM.
1. `TDX` doesn't support CPU/memory hotplug yet.
2. For ACPI based device hotplug, it depends on ACPI `DSDT` table, and the guest kernel will execute `ASL` code to handle during handling those hotplug event. And it should be easier to audit VSOCK based communication than ACPI `ASL` methods.
- What is the security boundary for the monolithic / "Built-in VMM" case?
It has the security boundary of virtualization. More details will be provided in next stage.

Binary file not shown.

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 136 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

File diff suppressed because one or more lines are too long

View File

@@ -33,6 +33,7 @@ are available, their default values and how each setting can be used.
[Cloud Hypervisor] | rust | `aarch64`, `x86_64` | Type 2 ([KVM]) | `configuration-clh.toml` |
[Firecracker] | rust | `aarch64`, `x86_64` | Type 2 ([KVM]) | `configuration-fc.toml` |
[QEMU] | C | all | Type 2 ([KVM]) | `configuration-qemu.toml` |
[`Dragonball`] | rust | `aarch64`, `x86_64` | Type 2 ([KVM]) | `configuration-dragonball.toml` |
## Determine currently configured hypervisor
@@ -52,6 +53,7 @@ the hypervisors:
[Cloud Hypervisor] | Low latency, small memory footprint, small attack surface | Minimal | | excellent | excellent | High performance modern cloud workloads | |
[Firecracker] | Very slimline | Extremely minimal | Doesn't support all device types | excellent | excellent | Serverless / FaaS | |
[QEMU] | Lots of features | Lots | | good | good | Good option for most users | | All users |
[`Dragonball`] | Built-in VMM, low CPU and memory overhead| Minimal | | excellent | excellent | Optimized for most container workloads | `out-of-the-box` Kata Containers experience |
For further details, see the [Virtualization in Kata Containers](design/virtualization.md) document and the official documentation for each hypervisor.
@@ -60,3 +62,4 @@ For further details, see the [Virtualization in Kata Containers](design/virtuali
[Firecracker]: https://github.com/firecracker-microvm/firecracker
[KVM]: https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine
[QEMU]: http://www.qemu-project.org
[`Dragonball`]: https://github.com/openanolis/dragonball-sandbox

View File

@@ -79,3 +79,6 @@ versions. This is not recommended for normal users.
* [upgrading document](../Upgrading.md)
* [developer guide](../Developer-Guide.md)
* [runtime documentation](../../src/runtime/README.md)
## Kata Containers 3.0 rust runtime installation
* [installation guide](../install/kata-containers-3.0-rust-runtime-installation-guide.md)

View File

@@ -0,0 +1,101 @@
# Kata Containers 3.0 rust runtime installation
The following is an overview of the different installation methods available.
## Prerequisites
Kata Containers 3.0 rust runtime requires nested virtualization or bare metal. Check
[hardware requirements](/src/runtime/README.md#hardware-requirements) to see if your system is capable of running Kata
Containers.
### Platform support
Kata Containers 3.0 rust runtime currently runs on 64-bit systems supporting the following
architectures:
> **Notes:**
> For other architectures, see https://github.com/kata-containers/kata-containers/issues/4320
| Architecture | Virtualization technology |
|-|-|
| `x86_64`| [Intel](https://www.intel.com) VT-x |
| `aarch64` ("`arm64`")| [ARM](https://www.arm.com) Hyp |
## Packaged installation methods
| Installation method | Description | Automatic updates | Use case | Availability
|------------------------------------------------------|----------------------------------------------------------------------------------------------|-------------------|-----------------------------------------------------------------------------------------------|----------- |
| [Using kata-deploy](#kata-deploy-installation) | The preferred way to deploy the Kata Containers distributed binaries on a Kubernetes cluster | **No!** | Best way to give it a try on kata-containers on an already up and running Kubernetes cluster. | No |
| [Using official distro packages](#official-packages) | Kata packages provided by Linux distributions official repositories | yes | Recommended for most users. | No |
| [Using snap](#snap-installation) | Easy to install | yes | Good alternative to official distro packages. | No |
| [Automatic](#automatic-installation) | Run a single command to install a full system | **No!** | For those wanting the latest release quickly. | No |
| [Manual](#manual-installation) | Follow a guide step-by-step to install a working system | **No!** | For those who want the latest release with more control. | No |
| [Build from source](#build-from-source-installation) | Build the software components manually | **No!** | Power users and developers only. | Yes |
### Kata Deploy Installation
`ToDo`
### Official packages
`ToDo`
### Snap Installation
`ToDo`
### Automatic Installation
`ToDo`
### Manual Installation
`ToDo`
## Build from source installation
### Rust Environment Set Up
* Download `Rustup` and install `Rust`
> **Notes:**
> Rust version 1.58 is needed
Example for `x86_64`
```
$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
$ source $HOME/.cargo/env
$ rustup install 1.58
$ rustup default 1.58-x86_64-unknown-linux-gnu
```
* Musl support for fully static binary
Example for `x86_64`
```
$ rustup target add x86_64-unknown-linux-musl
```
* [Musl `libc`](http://musl.libc.org/) install
Example for musl 1.2.3
```
$ wget https://git.musl-libc.org/cgit/musl/snapshot/musl-1.2.3.tar.gz
$ tar vxf musl-1.2.3.tar.gz
$ cd musl-1.2.3/
$ ./configure --prefix=/usr/local/
$ make && sudo make install
```
### Install Kata 3.0 Rust Runtime Shim
```
$ git clone https://github.com/kata-containers/kata-containers.git
$ cd kata-containers/src/runtime-rs
$ make && make install
```
After running the command above, the default config file `configuration.toml` will be installed under `/usr/share/defaults/kata-containers/`, the binary file `containerd-shim-kata-v2` will be installed under `/user/local/bin` .
### Build Kata Containers Kernel
Follow the [Kernel installation guide](/tools/packaging/kernel/README.md).
### Build Kata Rootfs
Follow the [Rootfs installation guide](../../tools/osbuilder/rootfs-builder/README.md).
### Build Kata Image
Follow the [Image installation guide](../../tools/osbuilder/image-builder/README.md).
### Install Containerd
Follow the [Containerd installation guide](container-manager/containerd/containerd-install.md).

169
src/agent/Cargo.lock generated
View File

@@ -98,6 +98,12 @@ version = "3.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "37ccbd214614c6783386c1af30caf03192f17891059cecc394b4fb119e363de3"
[[package]]
name = "byte-unit"
version = "3.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "415301c9de11005d4b92193c0eb7ac7adc37e5a49e0ac9bed0a42343512744b8"
[[package]]
name = "byteorder"
version = "1.4.3"
@@ -224,6 +230,12 @@ dependencies = [
"os_str_bytes",
]
[[package]]
name = "common-path"
version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2382f75942f4b3be3690fe4f86365e9c853c1587d6ee58212cebf6e2a9ccd101"
[[package]]
name = "core-foundation-sys"
version = "0.8.3"
@@ -322,6 +334,17 @@ dependencies = [
"libc",
]
[[package]]
name = "fail"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ec3245a0ca564e7f3c797d20d833a6870f57a728ac967d5225b3ffdef4465011"
dependencies = [
"lazy_static",
"log",
"rand 0.8.5",
]
[[package]]
name = "fastrand"
version = "1.7.0"
@@ -442,6 +465,17 @@ dependencies = [
"slab",
]
[[package]]
name = "getrandom"
version = "0.1.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8fc3cb4d91f53b50155bdcfd23f6a4c39ae1969c2ae85982b135750cccaf5fce"
dependencies = [
"cfg-if 1.0.0",
"libc",
"wasi 0.9.0+wasi-snapshot-preview1",
]
[[package]]
name = "getrandom"
version = "0.2.7"
@@ -453,6 +487,12 @@ dependencies = [
"wasi 0.11.0+wasi-snapshot-preview1",
]
[[package]]
name = "glob"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b919933a397b79c37e33b77bb2aa3dc8eb6e165ad809e58ff75bc7db2e34574"
[[package]]
name = "hashbrown"
version = "0.12.1"
@@ -584,13 +624,14 @@ dependencies = [
"clap",
"futures",
"ipnetwork",
"kata-sys-util",
"lazy_static",
"libc",
"log",
"logging",
"netlink-packet-utils 0.4.1",
"netlink-sys 0.7.0",
"nix 0.23.1",
"nix 0.24.2",
"oci",
"opentelemetry",
"procfs",
@@ -621,6 +662,47 @@ dependencies = [
"vsock-exporter",
]
[[package]]
name = "kata-sys-util"
version = "0.1.0"
dependencies = [
"byteorder",
"cgroups-rs",
"chrono",
"common-path",
"fail",
"kata-types",
"lazy_static",
"libc",
"nix 0.24.2",
"oci",
"once_cell",
"rand 0.7.3",
"serde_json",
"slog",
"slog-scope",
"subprocess",
"thiserror",
]
[[package]]
name = "kata-types"
version = "0.1.0"
dependencies = [
"byte-unit",
"glob",
"lazy_static",
"num_cpus",
"oci",
"regex",
"serde",
"serde_json",
"slog",
"slog-scope",
"thiserror",
"toml",
]
[[package]]
name = "lazy_static"
version = "1.4.0"
@@ -857,6 +939,7 @@ dependencies = [
"bitflags",
"cfg-if 1.0.0",
"libc",
"memoffset",
]
[[package]]
@@ -935,7 +1018,7 @@ dependencies = [
"lazy_static",
"percent-encoding",
"pin-project",
"rand",
"rand 0.8.5",
"serde",
"thiserror",
"tokio",
@@ -1199,9 +1282,9 @@ dependencies = [
[[package]]
name = "protobuf"
version = "2.14.0"
version = "2.27.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e86d370532557ae7573551a1ec8235a0f8d6cb276c7c9e6aa490b511c447485"
checksum = "cf7e6d18738ecd0902d30d1ad232c9125985a3422929b16c65517b38adc14f96"
dependencies = [
"serde",
"serde_derive",
@@ -1209,18 +1292,18 @@ dependencies = [
[[package]]
name = "protobuf-codegen"
version = "2.14.0"
version = "2.27.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "de113bba758ccf2c1ef816b127c958001b7831136c9bc3f8e9ec695ac4e82b0c"
checksum = "aec1632b7c8f2e620343439a7dfd1f3c47b18906c4be58982079911482b5d707"
dependencies = [
"protobuf",
]
[[package]]
name = "protobuf-codegen-pure"
version = "2.14.0"
version = "2.27.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2d1a4febc73bf0cada1d77c459a0c8e5973179f1cfd5b0f1ab789d45b17b6440"
checksum = "9f8122fdb18e55190c796b088a16bdb70cd7acdcd48f7a8b796b58c62e532cc6"
dependencies = [
"protobuf",
"protobuf-codegen",
@@ -1231,6 +1314,7 @@ name = "protocols"
version = "0.1.0"
dependencies = [
"async-trait",
"oci",
"protobuf",
"ttrpc",
"ttrpc-codegen",
@@ -1245,6 +1329,19 @@ dependencies = [
"proc-macro2",
]
[[package]]
name = "rand"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6a6b1679d49b24bbfe0c803429aa1874472f50d9b363131f0e89fc356b544d03"
dependencies = [
"getrandom 0.1.16",
"libc",
"rand_chacha 0.2.2",
"rand_core 0.5.1",
"rand_hc",
]
[[package]]
name = "rand"
version = "0.8.5"
@@ -1252,8 +1349,18 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404"
dependencies = [
"libc",
"rand_chacha",
"rand_core",
"rand_chacha 0.3.1",
"rand_core 0.6.3",
]
[[package]]
name = "rand_chacha"
version = "0.2.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f4c8ed856279c9737206bf725bf36935d8666ead7aa69b52be55af369d193402"
dependencies = [
"ppv-lite86",
"rand_core 0.5.1",
]
[[package]]
@@ -1263,7 +1370,16 @@ source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88"
dependencies = [
"ppv-lite86",
"rand_core",
"rand_core 0.6.3",
]
[[package]]
name = "rand_core"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "90bde5296fc891b0cef12a6d03ddccc162ce7b2aff54160af9338f8d40df6d19"
dependencies = [
"getrandom 0.1.16",
]
[[package]]
@@ -1272,7 +1388,16 @@ version = "0.6.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d34f1408f55294453790c48b2f1ebbb1c5b4b7563eb1f418bcfcfdbb06ebb4e7"
dependencies = [
"getrandom",
"getrandom 0.2.7",
]
[[package]]
name = "rand_hc"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ca3129af7b92a17112d59ad498c6f81eaf463253766b90396d39ea7a39d6613c"
dependencies = [
"rand_core 0.5.1",
]
[[package]]
@@ -1579,6 +1704,16 @@ version = "0.10.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "73473c0e59e6d5812c5dfe2a064a6444949f089e20eec9a2e5506596494e4623"
[[package]]
name = "subprocess"
version = "0.2.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0c2e86926081dda636c546d8c5e641661049d7562a68f5488be4a1f7f66f6086"
dependencies = [
"libc",
"winapi",
]
[[package]]
name = "syn"
version = "1.0.98"
@@ -1846,9 +1981,9 @@ dependencies = [
[[package]]
name = "ttrpc"
version = "0.5.3"
version = "0.6.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "c46d73bc2a74f2440921b6539afbed68064b48b2c4f194c637430d1c83d052ad"
checksum = "2ecfff459a859c6ba6668ff72b34c2f1d94d9d58f7088414c2674ad0f31cc7d8"
dependencies = [
"async-trait",
"byteorder",
@@ -1947,6 +2082,12 @@ dependencies = [
"tokio-vsock",
]
[[package]]
name = "wasi"
version = "0.9.0+wasi-snapshot-preview1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cccddf32554fecc6acb585f82a32a72e28b48f8c4c1883ddfeeeaa96f7d8e519"
[[package]]
name = "wasi"
version = "0.10.0+wasi-snapshot-preview1"

View File

@@ -7,12 +7,12 @@ edition = "2018"
[dependencies]
oci = { path = "../libs/oci" }
rustjail = { path = "rustjail" }
protocols = { path = "../libs/protocols" }
protocols = { path = "../libs/protocols", features = ["async"] }
lazy_static = "1.3.0"
ttrpc = { version = "0.5.0", features = ["async", "protobuf-codec"], default-features = false }
protobuf = "=2.14.0"
ttrpc = { version = "0.6.0", features = ["async"], default-features = false }
protobuf = "2.27.0"
libc = "0.2.58"
nix = "0.23.0"
nix = "0.24.1"
capctl = "0.2.0"
serde_json = "1.0.39"
scan_fmt = "0.2.3"
@@ -20,6 +20,7 @@ scopeguard = "1.0.0"
thiserror = "1.0.26"
regex = "1.5.5"
serial_test = "0.5.1"
kata-sys-util = { path = "../libs/kata-sys-util" }
sysinfo = "0.23.0"
# Async helpers

View File

@@ -107,10 +107,7 @@ endef
##TARGET default: build code
default: $(TARGET) show-header
$(TARGET): $(GENERATED_CODE) logging-crate-tests $(TARGET_PATH)
logging-crate-tests:
make -C $(CWD)/../libs/logging
$(TARGET): $(GENERATED_CODE) $(TARGET_PATH)
$(TARGET_PATH): show-summary
@RUSTFLAGS="$(EXTRA_RUSTFLAGS) --deny warnings" cargo build --target $(TRIPLE) $(if $(findstring release,$(BUILD_TYPE)),--release) $(EXTRA_RUSTFEATURES)
@@ -203,7 +200,6 @@ codecov-html: check_tarpaulin
.PHONY: \
help \
logging-crate-tests \
optimize \
show-header \
show-summary \

View File

@@ -16,7 +16,7 @@ scopeguard = "1.0.0"
capctl = "0.2.0"
lazy_static = "1.3.0"
libc = "0.2.58"
protobuf = "=2.14.0"
protobuf = "2.27.0"
slog = "2.5.2"
slog-scope = "4.1.2"
scan_fmt = "0.2.6"
@@ -27,7 +27,7 @@ cgroups = { package = "cgroups-rs", version = "0.2.8" }
rlimit = "0.5.3"
cfg-if = "0.1.0"
tokio = { version = "1.2.0", features = ["sync", "io-util", "process", "time", "macros"] }
tokio = { version = "1.2.0", features = ["sync", "io-util", "process", "time", "macros", "rt"] }
futures = "0.3.17"
async-trait = "0.1.31"
inotify = "0.9.2"

View File

@@ -9,7 +9,7 @@ use anyhow::{anyhow, Result};
use nix::fcntl::{self, FcntlArg, FdFlag, OFlag};
use nix::libc::{STDERR_FILENO, STDIN_FILENO, STDOUT_FILENO};
use nix::pty::{openpty, OpenptyResult};
use nix::sys::socket::{self, AddressFamily, SockAddr, SockFlag, SockType};
use nix::sys::socket::{self, AddressFamily, SockFlag, SockType, VsockAddr};
use nix::sys::stat::Mode;
use nix::sys::wait;
use nix::unistd::{self, close, dup2, fork, setsid, ForkResult, Pid};
@@ -67,7 +67,7 @@ pub async fn debug_console_handler(
SockFlag::SOCK_CLOEXEC,
None,
)?;
let addr = SockAddr::new_vsock(libc::VMADDR_CID_ANY, port);
let addr = VsockAddr::new(libc::VMADDR_CID_ANY, port);
socket::bind(listenfd, &addr)?;
socket::listen(listenfd, 1)?;

View File

@@ -22,7 +22,7 @@ extern crate slog;
use anyhow::{anyhow, Context, Result};
use clap::{AppSettings, Parser};
use nix::fcntl::OFlag;
use nix::sys::socket::{self, AddressFamily, SockAddr, SockFlag, SockType};
use nix::sys::socket::{self, AddressFamily, SockFlag, SockType, VsockAddr};
use nix::unistd::{self, dup, Pid};
use std::env;
use std::ffi::OsStr;
@@ -128,7 +128,7 @@ async fn create_logger_task(rfd: RawFd, vsock_port: u32, shutdown: Receiver<bool
None,
)?;
let addr = SockAddr::new_vsock(libc::VMADDR_CID_ANY, vsock_port);
let addr = VsockAddr::new(libc::VMADDR_CID_ANY, vsock_port);
socket::bind(listenfd, &addr)?;
socket::listen(listenfd, 1)?;

View File

@@ -34,6 +34,7 @@ use protocols::health::{
HealthCheckResponse, HealthCheckResponse_ServingStatus, VersionCheckResponse,
};
use protocols::types::Interface;
use protocols::{agent_ttrpc_async as agent_ttrpc, health_ttrpc_async as health_ttrpc};
use rustjail::cgroups::notifier;
use rustjail::container::{BaseContainer, Container, LinuxContainer};
use rustjail::process::Process;
@@ -133,30 +134,6 @@ pub struct AgentService {
sandbox: Arc<Mutex<Sandbox>>,
}
// A container ID must match this regex:
//
// ^[a-zA-Z0-9][a-zA-Z0-9_.-]+$
//
fn verify_cid(id: &str) -> Result<()> {
let mut chars = id.chars();
let valid = match chars.next() {
Some(first)
if first.is_alphanumeric()
&& id.len() > 1
&& chars.all(|c| c.is_alphanumeric() || ['.', '-', '_'].contains(&c)) =>
{
true
}
_ => false,
};
match valid {
true => Ok(()),
false => Err(anyhow!("invalid container ID: {:?}", id)),
}
}
impl AgentService {
#[instrument]
async fn do_create_container(
@@ -165,7 +142,7 @@ impl AgentService {
) -> Result<()> {
let cid = req.container_id.clone();
verify_cid(&cid)?;
kata_sys_util::validate::verify_id(&cid)?;
let mut oci_spec = req.OCI.clone();
let use_sandbox_pidns = req.get_sandbox_pidns();
@@ -650,7 +627,7 @@ impl AgentService {
}
#[async_trait]
impl protocols::agent_ttrpc::AgentService for AgentService {
impl agent_ttrpc::AgentService for AgentService {
async fn create_container(
&self,
ctx: &TtrpcContext,
@@ -1536,7 +1513,7 @@ impl protocols::agent_ttrpc::AgentService for AgentService {
struct HealthService;
#[async_trait]
impl protocols::health_ttrpc::Health for HealthService {
impl health_ttrpc::Health for HealthService {
async fn check(
&self,
_ctx: &TtrpcContext,
@@ -1675,18 +1652,17 @@ async fn read_stream(reader: Arc<Mutex<ReadHalf<PipeStream>>>, l: usize) -> Resu
}
pub fn start(s: Arc<Mutex<Sandbox>>, server_address: &str) -> Result<TtrpcServer> {
let agent_service = Box::new(AgentService { sandbox: s })
as Box<dyn protocols::agent_ttrpc::AgentService + Send + Sync>;
let agent_service =
Box::new(AgentService { sandbox: s }) as Box<dyn agent_ttrpc::AgentService + Send + Sync>;
let agent_worker = Arc::new(agent_service);
let health_service =
Box::new(HealthService {}) as Box<dyn protocols::health_ttrpc::Health + Send + Sync>;
let health_service = Box::new(HealthService {}) as Box<dyn health_ttrpc::Health + Send + Sync>;
let health_worker = Arc::new(health_service);
let aservice = protocols::agent_ttrpc::create_agent_service(agent_worker);
let aservice = agent_ttrpc::create_agent_service(agent_worker);
let hservice = protocols::health_ttrpc::create_health(health_worker);
let hservice = health_ttrpc::create_health(health_worker);
let server = TtrpcServer::new()
.bind(server_address)?
@@ -2012,7 +1988,7 @@ fn load_kernel_module(module: &protocols::agent::KernelModule) -> Result<()> {
mod tests {
use super::*;
use crate::{
assert_result, namespace::Namespace, protocols::agent_ttrpc::AgentService as _,
assert_result, namespace::Namespace, protocols::agent_ttrpc_async::AgentService as _,
skip_if_not_root,
};
use nix::mount;
@@ -2672,233 +2648,6 @@ OtherField:other
}
}
#[tokio::test]
async fn test_verify_cid() {
#[derive(Debug)]
struct TestData<'a> {
id: &'a str,
expect_error: bool,
}
let tests = &[
TestData {
// Cannot be blank
id: "",
expect_error: true,
},
TestData {
// Cannot be a space
id: " ",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: ".",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "-",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "_",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: " a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: ".a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "-a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "_a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "..",
expect_error: true,
},
TestData {
// Too short
id: "a",
expect_error: true,
},
TestData {
// Too short
id: "z",
expect_error: true,
},
TestData {
// Too short
id: "A",
expect_error: true,
},
TestData {
// Too short
id: "Z",
expect_error: true,
},
TestData {
// Too short
id: "0",
expect_error: true,
},
TestData {
// Too short
id: "9",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "-1",
expect_error: true,
},
TestData {
id: "/",
expect_error: true,
},
TestData {
id: "a/",
expect_error: true,
},
TestData {
id: "a/../",
expect_error: true,
},
TestData {
id: "../a",
expect_error: true,
},
TestData {
id: "../../a",
expect_error: true,
},
TestData {
id: "../../../a",
expect_error: true,
},
TestData {
id: "foo/../bar",
expect_error: true,
},
TestData {
id: "foo bar",
expect_error: true,
},
TestData {
id: "a.",
expect_error: false,
},
TestData {
id: "a..",
expect_error: false,
},
TestData {
id: "aa",
expect_error: false,
},
TestData {
id: "aa.",
expect_error: false,
},
TestData {
id: "hello..world",
expect_error: false,
},
TestData {
id: "hello/../world",
expect_error: true,
},
TestData {
id: "aa1245124sadfasdfgasdga.",
expect_error: false,
},
TestData {
id: "aAzZ0123456789_.-",
expect_error: false,
},
TestData {
id: "abcdefghijklmnopqrstuvwxyz0123456789.-_",
expect_error: false,
},
TestData {
id: "0123456789abcdefghijklmnopqrstuvwxyz.-_",
expect_error: false,
},
TestData {
id: " abcdefghijklmnopqrstuvwxyz0123456789.-_",
expect_error: true,
},
TestData {
id: ".abcdefghijklmnopqrstuvwxyz0123456789.-_",
expect_error: true,
},
TestData {
id: "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-_",
expect_error: false,
},
TestData {
id: "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ.-_",
expect_error: false,
},
TestData {
id: " ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-_",
expect_error: true,
},
TestData {
id: ".ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-_",
expect_error: true,
},
TestData {
id: "/a/b/c",
expect_error: true,
},
TestData {
id: "a/b/c",
expect_error: true,
},
TestData {
id: "foo/../../../etc/passwd",
expect_error: true,
},
TestData {
id: "../../../../../../etc/motd",
expect_error: true,
},
TestData {
id: "/etc/passwd",
expect_error: true,
},
];
for (i, d) in tests.iter().enumerate() {
let msg = format!("test[{}]: {:?}", i, d);
let result = verify_cid(d.id);
let msg = format!("{}, result: {:?}", msg, result);
if result.is_ok() {
assert!(!d.expect_error, "{}", msg);
} else {
assert!(d.expect_error, "{}", msg);
}
}
}
#[tokio::test]
async fn test_volume_capacity_stats() {
skip_if_not_root!();

3
src/dragonball/.gitignore vendored Normal file
View File

@@ -0,0 +1,3 @@
target
Cargo.lock
.idea

65
src/dragonball/Cargo.toml Normal file
View File

@@ -0,0 +1,65 @@
[package]
name = "dragonball"
version = "0.1.0"
authors = ["The Kata Containers community <kata-dev@lists.katacontainers.io>"]
description = "A secure sandbox for Kata Containers"
keywords = ["kata-containers", "sandbox", "vmm", "dragonball"]
homepage = "https://katacontainers.io/"
repository = "https://github.com/kata-containers/kata-containers.git"
license = "Apache-2.0"
edition = "2018"
[dependencies]
arc-swap = "1.5.0"
bytes = "1.1.0"
dbs-address-space = "0.1.0"
dbs-allocator = "0.1.0"
dbs-arch = "0.1.0"
dbs-boot = "0.2.0"
dbs-device = "0.1.0"
dbs-interrupt = { version = "0.1.0", features = ["kvm-irq"] }
dbs-legacy-devices = "0.1.0"
dbs-upcall = { version = "0.1.0", optional = true }
dbs-utils = "0.1.0"
dbs-virtio-devices = { version = "0.1.0", optional = true, features = ["virtio-mmio"] }
kvm-bindings = "0.5.0"
kvm-ioctls = "0.11.0"
lazy_static = "1.2"
libc = "0.2.39"
linux-loader = "0.4.0"
log = "0.4.14"
nix = "0.23.1"
seccompiler = "0.2.0"
serde = "1.0.27"
serde_derive = "1.0.27"
serde_json = "1.0.9"
slog = "2.5.2"
slog-scope = "4.4.0"
thiserror = "1"
vmm-sys-util = "0.9.0"
virtio-queue = { version = "0.1.0", optional = true }
vm-memory = { version = "0.7.0", features = ["backend-mmap"] }
[dev-dependencies]
slog-term = "2.9.0"
slog-async = "2.7.0"
[features]
acpi = []
atomic-guest-memory = []
hotplug = ["virtio-vsock"]
virtio-vsock = ["dbs-virtio-devices/virtio-vsock", "virtio-queue"]
virtio-blk = ["dbs-virtio-devices/virtio-blk", "virtio-queue"]
virtio-net = ["dbs-virtio-devices/virtio-net", "virtio-queue"]
# virtio-fs only work on atomic-guest-memory
virtio-fs = ["dbs-virtio-devices/virtio-fs", "virtio-queue", "atomic-guest-memory"]
[patch.'crates-io']
dbs-device = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }
dbs-interrupt = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }
dbs-legacy-devices = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }
dbs-upcall = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }
dbs-utils = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }
dbs-virtio-devices = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }
dbs-boot = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }
dbs-arch = { git = "https://github.com/openanolis/dragonball-sandbox.git", rev = "7a8e832b53d66994d6a16f0513d69f540583dcd0" }

1
src/dragonball/LICENSE Symbolic link
View File

@@ -0,0 +1 @@
../../LICENSE

29
src/dragonball/Makefile Normal file
View File

@@ -0,0 +1,29 @@
# Copyright (c) 2019-2022 Alibaba Cloud. All rights reserved.
# Copyright (c) 2019-2022 Ant Group. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
default: build
build:
# FIXME: This line will be removed when we solve the vm-memory dependency problem in Dragonball Sandbox
cargo update -p vm-memory:0.8.0 --precise 0.7.0
cargo build --all-features
check: clippy format
clippy:
@echo "INFO: cargo clippy..."
cargo clippy --all-targets --all-features \
-- \
-D warnings
format:
@echo "INFO: cargo fmt..."
cargo fmt -- --check
clean:
cargo clean
test:
@echo "INFO: testing dragonball for development build"
cargo test --all-features -- --nocapture

40
src/dragonball/README.md Normal file
View File

@@ -0,0 +1,40 @@
# Introduction
`Dragonball Sandbox` is a light-weight virtual machine manager (VMM) based on Linux Kernel-based Virtual Machine (KVM),
which is optimized for container workloads with:
- container image management and acceleration service
- flexible and high-performance virtual device drivers
- low CPU and memory overhead
- minimal startup time
- optimized concurrent startup speed
`Dragonball Sandbox` aims to provide a simple solution for the Kata Containers community. It is integrated into Kata 3.0
runtime as a built-in VMM and gives users an out-of-the-box Kata Containers experience without complex environment setup
and configuration process.
# Getting Started
[TODO](https://github.com/kata-containers/kata-containers/issues/4302)
# Documentation
Device: [Device Document](docs/device.md)
vCPU: [vCPU Document](docs/vcpu.md)
API: [API Document](docs/api.md)
Currently, the documents are still actively adding.
You could see the [official documentation](docs/) page for more details.
# Supported Architectures
- x86-64
- aarch64
# Supported Kernel
[TODO](https://github.com/kata-containers/kata-containers/issues/4303)
# Acknowledgement
Part of the code is based on the [Cloud Hypervisor](https://github.com/cloud-hypervisor/cloud-hypervisor) project, [`crosvm`](https://github.com/google/crosvm) project and [Firecracker](https://github.com/firecracker-microvm/firecracker) project. They are all rust written virtual machine managers with advantages on safety and security.
`Dragonball sandbox` is designed to be a VMM that is customized for Kata Containers and we will focus on optimizing container workloads for Kata ecosystem. The focus on the Kata community is what differentiates us from other rust written virtual machines.
# License
`Dragonball` is licensed under [Apache License](http://www.apache.org/licenses/LICENSE-2.0), Version 2.0.

View File

@@ -0,0 +1,27 @@
// Copyright 2017 The Chromium OS Authors. All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are
// met:
//
// * Redistributions of source code must retain the above copyright
// notice, this list of conditions and the following disclaimer.
// * Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following disclaimer
// in the documentation and/or other materials provided with the
// distribution.
// * Neither the name of Google Inc. nor the names of its
// contributors may be used to endorse or promote products derived from
// this software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
// "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
// LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
// OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
// DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
// THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
// (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -0,0 +1,27 @@
# API
We provide plenty API for Kata runtime to interact with `Dragonball` virtual machine manager.
This document provides the introduction for each of them.
## `ConfigureBootSource`
Configure the boot source of the VM using `BootSourceConfig`. This action can only be called before the VM has booted.
### Boot Source Config
1. `kernel_path`: Path of the kernel image. `Dragonball` only supports compressed kernel image for now.
2. `initrd_path`: Path of the initrd (could be None)
3. `boot_args`: Boot arguments passed to the kernel (could be None)
## `SetVmConfiguration`
Set virtual machine configuration using `VmConfigInfo` to initialize VM.
### VM Config Info
1. `vcpu_count`: Number of vCPU to start. Currently we only support up to 255 vCPUs.
2. `max_vcpu_count`: Max number of vCPU can be added through CPU hotplug.
3. `cpu_pm`: CPU power management.
4. `cpu_topology`: CPU topology information (including `threads_per_core`, `cores_per_die`, `dies_per_socket` and `sockets`).
5. `vpmu_feature`: `vPMU` feature level.
6. `mem_type`: Memory type that can be either `hugetlbfs` or `shmem`, default is `shmem`.
7. `mem_file_path` : Memory file path.
8. `mem_size_mib`: The memory size in MiB. The maximum memory size is 1TB.
9. `serial_path`: Optional sock path.

View File

@@ -0,0 +1,20 @@
# Device
## Device Manager
Currently we have following device manager:
| Name | Description |
| --- | --- |
| [address space manager](../src/address_space_manager.rs) | abstracts virtual machine's physical management and provide mapping for guest virtual memory and MMIO ranges of emulated virtual devices, pass-through devices and vCPU |
| [config manager](../src/config_manager.rs) | provides abstractions for configuration information |
| [console manager](../src/device_manager/console_manager.rs) | provides management for all console devices |
| [resource manager](../src/resource_manager.rs) |provides resource management for `legacy_irq_pool`, `msi_irq_pool`, `pio_pool`, `mmio_pool`, `mem_pool`, `kvm_mem_slot_pool` with builder `ResourceManagerBuilder` |
| [VSOCK device manager](../src/device_manager/vsock_dev_mgr.rs) | provides configuration info for `VIRTIO-VSOCK` and management for all VSOCK devices |
## Device supported
`VIRTIO-VSOCK`
`i8042`
`COM1`
`COM2`

View File

@@ -0,0 +1,42 @@
# vCPU
## vCPU Manager
The vCPU manager is to manage all vCPU related actions, we will dive into some of the important structure members in this doc.
For now, aarch64 vCPU support is still under development, we'll introduce it when we merge `runtime-rs` to the master branch. (issue: #4445)
### vCPU config
`VcpuConfig` is used to configure guest overall CPU info.
`boot_vcpu_count` is used to define the initial vCPU number.
`max_vcpu_count` is used to define the maximum vCPU number and it's used for the upper boundary for CPU hotplug feature
`thread_per_core`, `cores_per_die`, `dies_per_socket` and `socket` are used to define CPU topology.
`vpmu_feature` is used to define `vPMU` feature level.
If `vPMU` feature is `Disabled`, it means `vPMU` feature is off (by default).
If `vPMU` feature is `LimitedlyEnabled`, it means minimal `vPMU` counters are supported (cycles and instructions).
If `vPMU` feature is `FullyEnabled`, it means all `vPMU` counters are supported
## vCPU State
There are four states for vCPU state machine: `running`, `paused`, `waiting_exit`, `exited`. There is a state machine to maintain the task flow.
When the vCPU is created, it'll turn to `paused` state. After vCPU resource is ready at VMM, it'll send a `Resume` event to the vCPU thread, and then vCPU state will change to `running`.
During the `running` state, VMM will catch vCPU exit and execute different logic according to the exit reason.
If the VMM catch some exit reasons that it cannot handle, the state will change to `waiting_exit` and VMM will stop the virtual machine.
When the state switches to `waiting_exit`, an exit event will be sent to vCPU `exit_evt`, event manager will detect the change in `exit_evt` and set VMM `exit_evt_flag` as 1. A thread serving for VMM event loop will check `exit_evt_flag` and if the flag is 1, it'll stop the VMM.
When the VMM is stopped / destroyed, the state will change to `exited`.
## vCPU Hot plug
Since `Dragonball Sandbox` doesn't support virtualization of ACPI system, we use [`upcall`](https://github.com/openanolis/dragonball-sandbox/tree/main/crates/dbs-upcall) to establish a direct communication channel between `Dragonball` and Guest in order to trigger vCPU hotplug.
To use `upcall`, kernel patches are needed, you can get the patches from [`upcall`](https://github.com/openanolis/dragonball-sandbox/tree/main/crates/dbs-upcall) page, and we'll provide a ready-to-use guest kernel binary for you to try.
vCPU hot plug / hot unplug range is [1, `max_vcpu_count`]. Operations not in this range will be invalid.

View File

@@ -0,0 +1,892 @@
// Copyright (C) 2019-2022 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//! Address space abstraction to manage virtual machine's physical address space.
//!
//! The AddressSpace abstraction is introduced to manage virtual machine's physical address space.
//! The regions in virtual machine's physical address space may be used to:
//! 1) map guest virtual memory
//! 2) map MMIO ranges for emulated virtual devices, such as virtio-fs DAX window.
//! 3) map MMIO ranges for pass-through devices, such as PCI device BARs.
//! 4) map MMIO ranges for to vCPU, such as local APIC.
//! 5) not used/available
//!
//! A related abstraction, vm_memory::GuestMemory, is used to access guest virtual memory only.
//! In other words, AddressSpace is the resource owner, and GuestMemory is an accessor for guest
//! virtual memory.
use std::collections::{BTreeMap, HashMap};
use std::fs::File;
use std::os::unix::io::{AsRawFd, FromRawFd};
use std::sync::atomic::{AtomicBool, AtomicU8, Ordering};
use std::sync::{Arc, Mutex};
use std::thread;
use dbs_address_space::{
AddressSpace, AddressSpaceError, AddressSpaceLayout, AddressSpaceRegion,
AddressSpaceRegionType, NumaNode, NumaNodeInfo, MPOL_MF_MOVE, MPOL_PREFERRED,
};
use dbs_allocator::Constraint;
use kvm_bindings::kvm_userspace_memory_region;
use kvm_ioctls::VmFd;
use log::{debug, error, info, warn};
use nix::sys::mman;
use nix::unistd::dup;
#[cfg(feature = "atomic-guest-memory")]
use vm_memory::atomic::GuestMemoryAtomic;
use vm_memory::{
Address, FileOffset, GuestAddress, GuestAddressSpace, GuestMemoryMmap, GuestMemoryRegion,
GuestRegionMmap, GuestUsize, MemoryRegionAddress, MmapRegion,
};
use crate::resource_manager::ResourceManager;
use crate::vm::NumaRegionInfo;
#[cfg(not(feature = "atomic-guest-memory"))]
/// Concrete GuestAddressSpace type used by the VMM.
pub type GuestAddressSpaceImpl = Arc<GuestMemoryMmap>;
#[cfg(feature = "atomic-guest-memory")]
/// Concrete GuestAddressSpace type used by the VMM.
pub type GuestAddressSpaceImpl = GuestMemoryAtomic<GuestMemoryMmap>;
/// Concrete GuestMemory type used by the VMM.
pub type GuestMemoryImpl = <Arc<vm_memory::GuestMemoryMmap> as GuestAddressSpace>::M;
/// Concrete GuestRegion type used by the VMM.
pub type GuestRegionImpl = GuestRegionMmap;
// Maximum number of working threads for memory pre-allocation.
const MAX_PRE_ALLOC_THREAD: u64 = 16;
// Control the actual number of pre-allocating threads. After several performance tests, we decide to use one thread to do pre-allocating for every 4G memory.
const PRE_ALLOC_GRANULARITY: u64 = 32;
// We don't have plan to support mainframe computer and only focus on PC servers.
// 64 as max nodes should be enough for now.
const MAX_NODE: u32 = 64;
// We will split the memory region if it conflicts with the MMIO hole.
// But if the space below the MMIO hole is smaller than the MINIMAL_SPLIT_SPACE, we won't split the memory region in order to enhance performance.
const MINIMAL_SPLIT_SPACE: u64 = 128 << 20;
/// Errors associated with virtual machine address space management.
#[derive(Debug, thiserror::Error)]
pub enum AddressManagerError {
/// Invalid address space operation.
#[error("invalid address space operation")]
InvalidOperation,
/// Invalid address range.
#[error("invalid address space region (0x{0:x}, 0x{1:x})")]
InvalidAddressRange(u64, GuestUsize),
/// No available mem address.
#[error("no available mem address")]
NoAvailableMemAddress,
/// No available kvm slotse.
#[error("no available kvm slots")]
NoAvailableKvmSlot,
/// Address manager failed to create memfd to map anonymous memory.
#[error("address manager failed to create memfd to map anonymous memory")]
CreateMemFd(#[source] nix::Error),
/// Address manager failed to open memory file.
#[error("address manager failed to open memory file")]
OpenFile(#[source] std::io::Error),
/// Memory file provided is invalid due to empty file path, non-existent file path and other possible mistakes.
#[error("memory file provided to address manager {0} is invalid")]
FileInvalid(String),
/// Memory file provided is invalid due to empty memory type
#[error("memory type provided to address manager {0} is invalid")]
TypeInvalid(String),
/// Failed to set size for memory file.
#[error("address manager failed to set size for memory file")]
SetFileSize(#[source] std::io::Error),
/// Failed to unlink memory file.
#[error("address manager failed to unlink memory file")]
UnlinkFile(#[source] nix::Error),
/// Failed to duplicate fd of memory file.
#[error("address manager failed to duplicate memory file descriptor")]
DupFd(#[source] nix::Error),
/// Failure in accessing the memory located at some address.
#[error("address manager failed to access guest memory located at 0x{0:x}")]
AccessGuestMemory(u64, #[source] vm_memory::mmap::Error),
/// Failed to create GuestMemory
#[error("address manager failed to create guest memory object")]
CreateGuestMemory(#[source] vm_memory::Error),
/// Failure in initializing guest memory.
#[error("address manager failed to initialize guest memory")]
GuestMemoryNotInitialized,
/// Failed to mmap() guest memory
#[error("address manager failed to mmap() guest memory into current process")]
MmapGuestMemory(#[source] vm_memory::mmap::MmapRegionError),
/// Failed to set KVM memory slot.
#[error("address manager failed to configure KVM memory slot")]
KvmSetMemorySlot(#[source] kvm_ioctls::Error),
/// Failed to set madvise on AddressSpaceRegion
#[error("address manager failed to set madvice() on guest memory region")]
Madvise(#[source] nix::Error),
/// join threads fail
#[error("address manager failed to join threads")]
JoinFail,
/// Failed to create Address Space Region
#[error("address manager failed to create Address Space Region {0}")]
CreateAddressSpaceRegion(#[source] AddressSpaceError),
}
type Result<T> = std::result::Result<T, AddressManagerError>;
/// Parameters to configure address space creation operations.
pub struct AddressSpaceMgrBuilder<'a> {
mem_type: &'a str,
mem_file: &'a str,
mem_index: u32,
mem_suffix: bool,
mem_prealloc: bool,
dirty_page_logging: bool,
vmfd: Option<Arc<VmFd>>,
}
impl<'a> AddressSpaceMgrBuilder<'a> {
/// Create a new [`AddressSpaceMgrBuilder`] object.
pub fn new(mem_type: &'a str, mem_file: &'a str) -> Result<Self> {
if mem_type.is_empty() {
return Err(AddressManagerError::TypeInvalid(mem_type.to_string()));
}
Ok(AddressSpaceMgrBuilder {
mem_type,
mem_file,
mem_index: 0,
mem_suffix: true,
mem_prealloc: false,
dirty_page_logging: false,
vmfd: None,
})
}
/// Enable/disable adding numbered suffix to memory file path.
/// This feature could be useful to generate hugetlbfs files with number suffix. (e.g. shmem0, shmem1)
pub fn toggle_file_suffix(&mut self, enabled: bool) {
self.mem_suffix = enabled;
}
/// Enable/disable memory pre-allocation.
/// Enable this feature could improve performance stability at the start of workload by avoiding page fault.
/// Disable this feature may influence performance stability but the cpu resource consumption and start-up time will decrease.
pub fn toggle_prealloc(&mut self, prealloc: bool) {
self.mem_prealloc = prealloc;
}
/// Enable/disable KVM dirty page logging.
pub fn toggle_dirty_page_logging(&mut self, logging: bool) {
self.dirty_page_logging = logging;
}
/// Set KVM [`VmFd`] handle to configure memory slots.
pub fn set_kvm_vm_fd(&mut self, vmfd: Arc<VmFd>) -> Option<Arc<VmFd>> {
let mut existing_vmfd = None;
if self.vmfd.is_some() {
existing_vmfd = self.vmfd.clone();
}
self.vmfd = Some(vmfd);
existing_vmfd
}
/// Build a ['AddressSpaceMgr'] using the configured parameters.
pub fn build(
self,
res_mgr: &ResourceManager,
numa_region_infos: &[NumaRegionInfo],
) -> Result<AddressSpaceMgr> {
let mut mgr = AddressSpaceMgr::default();
mgr.create_address_space(res_mgr, numa_region_infos, self)?;
Ok(mgr)
}
fn get_next_mem_file(&mut self) -> String {
if self.mem_suffix {
let path = format!("{}{}", self.mem_file, self.mem_index);
self.mem_index += 1;
path
} else {
self.mem_file.to_string()
}
}
}
/// Struct to manage virtual machine's physical address space.
pub struct AddressSpaceMgr {
address_space: Option<AddressSpace>,
vm_as: Option<GuestAddressSpaceImpl>,
base_to_slot: Arc<Mutex<HashMap<u64, u32>>>,
prealloc_handlers: Vec<thread::JoinHandle<()>>,
prealloc_exit: Arc<AtomicBool>,
numa_nodes: BTreeMap<u32, NumaNode>,
}
impl AddressSpaceMgr {
/// Query address space manager is initialized or not
pub fn is_initialized(&self) -> bool {
self.address_space.is_some()
}
/// Gets address space.
pub fn address_space(&self) -> Option<&AddressSpace> {
self.address_space.as_ref()
}
/// Create the address space for a virtual machine.
///
/// This method is designed to be called when starting up a virtual machine instead of at
/// runtime, so it's expected the virtual machine will be tore down and no strict error recover.
pub fn create_address_space(
&mut self,
res_mgr: &ResourceManager,
numa_region_infos: &[NumaRegionInfo],
mut param: AddressSpaceMgrBuilder,
) -> Result<()> {
let mut regions = Vec::new();
let mut start_addr = dbs_boot::layout::GUEST_MEM_START;
// Create address space regions.
for info in numa_region_infos.iter() {
info!("numa_region_info {:?}", info);
// convert size_in_mib to bytes
let size = info
.size
.checked_shl(20)
.ok_or_else(|| AddressManagerError::InvalidOperation)?;
// Guest memory does not intersect with the MMIO hole.
// TODO: make it work for ARM (issue #4307)
if start_addr > dbs_boot::layout::MMIO_LOW_END
|| start_addr + size <= dbs_boot::layout::MMIO_LOW_START
{
let region = self.create_region(start_addr, size, info, &mut param)?;
regions.push(region);
start_addr = start_addr
.checked_add(size)
.ok_or_else(|| AddressManagerError::InvalidOperation)?;
} else {
// Add guest memory below the MMIO hole, avoid splitting the memory region
// if the available address region is small than MINIMAL_SPLIT_SPACE MiB.
let mut below_size = dbs_boot::layout::MMIO_LOW_START
.checked_sub(start_addr)
.ok_or_else(|| AddressManagerError::InvalidOperation)?;
if below_size < (MINIMAL_SPLIT_SPACE) {
below_size = 0;
} else {
let region = self.create_region(start_addr, below_size, info, &mut param)?;
regions.push(region);
}
// Add guest memory above the MMIO hole
let above_start = dbs_boot::layout::MMIO_LOW_END + 1;
let above_size = size
.checked_sub(below_size)
.ok_or_else(|| AddressManagerError::InvalidOperation)?;
let region = self.create_region(above_start, above_size, info, &mut param)?;
regions.push(region);
start_addr = above_start
.checked_add(above_size)
.ok_or_else(|| AddressManagerError::InvalidOperation)?;
}
}
// Create GuestMemory object
let mut vm_memory = GuestMemoryMmap::new();
for reg in regions.iter() {
// Allocate used guest memory addresses.
// These addresses are statically allocated, resource allocation/update should not fail.
let constraint = Constraint::new(reg.len())
.min(reg.start_addr().raw_value())
.max(reg.last_addr().raw_value());
let _key = res_mgr
.allocate_mem_address(&constraint)
.ok_or(AddressManagerError::NoAvailableMemAddress)?;
let mmap_reg = self.create_mmap_region(reg.clone())?;
vm_memory = vm_memory
.insert_region(mmap_reg.clone())
.map_err(AddressManagerError::CreateGuestMemory)?;
self.map_to_kvm(res_mgr, &param, reg, mmap_reg)?;
}
#[cfg(feature = "atomic-guest-memory")]
{
self.vm_as = Some(AddressSpace::convert_into_vm_as(vm_memory));
}
#[cfg(not(feature = "atomic-guest-memory"))]
{
self.vm_as = Some(Arc::new(vm_memory));
}
let layout = AddressSpaceLayout::new(
*dbs_boot::layout::GUEST_PHYS_END,
dbs_boot::layout::GUEST_MEM_START,
*dbs_boot::layout::GUEST_MEM_END,
);
self.address_space = Some(AddressSpace::from_regions(regions, layout));
Ok(())
}
// size unit: Byte
fn create_region(
&mut self,
start_addr: u64,
size_bytes: u64,
info: &NumaRegionInfo,
param: &mut AddressSpaceMgrBuilder,
) -> Result<Arc<AddressSpaceRegion>> {
let mem_file_path = param.get_next_mem_file();
let region = AddressSpaceRegion::create_default_memory_region(
GuestAddress(start_addr),
size_bytes,
info.host_numa_node_id,
param.mem_type,
&mem_file_path,
param.mem_prealloc,
false,
)
.map_err(AddressManagerError::CreateAddressSpaceRegion)?;
let region = Arc::new(region);
self.insert_into_numa_nodes(
&region,
info.guest_numa_node_id.unwrap_or(0),
&info.vcpu_ids,
);
info!(
"create new region: guest addr 0x{:x}-0x{:x} size {}",
start_addr,
start_addr + size_bytes,
size_bytes
);
Ok(region)
}
fn map_to_kvm(
&mut self,
res_mgr: &ResourceManager,
param: &AddressSpaceMgrBuilder,
reg: &Arc<AddressSpaceRegion>,
mmap_reg: Arc<GuestRegionImpl>,
) -> Result<()> {
// Build mapping between GPA <-> HVA, by adding kvm memory slot.
let slot = res_mgr
.allocate_kvm_mem_slot(1, None)
.ok_or(AddressManagerError::NoAvailableKvmSlot)?;
if let Some(vmfd) = param.vmfd.as_ref() {
let host_addr = mmap_reg
.get_host_address(MemoryRegionAddress(0))
.map_err(|_e| AddressManagerError::InvalidOperation)?;
let flags = 0u32;
let mem_region = kvm_userspace_memory_region {
slot: slot as u32,
guest_phys_addr: reg.start_addr().raw_value(),
memory_size: reg.len() as u64,
userspace_addr: host_addr as u64,
flags,
};
info!(
"VM: guest memory region {:x} starts at {:x?}",
reg.start_addr().raw_value(),
host_addr
);
// Safe because the guest regions are guaranteed not to overlap.
unsafe { vmfd.set_user_memory_region(mem_region) }
.map_err(AddressManagerError::KvmSetMemorySlot)?;
}
self.base_to_slot
.lock()
.unwrap()
.insert(reg.start_addr().raw_value(), slot as u32);
Ok(())
}
/// Mmap the address space region into current process.
pub fn create_mmap_region(
&mut self,
region: Arc<AddressSpaceRegion>,
) -> Result<Arc<GuestRegionImpl>> {
// Special check for 32bit host with 64bit virtual machines.
if region.len() > usize::MAX as u64 {
return Err(AddressManagerError::InvalidAddressRange(
region.start_addr().raw_value(),
region.len(),
));
}
// The device MMIO regions may not be backed by memory files, so refuse to mmap them.
if region.region_type() == AddressSpaceRegionType::DeviceMemory {
return Err(AddressManagerError::InvalidOperation);
}
// The GuestRegionMmap/MmapRegion will take ownership of the FileOffset object,
// so we have to duplicate the fd here. It's really a dirty design.
let file_offset = match region.file_offset().as_ref() {
Some(fo) => {
let fd = dup(fo.file().as_raw_fd()).map_err(AddressManagerError::DupFd)?;
// Safe because we have just duplicated the raw fd.
let file = unsafe { File::from_raw_fd(fd) };
let file_offset = FileOffset::new(file, fo.start());
Some(file_offset)
}
None => None,
};
let perm_flags = if (region.perm_flags() & libc::MAP_POPULATE) != 0 && region.is_hugepage()
{
// mmap(MAP_POPULATE) conflicts with madive(MADV_HUGEPAGE) because mmap(MAP_POPULATE)
// will pre-fault in all memory with normal pages before madive(MADV_HUGEPAGE) gets
// called. So remove the MAP_POPULATE flag and memory will be faulted in by working
// threads.
region.perm_flags() & (!libc::MAP_POPULATE)
} else {
region.perm_flags()
};
let mmap_reg = MmapRegion::build(
file_offset,
region.len() as usize,
libc::PROT_READ | libc::PROT_WRITE,
perm_flags,
)
.map_err(AddressManagerError::MmapGuestMemory)?;
if region.is_anonpage() {
self.configure_anon_mem(&mmap_reg)?;
}
if let Some(node_id) = region.host_numa_node_id() {
self.configure_numa(&mmap_reg, node_id)?;
}
if region.is_hugepage() {
self.configure_thp_and_prealloc(&region, &mmap_reg)?;
}
let reg = GuestRegionImpl::new(mmap_reg, region.start_addr())
.map_err(AddressManagerError::CreateGuestMemory)?;
Ok(Arc::new(reg))
}
fn configure_anon_mem(&self, mmap_reg: &MmapRegion) -> Result<()> {
unsafe {
mman::madvise(
mmap_reg.as_ptr() as *mut libc::c_void,
mmap_reg.size(),
mman::MmapAdvise::MADV_DONTFORK,
)
}
.map_err(AddressManagerError::Madvise)
}
fn configure_numa(&self, mmap_reg: &MmapRegion, node_id: u32) -> Result<()> {
let nodemask = 1_u64
.checked_shl(node_id)
.ok_or_else(|| AddressManagerError::InvalidOperation)?;
let res = unsafe {
libc::syscall(
libc::SYS_mbind,
mmap_reg.as_ptr() as *mut libc::c_void,
mmap_reg.size(),
MPOL_PREFERRED,
&nodemask as *const u64,
MAX_NODE,
MPOL_MF_MOVE,
)
};
if res < 0 {
warn!(
"failed to mbind memory to host_numa_node_id {}: this may affect performance",
node_id
);
}
Ok(())
}
// We set Transparent Huge Page (THP) through mmap to increase performance.
// In order to reduce the impact of page fault on performance, we start several threads (up to MAX_PRE_ALLOC_THREAD) to touch every 4k page of the memory region to manually do memory pre-allocation.
// The reason why we don't use mmap to enable THP and pre-alloction is that THP setting won't take effect in this operation (tested in kernel 4.9)
fn configure_thp_and_prealloc(
&mut self,
region: &Arc<AddressSpaceRegion>,
mmap_reg: &MmapRegion,
) -> Result<()> {
debug!(
"Setting MADV_HUGEPAGE on AddressSpaceRegion addr {:x?} len {:x?}",
mmap_reg.as_ptr(),
mmap_reg.size()
);
// Safe because we just create the MmapRegion
unsafe {
mman::madvise(
mmap_reg.as_ptr() as *mut libc::c_void,
mmap_reg.size(),
mman::MmapAdvise::MADV_HUGEPAGE,
)
}
.map_err(AddressManagerError::Madvise)?;
if region.perm_flags() & libc::MAP_POPULATE > 0 {
// Touch every 4k page to trigger allocation. The step is 4K instead of 2M to ensure
// pre-allocation when running out of huge pages.
const PAGE_SIZE: u64 = 4096;
const PAGE_SHIFT: u32 = 12;
let addr = mmap_reg.as_ptr() as u64;
// Here we use >> PAGE_SHIFT to calculate how many 4K pages in the memory region.
let npage = (mmap_reg.size() as u64) >> PAGE_SHIFT;
let mut touch_thread = ((mmap_reg.size() as u64) >> PRE_ALLOC_GRANULARITY) + 1;
if touch_thread > MAX_PRE_ALLOC_THREAD {
touch_thread = MAX_PRE_ALLOC_THREAD;
}
let per_npage = npage / touch_thread;
for n in 0..touch_thread {
let start_npage = per_npage * n;
let end_npage = if n == (touch_thread - 1) {
npage
} else {
per_npage * (n + 1)
};
let mut per_addr = addr + (start_npage * PAGE_SIZE);
let should_stop = self.prealloc_exit.clone();
let handler = thread::Builder::new()
.name("PreallocThread".to_string())
.spawn(move || {
info!("PreallocThread start start_npage: {:?}, end_npage: {:?}, per_addr: {:?}, thread_number: {:?}",
start_npage, end_npage, per_addr, touch_thread );
for _ in start_npage..end_npage {
if should_stop.load(Ordering::Acquire) {
info!("PreallocThread stop start_npage: {:?}, end_npage: {:?}, per_addr: {:?}, thread_number: {:?}",
start_npage, end_npage, per_addr, touch_thread);
break;
}
// Reading from a THP page may be served by the zero page, so only
// write operation could ensure THP memory allocation. So use
// the compare_exchange(old_val, old_val) trick to trigger allocation.
let addr_ptr = per_addr as *mut u8;
let read_byte = unsafe { std::ptr::read_volatile(addr_ptr) };
let atomic_u8 : &AtomicU8 = unsafe {&*(addr_ptr as *mut AtomicU8)};
let _ = atomic_u8.compare_exchange(read_byte, read_byte, Ordering::SeqCst, Ordering::SeqCst);
per_addr += PAGE_SIZE;
}
info!("PreallocThread done start_npage: {:?}, end_npage: {:?}, per_addr: {:?}, thread_number: {:?}",
start_npage, end_npage, per_addr, touch_thread );
});
match handler {
Err(e) => error!(
"Failed to create working thread for async pre-allocation, {:?}. This may affect performance stability at the start of the workload.",
e
),
Ok(hdl) => self.prealloc_handlers.push(hdl),
}
}
}
Ok(())
}
/// Get the address space object
pub fn get_address_space(&self) -> Option<&AddressSpace> {
self.address_space.as_ref()
}
/// Get the default guest memory object, which will be used to access virtual machine's default
/// guest memory.
pub fn get_vm_as(&self) -> Option<&GuestAddressSpaceImpl> {
self.vm_as.as_ref()
}
/// Get the base to slot map
pub fn get_base_to_slot_map(&self) -> Arc<Mutex<HashMap<u64, u32>>> {
self.base_to_slot.clone()
}
/// get numa nodes infos from address space manager.
pub fn get_numa_nodes(&self) -> &BTreeMap<u32, NumaNode> {
&self.numa_nodes
}
/// add cpu and memory numa informations to BtreeMap
fn insert_into_numa_nodes(
&mut self,
region: &Arc<AddressSpaceRegion>,
guest_numa_node_id: u32,
vcpu_ids: &[u32],
) {
let node = self
.numa_nodes
.entry(guest_numa_node_id)
.or_insert_with(NumaNode::new);
node.add_info(&NumaNodeInfo {
base: region.start_addr(),
size: region.len(),
});
node.add_vcpu_ids(vcpu_ids);
}
/// get address space layout from address space manager.
pub fn get_layout(&self) -> Result<AddressSpaceLayout> {
self.address_space
.as_ref()
.map(|v| v.layout())
.ok_or(AddressManagerError::GuestMemoryNotInitialized)
}
/// Wait for the pre-allocation working threads to finish work.
///
/// Force all working threads to exit if `stop` is true.
pub fn wait_prealloc(&mut self, stop: bool) -> Result<()> {
if stop {
self.prealloc_exit.store(true, Ordering::Release);
}
while let Some(handlers) = self.prealloc_handlers.pop() {
if let Err(e) = handlers.join() {
error!("wait_prealloc join fail {:?}", e);
return Err(AddressManagerError::JoinFail);
}
}
Ok(())
}
}
impl Default for AddressSpaceMgr {
/// Create a new empty AddressSpaceMgr
fn default() -> Self {
AddressSpaceMgr {
address_space: None,
vm_as: None,
base_to_slot: Arc::new(Mutex::new(HashMap::new())),
prealloc_handlers: Vec::new(),
prealloc_exit: Arc::new(AtomicBool::new(false)),
numa_nodes: BTreeMap::new(),
}
}
}
#[cfg(test)]
mod tests {
use dbs_boot::layout::GUEST_MEM_START;
use std::ops::Deref;
use vm_memory::{Bytes, GuestAddressSpace, GuestMemory, GuestMemoryRegion};
use vmm_sys_util::tempfile::TempFile;
use super::*;
#[test]
fn test_create_address_space() {
let res_mgr = ResourceManager::new(None);
let mem_size = 128 << 20;
let numa_region_infos = vec![NumaRegionInfo {
size: mem_size >> 20,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids: vec![1, 2],
}];
let builder = AddressSpaceMgrBuilder::new("shmem", "").unwrap();
let as_mgr = builder.build(&res_mgr, &numa_region_infos).unwrap();
let vm_as = as_mgr.get_vm_as().unwrap();
let guard = vm_as.memory();
let gmem = guard.deref();
assert_eq!(gmem.num_regions(), 1);
let reg = gmem
.find_region(GuestAddress(GUEST_MEM_START + mem_size - 1))
.unwrap();
assert_eq!(reg.start_addr(), GuestAddress(GUEST_MEM_START));
assert_eq!(reg.len(), mem_size);
assert!(gmem
.find_region(GuestAddress(GUEST_MEM_START + mem_size))
.is_none());
assert!(reg.file_offset().is_some());
let buf = [0x1u8, 0x2u8, 0x3u8, 0x4u8, 0x5u8];
gmem.write_slice(&buf, GuestAddress(GUEST_MEM_START))
.unwrap();
// Update middle of mapped memory region
let mut val = 0xa5u8;
gmem.write_obj(val, GuestAddress(GUEST_MEM_START + 0x1))
.unwrap();
val = gmem.read_obj(GuestAddress(GUEST_MEM_START + 0x1)).unwrap();
assert_eq!(val, 0xa5);
val = gmem.read_obj(GuestAddress(GUEST_MEM_START)).unwrap();
assert_eq!(val, 1);
val = gmem.read_obj(GuestAddress(GUEST_MEM_START + 0x2)).unwrap();
assert_eq!(val, 3);
val = gmem.read_obj(GuestAddress(GUEST_MEM_START + 0x5)).unwrap();
assert_eq!(val, 0);
// Read ahead of mapped memory region
assert!(gmem
.read_obj::<u8>(GuestAddress(GUEST_MEM_START + mem_size))
.is_err());
let res_mgr = ResourceManager::new(None);
let mem_size = dbs_boot::layout::MMIO_LOW_START + (1 << 30);
let numa_region_infos = vec![NumaRegionInfo {
size: mem_size >> 20,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids: vec![1, 2],
}];
let builder = AddressSpaceMgrBuilder::new("shmem", "").unwrap();
let as_mgr = builder.build(&res_mgr, &numa_region_infos).unwrap();
let vm_as = as_mgr.get_vm_as().unwrap();
let guard = vm_as.memory();
let gmem = guard.deref();
#[cfg(target_arch = "x86_64")]
assert_eq!(gmem.num_regions(), 2);
#[cfg(target_arch = "aarch64")]
assert_eq!(gmem.num_regions(), 1);
// Test dropping GuestMemoryMmap object releases all resources.
for _ in 0..10000 {
let res_mgr = ResourceManager::new(None);
let mem_size = 1 << 20;
let numa_region_infos = vec![NumaRegionInfo {
size: mem_size >> 20,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids: vec![1, 2],
}];
let builder = AddressSpaceMgrBuilder::new("shmem", "").unwrap();
let _as_mgr = builder.build(&res_mgr, &numa_region_infos).unwrap();
}
let file = TempFile::new().unwrap().into_file();
let fd = file.as_raw_fd();
// fd should be small enough if there's no leaking of fds.
assert!(fd < 1000);
}
#[test]
fn test_address_space_mgr_get_boundary() {
let layout = AddressSpaceLayout::new(
*dbs_boot::layout::GUEST_PHYS_END,
dbs_boot::layout::GUEST_MEM_START,
*dbs_boot::layout::GUEST_MEM_END,
);
let res_mgr = ResourceManager::new(None);
let mem_size = 128 << 20;
let numa_region_infos = vec![NumaRegionInfo {
size: mem_size >> 20,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids: vec![1, 2],
}];
let builder = AddressSpaceMgrBuilder::new("shmem", "").unwrap();
let as_mgr = builder.build(&res_mgr, &numa_region_infos).unwrap();
assert_eq!(as_mgr.get_layout().unwrap(), layout);
}
#[test]
fn test_address_space_mgr_get_numa_nodes() {
let res_mgr = ResourceManager::new(None);
let mem_size = 128 << 20;
let cpu_vec = vec![1, 2];
let numa_region_infos = vec![NumaRegionInfo {
size: mem_size >> 20,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids: cpu_vec.clone(),
}];
let builder = AddressSpaceMgrBuilder::new("shmem", "").unwrap();
let as_mgr = builder.build(&res_mgr, &numa_region_infos).unwrap();
let mut numa_node = NumaNode::new();
numa_node.add_info(&NumaNodeInfo {
base: GuestAddress(GUEST_MEM_START),
size: mem_size,
});
numa_node.add_vcpu_ids(&cpu_vec);
assert_eq!(*as_mgr.get_numa_nodes().get(&0).unwrap(), numa_node);
}
#[test]
fn test_address_space_mgr_async_prealloc() {
let res_mgr = ResourceManager::new(None);
let mem_size = 2 << 20;
let cpu_vec = vec![1, 2];
let numa_region_infos = vec![NumaRegionInfo {
size: mem_size >> 20,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids: cpu_vec,
}];
let mut builder = AddressSpaceMgrBuilder::new("hugeshmem", "").unwrap();
builder.toggle_prealloc(true);
let mut as_mgr = builder.build(&res_mgr, &numa_region_infos).unwrap();
as_mgr.wait_prealloc(false).unwrap();
}
#[test]
fn test_address_space_mgr_builder() {
let mut builder = AddressSpaceMgrBuilder::new("shmem", "/tmp/shmem").unwrap();
assert_eq!(builder.mem_type, "shmem");
assert_eq!(builder.mem_file, "/tmp/shmem");
assert_eq!(builder.mem_index, 0);
assert!(builder.mem_suffix);
assert!(!builder.mem_prealloc);
assert!(!builder.dirty_page_logging);
assert!(builder.vmfd.is_none());
assert_eq!(&builder.get_next_mem_file(), "/tmp/shmem0");
assert_eq!(&builder.get_next_mem_file(), "/tmp/shmem1");
assert_eq!(&builder.get_next_mem_file(), "/tmp/shmem2");
assert_eq!(builder.mem_index, 3);
builder.toggle_file_suffix(false);
assert_eq!(&builder.get_next_mem_file(), "/tmp/shmem");
assert_eq!(&builder.get_next_mem_file(), "/tmp/shmem");
assert_eq!(builder.mem_index, 3);
builder.toggle_prealloc(true);
builder.toggle_dirty_page_logging(true);
assert!(builder.mem_prealloc);
assert!(builder.dirty_page_logging);
}
#[test]
fn test_configure_invalid_numa() {
let res_mgr = ResourceManager::new(None);
let mem_size = 128 << 20;
let numa_region_infos = vec![NumaRegionInfo {
size: mem_size >> 20,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids: vec![1, 2],
}];
let builder = AddressSpaceMgrBuilder::new("shmem", "").unwrap();
let as_mgr = builder.build(&res_mgr, &numa_region_infos).unwrap();
let mmap_reg = MmapRegion::new(8).unwrap();
assert!(as_mgr.configure_numa(&mmap_reg, u32::MAX).is_err());
}
}

View File

@@ -0,0 +1,6 @@
// Copyright (C) 2019-2022 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//! API related data structures to configure the vmm.
pub mod v1;

View File

@@ -0,0 +1,55 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
use serde_derive::{Deserialize, Serialize};
/// Default guest kernel command line:
/// - `reboot=k` shutdown the guest on reboot, instead of well... rebooting;
/// - `panic=1` on panic, reboot after 1 second;
/// - `pci=off` do not scan for PCI devices (ser boot time);
/// - `nomodules` disable loadable kernel module support;
/// - `8250.nr_uarts=0` disable 8250 serial interface;
/// - `i8042.noaux` do not probe the i8042 controller for an attached mouse (ser boot time);
/// - `i8042.nomux` do not probe i8042 for a multiplexing controller (ser boot time);
/// - `i8042.nopnp` do not use ACPIPnP to discover KBD/AUX controllers (ser boot time);
/// - `i8042.dumbkbd` do not attempt to control kbd state via the i8042 (ser boot time).
pub const DEFAULT_KERNEL_CMDLINE: &str = "reboot=k panic=1 pci=off nomodules 8250.nr_uarts=0 \
i8042.noaux i8042.nomux i8042.nopnp i8042.dumbkbd";
/// Strongly typed data structure used to configure the boot source of the microvm.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize, Default)]
#[serde(deny_unknown_fields)]
pub struct BootSourceConfig {
/// Path of the kernel image.
/// We only support uncompressed kernel for Dragonball.
pub kernel_path: String,
/// Path of the initrd, if there is one.
/// ps. rootfs is set in BlockDeviceConfigInfo
pub initrd_path: Option<String>,
/// The boot arguments to pass to the kernel.
#[serde(skip_serializing_if = "Option::is_none")]
pub boot_args: Option<String>,
}
/// Errors associated with actions on `BootSourceConfig`.
#[derive(Debug, thiserror::Error)]
pub enum BootSourceConfigError {
/// The kernel file cannot be opened.
#[error(
"the kernel file cannot be opened due to invalid kernel path or invalid permissions: {0}"
)]
InvalidKernelPath(#[source] std::io::Error),
/// The initrd file cannot be opened.
#[error("the initrd file cannot be opened due to invalid path or invalid permissions: {0}")]
InvalidInitrdPath(#[source] std::io::Error),
/// The kernel command line is invalid.
#[error("the kernel command line is invalid: {0}")]
InvalidKernelCommandLine(#[source] linux_loader::cmdline::Error),
/// The boot source cannot be update post boot.
#[error("the update operation is not allowed after boot")]
UpdateNotAllowedPostBoot,
}

View File

@@ -0,0 +1,88 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
//
// SPDX-License-Identifier: Apache-2.0
use serde_derive::{Deserialize, Serialize};
/// The microvm state.
///
/// When Dragonball starts, the instance state is Uninitialized. Once start_microvm method is
/// called, the state goes from Uninitialized to Starting. The state is changed to Running until
/// the start_microvm method ends. Halting and Halted are currently unsupported.
#[derive(Copy, Clone, Debug, Deserialize, PartialEq, Serialize)]
pub enum InstanceState {
/// Microvm is not initialized.
Uninitialized,
/// Microvm is starting.
Starting,
/// Microvm is running.
Running,
/// Microvm is Paused.
Paused,
/// Microvm received a halt instruction.
Halting,
/// Microvm is halted.
Halted,
/// Microvm exit instead of process exit.
Exited(i32),
}
/// The state of async actions
#[derive(Debug, Deserialize, Serialize, Clone, PartialEq)]
pub enum AsyncState {
/// Uninitialized
Uninitialized,
/// Success
Success,
/// Failure
Failure,
}
/// The strongly typed that contains general information about the microVM.
#[derive(Debug, Deserialize, Serialize)]
pub struct InstanceInfo {
/// The ID of the microVM.
pub id: String,
/// The state of the microVM.
pub state: InstanceState,
/// The version of the VMM that runs the microVM.
pub vmm_version: String,
/// The pid of the current VMM process.
pub pid: u32,
/// The state of async actions.
pub async_state: AsyncState,
/// List of tids of vcpu threads (vcpu index, tid)
pub tids: Vec<(u8, u32)>,
/// Last instance downtime
pub last_instance_downtime: u64,
}
impl InstanceInfo {
/// create instance info object with given id, version, and platform type
pub fn new(id: String, vmm_version: String) -> Self {
InstanceInfo {
id,
state: InstanceState::Uninitialized,
vmm_version,
pid: std::process::id(),
async_state: AsyncState::Uninitialized,
tids: Vec::new(),
last_instance_downtime: 0,
}
}
}
impl Default for InstanceInfo {
fn default() -> Self {
InstanceInfo {
id: String::from(""),
state: InstanceState::Uninitialized,
vmm_version: env!("CARGO_PKG_VERSION").to_string(),
pid: std::process::id(),
async_state: AsyncState::Uninitialized,
tids: Vec::new(),
last_instance_downtime: 0,
}
}
}

View File

@@ -0,0 +1,86 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
/// We only support this number of vcpus for now. Mostly because we have set all vcpu related metrics as u8
/// and breaking u8 will take extra efforts.
pub const MAX_SUPPORTED_VCPUS: u8 = 254;
/// Memory hotplug value should have alignment in this size (unit: MiB)
pub const MEMORY_HOTPLUG_ALIGHMENT: u8 = 64;
/// Errors associated with configuring the microVM.
#[derive(Debug, PartialEq, thiserror::Error)]
pub enum VmConfigError {
/// Cannot update the configuration of the microvm post boot.
#[error("update operation is not allowed after boot")]
UpdateNotAllowedPostBoot,
/// The max vcpu count is invalid.
#[error("the vCPU number shouldn't large than {}", MAX_SUPPORTED_VCPUS)]
VcpuCountExceedsMaximum,
/// The vcpu count is invalid. When hyperthreading is enabled, the `cpu_count` must be either
/// 1 or an even number.
#[error(
"the vCPU number '{0}' can only be 1 or an even number when hyperthreading is enabled"
)]
InvalidVcpuCount(u8),
/// The threads_per_core is invalid. It should be either 1 or 2.
#[error("the threads_per_core number '{0}' can only be 1 or 2")]
InvalidThreadsPerCore(u8),
/// The cores_per_die is invalid. It should be larger than 0.
#[error("the cores_per_die number '{0}' can only be larger than 0")]
InvalidCoresPerDie(u8),
/// The dies_per_socket is invalid. It should be larger than 0.
#[error("the dies_per_socket number '{0}' can only be larger than 0")]
InvalidDiesPerSocket(u8),
/// The socket number is invalid. It should be either 1 or 2.
#[error("the socket number '{0}' can only be 1 or 2")]
InvalidSocket(u8),
/// max vcpu count inferred from cpu topology(threads_per_core * cores_per_die * dies_per_socket * sockets) should be larger or equal to vcpu_count
#[error("the max vcpu count inferred from cpu topology '{0}' (threads_per_core * cores_per_die * dies_per_socket * sockets) should be larger or equal to vcpu_count")]
InvalidCpuTopology(u8),
/// The max vcpu count is invalid.
#[error(
"the max vCPU number '{0}' shouldn't less than vCPU count and can only be 1 or an even number when hyperthreading is enabled"
)]
InvalidMaxVcpuCount(u8),
/// The memory size is invalid. The memory can only be an unsigned integer.
#[error("the memory size 0x{0:x}MiB is invalid")]
InvalidMemorySize(usize),
/// The hotplug memory size is invalid. The memory can only be an unsigned integer.
#[error(
"the hotplug memory size '{0}' (MiB) is invalid, must be multiple of {}",
MEMORY_HOTPLUG_ALIGHMENT
)]
InvalidHotplugMemorySize(usize),
/// The memory type is invalid.
#[error("the memory type '{0}' is invalid")]
InvalidMemType(String),
/// The memory file path is invalid.
#[error("the memory file path is invalid")]
InvalidMemFilePath(String),
/// NUMA region memory size is invalid
#[error("Total size of memory in NUMA regions: {0}, should matches memory size in config")]
InvalidNumaRegionMemorySize(usize),
/// NUMA region vCPU count is invalid
#[error("Total counts of vCPUs in NUMA regions: {0}, should matches max vcpu count in config")]
InvalidNumaRegionCpuCount(u16),
/// NUMA region vCPU count is invalid
#[error("Max id of vCPUs in NUMA regions: {0}, should matches max vcpu count in config")]
InvalidNumaRegionCpuMaxId(u16),
}

View File

@@ -0,0 +1,19 @@
// Copyright (C) 2019-2022 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//! API Version 1 related data structures to configure the vmm.
mod vmm_action;
pub use self::vmm_action::*;
/// Wrapper for configuring the microVM boot source.
mod boot_source;
pub use self::boot_source::{BootSourceConfig, BootSourceConfigError, DEFAULT_KERNEL_CMDLINE};
/// Wrapper over the microVM general information.
mod instance_info;
pub use self::instance_info::{InstanceInfo, InstanceState};
/// Wrapper for configuring the memory and CPU of the microVM.
mod machine_config;
pub use self::machine_config::{VmConfigError, MAX_SUPPORTED_VCPUS};

View File

@@ -0,0 +1,636 @@
// Copyright (C) 2020-2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::fs::File;
use std::sync::mpsc::{Receiver, Sender, TryRecvError};
use log::{debug, error, info, warn};
use crate::error::{Result, StartMicroVmError, StopMicrovmError};
use crate::event_manager::EventManager;
use crate::vm::{CpuTopology, KernelConfigInfo, VmConfigInfo};
use crate::vmm::Vmm;
use self::VmConfigError::*;
use self::VmmActionError::MachineConfig;
#[cfg(feature = "virtio-blk")]
pub use crate::device_manager::blk_dev_mgr::{
BlockDeviceConfigInfo, BlockDeviceConfigUpdateInfo, BlockDeviceError, BlockDeviceMgr,
};
#[cfg(feature = "virtio-fs")]
pub use crate::device_manager::fs_dev_mgr::{
FsDeviceConfigInfo, FsDeviceConfigUpdateInfo, FsDeviceError, FsDeviceMgr, FsMountConfigInfo,
};
#[cfg(feature = "virtio-net")]
pub use crate::device_manager::virtio_net_dev_mgr::{
VirtioNetDeviceConfigInfo, VirtioNetDeviceConfigUpdateInfo, VirtioNetDeviceError,
VirtioNetDeviceMgr,
};
#[cfg(feature = "virtio-vsock")]
pub use crate::device_manager::vsock_dev_mgr::{VsockDeviceConfigInfo, VsockDeviceError};
use super::*;
/// Wrapper for all errors associated with VMM actions.
#[derive(Debug, thiserror::Error)]
pub enum VmmActionError {
/// Invalid virtual machine instance ID.
#[error("the virtual machine instance ID is invalid")]
InvalidVMID,
/// Failed to hotplug, due to Upcall not ready.
#[error("Upcall not ready, can't hotplug device.")]
UpcallNotReady,
/// The action `ConfigureBootSource` failed either because of bad user input or an internal
/// error.
#[error("failed to configure boot source for VM: {0}")]
BootSource(#[source] BootSourceConfigError),
/// The action `StartMicroVm` failed either because of bad user input or an internal error.
#[error("failed to boot the VM: {0}")]
StartMicroVm(#[source] StartMicroVmError),
/// The action `StopMicroVm` failed either because of bad user input or an internal error.
#[error("failed to shutdown the VM: {0}")]
StopMicrovm(#[source] StopMicrovmError),
/// One of the actions `GetVmConfiguration` or `SetVmConfiguration` failed either because of bad
/// input or an internal error.
#[error("failed to set configuration for the VM: {0}")]
MachineConfig(#[source] VmConfigError),
#[cfg(feature = "virtio-vsock")]
/// The action `InsertVsockDevice` failed either because of bad user input or an internal error.
#[error("failed to add virtio-vsock device: {0}")]
Vsock(#[source] VsockDeviceError),
#[cfg(feature = "virtio-blk")]
/// Block device related errors.
#[error("virtio-blk device error: {0}")]
Block(#[source] BlockDeviceError),
#[cfg(feature = "virtio-net")]
/// Net device related errors.
#[error("virtio-net device error: {0}")]
VirtioNet(#[source] VirtioNetDeviceError),
#[cfg(feature = "virtio-fs")]
/// The action `InsertFsDevice` failed either because of bad user input or an internal error.
#[error("virtio-fs device: {0}")]
FsDevice(#[source] FsDeviceError),
}
/// This enum represents the public interface of the VMM. Each action contains various
/// bits of information (ids, paths, etc.).
#[derive(Clone, Debug, PartialEq)]
pub enum VmmAction {
/// Configure the boot source of the microVM using `BootSourceConfig`.
/// This action can only be called before the microVM has booted.
ConfigureBootSource(BootSourceConfig),
/// Launch the microVM. This action can only be called before the microVM has booted.
StartMicroVm,
/// Shutdown the vmicroVM. This action can only be called after the microVM has booted.
/// When vmm is used as the crate by the other process, which is need to
/// shutdown the vcpu threads and destory all of the object.
ShutdownMicroVm,
/// Get the configuration of the microVM.
GetVmConfiguration,
/// Set the microVM configuration (memory & vcpu) using `VmConfig` as input. This
/// action can only be called before the microVM has booted.
SetVmConfiguration(VmConfigInfo),
#[cfg(feature = "virtio-vsock")]
/// Add a new vsock device or update one that already exists using the
/// `VsockDeviceConfig` as input. This action can only be called before the microVM has
/// booted. The response is sent using the `OutcomeSender`.
InsertVsockDevice(VsockDeviceConfigInfo),
#[cfg(feature = "virtio-blk")]
/// Add a new block device or update one that already exists using the `BlockDeviceConfig` as
/// input. This action can only be called before the microVM has booted.
InsertBlockDevice(BlockDeviceConfigInfo),
#[cfg(feature = "virtio-blk")]
/// Remove a new block device for according to given drive_id
RemoveBlockDevice(String),
#[cfg(feature = "virtio-blk")]
/// Update a block device, after microVM start. Currently, the only updatable properties
/// are the RX and TX rate limiters.
UpdateBlockDevice(BlockDeviceConfigUpdateInfo),
#[cfg(feature = "virtio-net")]
/// Add a new network interface config or update one that already exists using the
/// `NetworkInterfaceConfig` as input. This action can only be called before the microVM has
/// booted. The response is sent using the `OutcomeSender`.
InsertNetworkDevice(VirtioNetDeviceConfigInfo),
#[cfg(feature = "virtio-net")]
/// Update a network interface, after microVM start. Currently, the only updatable properties
/// are the RX and TX rate limiters.
UpdateNetworkInterface(VirtioNetDeviceConfigUpdateInfo),
#[cfg(feature = "virtio-fs")]
/// Add a new shared fs device or update one that already exists using the
/// `FsDeviceConfig` as input. This action can only be called before the microVM has
/// booted.
InsertFsDevice(FsDeviceConfigInfo),
#[cfg(feature = "virtio-fs")]
/// Attach a new virtiofs Backend fs or detach an existing virtiofs Backend fs using the
/// `FsMountConfig` as input. This action can only be called _after_ the microVM has
/// booted.
ManipulateFsBackendFs(FsMountConfigInfo),
#[cfg(feature = "virtio-fs")]
/// Update fs rate limiter, after microVM start.
UpdateFsDevice(FsDeviceConfigUpdateInfo),
}
/// The enum represents the response sent by the VMM in case of success. The response is either
/// empty, when no data needs to be sent, or an internal VMM structure.
#[derive(Debug)]
pub enum VmmData {
/// No data is sent on the channel.
Empty,
/// The microVM configuration represented by `VmConfigInfo`.
MachineConfiguration(Box<VmConfigInfo>),
}
/// Request data type used to communicate between the API and the VMM.
pub type VmmRequest = Box<VmmAction>;
/// Data type used to communicate between the API and the VMM.
pub type VmmRequestResult = std::result::Result<VmmData, VmmActionError>;
/// Response data type used to communicate between the API and the VMM.
pub type VmmResponse = Box<VmmRequestResult>;
/// VMM Service to handle requests from the API server.
///
/// There are two levels of API servers as below:
/// API client <--> VMM API Server <--> VMM Core
pub struct VmmService {
from_api: Receiver<VmmRequest>,
to_api: Sender<VmmResponse>,
machine_config: VmConfigInfo,
}
impl VmmService {
/// Create a new VMM API server instance.
pub fn new(from_api: Receiver<VmmRequest>, to_api: Sender<VmmResponse>) -> Self {
VmmService {
from_api,
to_api,
machine_config: VmConfigInfo::default(),
}
}
/// Handle requests from the HTTP API Server and send back replies.
pub fn run_vmm_action(&mut self, vmm: &mut Vmm, event_mgr: &mut EventManager) -> Result<()> {
let request = match self.from_api.try_recv() {
Ok(t) => *t,
Err(TryRecvError::Empty) => {
warn!("Got a spurious notification from api thread");
return Ok(());
}
Err(TryRecvError::Disconnected) => {
panic!("The channel's sending half was disconnected. Cannot receive data.");
}
};
debug!("receive vmm action: {:?}", request);
let response = match request {
VmmAction::ConfigureBootSource(boot_source_body) => {
self.configure_boot_source(vmm, boot_source_body)
}
VmmAction::StartMicroVm => self.start_microvm(vmm, event_mgr),
VmmAction::ShutdownMicroVm => self.shutdown_microvm(vmm),
VmmAction::GetVmConfiguration => Ok(VmmData::MachineConfiguration(Box::new(
self.machine_config.clone(),
))),
VmmAction::SetVmConfiguration(machine_config) => {
self.set_vm_configuration(vmm, machine_config)
}
#[cfg(feature = "virtio-vsock")]
VmmAction::InsertVsockDevice(vsock_cfg) => self.add_vsock_device(vmm, vsock_cfg),
#[cfg(feature = "virtio-blk")]
VmmAction::InsertBlockDevice(block_device_config) => {
self.add_block_device(vmm, event_mgr, block_device_config)
}
#[cfg(feature = "virtio-blk")]
VmmAction::UpdateBlockDevice(blk_update) => {
self.update_blk_rate_limiters(vmm, blk_update)
}
#[cfg(feature = "virtio-blk")]
VmmAction::RemoveBlockDevice(drive_id) => {
self.remove_block_device(vmm, event_mgr, &drive_id)
}
#[cfg(feature = "virtio-net")]
VmmAction::InsertNetworkDevice(virtio_net_cfg) => {
self.add_virtio_net_device(vmm, event_mgr, virtio_net_cfg)
}
#[cfg(feature = "virtio-net")]
VmmAction::UpdateNetworkInterface(netif_update) => {
self.update_net_rate_limiters(vmm, netif_update)
}
#[cfg(feature = "virtio-fs")]
VmmAction::InsertFsDevice(fs_cfg) => self.add_fs_device(vmm, fs_cfg),
#[cfg(feature = "virtio-fs")]
VmmAction::ManipulateFsBackendFs(fs_mount_cfg) => {
self.manipulate_fs_backend_fs(vmm, fs_mount_cfg)
}
#[cfg(feature = "virtio-fs")]
VmmAction::UpdateFsDevice(fs_update_cfg) => {
self.update_fs_rate_limiters(vmm, fs_update_cfg)
}
};
debug!("send vmm response: {:?}", response);
self.send_response(response)
}
fn send_response(&self, result: VmmRequestResult) -> Result<()> {
self.to_api
.send(Box::new(result))
.map_err(|_| ())
.expect("vmm: one-shot API result channel has been closed");
Ok(())
}
fn configure_boot_source(
&self,
vmm: &mut Vmm,
boot_source_config: BootSourceConfig,
) -> VmmRequestResult {
use super::BootSourceConfigError::{
InvalidInitrdPath, InvalidKernelCommandLine, InvalidKernelPath,
UpdateNotAllowedPostBoot,
};
use super::VmmActionError::BootSource;
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
if vm.is_vm_initialized() {
return Err(BootSource(UpdateNotAllowedPostBoot));
}
let kernel_file = File::open(&boot_source_config.kernel_path)
.map_err(|e| BootSource(InvalidKernelPath(e)))?;
let initrd_file = match boot_source_config.initrd_path {
None => None,
Some(ref path) => Some(File::open(path).map_err(|e| BootSource(InvalidInitrdPath(e)))?),
};
let mut cmdline = linux_loader::cmdline::Cmdline::new(dbs_boot::layout::CMDLINE_MAX_SIZE);
let boot_args = boot_source_config
.boot_args
.clone()
.unwrap_or_else(|| String::from(DEFAULT_KERNEL_CMDLINE));
cmdline
.insert_str(boot_args)
.map_err(|e| BootSource(InvalidKernelCommandLine(e)))?;
let kernel_config = KernelConfigInfo::new(kernel_file, initrd_file, cmdline);
vm.set_kernel_config(kernel_config);
Ok(VmmData::Empty)
}
fn start_microvm(&mut self, vmm: &mut Vmm, event_mgr: &mut EventManager) -> VmmRequestResult {
use self::StartMicroVmError::MicroVMAlreadyRunning;
use self::VmmActionError::StartMicroVm;
let vmm_seccomp_filter = vmm.vmm_seccomp_filter();
let vcpu_seccomp_filter = vmm.vcpu_seccomp_filter();
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
if vm.is_vm_initialized() {
return Err(StartMicroVm(MicroVMAlreadyRunning));
}
vm.start_microvm(event_mgr, vmm_seccomp_filter, vcpu_seccomp_filter)
.map(|_| VmmData::Empty)
.map_err(StartMicroVm)
}
fn shutdown_microvm(&mut self, vmm: &mut Vmm) -> VmmRequestResult {
vmm.event_ctx.exit_evt_triggered = true;
Ok(VmmData::Empty)
}
/// Set virtual machine configuration.
pub fn set_vm_configuration(
&mut self,
vmm: &mut Vmm,
machine_config: VmConfigInfo,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
if vm.is_vm_initialized() {
return Err(MachineConfig(UpdateNotAllowedPostBoot));
}
// If the check is successful, set it up together.
let mut config = vm.vm_config().clone();
if config.vcpu_count != machine_config.vcpu_count {
let vcpu_count = machine_config.vcpu_count;
// Check that the vcpu_count value is >=1.
if vcpu_count == 0 {
return Err(MachineConfig(InvalidVcpuCount(vcpu_count)));
}
config.vcpu_count = vcpu_count;
}
if config.cpu_topology != machine_config.cpu_topology {
let cpu_topology = &machine_config.cpu_topology;
config.cpu_topology = handle_cpu_topology(cpu_topology, config.vcpu_count)?.clone();
} else {
// the same default
let mut default_cpu_topology = CpuTopology {
threads_per_core: 1,
cores_per_die: config.vcpu_count,
dies_per_socket: 1,
sockets: 1,
};
if machine_config.max_vcpu_count > config.vcpu_count {
default_cpu_topology.cores_per_die = machine_config.max_vcpu_count;
}
config.cpu_topology = default_cpu_topology;
}
let cpu_topology = &config.cpu_topology;
let max_vcpu_from_topo = cpu_topology.threads_per_core
* cpu_topology.cores_per_die
* cpu_topology.dies_per_socket
* cpu_topology.sockets;
// If the max_vcpu_count inferred by cpu_topology is not equal to
// max_vcpu_count, max_vcpu_count will be changed. currently, max vcpu size
// is used when cpu_topology is not defined and help define the cores_per_die
// for the default cpu topology.
let mut max_vcpu_count = machine_config.max_vcpu_count;
if max_vcpu_count < config.vcpu_count {
return Err(MachineConfig(InvalidMaxVcpuCount(max_vcpu_count)));
}
if max_vcpu_from_topo != max_vcpu_count {
max_vcpu_count = max_vcpu_from_topo;
info!("Since max_vcpu_count is not equal to cpu topo information, we have changed the max vcpu count to {}", max_vcpu_from_topo);
}
config.max_vcpu_count = max_vcpu_count;
config.cpu_pm = machine_config.cpu_pm;
config.mem_type = machine_config.mem_type;
let mem_size_mib_value = machine_config.mem_size_mib;
// Support 1TB memory at most, 2MB aligned for huge page.
if mem_size_mib_value == 0 || mem_size_mib_value > 0x10_0000 || mem_size_mib_value % 2 != 0
{
return Err(MachineConfig(InvalidMemorySize(mem_size_mib_value)));
}
config.mem_size_mib = mem_size_mib_value;
config.mem_file_path = machine_config.mem_file_path.clone();
if config.mem_type == "hugetlbfs" && config.mem_file_path.is_empty() {
return Err(MachineConfig(InvalidMemFilePath("".to_owned())));
}
config.vpmu_feature = machine_config.vpmu_feature;
let vm_id = vm.shared_info().read().unwrap().id.clone();
let serial_path = match machine_config.serial_path {
Some(value) => value,
None => {
if config.serial_path.is_none() {
String::from("/run/dragonball/") + &vm_id + "_com1"
} else {
// Safe to unwrap() because we have checked it has a value.
config.serial_path.as_ref().unwrap().clone()
}
}
};
config.serial_path = Some(serial_path);
vm.set_vm_config(config.clone());
self.machine_config = config;
Ok(VmmData::Empty)
}
#[cfg(feature = "virtio-vsock")]
fn add_vsock_device(&self, vmm: &mut Vmm, config: VsockDeviceConfigInfo) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
if vm.is_vm_initialized() {
return Err(VmmActionError::Vsock(
VsockDeviceError::UpdateNotAllowedPostBoot,
));
}
// VMADDR_CID_ANY (-1U) means any address for binding;
// VMADDR_CID_HYPERVISOR (0) is reserved for services built into the hypervisor;
// VMADDR_CID_RESERVED (1) must not be used;
// VMADDR_CID_HOST (2) is the well-known address of the host.
if config.guest_cid <= 2 {
return Err(VmmActionError::Vsock(VsockDeviceError::GuestCIDInvalid(
config.guest_cid,
)));
}
info!("add_vsock_device: {:?}", config);
let ctx = vm.create_device_op_context(None).map_err(|e| {
info!("create device op context error: {:?}", e);
VmmActionError::Vsock(VsockDeviceError::UpdateNotAllowedPostBoot)
})?;
vm.device_manager_mut()
.vsock_manager
.insert_device(ctx, config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::Vsock)
}
#[cfg(feature = "virtio-blk")]
// Only call this function as part of the API.
// If the drive_id does not exist, a new Block Device Config is added to the list.
fn add_block_device(
&mut self,
vmm: &mut Vmm,
event_mgr: &mut EventManager,
config: BlockDeviceConfigInfo,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
let ctx = vm
.create_device_op_context(Some(event_mgr.epoll_manager()))
.map_err(|e| {
if let StartMicroVmError::UpcallNotReady = e {
return VmmActionError::UpcallNotReady;
}
VmmActionError::Block(BlockDeviceError::UpdateNotAllowedPostBoot)
})?;
BlockDeviceMgr::insert_device(vm.device_manager_mut(), ctx, config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::Block)
}
#[cfg(feature = "virtio-blk")]
/// Updates configuration for an emulated net device as described in `config`.
fn update_blk_rate_limiters(
&mut self,
vmm: &mut Vmm,
config: BlockDeviceConfigUpdateInfo,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
BlockDeviceMgr::update_device_ratelimiters(vm.device_manager_mut(), config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::Block)
}
#[cfg(feature = "virtio-blk")]
// Remove the device
fn remove_block_device(
&mut self,
vmm: &mut Vmm,
event_mgr: &mut EventManager,
drive_id: &str,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
let ctx = vm
.create_device_op_context(Some(event_mgr.epoll_manager()))
.map_err(|_| VmmActionError::Block(BlockDeviceError::UpdateNotAllowedPostBoot))?;
BlockDeviceMgr::remove_device(vm.device_manager_mut(), ctx, drive_id)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::Block)
}
#[cfg(feature = "virtio-net")]
fn add_virtio_net_device(
&mut self,
vmm: &mut Vmm,
event_mgr: &mut EventManager,
config: VirtioNetDeviceConfigInfo,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
let ctx = vm
.create_device_op_context(Some(event_mgr.epoll_manager()))
.map_err(|e| {
if let StartMicroVmError::MicroVMAlreadyRunning = e {
VmmActionError::VirtioNet(VirtioNetDeviceError::UpdateNotAllowedPostBoot)
} else if let StartMicroVmError::UpcallNotReady = e {
VmmActionError::UpcallNotReady
} else {
VmmActionError::StartMicroVm(e)
}
})?;
VirtioNetDeviceMgr::insert_device(vm.device_manager_mut(), ctx, config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::VirtioNet)
}
#[cfg(feature = "virtio-net")]
fn update_net_rate_limiters(
&mut self,
vmm: &mut Vmm,
config: VirtioNetDeviceConfigUpdateInfo,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
VirtioNetDeviceMgr::update_device_ratelimiters(vm.device_manager_mut(), config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::VirtioNet)
}
#[cfg(feature = "virtio-fs")]
fn add_fs_device(&mut self, vmm: &mut Vmm, config: FsDeviceConfigInfo) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
let hotplug = vm.is_vm_initialized();
if !cfg!(feature = "hotplug") && hotplug {
return Err(VmmActionError::FsDevice(
FsDeviceError::UpdateNotAllowedPostBoot,
));
}
let ctx = vm.create_device_op_context(None).map_err(|e| {
info!("create device op context error: {:?}", e);
VmmActionError::FsDevice(FsDeviceError::UpdateNotAllowedPostBoot)
})?;
FsDeviceMgr::insert_device(vm.device_manager_mut(), ctx, config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::FsDevice)
}
#[cfg(feature = "virtio-fs")]
fn manipulate_fs_backend_fs(
&self,
vmm: &mut Vmm,
config: FsMountConfigInfo,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
if !vm.is_vm_initialized() {
return Err(VmmActionError::FsDevice(FsDeviceError::MicroVMNotRunning));
}
FsDeviceMgr::manipulate_backend_fs(vm.device_manager_mut(), config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::FsDevice)
}
#[cfg(feature = "virtio-fs")]
fn update_fs_rate_limiters(
&self,
vmm: &mut Vmm,
config: FsDeviceConfigUpdateInfo,
) -> VmmRequestResult {
let vm = vmm.get_vm_mut().ok_or(VmmActionError::InvalidVMID)?;
if !vm.is_vm_initialized() {
return Err(VmmActionError::FsDevice(FsDeviceError::MicroVMNotRunning));
}
FsDeviceMgr::update_device_ratelimiters(vm.device_manager_mut(), config)
.map(|_| VmmData::Empty)
.map_err(VmmActionError::FsDevice)
}
}
fn handle_cpu_topology(
cpu_topology: &CpuTopology,
vcpu_count: u8,
) -> std::result::Result<&CpuTopology, VmmActionError> {
// Check if dies_per_socket, cores_per_die, threads_per_core and socket number is valid
if cpu_topology.threads_per_core < 1 || cpu_topology.threads_per_core > 2 {
return Err(MachineConfig(InvalidThreadsPerCore(
cpu_topology.threads_per_core,
)));
}
let vcpu_count_from_topo = cpu_topology
.sockets
.checked_mul(cpu_topology.dies_per_socket)
.ok_or(MachineConfig(VcpuCountExceedsMaximum))?
.checked_mul(cpu_topology.cores_per_die)
.ok_or(MachineConfig(VcpuCountExceedsMaximum))?
.checked_mul(cpu_topology.threads_per_core)
.ok_or(MachineConfig(VcpuCountExceedsMaximum))?;
if vcpu_count_from_topo > MAX_SUPPORTED_VCPUS {
return Err(MachineConfig(VcpuCountExceedsMaximum));
}
if vcpu_count_from_topo < vcpu_count {
return Err(MachineConfig(InvalidCpuTopology(vcpu_count_from_topo)));
}
Ok(cpu_topology)
}

View File

@@ -0,0 +1,760 @@
// Copyright (C) 2020-2022 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use std::convert::TryInto;
use std::io;
use std::ops::{Index, IndexMut};
use std::sync::Arc;
use dbs_device::DeviceIo;
use dbs_utils::rate_limiter::{RateLimiter, TokenBucket};
use serde_derive::{Deserialize, Serialize};
/// Get bucket update for rate limiter.
#[macro_export]
macro_rules! get_bucket_update {
($self:ident, $rate_limiter: ident, $metric: ident) => {{
match &$self.$rate_limiter {
Some(rl_cfg) => {
let tb_cfg = &rl_cfg.$metric;
dbs_utils::rate_limiter::RateLimiter::make_bucket(
tb_cfg.size,
tb_cfg.one_time_burst,
tb_cfg.refill_time,
)
// Updated active rate-limiter.
.map(dbs_utils::rate_limiter::BucketUpdate::Update)
// Updated/deactivated rate-limiter
.unwrap_or(dbs_utils::rate_limiter::BucketUpdate::Disabled)
}
// No update to the rate-limiter.
None => dbs_utils::rate_limiter::BucketUpdate::None,
}
}};
}
/// Trait for generic configuration information.
pub trait ConfigItem {
/// Related errors.
type Err;
/// Get the unique identifier of the configuration item.
fn id(&self) -> &str;
/// Check whether current configuration item conflicts with another one.
fn check_conflicts(&self, other: &Self) -> std::result::Result<(), Self::Err>;
}
/// Struct to manage a group of configuration items.
#[derive(Debug, Default, Deserialize, PartialEq, Serialize)]
pub struct ConfigInfos<T>
where
T: ConfigItem + Clone,
{
configs: Vec<T>,
}
impl<T> ConfigInfos<T>
where
T: ConfigItem + Clone + Default,
{
/// Constructor
pub fn new() -> Self {
ConfigInfos::default()
}
/// Insert a configuration item in the group.
pub fn insert(&mut self, config: T) -> std::result::Result<(), T::Err> {
for item in self.configs.iter() {
config.check_conflicts(item)?;
}
self.configs.push(config);
Ok(())
}
/// Update a configuration item in the group.
pub fn update(&mut self, config: T, err: T::Err) -> std::result::Result<(), T::Err> {
match self.get_index_by_id(&config) {
None => Err(err),
Some(index) => {
for (idx, item) in self.configs.iter().enumerate() {
if idx != index {
config.check_conflicts(item)?;
}
}
self.configs[index] = config;
Ok(())
}
}
}
/// Insert or update a configuration item in the group.
pub fn insert_or_update(&mut self, config: T) -> std::result::Result<(), T::Err> {
match self.get_index_by_id(&config) {
None => {
for item in self.configs.iter() {
config.check_conflicts(item)?;
}
self.configs.push(config)
}
Some(index) => {
for (idx, item) in self.configs.iter().enumerate() {
if idx != index {
config.check_conflicts(item)?;
}
}
self.configs[index] = config;
}
}
Ok(())
}
/// Remove the matching configuration entry.
pub fn remove(&mut self, config: &T) -> Option<T> {
if let Some(index) = self.get_index_by_id(config) {
Some(self.configs.remove(index))
} else {
None
}
}
/// Returns an immutable iterator over the config items
pub fn iter(&self) -> ::std::slice::Iter<T> {
self.configs.iter()
}
/// Get the configuration entry with matching ID.
pub fn get_by_id(&self, item: &T) -> Option<&T> {
let id = item.id();
self.configs.iter().rfind(|cfg| cfg.id() == id)
}
fn get_index_by_id(&self, item: &T) -> Option<usize> {
let id = item.id();
self.configs.iter().position(|cfg| cfg.id() == id)
}
}
impl<T> Clone for ConfigInfos<T>
where
T: ConfigItem + Clone,
{
fn clone(&self) -> Self {
ConfigInfos {
configs: self.configs.clone(),
}
}
}
/// Struct to maintain configuration information for a device.
pub struct DeviceConfigInfo<T>
where
T: ConfigItem + Clone,
{
/// Configuration information for the device object.
pub config: T,
/// The associated device object.
pub device: Option<Arc<dyn DeviceIo>>,
}
impl<T> DeviceConfigInfo<T>
where
T: ConfigItem + Clone,
{
/// Create a new instance of ['DeviceInfoGroup'].
pub fn new(config: T) -> Self {
DeviceConfigInfo {
config,
device: None,
}
}
/// Create a new instance of ['DeviceInfoGroup'] with optional device.
pub fn new_with_device(config: T, device: Option<Arc<dyn DeviceIo>>) -> Self {
DeviceConfigInfo { config, device }
}
/// Set the device object associated with the configuration.
pub fn set_device(&mut self, device: Arc<dyn DeviceIo>) {
self.device = Some(device);
}
}
impl<T> Clone for DeviceConfigInfo<T>
where
T: ConfigItem + Clone,
{
fn clone(&self) -> Self {
DeviceConfigInfo::new_with_device(self.config.clone(), self.device.clone())
}
}
/// Struct to maintain configuration information for a group of devices.
pub struct DeviceConfigInfos<T>
where
T: ConfigItem + Clone,
{
info_list: Vec<DeviceConfigInfo<T>>,
}
impl<T> Default for DeviceConfigInfos<T>
where
T: ConfigItem + Clone,
{
fn default() -> Self {
Self::new()
}
}
impl<T> DeviceConfigInfos<T>
where
T: ConfigItem + Clone,
{
/// Create a new instance of ['DeviceConfigInfos'].
pub fn new() -> Self {
DeviceConfigInfos {
info_list: Vec::new(),
}
}
/// Insert or update configuration information for a device.
pub fn insert_or_update(&mut self, config: &T) -> std::result::Result<usize, T::Err> {
let device_info = DeviceConfigInfo::new(config.clone());
Ok(match self.get_index_by_id(config) {
Some(index) => {
for (idx, info) in self.info_list.iter().enumerate() {
if idx != index {
info.config.check_conflicts(config)?;
}
}
self.info_list[index] = device_info;
index
}
None => {
for info in self.info_list.iter() {
info.config.check_conflicts(config)?;
}
self.info_list.push(device_info);
self.info_list.len() - 1
}
})
}
/// Remove a device configuration information object.
pub fn remove(&mut self, index: usize) -> Option<DeviceConfigInfo<T>> {
if self.info_list.len() > index {
Some(self.info_list.remove(index))
} else {
None
}
}
/// Get number of device configuration information objects.
pub fn len(&self) -> usize {
self.info_list.len()
}
/// Returns true if the device configuration information objects is empty.
pub fn is_empty(&self) -> bool {
self.info_list.len() == 0
}
/// Add a device configuration information object at the tail.
pub fn push(&mut self, info: DeviceConfigInfo<T>) {
self.info_list.push(info);
}
/// Iterator for configuration information objects.
pub fn iter(&self) -> std::slice::Iter<DeviceConfigInfo<T>> {
self.info_list.iter()
}
/// Mutable iterator for configuration information objects.
pub fn iter_mut(&mut self) -> std::slice::IterMut<DeviceConfigInfo<T>> {
self.info_list.iter_mut()
}
fn get_index_by_id(&self, config: &T) -> Option<usize> {
self.info_list
.iter()
.position(|info| info.config.id().eq(config.id()))
}
}
impl<T> Index<usize> for DeviceConfigInfos<T>
where
T: ConfigItem + Clone,
{
type Output = DeviceConfigInfo<T>;
fn index(&self, idx: usize) -> &Self::Output {
&self.info_list[idx]
}
}
impl<T> IndexMut<usize> for DeviceConfigInfos<T>
where
T: ConfigItem + Clone,
{
fn index_mut(&mut self, idx: usize) -> &mut Self::Output {
&mut self.info_list[idx]
}
}
impl<T> Clone for DeviceConfigInfos<T>
where
T: ConfigItem + Clone,
{
fn clone(&self) -> Self {
DeviceConfigInfos {
info_list: self.info_list.clone(),
}
}
}
/// Configuration information for RateLimiter token bucket.
#[derive(Clone, Debug, Default, Deserialize, PartialEq, Serialize)]
pub struct TokenBucketConfigInfo {
/// The size for the token bucket. A TokenBucket of `size` total capacity will take `refill_time`
/// milliseconds to go from zero tokens to total capacity.
pub size: u64,
/// Number of free initial tokens, that can be consumed at no cost.
pub one_time_burst: u64,
/// Complete refill time in milliseconds.
pub refill_time: u64,
}
impl TokenBucketConfigInfo {
fn resize(&mut self, n: u64) {
if n != 0 {
self.size /= n;
self.one_time_burst /= n;
}
}
}
impl From<TokenBucketConfigInfo> for TokenBucket {
fn from(t: TokenBucketConfigInfo) -> TokenBucket {
(&t).into()
}
}
impl From<&TokenBucketConfigInfo> for TokenBucket {
fn from(t: &TokenBucketConfigInfo) -> TokenBucket {
TokenBucket::new(t.size, t.one_time_burst, t.refill_time)
}
}
/// Configuration information for RateLimiter objects.
#[derive(Clone, Debug, Default, Deserialize, PartialEq, Serialize)]
pub struct RateLimiterConfigInfo {
/// Data used to initialize the RateLimiter::bandwidth bucket.
pub bandwidth: TokenBucketConfigInfo,
/// Data used to initialize the RateLimiter::ops bucket.
pub ops: TokenBucketConfigInfo,
}
impl RateLimiterConfigInfo {
/// Update the bandwidth budget configuration.
pub fn update_bandwidth(&mut self, new_config: TokenBucketConfigInfo) {
self.bandwidth = new_config;
}
/// Update the ops budget configuration.
pub fn update_ops(&mut self, new_config: TokenBucketConfigInfo) {
self.ops = new_config;
}
/// resize the limiter to its 1/n.
pub fn resize(&mut self, n: u64) {
self.bandwidth.resize(n);
self.ops.resize(n);
}
}
impl TryInto<RateLimiter> for &RateLimiterConfigInfo {
type Error = io::Error;
fn try_into(self) -> Result<RateLimiter, Self::Error> {
RateLimiter::new(
self.bandwidth.size,
self.bandwidth.one_time_burst,
self.bandwidth.refill_time,
self.ops.size,
self.ops.one_time_burst,
self.ops.refill_time,
)
}
}
impl TryInto<RateLimiter> for RateLimiterConfigInfo {
type Error = io::Error;
fn try_into(self) -> Result<RateLimiter, Self::Error> {
RateLimiter::new(
self.bandwidth.size,
self.bandwidth.one_time_burst,
self.bandwidth.refill_time,
self.ops.size,
self.ops.one_time_burst,
self.ops.refill_time,
)
}
}
#[cfg(test)]
mod tests {
use super::*;
#[derive(Debug, thiserror::Error)]
pub enum DummyError {
#[error("configuration entry exists")]
Exist,
}
#[derive(Clone, Debug, Default)]
pub struct DummyConfigInfo {
id: String,
content: String,
}
impl ConfigItem for DummyConfigInfo {
type Err = DummyError;
fn id(&self) -> &str {
&self.id
}
fn check_conflicts(&self, other: &Self) -> Result<(), DummyError> {
if self.id == other.id || self.content == other.content {
Err(DummyError::Exist)
} else {
Ok(())
}
}
}
type DummyConfigInfos = ConfigInfos<DummyConfigInfo>;
#[test]
fn test_insert_config_info() {
let mut configs = DummyConfigInfos::new();
let config1 = DummyConfigInfo {
id: "1".to_owned(),
content: "a".to_owned(),
};
configs.insert(config1).unwrap();
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "a");
// Test case: cannot insert new item with the same id.
let config2 = DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
};
configs.insert(config2).unwrap_err();
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "a");
let config3 = DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
};
configs.insert(config3).unwrap();
assert_eq!(configs.configs.len(), 2);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "a");
assert_eq!(configs.configs[1].id, "2");
assert_eq!(configs.configs[1].content, "c");
// Test case: cannot insert new item with the same content.
let config4 = DummyConfigInfo {
id: "3".to_owned(),
content: "c".to_owned(),
};
configs.insert(config4).unwrap_err();
assert_eq!(configs.configs.len(), 2);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "a");
assert_eq!(configs.configs[1].id, "2");
assert_eq!(configs.configs[1].content, "c");
}
#[test]
fn test_update_config_info() {
let mut configs = DummyConfigInfos::new();
let config1 = DummyConfigInfo {
id: "1".to_owned(),
content: "a".to_owned(),
};
configs.insert(config1).unwrap();
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "a");
// Test case: succeed to update an existing entry
let config2 = DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
};
configs.update(config2, DummyError::Exist).unwrap();
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "b");
// Test case: cannot update a non-existing entry
let config3 = DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
};
configs.update(config3, DummyError::Exist).unwrap_err();
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "b");
// Test case: cannot update an entry with conflicting content
let config4 = DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
};
configs.insert(config4).unwrap();
let config5 = DummyConfigInfo {
id: "1".to_owned(),
content: "c".to_owned(),
};
configs.update(config5, DummyError::Exist).unwrap_err();
}
#[test]
fn test_insert_or_update_config_info() {
let mut configs = DummyConfigInfos::new();
let config1 = DummyConfigInfo {
id: "1".to_owned(),
content: "a".to_owned(),
};
configs.insert_or_update(config1).unwrap();
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "a");
// Test case: succeed to update an existing entry
let config2 = DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
};
configs.insert_or_update(config2.clone()).unwrap();
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "b");
// Add a second entry
let config3 = DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
};
configs.insert_or_update(config3.clone()).unwrap();
assert_eq!(configs.configs.len(), 2);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "b");
assert_eq!(configs.configs[1].id, "2");
assert_eq!(configs.configs[1].content, "c");
// Lookup the first entry
let config4 = configs
.get_by_id(&DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
})
.unwrap();
assert_eq!(config4.id, config2.id);
assert_eq!(config4.content, config2.content);
// Lookup the second entry
let config5 = configs
.get_by_id(&DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
})
.unwrap();
assert_eq!(config5.id, config3.id);
assert_eq!(config5.content, config3.content);
// Test case: can't insert an entry with conflicting content
let config6 = DummyConfigInfo {
id: "3".to_owned(),
content: "c".to_owned(),
};
configs.insert_or_update(config6).unwrap_err();
assert_eq!(configs.configs.len(), 2);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "b");
assert_eq!(configs.configs[1].id, "2");
assert_eq!(configs.configs[1].content, "c");
}
#[test]
fn test_remove_config_info() {
let mut configs = DummyConfigInfos::new();
let config1 = DummyConfigInfo {
id: "1".to_owned(),
content: "a".to_owned(),
};
configs.insert_or_update(config1).unwrap();
let config2 = DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
};
configs.insert_or_update(config2.clone()).unwrap();
let config3 = DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
};
configs.insert_or_update(config3.clone()).unwrap();
assert_eq!(configs.configs.len(), 2);
assert_eq!(configs.configs[0].id, "1");
assert_eq!(configs.configs[0].content, "b");
assert_eq!(configs.configs[1].id, "2");
assert_eq!(configs.configs[1].content, "c");
let config4 = configs
.remove(&DummyConfigInfo {
id: "1".to_owned(),
content: "no value".to_owned(),
})
.unwrap();
assert_eq!(config4.id, config2.id);
assert_eq!(config4.content, config2.content);
assert_eq!(configs.configs.len(), 1);
assert_eq!(configs.configs[0].id, "2");
assert_eq!(configs.configs[0].content, "c");
let config5 = configs
.remove(&DummyConfigInfo {
id: "2".to_owned(),
content: "no value".to_owned(),
})
.unwrap();
assert_eq!(config5.id, config3.id);
assert_eq!(config5.content, config3.content);
assert_eq!(configs.configs.len(), 0);
}
type DummyDeviceInfoList = DeviceConfigInfos<DummyConfigInfo>;
#[test]
fn test_insert_or_update_device_info() {
let mut configs = DummyDeviceInfoList::new();
let config1 = DummyConfigInfo {
id: "1".to_owned(),
content: "a".to_owned(),
};
configs.insert_or_update(&config1).unwrap();
assert_eq!(configs.len(), 1);
assert_eq!(configs[0].config.id, "1");
assert_eq!(configs[0].config.content, "a");
// Test case: succeed to update an existing entry
let config2 = DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
};
configs.insert_or_update(&config2 /* */).unwrap();
assert_eq!(configs.len(), 1);
assert_eq!(configs[0].config.id, "1");
assert_eq!(configs[0].config.content, "b");
// Add a second entry
let config3 = DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
};
configs.insert_or_update(&config3).unwrap();
assert_eq!(configs.len(), 2);
assert_eq!(configs[0].config.id, "1");
assert_eq!(configs[0].config.content, "b");
assert_eq!(configs[1].config.id, "2");
assert_eq!(configs[1].config.content, "c");
// Lookup the first entry
let config4_id = configs
.get_index_by_id(&DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
})
.unwrap();
let config4 = &configs[config4_id].config;
assert_eq!(config4.id, config2.id);
assert_eq!(config4.content, config2.content);
// Lookup the second entry
let config5_id = configs
.get_index_by_id(&DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
})
.unwrap();
let config5 = &configs[config5_id].config;
assert_eq!(config5.id, config3.id);
assert_eq!(config5.content, config3.content);
// Test case: can't insert an entry with conflicting content
let config6 = DummyConfigInfo {
id: "3".to_owned(),
content: "c".to_owned(),
};
configs.insert_or_update(&config6).unwrap_err();
assert_eq!(configs.len(), 2);
assert_eq!(configs[0].config.id, "1");
assert_eq!(configs[0].config.content, "b");
assert_eq!(configs[1].config.id, "2");
assert_eq!(configs[1].config.content, "c");
}
#[test]
fn test_remove_device_info() {
let mut configs = DummyDeviceInfoList::new();
let config1 = DummyConfigInfo {
id: "1".to_owned(),
content: "a".to_owned(),
};
configs.insert_or_update(&config1).unwrap();
let config2 = DummyConfigInfo {
id: "1".to_owned(),
content: "b".to_owned(),
};
configs.insert_or_update(&config2).unwrap();
let config3 = DummyConfigInfo {
id: "2".to_owned(),
content: "c".to_owned(),
};
configs.insert_or_update(&config3).unwrap();
assert_eq!(configs.len(), 2);
assert_eq!(configs[0].config.id, "1");
assert_eq!(configs[0].config.content, "b");
assert_eq!(configs[1].config.id, "2");
assert_eq!(configs[1].config.content, "c");
let config4 = configs.remove(0).unwrap().config;
assert_eq!(config4.id, config2.id);
assert_eq!(config4.content, config2.content);
assert_eq!(configs.len(), 1);
assert_eq!(configs[0].config.id, "2");
assert_eq!(configs[0].config.content, "c");
let config5 = configs.remove(0).unwrap().config;
assert_eq!(config5.id, config3.id);
assert_eq!(config5.content, config3.content);
assert_eq!(configs.len(), 0);
}
}

View File

@@ -0,0 +1,773 @@
// Copyright 2020-2022 Alibaba, Inc. or its affiliates. All Rights Reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
//! Device manager for virtio-blk and vhost-user-blk devices.
use std::collections::{vec_deque, VecDeque};
use std::convert::TryInto;
use std::fs::OpenOptions;
use std::os::unix::fs::OpenOptionsExt;
use std::os::unix::io::AsRawFd;
use std::path::{Path, PathBuf};
use std::sync::Arc;
use dbs_virtio_devices as virtio;
use dbs_virtio_devices::block::{aio::Aio, io_uring::IoUring, Block, LocalFile, Ufile};
use serde_derive::{Deserialize, Serialize};
use crate::address_space_manager::GuestAddressSpaceImpl;
use crate::config_manager::{ConfigItem, DeviceConfigInfo, RateLimiterConfigInfo};
use crate::device_manager::blk_dev_mgr::BlockDeviceError::InvalidDeviceId;
use crate::device_manager::{DeviceManager, DeviceMgrError, DeviceOpContext};
use crate::get_bucket_update;
use crate::vm::KernelConfigInfo;
use super::DbsMmioV2Device;
// The flag of whether to use the shared irq.
const USE_SHARED_IRQ: bool = true;
// The flag of whether to use the generic irq.
const USE_GENERIC_IRQ: bool = true;
macro_rules! info(
($l:expr, $($args:tt)+) => {
slog::info!($l, $($args)+; slog::o!("subsystem" => "block_manager"))
};
);
macro_rules! error(
($l:expr, $($args:tt)+) => {
slog::error!($l, $($args)+; slog::o!("subsystem" => "block_manager"))
};
);
/// Default queue size for VirtIo block devices.
pub const QUEUE_SIZE: u16 = 128;
/// Errors associated with the operations allowed on a drive.
#[derive(Debug, thiserror::Error)]
pub enum BlockDeviceError {
/// Invalid VM instance ID.
#[error("invalid VM instance id")]
InvalidVMID,
/// The block device path is invalid.
#[error("invalid block device path '{0}'")]
InvalidBlockDevicePath(PathBuf),
/// The block device type is invalid.
#[error("invalid block device type")]
InvalidBlockDeviceType,
/// The block device path was already used for a different drive.
#[error("block device path '{0}' already exists")]
BlockDevicePathAlreadyExists(PathBuf),
/// The device id doesn't exist.
#[error("invalid block device id '{0}'")]
InvalidDeviceId(String),
/// Cannot perform the requested operation after booting the microVM.
#[error("block device does not support runtime update")]
UpdateNotAllowedPostBoot,
/// A root block device was already added.
#[error("could not add multiple virtual machine root devices")]
RootBlockDeviceAlreadyAdded,
/// Failed to send patch message to block epoll handler.
#[error("could not send patch message to the block epoll handler")]
BlockEpollHanderSendFail,
/// Failure from device manager,
#[error("device manager errors: {0}")]
DeviceManager(#[from] DeviceMgrError),
/// Failure from virtio subsystem.
#[error(transparent)]
Virtio(virtio::Error),
/// Unable to seek the block device backing file due to invalid permissions or
/// the file was deleted/corrupted.
#[error("cannot create block device: {0}")]
CreateBlockDevice(#[source] virtio::Error),
/// Cannot open the block device backing file.
#[error("cannot open the block device backing file: {0}")]
OpenBlockDevice(#[source] std::io::Error),
/// Cannot initialize a MMIO Block Device or add a device to the MMIO Bus.
#[error("failure while registering block device: {0}")]
RegisterBlockDevice(#[source] DeviceMgrError),
}
/// Type of low level storage device/protocol for virtio-blk devices.
#[derive(Clone, Copy, Debug, PartialEq, Serialize, Deserialize)]
pub enum BlockDeviceType {
/// Unknown low level device type.
Unknown,
/// Vhost-user-blk based low level device.
/// SPOOL is a reliable NVMe virtualization system for the cloud environment.
/// You could learn more SPOOL here: https://www.usenix.org/conference/atc20/presentation/xue
Spool,
/// Local disk/file based low level device.
RawBlock,
}
impl BlockDeviceType {
/// Get type of low level storage device/protocol by parsing `path`.
pub fn get_type(path: &str) -> BlockDeviceType {
// SPOOL path should be started with "spool", e.g. "spool:/device1"
if path.starts_with("spool:/") {
BlockDeviceType::Spool
} else {
BlockDeviceType::RawBlock
}
}
}
/// Configuration information for a block device.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct BlockDeviceConfigUpdateInfo {
/// Unique identifier of the drive.
pub drive_id: String,
/// Rate Limiter for I/O operations.
pub rate_limiter: Option<RateLimiterConfigInfo>,
}
impl BlockDeviceConfigUpdateInfo {
/// Provides a `BucketUpdate` description for the bandwidth rate limiter.
pub fn bytes(&self) -> dbs_utils::rate_limiter::BucketUpdate {
get_bucket_update!(self, rate_limiter, bandwidth)
}
/// Provides a `BucketUpdate` description for the ops rate limiter.
pub fn ops(&self) -> dbs_utils::rate_limiter::BucketUpdate {
get_bucket_update!(self, rate_limiter, ops)
}
}
/// Configuration information for a block device.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct BlockDeviceConfigInfo {
/// Unique identifier of the drive.
pub drive_id: String,
/// Type of low level storage/protocol.
pub device_type: BlockDeviceType,
/// Path of the drive.
pub path_on_host: PathBuf,
/// If set to true, it makes the current device the root block device.
/// Setting this flag to true will mount the block device in the
/// guest under /dev/vda unless the part_uuid is present.
pub is_root_device: bool,
/// Part-UUID. Represents the unique id of the boot partition of this device.
/// It is optional and it will be used only if the `is_root_device` field is true.
pub part_uuid: Option<String>,
/// If set to true, the drive is opened in read-only mode. Otherwise, the
/// drive is opened as read-write.
pub is_read_only: bool,
/// If set to false, the drive is opened with buffered I/O mode. Otherwise, the
/// drive is opened with direct I/O mode.
pub is_direct: bool,
/// Don't close `path_on_host` file when dropping the device.
pub no_drop: bool,
/// Block device multi-queue
pub num_queues: usize,
/// Virtio queue size. Size: byte
pub queue_size: u16,
/// Rate Limiter for I/O operations.
pub rate_limiter: Option<RateLimiterConfigInfo>,
/// Use shared irq
pub use_shared_irq: Option<bool>,
/// Use generic irq
pub use_generic_irq: Option<bool>,
}
impl std::default::Default for BlockDeviceConfigInfo {
fn default() -> Self {
Self {
drive_id: String::default(),
device_type: BlockDeviceType::RawBlock,
path_on_host: PathBuf::default(),
is_root_device: false,
part_uuid: None,
is_read_only: false,
is_direct: Self::default_direct(),
no_drop: Self::default_no_drop(),
num_queues: Self::default_num_queues(),
queue_size: 256,
rate_limiter: None,
use_shared_irq: None,
use_generic_irq: None,
}
}
}
impl BlockDeviceConfigInfo {
/// Get default queue numbers
pub fn default_num_queues() -> usize {
1
}
/// Get default value of is_direct switch
pub fn default_direct() -> bool {
true
}
/// Get default value of no_drop switch
pub fn default_no_drop() -> bool {
false
}
/// Get type of low level storage/protocol.
pub fn device_type(&self) -> BlockDeviceType {
self.device_type
}
/// Returns a reference to `path_on_host`.
pub fn path_on_host(&self) -> &PathBuf {
&self.path_on_host
}
/// Returns a reference to the part_uuid.
pub fn get_part_uuid(&self) -> Option<&String> {
self.part_uuid.as_ref()
}
/// Checks whether the drive had read only permissions.
pub fn is_read_only(&self) -> bool {
self.is_read_only
}
/// Checks whether the drive uses direct I/O
pub fn is_direct(&self) -> bool {
self.is_direct
}
/// Get number and size of queues supported.
pub fn queue_sizes(&self) -> Vec<u16> {
(0..self.num_queues)
.map(|_| self.queue_size)
.collect::<Vec<u16>>()
}
}
impl ConfigItem for BlockDeviceConfigInfo {
type Err = BlockDeviceError;
fn id(&self) -> &str {
&self.drive_id
}
fn check_conflicts(&self, other: &Self) -> Result<(), BlockDeviceError> {
if self.drive_id == other.drive_id {
Ok(())
} else if self.path_on_host == other.path_on_host {
Err(BlockDeviceError::BlockDevicePathAlreadyExists(
self.path_on_host.clone(),
))
} else {
Ok(())
}
}
}
impl std::fmt::Debug for BlockDeviceInfo {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{:?}", self.config)
}
}
/// Block Device Info
pub type BlockDeviceInfo = DeviceConfigInfo<BlockDeviceConfigInfo>;
/// Wrapper for the collection that holds all the Block Devices Configs
//#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
#[derive(Clone)]
pub struct BlockDeviceMgr {
/// A list of `BlockDeviceInfo` objects.
info_list: VecDeque<BlockDeviceInfo>,
has_root_block: bool,
has_part_uuid_root: bool,
read_only_root: bool,
part_uuid: Option<String>,
use_shared_irq: bool,
}
impl BlockDeviceMgr {
/// returns a front-to-back iterator.
pub fn iter(&self) -> vec_deque::Iter<BlockDeviceInfo> {
self.info_list.iter()
}
/// Checks whether any of the added BlockDevice is the root.
pub fn has_root_block_device(&self) -> bool {
self.has_root_block
}
/// Checks whether the root device is configured using a part UUID.
pub fn has_part_uuid_root(&self) -> bool {
self.has_part_uuid_root
}
/// Checks whether the root device has read-only permisssions.
pub fn is_read_only_root(&self) -> bool {
self.read_only_root
}
/// Gets the index of the device with the specified `drive_id` if it exists in the list.
pub fn get_index_of_drive_id(&self, id: &str) -> Option<usize> {
self.info_list
.iter()
.position(|info| info.config.id().eq(id))
}
/// Gets the 'BlockDeviceConfigInfo' of the device with the specified `drive_id` if it exists in the list.
pub fn get_config_of_drive_id(&self, drive_id: &str) -> Option<BlockDeviceConfigInfo> {
match self.get_index_of_drive_id(drive_id) {
Some(index) => {
let config = self.info_list.get(index).unwrap().config.clone();
Some(config)
}
None => None,
}
}
/// Inserts `block_device_config` in the block device configuration list.
/// If an entry with the same id already exists, it will attempt to update
/// the existing entry.
/// Inserting a secondary root block device will fail.
pub fn insert_device(
device_mgr: &mut DeviceManager,
mut ctx: DeviceOpContext,
config: BlockDeviceConfigInfo,
) -> std::result::Result<(), BlockDeviceError> {
if !cfg!(feature = "hotplug") && ctx.is_hotplug {
return Err(BlockDeviceError::UpdateNotAllowedPostBoot);
}
let mgr = &mut device_mgr.block_manager;
// If the id of the drive already exists in the list, the operation is update.
match mgr.get_index_of_drive_id(config.id()) {
Some(index) => {
// No support for runtime update yet.
if ctx.is_hotplug {
Err(BlockDeviceError::BlockDevicePathAlreadyExists(
config.path_on_host.clone(),
))
} else {
for (idx, info) in mgr.info_list.iter().enumerate() {
if idx != index {
info.config.check_conflicts(&config)?;
}
}
mgr.update(index, config)
}
}
None => {
for info in mgr.info_list.iter() {
info.config.check_conflicts(&config)?;
}
let index = mgr.create(config.clone())?;
if !ctx.is_hotplug {
return Ok(());
}
match config.device_type {
BlockDeviceType::RawBlock => {
let device = Self::create_blk_device(&config, &mut ctx)
.map_err(BlockDeviceError::Virtio)?;
let dev = DeviceManager::create_mmio_virtio_device(
device,
&mut ctx,
config.use_shared_irq.unwrap_or(mgr.use_shared_irq),
config.use_generic_irq.unwrap_or(USE_GENERIC_IRQ),
)
.map_err(BlockDeviceError::DeviceManager)?;
mgr.update_device_by_index(index, Arc::clone(&dev))?;
// live-upgrade need save/restore device from info.device.
mgr.info_list[index].set_device(dev.clone());
ctx.insert_hotplug_mmio_device(&dev, None).map_err(|e| {
let logger = ctx.logger().new(slog::o!());
BlockDeviceMgr::remove_device(device_mgr, ctx, &config.drive_id)
.unwrap();
error!(
logger,
"failed to hot-add virtio block device {}, {:?}",
&config.drive_id,
e
);
BlockDeviceError::DeviceManager(e)
})
}
_ => Err(BlockDeviceError::InvalidBlockDeviceType),
}
}
}
}
/// Attaches all block devices from the BlockDevicesConfig.
pub fn attach_devices(
&mut self,
ctx: &mut DeviceOpContext,
) -> std::result::Result<(), BlockDeviceError> {
for info in self.info_list.iter_mut() {
match info.config.device_type {
BlockDeviceType::RawBlock => {
info!(
ctx.logger(),
"attach virtio-blk device, drive_id {}, path {}",
info.config.drive_id,
info.config.path_on_host.to_str().unwrap_or("<unknown>")
);
let device = Self::create_blk_device(&info.config, ctx)
.map_err(BlockDeviceError::Virtio)?;
let device = DeviceManager::create_mmio_virtio_device(
device,
ctx,
info.config.use_shared_irq.unwrap_or(self.use_shared_irq),
info.config.use_generic_irq.unwrap_or(USE_GENERIC_IRQ),
)
.map_err(BlockDeviceError::RegisterBlockDevice)?;
info.device = Some(device);
}
_ => {
return Err(BlockDeviceError::OpenBlockDevice(
std::io::Error::from_raw_os_error(libc::EINVAL),
));
}
}
}
Ok(())
}
/// Removes all virtio-blk devices
pub fn remove_devices(&mut self, ctx: &mut DeviceOpContext) -> Result<(), DeviceMgrError> {
while let Some(mut info) = self.info_list.pop_back() {
info!(ctx.logger(), "remove drive {}", info.config.drive_id);
if let Some(device) = info.device.take() {
DeviceManager::destroy_mmio_virtio_device(device, ctx)?;
}
}
Ok(())
}
fn remove(&mut self, drive_id: &str) -> Option<BlockDeviceInfo> {
match self.get_index_of_drive_id(drive_id) {
Some(index) => self.info_list.remove(index),
None => None,
}
}
/// remove a block device, it basically is the inverse operation of `insert_device``
pub fn remove_device(
dev_mgr: &mut DeviceManager,
mut ctx: DeviceOpContext,
drive_id: &str,
) -> std::result::Result<(), BlockDeviceError> {
if !cfg!(feature = "hotplug") {
return Err(BlockDeviceError::UpdateNotAllowedPostBoot);
}
let mgr = &mut dev_mgr.block_manager;
match mgr.remove(drive_id) {
Some(mut info) => {
info!(ctx.logger(), "remove drive {}", info.config.drive_id);
if let Some(device) = info.device.take() {
DeviceManager::destroy_mmio_virtio_device(device, &mut ctx)
.map_err(BlockDeviceError::DeviceManager)?;
}
}
None => return Err(BlockDeviceError::InvalidDeviceId(drive_id.to_owned())),
}
Ok(())
}
fn create_blk_device(
cfg: &BlockDeviceConfigInfo,
ctx: &mut DeviceOpContext,
) -> std::result::Result<Box<Block<GuestAddressSpaceImpl>>, virtio::Error> {
let epoll_mgr = ctx.epoll_mgr.clone().ok_or(virtio::Error::InvalidInput)?;
let mut block_files: Vec<Box<dyn Ufile>> = vec![];
match cfg.device_type {
BlockDeviceType::RawBlock => {
let custom_flags = if cfg.is_direct() {
info!(
ctx.logger(),
"Open block device \"{}\" in direct mode.",
cfg.path_on_host().display()
);
libc::O_DIRECT
} else {
info!(
ctx.logger(),
"Open block device \"{}\" in buffer mode.",
cfg.path_on_host().display(),
);
0
};
let io_uring_supported = IoUring::is_supported();
for i in 0..cfg.num_queues {
let queue_size = cfg.queue_sizes()[i] as u32;
let file = OpenOptions::new()
.read(true)
.custom_flags(custom_flags)
.write(!cfg.is_read_only())
.open(cfg.path_on_host())?;
info!(ctx.logger(), "Queue {}: block file opened", i);
if io_uring_supported {
info!(
ctx.logger(),
"Queue {}: Using io_uring Raw disk file, queue size {}.", i, queue_size
);
let io_engine = IoUring::new(file.as_raw_fd(), queue_size)?;
block_files.push(Box::new(LocalFile::new(file, cfg.no_drop, io_engine)?));
} else {
info!(
ctx.logger(),
"Queue {}: Since io_uring_supported is not enabled, change to default support of Aio Raw disk file, queue size {}", i, queue_size
);
let io_engine = Aio::new(file.as_raw_fd(), queue_size)?;
block_files.push(Box::new(LocalFile::new(file, cfg.no_drop, io_engine)?));
}
}
}
_ => {
error!(
ctx.logger(),
"invalid block device type: {:?}", cfg.device_type
);
return Err(virtio::Error::InvalidInput);
}
};
let mut limiters = vec![];
for _i in 0..cfg.num_queues {
if let Some(limiter) = cfg.rate_limiter.clone().map(|mut v| {
v.resize(cfg.num_queues as u64);
v.try_into().unwrap()
}) {
limiters.push(limiter);
}
}
Ok(Box::new(Block::new(
block_files,
cfg.is_read_only,
Arc::new(cfg.queue_sizes()),
epoll_mgr,
limiters,
)?))
}
/// Generated guest kernel commandline related to root block device.
pub fn generate_kernel_boot_args(
&self,
kernel_config: &mut KernelConfigInfo,
) -> std::result::Result<(), DeviceMgrError> {
// Respect user configuration if kernel_cmdline contains "root=",
// special attention for the case when kernel command line starting with "root=xxx"
let old_kernel_cmdline = format!(" {}", kernel_config.kernel_cmdline().as_str());
if !old_kernel_cmdline.contains(" root=") && self.has_root_block {
let cmdline = kernel_config.kernel_cmdline_mut();
if let Some(ref uuid) = self.part_uuid {
cmdline
.insert("root", &format!("PART_UUID={}", uuid))
.map_err(DeviceMgrError::Cmdline)?;
} else {
cmdline
.insert("root", "/dev/vda")
.map_err(DeviceMgrError::Cmdline)?;
}
if self.read_only_root {
if old_kernel_cmdline.contains(" rw") {
return Err(DeviceMgrError::InvalidOperation);
}
cmdline.insert_str("ro").map_err(DeviceMgrError::Cmdline)?;
}
}
Ok(())
}
/// insert a block device's config. return index on success.
fn create(
&mut self,
block_device_config: BlockDeviceConfigInfo,
) -> std::result::Result<usize, BlockDeviceError> {
self.check_data_file_present(&block_device_config)?;
if self
.get_index_of_drive_path(&block_device_config.path_on_host)
.is_some()
{
return Err(BlockDeviceError::BlockDevicePathAlreadyExists(
block_device_config.path_on_host,
));
}
// check whether the Device Config belongs to a root device
// we need to satisfy the condition by which a VMM can only have on root device
if block_device_config.is_root_device {
if self.has_root_block {
return Err(BlockDeviceError::RootBlockDeviceAlreadyAdded);
} else {
self.has_root_block = true;
self.read_only_root = block_device_config.is_read_only;
self.has_part_uuid_root = block_device_config.part_uuid.is_some();
self.part_uuid = block_device_config.part_uuid.clone();
// Root Device should be the first in the list whether or not PART_UUID is specified
// in order to avoid bugs in case of switching from part_uuid boot scenarios to
// /dev/vda boot type.
self.info_list
.push_front(BlockDeviceInfo::new(block_device_config));
Ok(0)
}
} else {
self.info_list
.push_back(BlockDeviceInfo::new(block_device_config));
Ok(self.info_list.len() - 1)
}
}
/// Updates a Block Device Config. The update fails if it would result in two
/// root block devices.
fn update(
&mut self,
mut index: usize,
new_config: BlockDeviceConfigInfo,
) -> std::result::Result<(), BlockDeviceError> {
// Check if the path exists
self.check_data_file_present(&new_config)?;
if let Some(idx) = self.get_index_of_drive_path(&new_config.path_on_host) {
if idx != index {
return Err(BlockDeviceError::BlockDevicePathAlreadyExists(
new_config.path_on_host.clone(),
));
}
}
if self.info_list.get(index).is_none() {
return Err(InvalidDeviceId(index.to_string()));
}
// Check if the root block device is being updated.
if self.info_list[index].config.is_root_device {
self.has_root_block = new_config.is_root_device;
self.read_only_root = new_config.is_root_device && new_config.is_read_only;
self.has_part_uuid_root = new_config.part_uuid.is_some();
self.part_uuid = new_config.part_uuid.clone();
} else if new_config.is_root_device {
// Check if a second root block device is being added.
if self.has_root_block {
return Err(BlockDeviceError::RootBlockDeviceAlreadyAdded);
} else {
// One of the non-root blocks is becoming root.
self.has_root_block = true;
self.read_only_root = new_config.is_read_only;
self.has_part_uuid_root = new_config.part_uuid.is_some();
self.part_uuid = new_config.part_uuid.clone();
// Make sure the root device is on the first position.
self.info_list.swap(0, index);
// Block config to be updated has moved to first position.
index = 0;
}
}
// Update the config.
self.info_list[index].config = new_config;
Ok(())
}
fn check_data_file_present(
&self,
block_device_config: &BlockDeviceConfigInfo,
) -> std::result::Result<(), BlockDeviceError> {
if block_device_config.device_type == BlockDeviceType::RawBlock
&& !block_device_config.path_on_host.exists()
{
Err(BlockDeviceError::InvalidBlockDevicePath(
block_device_config.path_on_host.clone(),
))
} else {
Ok(())
}
}
fn get_index_of_drive_path(&self, drive_path: &Path) -> Option<usize> {
self.info_list
.iter()
.position(|info| info.config.path_on_host.eq(drive_path))
}
/// update devce information in `info_list`. The caller of this method is
/// `insert_device` when hotplug is true.
pub fn update_device_by_index(
&mut self,
index: usize,
device: Arc<DbsMmioV2Device>,
) -> Result<(), BlockDeviceError> {
if let Some(info) = self.info_list.get_mut(index) {
info.device = Some(device);
return Ok(());
}
Err(BlockDeviceError::InvalidDeviceId("".to_owned()))
}
/// Update the ratelimiter settings of a virtio blk device.
pub fn update_device_ratelimiters(
device_mgr: &mut DeviceManager,
new_cfg: BlockDeviceConfigUpdateInfo,
) -> std::result::Result<(), BlockDeviceError> {
let mgr = &mut device_mgr.block_manager;
match mgr.get_index_of_drive_id(&new_cfg.drive_id) {
Some(index) => {
let config = &mut mgr.info_list[index].config;
config.rate_limiter = new_cfg.rate_limiter.clone();
let device = mgr.info_list[index]
.device
.as_mut()
.ok_or_else(|| BlockDeviceError::InvalidDeviceId("".to_owned()))?;
if let Some(mmio_dev) = device.as_any().downcast_ref::<DbsMmioV2Device>() {
let guard = mmio_dev.state();
let inner_dev = guard.get_inner_device();
if let Some(blk_dev) = inner_dev
.as_any()
.downcast_ref::<virtio::block::Block<GuestAddressSpaceImpl>>()
{
return blk_dev
.set_patch_rate_limiters(new_cfg.bytes(), new_cfg.ops())
.map(|_p| ())
.map_err(|_e| BlockDeviceError::BlockEpollHanderSendFail);
}
}
Ok(())
}
None => Err(BlockDeviceError::InvalidDeviceId(new_cfg.drive_id)),
}
}
}
impl Default for BlockDeviceMgr {
/// Constructor for the BlockDeviceMgr. It initializes an empty LinkedList.
fn default() -> BlockDeviceMgr {
BlockDeviceMgr {
info_list: VecDeque::<BlockDeviceInfo>::new(),
has_root_block: false,
has_part_uuid_root: false,
read_only_root: false,
part_uuid: None,
use_shared_irq: USE_SHARED_IRQ,
}
}
}

View File

@@ -0,0 +1,440 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
//! Virtual machine console device manager.
//!
//! A virtual console are composed up of two parts: frontend in virtual machine and backend in
//! host OS. A frontend may be serial port, virtio-console etc, a backend may be stdio or Unix
//! domain socket. The manager connects the frontend with the backend.
use std::io::{self, Read};
use std::os::unix::net::{UnixListener, UnixStream};
use std::path::Path;
use std::sync::{Arc, Mutex};
use bytes::{BufMut, BytesMut};
use dbs_legacy_devices::{ConsoleHandler, SerialDevice};
use dbs_utils::epoll_manager::{
EpollManager, EventOps, EventSet, Events, MutEventSubscriber, SubscriberId,
};
use vmm_sys_util::terminal::Terminal;
use super::{DeviceMgrError, Result};
const EPOLL_EVENT_SERIAL: u32 = 0;
const EPOLL_EVENT_SERIAL_DATA: u32 = 1;
const EPOLL_EVENT_STDIN: u32 = 2;
// Maximal backend throughput for every data transaction.
const MAX_BACKEND_THROUGHPUT: usize = 64;
/// Errors related to Console manager operations.
#[derive(Debug, thiserror::Error)]
pub enum ConsoleManagerError {
/// Cannot create unix domain socket for serial port
#[error("cannot create socket for serial console")]
CreateSerialSock(#[source] std::io::Error),
/// An operation on the epoll instance failed due to resource exhaustion or bad configuration.
#[error("failure while managing epoll event for console fd")]
EpollMgr(#[source] dbs_utils::epoll_manager::Error),
/// Cannot set mode for terminal.
#[error("failure while setting attribute for terminal")]
StdinHandle(#[source] vmm_sys_util::errno::Error),
}
enum Backend {
StdinHandle(std::io::Stdin),
SockPath(String),
}
/// Console manager to manage frontend and backend console devices.
pub struct ConsoleManager {
epoll_mgr: EpollManager,
logger: slog::Logger,
subscriber_id: Option<SubscriberId>,
backend: Option<Backend>,
}
impl ConsoleManager {
/// Create a console manager instance.
pub fn new(epoll_mgr: EpollManager, logger: &slog::Logger) -> Self {
let logger = logger.new(slog::o!("subsystem" => "console_manager"));
ConsoleManager {
epoll_mgr,
logger,
subscriber_id: Default::default(),
backend: None,
}
}
/// Create a console backend device by using stdio streams.
pub fn create_stdio_console(&mut self, device: Arc<Mutex<SerialDevice>>) -> Result<()> {
let stdin_handle = std::io::stdin();
stdin_handle
.lock()
.set_raw_mode()
.map_err(|e| DeviceMgrError::ConsoleManager(ConsoleManagerError::StdinHandle(e)))?;
let handler = ConsoleEpollHandler::new(device, Some(stdin_handle), None, &self.logger);
self.subscriber_id = Some(self.epoll_mgr.add_subscriber(Box::new(handler)));
self.backend = Some(Backend::StdinHandle(std::io::stdin()));
Ok(())
}
/// Create s console backend device by using Unix Domain socket.
pub fn create_socket_console(
&mut self,
device: Arc<Mutex<SerialDevice>>,
sock_path: String,
) -> Result<()> {
let sock_listener = Self::bind_domain_socket(&sock_path).map_err(|e| {
DeviceMgrError::ConsoleManager(ConsoleManagerError::CreateSerialSock(e))
})?;
let handler = ConsoleEpollHandler::new(device, None, Some(sock_listener), &self.logger);
self.subscriber_id = Some(self.epoll_mgr.add_subscriber(Box::new(handler)));
self.backend = Some(Backend::SockPath(sock_path));
Ok(())
}
/// Reset the host side terminal to canonical mode.
pub fn reset_console(&self) -> Result<()> {
if let Some(Backend::StdinHandle(stdin_handle)) = self.backend.as_ref() {
stdin_handle
.lock()
.set_canon_mode()
.map_err(|e| DeviceMgrError::ConsoleManager(ConsoleManagerError::StdinHandle(e)))?;
}
Ok(())
}
fn bind_domain_socket(serial_path: &str) -> std::result::Result<UnixListener, std::io::Error> {
let path = Path::new(serial_path);
if path.is_file() {
let _ = std::fs::remove_file(serial_path);
}
UnixListener::bind(path)
}
}
struct ConsoleEpollHandler {
device: Arc<Mutex<SerialDevice>>,
stdin_handle: Option<std::io::Stdin>,
sock_listener: Option<UnixListener>,
sock_conn: Option<UnixStream>,
logger: slog::Logger,
}
impl ConsoleEpollHandler {
fn new(
device: Arc<Mutex<SerialDevice>>,
stdin_handle: Option<std::io::Stdin>,
sock_listener: Option<UnixListener>,
logger: &slog::Logger,
) -> Self {
ConsoleEpollHandler {
device,
stdin_handle,
sock_listener,
sock_conn: None,
logger: logger.new(slog::o!("subsystem" => "console_manager")),
}
}
fn uds_listener_accept(&mut self, ops: &mut EventOps) -> std::io::Result<()> {
if self.sock_conn.is_some() {
slog::warn!(self.logger,
"UDS for serial port 1 already exists, reject the new connection";
"subsystem" => "console_mgr",
);
// Do not expected poisoned lock.
let _ = self.sock_listener.as_mut().unwrap().accept();
} else {
// Safe to unwrap() because self.sock_conn is Some().
let (conn_sock, _) = self.sock_listener.as_ref().unwrap().accept()?;
let events = Events::with_data(&conn_sock, EPOLL_EVENT_SERIAL_DATA, EventSet::IN);
if let Err(e) = ops.add(events) {
slog::error!(self.logger,
"failed to register epoll event for serial, {:?}", e;
"subsystem" => "console_mgr",
);
return Err(std::io::Error::last_os_error());
}
let conn_sock_copy = conn_sock.try_clone()?;
// Do not expected poisoned lock.
self.device
.lock()
.unwrap()
.set_output_stream(Some(Box::new(conn_sock_copy)));
self.sock_conn = Some(conn_sock);
}
Ok(())
}
fn uds_read_in(&mut self, ops: &mut EventOps) -> std::io::Result<()> {
let mut should_drop = true;
if let Some(conn_sock) = self.sock_conn.as_mut() {
let mut out = [0u8; MAX_BACKEND_THROUGHPUT];
match conn_sock.read(&mut out[..]) {
Ok(0) => {
// Zero-length read means EOF. Remove this conn sock.
self.device
.lock()
.expect("console: poisoned console lock")
.set_output_stream(None);
}
Ok(count) => {
self.device
.lock()
.expect("console: poisoned console lock")
.raw_input(&out[..count])?;
should_drop = false;
}
Err(e) => {
slog::warn!(self.logger,
"error while reading serial conn sock: {:?}", e;
"subsystem" => "console_mgr"
);
self.device
.lock()
.expect("console: poisoned console lock")
.set_output_stream(None);
}
}
}
if should_drop {
assert!(self.sock_conn.is_some());
// Safe to unwrap() because self.sock_conn is Some().
let sock_conn = self.sock_conn.take().unwrap();
let events = Events::with_data(&sock_conn, EPOLL_EVENT_SERIAL_DATA, EventSet::IN);
if let Err(e) = ops.remove(events) {
slog::error!(self.logger,
"failed deregister epoll event for UDS, {:?}", e;
"subsystem" => "console_mgr"
);
}
}
Ok(())
}
fn stdio_read_in(&mut self, ops: &mut EventOps) -> std::io::Result<()> {
let mut should_drop = true;
if let Some(handle) = self.stdin_handle.as_ref() {
let mut out = [0u8; MAX_BACKEND_THROUGHPUT];
// Safe to unwrap() because self.stdin_handle is Some().
let stdin_lock = handle.lock();
match stdin_lock.read_raw(&mut out[..]) {
Ok(0) => {
// Zero-length read indicates EOF. Remove from pollables.
self.device
.lock()
.expect("console: poisoned console lock")
.set_output_stream(None);
}
Ok(count) => {
self.device
.lock()
.expect("console: poisoned console lock")
.raw_input(&out[..count])?;
should_drop = false;
}
Err(e) => {
slog::warn!(self.logger,
"error while reading stdin: {:?}", e;
"subsystem" => "console_mgr"
);
self.device
.lock()
.expect("console: poisoned console lock")
.set_output_stream(None);
}
}
}
if should_drop {
let events = Events::with_data_raw(libc::STDIN_FILENO, EPOLL_EVENT_STDIN, EventSet::IN);
if let Err(e) = ops.remove(events) {
slog::error!(self.logger,
"failed to deregister epoll event for stdin, {:?}", e;
"subsystem" => "console_mgr"
);
}
}
Ok(())
}
}
impl MutEventSubscriber for ConsoleEpollHandler {
fn process(&mut self, events: Events, ops: &mut EventOps) {
slog::trace!(self.logger, "ConsoleEpollHandler::process()");
let slot = events.data();
match slot {
EPOLL_EVENT_SERIAL => {
if let Err(e) = self.uds_listener_accept(ops) {
slog::warn!(self.logger, "failed to accept incoming connection, {:?}", e);
}
}
EPOLL_EVENT_SERIAL_DATA => {
if let Err(e) = self.uds_read_in(ops) {
slog::warn!(self.logger, "failed to read data from UDS, {:?}", e);
}
}
EPOLL_EVENT_STDIN => {
if let Err(e) = self.stdio_read_in(ops) {
slog::warn!(self.logger, "failed to read data from stdin, {:?}", e);
}
}
_ => slog::error!(self.logger, "unknown epoll slot number {}", slot),
}
}
fn init(&mut self, ops: &mut EventOps) {
slog::trace!(self.logger, "ConsoleEpollHandler::init()");
if self.stdin_handle.is_some() {
slog::info!(self.logger, "ConsoleEpollHandler: stdin handler");
let events = Events::with_data_raw(libc::STDIN_FILENO, EPOLL_EVENT_STDIN, EventSet::IN);
if let Err(e) = ops.add(events) {
slog::error!(
self.logger,
"failed to register epoll event for stdin, {:?}",
e
);
}
}
if let Some(sock) = self.sock_listener.as_ref() {
slog::info!(self.logger, "ConsoleEpollHandler: sock listener");
let events = Events::with_data(sock, EPOLL_EVENT_SERIAL, EventSet::IN);
if let Err(e) = ops.add(events) {
slog::error!(
self.logger,
"failed to register epoll event for UDS listener, {:?}",
e
);
}
}
if let Some(conn) = self.sock_conn.as_ref() {
slog::info!(self.logger, "ConsoleEpollHandler: sock connection");
let events = Events::with_data(conn, EPOLL_EVENT_SERIAL_DATA, EventSet::IN);
if let Err(e) = ops.add(events) {
slog::error!(
self.logger,
"failed to register epoll event for UDS connection, {:?}",
e
);
}
}
}
}
/// Writer to process guest kernel dmesg.
pub struct DmesgWriter {
buf: BytesMut,
logger: slog::Logger,
}
impl DmesgWriter {
/// Creates a new instance.
pub fn new(logger: &slog::Logger) -> Self {
Self {
buf: BytesMut::with_capacity(1024),
logger: logger.new(slog::o!("subsystem" => "dmesg")),
}
}
}
impl io::Write for DmesgWriter {
/// 0000000 [ 0 . 0 3 4 9 1 6 ] R
/// 5b 20 20 20 20 30 2e 30 33 34 39 31 36 5d 20 52
/// 0000020 u n / s b i n / i n i t a s
/// 75 6e 20 2f 73 62 69 6e 2f 69 6e 69 74 20 61 73
/// 0000040 i n i t p r o c e s s \r \n [
///
/// dmesg message end a line with /r/n . When redirect message to logger, we should
/// remove the /r/n .
fn write(&mut self, buf: &[u8]) -> io::Result<usize> {
let arr: Vec<&[u8]> = buf.split(|c| *c == b'\n').collect();
let count = arr.len();
for (i, sub) in arr.iter().enumerate() {
if sub.is_empty() {
if !self.buf.is_empty() {
slog::info!(
self.logger,
"{}",
String::from_utf8_lossy(self.buf.as_ref()).trim_end()
);
self.buf.clear();
}
} else if sub.len() < buf.len() && i < count - 1 {
slog::info!(
self.logger,
"{}{}",
String::from_utf8_lossy(self.buf.as_ref()).trim_end(),
String::from_utf8_lossy(sub).trim_end(),
);
self.buf.clear();
} else {
self.buf.put_slice(sub);
}
}
Ok(buf.len())
}
fn flush(&mut self) -> io::Result<()> {
Ok(())
}
}
#[cfg(test)]
mod tests {
use super::*;
use slog::Drain;
use std::io::Write;
fn create_logger() -> slog::Logger {
let decorator = slog_term::TermDecorator::new().build();
let drain = slog_term::FullFormat::new(decorator).build().fuse();
let drain = slog_async::Async::new(drain).build().fuse();
slog::Logger::root(drain, slog::o!())
}
#[test]
fn test_dmesg_writer() {
let mut writer = DmesgWriter {
buf: Default::default(),
logger: create_logger(),
};
writer.flush().unwrap();
writer.write_all("".as_bytes()).unwrap();
writer.write_all("\n".as_bytes()).unwrap();
writer.write_all("\n\n".as_bytes()).unwrap();
writer.write_all("\n\n\n".as_bytes()).unwrap();
writer.write_all("12\n23\n34\n56".as_bytes()).unwrap();
writer.write_all("78".as_bytes()).unwrap();
writer.write_all("90\n".as_bytes()).unwrap();
writer.flush().unwrap();
}
// TODO: add unit tests for console manager
}

View File

@@ -0,0 +1,528 @@
// Copyright 2020-2022 Alibaba Cloud. All Rights Reserved.
// Copyright 2019 Intel Corporation. All Rights Reserved.
//
// SPDX-License-Identifier: Apache-2.0
use std::convert::TryInto;
use dbs_utils::epoll_manager::EpollManager;
use dbs_virtio_devices::{self as virtio, Error as VirtIoError};
use serde_derive::{Deserialize, Serialize};
use slog::{error, info};
use crate::address_space_manager::GuestAddressSpaceImpl;
use crate::config_manager::{
ConfigItem, DeviceConfigInfo, DeviceConfigInfos, RateLimiterConfigInfo,
};
use crate::device_manager::{
DbsMmioV2Device, DeviceManager, DeviceMgrError, DeviceOpContext, DeviceVirtioRegionHandler,
};
use crate::get_bucket_update;
use super::DbsVirtioDevice;
// The flag of whether to use the shared irq.
const USE_SHARED_IRQ: bool = true;
// The flag of whether to use the generic irq.
const USE_GENERIC_IRQ: bool = true;
// Default cache size is 2 Gi since this is a typical VM memory size.
const DEFAULT_CACHE_SIZE: u64 = 2 * 1024 * 1024 * 1024;
// We have 2 supported fs device mode, vhostuser and virtio
const VHOSTUSER_FS_MODE: &str = "vhostuser";
// We have 2 supported fs device mode, vhostuser and virtio
const VIRTIO_FS_MODE: &str = "virtio";
/// Errors associated with `FsDeviceConfig`.
#[derive(Debug, thiserror::Error)]
pub enum FsDeviceError {
/// Invalid fs, "virtio" or "vhostuser" is allowed.
#[error("the fs type is invalid, virtio or vhostuser is allowed")]
InvalidFs,
/// Cannot access address space.
#[error("Cannot access address space.")]
AddressSpaceNotInitialized,
/// Cannot convert RateLimterConfigInfo into RateLimiter.
#[error("failure while converting RateLimterConfigInfo into RateLimiter: {0}")]
RateLimterConfigInfoTryInto(#[source] std::io::Error),
/// The fs device tag was already used for a different fs.
#[error("VirtioFs device tag {0} already exists")]
FsDeviceTagAlreadyExists(String),
/// The fs device path was already used for a different fs.
#[error("VirtioFs device tag {0} already exists")]
FsDevicePathAlreadyExists(String),
/// The update is not allowed after booting the microvm.
#[error("update operation is not allowed after boot")]
UpdateNotAllowedPostBoot,
/// The attachbackendfs operation fails.
#[error("Fs device attach a backend fs failed")]
AttachBackendFailed(String),
/// attach backend fs must be done when vm is running.
#[error("vm is not running when attaching a backend fs")]
MicroVMNotRunning,
/// The mount tag doesn't exist.
#[error("fs tag'{0}' doesn't exist")]
TagNotExists(String),
/// Failed to send patch message to VirtioFs epoll handler.
#[error("could not send patch message to the VirtioFs epoll handler")]
VirtioFsEpollHanderSendFail,
/// Creating a shared-fs device fails (if the vhost-user socket cannot be open.)
#[error("cannot create shared-fs device: {0}")]
CreateFsDevice(#[source] VirtIoError),
/// Cannot initialize a shared-fs device or add a device to the MMIO Bus.
#[error("failure while registering shared-fs device: {0}")]
RegisterFsDevice(#[source] DeviceMgrError),
/// The device manager errors.
#[error("DeviceManager error: {0}")]
DeviceManager(#[source] DeviceMgrError),
}
/// Configuration information for a vhost-user-fs device.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct FsDeviceConfigInfo {
/// vhost-user socket path.
pub sock_path: String,
/// virtiofs mount tag name used inside the guest.
/// used as the device name during mount.
pub tag: String,
/// Number of virtqueues to use.
pub num_queues: usize,
/// Size of each virtqueue. Unit: byte.
pub queue_size: u16,
/// DAX cache window size
pub cache_size: u64,
/// Number of thread pool workers.
pub thread_pool_size: u16,
/// The caching policy the file system should use (auto, always or never).
/// This cache policy is set for virtio-fs, visit https://gitlab.com/virtio-fs/virtiofsd to get further information.
pub cache_policy: String,
/// Writeback cache
pub writeback_cache: bool,
/// Enable no_open or not
pub no_open: bool,
/// Enable xattr or not
pub xattr: bool,
/// Drop CAP_SYS_RESOURCE or not
pub drop_sys_resource: bool,
/// virtio fs or vhostuser fs.
pub mode: String,
/// Enable kill_priv_v2 or not
pub fuse_killpriv_v2: bool,
/// Enable no_readdir or not
pub no_readdir: bool,
/// Rate Limiter for I/O operations.
pub rate_limiter: Option<RateLimiterConfigInfo>,
/// Use shared irq
pub use_shared_irq: Option<bool>,
/// Use generic irq
pub use_generic_irq: Option<bool>,
}
impl std::default::Default for FsDeviceConfigInfo {
fn default() -> Self {
Self {
sock_path: String::default(),
tag: String::default(),
num_queues: 1,
queue_size: 1024,
cache_size: DEFAULT_CACHE_SIZE,
thread_pool_size: 0,
cache_policy: Self::default_cache_policy(),
writeback_cache: Self::default_writeback_cache(),
no_open: Self::default_no_open(),
fuse_killpriv_v2: Self::default_fuse_killpriv_v2(),
no_readdir: Self::default_no_readdir(),
xattr: Self::default_xattr(),
drop_sys_resource: Self::default_drop_sys_resource(),
mode: Self::default_fs_mode(),
rate_limiter: Some(RateLimiterConfigInfo::default()),
use_shared_irq: None,
use_generic_irq: None,
}
}
}
impl FsDeviceConfigInfo {
/// The default mode is set to 'virtio' for 'virtio-fs' device.
pub fn default_fs_mode() -> String {
String::from(VIRTIO_FS_MODE)
}
/// The default cache policy
pub fn default_cache_policy() -> String {
"always".to_string()
}
/// The default setting of writeback cache
pub fn default_writeback_cache() -> bool {
true
}
/// The default setting of no_open
pub fn default_no_open() -> bool {
true
}
/// The default setting of killpriv_v2
pub fn default_fuse_killpriv_v2() -> bool {
false
}
/// The default setting of xattr
pub fn default_xattr() -> bool {
false
}
/// The default setting of drop_sys_resource
pub fn default_drop_sys_resource() -> bool {
false
}
/// The default setting of no_readdir
pub fn default_no_readdir() -> bool {
false
}
/// The default setting of rate limiter
pub fn default_fs_rate_limiter() -> Option<RateLimiterConfigInfo> {
None
}
}
/// Configuration information for virtio-fs.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct FsDeviceConfigUpdateInfo {
/// virtiofs mount tag name used inside the guest.
/// used as the device name during mount.
pub tag: String,
/// Rate Limiter for I/O operations.
pub rate_limiter: Option<RateLimiterConfigInfo>,
}
impl FsDeviceConfigUpdateInfo {
/// Provides a `BucketUpdate` description for the bandwidth rate limiter.
pub fn bytes(&self) -> dbs_utils::rate_limiter::BucketUpdate {
get_bucket_update!(self, rate_limiter, bandwidth)
}
/// Provides a `BucketUpdate` description for the ops rate limiter.
pub fn ops(&self) -> dbs_utils::rate_limiter::BucketUpdate {
get_bucket_update!(self, rate_limiter, ops)
}
}
impl ConfigItem for FsDeviceConfigInfo {
type Err = FsDeviceError;
fn id(&self) -> &str {
&self.tag
}
fn check_conflicts(&self, other: &Self) -> Result<(), FsDeviceError> {
if self.tag == other.tag {
Err(FsDeviceError::FsDeviceTagAlreadyExists(self.tag.clone()))
} else if self.mode.as_str() == VHOSTUSER_FS_MODE && self.sock_path == other.sock_path {
Err(FsDeviceError::FsDevicePathAlreadyExists(
self.sock_path.clone(),
))
} else {
Ok(())
}
}
}
/// Configuration information of manipulating backend fs for a virtiofs device.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct FsMountConfigInfo {
/// Mount operations, mount, update, umount
pub ops: String,
/// The backend fs type to mount.
pub fstype: Option<String>,
/// the source file/directory the backend fs points to
pub source: Option<String>,
/// where the backend fs gets mounted
pub mountpoint: String,
/// backend fs config content in json format
pub config: Option<String>,
/// virtiofs mount tag name used inside the guest.
/// used as the device name during mount.
pub tag: String,
/// Path to file that contains file lists that should be prefetched by rafs
pub prefetch_list_path: Option<String>,
/// What size file supports dax
pub dax_threshold_size_kb: Option<u64>,
}
pub(crate) type FsDeviceInfo = DeviceConfigInfo<FsDeviceConfigInfo>;
impl ConfigItem for FsDeviceInfo {
type Err = FsDeviceError;
fn id(&self) -> &str {
&self.config.tag
}
fn check_conflicts(&self, other: &Self) -> Result<(), FsDeviceError> {
if self.config.tag == other.config.tag {
Err(FsDeviceError::FsDeviceTagAlreadyExists(
self.config.tag.clone(),
))
} else if self.config.sock_path == other.config.sock_path {
Err(FsDeviceError::FsDevicePathAlreadyExists(
self.config.sock_path.clone(),
))
} else {
Ok(())
}
}
}
/// Wrapper for the collection that holds all the Fs Devices Configs
pub struct FsDeviceMgr {
/// A list of `FsDeviceConfig` objects.
pub(crate) info_list: DeviceConfigInfos<FsDeviceConfigInfo>,
pub(crate) use_shared_irq: bool,
}
impl FsDeviceMgr {
/// Inserts `fs_cfg` in the shared-fs device configuration list.
pub fn insert_device(
device_mgr: &mut DeviceManager,
ctx: DeviceOpContext,
fs_cfg: FsDeviceConfigInfo,
) -> std::result::Result<(), FsDeviceError> {
// It's too complicated to manage life cycle of shared-fs service process for hotplug.
if ctx.is_hotplug {
error!(
ctx.logger(),
"no support of shared-fs device hotplug";
"subsystem" => "shared-fs",
"tag" => &fs_cfg.tag,
);
return Err(FsDeviceError::UpdateNotAllowedPostBoot);
}
info!(
ctx.logger(),
"add shared-fs device configuration";
"subsystem" => "shared-fs",
"tag" => &fs_cfg.tag,
);
device_mgr
.fs_manager
.lock()
.unwrap()
.info_list
.insert_or_update(&fs_cfg)?;
Ok(())
}
/// Attaches all vhost-user-fs devices from the FsDevicesConfig.
pub fn attach_devices(
&mut self,
ctx: &mut DeviceOpContext,
) -> std::result::Result<(), FsDeviceError> {
let epoll_mgr = ctx
.epoll_mgr
.clone()
.ok_or(FsDeviceError::CreateFsDevice(virtio::Error::InvalidInput))?;
for info in self.info_list.iter_mut() {
let device = Self::create_fs_device(&info.config, ctx, epoll_mgr.clone())?;
let mmio_device = DeviceManager::create_mmio_virtio_device(
device,
ctx,
info.config.use_shared_irq.unwrap_or(self.use_shared_irq),
info.config.use_generic_irq.unwrap_or(USE_GENERIC_IRQ),
)
.map_err(FsDeviceError::RegisterFsDevice)?;
info.set_device(mmio_device);
}
Ok(())
}
fn create_fs_device(
config: &FsDeviceConfigInfo,
ctx: &mut DeviceOpContext,
epoll_mgr: EpollManager,
) -> std::result::Result<DbsVirtioDevice, FsDeviceError> {
match &config.mode as &str {
VIRTIO_FS_MODE => Self::attach_virtio_fs_devices(config, ctx, epoll_mgr),
_ => Err(FsDeviceError::CreateFsDevice(virtio::Error::InvalidInput)),
}
}
fn attach_virtio_fs_devices(
config: &FsDeviceConfigInfo,
ctx: &mut DeviceOpContext,
epoll_mgr: EpollManager,
) -> std::result::Result<DbsVirtioDevice, FsDeviceError> {
info!(
ctx.logger(),
"add virtio-fs device configuration";
"subsystem" => "virito-fs",
"tag" => &config.tag,
"dax_window_size" => &config.cache_size,
);
let limiter = if let Some(rlc) = config.rate_limiter.clone() {
Some(
rlc.try_into()
.map_err(FsDeviceError::RateLimterConfigInfoTryInto)?,
)
} else {
None
};
let vm_as = ctx.get_vm_as().map_err(|e| {
error!(ctx.logger(), "virtio-fs get vm_as error: {:?}", e;
"subsystem" => "virito-fs");
FsDeviceError::DeviceManager(e)
})?;
let address_space = match ctx.address_space.as_ref() {
Some(address_space) => address_space.clone(),
None => {
error!(ctx.logger(), "virtio-fs get address_space error"; "subsystem" => "virito-fs");
return Err(FsDeviceError::AddressSpaceNotInitialized);
}
};
let handler = DeviceVirtioRegionHandler {
vm_as,
address_space,
};
let device = Box::new(
virtio::fs::VirtioFs::new(
&config.tag,
config.num_queues,
config.queue_size,
config.cache_size,
&config.cache_policy,
config.thread_pool_size,
config.writeback_cache,
config.no_open,
config.fuse_killpriv_v2,
config.xattr,
config.drop_sys_resource,
config.no_readdir,
Box::new(handler),
epoll_mgr,
limiter,
)
.map_err(FsDeviceError::CreateFsDevice)?,
);
Ok(device)
}
/// Attach a backend fs to a VirtioFs device or detach a backend
/// fs from a Virtiofs device
pub fn manipulate_backend_fs(
device_mgr: &mut DeviceManager,
config: FsMountConfigInfo,
) -> std::result::Result<(), FsDeviceError> {
let mut found = false;
let mgr = &mut device_mgr.fs_manager.lock().unwrap();
for info in mgr
.info_list
.iter()
.filter(|info| info.config.tag.as_str() == config.tag.as_str())
{
found = true;
if let Some(device) = info.device.as_ref() {
if let Some(mmio_dev) = device.as_any().downcast_ref::<DbsMmioV2Device>() {
let mut guard = mmio_dev.state();
let inner_dev = guard.get_inner_device_mut();
if let Some(virtio_fs_dev) = inner_dev
.as_any_mut()
.downcast_mut::<virtio::fs::VirtioFs<GuestAddressSpaceImpl>>()
{
return virtio_fs_dev
.manipulate_backend_fs(
config.source,
config.fstype,
&config.mountpoint,
config.config,
&config.ops,
config.prefetch_list_path,
config.dax_threshold_size_kb,
)
.map(|_p| ())
.map_err(|e| FsDeviceError::AttachBackendFailed(e.to_string()));
}
}
}
}
if !found {
Err(FsDeviceError::AttachBackendFailed(
"fs tag not found".to_string(),
))
} else {
Ok(())
}
}
/// Gets the index of the device with the specified `tag` if it exists in the list.
pub fn get_index_of_tag(&self, tag: &str) -> Option<usize> {
self.info_list
.iter()
.position(|info| info.config.id().eq(tag))
}
/// Update the ratelimiter settings of a virtio fs device.
pub fn update_device_ratelimiters(
device_mgr: &mut DeviceManager,
new_cfg: FsDeviceConfigUpdateInfo,
) -> std::result::Result<(), FsDeviceError> {
let mgr = &mut device_mgr.fs_manager.lock().unwrap();
match mgr.get_index_of_tag(&new_cfg.tag) {
Some(index) => {
let config = &mut mgr.info_list[index].config;
config.rate_limiter = new_cfg.rate_limiter.clone();
let device = mgr.info_list[index]
.device
.as_mut()
.ok_or_else(|| FsDeviceError::TagNotExists("".to_owned()))?;
if let Some(mmio_dev) = device.as_any().downcast_ref::<DbsMmioV2Device>() {
let guard = mmio_dev.state();
let inner_dev = guard.get_inner_device();
if let Some(fs_dev) = inner_dev
.as_any()
.downcast_ref::<virtio::fs::VirtioFs<GuestAddressSpaceImpl>>()
{
return fs_dev
.set_patch_rate_limiters(new_cfg.bytes(), new_cfg.ops())
.map(|_p| ())
.map_err(|_e| FsDeviceError::VirtioFsEpollHanderSendFail);
}
}
Ok(())
}
None => Err(FsDeviceError::TagNotExists(new_cfg.tag)),
}
}
}
impl Default for FsDeviceMgr {
/// Create a new `FsDeviceMgr` object..
fn default() -> Self {
FsDeviceMgr {
info_list: DeviceConfigInfos::new(),
use_shared_irq: USE_SHARED_IRQ,
}
}
}

View File

@@ -0,0 +1,246 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
//! Device Manager for Legacy Devices.
use std::io;
use std::sync::{Arc, Mutex};
use dbs_device::device_manager::Error as IoManagerError;
#[cfg(target_arch = "aarch64")]
use dbs_legacy_devices::RTCDevice;
use dbs_legacy_devices::SerialDevice;
use vmm_sys_util::eventfd::EventFd;
// The I8042 Data Port (IO Port 0x60) is used for reading data that was received from a I8042 device or from the I8042 controller itself and writing data to a I8042 device or to the I8042 controller itself.
const I8042_DATA_PORT: u16 = 0x60;
/// Errors generated by legacy device manager.
#[derive(Debug, thiserror::Error)]
pub enum Error {
/// Cannot add legacy device to Bus.
#[error("bus failure while managing legacy device")]
BusError(#[source] IoManagerError),
/// Cannot create EventFd.
#[error("failure while reading EventFd file descriptor")]
EventFd(#[source] io::Error),
/// Failed to register/deregister interrupt.
#[error("failure while managing interrupt for legacy device")]
IrqManager(#[source] vmm_sys_util::errno::Error),
}
/// The `LegacyDeviceManager` is a wrapper that is used for registering legacy devices
/// on an I/O Bus.
///
/// It currently manages the uart and i8042 devices. The `LegacyDeviceManger` should be initialized
/// only by using the constructor.
pub struct LegacyDeviceManager {
#[cfg(target_arch = "x86_64")]
i8042_reset_eventfd: EventFd,
#[cfg(target_arch = "aarch64")]
pub(crate) _rtc_device: Arc<Mutex<RTCDevice>>,
#[cfg(target_arch = "aarch64")]
_rtc_eventfd: EventFd,
pub(crate) com1_device: Arc<Mutex<SerialDevice>>,
_com1_eventfd: EventFd,
pub(crate) com2_device: Arc<Mutex<SerialDevice>>,
_com2_eventfd: EventFd,
}
impl LegacyDeviceManager {
/// Get the serial device for com1.
pub fn get_com1_serial(&self) -> Arc<Mutex<SerialDevice>> {
self.com1_device.clone()
}
/// Get the serial device for com2
pub fn get_com2_serial(&self) -> Arc<Mutex<SerialDevice>> {
self.com2_device.clone()
}
}
#[cfg(target_arch = "x86_64")]
pub(crate) mod x86_64 {
use super::*;
use dbs_device::device_manager::IoManager;
use dbs_device::resources::Resource;
use dbs_legacy_devices::{EventFdTrigger, I8042Device, I8042DeviceMetrics};
use kvm_ioctls::VmFd;
pub(crate) const COM1_IRQ: u32 = 4;
pub(crate) const COM1_PORT1: u16 = 0x3f8;
pub(crate) const COM2_IRQ: u32 = 3;
pub(crate) const COM2_PORT1: u16 = 0x2f8;
type Result<T> = ::std::result::Result<T, Error>;
impl LegacyDeviceManager {
/// Create a LegacyDeviceManager instance handling legacy devices (uart, i8042).
pub fn create_manager(bus: &mut IoManager, vm_fd: Option<Arc<VmFd>>) -> Result<Self> {
let (com1_device, com1_eventfd) =
Self::create_com_device(bus, vm_fd.as_ref(), COM1_IRQ, COM1_PORT1)?;
let (com2_device, com2_eventfd) =
Self::create_com_device(bus, vm_fd.as_ref(), COM2_IRQ, COM2_PORT1)?;
let exit_evt = EventFd::new(libc::EFD_NONBLOCK).map_err(Error::EventFd)?;
let i8042_device = Arc::new(Mutex::new(I8042Device::new(
EventFdTrigger::new(exit_evt.try_clone().map_err(Error::EventFd)?),
Arc::new(I8042DeviceMetrics::default()),
)));
let resources = [Resource::PioAddressRange {
// 0x60 and 0x64 are the io ports that i8042 devices used.
// We register pio address range from 0x60 - 0x64 with base I8042_DATA_PORT for i8042 to use.
base: I8042_DATA_PORT,
size: 0x5,
}];
bus.register_device_io(i8042_device, &resources)
.map_err(Error::BusError)?;
Ok(LegacyDeviceManager {
i8042_reset_eventfd: exit_evt,
com1_device,
_com1_eventfd: com1_eventfd,
com2_device,
_com2_eventfd: com2_eventfd,
})
}
/// Get the eventfd for exit notification.
pub fn get_reset_eventfd(&self) -> Result<EventFd> {
self.i8042_reset_eventfd.try_clone().map_err(Error::EventFd)
}
fn create_com_device(
bus: &mut IoManager,
vm_fd: Option<&Arc<VmFd>>,
irq: u32,
port_base: u16,
) -> Result<(Arc<Mutex<SerialDevice>>, EventFd)> {
let eventfd = EventFd::new(libc::EFD_NONBLOCK).map_err(Error::EventFd)?;
let device = Arc::new(Mutex::new(SerialDevice::new(
eventfd.try_clone().map_err(Error::EventFd)?,
)));
// port_base defines the base port address for the COM devices.
// Since every COM device has 8 data registers so we register the pio address range as size 0x8.
let resources = [Resource::PioAddressRange {
base: port_base,
size: 0x8,
}];
bus.register_device_io(device.clone(), &resources)
.map_err(Error::BusError)?;
if let Some(fd) = vm_fd {
fd.register_irqfd(&eventfd, irq)
.map_err(Error::IrqManager)?;
}
Ok((device, eventfd))
}
}
}
#[cfg(target_arch = "aarch64")]
pub(crate) mod aarch64 {
use super::*;
use dbs_device::device_manager::IoManager;
use dbs_device::resources::DeviceResources;
use kvm_ioctls::VmFd;
use std::collections::HashMap;
type Result<T> = ::std::result::Result<T, Error>;
/// LegacyDeviceType: com1
pub const COM1: &str = "com1";
/// LegacyDeviceType: com2
pub const COM2: &str = "com2";
/// LegacyDeviceType: rtc
pub const RTC: &str = "rtc";
impl LegacyDeviceManager {
/// Create a LegacyDeviceManager instance handling legacy devices.
pub fn create_manager(
bus: &mut IoManager,
vm_fd: Option<Arc<VmFd>>,
resources: &HashMap<String, DeviceResources>,
) -> Result<Self> {
let (com1_device, com1_eventfd) =
Self::create_com_device(bus, vm_fd.as_ref(), resources.get(COM1).unwrap())?;
let (com2_device, com2_eventfd) =
Self::create_com_device(bus, vm_fd.as_ref(), resources.get(COM2).unwrap())?;
let (rtc_device, rtc_eventfd) =
Self::create_rtc_device(bus, vm_fd.as_ref(), resources.get(RTC).unwrap())?;
Ok(LegacyDeviceManager {
_rtc_device: rtc_device,
_rtc_eventfd: rtc_eventfd,
com1_device,
_com1_eventfd: com1_eventfd,
com2_device,
_com2_eventfd: com2_eventfd,
})
}
fn create_com_device(
bus: &mut IoManager,
vm_fd: Option<&Arc<VmFd>>,
resources: &DeviceResources,
) -> Result<(Arc<Mutex<SerialDevice>>, EventFd)> {
let eventfd = EventFd::new(libc::EFD_NONBLOCK).map_err(Error::EventFd)?;
let device = Arc::new(Mutex::new(SerialDevice::new(
eventfd.try_clone().map_err(Error::EventFd)?,
)));
bus.register_device_io(device.clone(), resources.get_all_resources())
.map_err(Error::BusError)?;
if let Some(fd) = vm_fd {
let irq = resources.get_legacy_irq().unwrap();
fd.register_irqfd(&eventfd, irq)
.map_err(Error::IrqManager)?;
}
Ok((device, eventfd))
}
fn create_rtc_device(
bus: &mut IoManager,
vm_fd: Option<&Arc<VmFd>>,
resources: &DeviceResources,
) -> Result<(Arc<Mutex<RTCDevice>>, EventFd)> {
let eventfd = EventFd::new(libc::EFD_NONBLOCK).map_err(Error::EventFd)?;
let device = Arc::new(Mutex::new(RTCDevice::new()));
bus.register_device_io(device.clone(), resources.get_all_resources())
.map_err(Error::BusError)?;
if let Some(fd) = vm_fd {
let irq = resources.get_legacy_irq().unwrap();
fd.register_irqfd(&eventfd, irq)
.map_err(Error::IrqManager)?;
}
Ok((device, eventfd))
}
}
}
#[cfg(test)]
mod tests {
#[cfg(target_arch = "x86_64")]
use super::*;
#[test]
#[cfg(target_arch = "x86_64")]
fn test_create_legacy_device_manager() {
let mut bus = dbs_device::device_manager::IoManager::new();
let mgr = LegacyDeviceManager::create_manager(&mut bus, None).unwrap();
let _exit_fd = mgr.get_reset_eventfd().unwrap();
}
}

View File

@@ -0,0 +1,110 @@
// Copyright 2022 Alibaba, Inc. or its affiliates. All Rights Reserved.
//
// SPDX-License-Identifier: Apache-2.0
use std::io;
use std::sync::Arc;
use dbs_address_space::{AddressSpace, AddressSpaceRegion, AddressSpaceRegionType};
use dbs_virtio_devices::{Error as VirtIoError, VirtioRegionHandler};
use log::{debug, error};
use vm_memory::{FileOffset, GuestAddressSpace, GuestMemoryRegion, GuestRegionMmap};
use crate::address_space_manager::GuestAddressSpaceImpl;
/// This struct implements the VirtioRegionHandler trait, which inserts the memory
/// region of the virtio device into vm_as and address_space.
///
/// * After region is inserted into the vm_as, the virtio device can read guest memory
/// data using vm_as.get_slice with GuestAddress.
///
/// * Insert virtio memory into address_space so that the correct guest last address can
/// be found when initializing the e820 table. The e820 table is a table that describes
/// guest memory prepared before the guest startup. we need to config the correct guest
/// memory address and length in the table. The virtio device memory belongs to the MMIO
/// space and does not belong to the Guest Memory space. Therefore, it cannot be configured
/// into the e820 table. When creating AddressSpaceRegion we use
/// AddressSpaceRegionType::ReservedMemory type, in this way, address_space will know that
/// this region a special memory, it will don't put the this memory in e820 table.
///
/// This function relies on the atomic-guest-memory feature. Without this feature enabled, memory
/// regions cannot be inserted into vm_as. Because the insert_region interface of vm_as does
/// not insert regions in place, but returns an array of inserted regions. We need to manually
/// replace this array of regions with vm_as, and that's what atomic-guest-memory feature does.
/// So we rely on the atomic-guest-memory feature here
pub struct DeviceVirtioRegionHandler {
pub(crate) vm_as: GuestAddressSpaceImpl,
pub(crate) address_space: AddressSpace,
}
impl DeviceVirtioRegionHandler {
fn insert_address_space(
&mut self,
region: Arc<GuestRegionMmap>,
) -> std::result::Result<(), VirtIoError> {
let file_offset = match region.file_offset() {
// TODO: use from_arc
Some(f) => Some(FileOffset::new(f.file().try_clone()?, 0)),
None => None,
};
let as_region = Arc::new(AddressSpaceRegion::build(
AddressSpaceRegionType::DAXMemory,
region.start_addr(),
region.size() as u64,
None,
file_offset,
region.flags(),
false,
));
self.address_space.insert_region(as_region).map_err(|e| {
error!("inserting address apace error: {}", e);
// dbs-virtio-devices should not depend on dbs-address-space.
// So here io::Error is used instead of AddressSpaceError directly.
VirtIoError::IOError(io::Error::new(
io::ErrorKind::Other,
format!(
"invalid address space region ({0:#x}, {1:#x})",
region.start_addr().0,
region.len()
),
))
})?;
Ok(())
}
fn insert_vm_as(
&mut self,
region: Arc<GuestRegionMmap>,
) -> std::result::Result<(), VirtIoError> {
let vm_as_new = self.vm_as.memory().insert_region(region).map_err(|e| {
error!(
"DeviceVirtioRegionHandler failed to insert guest memory region: {:?}.",
e
);
VirtIoError::InsertMmap(e)
})?;
// Do not expect poisoned lock here, so safe to unwrap().
self.vm_as.lock().unwrap().replace(vm_as_new);
Ok(())
}
}
impl VirtioRegionHandler for DeviceVirtioRegionHandler {
fn insert_region(
&mut self,
region: Arc<GuestRegionMmap>,
) -> std::result::Result<(), VirtIoError> {
debug!(
"add geust memory region to address_space/vm_as, new region: {:?}",
region
);
self.insert_address_space(region.clone())?;
self.insert_vm_as(region)?;
Ok(())
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,387 @@
// Copyright 2020-2022 Alibaba, Inc. or its affiliates. All Rights Reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::convert::TryInto;
use std::sync::Arc;
use dbs_utils::net::{MacAddr, Tap, TapError};
use dbs_utils::rate_limiter::BucketUpdate;
use dbs_virtio_devices as virtio;
use dbs_virtio_devices::net::Net;
use dbs_virtio_devices::Error as VirtioError;
use serde_derive::{Deserialize, Serialize};
use crate::address_space_manager::GuestAddressSpaceImpl;
use crate::config_manager::{
ConfigItem, DeviceConfigInfo, DeviceConfigInfos, RateLimiterConfigInfo,
};
use crate::device_manager::{DeviceManager, DeviceMgrError, DeviceOpContext};
use crate::get_bucket_update;
use super::DbsMmioV2Device;
/// Default number of virtio queues, one rx/tx pair.
pub const NUM_QUEUES: usize = 2;
/// Default size of virtio queues.
pub const QUEUE_SIZE: u16 = 256;
// The flag of whether to use the shared irq.
const USE_SHARED_IRQ: bool = true;
// The flag of whether to use the generic irq.
const USE_GENERIC_IRQ: bool = true;
/// Errors associated with virtio net device operations.
#[derive(Debug, thiserror::Error)]
pub enum VirtioNetDeviceError {
/// The virtual machine instance ID is invalid.
#[error("the virtual machine instance ID is invalid")]
InvalidVMID,
/// The iface ID is invalid.
#[error("invalid virtio-net iface id '{0}'")]
InvalidIfaceId(String),
/// Invalid queue number configuration for virtio_net device.
#[error("invalid queue number {0} for virtio-net device")]
InvalidQueueNum(usize),
/// Failure from device manager,
#[error("failure in device manager operations, {0}")]
DeviceManager(#[source] DeviceMgrError),
/// The Context Identifier is already in use.
#[error("the device ID {0} already exists")]
DeviceIDAlreadyExist(String),
/// The MAC address is already in use.
#[error("the guest MAC address {0} is already in use")]
GuestMacAddressInUse(String),
/// The host device name is already in use.
#[error("the host device name {0} is already in use")]
HostDeviceNameInUse(String),
/// Cannot open/create tap device.
#[error("cannot open TAP device")]
OpenTap(#[source] TapError),
/// Failure from virtio subsystem.
#[error(transparent)]
Virtio(VirtioError),
/// Failed to send patch message to net epoll handler.
#[error("could not send patch message to the net epoll handler")]
NetEpollHanderSendFail,
/// The update is not allowed after booting the microvm.
#[error("update operation is not allowed after boot")]
UpdateNotAllowedPostBoot,
/// Split this at some point.
/// Internal errors are due to resource exhaustion.
/// Users errors are due to invalid permissions.
#[error("cannot create network device: {0}")]
CreateNetDevice(#[source] VirtioError),
/// Cannot initialize a MMIO Network Device or add a device to the MMIO Bus.
#[error("failure while registering network device: {0}")]
RegisterNetDevice(#[source] DeviceMgrError),
}
/// Configuration information for virtio net devices.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct VirtioNetDeviceConfigUpdateInfo {
/// ID of the guest network interface.
pub iface_id: String,
/// Rate Limiter for received packages.
pub rx_rate_limiter: Option<RateLimiterConfigInfo>,
/// Rate Limiter for transmitted packages.
pub tx_rate_limiter: Option<RateLimiterConfigInfo>,
}
impl VirtioNetDeviceConfigUpdateInfo {
/// Provides a `BucketUpdate` description for the RX bandwidth rate limiter.
pub fn rx_bytes(&self) -> BucketUpdate {
get_bucket_update!(self, rx_rate_limiter, bandwidth)
}
/// Provides a `BucketUpdate` description for the RX ops rate limiter.
pub fn rx_ops(&self) -> BucketUpdate {
get_bucket_update!(self, rx_rate_limiter, ops)
}
/// Provides a `BucketUpdate` description for the TX bandwidth rate limiter.
pub fn tx_bytes(&self) -> BucketUpdate {
get_bucket_update!(self, tx_rate_limiter, bandwidth)
}
/// Provides a `BucketUpdate` description for the TX ops rate limiter.
pub fn tx_ops(&self) -> BucketUpdate {
get_bucket_update!(self, tx_rate_limiter, ops)
}
}
/// Configuration information for virtio net devices.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize, Default)]
pub struct VirtioNetDeviceConfigInfo {
/// ID of the guest network interface.
pub iface_id: String,
/// Host level path for the guest network interface.
pub host_dev_name: String,
/// Number of virtqueues to use.
pub num_queues: usize,
/// Size of each virtqueue. Unit: byte.
pub queue_size: u16,
/// Guest MAC address.
pub guest_mac: Option<MacAddr>,
/// Rate Limiter for received packages.
pub rx_rate_limiter: Option<RateLimiterConfigInfo>,
/// Rate Limiter for transmitted packages.
pub tx_rate_limiter: Option<RateLimiterConfigInfo>,
/// allow duplicate mac
pub allow_duplicate_mac: bool,
/// Use shared irq
pub use_shared_irq: Option<bool>,
/// Use generic irq
pub use_generic_irq: Option<bool>,
}
impl VirtioNetDeviceConfigInfo {
/// Returns the tap device that `host_dev_name` refers to.
pub fn open_tap(&self) -> std::result::Result<Tap, VirtioNetDeviceError> {
Tap::open_named(self.host_dev_name.as_str(), false).map_err(VirtioNetDeviceError::OpenTap)
}
/// Returns a reference to the mac address. It the mac address is not configured, it
/// return None.
pub fn guest_mac(&self) -> Option<&MacAddr> {
self.guest_mac.as_ref()
}
///Rx and Tx queue and max queue sizes
pub fn queue_sizes(&self) -> Vec<u16> {
let mut queue_size = self.queue_size;
if queue_size == 0 {
queue_size = QUEUE_SIZE;
}
let num_queues = if self.num_queues > 0 {
self.num_queues
} else {
NUM_QUEUES
};
(0..num_queues).map(|_| queue_size).collect::<Vec<u16>>()
}
}
impl ConfigItem for VirtioNetDeviceConfigInfo {
type Err = VirtioNetDeviceError;
fn id(&self) -> &str {
&self.iface_id
}
fn check_conflicts(&self, other: &Self) -> Result<(), VirtioNetDeviceError> {
if self.iface_id == other.iface_id {
Err(VirtioNetDeviceError::DeviceIDAlreadyExist(
self.iface_id.clone(),
))
} else if !other.allow_duplicate_mac
&& self.guest_mac.is_some()
&& self.guest_mac == other.guest_mac
{
Err(VirtioNetDeviceError::GuestMacAddressInUse(
self.guest_mac.as_ref().unwrap().to_string(),
))
} else if self.host_dev_name == other.host_dev_name {
Err(VirtioNetDeviceError::HostDeviceNameInUse(
self.host_dev_name.clone(),
))
} else {
Ok(())
}
}
}
/// Virtio Net Device Info
pub type VirtioNetDeviceInfo = DeviceConfigInfo<VirtioNetDeviceConfigInfo>;
/// Device manager to manage all virtio net devices.
pub struct VirtioNetDeviceMgr {
pub(crate) info_list: DeviceConfigInfos<VirtioNetDeviceConfigInfo>,
pub(crate) use_shared_irq: bool,
}
impl VirtioNetDeviceMgr {
/// Gets the index of the device with the specified `drive_id` if it exists in the list.
pub fn get_index_of_iface_id(&self, if_id: &str) -> Option<usize> {
self.info_list
.iter()
.position(|info| info.config.iface_id.eq(if_id))
}
/// Insert or update a virtio net device into the manager.
pub fn insert_device(
device_mgr: &mut DeviceManager,
mut ctx: DeviceOpContext,
config: VirtioNetDeviceConfigInfo,
) -> std::result::Result<(), VirtioNetDeviceError> {
if config.num_queues % 2 != 0 {
return Err(VirtioNetDeviceError::InvalidQueueNum(config.num_queues));
}
if !cfg!(feature = "hotplug") && ctx.is_hotplug {
return Err(VirtioNetDeviceError::UpdateNotAllowedPostBoot);
}
let mgr = &mut device_mgr.virtio_net_manager;
slog::info!(
ctx.logger(),
"add virtio-net device configuration";
"subsystem" => "net_dev_mgr",
"id" => &config.iface_id,
"host_dev_name" => &config.host_dev_name,
);
let device_index = mgr.info_list.insert_or_update(&config)?;
if ctx.is_hotplug {
slog::info!(
ctx.logger(),
"attach virtio-net device";
"subsystem" => "net_dev_mgr",
"id" => &config.iface_id,
"host_dev_name" => &config.host_dev_name,
);
match Self::create_device(&config, &mut ctx) {
Ok(device) => {
let dev = DeviceManager::create_mmio_virtio_device(
device,
&mut ctx,
config.use_shared_irq.unwrap_or(mgr.use_shared_irq),
config.use_generic_irq.unwrap_or(USE_GENERIC_IRQ),
)
.map_err(VirtioNetDeviceError::DeviceManager)?;
ctx.insert_hotplug_mmio_device(&dev.clone(), None)
.map_err(VirtioNetDeviceError::DeviceManager)?;
// live-upgrade need save/restore device from info.device.
mgr.info_list[device_index].set_device(dev);
}
Err(e) => {
mgr.info_list.remove(device_index);
return Err(VirtioNetDeviceError::Virtio(e));
}
}
}
Ok(())
}
/// Update the ratelimiter settings of a virtio net device.
pub fn update_device_ratelimiters(
device_mgr: &mut DeviceManager,
new_cfg: VirtioNetDeviceConfigUpdateInfo,
) -> std::result::Result<(), VirtioNetDeviceError> {
let mgr = &mut device_mgr.virtio_net_manager;
match mgr.get_index_of_iface_id(&new_cfg.iface_id) {
Some(index) => {
let config = &mut mgr.info_list[index].config;
config.rx_rate_limiter = new_cfg.rx_rate_limiter.clone();
config.tx_rate_limiter = new_cfg.tx_rate_limiter.clone();
let device = mgr.info_list[index].device.as_mut().ok_or_else(|| {
VirtioNetDeviceError::InvalidIfaceId(new_cfg.iface_id.clone())
})?;
if let Some(mmio_dev) = device.as_any().downcast_ref::<DbsMmioV2Device>() {
let guard = mmio_dev.state();
let inner_dev = guard.get_inner_device();
if let Some(net_dev) = inner_dev
.as_any()
.downcast_ref::<virtio::net::Net<GuestAddressSpaceImpl>>()
{
return net_dev
.set_patch_rate_limiters(
new_cfg.rx_bytes(),
new_cfg.rx_ops(),
new_cfg.tx_bytes(),
new_cfg.tx_ops(),
)
.map(|_p| ())
.map_err(|_e| VirtioNetDeviceError::NetEpollHanderSendFail);
}
}
Ok(())
}
None => Err(VirtioNetDeviceError::InvalidIfaceId(
new_cfg.iface_id.clone(),
)),
}
}
/// Attach all configured vsock device to the virtual machine instance.
pub fn attach_devices(
&mut self,
ctx: &mut DeviceOpContext,
) -> std::result::Result<(), VirtioNetDeviceError> {
for info in self.info_list.iter_mut() {
slog::info!(
ctx.logger(),
"attach virtio-net device";
"subsystem" => "net_dev_mgr",
"id" => &info.config.iface_id,
"host_dev_name" => &info.config.host_dev_name,
);
let device = Self::create_device(&info.config, ctx)
.map_err(VirtioNetDeviceError::CreateNetDevice)?;
let device = DeviceManager::create_mmio_virtio_device(
device,
ctx,
info.config.use_shared_irq.unwrap_or(self.use_shared_irq),
info.config.use_generic_irq.unwrap_or(USE_GENERIC_IRQ),
)
.map_err(VirtioNetDeviceError::RegisterNetDevice)?;
info.set_device(device);
}
Ok(())
}
fn create_device(
cfg: &VirtioNetDeviceConfigInfo,
ctx: &mut DeviceOpContext,
) -> std::result::Result<Box<Net<GuestAddressSpaceImpl>>, virtio::Error> {
let epoll_mgr = ctx.epoll_mgr.clone().ok_or(virtio::Error::InvalidInput)?;
let rx_rate_limiter = match cfg.rx_rate_limiter.as_ref() {
Some(rl) => Some(rl.try_into().map_err(virtio::Error::IOError)?),
None => None,
};
let tx_rate_limiter = match cfg.tx_rate_limiter.as_ref() {
Some(rl) => Some(rl.try_into().map_err(virtio::Error::IOError)?),
None => None,
};
let net_device = Net::new(
cfg.host_dev_name.clone(),
cfg.guest_mac(),
Arc::new(cfg.queue_sizes()),
epoll_mgr,
rx_rate_limiter,
tx_rate_limiter,
)?;
Ok(Box::new(net_device))
}
}
impl Default for VirtioNetDeviceMgr {
/// Create a new virtio net device manager.
fn default() -> Self {
VirtioNetDeviceMgr {
info_list: DeviceConfigInfos::new(),
use_shared_irq: USE_SHARED_IRQ,
}
}
}

View File

@@ -0,0 +1,299 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::sync::Arc;
use dbs_virtio_devices as virtio;
use dbs_virtio_devices::mmio::DRAGONBALL_FEATURE_INTR_USED;
use dbs_virtio_devices::vsock::backend::{
VsockInnerBackend, VsockInnerConnector, VsockTcpBackend, VsockUnixStreamBackend,
};
use dbs_virtio_devices::vsock::Vsock;
use dbs_virtio_devices::Error as VirtioError;
use serde_derive::{Deserialize, Serialize};
use super::StartMicroVmError;
use crate::config_manager::{ConfigItem, DeviceConfigInfo, DeviceConfigInfos};
use crate::device_manager::{DeviceManager, DeviceOpContext};
pub use dbs_virtio_devices::vsock::QUEUE_SIZES;
const SUBSYSTEM: &str = "vsock_dev_mgr";
// The flag of whether to use the shared irq.
const USE_SHARED_IRQ: bool = true;
// The flag of whether to use the generic irq.
const USE_GENERIC_IRQ: bool = true;
/// Errors associated with `VsockDeviceConfigInfo`.
#[derive(Debug, thiserror::Error)]
pub enum VsockDeviceError {
/// The virtual machine instance ID is invalid.
#[error("the virtual machine instance ID is invalid")]
InvalidVMID,
/// The Context Identifier is already in use.
#[error("the device ID {0} already exists")]
DeviceIDAlreadyExist(String),
/// The Context Identifier is invalid.
#[error("the guest CID {0} is invalid")]
GuestCIDInvalid(u32),
/// The Context Identifier is already in use.
#[error("the guest CID {0} is already in use")]
GuestCIDAlreadyInUse(u32),
/// The Unix Domain Socket path is already in use.
#[error("the Unix Domain Socket path {0} is already in use")]
UDSPathAlreadyInUse(String),
/// The net address is already in use.
#[error("the net address {0} is already in use")]
NetAddrAlreadyInUse(String),
/// The update is not allowed after booting the microvm.
#[error("update operation is not allowed after boot")]
UpdateNotAllowedPostBoot,
/// The VsockId Already Exists
#[error("vsock id {0} already exists")]
VsockIdAlreadyExists(String),
/// Inner backend create error
#[error("vsock inner backend create error: {0}")]
CreateInnerBackend(#[source] std::io::Error),
}
/// Configuration information for a vsock device.
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct VsockDeviceConfigInfo {
/// ID of the vsock device.
pub id: String,
/// A 32-bit Context Identifier (CID) used to identify the guest.
pub guest_cid: u32,
/// unix domain socket path.
pub uds_path: Option<String>,
/// tcp socket address.
pub tcp_addr: Option<String>,
/// Virtio queue size.
pub queue_size: Vec<u16>,
/// Use shared irq
pub use_shared_irq: Option<bool>,
/// Use generic irq
pub use_generic_irq: Option<bool>,
}
impl Default for VsockDeviceConfigInfo {
fn default() -> Self {
Self {
id: String::default(),
guest_cid: 0,
uds_path: None,
tcp_addr: None,
queue_size: Vec::from(QUEUE_SIZES),
use_shared_irq: None,
use_generic_irq: None,
}
}
}
impl VsockDeviceConfigInfo {
/// Get number and size of queues supported.
pub fn queue_sizes(&self) -> Vec<u16> {
self.queue_size.clone()
}
}
impl ConfigItem for VsockDeviceConfigInfo {
type Err = VsockDeviceError;
fn id(&self) -> &str {
&self.id
}
fn check_conflicts(&self, other: &Self) -> Result<(), VsockDeviceError> {
if self.id == other.id {
return Err(VsockDeviceError::DeviceIDAlreadyExist(self.id.clone()));
}
if self.guest_cid == other.guest_cid {
return Err(VsockDeviceError::GuestCIDAlreadyInUse(self.guest_cid));
}
if let (Some(self_uds_path), Some(other_uds_path)) =
(self.uds_path.as_ref(), other.uds_path.as_ref())
{
if self_uds_path == other_uds_path {
return Err(VsockDeviceError::UDSPathAlreadyInUse(self_uds_path.clone()));
}
}
if let (Some(self_net_addr), Some(other_net_addr)) =
(self.tcp_addr.as_ref(), other.tcp_addr.as_ref())
{
if self_net_addr == other_net_addr {
return Err(VsockDeviceError::NetAddrAlreadyInUse(self_net_addr.clone()));
}
}
Ok(())
}
}
/// Vsock Device Info
pub type VsockDeviceInfo = DeviceConfigInfo<VsockDeviceConfigInfo>;
/// Device manager to manage all vsock devices.
pub struct VsockDeviceMgr {
pub(crate) info_list: DeviceConfigInfos<VsockDeviceConfigInfo>,
pub(crate) default_inner_backend: Option<VsockInnerBackend>,
pub(crate) default_inner_connector: Option<VsockInnerConnector>,
pub(crate) use_shared_irq: bool,
}
impl VsockDeviceMgr {
/// Insert or update a vsock device into the manager.
pub fn insert_device(
&mut self,
ctx: DeviceOpContext,
config: VsockDeviceConfigInfo,
) -> std::result::Result<(), VsockDeviceError> {
if ctx.is_hotplug {
slog::error!(
ctx.logger(),
"no support of virtio-vsock device hotplug";
"subsystem" => SUBSYSTEM,
"id" => &config.id,
"uds_path" => &config.uds_path,
);
return Err(VsockDeviceError::UpdateNotAllowedPostBoot);
}
// VMADDR_CID_ANY (-1U) means any address for binding;
// VMADDR_CID_HYPERVISOR (0) is reserved for services built into the hypervisor;
// VMADDR_CID_RESERVED (1) must not be used;
// VMADDR_CID_HOST (2) is the well-known address of the host.
if config.guest_cid <= 2 {
return Err(VsockDeviceError::GuestCIDInvalid(config.guest_cid));
}
slog::info!(
ctx.logger(),
"add virtio-vsock device configuration";
"subsystem" => SUBSYSTEM,
"id" => &config.id,
"uds_path" => &config.uds_path,
);
self.lazy_make_default_connector()?;
self.info_list.insert_or_update(&config)?;
Ok(())
}
/// Attach all configured vsock device to the virtual machine instance.
pub fn attach_devices(
&mut self,
ctx: &mut DeviceOpContext,
) -> std::result::Result<(), StartMicroVmError> {
let epoll_mgr = ctx
.epoll_mgr
.clone()
.ok_or(StartMicroVmError::CreateVsockDevice(
virtio::Error::InvalidInput,
))?;
for info in self.info_list.iter_mut() {
slog::info!(
ctx.logger(),
"attach virtio-vsock device";
"subsystem" => SUBSYSTEM,
"id" => &info.config.id,
"uds_path" => &info.config.uds_path,
);
let mut device = Box::new(
Vsock::new(
info.config.guest_cid as u64,
Arc::new(info.config.queue_sizes()),
epoll_mgr.clone(),
)
.map_err(VirtioError::VirtioVsockError)
.map_err(StartMicroVmError::CreateVsockDevice)?,
);
if let Some(uds_path) = info.config.uds_path.as_ref() {
let unix_backend = VsockUnixStreamBackend::new(uds_path.clone())
.map_err(VirtioError::VirtioVsockError)
.map_err(StartMicroVmError::CreateVsockDevice)?;
device
.add_backend(Box::new(unix_backend), true)
.map_err(VirtioError::VirtioVsockError)
.map_err(StartMicroVmError::CreateVsockDevice)?;
}
if let Some(tcp_addr) = info.config.tcp_addr.as_ref() {
let tcp_backend = VsockTcpBackend::new(tcp_addr.clone())
.map_err(VirtioError::VirtioVsockError)
.map_err(StartMicroVmError::CreateVsockDevice)?;
device
.add_backend(Box::new(tcp_backend), false)
.map_err(VirtioError::VirtioVsockError)
.map_err(StartMicroVmError::CreateVsockDevice)?;
}
// add inner backend to the the first added vsock device
if let Some(inner_backend) = self.default_inner_backend.take() {
device
.add_backend(Box::new(inner_backend), false)
.map_err(VirtioError::VirtioVsockError)
.map_err(StartMicroVmError::CreateVsockDevice)?;
}
let device = DeviceManager::create_mmio_virtio_device_with_features(
device,
ctx,
Some(DRAGONBALL_FEATURE_INTR_USED),
info.config.use_shared_irq.unwrap_or(self.use_shared_irq),
info.config.use_generic_irq.unwrap_or(USE_GENERIC_IRQ),
)
.map_err(StartMicroVmError::RegisterVsockDevice)?;
info.device = Some(device);
}
Ok(())
}
// check the default connector is present, or build it.
fn lazy_make_default_connector(&mut self) -> std::result::Result<(), VsockDeviceError> {
if self.default_inner_connector.is_none() {
let inner_backend =
VsockInnerBackend::new().map_err(VsockDeviceError::CreateInnerBackend)?;
self.default_inner_connector = Some(inner_backend.get_connector());
self.default_inner_backend = Some(inner_backend);
}
Ok(())
}
/// Get the default vsock inner connector.
pub fn get_default_connector(
&mut self,
) -> std::result::Result<VsockInnerConnector, VsockDeviceError> {
self.lazy_make_default_connector()?;
// safe to unwrap, because we created the inner connector before
Ok(self.default_inner_connector.clone().unwrap())
}
}
impl Default for VsockDeviceMgr {
/// Create a new Vsock device manager.
fn default() -> Self {
VsockDeviceMgr {
info_list: DeviceConfigInfos::new(),
default_inner_backend: None,
default_inner_connector: None,
use_shared_irq: USE_SHARED_IRQ,
}
}
}

224
src/dragonball/src/error.rs Normal file
View File

@@ -0,0 +1,224 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file
//! Error codes for the virtual machine monitor subsystem.
#[cfg(feature = "dbs-virtio-devices")]
use dbs_virtio_devices::Error as VirtIoError;
use crate::{address_space_manager, device_manager, resource_manager, vcpu, vm};
/// Shorthand result type for internal VMM commands.
pub type Result<T> = std::result::Result<T, Error>;
/// Errors associated with the VMM internal logic.
///
/// These errors cannot be generated by direct user input, but can result from bad configuration
/// of the host (for example if Dragonball doesn't have permissions to open the KVM fd).
#[derive(Debug, thiserror::Error)]
pub enum Error {
/// Empty AddressSpace from parameters.
#[error("Empty AddressSpace from parameters")]
AddressSpace,
/// The zero page extends past the end of guest_mem.
#[error("the guest zero page extends past the end of guest memory")]
ZeroPagePastRamEnd,
/// Error writing the zero page of guest memory.
#[error("failed to write to guest zero page")]
ZeroPageSetup,
/// Failure occurs in issuing KVM ioctls and errors will be returned from kvm_ioctls lib.
#[error("failure in issuing KVM ioctl command: {0}")]
Kvm(#[source] kvm_ioctls::Error),
/// The host kernel reports an unsupported KVM API version.
#[error("unsupported KVM version {0}")]
KvmApiVersion(i32),
/// Cannot initialize the KVM context due to missing capabilities.
#[error("missing KVM capability: {0:?}")]
KvmCap(kvm_ioctls::Cap),
#[cfg(target_arch = "x86_64")]
#[error("failed to configure MSRs: {0:?}")]
/// Cannot configure MSRs
GuestMSRs(dbs_arch::msr::Error),
/// MSR inner error
#[error("MSR inner error")]
Msr(vmm_sys_util::fam::Error),
/// Error writing MP table to memory.
#[cfg(target_arch = "x86_64")]
#[error("failed to write MP table to guest memory: {0}")]
MpTableSetup(#[source] dbs_boot::mptable::Error),
/// Fail to boot system
#[error("failed to boot system: {0}")]
BootSystem(#[source] dbs_boot::Error),
/// Cannot open the VM file descriptor.
#[error(transparent)]
Vm(vm::VmError),
}
/// Errors associated with starting the instance.
#[derive(Debug, thiserror::Error)]
pub enum StartMicroVmError {
/// Failed to allocate resources.
#[error("cannot allocate resources")]
AllocateResource(#[source] resource_manager::ResourceError),
/// Cannot read from an Event file descriptor.
#[error("failure while reading from EventFd file descriptor")]
EventFd,
/// Cannot add event to Epoll.
#[error("failure while registering epoll event for file descriptor")]
RegisterEvent,
/// The start command was issued more than once.
#[error("the virtual machine is already running")]
MicroVMAlreadyRunning,
/// Cannot start the VM because the kernel was not configured.
#[error("cannot start the virtual machine without kernel configuration")]
MissingKernelConfig,
#[cfg(feature = "hotplug")]
/// Upcall initialize miss vsock device.
#[error("the upcall client needs a virtio-vsock device for communication")]
UpcallMissVsock,
/// Upcall is not ready
#[error("the upcall client is not ready")]
UpcallNotReady,
/// Configuration passed in is invalidate.
#[error("invalid virtual machine configuration: {0} ")]
ConfigureInvalid(String),
/// This error is thrown by the minimal boot loader implementation.
/// It is related to a faulty memory configuration.
#[error("failure while configuring boot information for the virtual machine: {0}")]
ConfigureSystem(#[source] Error),
/// Cannot configure the VM.
#[error("failure while configuring the virtual machine: {0}")]
ConfigureVm(#[source] vm::VmError),
/// Cannot load initrd.
#[error("cannot load Initrd into guest memory: {0}")]
InitrdLoader(#[from] LoadInitrdError),
/// Cannot load kernel due to invalid memory configuration or invalid kernel image.
#[error("cannot load guest kernel into guest memory: {0}")]
KernelLoader(#[source] linux_loader::loader::Error),
/// Cannot load command line string.
#[error("failure while configuring guest kernel commandline: {0}")]
LoadCommandline(#[source] linux_loader::loader::Error),
/// The device manager was not configured.
#[error("the device manager failed to manage devices: {0}")]
DeviceManager(#[source] device_manager::DeviceMgrError),
/// Cannot add devices to the Legacy I/O Bus.
#[error("failure in managing legacy device: {0}")]
LegacyDevice(#[source] device_manager::LegacyDeviceError),
#[cfg(feature = "virtio-vsock")]
/// Failed to create the vsock device.
#[error("cannot create virtio-vsock device: {0}")]
CreateVsockDevice(#[source] VirtIoError),
#[cfg(feature = "virtio-vsock")]
/// Cannot initialize a MMIO Vsock Device or add a device to the MMIO Bus.
#[error("failure while registering virtio-vsock device: {0}")]
RegisterVsockDevice(#[source] device_manager::DeviceMgrError),
/// Address space manager related error, e.g.cannot access guest address space manager.
#[error("address space manager related error: {0}")]
AddressManagerError(#[source] address_space_manager::AddressManagerError),
/// Cannot create a new vCPU file descriptor.
#[error("vCPU related error: {0}")]
Vcpu(#[source] vcpu::VcpuManagerError),
#[cfg(all(feature = "hotplug", feature = "dbs-upcall"))]
/// Upcall initialize Error.
#[error("failure while initializing the upcall client: {0}")]
UpcallInitError(#[source] dbs_upcall::UpcallClientError),
#[cfg(all(feature = "hotplug", feature = "dbs-upcall"))]
/// Upcall connect Error.
#[error("failure while connecting the upcall client: {0}")]
UpcallConnectError(#[source] dbs_upcall::UpcallClientError),
#[cfg(feature = "virtio-blk")]
/// Virtio-blk errors.
#[error("virtio-blk errors: {0}")]
BlockDeviceError(#[source] device_manager::blk_dev_mgr::BlockDeviceError),
#[cfg(feature = "virtio-net")]
/// Virtio-net errors.
#[error("virtio-net errors: {0}")]
VirtioNetDeviceError(#[source] device_manager::virtio_net_dev_mgr::VirtioNetDeviceError),
#[cfg(feature = "virtio-fs")]
/// Virtio-fs errors.
#[error("virtio-fs errors: {0}")]
FsDeviceError(#[source] device_manager::fs_dev_mgr::FsDeviceError),
}
/// Errors associated with starting the instance.
#[derive(Debug, thiserror::Error)]
pub enum StopMicrovmError {
/// Guest memory has not been initialized.
#[error("Guest memory has not been initialized")]
GuestMemoryNotInitialized,
/// Cannnot remove devices
#[error("Failed to remove devices in device_manager {0}")]
DeviceManager(#[source] device_manager::DeviceMgrError),
}
/// Errors associated with loading initrd
#[derive(Debug, thiserror::Error)]
pub enum LoadInitrdError {
/// Cannot load initrd due to an invalid memory configuration.
#[error("failed to load the initrd image to guest memory")]
LoadInitrd,
/// Cannot load initrd due to an invalid image.
#[error("failed to read the initrd image: {0}")]
ReadInitrd(#[source] std::io::Error),
}
/// A dedicated error type to glue with the vmm_epoll crate.
#[derive(Debug, thiserror::Error)]
pub enum EpollError {
/// Generic internal error.
#[error("unclassfied internal error")]
InternalError,
/// Errors from the epoll subsystem.
#[error("failed to issue epoll syscall: {0}")]
EpollMgr(#[from] dbs_utils::epoll_manager::Error),
/// Generic IO errors.
#[error(transparent)]
IOError(std::io::Error),
#[cfg(feature = "dbs-virtio-devices")]
/// Errors from virtio devices.
#[error("failed to manager Virtio device: {0}")]
VirtIoDevice(#[source] VirtIoError),
}

View File

@@ -0,0 +1,169 @@
// Copyright (C) 2020-2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
//! Event manager to manage and handle IO events and requests from API server .
use std::sync::atomic::{AtomicUsize, Ordering};
use std::sync::{Arc, Mutex};
use dbs_utils::epoll_manager::{
EpollManager, EventOps, EventSet, Events, MutEventSubscriber, SubscriberId,
};
use log::{error, warn};
use vmm_sys_util::eventfd::EventFd;
use crate::error::{EpollError, Result};
use crate::vmm::Vmm;
// Statically assigned epoll slot for VMM events.
pub(crate) const EPOLL_EVENT_EXIT: u32 = 0;
pub(crate) const EPOLL_EVENT_API_REQUEST: u32 = 1;
/// Shared information between vmm::vmm_thread_event_loop() and VmmEpollHandler.
pub(crate) struct EventContext {
pub api_event_fd: EventFd,
pub api_event_triggered: bool,
pub exit_evt_triggered: bool,
}
impl EventContext {
/// Create a new instance of [`EventContext`].
pub fn new(api_event_fd: EventFd) -> Result<Self> {
Ok(EventContext {
api_event_fd,
api_event_triggered: false,
exit_evt_triggered: false,
})
}
}
/// Event manager for VMM to handle API requests and IO events.
pub struct EventManager {
epoll_mgr: EpollManager,
subscriber_id: SubscriberId,
vmm_event_count: Arc<AtomicUsize>,
}
impl Drop for EventManager {
fn drop(&mut self) {
// Vmm -> Vm -> EpollManager -> VmmEpollHandler -> Vmm
// We need to remove VmmEpollHandler to break the circular reference
// so that Vmm can drop.
self.epoll_mgr
.remove_subscriber(self.subscriber_id)
.map_err(|e| {
error!("event_manager: remove_subscriber err. {:?}", e);
e
})
.ok();
}
}
impl EventManager {
/// Create a new event manager associated with the VMM object.
pub fn new(vmm: &Arc<Mutex<Vmm>>, epoll_mgr: EpollManager) -> Result<Self> {
let vmm_event_count = Arc::new(AtomicUsize::new(0));
let handler: Box<dyn MutEventSubscriber + Send> = Box::new(VmmEpollHandler {
vmm: vmm.clone(),
vmm_event_count: vmm_event_count.clone(),
});
let subscriber_id = epoll_mgr.add_subscriber(handler);
Ok(EventManager {
epoll_mgr,
subscriber_id,
vmm_event_count,
})
}
/// Get the underlying epoll event manager.
pub fn epoll_manager(&self) -> EpollManager {
self.epoll_mgr.clone()
}
/// Registry the eventfd for exit notification.
pub fn register_exit_eventfd(
&mut self,
exit_evt: &EventFd,
) -> std::result::Result<(), EpollError> {
let events = Events::with_data(exit_evt, EPOLL_EVENT_EXIT, EventSet::IN);
self.epoll_mgr
.add_event(self.subscriber_id, events)
.map_err(EpollError::EpollMgr)
}
/// Poll pending events and invoke registered event handler.
///
/// # Arguments:
/// * max_events: maximum number of pending events to handle
/// * timeout: maximum time in milliseconds to wait
pub fn handle_events(&self, timeout: i32) -> std::result::Result<usize, EpollError> {
self.epoll_mgr
.handle_events(timeout)
.map_err(EpollError::EpollMgr)
}
/// Fetch the VMM event count and reset it to zero.
pub fn fetch_vmm_event_count(&self) -> usize {
self.vmm_event_count.swap(0, Ordering::AcqRel)
}
}
struct VmmEpollHandler {
vmm: Arc<Mutex<Vmm>>,
vmm_event_count: Arc<AtomicUsize>,
}
impl MutEventSubscriber for VmmEpollHandler {
fn process(&mut self, events: Events, _ops: &mut EventOps) {
// Do not try to recover when the lock has already been poisoned.
// And be careful to avoid deadlock between process() and vmm::vmm_thread_event_loop().
let mut vmm = self.vmm.lock().unwrap();
match events.data() {
EPOLL_EVENT_API_REQUEST => {
if let Err(e) = vmm.event_ctx.api_event_fd.read() {
error!("event_manager: failed to read API eventfd, {:?}", e);
}
vmm.event_ctx.api_event_triggered = true;
self.vmm_event_count.fetch_add(1, Ordering::AcqRel);
}
EPOLL_EVENT_EXIT => {
let vm = vmm.get_vm().unwrap();
match vm.get_reset_eventfd() {
Some(ev) => {
if let Err(e) = ev.read() {
error!("event_manager: failed to read exit eventfd, {:?}", e);
}
}
None => warn!("event_manager: leftover exit event in epoll context!"),
}
vmm.event_ctx.exit_evt_triggered = true;
self.vmm_event_count.fetch_add(1, Ordering::AcqRel);
}
_ => error!("event_manager: unknown epoll slot number {}", events.data()),
}
}
fn init(&mut self, ops: &mut EventOps) {
// Do not expect poisoned lock.
let vmm = self.vmm.lock().unwrap();
let events = Events::with_data(
&vmm.event_ctx.api_event_fd,
EPOLL_EVENT_API_REQUEST,
EventSet::IN,
);
if let Err(e) = ops.add(events) {
error!(
"event_manager: failed to register epoll event for API server, {:?}",
e
);
}
}
}

View File

@@ -0,0 +1,60 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use std::sync::Arc;
use arc_swap::{ArcSwap, Cache};
use dbs_device::device_manager::Error;
use dbs_device::device_manager::IoManager;
/// A specialized version of [`std::result::Result`] for IO manager related operations.
pub type Result<T> = std::result::Result<T, Error>;
/// Wrapper over IoManager to support device hotplug with [`ArcSwap`] and [`Cache`].
#[derive(Clone)]
pub struct IoManagerCached(pub(crate) Cache<Arc<ArcSwap<IoManager>>, Arc<IoManager>>);
impl IoManagerCached {
/// Create a new instance of [`IoManagerCached`].
pub fn new(io_manager: Arc<ArcSwap<IoManager>>) -> Self {
IoManagerCached(Cache::new(io_manager))
}
#[cfg(target_arch = "x86_64")]
#[inline]
/// Read data from IO ports.
pub fn pio_read(&mut self, addr: u16, data: &mut [u8]) -> Result<()> {
self.0.load().pio_read(addr, data)
}
#[cfg(target_arch = "x86_64")]
#[inline]
/// Write data to IO ports.
pub fn pio_write(&mut self, addr: u16, data: &[u8]) -> Result<()> {
self.0.load().pio_write(addr, data)
}
#[inline]
/// Read data to MMIO address.
pub fn mmio_read(&mut self, addr: u64, data: &mut [u8]) -> Result<()> {
self.0.load().mmio_read(addr, data)
}
#[inline]
/// Write data to MMIO address.
pub fn mmio_write(&mut self, addr: u64, data: &[u8]) -> Result<()> {
self.0.load().mmio_write(addr, data)
}
#[inline]
/// Revalidate the inner cache
pub fn revalidate_cache(&mut self) {
let _ = self.0.load();
}
#[inline]
/// Get immutable reference to underlying [`IoManager`].
pub fn load(&mut self) -> &IoManager {
self.0.load()
}
}

View File

@@ -0,0 +1,251 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
#![allow(dead_code)]
use kvm_bindings::KVM_API_VERSION;
use kvm_ioctls::{Cap, Kvm, VmFd};
use std::os::unix::io::{FromRawFd, RawFd};
use crate::error::{Error, Result};
/// Describes a KVM context that gets attached to the micro VM instance.
/// It gives access to the functionality of the KVM wrapper as long as every required
/// KVM capability is present on the host.
pub struct KvmContext {
kvm: Kvm,
max_memslots: usize,
#[cfg(target_arch = "x86_64")]
supported_msrs: kvm_bindings::MsrList,
}
impl KvmContext {
/// Create a new KVM context object, using the provided `kvm_fd` if one is presented.
pub fn new(kvm_fd: Option<RawFd>) -> Result<Self> {
let kvm = if let Some(fd) = kvm_fd {
// Safe because we expect kvm_fd to contain a valid fd number when is_some() == true.
unsafe { Kvm::from_raw_fd(fd) }
} else {
Kvm::new().map_err(Error::Kvm)?
};
if kvm.get_api_version() != KVM_API_VERSION as i32 {
return Err(Error::KvmApiVersion(kvm.get_api_version()));
}
Self::check_cap(&kvm, Cap::Irqchip)?;
Self::check_cap(&kvm, Cap::Irqfd)?;
Self::check_cap(&kvm, Cap::Ioeventfd)?;
Self::check_cap(&kvm, Cap::UserMemory)?;
#[cfg(target_arch = "x86_64")]
Self::check_cap(&kvm, Cap::SetTssAddr)?;
#[cfg(target_arch = "x86_64")]
let supported_msrs = dbs_arch::msr::supported_guest_msrs(&kvm).map_err(Error::GuestMSRs)?;
let max_memslots = kvm.get_nr_memslots();
Ok(KvmContext {
kvm,
max_memslots,
#[cfg(target_arch = "x86_64")]
supported_msrs,
})
}
/// Get underlying KVM object to access kvm-ioctls interfaces.
pub fn kvm(&self) -> &Kvm {
&self.kvm
}
/// Get the maximum number of memory slots reported by this KVM context.
pub fn max_memslots(&self) -> usize {
self.max_memslots
}
/// Create a virtual machine object.
pub fn create_vm(&self) -> Result<VmFd> {
self.kvm.create_vm().map_err(Error::Kvm)
}
/// Get the max vcpu count supported by kvm
pub fn get_max_vcpus(&self) -> usize {
self.kvm.get_max_vcpus()
}
fn check_cap(kvm: &Kvm, cap: Cap) -> std::result::Result<(), Error> {
if !kvm.check_extension(cap) {
return Err(Error::KvmCap(cap));
}
Ok(())
}
}
#[cfg(target_arch = "x86_64")]
mod x86_64 {
use super::*;
use dbs_arch::msr::*;
use kvm_bindings::{kvm_msr_entry, CpuId, MsrList, Msrs};
use std::collections::HashSet;
impl KvmContext {
/// Get information about supported CPUID of x86 processor.
pub fn supported_cpuid(
&self,
max_entries_count: usize,
) -> std::result::Result<CpuId, kvm_ioctls::Error> {
self.kvm.get_supported_cpuid(max_entries_count)
}
/// Get information about supported MSRs of x86 processor.
pub fn supported_msrs(
&self,
_max_entries_count: usize,
) -> std::result::Result<MsrList, kvm_ioctls::Error> {
Ok(self.supported_msrs.clone())
}
// It's very sensible to manipulate MSRs, so please be careful to change code below.
fn build_msrs_list(kvm: &Kvm) -> Result<Msrs> {
let mut mset: HashSet<u32> = HashSet::new();
let supported_msr_list = kvm.get_msr_index_list().map_err(super::Error::Kvm)?;
for msr in supported_msr_list.as_slice() {
mset.insert(*msr);
}
let mut msrs = vec![
MSR_IA32_APICBASE,
MSR_IA32_SYSENTER_CS,
MSR_IA32_SYSENTER_ESP,
MSR_IA32_SYSENTER_EIP,
MSR_IA32_CR_PAT,
];
let filters_list = vec![
MSR_STAR,
MSR_VM_HSAVE_PA,
MSR_TSC_AUX,
MSR_IA32_TSC_ADJUST,
MSR_IA32_TSCDEADLINE,
MSR_IA32_MISC_ENABLE,
MSR_IA32_BNDCFGS,
MSR_IA32_SPEC_CTRL,
];
for msr in filters_list {
if mset.contains(&msr) {
msrs.push(msr);
}
}
// TODO: several msrs are optional.
// TODO: Since our guests don't support nested-vmx, LMCE nor SGX for now.
// msrs.push(MSR_IA32_FEATURE_CONTROL);
msrs.push(MSR_CSTAR);
msrs.push(MSR_KERNEL_GS_BASE);
msrs.push(MSR_SYSCALL_MASK);
msrs.push(MSR_LSTAR);
msrs.push(MSR_IA32_TSC);
msrs.push(MSR_KVM_SYSTEM_TIME_NEW);
msrs.push(MSR_KVM_WALL_CLOCK_NEW);
// FIXME: check if it's supported.
msrs.push(MSR_KVM_ASYNC_PF_EN);
msrs.push(MSR_KVM_PV_EOI_EN);
msrs.push(MSR_KVM_STEAL_TIME);
msrs.push(MSR_CORE_PERF_FIXED_CTR_CTRL);
msrs.push(MSR_CORE_PERF_GLOBAL_CTRL);
msrs.push(MSR_CORE_PERF_GLOBAL_STATUS);
msrs.push(MSR_CORE_PERF_GLOBAL_OVF_CTRL);
const MAX_FIXED_COUNTERS: u32 = 3;
for i in 0..MAX_FIXED_COUNTERS {
msrs.push(MSR_CORE_PERF_FIXED_CTR0 + i);
}
// FIXME: skip MCE for now.
let mtrr_msrs = vec![
MSR_MTRRdefType,
MSR_MTRRfix64K_00000,
MSR_MTRRfix16K_80000,
MSR_MTRRfix16K_A0000,
MSR_MTRRfix4K_C0000,
MSR_MTRRfix4K_C8000,
MSR_MTRRfix4K_D0000,
MSR_MTRRfix4K_D8000,
MSR_MTRRfix4K_E0000,
MSR_MTRRfix4K_E8000,
MSR_MTRRfix4K_F0000,
MSR_MTRRfix4K_F8000,
];
for mtrr in mtrr_msrs {
msrs.push(mtrr);
}
const MSR_MTRRCAP_VCNT: u32 = 8;
for i in 0..MSR_MTRRCAP_VCNT {
msrs.push(0x200 + 2 * i);
msrs.push(0x200 + 2 * i + 1);
}
let msrs: Vec<kvm_msr_entry> = msrs
.iter()
.map(|reg| kvm_msr_entry {
index: *reg,
reserved: 0,
data: 0,
})
.collect();
Msrs::from_entries(&msrs).map_err(super::Error::Msr)
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use kvm_ioctls::Kvm;
use std::fs::File;
use std::os::unix::fs::MetadataExt;
use std::os::unix::io::{AsRawFd, FromRawFd};
#[test]
fn test_create_kvm_context() {
let c = KvmContext::new(None).unwrap();
assert!(c.max_memslots >= 32);
let kvm = Kvm::new().unwrap();
let f = unsafe { File::from_raw_fd(kvm.as_raw_fd()) };
let m1 = f.metadata().unwrap();
let m2 = File::open("/dev/kvm").unwrap().metadata().unwrap();
assert_eq!(m1.dev(), m2.dev());
assert_eq!(m1.ino(), m2.ino());
}
#[cfg(target_arch = "x86_64")]
#[test]
fn test_get_supported_cpu_id() {
let c = KvmContext::new(None).unwrap();
let _ = c
.supported_cpuid(kvm_bindings::KVM_MAX_CPUID_ENTRIES)
.expect("failed to get supported CPUID");
assert!(c.supported_cpuid(0).is_err());
}
#[test]
fn test_create_vm() {
let c = KvmContext::new(None).unwrap();
let _ = c.create_vm().unwrap();
}
}

60
src/dragonball/src/lib.rs Normal file
View File

@@ -0,0 +1,60 @@
// Copyright (C) 2018-2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//! Dragonball is a light-weight virtual machine manager(VMM) based on Linux Kernel-based Virtual
//! Machine(KVM) which is optimized for container workloads.
#![warn(missing_docs)]
//TODO: Remove this, after the rest of dragonball has been committed.
#![allow(dead_code)]
/// Address space manager for virtual machines.
pub mod address_space_manager;
/// API to handle vmm requests.
pub mod api;
/// Structs to maintain configuration information.
pub mod config_manager;
/// Device manager for virtual machines.
pub mod device_manager;
/// Errors related to Virtual machine manager.
pub mod error;
/// KVM operation context for virtual machines.
pub mod kvm_context;
/// Metrics system.
pub mod metric;
/// Resource manager for virtual machines.
pub mod resource_manager;
/// Signal handler for virtual machines.
pub mod signal_handler;
/// Virtual CPU manager for virtual machines.
pub mod vcpu;
/// Virtual machine manager for virtual machines.
pub mod vm;
mod event_manager;
mod io_manager;
mod vmm;
pub use self::error::StartMicroVmError;
pub use self::io_manager::IoManagerCached;
pub use self::vmm::Vmm;
/// Success exit code.
pub const EXIT_CODE_OK: u8 = 0;
/// Generic error exit code.
pub const EXIT_CODE_GENERIC_ERROR: u8 = 1;
/// Generic exit code for an error considered not possible to occur if the program logic is sound.
pub const EXIT_CODE_UNEXPECTED_ERROR: u8 = 2;
/// Dragonball was shut down after intercepting a restricted system call.
pub const EXIT_CODE_BAD_SYSCALL: u8 = 148;
/// Dragonball was shut down after intercepting `SIGBUS`.
pub const EXIT_CODE_SIGBUS: u8 = 149;
/// Dragonball was shut down after intercepting `SIGSEGV`.
pub const EXIT_CODE_SIGSEGV: u8 = 150;
/// Invalid json passed to the Dragonball process for configuring microvm.
pub const EXIT_CODE_INVALID_JSON: u8 = 151;
/// Bad configuration for microvm's resources, when using a single json.
pub const EXIT_CODE_BAD_CONFIGURATION: u8 = 152;
/// Command line arguments parsing error.
pub const EXIT_CODE_ARG_PARSING: u8 = 153;

View File

@@ -0,0 +1,58 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
use dbs_utils::metric::SharedIncMetric;
use lazy_static::lazy_static;
use serde::Serialize;
pub use dbs_utils::metric::IncMetric;
lazy_static! {
/// Static instance used for handling metrics.
pub static ref METRICS: DragonballMetrics = DragonballMetrics::default();
}
/// Metrics specific to VCPUs' mode of functioning.
#[derive(Default, Serialize)]
pub struct VcpuMetrics {
/// Number of KVM exits for handling input IO.
pub exit_io_in: SharedIncMetric,
/// Number of KVM exits for handling output IO.
pub exit_io_out: SharedIncMetric,
/// Number of KVM exits for handling MMIO reads.
pub exit_mmio_read: SharedIncMetric,
/// Number of KVM exits for handling MMIO writes.
pub exit_mmio_write: SharedIncMetric,
/// Number of errors during this VCPU's run.
pub failures: SharedIncMetric,
/// Failures in configuring the CPUID.
pub filter_cpuid: SharedIncMetric,
}
/// Metrics for the seccomp filtering.
#[derive(Default, Serialize)]
pub struct SeccompMetrics {
/// Number of errors inside the seccomp filtering.
pub num_faults: SharedIncMetric,
}
/// Metrics related to signals.
#[derive(Default, Serialize)]
pub struct SignalMetrics {
/// Number of times that SIGBUS was handled.
pub sigbus: SharedIncMetric,
/// Number of times that SIGSEGV was handled.
pub sigsegv: SharedIncMetric,
}
/// Structure storing all metrics while enforcing serialization support on them.
#[derive(Default, Serialize)]
pub struct DragonballMetrics {
/// Metrics related to a vcpu's functioning.
pub vcpu: VcpuMetrics,
/// Metrics related to seccomp filtering.
pub seccomp: SeccompMetrics,
/// Metrics related to signals.
pub signals: SignalMetrics,
}

View File

@@ -0,0 +1,785 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
//
// SPDX-License-Identifier: Apache-2.0
use std::sync::Mutex;
use dbs_allocator::{Constraint, IntervalTree, Range};
use dbs_boot::layout::{
GUEST_MEM_END, GUEST_MEM_START, GUEST_PHYS_END, IRQ_BASE as LEGACY_IRQ_BASE,
IRQ_MAX as LEGACY_IRQ_MAX, MMIO_LOW_END, MMIO_LOW_START,
};
use dbs_device::resources::{DeviceResources, MsiIrqType, Resource, ResourceConstraint};
// We reserve the LEGACY_IRQ_BASE(5) for shared IRQ.
const SHARED_IRQ: u32 = LEGACY_IRQ_BASE;
// Since ioapic2 have 24 pins for legacy devices, so irq number 0-23 are used. We will set MSI_IRQ_BASE at 24.
#[cfg(target_arch = "x86_64")]
const MSI_IRQ_BASE: u32 = 24;
#[cfg(target_arch = "aarch64")]
/// We define MSI_IRQ_BASE as LEGACY_IRQ_MAX for aarch64 in order not to conflict with legacy irq numbers.
const MSI_IRQ_BASE: u32 = LEGACY_IRQ_MAX + 1;
// kvm max irq is defined in arch/x86/include/asm/kvm_host.h
const MSI_IRQ_MAX: u32 = 1023;
// x86's kvm user mem slots is defined in arch/x86/include/asm/kvm_host.h
#[cfg(target_arch = "x86_64")]
const KVM_USER_MEM_SLOTS: u32 = 509;
// aarch64's kvm user mem slots is defined in arch/arm64/include/asm/kvm_host.h
#[cfg(target_arch = "aarch64")]
const KVM_USER_MEM_SLOTS: u32 = 512;
const PIO_MIN: u16 = 0x0;
const PIO_MAX: u16 = 0xFFFF;
// Reserve the 64MB MMIO address range just below 4G, x86 systems have special
// devices, such as LAPIC, IOAPIC, HPET etc, in this range. And we don't explicitly
// allocate MMIO address for those devices.
const MMIO_SPACE_RESERVED: u64 = 0x400_0000;
/// Errors associated with resource management operations
#[derive(Debug, PartialEq, thiserror::Error)]
pub enum ResourceError {
/// Unknown/unsupported resource type.
#[error("unsupported resource type")]
UnknownResourceType,
/// Invalid resource range.
#[error("invalid resource range for resource type : {0}")]
InvalidResourceRange(String),
/// No resource available.
#[error("no resource available")]
NoAvailResource,
}
#[derive(Default)]
struct ResourceManagerBuilder {
// IntervalTree for allocating legacy irq number.
legacy_irq_pool: IntervalTree<()>,
// IntervalTree for allocating message signal interrupt (MSI) irq number.
msi_irq_pool: IntervalTree<()>,
// IntervalTree for allocating port-mapped io (PIO) address.
pio_pool: IntervalTree<()>,
// IntervalTree for allocating memory-mapped io (MMIO) address.
mmio_pool: IntervalTree<()>,
// IntervalTree for allocating guest memory.
mem_pool: IntervalTree<()>,
// IntervalTree for allocating kvm memory slot.
kvm_mem_slot_pool: IntervalTree<()>,
}
impl ResourceManagerBuilder {
/// init legacy_irq_pool with arch specific constants.
fn init_legacy_irq_pool(mut self) -> Self {
// The LEGACY_IRQ_BASE irq is reserved for shared IRQ and won't be allocated / reallocated,
// so we don't insert it into the legacy_irq interval tree.
self.legacy_irq_pool
.insert(Range::new(LEGACY_IRQ_BASE + 1, LEGACY_IRQ_MAX), None);
self
}
/// init msi_irq_pool with arch specific constants.
fn init_msi_irq_pool(mut self) -> Self {
self.msi_irq_pool
.insert(Range::new(MSI_IRQ_BASE, MSI_IRQ_MAX), None);
self
}
/// init pio_pool with arch specific constants.
fn init_pio_pool(mut self) -> Self {
self.pio_pool.insert(Range::new(PIO_MIN, PIO_MAX), None);
self
}
/// Create mmio_pool with arch specific constants.
/// allow(clippy) is because `GUEST_MEM_START > MMIO_LOW_END`, we may modify GUEST_MEM_START or
/// MMIO_LOW_END in the future.
#[allow(clippy::absurd_extreme_comparisons)]
fn init_mmio_pool_helper(mmio: &mut IntervalTree<()>) {
mmio.insert(Range::new(MMIO_LOW_START, MMIO_LOW_END), None);
if !(*GUEST_MEM_END < MMIO_LOW_START
|| GUEST_MEM_START > MMIO_LOW_END
|| MMIO_LOW_START == MMIO_LOW_END)
{
#[cfg(target_arch = "x86_64")]
{
let constraint = Constraint::new(MMIO_SPACE_RESERVED)
.min(MMIO_LOW_END - MMIO_SPACE_RESERVED)
.max(0xffff_ffffu64);
let key = mmio.allocate(&constraint);
if let Some(k) = key.as_ref() {
mmio.update(k, ());
} else {
panic!("failed to reserve MMIO address range for x86 system devices");
}
}
}
if *GUEST_MEM_END < *GUEST_PHYS_END {
mmio.insert(Range::new(*GUEST_MEM_END + 1, *GUEST_PHYS_END), None);
}
}
/// init mmio_pool with helper function
fn init_mmio_pool(mut self) -> Self {
Self::init_mmio_pool_helper(&mut self.mmio_pool);
self
}
/// Create mem_pool with arch specific constants.
/// deny(clippy) is because `GUEST_MEM_START > MMIO_LOW_END`, we may modify GUEST_MEM_START or
/// MMIO_LOW_END in the future.
#[allow(clippy::absurd_extreme_comparisons)]
pub(crate) fn init_mem_pool_helper(mem: &mut IntervalTree<()>) {
if *GUEST_MEM_END < MMIO_LOW_START
|| GUEST_MEM_START > MMIO_LOW_END
|| MMIO_LOW_START == MMIO_LOW_END
{
mem.insert(Range::new(GUEST_MEM_START, *GUEST_MEM_END), None);
} else {
if MMIO_LOW_START > GUEST_MEM_START {
mem.insert(Range::new(GUEST_MEM_START, MMIO_LOW_START - 1), None);
}
if MMIO_LOW_END < *GUEST_MEM_END {
mem.insert(Range::new(MMIO_LOW_END + 1, *GUEST_MEM_END), None);
}
}
}
/// init mem_pool with helper function
fn init_mem_pool(mut self) -> Self {
Self::init_mem_pool_helper(&mut self.mem_pool);
self
}
/// init kvm_mem_slot_pool with arch specific constants.
fn init_kvm_mem_slot_pool(mut self, max_kvm_mem_slot: Option<usize>) -> Self {
let max_slots = max_kvm_mem_slot.unwrap_or(KVM_USER_MEM_SLOTS as usize);
self.kvm_mem_slot_pool
.insert(Range::new(0, max_slots as u64), None);
self
}
fn build(self) -> ResourceManager {
ResourceManager {
legacy_irq_pool: Mutex::new(self.legacy_irq_pool),
msi_irq_pool: Mutex::new(self.msi_irq_pool),
pio_pool: Mutex::new(self.pio_pool),
mmio_pool: Mutex::new(self.mmio_pool),
mem_pool: Mutex::new(self.mem_pool),
kvm_mem_slot_pool: Mutex::new(self.kvm_mem_slot_pool),
}
}
}
/// Resource manager manages all resources for a virtual machine instance.
pub struct ResourceManager {
legacy_irq_pool: Mutex<IntervalTree<()>>,
msi_irq_pool: Mutex<IntervalTree<()>>,
pio_pool: Mutex<IntervalTree<()>>,
mmio_pool: Mutex<IntervalTree<()>>,
mem_pool: Mutex<IntervalTree<()>>,
kvm_mem_slot_pool: Mutex<IntervalTree<()>>,
}
impl Default for ResourceManager {
fn default() -> Self {
ResourceManagerBuilder::default().build()
}
}
impl ResourceManager {
/// Create a resource manager instance.
pub fn new(max_kvm_mem_slot: Option<usize>) -> Self {
let res_manager_builder = ResourceManagerBuilder::default();
res_manager_builder
.init_legacy_irq_pool()
.init_msi_irq_pool()
.init_pio_pool()
.init_mmio_pool()
.init_mem_pool()
.init_kvm_mem_slot_pool(max_kvm_mem_slot)
.build()
}
/// Init mem_pool with arch specific constants.
pub fn init_mem_pool(&self) {
let mut mem = self.mem_pool.lock().unwrap();
ResourceManagerBuilder::init_mem_pool_helper(&mut mem);
}
/// Check if mem_pool is empty.
pub fn is_mem_pool_empty(&self) -> bool {
self.mem_pool.lock().unwrap().is_empty()
}
/// Allocate one legacy irq number.
///
/// Allocate the specified irq number if `fixed` contains an irq number.
pub fn allocate_legacy_irq(&self, shared: bool, fixed: Option<u32>) -> Option<u32> {
// if shared_irq is used, just return the shared irq num.
if shared {
return Some(SHARED_IRQ);
}
let mut constraint = Constraint::new(1u32);
if let Some(v) = fixed {
if v == SHARED_IRQ {
return None;
}
constraint.min = v as u64;
constraint.max = v as u64;
}
// Safe to unwrap() because we don't expect poisoned lock here.
let mut legacy_irq_pool = self.legacy_irq_pool.lock().unwrap();
let key = legacy_irq_pool.allocate(&constraint);
if let Some(k) = key.as_ref() {
legacy_irq_pool.update(k, ());
}
key.map(|v| v.min as u32)
}
/// Free a legacy irq number.
///
/// Panic if the irq number is invalid.
pub fn free_legacy_irq(&self, irq: u32) -> Result<(), ResourceError> {
// if the irq number is shared_irq, we don't need to do anything.
if irq == SHARED_IRQ {
return Ok(());
}
if !(LEGACY_IRQ_BASE..=LEGACY_IRQ_MAX).contains(&irq) {
return Err(ResourceError::InvalidResourceRange(
"Legacy IRQ".to_string(),
));
}
let key = Range::new(irq, irq);
// Safe to unwrap() because we don't expect poisoned lock here.
self.legacy_irq_pool.lock().unwrap().free(&key);
Ok(())
}
/// Allocate a group of MSI irq numbers.
///
/// The allocated MSI irq numbers may or may not be naturally aligned.
pub fn allocate_msi_irq(&self, count: u32) -> Option<u32> {
let constraint = Constraint::new(count);
// Safe to unwrap() because we don't expect poisoned lock here.
let mut msi_irq_pool = self.msi_irq_pool.lock().unwrap();
let key = msi_irq_pool.allocate(&constraint);
if let Some(k) = key.as_ref() {
msi_irq_pool.update(k, ());
}
key.map(|v| v.min as u32)
}
/// Allocate a group of MSI irq numbers, naturally aligned to `count`.
///
/// This may be used to support PCI MSI, which requires the allocated irq number is naturally
/// aligned.
pub fn allocate_msi_irq_aligned(&self, count: u32) -> Option<u32> {
let constraint = Constraint::new(count).align(count);
// Safe to unwrap() because we don't expect poisoned lock here.
let mut msi_irq_pool = self.msi_irq_pool.lock().unwrap();
let key = msi_irq_pool.allocate(&constraint);
if let Some(k) = key.as_ref() {
msi_irq_pool.update(k, ());
}
key.map(|v| v.min as u32)
}
/// Free a group of MSI irq numbers.
///
/// Panic if `irq` or `count` is invalid.
pub fn free_msi_irq(&self, irq: u32, count: u32) -> Result<(), ResourceError> {
if irq < MSI_IRQ_BASE
|| count == 0
|| irq.checked_add(count).is_none()
|| irq + count - 1 > MSI_IRQ_MAX
{
return Err(ResourceError::InvalidResourceRange("MSI IRQ".to_string()));
}
let key = Range::new(irq, irq + count - 1);
// Safe to unwrap() because we don't expect poisoned lock here.
self.msi_irq_pool.lock().unwrap().free(&key);
Ok(())
}
/// Allocate a group of PIO address and returns the allocated PIO base address.
pub fn allocate_pio_address_simple(&self, size: u16) -> Option<u16> {
let constraint = Constraint::new(size);
self.allocate_pio_address(&constraint)
}
/// Allocate a group of PIO address and returns the allocated PIO base address.
pub fn allocate_pio_address(&self, constraint: &Constraint) -> Option<u16> {
// Safe to unwrap() because we don't expect poisoned lock here.
let mut pio_pool = self.pio_pool.lock().unwrap();
let key = pio_pool.allocate(constraint);
if let Some(k) = key.as_ref() {
pio_pool.update(k, ());
}
key.map(|v| v.min as u16)
}
/// Free PIO address range `[base, base + size - 1]`.
///
/// Panic if `base` or `size` is invalid.
pub fn free_pio_address(&self, base: u16, size: u16) -> Result<(), ResourceError> {
if base.checked_add(size).is_none() {
return Err(ResourceError::InvalidResourceRange(
"PIO Address".to_string(),
));
}
let key = Range::new(base, base + size - 1);
// Safe to unwrap() because we don't expect poisoned lock here.
self.pio_pool.lock().unwrap().free(&key);
Ok(())
}
/// Allocate a MMIO address range alinged to `align` and returns the allocated base address.
pub fn allocate_mmio_address_aligned(&self, size: u64, align: u64) -> Option<u64> {
let constraint = Constraint::new(size).align(align);
self.allocate_mmio_address(&constraint)
}
/// Allocate a MMIO address range and returns the allocated base address.
pub fn allocate_mmio_address(&self, constraint: &Constraint) -> Option<u64> {
// Safe to unwrap() because we don't expect poisoned lock here.
let mut mmio_pool = self.mmio_pool.lock().unwrap();
let key = mmio_pool.allocate(constraint);
key.map(|v| v.min)
}
/// Free MMIO address range `[base, base + size - 1]`
pub fn free_mmio_address(&self, base: u64, size: u64) -> Result<(), ResourceError> {
if base.checked_add(size).is_none() {
return Err(ResourceError::InvalidResourceRange(
"MMIO Address".to_string(),
));
}
let key = Range::new(base, base + size - 1);
// Safe to unwrap() because we don't expect poisoned lock here.
self.mmio_pool.lock().unwrap().free(&key);
Ok(())
}
/// Allocate guest memory address range and returns the allocated base memory address.
pub fn allocate_mem_address(&self, constraint: &Constraint) -> Option<u64> {
// Safe to unwrap() because we don't expect poisoned lock here.
let mut mem_pool = self.mem_pool.lock().unwrap();
let key = mem_pool.allocate(constraint);
key.map(|v| v.min)
}
/// Free the guest memory address range `[base, base + size - 1]`.
///
/// Panic if the guest memory address range is invalid.
/// allow(clippy) is because `base < GUEST_MEM_START`, we may modify GUEST_MEM_START in the future.
#[allow(clippy::absurd_extreme_comparisons)]
pub fn free_mem_address(&self, base: u64, size: u64) -> Result<(), ResourceError> {
if base.checked_add(size).is_none()
|| base < GUEST_MEM_START
|| base + size > *GUEST_MEM_END
{
return Err(ResourceError::InvalidResourceRange(
"MEM Address".to_string(),
));
}
let key = Range::new(base, base + size - 1);
// Safe to unwrap() because we don't expect poisoned lock here.
self.mem_pool.lock().unwrap().free(&key);
Ok(())
}
/// Allocate a kvm memory slot number.
///
/// Allocate the specified slot if `fixed` contains a slot number.
pub fn allocate_kvm_mem_slot(&self, size: u32, fixed: Option<u32>) -> Option<u32> {
let mut constraint = Constraint::new(size);
if let Some(v) = fixed {
constraint.min = v as u64;
constraint.max = v as u64;
}
// Safe to unwrap() because we don't expect poisoned lock here.
let mut kvm_mem_slot_pool = self.kvm_mem_slot_pool.lock().unwrap();
let key = kvm_mem_slot_pool.allocate(&constraint);
if let Some(k) = key.as_ref() {
kvm_mem_slot_pool.update(k, ());
}
key.map(|v| v.min as u32)
}
/// Free a kvm memory slot number.
pub fn free_kvm_mem_slot(&self, slot: u32) -> Result<(), ResourceError> {
let key = Range::new(slot, slot);
// Safe to unwrap() because we don't expect poisoned lock here.
self.kvm_mem_slot_pool.lock().unwrap().free(&key);
Ok(())
}
/// Allocate requested resources for a device.
pub fn allocate_device_resources(
&self,
requests: &[ResourceConstraint],
shared_irq: bool,
) -> std::result::Result<DeviceResources, ResourceError> {
let mut resources = DeviceResources::new();
for resource in requests.iter() {
let res = match resource {
ResourceConstraint::PioAddress { range, align, size } => {
let mut constraint = Constraint::new(*size).align(*align);
if let Some(r) = range {
constraint.min = r.0 as u64;
constraint.max = r.1 as u64;
}
match self.allocate_pio_address(&constraint) {
Some(base) => Resource::PioAddressRange {
base: base as u16,
size: *size,
},
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
}
}
ResourceConstraint::MmioAddress { range, align, size } => {
let mut constraint = Constraint::new(*size).align(*align);
if let Some(r) = range {
constraint.min = r.0;
constraint.max = r.1;
}
match self.allocate_mmio_address(&constraint) {
Some(base) => Resource::MmioAddressRange { base, size: *size },
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
}
}
ResourceConstraint::MemAddress { range, align, size } => {
let mut constraint = Constraint::new(*size).align(*align);
if let Some(r) = range {
constraint.min = r.0;
constraint.max = r.1;
}
match self.allocate_mem_address(&constraint) {
Some(base) => Resource::MemAddressRange { base, size: *size },
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
}
}
ResourceConstraint::LegacyIrq { irq } => {
match self.allocate_legacy_irq(shared_irq, *irq) {
Some(v) => Resource::LegacyIrq(v),
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
}
}
ResourceConstraint::PciMsiIrq { size } => {
match self.allocate_msi_irq_aligned(*size) {
Some(base) => Resource::MsiIrq {
ty: MsiIrqType::PciMsi,
base,
size: *size,
},
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
}
}
ResourceConstraint::PciMsixIrq { size } => match self.allocate_msi_irq(*size) {
Some(base) => Resource::MsiIrq {
ty: MsiIrqType::PciMsix,
base,
size: *size,
},
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
},
ResourceConstraint::GenericIrq { size } => match self.allocate_msi_irq(*size) {
Some(base) => Resource::MsiIrq {
ty: MsiIrqType::GenericMsi,
base,
size: *size,
},
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
},
ResourceConstraint::KvmMemSlot { slot, size } => {
match self.allocate_kvm_mem_slot(*size, *slot) {
Some(v) => Resource::KvmMemSlot(v),
None => {
if let Err(e) = self.free_device_resources(&resources) {
return Err(e);
} else {
return Err(ResourceError::NoAvailResource);
}
}
}
}
};
resources.append(res);
}
Ok(resources)
}
/// Free resources allocated for a device.
pub fn free_device_resources(&self, resources: &DeviceResources) -> Result<(), ResourceError> {
for res in resources.iter() {
let result = match res {
Resource::PioAddressRange { base, size } => self.free_pio_address(*base, *size),
Resource::MmioAddressRange { base, size } => self.free_mmio_address(*base, *size),
Resource::MemAddressRange { base, size } => self.free_mem_address(*base, *size),
Resource::LegacyIrq(base) => self.free_legacy_irq(*base),
Resource::MsiIrq { ty: _, base, size } => self.free_msi_irq(*base, *size),
Resource::KvmMemSlot(slot) => self.free_kvm_mem_slot(*slot),
Resource::MacAddresss(_) => Ok(()),
};
if result.is_err() {
return result;
}
}
Ok(())
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_allocate_legacy_irq() {
let mgr = ResourceManager::new(None);
// Allocate/free shared IRQs multiple times.
assert_eq!(mgr.allocate_legacy_irq(true, None).unwrap(), SHARED_IRQ);
assert_eq!(mgr.allocate_legacy_irq(true, None).unwrap(), SHARED_IRQ);
mgr.free_legacy_irq(SHARED_IRQ);
mgr.free_legacy_irq(SHARED_IRQ);
mgr.free_legacy_irq(SHARED_IRQ);
// Allocate specified IRQs.
assert_eq!(
mgr.allocate_legacy_irq(false, Some(LEGACY_IRQ_BASE + 10))
.unwrap(),
LEGACY_IRQ_BASE + 10
);
mgr.free_legacy_irq(LEGACY_IRQ_BASE + 10);
assert_eq!(
mgr.allocate_legacy_irq(false, Some(LEGACY_IRQ_BASE + 10))
.unwrap(),
LEGACY_IRQ_BASE + 10
);
assert!(mgr
.allocate_legacy_irq(false, Some(LEGACY_IRQ_BASE + 10))
.is_none());
assert!(mgr.allocate_legacy_irq(false, None).is_some());
assert!(mgr
.allocate_legacy_irq(false, Some(LEGACY_IRQ_BASE - 1))
.is_none());
assert!(mgr
.allocate_legacy_irq(false, Some(LEGACY_IRQ_MAX + 1))
.is_none());
assert!(mgr.allocate_legacy_irq(false, Some(SHARED_IRQ)).is_none());
}
#[test]
fn test_invalid_free_legacy_irq() {
let mgr = ResourceManager::new(None);
assert_eq!(
mgr.free_legacy_irq(LEGACY_IRQ_MAX + 1),
Err(ResourceError::InvalidResourceRange(
"Legacy IRQ".to_string(),
))
);
}
#[test]
fn test_allocate_msi_irq() {
let mgr = ResourceManager::new(None);
let msi = mgr.allocate_msi_irq(3).unwrap();
mgr.free_msi_irq(msi, 3);
let msi = mgr.allocate_msi_irq(3).unwrap();
mgr.free_msi_irq(msi, 3);
let irq = mgr.allocate_msi_irq_aligned(8).unwrap();
assert_eq!(irq & 0x7, 0);
mgr.free_msi_irq(msi, 8);
let irq = mgr.allocate_msi_irq_aligned(8).unwrap();
assert_eq!(irq & 0x7, 0);
let irq = mgr.allocate_msi_irq_aligned(512).unwrap();
assert_eq!(irq, 512);
mgr.free_msi_irq(irq, 512);
let irq = mgr.allocate_msi_irq_aligned(512).unwrap();
assert_eq!(irq, 512);
assert!(mgr.allocate_msi_irq(4099).is_none());
}
#[test]
fn test_invalid_free_msi_irq() {
let mgr = ResourceManager::new(None);
assert_eq!(
mgr.free_msi_irq(MSI_IRQ_MAX, 3),
Err(ResourceError::InvalidResourceRange("MSI IRQ".to_string()))
);
}
#[test]
fn test_allocate_pio_addr() {
let mgr = ResourceManager::new(None);
assert!(mgr.allocate_pio_address_simple(10).is_some());
let mut requests = vec![
ResourceConstraint::PioAddress {
range: None,
align: 0x1000,
size: 0x2000,
},
ResourceConstraint::PioAddress {
range: Some((0x8000, 0x9000)),
align: 0x1000,
size: 0x1000,
},
ResourceConstraint::PioAddress {
range: Some((0x9000, 0xa000)),
align: 0x1000,
size: 0x1000,
},
ResourceConstraint::PioAddress {
range: Some((0xb000, 0xc000)),
align: 0x1000,
size: 0x1000,
},
];
let resources = mgr.allocate_device_resources(&requests, false).unwrap();
mgr.free_device_resources(&resources);
let resources = mgr.allocate_device_resources(&requests, false).unwrap();
mgr.free_device_resources(&resources);
requests.push(ResourceConstraint::PioAddress {
range: Some((0xc000, 0xc000)),
align: 0x1000,
size: 0x1000,
});
assert!(mgr.allocate_device_resources(&requests, false).is_err());
let resources = mgr
.allocate_device_resources(&requests[0..requests.len() - 1], false)
.unwrap();
mgr.free_device_resources(&resources);
}
#[test]
fn test_invalid_free_pio_addr() {
let mgr = ResourceManager::new(None);
assert_eq!(
mgr.free_pio_address(u16::MAX, 3),
Err(ResourceError::InvalidResourceRange(
"PIO Address".to_string(),
))
);
}
#[test]
fn test_allocate_kvm_mem_slot() {
let mgr = ResourceManager::new(None);
assert_eq!(mgr.allocate_kvm_mem_slot(1, None).unwrap(), 0);
assert_eq!(mgr.allocate_kvm_mem_slot(1, Some(200)).unwrap(), 200);
mgr.free_kvm_mem_slot(200);
assert_eq!(mgr.allocate_kvm_mem_slot(1, Some(200)).unwrap(), 200);
assert_eq!(
mgr.allocate_kvm_mem_slot(1, Some(KVM_USER_MEM_SLOTS))
.unwrap(),
KVM_USER_MEM_SLOTS
);
assert!(mgr
.allocate_kvm_mem_slot(1, Some(KVM_USER_MEM_SLOTS + 1))
.is_none());
}
#[test]
fn test_allocate_mmio_address() {
let mgr = ResourceManager::new(None);
#[cfg(target_arch = "x86_64")]
{
// Can't allocate from reserved region
let constraint = Constraint::new(0x100_0000u64)
.min(0x1_0000_0000u64 - 0x200_0000u64)
.max(0xffff_ffffu64);
assert!(mgr.allocate_mmio_address(&constraint).is_none());
}
let constraint = Constraint::new(0x100_0000u64).min(0x1_0000_0000u64 - 0x200_0000u64);
assert!(mgr.allocate_mmio_address(&constraint).is_some());
#[cfg(target_arch = "x86_64")]
{
// Can't allocate from reserved region
let constraint = Constraint::new(0x100_0000u64)
.min(0x1_0000_0000u64 - 0x200_0000u64)
.max(0xffff_ffffu64);
assert!(mgr.allocate_mem_address(&constraint).is_none());
}
#[cfg(target_arch = "aarch64")]
{
let constraint = Constraint::new(0x200_0000u64)
.min(0x1_0000_0000u64 - 0x200_0000u64)
.max(0xffff_fffeu64);
assert!(mgr.allocate_mem_address(&constraint).is_none());
}
let constraint = Constraint::new(0x100_0000u64).min(0x1_0000_0000u64 - 0x200_0000u64);
assert!(mgr.allocate_mem_address(&constraint).is_some());
}
#[test]
#[should_panic]
fn test_allocate_duplicate_memory() {
let mgr = ResourceManager::new(None);
let constraint_1 = Constraint::new(0x100_0000u64)
.min(0x1_0000_0000u64)
.max(0x1_0000_0000u64 + 0x100_0000u64);
let constraint_2 = Constraint::new(0x100_0000u64)
.min(0x1_0000_0000u64)
.max(0x1_0000_0000u64 + 0x100_0000u64);
assert!(mgr.allocate_mem_address(&constraint_1).is_some());
assert!(mgr.allocate_mem_address(&constraint_2).is_some());
}
}

View File

@@ -0,0 +1,219 @@
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
use libc::{_exit, c_int, c_void, siginfo_t, SIGBUS, SIGSEGV, SIGSYS};
use log::error;
use vmm_sys_util::signal::register_signal_handler;
use crate::metric::{IncMetric, METRICS};
// The offset of `si_syscall` (offending syscall identifier) within the siginfo structure
// expressed as an `(u)int*`.
// Offset `6` for an `i32` field means that the needed information is located at `6 * sizeof(i32)`.
// See /usr/include/linux/signal.h for the C struct definition.
// See https://github.com/rust-lang/libc/issues/716 for why the offset is different in Rust.
const SI_OFF_SYSCALL: isize = 6;
const SYS_SECCOMP_CODE: i32 = 1;
extern "C" {
fn __libc_current_sigrtmin() -> c_int;
fn __libc_current_sigrtmax() -> c_int;
}
/// Gets current sigrtmin
pub fn sigrtmin() -> c_int {
unsafe { __libc_current_sigrtmin() }
}
/// Gets current sigrtmax
pub fn sigrtmax() -> c_int {
unsafe { __libc_current_sigrtmax() }
}
/// Signal handler for `SIGSYS`.
///
/// Increments the `seccomp.num_faults` metric, logs an error message and terminates the process
/// with a specific exit code.
extern "C" fn sigsys_handler(num: c_int, info: *mut siginfo_t, _unused: *mut c_void) {
// Safe because we're just reading some fields from a supposedly valid argument.
let si_signo = unsafe { (*info).si_signo };
let si_code = unsafe { (*info).si_code };
// Sanity check. The condition should never be true.
if num != si_signo || num != SIGSYS || si_code != SYS_SECCOMP_CODE as i32 {
// Safe because we're terminating the process anyway.
unsafe { _exit(i32::from(super::EXIT_CODE_UNEXPECTED_ERROR)) };
}
// Other signals which might do async unsafe things incompatible with the rest of this
// function are blocked due to the sa_mask used when registering the signal handler.
let syscall = unsafe { *(info as *const i32).offset(SI_OFF_SYSCALL) as usize };
// SIGSYS is triggered when bad syscalls are detected. num_faults is only added when SIGSYS is detected
// so it actually only collects the count for bad syscalls.
METRICS.seccomp.num_faults.inc();
error!(
"Shutting down VM after intercepting a bad syscall ({}).",
syscall
);
// Safe because we're terminating the process anyway. We don't actually do anything when
// running unit tests.
#[cfg(not(test))]
unsafe {
_exit(i32::from(super::EXIT_CODE_BAD_SYSCALL))
};
}
/// Signal handler for `SIGBUS` and `SIGSEGV`.
///
/// Logs an error message and terminates the process with a specific exit code.
extern "C" fn sigbus_sigsegv_handler(num: c_int, info: *mut siginfo_t, _unused: *mut c_void) {
// Safe because we're just reading some fields from a supposedly valid argument.
let si_signo = unsafe { (*info).si_signo };
let si_code = unsafe { (*info).si_code };
// Sanity check. The condition should never be true.
if num != si_signo || (num != SIGBUS && num != SIGSEGV) {
// Safe because we're terminating the process anyway.
unsafe { _exit(i32::from(super::EXIT_CODE_UNEXPECTED_ERROR)) };
}
// Other signals which might do async unsafe things incompatible with the rest of this
// function are blocked due to the sa_mask used when registering the signal handler.
match si_signo {
SIGBUS => METRICS.signals.sigbus.inc(),
SIGSEGV => METRICS.signals.sigsegv.inc(),
_ => (),
}
error!(
"Shutting down VM after intercepting signal {}, code {}.",
si_signo, si_code
);
// Safe because we're terminating the process anyway. We don't actually do anything when
// running unit tests.
#[cfg(not(test))]
unsafe {
_exit(i32::from(match si_signo {
SIGBUS => super::EXIT_CODE_SIGBUS,
SIGSEGV => super::EXIT_CODE_SIGSEGV,
_ => super::EXIT_CODE_UNEXPECTED_ERROR,
}))
};
}
/// Registers all the required signal handlers.
///
/// Custom handlers are installed for: `SIGBUS`, `SIGSEGV`, `SIGSYS`.
pub fn register_signal_handlers() -> vmm_sys_util::errno::Result<()> {
// Call to unsafe register_signal_handler which is considered unsafe because it will
// register a signal handler which will be called in the current thread and will interrupt
// whatever work is done on the current thread, so we have to keep in mind that the registered
// signal handler must only do async-signal-safe operations.
register_signal_handler(SIGSYS, sigsys_handler)?;
register_signal_handler(SIGBUS, sigbus_sigsegv_handler)?;
register_signal_handler(SIGSEGV, sigbus_sigsegv_handler)?;
Ok(())
}
#[cfg(test)]
mod tests {
use super::*;
use libc::{cpu_set_t, syscall};
use std::convert::TryInto;
use std::{mem, process, thread};
use seccompiler::{apply_filter, BpfProgram, SeccompAction, SeccompFilter};
// This function is used when running unit tests, so all the unsafes are safe.
fn cpu_count() -> usize {
let mut cpuset: cpu_set_t = unsafe { mem::zeroed() };
unsafe {
libc::CPU_ZERO(&mut cpuset);
}
let ret = unsafe {
libc::sched_getaffinity(
0,
mem::size_of::<cpu_set_t>(),
&mut cpuset as *mut cpu_set_t,
)
};
assert_eq!(ret, 0);
let mut num = 0;
for i in 0..libc::CPU_SETSIZE as usize {
if unsafe { libc::CPU_ISSET(i, &cpuset) } {
num += 1;
}
}
num
}
#[test]
fn test_signal_handler() {
let child = thread::spawn(move || {
assert!(register_signal_handlers().is_ok());
let filter = SeccompFilter::new(
vec![
(libc::SYS_brk, vec![]),
(libc::SYS_exit, vec![]),
(libc::SYS_futex, vec![]),
(libc::SYS_getpid, vec![]),
(libc::SYS_munmap, vec![]),
(libc::SYS_kill, vec![]),
(libc::SYS_rt_sigprocmask, vec![]),
(libc::SYS_rt_sigreturn, vec![]),
(libc::SYS_sched_getaffinity, vec![]),
(libc::SYS_set_tid_address, vec![]),
(libc::SYS_sigaltstack, vec![]),
(libc::SYS_write, vec![]),
]
.into_iter()
.collect(),
SeccompAction::Trap,
SeccompAction::Allow,
std::env::consts::ARCH.try_into().unwrap(),
)
.unwrap();
assert!(apply_filter(&TryInto::<BpfProgram>::try_into(filter).unwrap()).is_ok());
assert_eq!(METRICS.seccomp.num_faults.count(), 0);
// Call the blacklisted `SYS_mkdirat`.
unsafe { syscall(libc::SYS_mkdirat, "/foo/bar\0") };
// Call SIGBUS signal handler.
assert_eq!(METRICS.signals.sigbus.count(), 0);
unsafe {
syscall(libc::SYS_kill, process::id(), SIGBUS);
}
// Call SIGSEGV signal handler.
assert_eq!(METRICS.signals.sigsegv.count(), 0);
unsafe {
syscall(libc::SYS_kill, process::id(), SIGSEGV);
}
});
assert!(child.join().is_ok());
// Sanity check.
assert!(cpu_count() > 0);
// Kcov somehow messes with our handler getting the SIGSYS signal when a bad syscall
// is caught, so the following assertion no longer holds. Ideally, we'd have a surefire
// way of either preventing this behaviour, or detecting for certain whether this test is
// run by kcov or not. The best we could do so far is to look at the perceived number of
// available CPUs. Kcov seems to make a single CPU available to the process running the
// tests, so we use this as an heuristic to decide if we check the assertion.
if cpu_count() > 1 {
// The signal handler should let the program continue during unit tests.
assert!(METRICS.seccomp.num_faults.count() >= 1);
}
assert!(METRICS.signals.sigbus.count() >= 1);
assert!(METRICS.signals.sigsegv.count() >= 1);
}
}

View File

@@ -0,0 +1,123 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::ops::Deref;
use std::sync::mpsc::{channel, Sender};
use std::sync::Arc;
use crate::IoManagerCached;
use dbs_arch::regs;
use dbs_boot::get_fdt_addr;
use dbs_utils::time::TimestampUs;
use kvm_ioctls::{VcpuFd, VmFd};
use vm_memory::{Address, GuestAddress, GuestAddressSpace};
use vmm_sys_util::eventfd::EventFd;
use crate::address_space_manager::GuestAddressSpaceImpl;
use crate::vcpu::vcpu_impl::{Result, Vcpu, VcpuError, VcpuStateEvent};
use crate::vcpu::VcpuConfig;
#[allow(unused)]
impl Vcpu {
/// Constructs a new VCPU for `vm`.
///
/// # Arguments
///
/// * `id` - Represents the CPU number between [0, max vcpus).
/// * `vcpu_fd` - The kvm `VcpuFd` for the vcpu.
/// * `io_mgr` - The io-manager used to access port-io and mmio devices.
/// * `exit_evt` - An `EventFd` that will be written into when this vcpu
/// exits.
/// * `vcpu_state_event` - The eventfd which can notify vmm state of some
/// vcpu should change.
/// * `vcpu_state_sender` - The channel to send state change message from
/// vcpu thread to vmm thread.
/// * `create_ts` - A timestamp used by the vcpu to calculate its lifetime.
/// * `support_immediate_exit` - whether kvm uses supports immediate_exit flag.
pub fn new_aarch64(
id: u8,
vcpu_fd: Arc<VcpuFd>,
io_mgr: IoManagerCached,
exit_evt: EventFd,
vcpu_state_event: EventFd,
vcpu_state_sender: Sender<VcpuStateEvent>,
create_ts: TimestampUs,
support_immediate_exit: bool,
) -> Result<Self> {
let (event_sender, event_receiver) = channel();
let (response_sender, response_receiver) = channel();
Ok(Vcpu {
fd: vcpu_fd,
id,
io_mgr,
create_ts,
event_receiver,
event_sender: Some(event_sender),
response_receiver: Some(response_receiver),
response_sender,
vcpu_state_event,
vcpu_state_sender,
support_immediate_exit,
mpidr: 0,
exit_evt,
})
}
/// Configures an aarch64 specific vcpu.
///
/// # Arguments
///
/// * `vcpu_config` - vCPU config for this vCPU status
/// * `vm_fd` - The kvm `VmFd` for this microvm.
/// * `vm_as` - The guest memory address space used by this microvm.
/// * `kernel_load_addr` - Offset from `guest_mem` at which the kernel is loaded.
/// * `_pgtable_addr` - pgtable address for ap vcpu (not used in aarch64)
pub fn configure(
&mut self,
_vcpu_config: &VcpuConfig,
vm_fd: &VmFd,
vm_as: &GuestAddressSpaceImpl,
kernel_load_addr: Option<GuestAddress>,
_pgtable_addr: Option<GuestAddress>,
) -> Result<()> {
let mut kvi: kvm_bindings::kvm_vcpu_init = kvm_bindings::kvm_vcpu_init::default();
// This reads back the kernel's preferred target type.
vm_fd
.get_preferred_target(&mut kvi)
.map_err(VcpuError::VcpuArmPreferredTarget)?;
// We already checked that the capability is supported.
kvi.features[0] |= 1 << kvm_bindings::KVM_ARM_VCPU_PSCI_0_2;
// Non-boot cpus are powered off initially.
if self.id > 0 {
kvi.features[0] |= 1 << kvm_bindings::KVM_ARM_VCPU_POWER_OFF;
}
self.fd.vcpu_init(&kvi).map_err(VcpuError::VcpuArmInit)?;
if let Some(address) = kernel_load_addr {
regs::setup_regs(
&self.fd,
self.id,
address.raw_value(),
get_fdt_addr(vm_as.memory().deref()),
)
.map_err(VcpuError::REGSConfiguration)?;
}
self.mpidr = regs::read_mpidr(&self.fd).map_err(VcpuError::REGSConfiguration)?;
Ok(())
}
/// Gets the MPIDR register value.
pub fn get_mpidr(&self) -> u64 {
self.mpidr
}
}

View File

@@ -0,0 +1,34 @@
// Copyright (C) 2022 Alibaba Cloud Computing. All rights reserved.
// Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
//
// SPDX-License-Identifier: Apache-2.0
mod sm;
mod vcpu_impl;
mod vcpu_manager;
#[cfg(target_arch = "x86_64")]
use dbs_arch::cpuid::VpmuFeatureLevel;
pub use vcpu_manager::{VcpuManager, VcpuManagerError};
/// vcpu config collection
pub struct VcpuConfig {
/// initial vcpu count
pub boot_vcpu_count: u8,
/// max vcpu count for hotplug
pub max_vcpu_count: u8,
/// threads per core for cpu topology information
pub threads_per_core: u8,
/// cores per die for cpu topology information
pub cores_per_die: u8,
/// dies per socket for cpu topology information
pub dies_per_socket: u8,
/// socket number for cpu topology information
pub sockets: u8,
/// if vpmu feature is Disabled, it means vpmu feature is off (by default)
/// if vpmu feature is LimitedlyEnabled, it means minimal vpmu counters are supported (cycles and instructions)
/// if vpmu feature is FullyEnabled, it means all vpmu counters are supported
#[cfg(target_arch = "x86_64")]
pub vpmu_feature: VpmuFeatureLevel,
}

View File

@@ -0,0 +1,149 @@
// Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
use std::ops::Deref;
/// Simple abstraction of a state machine.
///
/// `StateMachine<T>` is a wrapper over `T` that also encodes state information for `T`.
///
/// Each state for `T` is represented by a `StateFn<T>` which is a function that acts as
/// the state handler for that particular state of `T`.
///
/// `StateFn<T>` returns exactly one other `StateMachine<T>` thus each state gets clearly
/// defined transitions to other states.
pub struct StateMachine<T> {
function: StateFn<T>,
end_state: bool,
}
/// Type representing a state handler of a `StateMachine<T>` machine. Each state handler
/// is a function from `T` that handles a specific state of `T`.
type StateFn<T> = fn(&mut T) -> StateMachine<T>;
impl<T> StateMachine<T> {
/// Creates a new state wrapper.
///
/// # Arguments
///
/// `function` - the state handler for this state.
/// `end_state` - whether this state is final.
pub fn new(function: StateFn<T>, end_state: bool) -> StateMachine<T> {
StateMachine {
function,
end_state,
}
}
/// Creates a new state wrapper that has further possible transitions.
///
/// # Arguments
///
/// `function` - the state handler for this state.
pub fn next(function: StateFn<T>) -> StateMachine<T> {
StateMachine::new(function, false)
}
/// Creates a new state wrapper that has no further transitions. The state machine
/// will finish after running this handler.
///
/// # Arguments
///
/// `function` - the state handler for this last state.
pub fn finish(function: StateFn<T>) -> StateMachine<T> {
StateMachine::new(function, true)
}
/// Runs a state machine for `T` starting from the provided state.
///
/// # Arguments
///
/// `machine` - a mutable reference to the object running through the various states.
/// `starting_state_fn` - a `fn(&mut T) -> StateMachine<T>` that should be the handler for
/// the initial state.
pub fn run(machine: &mut T, starting_state_fn: StateFn<T>) {
// Start off in the `starting_state` state.
let mut sf = StateMachine::new(starting_state_fn, false);
// While current state is not a final/end state, keep churning.
while !sf.end_state {
// Run the current state handler, and get the next one.
sf = sf(machine);
}
}
}
// Implement Deref of `StateMachine<T>` so that we can directly call its underlying state handler.
impl<T> Deref for StateMachine<T> {
type Target = StateFn<T>;
fn deref(&self) -> &Self::Target {
&self.function
}
}
#[cfg(test)]
mod tests {
use super::*;
// DummyMachine with states `s1`, `s2` and `s3`.
struct DummyMachine {
private_data_s1: bool,
private_data_s2: bool,
private_data_s3: bool,
}
impl DummyMachine {
fn new() -> Self {
DummyMachine {
private_data_s1: false,
private_data_s2: false,
private_data_s3: false,
}
}
// DummyMachine functions here.
// Simple state-machine: start->s1->s2->s3->done.
fn run(&mut self) {
// Verify the machine has not run yet.
assert!(!self.private_data_s1);
assert!(!self.private_data_s2);
assert!(!self.private_data_s3);
// Run the state-machine.
StateMachine::run(self, Self::s1);
// Verify the machine went through all states.
assert!(self.private_data_s1);
assert!(self.private_data_s2);
assert!(self.private_data_s3);
}
fn s1(&mut self) -> StateMachine<Self> {
// Verify private data mutates along with the states.
assert!(!self.private_data_s1);
self.private_data_s1 = true;
StateMachine::next(Self::s2)
}
fn s2(&mut self) -> StateMachine<Self> {
// Verify private data mutates along with the states.
assert!(!self.private_data_s2);
self.private_data_s2 = true;
StateMachine::next(Self::s3)
}
fn s3(&mut self) -> StateMachine<Self> {
// Verify private data mutates along with the states.
assert!(!self.private_data_s3);
self.private_data_s3 = true;
// The machine ends here, adding `s1` as next state to validate this.
StateMachine::finish(Self::s1)
}
}
#[test]
fn test_sm() {
let mut machine = DummyMachine::new();
machine.run();
}
}

View File

@@ -0,0 +1,975 @@
// Copyright (C) 2019-2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
//! The implementation for per vcpu
use std::cell::Cell;
use std::result;
use std::sync::atomic::{fence, Ordering};
use std::sync::mpsc::{Receiver, Sender, TryRecvError};
use std::sync::{Arc, Barrier};
use std::thread;
use dbs_utils::time::TimestampUs;
use kvm_bindings::{KVM_SYSTEM_EVENT_RESET, KVM_SYSTEM_EVENT_SHUTDOWN};
use kvm_ioctls::{VcpuExit, VcpuFd};
use libc::{c_int, c_void, siginfo_t};
use log::{error, info};
use seccompiler::{apply_filter, BpfProgram, Error as SecError};
use vmm_sys_util::eventfd::EventFd;
use vmm_sys_util::signal::{register_signal_handler, Killable};
use super::sm::StateMachine;
use crate::metric::{IncMetric, METRICS};
use crate::signal_handler::sigrtmin;
use crate::IoManagerCached;
#[cfg(target_arch = "x86_64")]
#[path = "x86_64.rs"]
mod x86_64;
#[cfg(target_arch = "aarch64")]
#[path = "aarch64.rs"]
mod aarch64;
#[cfg(target_arch = "x86_64")]
const MAGIC_IOPORT_BASE: u16 = 0xdbdb;
#[cfg(target_arch = "x86_64")]
const MAGIC_IOPORT_DEBUG_INFO: u16 = MAGIC_IOPORT_BASE;
/// Signal number (SIGRTMIN) used to kick Vcpus.
pub const VCPU_RTSIG_OFFSET: i32 = 0;
#[cfg(target_arch = "x86_64")]
/// Errors associated with the wrappers over KVM ioctls.
#[derive(Debug, thiserror::Error)]
pub enum VcpuError {
/// Failed to signal Vcpu.
#[error("cannot signal the vCPU thread")]
SignalVcpu(#[source] vmm_sys_util::errno::Error),
/// Cannot open the vCPU file descriptor.
#[error("cannot open the vCPU file descriptor")]
VcpuFd(#[source] kvm_ioctls::Error),
/// Cannot spawn a new vCPU thread.
#[error("cannot spawn vCPU thread")]
VcpuSpawn(#[source] std::io::Error),
/// Cannot cleanly initialize vCPU TLS.
#[error("cannot cleanly initialize TLS fro vCPU")]
VcpuTlsInit,
/// Vcpu not present in TLS.
#[error("vCPU not present in the TLS")]
VcpuTlsNotPresent,
/// Unexpected KVM_RUN exit reason
#[error("Unexpected KVM_RUN exit reason")]
VcpuUnhandledKvmExit,
/// Pause vcpu failed
#[error("failed to pause vcpus")]
PauseFailed,
/// Kvm Ioctl Error
#[error("failure in issuing KVM ioctl command")]
Kvm(#[source] kvm_ioctls::Error),
/// Msr error
#[error("failure to deal with MSRs")]
Msr(vmm_sys_util::fam::Error),
/// A call to cpuid instruction failed on x86_64.
#[error("failure while configuring CPUID for virtual CPU on x86_64")]
CpuId(dbs_arch::cpuid::Error),
/// Error configuring the floating point related registers on x86_64.
#[error("failure while configuring the floating point related registers on x86_64")]
FPUConfiguration(dbs_arch::regs::Error),
/// Cannot set the local interruption due to bad configuration on x86_64.
#[error("cannot set the local interruption due to bad configuration on x86_64")]
LocalIntConfiguration(dbs_arch::interrupts::Error),
/// Error configuring the MSR registers on x86_64.
#[error("failure while configuring the MSR registers on x86_64")]
MSRSConfiguration(dbs_arch::regs::Error),
/// Error configuring the general purpose registers on x86_64.
#[error("failure while configuring the general purpose registers on x86_64")]
REGSConfiguration(dbs_arch::regs::Error),
/// Error configuring the special registers on x86_64.
#[error("failure while configuring the special registers on x86_64")]
SREGSConfiguration(dbs_arch::regs::Error),
/// Error configuring the page table on x86_64.
#[error("failure while configuring the page table on x86_64")]
PageTable(dbs_boot::Error),
/// The call to KVM_SET_CPUID2 failed on x86_64.
#[error("failure while calling KVM_SET_CPUID2 on x86_64")]
SetSupportedCpusFailed(#[source] kvm_ioctls::Error),
}
#[cfg(target_arch = "aarch64")]
/// Errors associated with the wrappers over KVM ioctls.
#[derive(Debug, thiserror::Error)]
pub enum VcpuError {
/// Failed to signal Vcpu.
#[error("cannot signal the vCPU thread")]
SignalVcpu(#[source] vmm_sys_util::errno::Error),
/// Cannot open the vCPU file descriptor.
#[error("cannot open the vCPU file descriptor")]
VcpuFd(#[source] kvm_ioctls::Error),
/// Cannot spawn a new vCPU thread.
#[error("cannot spawn vCPU thread")]
VcpuSpawn(#[source] std::io::Error),
/// Cannot cleanly initialize vCPU TLS.
#[error("cannot cleanly initialize TLS fro vCPU")]
VcpuTlsInit,
/// Vcpu not present in TLS.
#[error("vCPU not present in the TLS")]
VcpuTlsNotPresent,
/// Unexpected KVM_RUN exit reason
#[error("Unexpected KVM_RUN exit reason")]
VcpuUnhandledKvmExit,
/// Pause vcpu failed
#[error("failed to pause vcpus")]
PauseFailed,
/// Kvm Ioctl Error
#[error("failure in issuing KVM ioctl command")]
Kvm(#[source] kvm_ioctls::Error),
/// Msr error
#[error("failure to deal with MSRs")]
Msr(vmm_sys_util::fam::Error),
#[cfg(target_arch = "aarch64")]
/// Error configuring the general purpose aarch64 registers on aarch64.
#[error("failure while configuring the general purpose registers on aarch64")]
REGSConfiguration(dbs_arch::regs::Error),
#[cfg(target_arch = "aarch64")]
/// Error setting up the global interrupt controller on aarch64.
#[error("failure while setting up the global interrupt controller on aarch64")]
SetupGIC(dbs_arch::gic::Error),
#[cfg(target_arch = "aarch64")]
/// Error getting the Vcpu preferred target on aarch64.
#[error("failure while getting the vCPU preferred target on aarch64")]
VcpuArmPreferredTarget(kvm_ioctls::Error),
#[cfg(target_arch = "aarch64")]
/// Error doing vCPU Init on aarch64.
#[error("failure while doing vCPU init on aarch64")]
VcpuArmInit(kvm_ioctls::Error),
}
/// Result for Vcpu related operations.
pub type Result<T> = result::Result<T, VcpuError>;
/// List of events that the Vcpu can receive.
#[derive(Debug)]
pub enum VcpuEvent {
/// Kill the Vcpu.
Exit,
/// Pause the Vcpu.
Pause,
/// Event that should resume the Vcpu.
Resume,
/// Get vcpu thread tid
Gettid,
/// Event to revalidate vcpu IoManager cache
RevalidateCache,
}
/// List of responses that the Vcpu reports.
pub enum VcpuResponse {
/// Vcpu is paused.
Paused,
/// Vcpu is resumed.
Resumed,
/// Vcpu index and thread tid.
Tid(u8, u32),
/// Requested Vcpu operation is not allowed.
NotAllowed,
/// Requestion action encountered an error
Error(VcpuError),
/// Vcpu IoManager cache is revalidated
CacheRevalidated,
}
/// List of events that the vcpu_state_sender can send.
pub enum VcpuStateEvent {
/// (result, response) for hotplug, result 0 means failure, 1 means success.
Hotplug((i32, u32)),
}
/// Wrapper over vCPU that hides the underlying interactions with the vCPU thread.
pub struct VcpuHandle {
event_sender: Sender<VcpuEvent>,
response_receiver: Receiver<VcpuResponse>,
vcpu_thread: thread::JoinHandle<()>,
}
impl VcpuHandle {
/// Send event to vCPU thread
pub fn send_event(&self, event: VcpuEvent) -> Result<()> {
// Use expect() to crash if the other thread closed this channel.
self.event_sender
.send(event)
.expect("event sender channel closed on vcpu end.");
// Kick the vCPU so it picks up the message.
self.vcpu_thread
.kill(sigrtmin() + VCPU_RTSIG_OFFSET)
.map_err(VcpuError::SignalVcpu)?;
Ok(())
}
/// Receive response from vcpu thread
pub fn response_receiver(&self) -> &Receiver<VcpuResponse> {
&self.response_receiver
}
#[allow(dead_code)]
/// Join the vcpu thread
pub fn join_vcpu_thread(self) -> thread::Result<()> {
self.vcpu_thread.join()
}
}
#[derive(PartialEq)]
enum VcpuEmulation {
Handled,
Interrupted,
Stopped,
}
/// A wrapper around creating and using a kvm-based VCPU.
pub struct Vcpu {
// vCPU fd used by the vCPU
fd: Arc<VcpuFd>,
// vCPU id info
id: u8,
// Io manager Cached for facilitating IO operations
io_mgr: IoManagerCached,
// Records vCPU create time stamp
create_ts: TimestampUs,
// The receiving end of events channel owned by the vcpu side.
event_receiver: Receiver<VcpuEvent>,
// The transmitting end of the events channel which will be given to the handler.
event_sender: Option<Sender<VcpuEvent>>,
// The receiving end of the responses channel which will be given to the handler.
response_receiver: Option<Receiver<VcpuResponse>>,
// The transmitting end of the responses channel owned by the vcpu side.
response_sender: Sender<VcpuResponse>,
// Event notifier for CPU hotplug.
// After arm adapts to hotplug vcpu, the dead code macro needs to be removed
#[cfg_attr(target_arch = "aarch64", allow(dead_code))]
vcpu_state_event: EventFd,
// CPU hotplug events.
// After arm adapts to hotplug vcpu, the dead code macro needs to be removed
#[cfg_attr(target_arch = "aarch64", allow(dead_code))]
vcpu_state_sender: Sender<VcpuStateEvent>,
// An `EventFd` that will be written into when this vcpu exits.
exit_evt: EventFd,
// Whether kvm used supports immediate_exit flag.
support_immediate_exit: bool,
// CPUID information for the x86_64 CPU
#[cfg(target_arch = "x86_64")]
cpuid: kvm_bindings::CpuId,
/// Multiprocessor affinity register recorded for aarch64
#[cfg(target_arch = "aarch64")]
pub(crate) mpidr: u64,
}
// Using this for easier explicit type-casting to help IDEs interpret the code.
type VcpuCell = Cell<Option<*const Vcpu>>;
impl Vcpu {
thread_local!(static TLS_VCPU_PTR: VcpuCell = Cell::new(None));
/// Associates `self` with the current thread.
///
/// It is a prerequisite to successfully run `init_thread_local_data()` before using
/// `run_on_thread_local()` on the current thread.
/// This function will return an error if there already is a `Vcpu` present in the TLS.
fn init_thread_local_data(&mut self) -> Result<()> {
Self::TLS_VCPU_PTR.with(|cell: &VcpuCell| {
if cell.get().is_some() {
return Err(VcpuError::VcpuTlsInit);
}
cell.set(Some(self as *const Vcpu));
Ok(())
})
}
/// Deassociates `self` from the current thread.
///
/// Should be called if the current `self` had called `init_thread_local_data()` and
/// now needs to move to a different thread.
///
/// Fails if `self` was not previously associated with the current thread.
fn reset_thread_local_data(&mut self) -> Result<()> {
// Best-effort to clean up TLS. If the `Vcpu` was moved to another thread
// _before_ running this, then there is nothing we can do.
Self::TLS_VCPU_PTR.with(|cell: &VcpuCell| {
if let Some(vcpu_ptr) = cell.get() {
if vcpu_ptr == self as *const Vcpu {
Self::TLS_VCPU_PTR.with(|cell: &VcpuCell| cell.take());
return Ok(());
}
}
Err(VcpuError::VcpuTlsNotPresent)
})
}
/// Runs `func` for the `Vcpu` associated with the current thread.
///
/// It requires that `init_thread_local_data()` was run on this thread.
///
/// Fails if there is no `Vcpu` associated with the current thread.
///
/// # Safety
///
/// This is marked unsafe as it allows temporary aliasing through
/// dereferencing from pointer an already borrowed `Vcpu`.
unsafe fn run_on_thread_local<F>(func: F) -> Result<()>
where
F: FnOnce(&Vcpu),
{
Self::TLS_VCPU_PTR.with(|cell: &VcpuCell| {
if let Some(vcpu_ptr) = cell.get() {
// Dereferencing here is safe since `TLS_VCPU_PTR` is populated/non-empty,
// and it is being cleared on `Vcpu::drop` so there is no dangling pointer.
let vcpu_ref: &Vcpu = &*vcpu_ptr;
func(vcpu_ref);
Ok(())
} else {
Err(VcpuError::VcpuTlsNotPresent)
}
})
}
/// Registers a signal handler which makes use of TLS and kvm immediate exit to
/// kick the vcpu running on the current thread, if there is one.
pub fn register_kick_signal_handler() {
extern "C" fn handle_signal(_: c_int, _: *mut siginfo_t, _: *mut c_void) {
// This is safe because it's temporarily aliasing the `Vcpu` object, but we are
// only reading `vcpu.fd` which does not change for the lifetime of the `Vcpu`.
unsafe {
let _ = Vcpu::run_on_thread_local(|vcpu| {
vcpu.fd.set_kvm_immediate_exit(1);
fence(Ordering::Release);
});
}
}
register_signal_handler(sigrtmin() + VCPU_RTSIG_OFFSET, handle_signal)
.expect("Failed to register vcpu signal handler");
}
/// Returns the cpu index as seen by the guest OS.
pub fn cpu_index(&self) -> u8 {
self.id
}
/// Moves the vcpu to its own thread and constructs a VcpuHandle.
/// The handle can be used to control the remote vcpu.
pub fn start_threaded(
mut self,
seccomp_filter: BpfProgram,
barrier: Arc<Barrier>,
) -> Result<VcpuHandle> {
let event_sender = self.event_sender.take().unwrap();
let response_receiver = self.response_receiver.take().unwrap();
let vcpu_thread = thread::Builder::new()
.name(format!("db_vcpu{}", self.cpu_index()))
.spawn(move || {
self.init_thread_local_data()
.expect("Cannot cleanly initialize vcpu TLS.");
barrier.wait();
self.run(seccomp_filter);
})
.map_err(VcpuError::VcpuSpawn)?;
Ok(VcpuHandle {
event_sender,
response_receiver,
vcpu_thread,
})
}
/// Extract the vcpu running logic for test mocking.
#[cfg(not(test))]
pub fn emulate(fd: &VcpuFd) -> std::result::Result<VcpuExit<'_>, kvm_ioctls::Error> {
fd.run()
}
/// Runs the vCPU in KVM context and handles the kvm exit reason.
///
/// Returns error or enum specifying whether emulation was handled or interrupted.
fn run_emulation(&mut self) -> Result<VcpuEmulation> {
match Vcpu::emulate(&self.fd) {
Ok(run) => match run {
#[cfg(target_arch = "x86_64")]
VcpuExit::IoIn(addr, data) => {
let _ = self.io_mgr.pio_read(addr, data);
METRICS.vcpu.exit_io_in.inc();
Ok(VcpuEmulation::Handled)
}
#[cfg(target_arch = "x86_64")]
VcpuExit::IoOut(addr, data) => {
if !self.check_io_port_info(addr, data)? {
let _ = self.io_mgr.pio_write(addr, data);
}
METRICS.vcpu.exit_io_out.inc();
Ok(VcpuEmulation::Handled)
}
VcpuExit::MmioRead(addr, data) => {
let _ = self.io_mgr.mmio_read(addr, data);
METRICS.vcpu.exit_mmio_read.inc();
Ok(VcpuEmulation::Handled)
}
VcpuExit::MmioWrite(addr, data) => {
let _ = self.io_mgr.mmio_write(addr, data);
METRICS.vcpu.exit_mmio_write.inc();
Ok(VcpuEmulation::Handled)
}
VcpuExit::Hlt => {
info!("Received KVM_EXIT_HLT signal");
Err(VcpuError::VcpuUnhandledKvmExit)
}
VcpuExit::Shutdown => {
info!("Received KVM_EXIT_SHUTDOWN signal");
Err(VcpuError::VcpuUnhandledKvmExit)
}
// Documentation specifies that below kvm exits are considered errors.
VcpuExit::FailEntry => {
METRICS.vcpu.failures.inc();
error!("Received KVM_EXIT_FAIL_ENTRY signal");
Err(VcpuError::VcpuUnhandledKvmExit)
}
VcpuExit::InternalError => {
METRICS.vcpu.failures.inc();
error!("Received KVM_EXIT_INTERNAL_ERROR signal");
Err(VcpuError::VcpuUnhandledKvmExit)
}
VcpuExit::SystemEvent(event_type, event_flags) => match event_type {
KVM_SYSTEM_EVENT_RESET | KVM_SYSTEM_EVENT_SHUTDOWN => {
info!(
"Received KVM_SYSTEM_EVENT: type: {}, event: {}",
event_type, event_flags
);
Ok(VcpuEmulation::Stopped)
}
_ => {
METRICS.vcpu.failures.inc();
error!(
"Received KVM_SYSTEM_EVENT signal type: {}, flag: {}",
event_type, event_flags
);
Err(VcpuError::VcpuUnhandledKvmExit)
}
},
r => {
METRICS.vcpu.failures.inc();
// TODO: Are we sure we want to finish running a vcpu upon
// receiving a vm exit that is not necessarily an error?
error!("Unexpected exit reason on vcpu run: {:?}", r);
Err(VcpuError::VcpuUnhandledKvmExit)
}
},
// The unwrap on raw_os_error can only fail if we have a logic
// error in our code in which case it is better to panic.
Err(ref e) => {
match e.errno() {
libc::EAGAIN => Ok(VcpuEmulation::Handled),
libc::EINTR => {
self.fd.set_kvm_immediate_exit(0);
// Notify that this KVM_RUN was interrupted.
Ok(VcpuEmulation::Interrupted)
}
_ => {
METRICS.vcpu.failures.inc();
error!("Failure during vcpu run: {}", e);
#[cfg(target_arch = "x86_64")]
{
error!(
"dump regs: {:?}, dump sregs: {:?}",
self.fd.get_regs(),
self.fd.get_sregs()
);
}
Err(VcpuError::VcpuUnhandledKvmExit)
}
}
}
}
}
#[cfg(target_arch = "x86_64")]
// checkout the io port that dragonball used only
fn check_io_port_info(&self, addr: u16, data: &[u8]) -> Result<bool> {
let mut checked = false;
match addr {
// debug info signal
MAGIC_IOPORT_DEBUG_INFO => {
if data.len() == 4 {
let data = unsafe { std::ptr::read(data.as_ptr() as *const u32) };
log::warn!("KDBG: guest kernel debug info: 0x{:x}", data);
checked = true;
}
}
_ => {}
};
Ok(checked)
}
fn gettid() -> u32 {
nix::unistd::gettid().as_raw() as u32
}
fn revalidate_cache(&mut self) -> Result<()> {
self.io_mgr.revalidate_cache();
Ok(())
}
/// Main loop of the vCPU thread.
///
/// Runs the vCPU in KVM context in a loop. Handles KVM_EXITs then goes back in.
/// Note that the state of the VCPU and associated VM must be setup first for this to do
/// anything useful.
pub fn run(&mut self, seccomp_filter: BpfProgram) {
// Load seccomp filters for this vCPU thread.
// Execution panics if filters cannot be loaded, use --seccomp-level=0 if skipping filters
// altogether is the desired behaviour.
if let Err(e) = apply_filter(&seccomp_filter) {
if matches!(e, SecError::EmptyFilter) {
info!("vCPU thread {} use empty seccomp filters.", self.id);
} else {
panic!(
"Failed to set the requested seccomp filters on vCPU {}: Error: {}",
self.id, e
);
}
}
info!("vcpu {} is running", self.cpu_index());
// Start running the machine state in the `Paused` state.
StateMachine::run(self, Self::paused);
}
// This is the main loop of the `Running` state.
fn running(&mut self) -> StateMachine<Self> {
// This loop is here just for optimizing the emulation path.
// No point in ticking the state machine if there are no external events.
loop {
match self.run_emulation() {
// Emulation ran successfully, continue.
Ok(VcpuEmulation::Handled) => {
// We need to break here if kvm doesn't support
// immediate_exit flag. Because the signal sent from vmm
// thread may occurs when handling the vcpu exit events, and
// in this case the external vcpu events may not be handled
// correctly, so we need to check the event_receiver channel
// after handle vcpu exit events to decrease the window that
// doesn't handle the vcpu external events.
if !self.support_immediate_exit {
break;
}
}
// Emulation was interrupted, check external events.
Ok(VcpuEmulation::Interrupted) => break,
// Emulation was stopped due to reset or shutdown.
Ok(VcpuEmulation::Stopped) => return StateMachine::next(Self::waiting_exit),
// Emulation errors lead to vCPU exit.
Err(e) => {
error!("vcpu: {}, run_emulation failed: {:?}", self.id, e);
return StateMachine::next(Self::waiting_exit);
}
}
}
// By default don't change state.
let mut state = StateMachine::next(Self::running);
// Break this emulation loop on any transition request/external event.
match self.event_receiver.try_recv() {
// Running ---- Exit ----> Exited
Ok(VcpuEvent::Exit) => {
// Move to 'exited' state.
state = StateMachine::next(Self::exited);
}
// Running ---- Pause ----> Paused
Ok(VcpuEvent::Pause) => {
// Nothing special to do.
self.response_sender
.send(VcpuResponse::Paused)
.expect("failed to send pause status");
// TODO: we should call `KVM_KVMCLOCK_CTRL` here to make sure
// TODO continued: the guest soft lockup watchdog does not panic on Resume.
//let _ = self.fd.kvmclock_ctrl();
// Move to 'paused' state.
state = StateMachine::next(Self::paused);
}
Ok(VcpuEvent::Resume) => {
self.response_sender
.send(VcpuResponse::Resumed)
.expect("failed to send resume status");
}
Ok(VcpuEvent::Gettid) => {
self.response_sender
.send(VcpuResponse::Tid(self.cpu_index(), Vcpu::gettid()))
.expect("failed to send vcpu thread tid");
}
Ok(VcpuEvent::RevalidateCache) => {
self.revalidate_cache()
.map(|()| {
self.response_sender
.send(VcpuResponse::CacheRevalidated)
.expect("failed to revalidate vcpu IoManager cache");
})
.map_err(|e| self.response_sender.send(VcpuResponse::Error(e)))
.expect("failed to revalidate vcpu IoManager cache");
}
// Unhandled exit of the other end.
Err(TryRecvError::Disconnected) => {
// Move to 'exited' state.
state = StateMachine::next(Self::exited);
}
// All other events or lack thereof have no effect on current 'running' state.
Err(TryRecvError::Empty) => (),
}
state
}
// This is the main loop of the `Paused` state.
fn paused(&mut self) -> StateMachine<Self> {
match self.event_receiver.recv() {
// Paused ---- Exit ----> Exited
Ok(VcpuEvent::Exit) => {
// Move to 'exited' state.
StateMachine::next(Self::exited)
}
// Paused ---- Resume ----> Running
Ok(VcpuEvent::Resume) => {
self.response_sender
.send(VcpuResponse::Resumed)
.expect("failed to send resume status");
// Move to 'running' state.
StateMachine::next(Self::running)
}
Ok(VcpuEvent::Pause) => {
self.response_sender
.send(VcpuResponse::Paused)
.expect("failed to send pause status");
// continue 'pause' state.
StateMachine::next(Self::paused)
}
Ok(VcpuEvent::Gettid) => {
self.response_sender
.send(VcpuResponse::Tid(self.cpu_index(), Vcpu::gettid()))
.expect("failed to send vcpu thread tid");
StateMachine::next(Self::paused)
}
Ok(VcpuEvent::RevalidateCache) => {
self.revalidate_cache()
.map(|()| {
self.response_sender
.send(VcpuResponse::CacheRevalidated)
.expect("failed to revalidate vcpu IoManager cache");
})
.map_err(|e| self.response_sender.send(VcpuResponse::Error(e)))
.expect("failed to revalidate vcpu IoManager cache");
StateMachine::next(Self::paused)
}
// Unhandled exit of the other end.
Err(_) => {
// Move to 'exited' state.
StateMachine::next(Self::exited)
}
}
}
// This is the main loop of the `WaitingExit` state.
fn waiting_exit(&mut self) -> StateMachine<Self> {
// trigger vmm to stop machine
if let Err(e) = self.exit_evt.write(1) {
METRICS.vcpu.failures.inc();
error!("Failed signaling vcpu exit event: {}", e);
}
let mut state = StateMachine::next(Self::waiting_exit);
match self.event_receiver.recv() {
Ok(VcpuEvent::Exit) => state = StateMachine::next(Self::exited),
Ok(_) => error!(
"wrong state received in waiting exit state on vcpu {}",
self.id
),
Err(_) => {
error!(
"vcpu channel closed in waiting exit state on vcpu {}",
self.id
);
state = StateMachine::next(Self::exited);
}
}
state
}
// This is the main loop of the `Exited` state.
fn exited(&mut self) -> StateMachine<Self> {
// State machine reached its end.
StateMachine::finish(Self::exited)
}
}
impl Drop for Vcpu {
fn drop(&mut self) {
let _ = self.reset_thread_local_data();
}
}
#[cfg(test)]
pub mod tests {
use std::os::unix::io::AsRawFd;
use std::sync::mpsc::{channel, Receiver};
use std::sync::Mutex;
use arc_swap::ArcSwap;
use dbs_device::device_manager::IoManager;
use kvm_ioctls::Kvm;
use lazy_static::lazy_static;
use super::*;
use crate::kvm_context::KvmContext;
pub enum EmulationCase {
IoIn,
IoOut,
MmioRead,
MmioWrite,
Hlt,
Shutdown,
FailEntry,
InternalError,
Unknown,
SystemEvent(u32, u64),
Error(i32),
}
lazy_static! {
pub static ref EMULATE_RES: Mutex<EmulationCase> = Mutex::new(EmulationCase::Unknown);
}
impl Vcpu {
pub fn emulate(_fd: &VcpuFd) -> std::result::Result<VcpuExit<'_>, kvm_ioctls::Error> {
let res = &*EMULATE_RES.lock().unwrap();
match res {
EmulationCase::IoIn => Ok(VcpuExit::IoIn(0, &mut [])),
EmulationCase::IoOut => Ok(VcpuExit::IoOut(0, &[])),
EmulationCase::MmioRead => Ok(VcpuExit::MmioRead(0, &mut [])),
EmulationCase::MmioWrite => Ok(VcpuExit::MmioWrite(0, &[])),
EmulationCase::Hlt => Ok(VcpuExit::Hlt),
EmulationCase::Shutdown => Ok(VcpuExit::Shutdown),
EmulationCase::FailEntry => Ok(VcpuExit::FailEntry),
EmulationCase::InternalError => Ok(VcpuExit::InternalError),
EmulationCase::Unknown => Ok(VcpuExit::Unknown),
EmulationCase::SystemEvent(event_type, event_flags) => {
Ok(VcpuExit::SystemEvent(*event_type, *event_flags))
}
EmulationCase::Error(e) => Err(kvm_ioctls::Error::new(*e)),
}
}
}
#[cfg(target_arch = "x86_64")]
fn create_vcpu() -> (Vcpu, Receiver<VcpuStateEvent>) {
// Call for kvm too frequently would cause error in some host kernel.
std::thread::sleep(std::time::Duration::from_millis(5));
let kvm = Kvm::new().unwrap();
let vm = Arc::new(kvm.create_vm().unwrap());
let kvm_context = KvmContext::new(Some(kvm.as_raw_fd())).unwrap();
let vcpu_fd = Arc::new(vm.create_vcpu(0).unwrap());
let io_manager = IoManagerCached::new(Arc::new(ArcSwap::new(Arc::new(IoManager::new()))));
let supported_cpuid = kvm_context
.supported_cpuid(kvm_bindings::KVM_MAX_CPUID_ENTRIES)
.unwrap();
let reset_event_fd = EventFd::new(libc::EFD_NONBLOCK).unwrap();
let vcpu_state_event = EventFd::new(libc::EFD_NONBLOCK).unwrap();
let (tx, rx) = channel();
let time_stamp = TimestampUs::default();
let vcpu = Vcpu::new_x86_64(
0,
vcpu_fd,
io_manager,
supported_cpuid,
reset_event_fd,
vcpu_state_event,
tx,
time_stamp,
false,
)
.unwrap();
(vcpu, rx)
}
#[cfg(target_arch = "aarch64")]
fn create_vcpu() -> (Vcpu, Receiver<VcpuStateEvent>) {
// Call for kvm too frequently would cause error in some host kernel.
std::thread::sleep(std::time::Duration::from_millis(5));
let kvm = Kvm::new().unwrap();
let vm = Arc::new(kvm.create_vm().unwrap());
let kvm_context = KvmContext::new(Some(kvm.as_raw_fd())).unwrap();
let vcpu_fd = Arc::new(vm.create_vcpu(0).unwrap());
let io_manager = IoManagerCached::new(Arc::new(ArcSwap::new(Arc::new(IoManager::new()))));
let reset_event_fd = EventFd::new(libc::EFD_NONBLOCK).unwrap();
let vcpu_state_event = EventFd::new(libc::EFD_NONBLOCK).unwrap();
let (tx, rx) = channel();
let time_stamp = TimestampUs::default();
let vcpu = Vcpu::new_aarch64(
0,
vcpu_fd,
io_manager,
reset_event_fd,
vcpu_state_event,
tx,
time_stamp,
false,
)
.unwrap();
(vcpu, rx)
}
#[test]
fn test_vcpu_run_emulation() {
let (mut vcpu, _) = create_vcpu();
#[cfg(target_arch = "x86_64")]
{
// Io in
*(EMULATE_RES.lock().unwrap()) = EmulationCase::IoIn;
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Handled)));
// Io out
*(EMULATE_RES.lock().unwrap()) = EmulationCase::IoOut;
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Handled)));
}
// Mmio read
*(EMULATE_RES.lock().unwrap()) = EmulationCase::MmioRead;
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Handled)));
// Mmio write
*(EMULATE_RES.lock().unwrap()) = EmulationCase::MmioWrite;
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Handled)));
// KVM_EXIT_HLT signal
*(EMULATE_RES.lock().unwrap()) = EmulationCase::Hlt;
let res = vcpu.run_emulation();
assert!(matches!(res, Err(VcpuError::VcpuUnhandledKvmExit)));
// KVM_EXIT_SHUTDOWN signal
*(EMULATE_RES.lock().unwrap()) = EmulationCase::Shutdown;
let res = vcpu.run_emulation();
assert!(matches!(res, Err(VcpuError::VcpuUnhandledKvmExit)));
// KVM_EXIT_FAIL_ENTRY signal
*(EMULATE_RES.lock().unwrap()) = EmulationCase::FailEntry;
let res = vcpu.run_emulation();
assert!(matches!(res, Err(VcpuError::VcpuUnhandledKvmExit)));
// KVM_EXIT_INTERNAL_ERROR signal
*(EMULATE_RES.lock().unwrap()) = EmulationCase::InternalError;
let res = vcpu.run_emulation();
assert!(matches!(res, Err(VcpuError::VcpuUnhandledKvmExit)));
// KVM_SYSTEM_EVENT_RESET
*(EMULATE_RES.lock().unwrap()) = EmulationCase::SystemEvent(KVM_SYSTEM_EVENT_RESET, 0);
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Stopped)));
// KVM_SYSTEM_EVENT_SHUTDOWN
*(EMULATE_RES.lock().unwrap()) = EmulationCase::SystemEvent(KVM_SYSTEM_EVENT_SHUTDOWN, 0);
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Stopped)));
// Other system event
*(EMULATE_RES.lock().unwrap()) = EmulationCase::SystemEvent(0, 0);
let res = vcpu.run_emulation();
assert!(matches!(res, Err(VcpuError::VcpuUnhandledKvmExit)));
// Unknown exit reason
*(EMULATE_RES.lock().unwrap()) = EmulationCase::Unknown;
let res = vcpu.run_emulation();
assert!(matches!(res, Err(VcpuError::VcpuUnhandledKvmExit)));
// Error: EAGAIN
*(EMULATE_RES.lock().unwrap()) = EmulationCase::Error(libc::EAGAIN);
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Handled)));
// Error: EINTR
*(EMULATE_RES.lock().unwrap()) = EmulationCase::Error(libc::EINTR);
let res = vcpu.run_emulation();
assert!(matches!(res, Ok(VcpuEmulation::Interrupted)));
// other error
*(EMULATE_RES.lock().unwrap()) = EmulationCase::Error(libc::EINVAL);
let res = vcpu.run_emulation();
assert!(matches!(res, Err(VcpuError::VcpuUnhandledKvmExit)));
}
#[cfg(target_arch = "x86_64")]
#[test]
fn test_vcpu_check_io_port_info() {
let (vcpu, _receiver) = create_vcpu();
// debug info signal
let res = vcpu
.check_io_port_info(MAGIC_IOPORT_DEBUG_INFO, &[0, 0, 0, 0])
.unwrap();
assert!(res);
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,149 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::sync::mpsc::{channel, Sender};
use std::sync::Arc;
use dbs_arch::cpuid::{process_cpuid, VmSpec};
use dbs_arch::gdt::gdt_entry;
use dbs_utils::time::TimestampUs;
use kvm_bindings::CpuId;
use kvm_ioctls::{VcpuFd, VmFd};
use log::error;
use vm_memory::{Address, GuestAddress, GuestAddressSpace};
use vmm_sys_util::eventfd::EventFd;
use crate::address_space_manager::GuestAddressSpaceImpl;
use crate::metric::{IncMetric, METRICS};
use crate::vcpu::vcpu_impl::{Result, Vcpu, VcpuError, VcpuStateEvent};
use crate::vcpu::VcpuConfig;
use crate::IoManagerCached;
impl Vcpu {
/// Constructs a new VCPU for `vm`.
///
/// # Arguments
///
/// * `id` - Represents the CPU number between [0, max vcpus).
/// * `vcpu_fd` - The kvm `VcpuFd` for the vcpu.
/// * `io_mgr` - The io-manager used to access port-io and mmio devices.
/// * `cpuid` - The `CpuId` listing the supported capabilities of this vcpu.
/// * `exit_evt` - An `EventFd` that will be written into when this vcpu
/// exits.
/// * `vcpu_state_event` - The eventfd which can notify vmm state of some
/// vcpu should change.
/// * `vcpu_state_sender` - The channel to send state change message from
/// vcpu thread to vmm thread.
/// * `create_ts` - A timestamp used by the vcpu to calculate its lifetime.
/// * `support_immediate_exit` - whether kvm used supports immediate_exit flag.
#[allow(clippy::too_many_arguments)]
pub fn new_x86_64(
id: u8,
vcpu_fd: Arc<VcpuFd>,
io_mgr: IoManagerCached,
cpuid: CpuId,
exit_evt: EventFd,
vcpu_state_event: EventFd,
vcpu_state_sender: Sender<VcpuStateEvent>,
create_ts: TimestampUs,
support_immediate_exit: bool,
) -> Result<Self> {
let (event_sender, event_receiver) = channel();
let (response_sender, response_receiver) = channel();
// Initially the cpuid per vCPU is the one supported by this VM.
Ok(Vcpu {
fd: vcpu_fd,
id,
io_mgr,
create_ts,
event_receiver,
event_sender: Some(event_sender),
response_receiver: Some(response_receiver),
response_sender,
vcpu_state_event,
vcpu_state_sender,
exit_evt,
support_immediate_exit,
cpuid,
})
}
/// Configures a x86_64 specific vcpu and should be called once per vcpu.
///
/// # Arguments
///
/// * `vm_config` - The machine configuration of this microvm needed for the CPUID configuration.
/// * `vm_fd` - The kvm `VmFd` for the virtual machine this vcpu will get attached to.
/// * `vm_memory` - The guest memory used by this microvm.
/// * `kernel_start_addr` - Offset from `guest_mem` at which the kernel starts.
/// * `pgtable_addr` - pgtable address for ap vcpu
pub fn configure(
&mut self,
vcpu_config: &VcpuConfig,
_vm_fd: &VmFd,
vm_as: &GuestAddressSpaceImpl,
kernel_start_addr: Option<GuestAddress>,
_pgtable_addr: Option<GuestAddress>,
) -> Result<()> {
self.set_cpuid(vcpu_config)?;
dbs_arch::regs::setup_msrs(&self.fd).map_err(VcpuError::MSRSConfiguration)?;
if let Some(start_addr) = kernel_start_addr {
dbs_arch::regs::setup_regs(
&self.fd,
start_addr.raw_value() as u64,
dbs_boot::layout::BOOT_STACK_POINTER,
dbs_boot::layout::BOOT_STACK_POINTER,
dbs_boot::layout::ZERO_PAGE_START,
)
.map_err(VcpuError::REGSConfiguration)?;
dbs_arch::regs::setup_fpu(&self.fd).map_err(VcpuError::FPUConfiguration)?;
let gdt_table: [u64; dbs_boot::layout::BOOT_GDT_MAX as usize] = [
gdt_entry(0, 0, 0), // NULL
gdt_entry(0xa09b, 0, 0xfffff), // CODE
gdt_entry(0xc093, 0, 0xfffff), // DATA
gdt_entry(0x808b, 0, 0xfffff), // TSS
];
let pgtable_addr =
dbs_boot::setup_identity_mapping(&*vm_as.memory()).map_err(VcpuError::PageTable)?;
dbs_arch::regs::setup_sregs(
&*vm_as.memory(),
&self.fd,
pgtable_addr,
&gdt_table,
dbs_boot::layout::BOOT_GDT_OFFSET,
dbs_boot::layout::BOOT_IDT_OFFSET,
)
.map_err(VcpuError::SREGSConfiguration)?;
}
dbs_arch::interrupts::set_lint(&self.fd).map_err(VcpuError::LocalIntConfiguration)?;
Ok(())
}
fn set_cpuid(&mut self, vcpu_config: &VcpuConfig) -> Result<()> {
let cpuid_vm_spec = VmSpec::new(
self.id,
vcpu_config.max_vcpu_count as u8,
vcpu_config.threads_per_core,
vcpu_config.cores_per_die,
vcpu_config.dies_per_socket,
vcpu_config.vpmu_feature,
)
.map_err(VcpuError::CpuId)?;
process_cpuid(&mut self.cpuid, &cpuid_vm_spec).map_err(|e| {
METRICS.vcpu.filter_cpuid.inc();
error!("Failure in configuring CPUID for vcpu {}: {:?}", self.id, e);
VcpuError::CpuId(e)
})?;
self.fd
.set_cpuid2(&self.cpuid)
.map_err(VcpuError::SetSupportedCpusFailed)
}
}

View File

@@ -0,0 +1,159 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::collections::HashMap;
use std::fmt::Debug;
use std::ops::Deref;
use dbs_arch::gic::GICDevice;
use dbs_arch::{DeviceInfoForFDT, DeviceType};
use dbs_boot::InitrdConfig;
use dbs_utils::epoll_manager::EpollManager;
use dbs_utils::time::TimestampUs;
use linux_loader::loader::Cmdline;
use vm_memory::{GuestAddressSpace, GuestMemory};
use vmm_sys_util::eventfd::EventFd;
use super::{Vm, VmError};
use crate::address_space_manager::{GuestAddressSpaceImpl, GuestMemoryImpl};
use crate::error::{Error, StartMicroVmError};
use crate::event_manager::EventManager;
/// Configures the system and should be called once per vm before starting vcpu threads.
/// For aarch64, we only setup the FDT.
///
/// # Arguments
///
/// * `guest_mem` - The memory to be used by the guest.
/// * `cmdline` - The kernel commandline.
/// * `vcpu_mpidr` - Array of MPIDR register values per vcpu.
/// * `device_info` - A hashmap containing the attached devices for building FDT device nodes.
/// * `gic_device` - The GIC device.
/// * `initrd` - Information about an optional initrd.
fn configure_system<T: DeviceInfoForFDT + Clone + Debug, M: GuestMemory>(
guest_mem: &M,
cmdline: &str,
vcpu_mpidr: Vec<u64>,
device_info: Option<&HashMap<(DeviceType, String), T>>,
gic_device: &Box<dyn GICDevice>,
initrd: &Option<super::InitrdConfig>,
) -> super::Result<()> {
dbs_boot::fdt::create_fdt(
guest_mem,
vcpu_mpidr,
cmdline,
device_info,
gic_device,
initrd,
)
.map_err(Error::BootSystem)?;
Ok(())
}
#[cfg(target_arch = "aarch64")]
impl Vm {
/// Gets a reference to the irqchip of the VM
pub fn get_irqchip(&self) -> &Box<dyn GICDevice> {
&self.irqchip_handle.as_ref().unwrap()
}
/// Creates the irq chip in-kernel device model.
pub fn setup_interrupt_controller(&mut self) -> std::result::Result<(), StartMicroVmError> {
let vcpu_count = self.vm_config.vcpu_count;
self.irqchip_handle = Some(
dbs_arch::gic::create_gic(&self.vm_fd, vcpu_count.into())
.map_err(|e| StartMicroVmError::ConfigureVm(VmError::SetupGIC(e)))?,
);
Ok(())
}
/// Initialize the virtual machine instance.
///
/// It initialize the virtual machine instance by:
/// 1) initialize virtual machine global state and configuration.
/// 2) create system devices, such as interrupt controller.
/// 3) create and start IO devices, such as serial, console, block, net, vsock etc.
/// 4) create and initialize vCPUs.
/// 5) configure CPU power management features.
/// 6) load guest kernel image.
pub fn init_microvm(
&mut self,
epoll_mgr: EpollManager,
vm_as: GuestAddressSpaceImpl,
request_ts: TimestampUs,
) -> Result<(), StartMicroVmError> {
let reset_eventfd =
EventFd::new(libc::EFD_NONBLOCK).map_err(|_| StartMicroVmError::EventFd)?;
self.reset_eventfd = Some(
reset_eventfd
.try_clone()
.map_err(|_| StartMicroVmError::EventFd)?,
);
self.vcpu_manager()
.map_err(StartMicroVmError::Vcpu)?
.set_reset_event_fd(reset_eventfd)
.map_err(StartMicroVmError::Vcpu)?;
// On aarch64, the vCPUs need to be created (i.e call KVM_CREATE_VCPU) and configured before
// setting up the IRQ chip because the `KVM_CREATE_VCPU` ioctl will return error if the IRQCHIP
// was already initialized.
// Search for `kvm_arch_vcpu_create` in arch/arm/kvm/arm.c.
let kernel_loader_result = self.load_kernel(vm_as.memory().deref())?;
self.vcpu_manager()
.map_err(StartMicroVmError::Vcpu)?
.create_boot_vcpus(request_ts, kernel_loader_result.kernel_load)
.map_err(StartMicroVmError::Vcpu)?;
self.setup_interrupt_controller()?;
self.init_devices(epoll_mgr)?;
Ok(())
}
/// Execute system architecture specific configurations.
///
/// 1) set guest kernel boot parameters
/// 2) setup FDT data structs.
pub fn configure_system_arch(
&self,
vm_memory: &GuestMemoryImpl,
cmdline: &Cmdline,
initrd: Option<InitrdConfig>,
) -> std::result::Result<(), StartMicroVmError> {
let vcpu_manager = self.vcpu_manager().map_err(StartMicroVmError::Vcpu)?;
let vcpu_mpidr = vcpu_manager
.vcpus()
.into_iter()
.map(|cpu| cpu.get_mpidr())
.collect();
let guest_memory = vm_memory.memory();
configure_system(
guest_memory,
cmdline.as_str(),
vcpu_mpidr,
self.device_manager.get_mmio_device_info(),
self.get_irqchip(),
&initrd,
)
.map_err(StartMicroVmError::ConfigureSystem)
}
pub(crate) fn register_events(
&mut self,
event_mgr: &mut EventManager,
) -> std::result::Result<(), StartMicroVmError> {
let reset_evt = self.get_reset_eventfd().ok_or(StartMicroVmError::EventFd)?;
event_mgr
.register_exit_eventfd(reset_evt)
.map_err(|_| StartMicroVmError::RegisterEvent)?;
Ok(())
}
}

View File

@@ -0,0 +1,72 @@
// Copyright (C) 2022 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use std::fs::File;
/// Structure to hold guest kernel configuration information.
pub struct KernelConfigInfo {
/// The descriptor to the kernel file.
kernel_file: File,
/// The descriptor to the initrd file, if there is one
initrd_file: Option<File>,
/// The commandline for guest kernel.
cmdline: linux_loader::cmdline::Cmdline,
}
impl KernelConfigInfo {
/// Create a KernelConfigInfo instance.
pub fn new(
kernel_file: File,
initrd_file: Option<File>,
cmdline: linux_loader::cmdline::Cmdline,
) -> Self {
KernelConfigInfo {
kernel_file,
initrd_file,
cmdline,
}
}
/// Get a mutable reference to the kernel file.
pub fn kernel_file_mut(&mut self) -> &mut File {
&mut self.kernel_file
}
/// Get an immutable reference to the initrd file.
pub fn initrd_file(&self) -> Option<&File> {
self.initrd_file.as_ref()
}
/// Get a mutable reference to the initrd file.
pub fn initrd_file_mut(&mut self) -> Option<&mut File> {
self.initrd_file.as_mut()
}
/// Get a shared reference to the guest kernel boot parameter object.
pub fn kernel_cmdline(&self) -> &linux_loader::cmdline::Cmdline {
&self.cmdline
}
/// Get a mutable reference to the guest kernel boot parameter object.
pub fn kernel_cmdline_mut(&mut self) -> &mut linux_loader::cmdline::Cmdline {
&mut self.cmdline
}
}
#[cfg(test)]
mod tests {
use super::*;
use vmm_sys_util::tempfile::TempFile;
#[test]
fn test_kernel_config_info() {
let kernel = TempFile::new().unwrap();
let initrd = TempFile::new().unwrap();
let mut cmdline = linux_loader::cmdline::Cmdline::new(1024);
cmdline.insert_str("ro").unwrap();
let mut info = KernelConfigInfo::new(kernel.into_file(), Some(initrd.into_file()), cmdline);
assert_eq!(info.cmdline.as_str(), "ro");
assert!(info.initrd_file_mut().is_some());
}
}

View File

@@ -0,0 +1,816 @@
// Copyright (C) 2021 Alibaba Cloud. All rights reserved.
// SPDX-License-Identifier: Apache-2.0
use std::io::{self, Read, Seek, SeekFrom};
use std::ops::Deref;
use std::os::unix::io::RawFd;
use std::sync::{Arc, Mutex, RwLock};
use dbs_address_space::AddressSpace;
#[cfg(target_arch = "aarch64")]
use dbs_arch::gic::GICDevice;
use dbs_boot::InitrdConfig;
use dbs_utils::epoll_manager::EpollManager;
use dbs_utils::time::TimestampUs;
use kvm_ioctls::VmFd;
use linux_loader::loader::{KernelLoader, KernelLoaderResult};
use seccompiler::BpfProgram;
use serde_derive::{Deserialize, Serialize};
use slog::{error, info};
use vm_memory::{Bytes, GuestAddress, GuestAddressSpace};
use vmm_sys_util::eventfd::EventFd;
#[cfg(all(feature = "hotplug", feature = "dbs-upcall"))]
use dbs_upcall::{DevMgrService, UpcallClient};
use crate::address_space_manager::{
AddressManagerError, AddressSpaceMgr, AddressSpaceMgrBuilder, GuestAddressSpaceImpl,
GuestMemoryImpl,
};
use crate::api::v1::{InstanceInfo, InstanceState};
use crate::device_manager::console_manager::DmesgWriter;
use crate::device_manager::{DeviceManager, DeviceMgrError, DeviceOpContext};
use crate::error::{LoadInitrdError, Result, StartMicroVmError, StopMicrovmError};
use crate::event_manager::EventManager;
use crate::kvm_context::KvmContext;
use crate::resource_manager::ResourceManager;
use crate::vcpu::{VcpuManager, VcpuManagerError};
#[cfg(target_arch = "aarch64")]
use dbs_arch::gic::Error as GICError;
mod kernel_config;
pub use self::kernel_config::KernelConfigInfo;
#[cfg(target_arch = "aarch64")]
#[path = "aarch64.rs"]
mod aarch64;
#[cfg(target_arch = "x86_64")]
#[path = "x86_64.rs"]
mod x86_64;
/// Errors associated with virtual machine instance related operations.
#[derive(Debug, thiserror::Error)]
pub enum VmError {
/// Cannot configure the IRQ.
#[error("failed to configure IRQ fot the virtual machine: {0}")]
Irq(#[source] kvm_ioctls::Error),
/// Cannot configure the microvm.
#[error("failed to initialize the virtual machine: {0}")]
VmSetup(#[source] kvm_ioctls::Error),
/// Cannot setup GIC
#[cfg(target_arch = "aarch64")]
#[error("failed to configure GIC")]
SetupGIC(GICError),
}
/// Configuration information for user defined NUMA nodes.
#[derive(Clone, Debug, Default, Serialize, Deserialize, PartialEq)]
pub struct NumaRegionInfo {
/// memory size for this region (unit: MiB)
pub size: u64,
/// numa node id on host for this region
pub host_numa_node_id: Option<u32>,
/// numa node id on guest for this region
pub guest_numa_node_id: Option<u32>,
/// vcpu ids belonging to this region
pub vcpu_ids: Vec<u32>,
}
/// Information for cpu topology to guide guest init
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct CpuTopology {
/// threads per core to indicate hyperthreading is enabled or not
pub threads_per_core: u8,
/// cores per die to guide guest cpu topology init
pub cores_per_die: u8,
/// dies per socket to guide guest cpu topology
pub dies_per_socket: u8,
/// number of sockets
pub sockets: u8,
}
impl Default for CpuTopology {
fn default() -> Self {
CpuTopology {
threads_per_core: 1,
cores_per_die: 1,
dies_per_socket: 1,
sockets: 1,
}
}
}
/// Configuration information for virtual machine instance.
#[derive(Clone, Debug, PartialEq)]
pub struct VmConfigInfo {
/// Number of vcpu to start.
pub vcpu_count: u8,
/// Max number of vcpu can be added
pub max_vcpu_count: u8,
/// cpu power management.
pub cpu_pm: String,
/// cpu topology information
pub cpu_topology: CpuTopology,
/// vpmu support level
pub vpmu_feature: u8,
/// Memory type that can be either hugetlbfs or shmem, default is shmem
pub mem_type: String,
/// Memory file path
pub mem_file_path: String,
/// The memory size in MiB.
pub mem_size_mib: usize,
/// sock path
pub serial_path: Option<String>,
}
impl Default for VmConfigInfo {
fn default() -> Self {
VmConfigInfo {
vcpu_count: 1,
max_vcpu_count: 1,
cpu_pm: String::from("on"),
cpu_topology: CpuTopology {
threads_per_core: 1,
cores_per_die: 1,
dies_per_socket: 1,
sockets: 1,
},
vpmu_feature: 0,
mem_type: String::from("shmem"),
mem_file_path: String::from(""),
mem_size_mib: 128,
serial_path: None,
}
}
}
/// Struct to manage resources and control states of an virtual machine instance.
///
/// An `Vm` instance holds a resources assigned to a virtual machine instance, such as CPU, memory,
/// devices etc. When an `Vm` instance gets deconstructed, all resources assigned should be
/// released.
///
/// We have explicit build the object model as:
/// |---Vmm API Server--<-1:1-> HTTP API Server
/// | |----------<-1:1-> Shimv2/CRI API Server
/// |
/// Vmm <-1:N-> Vm <-1:1-> Address Space Manager <-1:N-> GuestMemory
/// ^ ^---1:1-> Device Manager <-1:N-> Device
/// | ^---1:1-> Resource Manager
/// | ^---1:N-> Vcpu
/// |---<-1:N-> Event Manager
pub struct Vm {
epoll_manager: EpollManager,
kvm: KvmContext,
shared_info: Arc<RwLock<InstanceInfo>>,
address_space: AddressSpaceMgr,
device_manager: DeviceManager,
dmesg_fifo: Option<Box<dyn io::Write + Send>>,
kernel_config: Option<KernelConfigInfo>,
logger: slog::Logger,
reset_eventfd: Option<EventFd>,
resource_manager: Arc<ResourceManager>,
vcpu_manager: Option<Arc<Mutex<VcpuManager>>>,
vm_config: VmConfigInfo,
vm_fd: Arc<VmFd>,
start_instance_request_ts: u64,
start_instance_request_cpu_ts: u64,
start_instance_downtime: u64,
// Arm specific fields.
// On aarch64 we need to keep around the fd obtained by creating the VGIC device.
#[cfg(target_arch = "aarch64")]
irqchip_handle: Option<Box<dyn GICDevice>>,
#[cfg(all(feature = "hotplug", feature = "dbs-upcall"))]
upcall_client: Option<Arc<UpcallClient<DevMgrService>>>,
}
impl Vm {
/// Constructs a new `Vm` instance using the given `Kvm` instance.
pub fn new(
kvm_fd: Option<RawFd>,
api_shared_info: Arc<RwLock<InstanceInfo>>,
epoll_manager: EpollManager,
) -> Result<Self> {
let id = api_shared_info.read().unwrap().id.clone();
let logger = slog_scope::logger().new(slog::o!("id" => id));
let kvm = KvmContext::new(kvm_fd)?;
let vm_fd = Arc::new(kvm.create_vm()?);
let resource_manager = Arc::new(ResourceManager::new(Some(kvm.max_memslots())));
let device_manager = DeviceManager::new(
vm_fd.clone(),
resource_manager.clone(),
epoll_manager.clone(),
&logger,
);
Ok(Vm {
epoll_manager,
kvm,
shared_info: api_shared_info,
address_space: AddressSpaceMgr::default(),
device_manager,
dmesg_fifo: None,
kernel_config: None,
logger,
reset_eventfd: None,
resource_manager,
vcpu_manager: None,
vm_config: Default::default(),
vm_fd,
start_instance_request_ts: 0,
start_instance_request_cpu_ts: 0,
start_instance_downtime: 0,
#[cfg(target_arch = "aarch64")]
irqchip_handle: None,
#[cfg(all(feature = "hotplug", feature = "dbs-upcall"))]
upcall_client: None,
})
}
/// Gets a reference to the device manager by this VM.
pub fn device_manager(&self) -> &DeviceManager {
&self.device_manager
}
/// Gets a mutable reference to the device manager by this VM.
pub fn device_manager_mut(&mut self) -> &mut DeviceManager {
&mut self.device_manager
}
/// Get a reference to EpollManager.
pub fn epoll_manager(&self) -> &EpollManager {
&self.epoll_manager
}
/// Get eventfd for exit notification.
pub fn get_reset_eventfd(&self) -> Option<&EventFd> {
self.reset_eventfd.as_ref()
}
/// Set guest kernel boot configurations.
pub fn set_kernel_config(&mut self, kernel_config: KernelConfigInfo) {
self.kernel_config = Some(kernel_config);
}
/// Get virtual machine shared instance information.
pub fn shared_info(&self) -> &Arc<RwLock<InstanceInfo>> {
&self.shared_info
}
/// Gets a reference to the address_space.address_space for guest memory owned by this VM.
pub fn vm_address_space(&self) -> Option<&AddressSpace> {
self.address_space.get_address_space()
}
/// Gets a reference to the address space for guest memory owned by this VM.
///
/// Note that `GuestMemory` does not include any device memory that may have been added after
/// this VM was constructed.
pub fn vm_as(&self) -> Option<&GuestAddressSpaceImpl> {
self.address_space.get_vm_as()
}
/// Get a immutable reference to the virtual machine configuration information.
pub fn vm_config(&self) -> &VmConfigInfo {
&self.vm_config
}
/// Set the virtual machine configuration information.
pub fn set_vm_config(&mut self, config: VmConfigInfo) {
self.vm_config = config;
}
/// Gets a reference to the kvm file descriptor owned by this VM.
pub fn vm_fd(&self) -> &VmFd {
&self.vm_fd
}
/// returns true if system upcall service is ready
pub fn is_upcall_client_ready(&self) -> bool {
#[cfg(all(feature = "hotplug", feature = "dbs-upcall"))]
{
if let Some(upcall_client) = self.upcall_client() {
return upcall_client.is_ready();
}
}
false
}
/// Check whether the VM has been initialized.
pub fn is_vm_initialized(&self) -> bool {
let instance_state = {
// Use expect() to crash if the other thread poisoned this lock.
let shared_info = self.shared_info.read()
.expect("Failed to determine if instance is initialized because shared info couldn't be read due to poisoned lock");
shared_info.state
};
instance_state != InstanceState::Uninitialized
}
/// Check whether the VM instance is running.
pub fn is_vm_running(&self) -> bool {
let instance_state = {
// Use expect() to crash if the other thread poisoned this lock.
let shared_info = self.shared_info.read()
.expect("Failed to determine if instance is initialized because shared info couldn't be read due to poisoned lock");
shared_info.state
};
instance_state == InstanceState::Running
}
/// Save VM instance exit state
pub fn vm_exit(&self, exit_code: i32) {
if let Ok(mut info) = self.shared_info.write() {
info.state = InstanceState::Exited(exit_code);
} else {
error!(
self.logger,
"Failed to save exit state, couldn't be written due to poisoned lock"
);
}
}
/// Create device operation context.
/// vm is not running, return false
/// vm is running, but hotplug feature is not enable, return error
/// vm is running, but upcall initialize failed, return error
/// vm is running, upcall initialize OK, return true
pub fn create_device_op_context(
&mut self,
epoll_mgr: Option<EpollManager>,
) -> std::result::Result<DeviceOpContext, StartMicroVmError> {
if !self.is_vm_initialized() {
Ok(DeviceOpContext::create_boot_ctx(self, epoll_mgr))
} else {
self.create_device_hotplug_context(epoll_mgr)
}
}
pub(crate) fn check_health(&self) -> std::result::Result<(), StartMicroVmError> {
if self.kernel_config.is_none() {
return Err(StartMicroVmError::MissingKernelConfig);
}
Ok(())
}
pub(crate) fn get_dragonball_info(&self) -> (String, String) {
let guard = self.shared_info.read().unwrap();
let instance_id = guard.id.clone();
let dragonball_version = guard.vmm_version.clone();
(dragonball_version, instance_id)
}
}
impl Vm {
pub(crate) fn init_vcpu_manager(
&mut self,
vm_as: GuestAddressSpaceImpl,
vcpu_seccomp_filter: BpfProgram,
) -> std::result::Result<(), VcpuManagerError> {
let vcpu_manager = VcpuManager::new(
self.vm_fd.clone(),
&self.kvm,
&self.vm_config,
vm_as,
vcpu_seccomp_filter,
self.shared_info.clone(),
self.device_manager.io_manager(),
self.epoll_manager.clone(),
)?;
self.vcpu_manager = Some(vcpu_manager);
Ok(())
}
/// get the cpu manager's reference
pub(crate) fn vcpu_manager(
&self,
) -> std::result::Result<std::sync::MutexGuard<'_, VcpuManager>, VcpuManagerError> {
self.vcpu_manager
.as_ref()
.ok_or(VcpuManagerError::VcpuManagerNotInitialized)
.map(|mgr| mgr.lock().unwrap())
}
/// Pause all vcpus and record the instance downtime
pub fn pause_all_vcpus_with_downtime(&mut self) -> std::result::Result<(), VcpuManagerError> {
let ts = TimestampUs::default();
self.start_instance_downtime = ts.time_us;
self.vcpu_manager()?.pause_all_vcpus()?;
Ok(())
}
/// Resume all vcpus and calc the intance downtime
pub fn resume_all_vcpus_with_downtime(&mut self) -> std::result::Result<(), VcpuManagerError> {
self.vcpu_manager()?.resume_all_vcpus()?;
if self.start_instance_downtime != 0 {
let now = TimestampUs::default();
let downtime = now.time_us - self.start_instance_downtime;
info!(self.logger, "VM: instance downtime: {} us", downtime);
self.start_instance_downtime = 0;
if let Ok(mut info) = self.shared_info.write() {
info.last_instance_downtime = downtime;
} else {
error!(self.logger, "Failed to update live upgrade downtime, couldn't be written due to poisoned lock");
}
}
Ok(())
}
pub(crate) fn init_devices(
&mut self,
epoll_manager: EpollManager,
) -> std::result::Result<(), StartMicroVmError> {
info!(self.logger, "VM: initializing devices ...");
let com1_sock_path = self.vm_config.serial_path.clone();
let kernel_config = self
.kernel_config
.as_mut()
.ok_or(StartMicroVmError::MissingKernelConfig)?;
info!(self.logger, "VM: create interrupt manager");
self.device_manager
.create_interrupt_manager()
.map_err(StartMicroVmError::DeviceManager)?;
info!(self.logger, "VM: create devices");
let vm_as =
self.address_space
.get_vm_as()
.ok_or(StartMicroVmError::AddressManagerError(
AddressManagerError::GuestMemoryNotInitialized,
))?;
self.device_manager.create_devices(
vm_as.clone(),
epoll_manager,
kernel_config,
com1_sock_path,
self.dmesg_fifo.take(),
self.address_space.address_space(),
)?;
info!(self.logger, "VM: start devices");
self.device_manager.start_devices()?;
info!(self.logger, "VM: initializing devices done");
Ok(())
}
/// Remove devices when shutdown vm
pub fn remove_devices(&mut self) -> std::result::Result<(), StopMicrovmError> {
info!(self.logger, "VM: remove devices");
let vm_as = self
.address_space
.get_vm_as()
.ok_or(StopMicrovmError::GuestMemoryNotInitialized)?;
self.device_manager
.remove_devices(
vm_as.clone(),
self.epoll_manager.clone(),
self.address_space.address_space(),
)
.map_err(StopMicrovmError::DeviceManager)
}
/// Reset the console into canonical mode.
pub fn reset_console(&self) -> std::result::Result<(), DeviceMgrError> {
self.device_manager.reset_console()
}
pub(crate) fn init_dmesg_logger(&mut self) {
let writer = self.dmesg_logger();
self.dmesg_fifo = Some(writer);
}
/// dmesg write to logger
fn dmesg_logger(&self) -> Box<dyn io::Write + Send> {
Box::new(DmesgWriter::new(&self.logger))
}
pub(crate) fn init_guest_memory(&mut self) -> std::result::Result<(), StartMicroVmError> {
info!(self.logger, "VM: initializing guest memory...");
// We are not allowing reinitialization of vm guest memory.
if self.address_space.is_initialized() {
return Ok(());
}
// vcpu boot up require local memory. reserve 100 MiB memory
let mem_size = (self.vm_config.mem_size_mib as u64) << 20;
let mem_type = self.vm_config.mem_type.clone();
let mut mem_file_path = String::from("");
if mem_type == "hugetlbfs" {
let shared_info = self.shared_info.read()
.expect("Failed to determine if instance is initialized because shared info couldn't be read due to poisoned lock");
mem_file_path.push_str("/dragonball/");
mem_file_path.push_str(shared_info.id.as_str());
}
let mut vcpu_ids: Vec<u32> = Vec::new();
for i in 0..self.vm_config().max_vcpu_count {
vcpu_ids.push(i as u32);
}
// init default regions.
let mut numa_regions = Vec::with_capacity(1);
let numa_node = NumaRegionInfo {
size: self.vm_config.mem_size_mib as u64,
host_numa_node_id: None,
guest_numa_node_id: Some(0),
vcpu_ids,
};
numa_regions.push(numa_node);
info!(
self.logger,
"VM: mem_type:{} mem_file_path:{}, mem_size:{}, numa_regions:{:?}",
mem_type,
mem_file_path,
mem_size,
numa_regions,
);
let mut address_space_param = AddressSpaceMgrBuilder::new(&mem_type, &mem_file_path)
.map_err(StartMicroVmError::AddressManagerError)?;
address_space_param.set_kvm_vm_fd(self.vm_fd.clone());
self.address_space
.create_address_space(&self.resource_manager, &numa_regions, address_space_param)
.map_err(StartMicroVmError::AddressManagerError)?;
info!(self.logger, "VM: initializing guest memory done");
Ok(())
}
fn init_configure_system(
&mut self,
vm_as: &GuestAddressSpaceImpl,
) -> std::result::Result<(), StartMicroVmError> {
let vm_memory = vm_as.memory();
let kernel_config = self
.kernel_config
.as_ref()
.ok_or(StartMicroVmError::MissingKernelConfig)?;
//let cmdline = kernel_config.cmdline.clone();
let initrd: Option<InitrdConfig> = match kernel_config.initrd_file() {
Some(f) => {
let initrd_file = f.try_clone();
if initrd_file.is_err() {
return Err(StartMicroVmError::InitrdLoader(
LoadInitrdError::ReadInitrd(io::Error::from(io::ErrorKind::InvalidData)),
));
}
let res = self.load_initrd(vm_memory.deref(), &mut initrd_file.unwrap())?;
Some(res)
}
None => None,
};
self.configure_system_arch(vm_memory.deref(), kernel_config.kernel_cmdline(), initrd)
}
/// Loads the initrd from a file into the given memory slice.
///
/// * `vm_memory` - The guest memory the initrd is written to.
/// * `image` - The initrd image.
///
/// Returns the result of initrd loading
fn load_initrd<F>(
&self,
vm_memory: &GuestMemoryImpl,
image: &mut F,
) -> std::result::Result<InitrdConfig, LoadInitrdError>
where
F: Read + Seek,
{
use crate::error::LoadInitrdError::*;
let size: usize;
// Get the image size
match image.seek(SeekFrom::End(0)) {
Err(e) => return Err(ReadInitrd(e)),
Ok(0) => {
return Err(ReadInitrd(io::Error::new(
io::ErrorKind::InvalidData,
"Initrd image seek returned a size of zero",
)))
}
Ok(s) => size = s as usize,
};
// Go back to the image start
image.seek(SeekFrom::Start(0)).map_err(ReadInitrd)?;
// Get the target address
let address = dbs_boot::initrd_load_addr(vm_memory, size as u64).map_err(|_| LoadInitrd)?;
// Load the image into memory
vm_memory
.read_from(GuestAddress(address), image, size)
.map_err(|_| LoadInitrd)?;
Ok(InitrdConfig {
address: GuestAddress(address),
size,
})
}
fn load_kernel(
&mut self,
vm_memory: &GuestMemoryImpl,
) -> std::result::Result<KernelLoaderResult, StartMicroVmError> {
// This is the easy way out of consuming the value of the kernel_cmdline.
let kernel_config = self
.kernel_config
.as_mut()
.ok_or(StartMicroVmError::MissingKernelConfig)?;
let high_mem_addr = GuestAddress(dbs_boot::get_kernel_start());
#[cfg(target_arch = "x86_64")]
return linux_loader::loader::elf::Elf::load(
vm_memory,
None,
kernel_config.kernel_file_mut(),
Some(high_mem_addr),
)
.map_err(StartMicroVmError::KernelLoader);
#[cfg(target_arch = "aarch64")]
return linux_loader::loader::pe::PE::load(
vm_memory,
Some(GuestAddress(dbs_boot::get_kernel_start())),
kernel_config.kernel_file_mut(),
Some(high_mem_addr),
)
.map_err(StartMicroVmError::KernelLoader);
}
/// Set up the initial microVM state and start the vCPU threads.
///
/// This is the main entrance of the Vm object, to bring up the virtual machine instance into
/// running state.
pub fn start_microvm(
&mut self,
event_mgr: &mut EventManager,
vmm_seccomp_filter: BpfProgram,
vcpu_seccomp_filter: BpfProgram,
) -> std::result::Result<(), StartMicroVmError> {
info!(self.logger, "VM: received instance start command");
if self.is_vm_initialized() {
return Err(StartMicroVmError::MicroVMAlreadyRunning);
}
let request_ts = TimestampUs::default();
self.start_instance_request_ts = request_ts.time_us;
self.start_instance_request_cpu_ts = request_ts.cputime_us;
self.init_dmesg_logger();
self.check_health()?;
// Use expect() to crash if the other thread poisoned this lock.
self.shared_info
.write()
.expect("Failed to start microVM because shared info couldn't be written due to poisoned lock")
.state = InstanceState::Starting;
self.init_guest_memory()?;
let vm_as = self
.vm_as()
.cloned()
.ok_or(StartMicroVmError::AddressManagerError(
AddressManagerError::GuestMemoryNotInitialized,
))?;
self.init_vcpu_manager(vm_as.clone(), vcpu_seccomp_filter)
.map_err(StartMicroVmError::Vcpu)?;
self.init_microvm(event_mgr.epoll_manager(), vm_as.clone(), request_ts)?;
self.init_configure_system(&vm_as)?;
#[cfg(feature = "dbs-upcall")]
self.init_upcall()?;
info!(self.logger, "VM: register events");
self.register_events(event_mgr)?;
info!(self.logger, "VM: start vcpus");
self.vcpu_manager()
.map_err(StartMicroVmError::Vcpu)?
.start_boot_vcpus(vmm_seccomp_filter)
.map_err(StartMicroVmError::Vcpu)?;
// Use expect() to crash if the other thread poisoned this lock.
self.shared_info
.write()
.expect("Failed to start microVM because shared info couldn't be written due to poisoned lock")
.state = InstanceState::Running;
info!(self.logger, "VM started");
Ok(())
}
}
#[cfg(feature = "hotplug")]
impl Vm {
#[cfg(feature = "dbs-upcall")]
/// initialize upcall client for guest os
#[cfg(feature = "dbs-upcall")]
fn new_upcall(&mut self) -> std::result::Result<(), StartMicroVmError> {
// get vsock inner connector for upcall
let inner_connector = self
.device_manager
.get_vsock_inner_connector()
.ok_or(StartMicroVmError::UpcallMissVsock)?;
let mut upcall_client = UpcallClient::new(
inner_connector,
self.epoll_manager.clone(),
DevMgrService::default(),
)
.map_err(StartMicroVmError::UpcallInitError)?;
upcall_client
.connect()
.map_err(StartMicroVmError::UpcallConnectError)?;
self.upcall_client = Some(Arc::new(upcall_client));
info!(self.logger, "upcall client init success");
Ok(())
}
#[cfg(feature = "dbs-upcall")]
fn init_upcall(&mut self) -> std::result::Result<(), StartMicroVmError> {
info!(self.logger, "VM upcall init");
if let Err(e) = self.new_upcall() {
info!(
self.logger,
"VM upcall init failed, no support hotplug: {}", e
);
Err(e)
} else {
self.vcpu_manager()
.map_err(StartMicroVmError::Vcpu)?
.set_upcall_channel(self.upcall_client().clone());
Ok(())
}
}
#[cfg(feature = "dbs-upcall")]
/// Get upcall client.
#[cfg(feature = "dbs-upcall")]
pub fn upcall_client(&self) -> &Option<Arc<UpcallClient<DevMgrService>>> {
&self.upcall_client
}
#[cfg(feature = "dbs-upcall")]
fn create_device_hotplug_context(
&self,
epoll_mgr: Option<EpollManager>,
) -> std::result::Result<DeviceOpContext, StartMicroVmError> {
if self.upcall_client().is_none() {
Err(StartMicroVmError::UpcallMissVsock)
} else if self.is_upcall_client_ready() {
Ok(DeviceOpContext::create_hotplug_ctx(self, epoll_mgr))
} else {
Err(StartMicroVmError::UpcallNotReady)
}
}
// We will support hotplug without upcall in future stages.
#[cfg(not(feature = "dbs-upcall"))]
fn create_device_hotplug_context(
&self,
_epoll_mgr: Option<EpollManager>,
) -> std::result::Result<DeviceOpContext, StartMicroVmError> {
Err(StartMicroVmError::MicroVMAlreadyRunning)
}
}
#[cfg(not(feature = "hotplug"))]
impl Vm {
fn init_upcall(&mut self) -> std::result::Result<(), StartMicroVmError> {
Ok(())
}
fn create_device_hotplug_context(
&self,
_epoll_mgr: Option<EpollManager>,
) -> std::result::Result<DeviceOpContext, StartMicroVmError> {
Err(StartMicroVmError::MicroVMAlreadyRunning)
}
}

View File

@@ -0,0 +1,280 @@
// Copyright (C) 2020-2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::convert::TryInto;
use std::mem;
use std::ops::Deref;
use dbs_address_space::AddressSpace;
use dbs_boot::{add_e820_entry, bootparam, layout, mptable, BootParamsWrapper, InitrdConfig};
use dbs_utils::epoll_manager::EpollManager;
use dbs_utils::time::TimestampUs;
use kvm_bindings::{kvm_irqchip, kvm_pit_config, kvm_pit_state2, KVM_PIT_SPEAKER_DUMMY};
use linux_loader::cmdline::Cmdline;
use slog::info;
use vm_memory::{Address, Bytes, GuestAddress, GuestAddressSpace, GuestMemory};
use crate::address_space_manager::{GuestAddressSpaceImpl, GuestMemoryImpl};
use crate::error::{Error, Result, StartMicroVmError};
use crate::event_manager::EventManager;
use crate::vm::{Vm, VmError};
/// Configures the system and should be called once per vm before starting vcpu
/// threads.
///
/// # Arguments
///
/// * `guest_mem` - The memory to be used by the guest.
/// * `cmdline_addr` - Address in `guest_mem` where the kernel command line was
/// loaded.
/// * `cmdline_size` - Size of the kernel command line in bytes including the
/// null terminator.
/// * `initrd` - Information about where the ramdisk image was loaded in the
/// `guest_mem`.
/// * `boot_cpus` - Number of virtual CPUs the guest will have at boot time.
/// * `max_cpus` - Max number of virtual CPUs the guest will have.
/// * `rsv_mem_bytes` - Reserve memory from microVM..
#[allow(clippy::too_many_arguments)]
fn configure_system<M: GuestMemory>(
guest_mem: &M,
address_space: Option<&AddressSpace>,
cmdline_addr: GuestAddress,
cmdline_size: usize,
initrd: &Option<InitrdConfig>,
boot_cpus: u8,
max_cpus: u8,
) -> super::Result<()> {
const KERNEL_BOOT_FLAG_MAGIC: u16 = 0xaa55;
const KERNEL_HDR_MAGIC: u32 = 0x5372_6448;
const KERNEL_LOADER_OTHER: u8 = 0xff;
const KERNEL_MIN_ALIGNMENT_BYTES: u32 = 0x0100_0000; // Must be non-zero.
let mmio_start = GuestAddress(layout::MMIO_LOW_START);
let mmio_end = GuestAddress(layout::MMIO_LOW_END);
let himem_start = GuestAddress(layout::HIMEM_START);
// Note that this puts the mptable at the last 1k of Linux's 640k base RAM
mptable::setup_mptable(guest_mem, boot_cpus, max_cpus).map_err(Error::MpTableSetup)?;
let mut params: BootParamsWrapper = BootParamsWrapper(bootparam::boot_params::default());
params.0.hdr.type_of_loader = KERNEL_LOADER_OTHER;
params.0.hdr.boot_flag = KERNEL_BOOT_FLAG_MAGIC;
params.0.hdr.header = KERNEL_HDR_MAGIC;
params.0.hdr.cmd_line_ptr = cmdline_addr.raw_value() as u32;
params.0.hdr.cmdline_size = cmdline_size as u32;
params.0.hdr.kernel_alignment = KERNEL_MIN_ALIGNMENT_BYTES;
if let Some(initrd_config) = initrd {
params.0.hdr.ramdisk_image = initrd_config.address.raw_value() as u32;
params.0.hdr.ramdisk_size = initrd_config.size as u32;
}
add_e820_entry(&mut params.0, 0, layout::EBDA_START, bootparam::E820_RAM)
.map_err(Error::BootSystem)?;
let mem_end = address_space.ok_or(Error::AddressSpace)?.last_addr();
if mem_end < mmio_start {
add_e820_entry(
&mut params.0,
himem_start.raw_value() as u64,
// it's safe to use unchecked_offset_from because
// mem_end > himem_start
mem_end.unchecked_offset_from(himem_start) as u64 + 1,
bootparam::E820_RAM,
)
.map_err(Error::BootSystem)?;
} else {
add_e820_entry(
&mut params.0,
himem_start.raw_value(),
// it's safe to use unchecked_offset_from because
// end_32bit_gap_start > himem_start
mmio_start.unchecked_offset_from(himem_start),
bootparam::E820_RAM,
)
.map_err(Error::BootSystem)?;
if mem_end > mmio_end {
add_e820_entry(
&mut params.0,
mmio_end.raw_value() + 1,
// it's safe to use unchecked_offset_from because mem_end > mmio_end
mem_end.unchecked_offset_from(mmio_end) as u64,
bootparam::E820_RAM,
)
.map_err(Error::BootSystem)?;
}
}
let zero_page_addr = GuestAddress(layout::ZERO_PAGE_START);
guest_mem
.checked_offset(zero_page_addr, mem::size_of::<bootparam::boot_params>())
.ok_or(Error::ZeroPagePastRamEnd)?;
guest_mem
.write_obj(params, zero_page_addr)
.map_err(|_| Error::ZeroPageSetup)?;
Ok(())
}
impl Vm {
/// Get the status of in-kernel PIT.
pub fn get_pit_state(&self) -> Result<kvm_pit_state2> {
self.vm_fd
.get_pit2()
.map_err(|e| Error::Vm(VmError::Irq(e)))
}
/// Set the status of in-kernel PIT.
pub fn set_pit_state(&self, pit_state: &kvm_pit_state2) -> Result<()> {
self.vm_fd
.set_pit2(pit_state)
.map_err(|e| Error::Vm(VmError::Irq(e)))
}
/// Get the status of in-kernel ioapic.
pub fn get_irqchip_state(&self, chip_id: u32) -> Result<kvm_irqchip> {
let mut irqchip: kvm_irqchip = kvm_irqchip {
chip_id,
..kvm_irqchip::default()
};
self.vm_fd
.get_irqchip(&mut irqchip)
.map(|_| irqchip)
.map_err(|e| Error::Vm(VmError::Irq(e)))
}
/// Set the status of in-kernel ioapic.
pub fn set_irqchip_state(&self, irqchip: &kvm_irqchip) -> Result<()> {
self.vm_fd
.set_irqchip(irqchip)
.map_err(|e| Error::Vm(VmError::Irq(e)))
}
}
impl Vm {
/// Initialize the virtual machine instance.
///
/// It initialize the virtual machine instance by:
/// 1) initialize virtual machine global state and configuration.
/// 2) create system devices, such as interrupt controller, PIT etc.
/// 3) create and start IO devices, such as serial, console, block, net, vsock etc.
/// 4) create and initialize vCPUs.
/// 5) configure CPU power management features.
/// 6) load guest kernel image.
pub fn init_microvm(
&mut self,
epoll_mgr: EpollManager,
vm_as: GuestAddressSpaceImpl,
request_ts: TimestampUs,
) -> std::result::Result<(), StartMicroVmError> {
info!(self.logger, "VM: start initializing microvm ...");
self.init_tss()?;
// For x86_64 we need to create the interrupt controller before calling `KVM_CREATE_VCPUS`
// while on aarch64 we need to do it the other way around.
self.setup_interrupt_controller()?;
self.create_pit()?;
self.init_devices(epoll_mgr)?;
let reset_event_fd = self.device_manager.get_reset_eventfd().unwrap();
self.vcpu_manager()
.map_err(StartMicroVmError::Vcpu)?
.set_reset_event_fd(reset_event_fd)
.map_err(StartMicroVmError::Vcpu)?;
if self.vm_config.cpu_pm == "on" {
// TODO: add cpu_pm support. issue #4590.
info!(self.logger, "VM: enable CPU disable_idle_exits capability");
}
let vm_memory = vm_as.memory();
let kernel_loader_result = self.load_kernel(vm_memory.deref())?;
self.vcpu_manager()
.map_err(StartMicroVmError::Vcpu)?
.create_boot_vcpus(request_ts, kernel_loader_result.kernel_load)
.map_err(StartMicroVmError::Vcpu)?;
info!(self.logger, "VM: initializing microvm done");
Ok(())
}
/// Execute system architecture specific configurations.
///
/// 1) set guest kernel boot parameters
/// 2) setup BIOS configuration data structs, mainly implement the MPSpec.
pub fn configure_system_arch(
&self,
vm_memory: &GuestMemoryImpl,
cmdline: &Cmdline,
initrd: Option<InitrdConfig>,
) -> std::result::Result<(), StartMicroVmError> {
let cmdline_addr = GuestAddress(dbs_boot::layout::CMDLINE_START);
linux_loader::loader::load_cmdline(vm_memory, cmdline_addr, cmdline)
.map_err(StartMicroVmError::LoadCommandline)?;
configure_system(
vm_memory,
self.address_space.address_space(),
cmdline_addr,
cmdline.as_str().len() + 1,
&initrd,
self.vm_config.vcpu_count,
self.vm_config.max_vcpu_count,
)
.map_err(StartMicroVmError::ConfigureSystem)
}
/// Initializes the guest memory.
pub(crate) fn init_tss(&mut self) -> std::result::Result<(), StartMicroVmError> {
self.vm_fd
.set_tss_address(dbs_boot::layout::KVM_TSS_ADDRESS.try_into().unwrap())
.map_err(|e| StartMicroVmError::ConfigureVm(VmError::VmSetup(e)))
}
/// Creates the irq chip and an in-kernel device model for the PIT.
pub(crate) fn setup_interrupt_controller(
&mut self,
) -> std::result::Result<(), StartMicroVmError> {
self.vm_fd
.create_irq_chip()
.map_err(|e| StartMicroVmError::ConfigureVm(VmError::VmSetup(e)))
}
/// Creates an in-kernel device model for the PIT.
pub(crate) fn create_pit(&self) -> std::result::Result<(), StartMicroVmError> {
info!(self.logger, "VM: create pit");
// We need to enable the emulation of a dummy speaker port stub so that writing to port 0x61
// (i.e. KVM_SPEAKER_BASE_ADDRESS) does not trigger an exit to user space.
let pit_config = kvm_pit_config {
flags: KVM_PIT_SPEAKER_DUMMY,
..kvm_pit_config::default()
};
// Safe because we know that our file is a VM fd, we know the kernel will only read the
// correct amount of memory from our pointer, and we verify the return result.
self.vm_fd
.create_pit2(pit_config)
.map_err(|e| StartMicroVmError::ConfigureVm(VmError::VmSetup(e)))
}
pub(crate) fn register_events(
&mut self,
event_mgr: &mut EventManager,
) -> std::result::Result<(), StartMicroVmError> {
let reset_evt = self
.device_manager
.get_reset_eventfd()
.map_err(StartMicroVmError::DeviceManager)?;
event_mgr
.register_exit_eventfd(&reset_evt)
.map_err(|_| StartMicroVmError::RegisterEvent)?;
self.reset_eventfd = Some(reset_evt);
Ok(())
}
}

215
src/dragonball/src/vmm.rs Normal file
View File

@@ -0,0 +1,215 @@
// Copyright (C) 2020-2022 Alibaba Cloud. All rights reserved.
// Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
// SPDX-License-Identifier: Apache-2.0
//
// Portions Copyright 2017 The Chromium OS Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the THIRD-PARTY file.
use std::os::unix::io::RawFd;
use std::sync::{Arc, Mutex, RwLock};
use dbs_utils::epoll_manager::EpollManager;
use log::{error, info, warn};
use seccompiler::BpfProgram;
use vmm_sys_util::eventfd::EventFd;
use crate::api::v1::{InstanceInfo, VmmService};
use crate::error::{EpollError, Result};
use crate::event_manager::{EventContext, EventManager};
use crate::vm::Vm;
use crate::{EXIT_CODE_GENERIC_ERROR, EXIT_CODE_OK};
/// Global coordinator to manage API servers, virtual machines, upgrade etc.
///
/// Originally firecracker assumes an VMM only manages an VM, and doesn't distinguish VMM and VM.
/// Thus caused a mixed and confusion design. Now we have explicit build the object model as:
/// |---Vmm API Server--<-1:1-> HTTP API Server
/// | |----------<-1:1-> Shimv2/CRI API Server
/// |
/// Vmm <-1:N-> Vm <-1:1-> Address Space Manager <-1:N-> GuestMemory
/// ^ ^---1:1-> Device Manager <-1:N-> Device
/// | ^---1:1-> Resource Manager
/// | ^---1:N-> Vcpu
/// |---<-1:N-> Event Manager
pub struct Vmm {
pub(crate) event_ctx: EventContext,
epoll_manager: EpollManager,
// Will change to a HashMap when enabling 1 VMM with multiple VMs.
vm: Vm,
vcpu_seccomp_filter: BpfProgram,
vmm_seccomp_filter: BpfProgram,
}
impl Vmm {
/// Create a Virtual Machine Monitor instance.
pub fn new(
api_shared_info: Arc<RwLock<InstanceInfo>>,
api_event_fd: EventFd,
vmm_seccomp_filter: BpfProgram,
vcpu_seccomp_filter: BpfProgram,
kvm_fd: Option<RawFd>,
) -> Result<Self> {
let epoll_manager = EpollManager::default();
Self::new_with_epoll_manager(
api_shared_info,
api_event_fd,
epoll_manager,
vmm_seccomp_filter,
vcpu_seccomp_filter,
kvm_fd,
)
}
/// Create a Virtual Machine Monitor instance with a epoll_manager.
pub fn new_with_epoll_manager(
api_shared_info: Arc<RwLock<InstanceInfo>>,
api_event_fd: EventFd,
epoll_manager: EpollManager,
vmm_seccomp_filter: BpfProgram,
vcpu_seccomp_filter: BpfProgram,
kvm_fd: Option<RawFd>,
) -> Result<Self> {
let vm = Vm::new(kvm_fd, api_shared_info, epoll_manager.clone())?;
let event_ctx = EventContext::new(api_event_fd)?;
Ok(Vmm {
event_ctx,
epoll_manager,
vm,
vcpu_seccomp_filter,
vmm_seccomp_filter,
})
}
/// Get a reference to a virtual machine managed by the VMM.
pub fn get_vm(&self) -> Option<&Vm> {
Some(&self.vm)
}
/// Get a mutable reference to a virtual machine managed by the VMM.
pub fn get_vm_mut(&mut self) -> Option<&mut Vm> {
Some(&mut self.vm)
}
/// Get the seccomp rules for vCPU threads.
pub fn vcpu_seccomp_filter(&self) -> BpfProgram {
self.vcpu_seccomp_filter.clone()
}
/// Get the seccomp rules for VMM threads.
pub fn vmm_seccomp_filter(&self) -> BpfProgram {
self.vmm_seccomp_filter.clone()
}
/// Run the event loop to service API requests.
///
/// # Arguments
///
/// * `vmm` - An Arc reference to the global Vmm instance.
/// * `service` - VMM Service provider.
pub fn run_vmm_event_loop(vmm: Arc<Mutex<Vmm>>, mut service: VmmService) -> i32 {
let epoll_mgr = vmm.lock().unwrap().epoll_manager.clone();
let mut event_mgr =
EventManager::new(&vmm, epoll_mgr).expect("Cannot create epoll manager");
'poll: loop {
match event_mgr.handle_events(-1) {
Ok(_) => {
// Check whether there are pending vmm events.
if event_mgr.fetch_vmm_event_count() == 0 {
continue;
}
let mut v = vmm.lock().unwrap();
if v.event_ctx.api_event_triggered {
// The run_vmm_action() needs to access event_mgr, so it could
// not be handled in EpollHandler::handle_events(). It has been
// delayed to the main loop.
v.event_ctx.api_event_triggered = false;
service
.run_vmm_action(&mut v, &mut event_mgr)
.unwrap_or_else(|_| {
warn!("got spurious notification from api thread");
});
}
if v.event_ctx.exit_evt_triggered {
info!("Gracefully terminated VMM control loop");
return v.stop(EXIT_CODE_OK as i32);
}
}
Err(e) => {
error!("Abruptly exited VMM control loop: {:?}", e);
if let EpollError::EpollMgr(dbs_utils::epoll_manager::Error::Epoll(e)) = e {
if e.errno() == libc::EAGAIN || e.errno() == libc::EINTR {
continue 'poll;
}
}
return vmm.lock().unwrap().stop(EXIT_CODE_GENERIC_ERROR as i32);
}
}
}
}
/// Waits for all vCPUs to exit and terminates the Dragonball process.
fn stop(&mut self, exit_code: i32) -> i32 {
info!("Vmm is stopping.");
if let Some(vm) = self.get_vm_mut() {
if vm.is_vm_initialized() {
if let Err(e) = vm.remove_devices() {
warn!("failed to remove devices: {:?}", e);
}
if let Err(e) = vm.reset_console() {
warn!("Cannot set canonical mode for the terminal. {:?}", e);
}
// Now, we use exit_code instead of invoking _exit to
// terminate process, so all of vcpu threads should be stopped
// prior to vmm event loop.
match vm.vcpu_manager() {
Ok(mut mgr) => {
if let Err(e) = mgr.exit_all_vcpus() {
warn!("Failed to exit vcpu thread. {:?}", e);
}
}
Err(e) => warn!("Failed to get vcpu manager {:?}", e),
}
// save exit state to VM, instead of exit process.
vm.vm_exit(exit_code);
}
}
exit_code
}
}
#[cfg(test)]
pub(crate) mod tests {
use super::*;
pub fn create_vmm_instance() -> Vmm {
let info = Arc::new(RwLock::new(InstanceInfo::default()));
let event_fd = EventFd::new(libc::EFD_NONBLOCK).unwrap();
let seccomp_filter: BpfProgram = Vec::new();
let epoll_manager = EpollManager::default();
Vmm::new_with_epoll_manager(
info,
event_fd,
epoll_manager,
seccomp_filter.clone(),
seccomp_filter,
None,
)
.unwrap()
}
#[test]
fn test_create_vmm_instance() {
create_vmm_instance();
}
}

394
src/libs/Cargo.lock generated
View File

@@ -2,6 +2,15 @@
# It is not intended for manual editing.
version = 3
[[package]]
name = "aho-corasick"
version = "0.7.18"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1e37cfd5e7657ada45f742d6e99ca5788580b5c529dc78faf11ece6dc702656f"
dependencies = [
"memchr",
]
[[package]]
name = "anyhow"
version = "1.0.57"
@@ -27,9 +36,9 @@ dependencies = [
[[package]]
name = "autocfg"
version = "1.0.1"
version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cdb031dd78e28731d87d56cc8ffef4a8f36ca26c38fe2de700543e627f8a464a"
checksum = "d468802bab17cbc0cc575e9b053f41e72aa36bfa6b7f55e3529ffa43161b97fa"
[[package]]
name = "bitflags"
@@ -37,6 +46,12 @@ version = "1.2.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "cf1de2fe8c75bc145a2f577add951f8134889b4795d47466a54a5c846d691693"
[[package]]
name = "byte-unit"
version = "3.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "415301c9de11005d4b92193c0eb7ac7adc37e5a49e0ac9bed0a42343512744b8"
[[package]]
name = "byteorder"
version = "1.4.3"
@@ -71,6 +86,18 @@ version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "baf1de4339761588bc0619e3cbc0120ee582ebb74b53b4efbf79117bd2da40fd"
[[package]]
name = "cgroups-rs"
version = "0.2.8"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1b827f9d9f6c2fff719d25f5d44cbc8d2ef6df1ef00d055c5c14d5dc25529579"
dependencies = [
"libc",
"log",
"nix 0.23.1",
"regex",
]
[[package]]
name = "chrono"
version = "0.4.19"
@@ -84,6 +111,12 @@ dependencies = [
"winapi",
]
[[package]]
name = "common-path"
version = "1.0.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2382f75942f4b3be3690fe4f86365e9c853c1587d6ee58212cebf6e2a9ccd101"
[[package]]
name = "crossbeam-channel"
version = "0.5.2"
@@ -121,6 +154,17 @@ version = "1.6.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e78d4f1cc4ae33bbfc157ed5d5a5ef3bc29227303d595861deb238fcec4e9457"
[[package]]
name = "fail"
version = "0.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ec3245a0ca564e7f3c797d20d833a6870f57a728ac967d5225b3ffdef4465011"
dependencies = [
"lazy_static",
"log",
"rand 0.8.5",
]
[[package]]
name = "fastrand"
version = "1.6.0"
@@ -225,6 +269,34 @@ dependencies = [
"slab",
]
[[package]]
name = "getrandom"
version = "0.1.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8fc3cb4d91f53b50155bdcfd23f6a4c39ae1969c2ae85982b135750cccaf5fce"
dependencies = [
"cfg-if",
"libc",
"wasi 0.9.0+wasi-snapshot-preview1",
]
[[package]]
name = "getrandom"
version = "0.2.6"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9be70c98951c83b8d2f8f60d7065fa6d5146873094452a1008da8c2f1e4205ad"
dependencies = [
"cfg-if",
"libc",
"wasi 0.10.2+wasi-snapshot-preview1",
]
[[package]]
name = "glob"
version = "0.3.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9b919933a397b79c37e33b77bb2aa3dc8eb6e165ad809e58ff75bc7db2e34574"
[[package]]
name = "hashbrown"
version = "0.11.2"
@@ -240,6 +312,15 @@ dependencies = [
"unicode-segmentation",
]
[[package]]
name = "hermit-abi"
version = "0.1.19"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "62b467343b94ba476dcb2500d242dadbb39557df889310ac77c5d99100aaac33"
dependencies = [
"libc",
]
[[package]]
name = "indexmap"
version = "1.8.1"
@@ -283,6 +364,50 @@ version = "1.0.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1aab8fc367588b89dcee83ab0fd66b72b50b72fa1904d7095045ace2b0c81c35"
[[package]]
name = "kata-sys-util"
version = "0.1.0"
dependencies = [
"byteorder",
"cgroups-rs",
"chrono",
"common-path",
"fail",
"kata-types",
"lazy_static",
"libc",
"nix 0.24.2",
"num_cpus",
"oci",
"once_cell",
"rand 0.7.3",
"serde_json",
"serial_test",
"slog",
"slog-scope",
"subprocess",
"tempfile",
"thiserror",
]
[[package]]
name = "kata-types"
version = "0.1.0"
dependencies = [
"byte-unit",
"glob",
"lazy_static",
"num_cpus",
"oci",
"regex",
"serde",
"serde_json",
"slog",
"slog-scope",
"thiserror",
"toml",
]
[[package]]
name = "lazy_static"
version = "1.4.0"
@@ -295,6 +420,16 @@ version = "0.2.124"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "21a41fed9d98f27ab1c6d161da622a4fa35e8a54a8adc24bbf3ddd0ef70b0e50"
[[package]]
name = "lock_api"
version = "0.4.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "327fa5b6a6940e4699ec49a9beae1ea4845c6bab9314e4f84ac68742139d8c53"
dependencies = [
"autocfg",
"scopeguard",
]
[[package]]
name = "log"
version = "0.4.16"
@@ -362,9 +497,9 @@ checksum = "e5ce46fe64a9d73be07dcbe690a38ce1b293be448fd8ce1e6c1b8062c9f72c6a"
[[package]]
name = "nix"
version = "0.20.2"
version = "0.23.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f5e06129fb611568ef4e868c14b326274959aa70ff7776e9d55323531c374945"
checksum = "9f866317acbd3a240710c63f065ffb1e4fd466259045ccb504130b7f668f35c6"
dependencies = [
"bitflags",
"cc",
@@ -375,12 +510,11 @@ dependencies = [
[[package]]
name = "nix"
version = "0.23.1"
version = "0.24.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "9f866317acbd3a240710c63f065ffb1e4fd466259045ccb504130b7f668f35c6"
checksum = "195cdbc1741b8134346d515b3a56a1c94b0912758009cfd53f99ea0f57b065fc"
dependencies = [
"bitflags",
"cc",
"cfg-if",
"libc",
"memoffset",
@@ -414,12 +548,57 @@ dependencies = [
"autocfg",
]
[[package]]
name = "num_cpus"
version = "1.13.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "19e64526ebdee182341572e50e9ad03965aa510cd94427a4549448f285e957a1"
dependencies = [
"hermit-abi",
"libc",
]
[[package]]
name = "oci"
version = "0.1.0"
dependencies = [
"libc",
"serde",
"serde_derive",
"serde_json",
]
[[package]]
name = "once_cell"
version = "1.9.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "da32515d9f6e6e489d7bc9d84c71b060db7247dc035bbe44eac88cf87486d8d5"
[[package]]
name = "parking_lot"
version = "0.11.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "7d17b78036a60663b797adeaee46f5c9dfebb86948d1255007a1d6be0271ff99"
dependencies = [
"instant",
"lock_api",
"parking_lot_core",
]
[[package]]
name = "parking_lot_core"
version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d76e8e1493bcac0d2766c42737f34458f1c8c50c0d23bcb24ea953affb273216"
dependencies = [
"cfg-if",
"instant",
"libc",
"redox_syscall",
"smallvec",
"winapi",
]
[[package]]
name = "petgraph"
version = "0.5.1"
@@ -442,6 +621,12 @@ version = "0.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8b870d8c151b6f2fb93e84a13146138f05d02ed11c7e7c54f8826aaaf7c9f184"
[[package]]
name = "ppv-lite86"
version = "0.2.16"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "eb9f9e6e233e5c4a35559a617bf40a4ec447db2e84c20b55a6f83167b7e57872"
[[package]]
name = "proc-macro2"
version = "1.0.37"
@@ -504,9 +689,9 @@ dependencies = [
[[package]]
name = "protobuf"
version = "2.14.0"
version = "2.27.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8e86d370532557ae7573551a1ec8235a0f8d6cb276c7c9e6aa490b511c447485"
checksum = "cf7e6d18738ecd0902d30d1ad232c9125985a3422929b16c65517b38adc14f96"
dependencies = [
"serde",
"serde_derive",
@@ -514,18 +699,18 @@ dependencies = [
[[package]]
name = "protobuf-codegen"
version = "2.14.0"
version = "2.27.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "de113bba758ccf2c1ef816b127c958001b7831136c9bc3f8e9ec695ac4e82b0c"
checksum = "aec1632b7c8f2e620343439a7dfd1f3c47b18906c4be58982079911482b5d707"
dependencies = [
"protobuf",
]
[[package]]
name = "protobuf-codegen-pure"
version = "2.14.0"
version = "2.27.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2d1a4febc73bf0cada1d77c459a0c8e5973179f1cfd5b0f1ab789d45b17b6440"
checksum = "9f8122fdb18e55190c796b088a16bdb70cd7acdcd48f7a8b796b58c62e532cc6"
dependencies = [
"protobuf",
"protobuf-codegen",
@@ -536,6 +721,7 @@ name = "protocols"
version = "0.1.0"
dependencies = [
"async-trait",
"oci",
"protobuf",
"serde",
"serde_json",
@@ -552,6 +738,77 @@ dependencies = [
"proc-macro2",
]
[[package]]
name = "rand"
version = "0.7.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6a6b1679d49b24bbfe0c803429aa1874472f50d9b363131f0e89fc356b544d03"
dependencies = [
"getrandom 0.1.16",
"libc",
"rand_chacha 0.2.2",
"rand_core 0.5.1",
"rand_hc",
]
[[package]]
name = "rand"
version = "0.8.5"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "34af8d1a0e25924bc5b7c43c079c942339d8f0a8b57c39049bef581b46327404"
dependencies = [
"libc",
"rand_chacha 0.3.1",
"rand_core 0.6.3",
]
[[package]]
name = "rand_chacha"
version = "0.2.2"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f4c8ed856279c9737206bf725bf36935d8666ead7aa69b52be55af369d193402"
dependencies = [
"ppv-lite86",
"rand_core 0.5.1",
]
[[package]]
name = "rand_chacha"
version = "0.3.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e6c10a63a0fa32252be49d21e7709d4d4baf8d231c2dbce1eaa8141b9b127d88"
dependencies = [
"ppv-lite86",
"rand_core 0.6.3",
]
[[package]]
name = "rand_core"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "90bde5296fc891b0cef12a6d03ddccc162ce7b2aff54160af9338f8d40df6d19"
dependencies = [
"getrandom 0.1.16",
]
[[package]]
name = "rand_core"
version = "0.6.3"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d34f1408f55294453790c48b2f1ebbb1c5b4b7563eb1f418bcfcfdbb06ebb4e7"
dependencies = [
"getrandom 0.2.6",
]
[[package]]
name = "rand_hc"
version = "0.2.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ca3129af7b92a17112d59ad498c6f81eaf463253766b90396d39ea7a39d6613c"
dependencies = [
"rand_core 0.5.1",
]
[[package]]
name = "redox_syscall"
version = "0.2.10"
@@ -561,6 +818,23 @@ dependencies = [
"bitflags",
]
[[package]]
name = "regex"
version = "1.5.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "d07a8629359eb56f1e2fb1652bb04212c072a87ba68546a04065d525673ac461"
dependencies = [
"aho-corasick",
"memchr",
"regex-syntax",
]
[[package]]
name = "regex-syntax"
version = "0.6.25"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f497285884f3fcff424ffc933e56d7cbca511def0c9831a7f9b5f6153e3cc89b"
[[package]]
name = "remove_dir_all"
version = "0.5.3"
@@ -585,19 +859,25 @@ dependencies = [
]
[[package]]
name = "serde"
version = "1.0.133"
name = "scopeguard"
version = "1.1.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "97565067517b60e2d1ea8b268e59ce036de907ac523ad83a0475da04e818989a"
checksum = "d29ab0c6d3fc0ee92fe66e2d99f700eab17a8d57d1c1d3b748380fb20baa78cd"
[[package]]
name = "serde"
version = "1.0.136"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ce31e24b01e1e524df96f1c2fdd054405f8d7376249a5110886fb4b658484789"
dependencies = [
"serde_derive",
]
[[package]]
name = "serde_derive"
version = "1.0.133"
version = "1.0.136"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ed201699328568d8d08208fdd080e3ff594e6c422e438b6705905da01005d537"
checksum = "08597e7152fcd306f41838ed3e37be9eaeed2b61c42e2117266a554fab4662f9"
dependencies = [
"proc-macro2",
"quote",
@@ -615,6 +895,28 @@ dependencies = [
"serde",
]
[[package]]
name = "serial_test"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "e0bccbcf40c8938196944a3da0e133e031a33f4d6b72db3bda3cc556e361905d"
dependencies = [
"lazy_static",
"parking_lot",
"serial_test_derive",
]
[[package]]
name = "serial_test_derive"
version = "0.5.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "b2acd6defeddb41eb60bb468f8825d0cfd0c2a76bc03bfd235b6a1dc4f6a1ad5"
dependencies = [
"proc-macro2",
"quote",
"syn",
]
[[package]]
name = "slab"
version = "0.4.6"
@@ -641,9 +943,9 @@ dependencies = [
[[package]]
name = "slog-json"
version = "2.4.0"
version = "2.5.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "52e9b96fb6b5e80e371423b4aca6656eb537661ce8f82c2697e619f8ca85d043"
checksum = "0f7f7a952ce80fca9da17bf0a53895d11f8aa1ba063668ca53fc72e7869329e9"
dependencies = [
"chrono",
"serde",
@@ -662,6 +964,12 @@ dependencies = [
"slog",
]
[[package]]
name = "smallvec"
version = "1.8.0"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "f2dd574626839106c320a323308629dcb1acfc96e32a8cba364ddc61ac23ee83"
[[package]]
name = "socket2"
version = "0.4.4"
@@ -672,6 +980,16 @@ dependencies = [
"winapi",
]
[[package]]
name = "subprocess"
version = "0.2.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "0c2e86926081dda636c546d8c5e641661049d7562a68f5488be4a1f7f66f6086"
dependencies = [
"libc",
"winapi",
]
[[package]]
name = "syn"
version = "1.0.91"
@@ -725,21 +1043,20 @@ dependencies = [
[[package]]
name = "thread_local"
version = "1.1.3"
version = "1.1.4"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "8018d24e04c95ac8790716a5987d0fec4f8b27249ffa0f7d33f1369bdfb88cbd"
checksum = "5516c27b78311c50bf42c071425c560ac799b11c30b31f87e3081965fe5e0180"
dependencies = [
"once_cell",
]
[[package]]
name = "time"
version = "0.1.44"
version = "0.1.43"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "6db9e6914ab8b1ae1c260a4ae7a49b6c5611b40328a735b21862567685e73255"
checksum = "ca8a50ef2360fbd1eeb0ecd46795a87a19024eb4b53c5dc916ca1fd95fe62438"
dependencies = [
"libc",
"wasi 0.10.0+wasi-snapshot-preview1",
"winapi",
]
@@ -784,17 +1101,26 @@ dependencies = [
]
[[package]]
name = "ttrpc"
version = "0.5.2"
name = "toml"
version = "0.5.9"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "66a973ce6d5eaa20c173635b29ffb660dafbc7ef109172c0015ba44e47a23711"
checksum = "8d82e1a7758622a465f8cee077614c73484dac5b836c02ff6a40d5d1010324d7"
dependencies = [
"serde",
]
[[package]]
name = "ttrpc"
version = "0.6.1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "2ecfff459a859c6ba6668ff72b34c2f1d94d9d58f7088414c2674ad0f31cc7d8"
dependencies = [
"async-trait",
"byteorder",
"futures",
"libc",
"log",
"nix 0.20.2",
"nix 0.23.1",
"protobuf",
"protobuf-codegen-pure",
"thiserror",
@@ -853,9 +1179,15 @@ dependencies = [
[[package]]
name = "wasi"
version = "0.10.0+wasi-snapshot-preview1"
version = "0.9.0+wasi-snapshot-preview1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "1a143597ca7c7793eff794def352d41792a93c481eb1042423ff7ff72ba2c31f"
checksum = "cccddf32554fecc6acb585f82a32a72e28b48f8c4c1883ddfeeeaa96f7d8e519"
[[package]]
name = "wasi"
version = "0.10.2+wasi-snapshot-preview1"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "fd6fbd9a79829dd1ad0cc20627bf1ed606756a7f77edff7b66b7064f9cb327c6"
[[package]]
name = "wasi"

View File

@@ -1,7 +1,10 @@
[workspace]
members = [
"logging",
"kata-types",
"kata-sys-util",
"safe-path",
"protocols",
"oci",
]
resolver = "2"

42
src/libs/Makefile Normal file
View File

@@ -0,0 +1,42 @@
# Copyright (c) 2021 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#
EXTRA_RUSTFEATURES :=
EXTRA_TEST_FLAGS :=
USERID=$(shell id -u)
ifeq ($(USERID), 0)
override EXTRA_TEST_FLAGS = --ignored
endif
default: build
build:
cargo build --all-features
check: clippy format
clippy:
@echo "INFO: cargo clippy..."
cargo clippy --all-targets --all-features --release \
-- \
-D warnings
format:
@echo "INFO: cargo fmt..."
cargo fmt -- --check
clean:
cargo clean
# It is essential to run these tests using *both* build profiles.
# See the `test_logger_levels()` test for further information.
test:
@echo "INFO: testing libraries for development build"
cargo test --all $(EXTRA_RUSTFEATURES) -- --nocapture $(EXTRA_TEST_FLAGS)
@echo "INFO: testing libraries for release build"
cargo test --release --all $(EXTRA_RUSTFEATURES) -- --nocapture $(EXTRA_TEST_FLAGS)
.PHONY: install vendor

View File

@@ -6,5 +6,7 @@ Currently it provides following library crates:
| Library | Description |
|-|-|
| [logging](logging/) | Facilities to setup logging subsystem based slog. |
| [logging](logging/) | Facilities to setup logging subsystem based on slog. |
| [system utilities](kata-sys-util/) | Collection of facilities and helpers to access system services. |
| [types](kata-types/) | Collection of constants and data types shared by multiple Kata Containers components. |
| [safe-path](safe-path/) | Utilities to safely resolve filesystem paths. |

View File

@@ -0,0 +1,36 @@
[package]
name = "kata-sys-util"
version = "0.1.0"
description = "System Utilities for Kata Containers"
keywords = ["kata", "container", "runtime"]
authors = ["The Kata Containers community <kata-dev@lists.katacontainers.io>"]
repository = "https://github.com/kata-containers/kata-containers.git"
homepage = "https://katacontainers.io/"
readme = "README.md"
license = "Apache-2.0"
edition = "2018"
[dependencies]
byteorder = "1.4.3"
cgroups = { package = "cgroups-rs", version = "0.2.7" }
chrono = "0.4.0"
common-path = "=1.0.0"
fail = "0.5.0"
lazy_static = "1.4.0"
libc = "0.2.100"
nix = "0.24.1"
once_cell = "1.9.0"
serde_json = "1.0.73"
slog = "2.5.2"
slog-scope = "4.4.0"
subprocess = "0.2.8"
rand = "0.7.2"
thiserror = "1.0.30"
kata-types = { path = "../kata-types" }
oci = { path = "../oci" }
[dev-dependencies]
num_cpus = "1.13.1"
serial_test = "0.5.1"
tempfile = "3.2.0"

View File

@@ -0,0 +1,19 @@
# kata-sys-util
This crate is a collection of utilities and helpers for
[Kata Containers](https://github.com/kata-containers/kata-containers/) components to access system services.
It provides safe wrappers over system services, such as:
- cgroups
- file systems
- mount
- NUMA
## Support
**Operating Systems**:
- Linux
## License
This code is licensed under [Apache-2.0](../../../LICENSE).

View File

@@ -0,0 +1,104 @@
// Copyright (c) 2019-2021 Alibaba Cloud
// Copyright (c) 2019-2021 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
use std::fs;
use std::io::Result;
use std::os::unix::fs::{FileTypeExt, MetadataExt};
use std::path::{Path, PathBuf};
use nix::sys::stat;
use crate::{eother, sl};
const SYS_DEV_BLOCK_PATH: &str = "/sys/dev/block";
const BLKDEV_PARTITION: &str = "partition";
const BLKDEV_DEV_FILE: &str = "dev";
/// Get major and minor number of the device or of the device hosting the regular file/directory.
pub fn get_devid_for_blkio_cgroup<P: AsRef<Path>>(path: P) -> Result<Option<(u64, u64)>> {
let md = fs::metadata(path)?;
if md.is_dir() || md.is_file() {
// For regular file/directory, get major/minor of the block device hosting it.
// Note that we need to get the major/minor of the block device instead of partition,
// e.g. /dev/sda instead of /dev/sda3, because blkio cgroup works with block major/minor.
let id = md.dev();
Ok(Some((stat::major(id), stat::minor(id))))
} else if md.file_type().is_block_device() {
// For block device, get major/minor of the device special file itself
get_block_device_id(md.rdev())
} else {
Ok(None)
}
}
/// Get the block device major/minor number from a partition/block device(itself).
///
/// For example, given the dev_t of /dev/sda3 returns major and minor of /dev/sda. We rely on the
/// fact that if /sys/dev/block/$major:$minor/partition exists, then it's a partition, and find its
/// parent for the real device.
fn get_block_device_id(dev: stat::dev_t) -> Result<Option<(u64, u64)>> {
let major = stat::major(dev);
let minor = stat::minor(dev);
let mut blk_dev_path = PathBuf::from(SYS_DEV_BLOCK_PATH)
.join(format!("{}:{}", major, minor))
.canonicalize()?;
// If 'partition' file exists, then it's a partition of the real device, take its parent.
// Otherwise it's already the real device.
loop {
if !blk_dev_path.join(BLKDEV_PARTITION).exists() {
break;
}
blk_dev_path = match blk_dev_path.parent() {
Some(p) => p.to_path_buf(),
None => {
return Err(eother!(
"Can't find real device for dev {}:{}",
major,
minor
))
}
};
}
// Parse major:minor in dev file
let dev_path = blk_dev_path.join(BLKDEV_DEV_FILE);
let dev_buf = fs::read_to_string(&dev_path)?;
let dev_buf = dev_buf.trim_end();
debug!(
sl!(),
"get_real_devid: dev {}:{} -> {:?} ({})", major, minor, blk_dev_path, dev_buf
);
if let Some((major, minor)) = dev_buf.split_once(':') {
let major = major
.parse::<u64>()
.map_err(|_e| eother!("Failed to parse major number: {}", major))?;
let minor = minor
.parse::<u64>()
.map_err(|_e| eother!("Failed to parse minor number: {}", minor))?;
Ok(Some((major, minor)))
} else {
Err(eother!(
"Wrong format in {}: {}",
dev_path.to_string_lossy(),
dev_buf
))
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_get_devid() {
//let (major, minor) = get_devid_for_blkio_cgroup("/dev/vda1").unwrap().unwrap();
assert!(get_devid_for_blkio_cgroup("/dev/tty").unwrap().is_none());
get_devid_for_blkio_cgroup("/do/not/exist/file_______name").unwrap_err();
}
}

View File

@@ -0,0 +1,217 @@
// Copyright (c) 2019-2021 Alibaba Cloud
// Copyright (c) 2019-2021 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
use std::ffi::OsString;
use std::fs::{self, File};
use std::io::{Error, Result};
use std::os::unix::io::AsRawFd;
use std::path::{Path, PathBuf};
use std::process::Command;
use crate::{eother, sl};
// nix filesystem_type for different libc and architectures
#[cfg(all(target_os = "linux", target_env = "musl"))]
type FsType = libc::c_ulong;
#[cfg(all(
target_os = "linux",
not(any(target_env = "musl", target_arch = "s390x"))
))]
type FsType = libc::__fsword_t;
#[cfg(all(target_os = "linux", not(target_env = "musl"), target_arch = "s390x"))]
type FsType = libc::c_uint;
// from linux.git/fs/fuse/inode.c: #define FUSE_SUPER_MAGIC 0x65735546
const FUSE_SUPER_MAGIC: FsType = 0x65735546;
// from linux.git/include/uapi/linux/magic.h
const OVERLAYFS_SUPER_MAGIC: FsType = 0x794c7630;
/// Get bundle path (current working directory).
pub fn get_bundle_path() -> Result<PathBuf> {
std::env::current_dir()
}
/// Get the basename of the canonicalized path
pub fn get_base_name<P: AsRef<Path>>(src: P) -> Result<OsString> {
let s = src.as_ref().canonicalize()?;
s.file_name().map(|v| v.to_os_string()).ok_or_else(|| {
eother!(
"failed to get base name of path {}",
src.as_ref().to_string_lossy()
)
})
}
/// Check whether `path` is on a fuse filesystem.
pub fn is_fuse_fs<P: AsRef<Path>>(path: P) -> bool {
if let Ok(st) = nix::sys::statfs::statfs(path.as_ref()) {
if st.filesystem_type().0 == FUSE_SUPER_MAGIC {
return true;
}
}
false
}
/// Check whether `path` is on a overlay filesystem.
pub fn is_overlay_fs<P: AsRef<Path>>(path: P) -> bool {
if let Ok(st) = nix::sys::statfs::statfs(path.as_ref()) {
if st.filesystem_type().0 == OVERLAYFS_SUPER_MAGIC {
return true;
}
}
false
}
/// Check whether the given path is a symlink.
pub fn is_symlink<P: AsRef<Path>>(path: P) -> std::io::Result<bool> {
let path = path.as_ref();
let meta = fs::symlink_metadata(path)?;
Ok(meta.file_type().is_symlink())
}
/// Reflink copy src to dst, and falls back to regular copy if reflink copy fails.
///
/// # Safety
/// The `reflink_copy()` doesn't preserve permission/security context for the copied file,
/// so caller needs to take care of it.
pub fn reflink_copy<S: AsRef<Path>, D: AsRef<Path>>(src: S, dst: D) -> Result<()> {
let src_path = src.as_ref();
let dst_path = dst.as_ref();
let src = src_path.to_string_lossy();
let dst = dst_path.to_string_lossy();
if !src_path.is_file() {
return Err(eother!("reflink_copy src {} is not a regular file", src));
}
// Make sure dst's parent exist. If dst is a regular file, then unlink it for later copy.
if dst_path.exists() {
if !dst_path.is_file() {
return Err(eother!("reflink_copy dst {} is not a regular file", dst));
} else {
fs::remove_file(dst_path)?;
}
} else if let Some(dst_parent) = dst_path.parent() {
if !dst_parent.exists() {
if let Err(e) = fs::create_dir_all(dst_parent) {
return Err(eother!(
"reflink_copy: create_dir_all {} failed: {:?}",
dst_parent.to_str().unwrap(),
e
));
}
} else if !dst_parent.is_dir() {
return Err(eother!("reflink_copy parent of {} is not a directory", dst));
}
}
// Reflink copy, and fallback to regular copy if reflink fails.
let src_file = fs::File::open(src_path)?;
let dst_file = fs::File::create(dst_path)?;
if let Err(e) = do_reflink_copy(src_file, dst_file) {
match e.raw_os_error() {
// Cross dev copy or filesystem doesn't support reflink, do regular copy
Some(os_err)
if os_err == nix::Error::EXDEV as i32
|| os_err == nix::Error::EOPNOTSUPP as i32 =>
{
warn!(
sl!(),
"reflink_copy: reflink is not supported ({:?}), do regular copy instead", e,
);
if let Err(e) = do_regular_copy(src.as_ref(), dst.as_ref()) {
return Err(eother!(
"reflink_copy: regular copy {} to {} failed: {:?}",
src,
dst,
e
));
}
}
// Reflink copy failed
_ => {
return Err(eother!(
"reflink_copy: copy {} to {} failed: {:?}",
src,
dst,
e,
))
}
}
}
Ok(())
}
// Copy file using cp command, which handles sparse file copy.
fn do_regular_copy(src: &str, dst: &str) -> Result<()> {
let mut cmd = Command::new("/bin/cp");
cmd.args(&["--sparse=auto", src, dst]);
match cmd.output() {
Ok(output) => match output.status.success() {
true => Ok(()),
false => Err(eother!("`{:?}` failed: {:?}", cmd, output)),
},
Err(e) => Err(eother!("`{:?}` failed: {:?}", cmd, e)),
}
}
/// Copy file by reflink
fn do_reflink_copy(src: File, dst: File) -> Result<()> {
use nix::ioctl_write_int;
// FICLONE ioctl number definition, from include/linux/fs.h
const FS_IOC_MAGIC: u8 = 0x94;
const FS_IOC_FICLONE: u8 = 9;
// Define FICLONE ioctl using nix::ioctl_write_int! macro.
// The generated function has the following signature:
// pub unsafe fn ficlone(fd: libc::c_int, data: libc::c_ulang) -> Result<libc::c_int>
ioctl_write_int!(ficlone, FS_IOC_MAGIC, FS_IOC_FICLONE);
// Safe because the `src` and `dst` are valid file objects and we have checked the result.
unsafe { ficlone(dst.as_raw_fd(), src.as_raw_fd() as u64) }
.map(|_| ())
.map_err(|e| Error::from_raw_os_error(e as i32))
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_get_base_name() {
assert_eq!(&get_base_name("/etc/hostname").unwrap(), "hostname");
assert_eq!(&get_base_name("/bin").unwrap(), "bin");
assert!(&get_base_name("/").is_err());
assert!(&get_base_name("").is_err());
assert!(get_base_name("/no/such/path________yeah").is_err());
}
#[test]
fn test_is_symlink() {
let tmpdir = tempfile::tempdir().unwrap();
let path = tmpdir.path();
std::os::unix::fs::symlink(path, path.join("a")).unwrap();
assert!(is_symlink(path.join("a")).unwrap());
}
#[test]
fn test_reflink_copy() {
let tmpdir = tempfile::tempdir().unwrap();
let path = tmpdir.path().join("mounts");
reflink_copy("/proc/mounts", &path).unwrap();
let content = fs::read_to_string(&path).unwrap();
assert!(!content.is_empty());
reflink_copy("/proc/mounts", &path).unwrap();
let content = fs::read_to_string(&path).unwrap();
assert!(!content.is_empty());
reflink_copy("/proc/mounts", tmpdir.path()).unwrap_err();
reflink_copy("/proc/mounts_not_exist", &path).unwrap_err();
}
}

View File

@@ -0,0 +1,541 @@
// Copyright (c) 2019-2021 Alibaba Cloud
// Copyright (c) 2019-2021 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
use std::collections::HashMap;
use std::ffi::OsString;
use std::hash::{Hash, Hasher};
use std::io::{self, Read, Result};
use std::path::Path;
use std::time::Duration;
use subprocess::{ExitStatus, Popen, PopenConfig, PopenError, Redirection};
use crate::{eother, sl};
const DEFAULT_HOOK_TIMEOUT_SEC: i32 = 10;
/// A simple wrapper over `oci::Hook` to provide `Hash, Eq`.
///
/// The `oci::Hook` is auto-generated from protobuf source file, which doesn't implement `Hash, Eq`.
#[derive(Debug, Default, Clone)]
struct HookKey(oci::Hook);
impl From<&oci::Hook> for HookKey {
fn from(hook: &oci::Hook) -> Self {
HookKey(hook.clone())
}
}
impl PartialEq for HookKey {
fn eq(&self, other: &Self) -> bool {
self.0 == other.0
}
}
impl Eq for HookKey {}
impl Hash for HookKey {
fn hash<H: Hasher>(&self, state: &mut H) {
self.0.path.hash(state);
self.0.args.hash(state);
self.0.env.hash(state);
self.0.timeout.hash(state);
}
}
/// Execution state of OCI hooks.
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum HookState {
/// Hook is pending for executing/retry.
Pending,
/// Hook has been successfully executed.
Done,
/// Hook has been marked as ignore.
Ignored,
}
/// Structure to maintain state for hooks.
#[derive(Default)]
pub struct HookStates {
states: HashMap<HookKey, HookState>,
}
impl HookStates {
/// Create a new instance of [`HookStates`].
pub fn new() -> Self {
Self::default()
}
/// Get execution state of a hook.
pub fn get(&self, hook: &oci::Hook) -> HookState {
self.states
.get(&hook.into())
.copied()
.unwrap_or(HookState::Pending)
}
/// Update execution state of a hook.
pub fn update(&mut self, hook: &oci::Hook, state: HookState) {
self.states.insert(hook.into(), state);
}
/// Remove an execution state of a hook.
pub fn remove(&mut self, hook: &oci::Hook) {
self.states.remove(&hook.into());
}
/// Check whether some hooks are still pending and should retry execution.
pub fn should_retry(&self) -> bool {
for state in self.states.values() {
if *state == HookState::Pending {
return true;
}
}
false
}
/// Execute an OCI hook.
///
/// If `state` is valid, it will be sent to subprocess' STDIN.
///
/// The [OCI Runtime specification 1.0.0](https://github.com/opencontainers/runtime-spec/releases/download/v1.0.0/oci-runtime-spec-v1.0.0.pdf)
/// states:
/// - path (string, REQUIRED) with similar semantics to IEEE Std 1003.1-2008 execv's path.
/// This specification extends the IEEE standard in that path MUST be absolute.
/// - args (array of strings, OPTIONAL) with the same semantics as IEEE Std 1003.1-2008 execv's
/// argv.
/// - env (array of strings, OPTIONAL) with the same semantics as IEEE Std 1003.1-2008's environ.
/// - timeout (int, OPTIONAL) is the number of seconds before aborting the hook. If set, timeout
/// MUST be greater than zero.
///
/// The OCI spec also defines the context to invoke hooks, caller needs to take the responsibility
/// to setup execution context, such as namespace etc.
pub fn execute_hook(&mut self, hook: &oci::Hook, state: Option<oci::State>) -> Result<()> {
if self.get(hook) != HookState::Pending {
return Ok(());
}
fail::fail_point!("execute_hook", |_| {
Err(eother!("execute hook fail point injection"))
});
info!(sl!(), "execute hook {:?}", hook);
self.states.insert(hook.into(), HookState::Pending);
let mut executor = HookExecutor::new(hook)?;
let stdin = if state.is_some() {
Redirection::Pipe
} else {
Redirection::None
};
let mut popen = Popen::create(
&executor.args,
PopenConfig {
stdin,
stdout: Redirection::Pipe,
stderr: Redirection::Pipe,
executable: executor.executable.to_owned(),
detached: true,
env: Some(executor.envs.clone()),
..Default::default()
},
)
.map_err(|e| eother!("failed to create subprocess for hook {:?}: {}", hook, e))?;
if let Some(state) = state {
executor.execute_with_input(&mut popen, state)?;
}
executor.execute_and_wait(&mut popen)?;
info!(sl!(), "hook {} finished", hook.path);
self.states.insert(hook.into(), HookState::Done);
Ok(())
}
/// Try to execute hooks and remember execution result.
///
/// The `execute_hooks()` will be called multiple times.
/// It will first be called before creating the VMM when creating the sandbox, so hooks could be
/// used to setup environment for the VMM, such as creating tap device etc.
/// It will also be called during starting containers, to setup environment for those containers.
///
/// The execution result will be recorded for each hook. Once a hook returns success, it will not
/// be invoked anymore.
pub fn execute_hooks(&mut self, hooks: &[oci::Hook], state: Option<oci::State>) -> Result<()> {
for hook in hooks.iter() {
if let Err(e) = self.execute_hook(hook, state.clone()) {
// Ignore error and try next hook, the caller should retry.
error!(sl!(), "hook {} failed: {}", hook.path, e);
}
}
Ok(())
}
}
struct HookExecutor<'a> {
hook: &'a oci::Hook,
executable: Option<OsString>,
args: Vec<String>,
envs: Vec<(OsString, OsString)>,
timeout: u64,
}
impl<'a> HookExecutor<'a> {
fn new(hook: &'a oci::Hook) -> Result<Self> {
// Ensure Hook.path is present and is an absolute path.
let executable = if hook.path.is_empty() {
return Err(eother!("path of hook {:?} is empty", hook));
} else {
let path = Path::new(&hook.path);
if !path.is_absolute() {
return Err(eother!("path of hook {:?} is not absolute", hook));
}
Some(path.as_os_str().to_os_string())
};
// Hook.args is optional, use Hook.path as arg0 if Hook.args is empty.
let args = if hook.args.is_empty() {
vec![hook.path.clone()]
} else {
hook.args.clone()
};
let mut envs: Vec<(OsString, OsString)> = Vec::new();
for e in hook.env.iter() {
match e.split_once('=') {
Some((key, value)) => envs.push((OsString::from(key), OsString::from(value))),
None => warn!(sl!(), "env {} of hook {:?} is invalid", e, hook),
}
}
// Use Hook.timeout if it's valid, otherwise default to 10s.
let mut timeout = DEFAULT_HOOK_TIMEOUT_SEC as u64;
if let Some(t) = hook.timeout {
if t > 0 {
timeout = t as u64;
}
}
Ok(HookExecutor {
hook,
executable,
args,
envs,
timeout,
})
}
fn execute_with_input(&mut self, popen: &mut Popen, state: oci::State) -> Result<()> {
let st = serde_json::to_string(&state)?;
let (stdout, stderr) = popen
.communicate_start(Some(st.as_bytes().to_vec()))
.limit_time(Duration::from_secs(self.timeout))
.read_string()
.map_err(|e| e.error)?;
if let Some(err) = stderr {
if !err.is_empty() {
error!(sl!(), "hook {} exec failed: {}", self.hook.path, err);
}
}
if let Some(out) = stdout {
if !out.is_empty() {
info!(sl!(), "hook {} exec stdout: {}", self.hook.path, out);
}
}
// Give a grace period for `execute_and_wait()`.
self.timeout = 1;
Ok(())
}
fn execute_and_wait(&mut self, popen: &mut Popen) -> Result<()> {
match popen.wait_timeout(Duration::from_secs(self.timeout)) {
Ok(v) => self.handle_exit_status(v, popen),
Err(e) => self.handle_popen_wait_error(e, popen),
}
}
fn handle_exit_status(&mut self, result: Option<ExitStatus>, popen: &mut Popen) -> Result<()> {
if let Some(exit_status) = result {
// the process has finished
info!(
sl!(),
"exit status of hook {:?} : {:?}", self.hook, exit_status
);
self.print_result(popen);
match exit_status {
subprocess::ExitStatus::Exited(code) => {
if code == 0 {
info!(sl!(), "hook {:?} succeeds", self.hook);
Ok(())
} else {
warn!(sl!(), "hook {:?} exit status with {}", self.hook, code,);
Err(eother!("hook {:?} exit status with {}", self.hook, code))
}
}
_ => {
error!(
sl!(),
"no exit code for hook {:?}: {:?}", self.hook, exit_status
);
Err(eother!(
"no exit code for hook {:?}: {:?}",
self.hook,
exit_status
))
}
}
} else {
// may be timeout
error!(sl!(), "hook poll failed, kill it");
// it is still running, kill it
popen.kill()?;
let _ = popen.wait();
self.print_result(popen);
Err(io::Error::from(io::ErrorKind::TimedOut))
}
}
fn handle_popen_wait_error(&mut self, e: PopenError, popen: &mut Popen) -> Result<()> {
self.print_result(popen);
error!(sl!(), "wait_timeout for hook {:?} failed: {}", self.hook, e);
Err(eother!(
"wait_timeout for hook {:?} failed: {}",
self.hook,
e
))
}
fn print_result(&mut self, popen: &mut Popen) {
if let Some(file) = popen.stdout.as_mut() {
let mut buffer = String::new();
file.read_to_string(&mut buffer).ok();
if !buffer.is_empty() {
info!(sl!(), "hook stdout: {}", buffer);
}
}
if let Some(file) = popen.stderr.as_mut() {
let mut buffer = String::new();
file.read_to_string(&mut buffer).ok();
if !buffer.is_empty() {
info!(sl!(), "hook stderr: {}", buffer);
}
}
}
}
#[cfg(test)]
mod tests {
use super::*;
use std::fs::{self, set_permissions, File, Permissions};
use std::io::Write;
use std::os::unix::fs::PermissionsExt;
use std::time::Instant;
fn test_hook_eq(hook1: &oci::Hook, hook2: &oci::Hook, expected: bool) {
let key1 = HookKey::from(hook1);
let key2 = HookKey::from(hook2);
assert_eq!(key1 == key2, expected);
}
#[test]
fn test_hook_key() {
let hook = oci::Hook {
path: "1".to_string(),
args: vec!["2".to_string(), "3".to_string()],
env: vec![],
timeout: Some(0),
};
let cases = [
(
oci::Hook {
path: "1000".to_string(),
args: vec!["2".to_string(), "3".to_string()],
env: vec![],
timeout: Some(0),
},
false,
),
(
oci::Hook {
path: "1".to_string(),
args: vec!["2".to_string(), "4".to_string()],
env: vec![],
timeout: Some(0),
},
false,
),
(
oci::Hook {
path: "1".to_string(),
args: vec!["2".to_string()],
env: vec![],
timeout: Some(0),
},
false,
),
(
oci::Hook {
path: "1".to_string(),
args: vec!["2".to_string(), "3".to_string()],
env: vec!["5".to_string()],
timeout: Some(0),
},
false,
),
(
oci::Hook {
path: "1".to_string(),
args: vec!["2".to_string(), "3".to_string()],
env: vec![],
timeout: Some(6),
},
false,
),
(
oci::Hook {
path: "1".to_string(),
args: vec!["2".to_string(), "3".to_string()],
env: vec![],
timeout: None,
},
false,
),
(
oci::Hook {
path: "1".to_string(),
args: vec!["2".to_string(), "3".to_string()],
env: vec![],
timeout: Some(0),
},
true,
),
];
for case in cases.iter() {
test_hook_eq(&hook, &case.0, case.1);
}
}
#[test]
fn test_execute_hook() {
// test need root permission
if !nix::unistd::getuid().is_root() {
println!("test need root permission");
return;
}
let tmpdir = tempfile::tempdir().unwrap();
let file = tmpdir.path().join("data");
let file_str = file.to_string_lossy();
let mut states = HookStates::new();
// test case 1: normal
// execute hook
let hook = oci::Hook {
path: "/bin/touch".to_string(),
args: vec!["touch".to_string(), file_str.to_string()],
env: vec![],
timeout: Some(0),
};
let ret = states.execute_hook(&hook, None);
assert!(ret.is_ok());
assert!(fs::metadata(&file).is_ok());
assert!(!states.should_retry());
// test case 2: timeout in 10s
let hook = oci::Hook {
path: "/bin/sleep".to_string(),
args: vec!["sleep".to_string(), "3600".to_string()],
env: vec![],
timeout: Some(0), // default timeout is 10 seconds
};
let start = Instant::now();
let ret = states.execute_hook(&hook, None).unwrap_err();
let duration = start.elapsed();
let used = duration.as_secs();
assert!((10..12u64).contains(&used));
assert_eq!(ret.kind(), io::ErrorKind::TimedOut);
assert_eq!(states.get(&hook), HookState::Pending);
assert!(states.should_retry());
states.remove(&hook);
// test case 3: timeout in 5s
let hook = oci::Hook {
path: "/bin/sleep".to_string(),
args: vec!["sleep".to_string(), "3600".to_string()],
env: vec![],
timeout: Some(5), // timeout is set to 5 seconds
};
let start = Instant::now();
let ret = states.execute_hook(&hook, None).unwrap_err();
let duration = start.elapsed();
let used = duration.as_secs();
assert!((5..7u64).contains(&used));
assert_eq!(ret.kind(), io::ErrorKind::TimedOut);
assert_eq!(states.get(&hook), HookState::Pending);
assert!(states.should_retry());
states.remove(&hook);
// test case 4: with envs
let create_shell = |shell_path: &str, data_path: &str| -> Result<()> {
let shell = format!(
r#"#!/bin/sh
echo -n "K1=${{K1}}" > {}
"#,
data_path
);
let mut output = File::create(shell_path)?;
output.write_all(shell.as_bytes())?;
// set to executable
let permissions = Permissions::from_mode(0o755);
set_permissions(shell_path, permissions)?;
Ok(())
};
let shell_path = format!("{}/test.sh", tmpdir.path().to_string_lossy());
let ret = create_shell(&shell_path, file_str.as_ref());
assert!(ret.is_ok());
let hook = oci::Hook {
path: shell_path,
args: vec![],
env: vec!["K1=V1".to_string()],
timeout: Some(5),
};
let ret = states.execute_hook(&hook, None);
assert!(ret.is_ok());
assert!(!states.should_retry());
let contents = fs::read_to_string(file);
match contents {
Err(e) => panic!("got error {}", e),
Ok(s) => assert_eq!(s, "K1=V1"),
}
// test case 5: timeout in 5s with state
let hook = oci::Hook {
path: "/bin/sleep".to_string(),
args: vec!["sleep".to_string(), "3600".to_string()],
env: vec![],
timeout: Some(6), // timeout is set to 5 seconds
};
let state = oci::State {
version: "".to_string(),
id: "".to_string(),
status: oci::ContainerState::Creating,
pid: 10,
bundle: "nouse".to_string(),
annotations: Default::default(),
};
let start = Instant::now();
let ret = states.execute_hook(&hook, Some(state)).unwrap_err();
let duration = start.elapsed();
let used = duration.as_secs();
assert!((6..8u64).contains(&used));
assert_eq!(ret.kind(), io::ErrorKind::TimedOut);
assert!(states.should_retry());
}
}

View File

@@ -0,0 +1,69 @@
// Copyright (c) 2019-2021 Alibaba Cloud
// Copyright (c) 2019-2021 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
//! Utilities to support Kubernetes (K8s).
//!
//! This module depends on kubelet internal implementation details, a better way is needed
//! to detect K8S EmptyDir medium type from `oci::spec::Mount` objects.
use kata_types::mount;
use oci::Spec;
use crate::mount::get_linux_mount_info;
pub use kata_types::k8s::is_empty_dir;
/// Check whether the given path is a kubernetes ephemeral volume.
///
/// This method depends on a specific path used by k8s to detect if it's type of ephemeral.
/// As of now, this is a very k8s specific solution that works but in future there should be a
/// better way for this method to determine if the path is for ephemeral volume type.
pub fn is_ephemeral_volume(path: &str) -> bool {
if is_empty_dir(path) {
if let Ok(info) = get_linux_mount_info(path) {
if info.fs_type == "tmpfs" {
return true;
}
}
}
false
}
/// Check whether the given path is a kubernetes empty-dir volume of medium "default".
///
/// K8s `EmptyDir` volumes are directories on the host. If the fs type is tmpfs, it's a ephemeral
/// volume instead of a `EmptyDir` volume.
pub fn is_host_empty_dir(path: &str) -> bool {
if is_empty_dir(path) {
if let Ok(info) = get_linux_mount_info(path) {
if info.fs_type != "tmpfs" {
return true;
}
}
}
false
}
// set_ephemeral_storage_type sets the mount type to 'ephemeral'
// if the mount source path is provisioned by k8s for ephemeral storage.
// For the given pod ephemeral volume is created only once
// backed by tmpfs inside the VM. For successive containers
// of the same pod the already existing volume is reused.
pub fn update_ephemeral_storage_type(oci_spec: &mut Spec) {
for m in oci_spec.mounts.iter_mut() {
if mount::is_kata_guest_mount_volume(&m.r#type) {
continue;
}
if is_ephemeral_volume(&m.source) {
m.r#type = String::from(mount::KATA_EPHEMERAL_VOLUME_TYPE);
} else if is_host_empty_dir(&m.source) {
m.r#type = String::from(mount::KATA_HOST_DIR_VOLUME_TYPE);
}
}
}

View File

@@ -0,0 +1,33 @@
// Copyright (c) 2021 Alibaba Cloud
//
// SPDX-License-Identifier: Apache-2.0
//
#[macro_use]
extern crate slog;
pub mod device;
pub mod fs;
pub mod hooks;
pub mod k8s;
pub mod mount;
pub mod numa;
pub mod rand;
pub mod spec;
pub mod validate;
// Convenience macro to obtain the scoped logger
#[macro_export]
macro_rules! sl {
() => {
slog_scope::logger()
};
}
#[macro_export]
macro_rules! eother {
() => (std::io::Error::new(std::io::ErrorKind::Other, ""));
($fmt:expr, $($arg:tt)*) => ({
std::io::Error::new(std::io::ErrorKind::Other, format!($fmt, $($arg)*))
})
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,221 @@
// Copyright (c) 2021 Alibaba Cloud
//
// SPDX-License-Identifier: Apache-2.0
//
use std::collections::HashMap;
use std::fs::DirEntry;
use std::io::Read;
use std::path::PathBuf;
use kata_types::cpu::CpuSet;
use lazy_static::lazy_static;
use crate::sl;
use std::str::FromStr;
#[derive(thiserror::Error, Debug)]
pub enum Error {
#[error("Invalid CPU number {0}")]
InvalidCpu(u32),
#[error("Invalid node file name {0}")]
InvalidNodeFileName(String),
#[error("Can not read directory {1}: {0}")]
ReadDirectory(#[source] std::io::Error, String),
#[error("Can not read from file {0}, {1:?}")]
ReadFile(String, #[source] std::io::Error),
#[error("Can not open from file {0}, {1:?}")]
OpenFile(String, #[source] std::io::Error),
#[error("Can not parse CPU info, {0:?}")]
ParseCpuInfo(#[from] kata_types::Error),
}
pub type Result<T> = std::result::Result<T, Error>;
// global config in UT
#[cfg(test)]
lazy_static! {
static ref SYS_FS_PREFIX: PathBuf = PathBuf::from(env!("CARGO_MANIFEST_DIR")).join("test/texture");
// numa node file for UT, we can mock data
static ref NUMA_NODE_PATH: PathBuf = (&*SYS_FS_PREFIX).join("sys/devices/system/node");
// sysfs directory for CPU devices
static ref NUMA_CPU_PATH: PathBuf = (&*SYS_FS_PREFIX).join("sys/devices/system/cpu");
}
// global config in release
#[cfg(not(test))]
lazy_static! {
// numa node file for UT, we can mock data
static ref NUMA_NODE_PATH: PathBuf = PathBuf::from("/sys/devices/system/node");
// sysfs directory for CPU devices
static ref NUMA_CPU_PATH: PathBuf = PathBuf::from("/sys/devices/system/cpu");
}
const NUMA_NODE_PREFIX: &str = "node";
const NUMA_NODE_CPU_LIST_NAME: &str = "cpulist";
/// Get numa node id for a CPU
pub fn get_node_id(cpu: u32) -> Result<u32> {
let path = NUMA_CPU_PATH.join(format!("cpu{}", cpu));
let dirs = path.read_dir().map_err(|_| Error::InvalidCpu(cpu))?;
for d in dirs {
let d = d.map_err(|e| Error::ReadDirectory(e, path.to_string_lossy().to_string()))?;
if let Some(file_name) = d.file_name().to_str() {
if !file_name.starts_with(NUMA_NODE_PREFIX) {
continue;
}
let index_str = file_name.trim_start_matches(NUMA_NODE_PREFIX);
if let Ok(i) = index_str.parse::<u32>() {
return Ok(i);
}
}
}
// Default to node 0 on UMA systems.
Ok(0)
}
/// Map cpulist to NUMA node, returns a HashMap<numa_node_id, Vec<cpu_id>>.
pub fn get_node_map(cpus: &str) -> Result<HashMap<u32, Vec<u32>>> {
// <numa id, Vec<cpu id> >
let mut node_map: HashMap<u32, Vec<u32>> = HashMap::new();
let cpuset = CpuSet::from_str(cpus)?;
for c in cpuset.iter() {
let node_id = get_node_id(*c)?;
node_map.entry(node_id).or_insert_with(Vec::new).push(*c);
}
Ok(node_map)
}
/// Get CPU to NUMA node mapping by reading `/sys/devices/system/node/nodex/cpulist`.
///
/// Return a HashMap<cpu id, node id>. The hashmap will be empty if NUMA is not enabled on the
/// system.
pub fn get_numa_nodes() -> Result<HashMap<u32, u32>> {
let mut numa_nodes = HashMap::new();
let numa_node_path = &*NUMA_NODE_PATH;
if !numa_node_path.exists() {
debug!(sl!(), "no numa node available on this system");
return Ok(numa_nodes);
}
let dirs = numa_node_path
.read_dir()
.map_err(|e| Error::ReadDirectory(e, numa_node_path.to_string_lossy().to_string()))?;
for d in dirs {
match d {
Err(e) => {
return Err(Error::ReadDirectory(
e,
numa_node_path.to_string_lossy().to_string(),
))
}
Ok(d) => {
if let Ok(file_name) = d.file_name().into_string() {
if file_name.starts_with(NUMA_NODE_PREFIX) {
let index_string = file_name.trim_start_matches(NUMA_NODE_PREFIX);
info!(
sl!(),
"get node dir {} node index {}", &file_name, index_string
);
match index_string.parse::<u32>() {
Ok(nid) => read_cpu_info_from_node(&d, nid, &mut numa_nodes)?,
Err(_e) => {
return Err(Error::InvalidNodeFileName(file_name.to_string()))
}
}
}
}
}
}
}
Ok(numa_nodes)
}
fn read_cpu_info_from_node(
d: &DirEntry,
node_index: u32,
numa_nodes: &mut HashMap<u32, u32>,
) -> Result<()> {
let cpu_list_path = d.path().join(NUMA_NODE_CPU_LIST_NAME);
let mut file = std::fs::File::open(&cpu_list_path)
.map_err(|e| Error::OpenFile(cpu_list_path.to_string_lossy().to_string(), e))?;
let mut cpu_list_string = String::new();
if let Err(e) = file.read_to_string(&mut cpu_list_string) {
return Err(Error::ReadFile(
cpu_list_path.to_string_lossy().to_string(),
e,
));
}
let split_cpus = CpuSet::from_str(cpu_list_string.trim())?;
info!(
sl!(),
"node {} list {:?} from {}", node_index, split_cpus, &cpu_list_string
);
for split_cpu_id in split_cpus.iter() {
numa_nodes.insert(*split_cpu_id, node_index);
}
Ok(())
}
/// Check whether all specified CPUs have associated NUMA node.
pub fn is_valid_numa_cpu(cpus: &[u32]) -> Result<bool> {
let numa_nodes = get_numa_nodes()?;
for cpu in cpus {
if numa_nodes.get(cpu).is_none() {
return Ok(false);
}
}
Ok(true)
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_get_node_id() {
assert_eq!(get_node_id(0).unwrap(), 0);
assert_eq!(get_node_id(1).unwrap(), 0);
assert_eq!(get_node_id(64).unwrap(), 1);
get_node_id(65).unwrap_err();
}
#[test]
fn test_get_node_map() {
let map = get_node_map("0-1,64").unwrap();
assert_eq!(map.len(), 2);
assert_eq!(map.get(&0).unwrap().len(), 2);
assert_eq!(map.get(&1).unwrap().len(), 1);
get_node_map("0-1,64,65").unwrap_err();
}
#[test]
fn test_get_numa_nodes() {
let map = get_numa_nodes().unwrap();
assert_eq!(map.len(), 65);
assert_eq!(*map.get(&0).unwrap(), 0);
assert_eq!(*map.get(&1).unwrap(), 0);
assert_eq!(*map.get(&63).unwrap(), 0);
assert_eq!(*map.get(&64).unwrap(), 1);
}
#[test]
fn test_is_valid_numa_cpu() {
assert!(is_valid_numa_cpu(&[0]).unwrap());
assert!(is_valid_numa_cpu(&[1]).unwrap());
assert!(is_valid_numa_cpu(&[63]).unwrap());
assert!(is_valid_numa_cpu(&[64]).unwrap());
assert!(is_valid_numa_cpu(&[0, 1, 64]).unwrap());
assert!(!is_valid_numa_cpu(&[0, 1, 64, 65]).unwrap());
assert!(!is_valid_numa_cpu(&[65]).unwrap());
}
}

View File

@@ -0,0 +1,10 @@
// Copyright (c) 2019-2022 Alibaba Cloud
// Copyright (c) 2019-2022 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
mod random_bytes;
pub use random_bytes::RandomBytes;
mod uuid;
pub use uuid::UUID;

View File

@@ -0,0 +1,62 @@
// Copyright (c) 2019-2022 Alibaba Cloud
// Copyright (c) 2019-2022 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
use std::fmt;
use rand::RngCore;
pub struct RandomBytes {
pub bytes: Vec<u8>,
}
impl RandomBytes {
pub fn new(n: usize) -> Self {
let mut bytes = vec![0u8; n];
rand::thread_rng().fill_bytes(&mut bytes);
Self { bytes }
}
}
impl fmt::LowerHex for RandomBytes {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
for byte in &self.bytes {
write!(f, "{:x}", byte)?;
}
Ok(())
}
}
impl fmt::UpperHex for RandomBytes {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
for byte in &self.bytes {
write!(f, "{:X}", byte)?;
}
Ok(())
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn random_bytes() {
let b = RandomBytes::new(16);
assert_eq!(b.bytes.len(), 16);
// check lower hex
let lower_hex = format!("{:x}", b);
assert_eq!(lower_hex, lower_hex.to_lowercase());
// check upper hex
let upper_hex = format!("{:X}", b);
assert_eq!(upper_hex, upper_hex.to_uppercase());
// check new random bytes
let b1 = RandomBytes::new(16);
assert_ne!(b.bytes, b1.bytes);
}
}

View File

@@ -0,0 +1,74 @@
// Copyright (c) 2019-2022 Alibaba Cloud
// Copyright (c) 2019-2022 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
use std::{convert::From, fmt};
use byteorder::{BigEndian, ByteOrder};
use rand::RngCore;
pub struct UUID([u8; 16]);
impl Default for UUID {
fn default() -> Self {
Self::new()
}
}
impl UUID {
pub fn new() -> Self {
let mut b = [0u8; 16];
rand::thread_rng().fill_bytes(&mut b);
b[6] = (b[6] & 0x0f) | 0x40;
b[8] = (b[8] & 0x3f) | 0x80;
Self(b)
}
}
/// From: convert UUID to string
impl From<&UUID> for String {
fn from(from: &UUID) -> Self {
let time_low = BigEndian::read_u32(&from.0[..4]);
let time_mid = BigEndian::read_u16(&from.0[4..6]);
let time_hi = BigEndian::read_u16(&from.0[6..8]);
let clk_seq_hi = from.0[8];
let clk_seq_low = from.0[9];
let mut buf = [0u8; 8];
buf[2..].copy_from_slice(&from.0[10..]);
let node = BigEndian::read_u64(&buf);
format!(
"{:08x}-{:04x}-{:04x}-{:02x}{:02x}-{:012x}",
time_low, time_mid, time_hi, clk_seq_hi, clk_seq_low, node
)
}
}
impl fmt::Display for UUID {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(f, "{}", String::from(self))
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_uuid() {
let uuid1 = UUID::new();
let s1: String = String::from(&uuid1);
let uuid2 = UUID::new();
let s2: String = String::from(&uuid2);
assert_eq!(s1.len(), s2.len());
assert_ne!(s1, s2);
let uuid3 = UUID([0u8, 1u8, 2u8, 3u8, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]);
let s3 = String::from(&uuid3);
assert_eq!(&s3, "00010203-0405-0607-0809-0a0b0c0d0e0f");
}
}

View File

@@ -0,0 +1,94 @@
// Copyright (c) 2019-2022 Alibaba Cloud
// Copyright (c) 2019-2022 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
use std::path::PathBuf;
use kata_types::container::ContainerType;
#[derive(thiserror::Error, Debug)]
pub enum Error {
/// unknow container type
#[error("unknow container type {0}")]
UnknowContainerType(String),
/// missing sandboxID
#[error("missing sandboxID")]
MissingSandboxID,
/// oci error
#[error("oci error")]
Oci(#[from] oci::Error),
}
const CRI_CONTAINER_TYPE_KEY_LIST: &[&str] = &[
// cri containerd
"io.kubernetes.cri.container-type",
// cri-o
"io.kubernetes.cri-o.ContainerType",
// docker shim
"io.kubernetes.docker.type",
];
const CRI_SANDBOX_ID_KEY_LIST: &[&str] = &[
// cri containerd
"io.kubernetes.cri.sandbox-id",
// cri-o
"io.kubernetes.cri-o.SandboxID",
// docker shim
"io.kubernetes.sandbox.id",
];
/// container sandbox info
#[derive(Debug, Clone)]
pub enum ShimIdInfo {
/// Sandbox
Sandbox,
/// Container
Container(String),
}
/// get container type
pub fn get_contaier_type(spec: &oci::Spec) -> Result<ContainerType, Error> {
for k in CRI_CONTAINER_TYPE_KEY_LIST.iter() {
if let Some(type_value) = spec.annotations.get(*k) {
match type_value.as_str() {
"sandbox" => return Ok(ContainerType::PodSandbox),
"podsandbox" => return Ok(ContainerType::PodSandbox),
"container" => return Ok(ContainerType::PodContainer),
_ => return Err(Error::UnknowContainerType(type_value.clone())),
}
}
}
Ok(ContainerType::PodSandbox)
}
/// get shim id info
pub fn get_shim_id_info() -> Result<ShimIdInfo, Error> {
let spec = load_oci_spec()?;
match get_contaier_type(&spec)? {
ContainerType::PodSandbox => Ok(ShimIdInfo::Sandbox),
ContainerType::PodContainer => {
for k in CRI_SANDBOX_ID_KEY_LIST {
if let Some(sandbox_id) = spec.annotations.get(*k) {
return Ok(ShimIdInfo::Container(sandbox_id.into()));
}
}
Err(Error::MissingSandboxID)
}
}
}
/// get bundle path
pub fn get_bundle_path() -> std::io::Result<PathBuf> {
std::env::current_dir()
}
/// load oci spec
pub fn load_oci_spec() -> oci::Result<oci::Spec> {
let bundle_path = get_bundle_path()?;
let spec_file = bundle_path.join("config.json");
oci::Spec::load(spec_file.to_str().unwrap_or_default())
}

View File

@@ -0,0 +1,267 @@
// Copyright (c) 2019-2022 Alibaba Cloud
// Copyright (c) 2019-2022 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
#[derive(thiserror::Error, Debug)]
pub enum Error {
#[error("invalid container ID {0}")]
InvalidContainerID(String),
}
// A container ID or exec ID must match this regex:
//
// ^[a-zA-Z0-9][a-zA-Z0-9_.-]+$
//
pub fn verify_id(id: &str) -> Result<(), Error> {
let mut chars = id.chars();
let valid = match chars.next() {
Some(first)
if first.is_alphanumeric()
&& id.len() > 1
&& chars.all(|c| c.is_alphanumeric() || ['.', '-', '_'].contains(&c)) =>
{
true
}
_ => false,
};
match valid {
true => Ok(()),
false => Err(Error::InvalidContainerID(id.to_string())),
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_verify_cid() {
#[derive(Debug)]
struct TestData<'a> {
id: &'a str,
expect_error: bool,
}
let tests = &[
TestData {
// Cannot be blank
id: "",
expect_error: true,
},
TestData {
// Cannot be a space
id: " ",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: ".",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "-",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "_",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: " a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: ".a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "-a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "_a",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "..",
expect_error: true,
},
TestData {
// Too short
id: "a",
expect_error: true,
},
TestData {
// Too short
id: "z",
expect_error: true,
},
TestData {
// Too short
id: "A",
expect_error: true,
},
TestData {
// Too short
id: "Z",
expect_error: true,
},
TestData {
// Too short
id: "0",
expect_error: true,
},
TestData {
// Too short
id: "9",
expect_error: true,
},
TestData {
// Must start with an alphanumeric
id: "-1",
expect_error: true,
},
TestData {
id: "/",
expect_error: true,
},
TestData {
id: "a/",
expect_error: true,
},
TestData {
id: "a/../",
expect_error: true,
},
TestData {
id: "../a",
expect_error: true,
},
TestData {
id: "../../a",
expect_error: true,
},
TestData {
id: "../../../a",
expect_error: true,
},
TestData {
id: "foo/../bar",
expect_error: true,
},
TestData {
id: "foo bar",
expect_error: true,
},
TestData {
id: "a.",
expect_error: false,
},
TestData {
id: "a..",
expect_error: false,
},
TestData {
id: "aa",
expect_error: false,
},
TestData {
id: "aa.",
expect_error: false,
},
TestData {
id: "hello..world",
expect_error: false,
},
TestData {
id: "hello/../world",
expect_error: true,
},
TestData {
id: "aa1245124sadfasdfgasdga.",
expect_error: false,
},
TestData {
id: "aAzZ0123456789_.-",
expect_error: false,
},
TestData {
id: "abcdefghijklmnopqrstuvwxyz0123456789.-_",
expect_error: false,
},
TestData {
id: "0123456789abcdefghijklmnopqrstuvwxyz.-_",
expect_error: false,
},
TestData {
id: " abcdefghijklmnopqrstuvwxyz0123456789.-_",
expect_error: true,
},
TestData {
id: ".abcdefghijklmnopqrstuvwxyz0123456789.-_",
expect_error: true,
},
TestData {
id: "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-_",
expect_error: false,
},
TestData {
id: "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ.-_",
expect_error: false,
},
TestData {
id: " ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-_",
expect_error: true,
},
TestData {
id: ".ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.-_",
expect_error: true,
},
TestData {
id: "/a/b/c",
expect_error: true,
},
TestData {
id: "a/b/c",
expect_error: true,
},
TestData {
id: "foo/../../../etc/passwd",
expect_error: true,
},
TestData {
id: "../../../../../../etc/motd",
expect_error: true,
},
TestData {
id: "/etc/passwd",
expect_error: true,
},
];
for (i, d) in tests.iter().enumerate() {
let msg = format!("test[{}]: {:?}", i, d);
let result = verify_id(d.id);
let msg = format!("{}, result: {:?}", msg, result);
if result.is_ok() {
assert!(!d.expect_error, "{}", msg);
} else {
assert!(d.expect_error, "{}", msg);
}
}
}
}

View File

@@ -0,0 +1 @@
ffffffff,ffffffff

View File

@@ -0,0 +1 @@
ffffffff,ffffffff

View File

@@ -0,0 +1 @@
1,00000000,00000000

1
src/libs/kata-types/.gitignore vendored Normal file
View File

@@ -0,0 +1 @@
Cargo.lock

View File

@@ -0,0 +1,31 @@
[package]
name = "kata-types"
version = "0.1.0"
description = "Constants and data types shared by Kata Containers components"
keywords = ["kata", "container", "runtime"]
authors = ["The Kata Containers community <kata-dev@lists.katacontainers.io>"]
repository = "https://github.com/kata-containers/kata-containers.git"
homepage = "https://katacontainers.io/"
readme = "README.md"
license = "Apache-2.0"
edition = "2018"
[dependencies]
byte-unit = "3.1.4"
glob = "0.3.0"
lazy_static = "1.4.0"
num_cpus = "1.13.1"
regex = "1.5.4"
serde = { version = "1.0.100", features = ["derive"] }
slog = "2.5.2"
slog-scope = "4.4.0"
serde_json = "1.0.73"
thiserror = "1.0"
toml = "0.5.8"
oci = { path = "../oci" }
[dev-dependencies]
[features]
default = []
enable-vendor = []

View File

@@ -0,0 +1,18 @@
# kata-types
This crate is a collection of constants and data types shared by multiple
[Kata Containers](https://github.com/kata-containers/kata-containers/) components.
It defines constants and data types used by multiple Kata Containers components. Those constants
and data types may be defined by Kata Containers or by other projects/specifications, such as:
- [Containerd](https://github.com/containerd/containerd)
- [Kubelet](https://github.com/kubernetes/kubelet)
## Support
**Operating Systems**:
- Linux
## License
This code is licensed under [Apache-2.0](../../../LICENSE).

View File

@@ -0,0 +1,13 @@
// Copyright (c) 2019 Alibaba Cloud
// Copyright (c) 2019 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
#![allow(missing_docs)]
pub const CONTAINER_TYPE_LABEL_KEY: &str = "io.kubernetes.cri.container-type";
pub const SANDBOX: &str = "sandbox";
pub const CONTAINER: &str = "container";
pub const SANDBOX_ID_LABEL_KEY: &str = "io.kubernetes.cri.sandbox-id";

View File

@@ -0,0 +1,13 @@
// Copyright (c) 2019 Alibaba Cloud
// Copyright (c) 2019 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
#![allow(missing_docs)]
pub const CONTAINER_TYPE_LABEL_KEY: &str = "io.kubernetes.cri.container-type";
pub const SANDBOX: &str = "sandbox";
pub const CONTAINER: &str = "container";
pub const SANDBOX_ID_LABEL_KEY: &str = "io.kubernetes.cri-o.SandboxID";

View File

@@ -0,0 +1,23 @@
// Copyright (c) 2019 Alibaba Cloud
// Copyright (c) 2019 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
#![allow(missing_docs)]
//! Copied from k8s.io/pkg/kubelet/dockershim/docker_service.go, used to identify whether a docker
//! container is a sandbox or a regular container, will be removed after defining those as public
//! fields in dockershim.
/// ContainerTypeLabelKey is the container type (podsandbox or container) of key.
pub const CONTAINER_TYPE_LABEL_KEY: &str = "io.kubernetes.docker.type";
/// ContainerTypeLabelSandbox represents a sandbox sandbox container.
pub const SANDBOX: &str = "podsandbox";
/// ContainerTypeLabelContainer represents a container running within a sandbox.
pub const CONTAINER: &str = "container";
/// SandboxIDLabelKey is the sandbox ID annotation.
pub const SANDBOX_ID_LABEL_KEY: &str = "io.kubernetes.sandbox.id";

View File

@@ -0,0 +1,910 @@
// Copyright (c) 2019-2021 Alibaba Cloud
// Copyright (c) 2019 Ant Group
//
// SPDX-License-Identifier: Apache-2.0
//
use std::collections::HashMap;
use std::fs::File;
use std::io::{self, BufReader, Result};
use std::result::{self};
use std::u32;
use serde::Deserialize;
use crate::config::hypervisor::get_hypervisor_plugin;
use crate::config::TomlConfig;
use crate::sl;
/// CRI-containerd specific annotations.
pub mod cri_containerd;
/// CRI-O specific annotations.
pub mod crio;
/// Dockershim specific annotations.
pub mod dockershim;
/// Third-party annotations.
pub mod thirdparty;
// Common section
/// Prefix for Kata specific annotations
pub const KATA_ANNO_PREFIX: &str = "io.katacontainers.";
/// Prefix for Kata configuration annotations
pub const KATA_ANNO_CFG_PREFIX: &str = "io.katacontainers.config.";
/// Prefix for Kata container annotations
pub const KATA_ANNO_CONTAINER_PREFIX: &str = "io.katacontainers.container.";
/// The annotation key to fetch runtime configuration file.
pub const SANDBOX_CFG_PATH_KEY: &str = "io.katacontainers.config_path";
// OCI section
/// The annotation key to fetch the OCI configuration file path.
pub const BUNDLE_PATH_KEY: &str = "io.katacontainers.pkg.oci.bundle_path";
/// The annotation key to fetch container type.
pub const CONTAINER_TYPE_KEY: &str = "io.katacontainers.pkg.oci.container_type";
// Container resource related annotations
/// Prefix for Kata container resource related annotations.
pub const KATA_ANNO_CONTAINER_RES_PREFIX: &str = "io.katacontainers.container.resource";
/// A container annotation to specify the Resources.Memory.Swappiness.
pub const KATA_ANNO_CONTAINER_RES_SWAPPINESS: &str =
"io.katacontainers.container.resource.swappiness";
/// A container annotation to specify the Resources.Memory.Swap.
pub const KATA_ANNO_CONTAINER_RES_SWAP_IN_BYTES: &str =
"io.katacontainers.container.resource.swap_in_bytes";
// Agent related annotations
/// Prefix for Agent configurations.
pub const KATA_ANNO_CFG_AGENT_PREFIX: &str = "io.katacontainers.config.agent.";
/// KernelModules is the annotation key for passing the list of kernel modules and their parameters
/// that will be loaded in the guest kernel.
///
/// Semicolon separated list of kernel modules and their parameters. These modules will be loaded
/// in the guest kernel using modprobe(8).
/// The following example can be used to load two kernel modules with parameters
///
/// annotations:
/// io.katacontainers.config.agent.kernel_modules: "e1000e InterruptThrottleRate=3000,3000,3000 EEE=1; i915 enable_ppgtt=0"
///
/// The first word is considered as the module name and the rest as its parameters.
pub const KATA_ANNO_CFG_KERNEL_MODULES: &str = "io.katacontainers.config.agent.kernel_modules";
/// A sandbox annotation to enable tracing for the agent.
pub const KATA_ANNO_CFG_AGENT_TRACE: &str = "io.katacontainers.config.agent.enable_tracing";
/// An annotation to specify the size of the pipes created for containers.
pub const KATA_ANNO_CFG_AGENT_CONTAINER_PIPE_SIZE: &str =
"io.katacontainers.config.agent.container_pipe_size";
/// An annotation key to specify the size of the pipes created for containers.
pub const CONTAINER_PIPE_SIZE_KERNEL_PARAM: &str = "agent.container_pipe_size";
// Hypervisor related annotations
/// Prefix for Hypervisor configurations.
pub const KATA_ANNO_CFG_HYPERVISOR_PREFIX: &str = "io.katacontainers.config.hypervisor.";
/// A sandbox annotation for passing a per container path pointing at the hypervisor that will run
/// the container VM.
pub const KATA_ANNO_CFG_HYPERVISOR_PATH: &str = "io.katacontainers.config.hypervisor.path";
/// A sandbox annotation for passing a container hypervisor binary SHA-512 hash value.
pub const KATA_ANNO_CFG_HYPERVISOR_HASH: &str = "io.katacontainers.config.hypervisor.path_hash";
/// A sandbox annotation for passing a per container path pointing at the hypervisor control binary
/// that will run the container VM.
pub const KATA_ANNO_CFG_HYPERVISOR_CTLPATH: &str = "io.katacontainers.config.hypervisor.ctlpath";
/// A sandbox annotation for passing a container hypervisor control binary SHA-512 hash value.
pub const KATA_ANNO_CFG_HYPERVISOR_CTLHASH: &str =
"io.katacontainers.config.hypervisor.hypervisorctl_hash";
/// A sandbox annotation for passing a per container path pointing at the jailer that will constrain
/// the container VM.
pub const KATA_ANNO_CFG_HYPERVISOR_JAILER_PATH: &str =
"io.katacontainers.config.hypervisor.jailer_path";
/// A sandbox annotation for passing a jailer binary SHA-512 hash value.
pub const KATA_ANNO_CFG_HYPERVISOR_JAILER_HASH: &str =
"io.katacontainers.config.hypervisor.jailer_hash";
/// A sandbox annotation to enable IO to be processed in a separate thread.
/// Supported currently for virtio-scsi driver.
pub const KATA_ANNO_CFG_HYPERVISOR_ENABLE_IO_THREADS: &str =
"io.katacontainers.config.hypervisor.enable_iothreads";
/// The hash type used for assets verification
pub const KATA_ANNO_CFG_HYPERVISOR_ASSET_HASH_TYPE: &str =
"io.katacontainers.config.hypervisor.asset_hash_type";
/// SHA512 is the SHA-512 (64) hash algorithm
pub const SHA512: &str = "sha512";
// Hypervisor Block Device related annotations
/// Specify the driver to be used for block device either VirtioSCSI or VirtioBlock
pub const KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_DRIVER: &str =
"io.katacontainers.config.hypervisor.block_device_driver";
/// A sandbox annotation that disallows a block device from being used.
pub const KATA_ANNO_CFG_HYPERVISOR_DISABLE_BLOCK_DEV_USE: &str =
"io.katacontainers.config.hypervisor.disable_block_device_use";
/// A sandbox annotation that specifies cache-related options will be set to block devices or not.
pub const KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_CACHE_SET: &str =
"io.katacontainers.config.hypervisor.block_device_cache_set";
/// A sandbox annotation that specifies cache-related options for block devices.
/// Denotes whether use of O_DIRECT (bypass the host page cache) is enabled.
pub const KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_CACHE_DIRECT: &str =
"io.katacontainers.config.hypervisor.block_device_cache_direct";
/// A sandbox annotation that specifies cache-related options for block devices.
/// Denotes whether flush requests for the device are ignored.
pub const KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_CACHE_NOFLUSH: &str =
"io.katacontainers.config.hypervisor.block_device_cache_noflush";
/// A sandbox annotation to specify use of nvdimm device for guest rootfs image.
pub const KATA_ANNO_CFG_HYPERVISOR_DISABLE_IMAGE_NVDIMM: &str =
"io.katacontainers.config.hypervisor.disable_image_nvdimm";
/// A sandbox annotation that specifies the memory space used for nvdimm device by the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_MEMORY_OFFSET: &str =
"io.katacontainers.config.hypervisor.memory_offset";
/// A sandbox annotation to specify if vhost-user-blk/scsi is abailable on the host
pub const KATA_ANNO_CFG_HYPERVISOR_ENABLE_VHOSTUSER_STORE: &str =
"io.katacontainers.config.hypervisor.enable_vhost_user_store";
/// A sandbox annotation to specify the directory path where vhost-user devices related folders,
/// sockets and device nodes should be.
pub const KATA_ANNO_CFG_HYPERVISOR_VHOSTUSER_STORE_PATH: &str =
"io.katacontainers.config.hypervisor.vhost_user_store_path";
// Hypervisor Guest Boot related annotations
/// A sandbox annotation for passing a per container path pointing at the kernel needed to boot
/// the container VM.
pub const KATA_ANNO_CFG_HYPERVISOR_KERNEL_PATH: &str = "io.katacontainers.config.hypervisor.kernel";
/// A sandbox annotation for passing a container kernel image SHA-512 hash value.
pub const KATA_ANNO_CFG_HYPERVISOR_KERNEL_HASH: &str =
"io.katacontainers.config.hypervisor.kernel_hash";
/// A sandbox annotation for passing a per container path pointing at the guest image that will run
/// in the container VM.
/// A sandbox annotation for passing additional guest kernel parameters.
pub const KATA_ANNO_CFG_HYPERVISOR_KERNEL_PARAMS: &str =
"io.katacontainers.config.hypervisor.kernel_params";
/// A sandbox annotation for passing a container guest image path.
pub const KATA_ANNO_CFG_HYPERVISOR_IMAGE_PATH: &str = "io.katacontainers.config.hypervisor.image";
/// A sandbox annotation for passing a container guest image SHA-512 hash value.
pub const KATA_ANNO_CFG_HYPERVISOR_IMAGE_HASH: &str =
"io.katacontainers.config.hypervisor.image_hash";
/// A sandbox annotation for passing a per container path pointing at the initrd that will run
/// in the container VM.
pub const KATA_ANNO_CFG_HYPERVISOR_INITRD_PATH: &str = "io.katacontainers.config.hypervisor.initrd";
/// A sandbox annotation for passing a container guest initrd SHA-512 hash value.
pub const KATA_ANNO_CFG_HYPERVISOR_INITRD_HASH: &str =
"io.katacontainers.config.hypervisor.initrd_hash";
/// A sandbox annotation for passing a per container path pointing at the guest firmware that will
/// run the container VM.
pub const KATA_ANNO_CFG_HYPERVISOR_FIRMWARE_PATH: &str =
"io.katacontainers.config.hypervisor.firmware";
/// A sandbox annotation for passing a container guest firmware SHA-512 hash value.
pub const KATA_ANNO_CFG_HYPERVISOR_FIRMWARE_HASH: &str =
"io.katacontainers.config.hypervisor.firmware_hash";
// Hypervisor CPU related annotations
/// A sandbox annotation to specify cpu specific features.
pub const KATA_ANNO_CFG_HYPERVISOR_CPU_FEATURES: &str =
"io.katacontainers.config.hypervisor.cpu_features";
/// A sandbox annotation for passing the default vcpus assigned for a VM by the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_DEFAULT_VCPUS: &str =
"io.katacontainers.config.hypervisor.default_vcpus";
/// A sandbox annotation that specifies the maximum number of vCPUs allocated for the VM by the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_DEFAULT_MAX_VCPUS: &str =
"io.katacontainers.config.hypervisor.default_max_vcpus";
// Hypervisor Device related annotations
/// A sandbox annotation used to indicate if devices need to be hotplugged on the root bus instead
/// of a bridge.
pub const KATA_ANNO_CFG_HYPERVISOR_HOTPLUG_VFIO_ON_ROOT_BUS: &str =
"io.katacontainers.config.hypervisor.hotplug_vfio_on_root_bus";
/// PCIeRootPort is used to indicate the number of PCIe Root Port devices
pub const KATA_ANNO_CFG_HYPERVISOR_PCIE_ROOT_PORT: &str =
"io.katacontainers.config.hypervisor.pcie_root_port";
/// A sandbox annotation to specify if the VM should have a vIOMMU device.
pub const KATA_ANNO_CFG_HYPERVISOR_IOMMU: &str = "io.katacontainers.config.hypervisor.enable_iommu";
/// Enable Hypervisor Devices IOMMU_PLATFORM
pub const KATA_ANNO_CFG_HYPERVISOR_IOMMU_PLATFORM: &str =
"io.katacontainers.config.hypervisor.enable_iommu_platform";
// Hypervisor Machine related annotations
/// A sandbox annotation to specify the type of machine being emulated by the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_MACHINE_TYPE: &str =
"io.katacontainers.config.hypervisor.machine_type";
/// A sandbox annotation to specify machine specific accelerators for the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_MACHINE_ACCELERATORS: &str =
"io.katacontainers.config.hypervisor.machine_accelerators";
/// EntropySource is a sandbox annotation to specify the path to a host source of
/// entropy (/dev/random, /dev/urandom or real hardware RNG device)
pub const KATA_ANNO_CFG_HYPERVISOR_ENTROPY_SOURCE: &str =
"io.katacontainers.config.hypervisor.entropy_source";
// Hypervisor Memory related annotations
/// A sandbox annotation for the memory assigned for a VM by the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_DEFAULT_MEMORY: &str =
"io.katacontainers.config.hypervisor.default_memory";
/// A sandbox annotation to specify the memory slots assigned to the VM by the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_MEMORY_SLOTS: &str =
"io.katacontainers.config.hypervisor.memory_slots";
/// A sandbox annotation that specifies the memory space used for nvdimm device by the hypervisor.
pub const KATA_ANNO_CFG_HYPERVISOR_MEMORY_PREALLOC: &str =
"io.katacontainers.config.hypervisor.enable_mem_prealloc";
/// A sandbox annotation to specify if the memory should be pre-allocated from huge pages.
pub const KATA_ANNO_CFG_HYPERVISOR_HUGE_PAGES: &str =
"io.katacontainers.config.hypervisor.enable_hugepages";
/// A sandbox annotation to soecify file based memory backend root directory.
pub const KATA_ANNO_CFG_HYPERVISOR_FILE_BACKED_MEM_ROOT_DIR: &str =
"io.katacontainers.config.hypervisor.file_mem_backend";
/// A sandbox annotation that is used to enable/disable virtio-mem.
pub const KATA_ANNO_CFG_HYPERVISOR_VIRTIO_MEM: &str =
"io.katacontainers.config.hypervisor.enable_virtio_mem";
/// A sandbox annotation to enable swap of vm memory.
pub const KATA_ANNO_CFG_HYPERVISOR_ENABLE_SWAP: &str =
"io.katacontainers.config.hypervisor.enable_swap";
/// A sandbox annotation to enable swap in the guest.
pub const KATA_ANNO_CFG_HYPERVISOR_ENABLE_GUEST_SWAP: &str =
"io.katacontainers.config.hypervisor.enable_guest_swap";
// Hypervisor Network related annotations
/// A sandbox annotation to specify if vhost-net is not available on the host.
pub const KATA_ANNO_CFG_HYPERVISOR_DISABLE_VHOST_NET: &str =
"io.katacontainers.config.hypervisor.disable_vhost_net";
/// A sandbox annotation that specifies max rate on network I/O inbound bandwidth.
pub const KATA_ANNO_CFG_HYPERVISOR_RX_RATE_LIMITER_MAX_RATE: &str =
"io.katacontainers.config.hypervisor.rx_rate_limiter_max_rate";
/// A sandbox annotation that specifies max rate on network I/O outbound bandwidth.
pub const KATA_ANNO_CFG_HYPERVISOR_TX_RATE_LIMITER_MAX_RATE: &str =
"io.katacontainers.config.hypervisor.tx_rate_limiter_max_rate";
// Hypervisor Security related annotations
/// A sandbox annotation to specify the path within the VM that will be used for 'drop-in' hooks.
pub const KATA_ANNO_CFG_HYPERVISOR_GUEST_HOOK_PATH: &str =
"io.katacontainers.config.hypervisor.guest_hook_path";
/// A sandbox annotation to enable rootless hypervisor (only supported in QEMU currently).
pub const KATA_ANNO_CFG_HYPERVISOR_ENABLE_ROOTLESS_HYPERVISOR: &str =
"io.katacontainers.config.hypervisor.rootless";
// Hypervisor Shared File System related annotations
/// A sandbox annotation to specify the shared file system type, either virtio-9p or virtio-fs.
pub const KATA_ANNO_CFG_HYPERVISOR_SHARED_FS: &str =
"io.katacontainers.config.hypervisor.shared_fs";
/// A sandbox annotations to specify virtio-fs vhost-user daemon path.
pub const KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_DAEMON: &str =
"io.katacontainers.config.hypervisor.virtio_fs_daemon";
/// A sandbox annotation to specify the cache mode for fs version cache or "none".
pub const KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_CACHE: &str =
"io.katacontainers.config.hypervisor.virtio_fs_cache";
/// A sandbox annotation to specify the DAX cache size in MiB.
pub const KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_CACHE_SIZE: &str =
"io.katacontainers.config.hypervisor.virtio_fs_cache_size";
/// A sandbox annotation to pass options to virtiofsd daemon.
pub const KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_EXTRA_ARGS: &str =
"io.katacontainers.config.hypervisor.virtio_fs_extra_args";
/// A sandbox annotation to specify as the msize for 9p shares.
pub const KATA_ANNO_CFG_HYPERVISOR_MSIZE_9P: &str = "io.katacontainers.config.hypervisor.msize_9p";
// Runtime related annotations
/// Prefix for Runtime configurations.
pub const KATA_ANNO_CFG_RUNTIME_PREFIX: &str = "io.katacontainers.config.runtime.";
/// runtime name
pub const KATA_ANNO_CFG_RUNTIME_NAME: &str = "io.katacontainers.config.runtime.name";
/// hypervisor name
pub const KATA_ANNO_CFG_RUNTIME_HYPERVISOR: &str =
"io.katacontainers.config.runtime.hypervisor_name";
/// agent name
pub const KATA_ANNO_CFG_RUNTIME_AGENT: &str = "io.katacontainers.config.runtime.agent_name";
/// A sandbox annotation that determines if seccomp should be applied inside guest.
pub const KATA_ANNO_CFG_DISABLE_GUEST_SECCOMP: &str =
"io.katacontainers.config.runtime.disable_guest_seccomp";
/// A sandbox annotation that determines if pprof enabled.
pub const KATA_ANNO_CFG_ENABLE_PPROF: &str = "io.katacontainers.config.runtime.enable_pprof";
/// A sandbox annotation that determines if experimental features enabled.
pub const KATA_ANNO_CFG_EXPERIMENTAL: &str = "io.katacontainers.config.runtime.experimental";
/// A sandbox annotaion that determines how the VM should be connected to the the container network
/// interface.
pub const KATA_ANNO_CFG_INTER_NETWORK_MODEL: &str =
"io.katacontainers.config.runtime.internetworking_model";
/// SandboxCgroupOnly is a sandbox annotation that determines if kata processes are managed only in sandbox cgroup.
pub const KATA_ANNO_CFG_SANDBOX_CGROUP_ONLY: &str =
"io.katacontainers.config.runtime.sandbox_cgroup_only";
/// A sandbox annotation that determines if create a netns for hypervisor process.
pub const KATA_ANNO_CFG_DISABLE_NEW_NETNS: &str =
"io.katacontainers.config.runtime.disable_new_netns";
/// A sandbox annotation to specify how attached VFIO devices should be treated.
pub const KATA_ANNO_CFG_VFIO_MODE: &str = "io.katacontainers.config.runtime.vfio_mode";
/// A helper structure to query configuration information by check annotations.
#[derive(Debug, Default, Deserialize)]
pub struct Annotation {
annotations: HashMap<String, String>,
}
impl From<HashMap<String, String>> for Annotation {
fn from(annotations: HashMap<String, String>) -> Self {
Annotation { annotations }
}
}
impl Annotation {
/// Create a new instance of [`Annotation`].
pub fn new(annotations: HashMap<String, String>) -> Annotation {
Annotation { annotations }
}
/// Deserialize an object from a json string.
pub fn deserialize<T>(path: &str) -> Result<T>
where
for<'a> T: Deserialize<'a>,
{
let f = BufReader::new(File::open(path)?);
Ok(serde_json::from_reader(f)?)
}
/// Get an immutable reference to the annotation hashmap.
pub fn get_annotations(&self) -> &HashMap<String, String> {
&self.annotations
}
/// Get a mutable reference to the annotation hashmap.
pub fn get_annotations_mut(&mut self) -> &mut HashMap<String, String> {
&mut self.annotations
}
/// Get the value of annotation with `key`
pub fn get_value<T>(
&self,
key: &str,
) -> result::Result<Option<T>, <T as std::str::FromStr>::Err>
where
T: std::str::FromStr,
{
if let Some(value) = self.get(key) {
return value.parse::<T>().map(Some);
}
Ok(None)
}
/// Get the value of annotation with `key` as string.
pub fn get(&self, key: &str) -> Option<String> {
self.annotations.get(key).map(|v| String::from(v.trim()))
}
}
// Miscellaneous annotations.
impl Annotation {
/// Get the annotation of sandbox configuration file path.
pub fn get_sandbox_config_path(&self) -> Option<String> {
self.get(SANDBOX_CFG_PATH_KEY)
}
/// Get the annotation of bundle path.
pub fn get_bundle_path(&self) -> Option<String> {
self.get(BUNDLE_PATH_KEY)
}
/// Get the annotation of container type.
pub fn get_container_type(&self) -> Option<String> {
self.get(CONTAINER_TYPE_KEY)
}
/// Get the annotation to specify the Resources.Memory.Swappiness.
pub fn get_container_resource_swappiness(&self) -> Result<Option<u32>> {
match self.get_value::<u32>(KATA_ANNO_CONTAINER_RES_SWAPPINESS) {
Ok(r) => {
if r.unwrap_or_default() > 100 {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!("{} greater than 100", r.unwrap_or_default()),
));
} else {
Ok(r)
}
}
Err(_e) => Err(io::Error::new(
io::ErrorKind::InvalidData,
"parse u32 error".to_string(),
)),
}
}
/// Get the annotation to specify the Resources.Memory.Swap.
pub fn get_container_resource_swap_in_bytes(&self) -> Option<String> {
self.get(KATA_ANNO_CONTAINER_RES_SWAP_IN_BYTES)
}
}
impl Annotation {
/// update config info by annotation
pub fn update_config_by_annotation(&self, config: &mut TomlConfig) -> Result<()> {
if let Some(hv) = self.annotations.get(KATA_ANNO_CFG_RUNTIME_HYPERVISOR) {
if config.hypervisor.get(hv).is_some() {
config.runtime.hypervisor_name = hv.to_string();
}
}
if let Some(ag) = self.annotations.get(KATA_ANNO_CFG_RUNTIME_AGENT) {
if config.agent.get(ag).is_some() {
config.runtime.agent_name = ag.to_string();
}
}
let hypervisor_name = &config.runtime.hypervisor_name;
let agent_name = &config.runtime.agent_name;
let bool_err = io::Error::new(io::ErrorKind::InvalidData, "parse bool error".to_string());
let u32_err = io::Error::new(io::ErrorKind::InvalidData, "parse u32 error".to_string());
let u64_err = io::Error::new(io::ErrorKind::InvalidData, "parse u64 error".to_string());
let i32_err = io::Error::new(io::ErrorKind::InvalidData, "parse i32 error".to_string());
let mut hv = config.hypervisor.get_mut(hypervisor_name).unwrap();
let mut ag = config.agent.get_mut(agent_name).unwrap();
for (key, value) in &self.annotations {
if hv.security_info.is_annotation_enabled(key) {
match key.as_str() {
// update hypervisor config
// Hypervisor related annotations
KATA_ANNO_CFG_HYPERVISOR_PATH => {
hv.validate_hypervisor_path(value)?;
hv.path = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_CTLPATH => {
hv.validate_hypervisor_ctlpath(value)?;
hv.ctlpath = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_JAILER_PATH => {
hv.validate_jailer_path(value)?;
hv.jailer_path = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_ENABLE_IO_THREADS => match self.get_value::<bool>(key)
{
Ok(r) => {
hv.enable_iothreads = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
// Hypervisor Block Device related annotations
KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_DRIVER => {
hv.blockdev_info.block_device_driver = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_DISABLE_BLOCK_DEV_USE => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.blockdev_info.disable_block_device_use = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_CACHE_SET => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.blockdev_info.block_device_cache_set = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_CACHE_DIRECT => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.blockdev_info.block_device_cache_direct = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_BLOCK_DEV_CACHE_NOFLUSH => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.blockdev_info.block_device_cache_noflush = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_DISABLE_IMAGE_NVDIMM => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.blockdev_info.disable_image_nvdimm = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_MEMORY_OFFSET => match self.get_value::<u64>(key) {
Ok(r) => {
hv.blockdev_info.memory_offset = r.unwrap_or_default();
}
Err(_e) => {
return Err(u64_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_ENABLE_VHOSTUSER_STORE => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.blockdev_info.enable_vhost_user_store = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_VHOSTUSER_STORE_PATH => {
hv.blockdev_info.validate_vhost_user_store_path(value)?;
hv.blockdev_info.vhost_user_store_path = value.to_string();
}
// Hypervisor Guest Boot related annotations
KATA_ANNO_CFG_HYPERVISOR_KERNEL_PATH => {
hv.boot_info.validate_boot_path(value)?;
hv.boot_info.kernel = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_KERNEL_PARAMS => {
hv.boot_info.kernel_params = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_IMAGE_PATH => {
hv.boot_info.validate_boot_path(value)?;
hv.boot_info.image = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_INITRD_PATH => {
hv.boot_info.validate_boot_path(value)?;
hv.boot_info.initrd = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_FIRMWARE_PATH => {
hv.boot_info.validate_boot_path(value)?;
hv.boot_info.firmware = value.to_string();
}
// Hypervisor CPU related annotations
KATA_ANNO_CFG_HYPERVISOR_CPU_FEATURES => {
hv.cpu_info.cpu_features = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_DEFAULT_VCPUS => match self.get_value::<i32>(key) {
Ok(num_cpus) => {
let num_cpus = num_cpus.unwrap_or_default();
if num_cpus
> get_hypervisor_plugin(hypervisor_name)
.unwrap()
.get_max_cpus() as i32
{
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!(
"Vcpus specified in annotation {} is more than maximum limitation {}",
num_cpus,
get_hypervisor_plugin(hypervisor_name)
.unwrap()
.get_max_cpus()
),
));
} else {
hv.cpu_info.default_vcpus = num_cpus;
}
}
Err(_e) => {
return Err(i32_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_DEFAULT_MAX_VCPUS => {
match self.get_value::<u32>(key) {
Ok(r) => {
hv.cpu_info.default_maxvcpus = r.unwrap_or_default();
}
Err(_e) => {
return Err(u32_err);
}
}
}
// Hypervisor Device related annotations
KATA_ANNO_CFG_HYPERVISOR_HOTPLUG_VFIO_ON_ROOT_BUS => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.device_info.hotplug_vfio_on_root_bus = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_PCIE_ROOT_PORT => match self.get_value::<u32>(key) {
Ok(r) => {
hv.device_info.pcie_root_port = r.unwrap_or_default();
}
Err(_e) => {
return Err(u32_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_IOMMU => match self.get_value::<bool>(key) {
Ok(r) => {
hv.device_info.enable_iommu = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_IOMMU_PLATFORM => match self.get_value::<bool>(key) {
Ok(r) => {
hv.device_info.enable_iommu_platform = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
// Hypervisor Machine related annotations
KATA_ANNO_CFG_HYPERVISOR_MACHINE_TYPE => {
hv.machine_info.machine_type = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_MACHINE_ACCELERATORS => {
hv.machine_info.machine_accelerators = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_ENTROPY_SOURCE => {
hv.machine_info.validate_entropy_source(value)?;
hv.machine_info.entropy_source = value.to_string();
}
// Hypervisor Memory related annotations
KATA_ANNO_CFG_HYPERVISOR_DEFAULT_MEMORY => {
match byte_unit::Byte::from_str(value) {
Ok(mem_bytes) => {
let memory_size = mem_bytes
.get_adjusted_unit(byte_unit::ByteUnit::MiB)
.get_value()
as u32;
info!(sl!(), "get mem {} from annotations: {}", memory_size, value);
if memory_size
< get_hypervisor_plugin(hypervisor_name)
.unwrap()
.get_min_memory()
{
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!(
"memory specified in annotation {} is less than minimum limitation {}",
memory_size,
get_hypervisor_plugin(hypervisor_name)
.unwrap()
.get_min_memory()
),
));
}
hv.memory_info.default_memory = memory_size;
}
Err(error) => {
error!(
sl!(),
"failed to parse byte from string {} error {:?}", value, error
);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_MEMORY_SLOTS => match self.get_value::<u32>(key) {
Ok(v) => {
hv.memory_info.memory_slots = v.unwrap_or_default();
}
Err(_e) => {
return Err(u32_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_MEMORY_PREALLOC => match self.get_value::<bool>(key) {
Ok(r) => {
hv.memory_info.enable_mem_prealloc = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_HUGE_PAGES => match self.get_value::<bool>(key) {
Ok(r) => {
hv.memory_info.enable_hugepages = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_FILE_BACKED_MEM_ROOT_DIR => {
hv.memory_info.validate_memory_backend_path(value)?;
hv.memory_info.file_mem_backend = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_VIRTIO_MEM => match self.get_value::<bool>(key) {
Ok(r) => {
hv.memory_info.enable_virtio_mem = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_ENABLE_SWAP => match self.get_value::<bool>(key) {
Ok(r) => {
hv.memory_info.enable_swap = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_ENABLE_GUEST_SWAP => match self.get_value::<bool>(key)
{
Ok(r) => {
hv.memory_info.enable_guest_swap = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
// Hypervisor Network related annotations
KATA_ANNO_CFG_HYPERVISOR_DISABLE_VHOST_NET => match self.get_value::<bool>(key)
{
Ok(r) => {
hv.network_info.disable_vhost_net = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_HYPERVISOR_RX_RATE_LIMITER_MAX_RATE => {
match self.get_value::<u64>(key) {
Ok(r) => {
hv.network_info.rx_rate_limiter_max_rate = r.unwrap_or_default();
}
Err(_e) => {
return Err(u64_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_TX_RATE_LIMITER_MAX_RATE => {
match self.get_value::<u64>(key) {
Ok(r) => {
hv.network_info.tx_rate_limiter_max_rate = r.unwrap_or_default();
}
Err(_e) => {
return Err(u64_err);
}
}
}
// Hypervisor Security related annotations
KATA_ANNO_CFG_HYPERVISOR_GUEST_HOOK_PATH => {
hv.security_info.validate_path(value)?;
hv.security_info.guest_hook_path = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_ENABLE_ROOTLESS_HYPERVISOR => {
match self.get_value::<bool>(key) {
Ok(r) => {
hv.security_info.rootless = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
}
}
// Hypervisor Shared File System related annotations
KATA_ANNO_CFG_HYPERVISOR_SHARED_FS => {
hv.shared_fs.shared_fs = self.get(key);
}
KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_DAEMON => {
hv.shared_fs.validate_virtiofs_daemon_path(value)?;
hv.shared_fs.virtio_fs_daemon = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_CACHE => {
hv.shared_fs.virtio_fs_cache = value.to_string();
}
KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_CACHE_SIZE => {
match self.get_value::<u32>(key) {
Ok(r) => {
hv.shared_fs.virtio_fs_cache_size = r.unwrap_or_default();
}
Err(_e) => {
return Err(u32_err);
}
}
}
KATA_ANNO_CFG_HYPERVISOR_VIRTIO_FS_EXTRA_ARGS => {
let args: Vec<String> =
value.to_string().split(',').map(str::to_string).collect();
for arg in args {
hv.shared_fs.virtio_fs_extra_args.push(arg.to_string());
}
}
KATA_ANNO_CFG_HYPERVISOR_MSIZE_9P => match self.get_value::<u32>(key) {
Ok(v) => {
hv.shared_fs.msize_9p = v.unwrap_or_default();
}
Err(_e) => {
return Err(u32_err);
}
},
_ => {
return Err(io::Error::new(
io::ErrorKind::InvalidInput,
format!("Invalid annotation type {}", key),
));
}
}
} else {
match key.as_str() {
//update agent config
KATA_ANNO_CFG_KERNEL_MODULES => {
let kernel_mod: Vec<String> =
value.to_string().split(';').map(str::to_string).collect();
for modules in kernel_mod {
ag.kernel_modules.push(modules.to_string());
}
}
KATA_ANNO_CFG_AGENT_TRACE => match self.get_value::<bool>(key) {
Ok(r) => {
ag.enable_tracing = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_AGENT_CONTAINER_PIPE_SIZE => match self.get_value::<u32>(key) {
Ok(v) => {
ag.container_pipe_size = v.unwrap_or_default();
}
Err(_e) => {
return Err(u32_err);
}
},
//update runtime config
KATA_ANNO_CFG_RUNTIME_NAME => {
let runtime = vec!["virt-container", "linux-container", "wasm-container"];
if runtime.contains(&value.as_str()) {
config.runtime.name = value.to_string();
} else {
return Err(io::Error::new(
io::ErrorKind::InvalidData,
format!(
"runtime specified in annotation {} is not in {:?}",
&value, &runtime
),
));
}
}
KATA_ANNO_CFG_DISABLE_GUEST_SECCOMP => match self.get_value::<bool>(key) {
Ok(r) => {
config.runtime.disable_guest_seccomp = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_ENABLE_PPROF => match self.get_value::<bool>(key) {
Ok(r) => {
config.runtime.enable_pprof = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_EXPERIMENTAL => {
let args: Vec<String> =
value.to_string().split(',').map(str::to_string).collect();
for arg in args {
config.runtime.experimental.push(arg.to_string());
}
}
KATA_ANNO_CFG_INTER_NETWORK_MODEL => {
config.runtime.internetworking_model = value.to_string();
}
KATA_ANNO_CFG_SANDBOX_CGROUP_ONLY => match self.get_value::<bool>(key) {
Ok(r) => {
config.runtime.sandbox_cgroup_only = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_DISABLE_NEW_NETNS => match self.get_value::<bool>(key) {
Ok(r) => {
config.runtime.disable_new_netns = r.unwrap_or_default();
}
Err(_e) => {
return Err(bool_err);
}
},
KATA_ANNO_CFG_VFIO_MODE => {
config.runtime.vfio_mode = value.to_string();
}
_ => {
warn!(sl!(), "Annotation {} not enabled", key);
}
}
}
}
Ok(())
}
}

View File

@@ -0,0 +1,12 @@
// Copyright (c) 2021 Alibaba Cloud
//
// SPDX-License-Identifier: Apache-2.0
//
//! Third-party annotations - annotations defined by other projects or k8s plugins but that can
//! change Kata Containers behaviour.
/// Annotation to enable SGX.
///
/// Hardware-based isolation and memory encryption.
pub const SGX_EPC: &str = "sgx.intel.com/epc";

View File

@@ -0,0 +1,123 @@
// Copyright (c) 2021 Alibaba Cloud
//
// SPDX-License-Identifier: Apache-2.0
//
use std::io::Result;
use crate::config::{ConfigOps, TomlConfig};
pub use vendor::AgentVendor;
/// Kata agent configuration information.
#[derive(Debug, Default, Deserialize, Serialize)]
pub struct Agent {
/// If enabled, the agent will log additional debug messages to the system log.
#[serde(default, rename = "enable_debug")]
pub debug: bool,
/// Enable agent tracing.
///
/// If enabled, the agent will generate OpenTelemetry trace spans.
/// # Notes:
/// - If the runtime also has tracing enabled, the agent spans will be associated with the
/// appropriate runtime parent span.
/// - If enabled, the runtime will wait for the container to shutdown, increasing the container
/// shutdown time slightly.
#[serde(default)]
pub enable_tracing: bool,
/// Enable debug console.
/// If enabled, user can connect guest OS running inside hypervisor through
/// "kata-runtime exec <sandbox-id>" command
#[serde(default)]
pub debug_console_enabled: bool,
/// Agent server port
#[serde(default)]
pub server_port: u32,
/// Agent log port
#[serde(default)]
pub log_port: u32,
/// Agent connection dialing timeout value in millisecond
#[serde(default = "default_dial_timeout")]
pub dial_timeout_ms: u32,
/// Agent reconnect timeout value in millisecond
#[serde(default = "default_reconnect_timeout")]
pub reconnect_timeout_ms: u32,
/// Agent request timeout value in millisecond
#[serde(default = "default_request_timeout")]
pub request_timeout_ms: u32,
/// Agent health check request timeout value in millisecond
#[serde(default = "default_health_check_timeout")]
pub health_check_request_timeout_ms: u32,
/// Comma separated list of kernel modules and their parameters.
///
/// These modules will be loaded in the guest kernel using modprobe(8).
/// The following example can be used to load two kernel modules with parameters:
/// - kernel_modules=["e1000e InterruptThrottleRate=3000,3000,3000 EEE=1", "i915 enable_ppgtt=0"]
/// The first word is considered as the module name and the rest as its parameters.
/// Container will not be started when:
/// - A kernel module is specified and the modprobe command is not installed in the guest
/// or it fails loading the module.
/// - The module is not available in the guest or it doesn't met the guest kernel
/// requirements, like architecture and version.
#[serde(default)]
pub kernel_modules: Vec<String>,
/// container pipe size
pub container_pipe_size: u32,
}
fn default_dial_timeout() -> u32 {
// 10ms
10
}
fn default_reconnect_timeout() -> u32 {
// 3s
3_000
}
fn default_request_timeout() -> u32 {
// 30s
30_000
}
fn default_health_check_timeout() -> u32 {
// 90s
90_000
}
impl ConfigOps for Agent {
fn adjust_config(conf: &mut TomlConfig) -> Result<()> {
AgentVendor::adjust_config(conf)?;
Ok(())
}
fn validate(conf: &TomlConfig) -> Result<()> {
AgentVendor::validate(conf)?;
Ok(())
}
}
#[cfg(not(feature = "enable-vendor"))]
mod vendor {
use super::*;
/// Vendor customization agent configuration.
#[derive(Debug, Default, Deserialize, Serialize)]
pub struct AgentVendor {}
impl ConfigOps for AgentVendor {}
}
#[cfg(feature = "enable-vendor")]
#[path = "agent_vendor.rs"]
mod vendor;

Some files were not shown because too many files have changed in this diff Show More