packaging: merge packaging repository

git-subtree-dir: tools/packaging
git-subtree-mainline: f818b46a41
git-subtree-split: 1f22d72d5d

Signed-off-by: Peng Tao <bergwolf@hyper.sh>
This commit is contained in:
Peng Tao
2020-06-23 22:49:04 -07:00
645 changed files with 292694 additions and 0 deletions

View File

@@ -0,0 +1,179 @@
# Build Kata Containers Kernel
* [Requirements](#requirements)
* [Usage](#usage)
* [Setup kernel source code](#setup-kernel-source-code)
* [Build the kernel](#build-the-kernel)
* [Install the Kernel in the default path for Kata](#install-the-kernel-in-the-default-path-for-kata)
* [Submit Kernel Changes](#submit-kernel-changes)
* [How is it tested](#how-is-it-tested)
* [Contribute](#contribute)
This document explains the steps to build a kernel recommended for use with
Kata Containers. To do this use `build-kernel.sh`, this script
automates the process to build a kernel for Kata Containers.
## Requirements
The `build-kernel.sh` script requires an installed Golang version matching the
[component build requirements](https://github.com/kata-containers/documentation/blob/master/Developer-Guide.md#requirements-to-build-individual-components).
## Usage
```
$ ./build-kernel.sh -h
Overview:
Build a kernel for Kata Containers
Description: This script is the *ONLY* to build a kernel for development.
Usage:
build-kernel.sh [options] <command> <argument>
Commands:
- setup
- build
- install
Options:
-c <path> : Path to config file to build a the kernel.
-d : Enable bash debug.
-e : Enable experimental kernel.
-f : Enable force generate config when setup.
-g <vendor> : GPU vendor, intel or nvidia.
-h : Display this help.
-k <path> : Path to kernel to build.
-p <path> : Path to a directory with patches to apply to kernel.
-t : Hypervisor_target.
-v : Kernel version to use if kernel path not provided.
```
Example:
```
$ ./build-kernel.sh -v 4.19.86 -g nvidia -f -d setup
```
> **Note**
> - `-v 4.19.86`: Specify the guest kernel version.
> - `-g nvidia`: To build a guest kernel supporting Nvidia GPU.
> - `-f`: The .config file is forced to be generated even if the kernel directory already exists.
> - `-d`: Enable bash debug mode.
## Setup kernel source code
```bash
$ go get -d -u github.com/kata-containers/packaging
$ cd $GOPATH/src/github.com/kata-containers/packaging/kernel
$ ./build-kernel.sh setup
```
The script `./build-kernel.sh` tries to apply the patches from
`${GOPATH}/src/github.com/kata-containers/packaging/kernel/patches/` when it
sets up a kernel. If you want to add a source modification, add a patch on this
directory.
The script also adds a kernel config file from
`${GOPATH}/src/github.com/kata-containers/packaging/kernel/configs/` to `.config`
in the kernel source code. You can modify it as needed.
## Build the kernel
After the kernel source code is ready, it is possible to build the kernel.
```bash
$ ./build-kernel.sh build
```
## Install the Kernel in the default path for Kata
Kata Containers uses some default path to search a kernel to boot. To install
on this path, the following command will install it to the default Kata
containers path (`/usr/share/kata-containers/`).
```bash
$ ./build-kernel.sh install
```
## Submit Kernel Changes
Kata Containers packaging repository holds the kernel configs and patches. The
config and patches can work for many versions, but we only test the
kernel version defined in the [runtime versions file][runtime-versions-file].
For further details, see [the kernel configuration documentation](configs).
## How is it tested
The Kata Containers CI scripts install the kernel from [CI cache
job][cache-job] or build from sources.
If the kernel defined in the [runtime versions file][runtime-versions-file] is
built and cached with the latest kernel config and patches, it installs.
Otherwise, the kernel is built from source.
The Kata kernel version is a mix of the kernel version defined in the [runtime
versions file][runtime-versions-file] and the file `kata_config_version`. This
helps to identify if a kernel build has the latest recommend
configuration.
Example:
```bash
# From https://github.com/kata-containers/runtime/blob/master/versions.yaml
$ kernel_version_in_versions_file=4.10.1
# From https://github.com/kata-containers/packaging/blob/master/kernel/kata_config_version
$ kata_config_version=25
$ latest_kernel_version=${kernel_version_in_versions_file}-${kata_config_version}
```
The resulting version is 4.10.1-25, this helps identify whether or not the kernel
configs are up-to-date on a CI version.
## Contribute
In order to do Kata Kernel changes. There are places to contribute:
1. [Kata runtime versions file][runtime-versions-file]: This file points to the
recommended versions to be used by Kata. To update the kernel version send a
pull request to update that version. The Kata CI will run all the use cases
and verify it works.
1. Kata packaging repository. This repository contains all the kernel configs
and patches recommended for Kata Containers kernel:
- If you want to upload one new configuration (new version or architecture
specific) make sure the config file name has the following format:
```bash
# Format:
$ ${arch}_kata_${hypervisor_target}_${major_kernel_version}.x
# example:
$ arch=x86_64
$ hypervisor_target=kvm
$ major_kernel_version=4.19
# Resulting file
$ name: x86_64_kata_kvm_4.19.x
```
- Kernel patches, the CI and packaging scripts will apply all patches in the
[patches directory][patches-dir].
Note: The kernel version and configuration file live in different locations,
which could result in a circular dependency on your (runtime or packaging) PR.
In this case, the PR you submit needs to be tested together with a patch from
another Kata Containers repository. To do this you have to specify which
repository and which pull request [it depends on][depends-on-docs].
[runtime-versions-file]: https://github.com/kata-containers/runtime/blob/master/versions.yaml
[patches-dir]: https://github.com/kata-containers/packaging/tree/master/kernel/patches
[depends-on-docs]: https://github.com/kata-containers/tests/blob/master/README.md#breaking-compatibility
[cache-job]: http://jenkins.katacontainers.io/job/image-nightly-x86_64/

View File

@@ -0,0 +1,539 @@
#!/bin/bash
#
# Copyright (c) 2018 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
description="
Description: This script is the *ONLY* to build a kernel for development.
"
set -o errexit
set -o nounset
set -o pipefail
readonly script_name="$(basename "${BASH_SOURCE[0]}")"
readonly script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
kata_version="${kata_version:-}"
#project_name
readonly project_name="kata-containers"
[ -n "${GOPATH:-}" ] || GOPATH="${HOME}/go"
# Fetch the first element from GOPATH as working directory
# as go get only works against the first item in the GOPATH
GOPATH="${GOPATH%%:*}"
# Kernel version to be used
kernel_version=""
# Flag know if need to download the kernel source
download_kernel=false
# The repository where kernel configuration lives
runtime_repository="github.com/${project_name}/runtime"
# The repository where kernel configuration lives
readonly kernel_config_repo="github.com/${project_name}/packaging"
readonly patches_repo="github.com/${project_name}/packaging"
readonly patches_repo_dir="${GOPATH}/src/${patches_repo}"
# Default path to search patches to apply to kernel
readonly default_patches_dir="${patches_repo_dir}/kernel/patches/"
# Default path to search config for kata
readonly default_kernel_config_dir="${GOPATH}/src/${kernel_config_repo}/kernel/configs"
# Default path to search for kernel config fragments
readonly default_config_frags_dir="${GOPATH}/src/${kernel_config_repo}/kernel/configs/fragments"
readonly default_config_whitelist="${GOPATH}/src/${kernel_config_repo}/kernel/configs/fragments/whitelist.conf"
# GPU vendor
readonly GV_INTEL="intel"
readonly GV_NVIDIA="nvidia"
#Path to kernel directory
kernel_path=""
#Experimental kernel support. Pull from virtio-fs GitLab instead of kernel.org
experimental_kernel="false"
#Force generate config when setup
force_setup_generate_config="false"
#GPU kernel support
gpu_vendor=""
#
patches_path=""
#
hypervisor_target=""
#
arch_target=""
#
kernel_config_path=""
# destdir
DESTDIR="${DESTDIR:-/}"
#PREFIX=
PREFIX="${PREFIX:-/usr}"
source "${script_dir}/../scripts/lib.sh"
usage() {
exit_code="$1"
cat <<EOT
Overview:
Build a kernel for Kata Containers
${description}
Usage:
$script_name [options] <command> <argument>
Commands:
- setup
- build
- install
Options:
-c <path> : Path to config file to build a the kernel.
-d : Enable bash debug.
-e : Enable experimental kernel.
-f : Enable force generate config when setup.
-g <vendor> : GPU vendor, intel or nvidia.
-h : Display this help.
-k <path> : Path to kernel to build.
-p <path> : Path to a directory with patches to apply to kernel.
-t : Hypervisor_target.
-v : Kernel version to use if kernel path not provided.
EOT
exit "$exit_code"
}
# Convert architecture to the name used by the Linux kernel build system
arch_to_kernel() {
local -r arch="$1"
case "$arch" in
aarch64) echo "arm64" ;;
ppc64le) echo "powerpc" ;;
s390x) echo "s390" ;;
x86_64) echo "$arch" ;;
*) die "unsupported architecture: $arch" ;;
esac
}
get_kernel() {
local version="${1:-}"
local kernel_path=${2:-}
[ -n "${kernel_path}" ] || die "kernel_path not provided"
[ ! -d "${kernel_path}" ] || die "kernel_path already exist"
if [[ ${experimental_kernel} == "true" ]]; then
kernel_tarball="linux-${version}.tar.gz"
curl --fail -OL "https://gitlab.com/virtio-fs/linux/-/archive/${version}/${kernel_tarball}"
tar xf "${kernel_tarball}"
mv "linux-${version}" "${kernel_path}"
else
#Remove extra 'v'
version=${version#v}
major_version=$(echo "${version}" | cut -d. -f1)
kernel_tarball="linux-${version}.tar.xz"
if [ ! -f sha256sums.asc ] || ! grep -q "${kernel_tarball}" sha256sums.asc; then
info "Download kernel checksum file: sha256sums.asc"
curl --fail -OL "https://cdn.kernel.org/pub/linux/kernel/v${major_version}.x/sha256sums.asc"
fi
grep "${kernel_tarball}" sha256sums.asc >"${kernel_tarball}.sha256"
if [ -f "${kernel_tarball}" ] && ! sha256sum -c "${kernel_tarball}.sha256"; then
info "invalid kernel tarball ${kernel_tarball} removing "
rm -f "${kernel_tarball}"
fi
if [ ! -f "${kernel_tarball}" ]; then
info "Download kernel version ${version}"
info "Download kernel"
curl --fail -OL "https://www.kernel.org/pub/linux/kernel/v${major_version}.x/${kernel_tarball}"
else
info "kernel tarball already downloaded"
fi
sha256sum -c "${kernel_tarball}.sha256"
tar xf "${kernel_tarball}"
mv "linux-${version}" "${kernel_path}"
fi
}
get_major_kernel_version() {
local version="${1}"
[ -n "${version}" ] || die "kernel version not provided"
major_version=$(echo "${version}" | cut -d. -f1)
minor_version=$(echo "${version}" | cut -d. -f2)
echo "${major_version}.${minor_version}"
}
# Make a kernel config file from generic and arch specific
# fragments
# - arg1 - path to arch specific fragments
# - arg2 - path to kernel sources
#
get_kernel_frag_path() {
local arch_path="$1"
local common_path="${arch_path}/../common"
local gpu_path="${arch_path}/../gpu"
local kernel_path="$2"
local arch="$3"
local cmdpath="${kernel_path}/scripts/kconfig/merge_config.sh"
local config_path="${arch_path}/.config"
local arch_configs="$(ls ${arch_path}/*.conf)"
# Exclude configs if they have !$arch tag in the header
local common_configs="$(grep "\!${arch}" ${common_path}/*.conf -L)"
local experimental_configs="$(ls ${common_path}/experimental/*.conf)"
# These are the strings that the kernel merge_config.sh script kicks out
# when it reports an error or warning condition. We search for them in the
# output to try and fail when we think something has been misconfigured.
local not_in_string="not in final"
local redefined_string="not in final"
local redundant_string="not in final"
# Later, if we need to add kernel version specific subdirs in order to
# handle specific cases, then add the path definition and search/list/cat
# here.
local all_configs="${common_configs} ${arch_configs}"
if [[ ${experimental_kernel} == "true" ]]; then
all_configs="${all_configs} ${experimental_configs}"
fi
if [[ "${gpu_vendor}" != "" ]];then
info "Add kernel config for GPU due to '-g ${gpu_vendor}'"
local gpu_configs="$(ls ${gpu_path}/${gpu_vendor}.conf)"
all_configs="${all_configs} ${gpu_configs}"
fi
info "Constructing config from fragments: ${config_path}"
export KCONFIG_CONFIG=${config_path}
export ARCH=${arch_target}
cd ${kernel_path}
local results
results=$( ${cmdpath} -r -n ${all_configs} )
# Only consider results highlighting "not in final"
results=$(grep "${not_in_string}" <<< "$results")
# Do not care about options that are in whitelist
results=$(grep -v -f ${default_config_whitelist} <<< "$results")
# Did we request any entries that did not make it?
local missing=$(echo $results | grep -v -q "${not_in_string}"; echo $?)
if [ ${missing} -ne 0 ]; then
info "Some CONFIG elements failed to make the final .config:"
info "${results}"
info "Generated config file can be found in ${config_path}"
die "Failed to construct requested .config file"
fi
# Did we define something as two different values?
local redefined=$(echo ${results} | grep -v -q "${redefined_string}"; echo $?)
if [ ${redefined} -ne 0 ]; then
info "Some CONFIG elements are redefined in fragments:"
info "${results}"
info "Generated config file can be found in ${config_path}"
die "Failed to construct requested .config file"
fi
# Did we define something twice? Nominally this may not be an error, and it
# might be convenient to allow it, but for now, let's pick up on them.
local redundant=$(echo ${results} | grep -v -q "${redundant_string}"; echo $?)
if [ ${redundant} -ne 0 ]; then
info "Some CONFIG elements failed to make the final .config"
info "${results}"
info "Generated config file can be found in ${config_path}"
die "Failed to construct requested .config file"
fi
echo "${config_path}"
}
# Locate and return the path to the relevant kernel config file
# - arg1: kernel version
# - arg2: hypervisor target
# - arg3: arch target
# - arg4: kernel source path
get_default_kernel_config() {
local version="${1}"
local hypervisor="$2"
local kernel_arch="$3"
local kernel_path="$4"
[ -n "${version}" ] || die "kernel version not provided"
[ -n "${hypervisor}" ] || die "hypervisor not provided"
[ -n "${kernel_arch}" ] || die "kernel arch not provided"
local kernel_ver
kernel_ver=$(get_major_kernel_version "${version}")
archfragdir="${default_config_frags_dir}/${kernel_arch}"
if [ -d "${archfragdir}" ]; then
config="$(get_kernel_frag_path ${archfragdir} ${kernel_path} ${kernel_arch})"
else
[ "${hypervisor}" == "firecracker" ] && hypervisor="kvm"
config="${default_kernel_config_dir}/${kernel_arch}_kata_${hypervisor}_${major_kernel}.x"
fi
[ -f "${config}" ] || die "failed to find default config ${config}"
echo "${config}"
}
get_config_and_patches() {
if [ -z "${patches_path}" ]; then
patches_path="${default_patches_dir}"
if [ ! -d "${patches_path}" ]; then
tag="${kata_version}"
git clone -q "https://${patches_repo}.git" "${patches_repo_dir}"
pushd "${patches_repo_dir}" >> /dev/null
if [ -n $tag ] ; then
info "checking out $tag"
git checkout -q $tag
fi
popd >> /dev/null
fi
fi
}
get_config_version() {
get_config_and_patches
config_version_file="${default_patches_dir}/../kata_config_version"
if [ -f "${config_version_file}" ]; then
cat "${config_version_file}"
else
die "failed to find ${config_version_file}"
fi
}
setup_kernel() {
local kernel_path=${1:-}
[ -n "${kernel_path}" ] || die "kernel_path not provided"
if [ -d "$kernel_path" ]; then
info "${kernel_path} already exist"
if [[ "${force_setup_generate_config}" != "true" ]];then
return
else
info "Force generate config due to '-f'"
fi
else
info "kernel path does not exist, will download kernel"
download_kernel="true"
[ -n "$kernel_version" ] || die "failed to get kernel version: Kernel version is emtpy"
if [[ ${download_kernel} == "true" ]]; then
get_kernel "${kernel_version}" "${kernel_path}"
fi
[ -n "$kernel_path" ] || die "failed to find kernel source path"
get_config_and_patches
[ -d "${patches_path}" ] || die " patches path '${patches_path}' does not exist"
fi
local major_kernel
major_kernel=$(get_major_kernel_version "${kernel_version}")
local patches_dir_for_version="${patches_path}/${major_kernel}.x"
local kernel_patches=""
if [ -d "${patches_dir_for_version}" ]; then
# Patches are expected to be named in the standard
# git-format-patch(1) format where the first part of the
# filename represents the patch ordering
# (lowest numbers apply first):
#
# "${number}-${dashed_description}"
#
# For example,
#
# 0001-fix-the-bad-thing.patch
# 0002-improve-the-fix-the-bad-thing-fix.patch
# 0003-correct-compiler-warnings.patch
kernel_patches=$(find "${patches_dir_for_version}" -name '*.patch' -type f |\
sort -t- -k1,1n)
else
info "kernel patches directory does not exit"
fi
[ -n "${arch_target}" ] || arch_target="$(uname -m)"
arch_target=$(arch_to_kernel "${arch_target}")
(
cd "${kernel_path}" || exit 1
for p in ${kernel_patches}; do
info "Applying patch $p"
patch -p1 --fuzz 0 <"$p"
done
[ -n "${hypervisor_target}" ] || hypervisor_target="kvm"
[ -n "${kernel_config_path}" ] || kernel_config_path=$(get_default_kernel_config "${kernel_version}" "${hypervisor_target}" "${arch_target}" "${kernel_path}")
info "Copying config file from: ${kernel_config_path}"
cp "${kernel_config_path}" ./.config
make oldconfig
)
}
build_kernel() {
local kernel_path=${1:-}
[ -n "${kernel_path}" ] || die "kernel_path not provided"
[ -d "${kernel_path}" ] || die "path to kernel does not exist, use ${script_name} setup"
[ -n "${arch_target}" ] || arch_target="$(uname -m)"
arch_target=$(arch_to_kernel "${arch_target}")
pushd "${kernel_path}" >>/dev/null
make -j $(nproc) ARCH="${arch_target}"
[ "$arch_target" != "powerpc" ] && ([ -e "arch/${arch_target}/boot/bzImage" ] || [ -e "arch/${arch_target}/boot/Image.gz" ])
[ -e "vmlinux" ]
[ "${hypervisor_target}" == "firecracker" ] && [ "${arch_target}" == "arm64" ] && [ -e "arch/${arch_target}/boot/Image" ]
popd >>/dev/null
}
install_kata() {
local kernel_path=${1:-}
[ -n "${kernel_path}" ] || die "kernel_path not provided"
[ -d "${kernel_path}" ] || die "path to kernel does not exist, use ${script_name} setup"
pushd "${kernel_path}" >>/dev/null
config_version=$(get_config_version)
[ -n "${config_version}" ] || die "failed to get config version"
install_path=$(readlink -m "${DESTDIR}/${PREFIX}/share/${project_name}")
suffix=""
if [[ ${experimental_kernel} == "true" ]]; then
suffix="-virtiofs"
fi
if [[ ${gpu_vendor} != "" ]];then
suffix="-${gpu_vendor}-gpu${suffix}"
fi
vmlinuz="vmlinuz-${kernel_version}-${config_version}${suffix}"
vmlinux="vmlinux-${kernel_version}-${config_version}${suffix}"
if [ -e "arch/${arch_target}/boot/bzImage" ]; then
bzImage="arch/${arch_target}/boot/bzImage"
elif [ -e "arch/${arch_target}/boot/Image.gz" ]; then
bzImage="arch/${arch_target}/boot/Image.gz"
elif [ "${arch_target}" != "powerpc" ]; then
die "failed to find image"
fi
# Install compressed kernel
if [ "${arch_target}" = "powerpc" ]; then
install --mode 0644 -D "vmlinux" "${install_path}/${vmlinuz}"
else
install --mode 0644 -D "${bzImage}" "${install_path}/${vmlinuz}"
fi
# Install uncompressed kernel
if [ "${arch_target}" = "arm64" ]; then
install --mode 0644 -D "arch/${arch_target}/boot/Image" "${install_path}/${vmlinux}"
else
install --mode 0644 -D "vmlinux" "${install_path}/${vmlinux}"
fi
install --mode 0644 -D ./.config "${install_path}/config-${kernel_version}"
ln -sf "${vmlinuz}" "${install_path}/vmlinuz${suffix}.container"
ln -sf "${vmlinux}" "${install_path}/vmlinux${suffix}.container"
ls -la "${install_path}/vmlinux${suffix}.container"
ls -la "${install_path}/vmlinuz${suffix}.container"
popd >>/dev/null
}
main() {
while getopts "a:c:defg:hk:p:t:v:" opt; do
case "$opt" in
a)
arch_target="${OPTARG}"
;;
c)
kernel_config_path="${OPTARG}"
;;
d)
PS4=' Line ${LINENO}: '
set -x
;;
e)
experimental_kernel="true"
;;
f)
force_setup_generate_config="true"
;;
g)
gpu_vendor="${OPTARG}"
[[ "${gpu_vendor}" == "${GV_INTEL}" || "${gpu_vendor}" == "${GV_NVIDIA}" ]] || die "GPU vendor only support intel and nvidia"
;;
h)
usage 0
;;
k)
kernel_path="${OPTARG}"
;;
p)
patches_path="${OPTARG}"
;;
t)
hypervisor_target="${OPTARG}"
;;
v)
kernel_version="${OPTARG}"
;;
esac
done
shift $((OPTIND - 1))
subcmd="${1:-}"
[ -z "${subcmd}" ] && usage 1
# If not kernel version take it from versions.yaml
if [ -z "$kernel_version" ]; then
if [[ ${experimental_kernel} == "true" ]]; then
kernel_version=$(get_from_kata_deps "assets.kernel-experimental.tag" "${kata_version}")
else
kernel_version=$(get_from_kata_deps "assets.kernel.version" "${kata_version}")
#Remove extra 'v'
kernel_version="${kernel_version#v}"
fi
fi
if [ -z "${kernel_path}" ]; then
config_version=$(get_config_version)
if [[ ${experimental_kernel} == "true" ]]; then
kernel_path="${PWD}/kata-linux-experimental-${kernel_version}-${config_version}"
else
kernel_path="${PWD}/kata-linux-${kernel_version}-${config_version}"
fi
info "Config version: ${config_version}"
fi
info "Kernel version: ${kernel_version}"
case "${subcmd}" in
build)
build_kernel "${kernel_path}"
;;
install)
build_kernel "${kernel_path}"
install_kata "${kernel_path}"
;;
setup)
setup_kernel "${kernel_path}"
[ -d "${kernel_path}" ] || die "${kernel_path} does not exist"
echo "Kernel source ready: ${kernel_path} "
;;
*)
usage 1
;;
esac
}
main $@

View File

@@ -0,0 +1,71 @@
* [Kata Containers kernel config files](#kata-containers-kernel-config-files)
* [Types of config files](#types-of-config-files)
* [How to use config files](#how-to-use-config-files)
* [How to modify config files](#how-to-modify-config-files)
# Kata Containers kernel config files
This directory contains Linux Kernel config files used to configure Kata
Containers VM kernels.
## Types of config files
This directory holds config files for the Kata Linux Kernel in two forms:
- A tree of config file 'fragments' in the `fragments` sub-folder, that are
constructed into a complete config file using the kernel
`scripts/kconfig/merge_config.sh` script.
- As complete config files that can be used as-is.
Kernel config fragments are the preferred method of constructing `.config` files
to build Kata Containers kernels, due to their improved clarity and ease of maintenance
over single file monolithic `.config`s.
## How to use config files
The recommended way to set up a kernel tree, populate it with a relevant `.config` file,
and build a kernel, is to use the [`build_kernel.sh`](../build-kernel.sh) script. For
example:
```bash
$ ./build-kernel.sh setup
```
The `build-kernel.sh` script understands both full and fragment based config files.
Run `./build-kernel.sh help` for more information.
## How to modify config files
Complete config files can be modified either with an editor, or preferably
using the kernel `Kconfig` configuration tools, for example:
```
$ cp x86_kata_kvm_4.14.x linux-4.14.22/.config
$ pushd linux-4.14.22
$ make menuconfig
$ popd
$ cp linux-4.14.22/.config x86_kata_kvm_4.14.x
```
Kernel fragments are best constructed using an editor. Tools such as `grep` and
`diff` can help find the differences between two config files to be placed
into a fragment.
If adding config entries for a new subsystem or feature, consider making a new
fragment with an appropriately descriptive name.
If you want to disable an entire fragment for a specific architecture, you can add the tag `# !${arch}` in the first line of the fragment. You can also exclude multiple architectures on the same line. Note the `#` at the beginning of the line, this is required to avoid that the tag is interpreted as a configuration.
Example of valid exclusion:
```
# !s390x !ppc64le
```
The fragment gathering tool perfoms some basic sanity checks, and the `build-kernel.sh` will
fail and report the error in the cases of:
- A duplicate `CONFIG` symbol appearing.
- A `CONFIG` symbol being in a fragment, but not appearing in the final .config
- which indicates that `CONFIG` variable is not a part of the kernel `Kconfig` setup, which
can indicate a typing mistake in the name of the symbol.
- A `CONFIG` symbol appearing in the fragments with multiple different values.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,5 @@
# ACPI on arm64 is dependent on uEFI.
CONFIG_EFI=y
CONFIG_EFI_STUB=y
# ARM64 can run properly in ACPI hardware reduced mode.
CONFIG_ACPI_REDUCED_HARDWARE_ONLY=y

View File

@@ -0,0 +1,42 @@
CONFIG_ARM64=y
CONFIG_ARM64_4K_PAGES=y
# ARM servers are often multi-cores, following configs improve
# the CPU scheduler's decision making.
CONFIG_SCHED_MC=y
CONFIG_SCHED_SMT=y
# Virtual address space size (48-bit)
CONFIG_ARM64_VA_BITS_48=y
CONFIG_ARM64_VA_BITS=48
# Physical address space size (48-bit)
CONFIG_ARM64_PA_BITS_48=y
CONFIG_ARM64_PA_BITS=48
# Use the maximum number of CPUs supported by KVM (255)
CONFIG_NR_CPUS=255
CONFIG_PERF_EVENTS=y
# No architected NMI
CONFIG_ARM64_PSEUDO_NMI=y
CONFIG_ARM64_SVE=y
# Arm64 prefers to use REFCOUNT_FULL by default.
CONFIG_REFCOUNT_FULL=y
#
# ARMv8.1 architectural features
#
CONFIG_ARM64_HW_AFDBM=y
CONFIG_ARM64_PAN=y
# end of ARMv8.1 architectural features
#
# ARMv8.2 architectural features
#
CONFIG_ARM64_CNP=y
CONFIG_ARM64_PMEM=y
CONFIG_ARM64_RAS_EXTN=y
CONFIG_ARM64_UAO=y
# end of ARMv8.2 architectural feature

View File

@@ -0,0 +1,6 @@
# ARMv8 adds cryptographic instructions that could significantly improve
# performance on tasks such as AES encryption and SHA1 and SHA256 hashing.
CONFIG_ARM64_CRYPTO=y
CONFIG_CRYPTO_AES_ARM64=y
CONFIG_CRYPTO_AES_ARM64_CE=y
CONFIG_CRYPTO_SHA256_ARM64=y

View File

@@ -0,0 +1,4 @@
# Device Tree and Open Firmware support
CONFIG_DTC=y
CONFIG_OF=y
CONFIG_OF_PMEM=y

View File

@@ -0,0 +1,15 @@
# ARM errata workarounds via the alternatives framework.
# Vendor-specific option will be left to users to decide.
CONFIG_ARM64_ERRATUM_1024718=y
CONFIG_ARM64_ERRATUM_1165522=y
CONFIG_ARM64_ERRATUM_1286807=y
CONFIG_ARM64_ERRATUM_1463225=y
CONFIG_ARM64_ERRATUM_819472=y
CONFIG_ARM64_ERRATUM_824069=y
CONFIG_ARM64_ERRATUM_826319=y
CONFIG_ARM64_ERRATUM_827319=y
CONFIG_ARM64_ERRATUM_832075=y
CONFIG_ARM64_ERRATUM_843419=y
CONFIG_ARM64_WORKAROUND_CLEAN_CACHE=y
CONFIG_ARM64_WORKAROUND_REPEAT_TLBI=y

View File

@@ -0,0 +1,3 @@
# It brings PCI support to mach-virt based upon an idealised host controller.
CONFIG_PCI_HOST_COMMON=y
CONFIG_PCI_HOST_GENERIC=y

View File

@@ -0,0 +1,7 @@
# PTP clock support
#
# The implementation of ptp_kvm on arm is one experimental feature,
# you need to apply private patches to enable it on your host machine.
# See https://github.com/kata-containers/packaging/pull/998 for detailed info.
CONFIG_PTP_1588_CLOCK=y
CONFIG_PTP_1588_CLOCK_KVM=y

View File

@@ -0,0 +1,10 @@
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_SYSTOHC=y
# RTC interfaces
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# QEMU provides an emulated ARM AMBA PrimeCell PL031 RTC.
CONFIG_RTC_DRV_PL031=y

View File

@@ -0,0 +1,3 @@
# This option is used for all 8250 compatible serial ports
# that are probed through device tree.
CONFIG_SERIAL_OF_PLATFORM=y

View File

@@ -0,0 +1,17 @@
# Enable 9p(fs) support - required for Kata to mount filesystems into the workload
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
CONFIG_9P_FS=y
# NOTE - 9p client cacheing turned off?
# FIXME: check if that is right?
# https://github.com/kata-containers/packaging/issues/483
#CONFIG_9P_FSCACHE=y
CONFIG_NETWORK_FILESYSTEMS=y
# Q. Do we use the POSIX_ACL over 9p?
# FIXME: https://github.com/kata-containers/packaging/issues/483
CONFIG_9P_FS_POSIX_ACL=y
# NOTE - this adds security labels, such as used by SELinux - we may be able to
# disable this, for now.
# FIXME: https://github.com/kata-containers/packaging/issues/483
CONFIG_9P_FS_SECURITY=y

View File

@@ -0,0 +1,20 @@
# enable ACPI support.
# This could do with REVIEW
# https://github.com/kata-containers/packaging/issues/483
CONFIG_ARCH_SUPPORTS_ACPI=y
CONFIG_ACPI=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_PROCESSOR_IDLE=y
# Having trouble enabling this - disable for now.
# Would add support for ACPI CPPC power control via firmware - do we need
# that for the guest??
#CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_PCI_SLOT=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_MEMORY=y
CONFIG_ACPI_NFIT=y
CONFIG_HAVE_ACPI_APEI=y

View File

@@ -0,0 +1,52 @@
# Basic necessary items!
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_SMP=y
CONFIG_PARAVIRT=y
# Note, no nested VM support enabled here
# Turn off embedded mode, as it disabled 'too much', and we
# no longer pass all the tests. We should refine this, and
# work out which of the ~66 items it enables are really needed.
# I believe this is the actual syntax we need for a fragment to
# disable an item...
# CONFIG_EMBEDDED is not set
# Note, no virt enabled baloon yet
CONFIG_INPUT=y
CONFIG_PRINTK=y
# We use this for metrics!
CONFIG_PRINTK_TIME=y
CONFIG_UNIX98_PTYS=y
CONFIG_FUTEX=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_NO_HZ=y
CONFIG_NO_HZ_FULL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_TIMERS=y
CONFIG_PROC_SYSCTL=y
CONFIG_SHMEM=y
# For security...
CONFIG_RELOCATABLE=y
CONFIG_RANDOMIZE_BASE=y
# FIXME - check if we should be setting this
# https://github.com/kata-containers/packaging/issues/483
# I have a feeling it effects our memory hotplug maybe?
# PHYSICAL_ALIGN=0x1000000
# This would only affect two drivers, neither of which we have enabled.
# The recommendation is to have it on, and you will see if in a diff if you
# look for differences against the frag generated config - so, add it here as
# a comment to make it clear in the future why we have not set it - as it would
# only add noise to our frags and config.
# PREVENT_FIRMWARE_BUILD=y
# Trust the hardware vendor to initialise the RNG - which can speed up boot.
# This can still be dynamically disabled on the kernel command line/kata config if needed.
# Disable for now, as it upsets the entropy test, and we need to improve those: FIXME: see:
# https://github.com/kata-containers/tests/issues/1543
# RANDOM_TRUST_CPU=y

View File

@@ -0,0 +1,26 @@
# Add cgroup support. Needed both for the agent to place the workload into, and
# also used/looked for by systemd rootfs.
CONFIG_CGROUPS=y
CONFIG_MEMCG=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
CONFIG_SOCK_CGROUP_DATA=y
# We have to enable SWAP CG, as runc/libcontainer in the agent currently fails
# to write to it, even though it does some checks to see if swap is enabled.
CONFIG_SWAP=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_SWAP_ENABLED=y
# Needed for cgroups v2
CONFIG_BPF_SYSCALL=y
CONFIG_CGROUP_BPF=y

View File

@@ -0,0 +1,7 @@
# Items to do with CPU frequency, power etc.
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_MENU=y

View File

@@ -0,0 +1,17 @@
# Need decompressors for root filesystems and kernels.
# Do we need all of these?
CONFIG_CRYPTO=y
# Deflate used by IPSec and IPCOMP protocols
# Also selects ZLIB and a couple of other algos
CONFIG_CRYPTO_DEFLATE=y
CONFIG_XZ_DEC=y
CONFIG_ZLIB_DEFLATE=y
# FIXME - check, do we need gzip?
# https://github.com/kata-containers/packaging/issues/483
CONFIG_DECOMPRESS_GZIP=y
# Some items required by systemd: https://github.com/systemd/systemd/blob/master/README
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_SHA256=y
CONFIG_CRYPTO_FIPS=y
CONFIG_CRYPTO_ANSI_CPRNG=y

View File

@@ -0,0 +1,32 @@
# Enable DAX and NVDIMM support so we can map in our rootfs
# Need HOTREMOVE, or ZONE_DEVICE will not get enabled
# We don't actually afaik remove any memory once we have plugged it in, as
# generally it is too 'expensive' an operation.
CONFIG_MEMORY_HOTREMOVE=y
# Also need this
CONFIG_SPARSEMEM_VMEMMAP=y
# Without these the pmem_should_map_pages() call in the kernel fails with new
# Related to the ARCH_HAS_HMM set in the arch files.
CONFIG_ZONE_DEVICE=y
CONFIG_DEV_PAGEMAP_OPS=y
CONFIG_ND_PFN=y
CONFIG_NVDIMM_PFN=y
CONFIG_NVDIMM_DAX=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_PMEM=y
CONFIG_BLK_DEV_RAM=y
CONFIG_LIBNVDIMM=y
CONFIG_ND_BLK=y
CONFIG_BTT=y
# FIXME: Should check if this is really needed
# https://github.com/kata-containers/packaging/issues/483
CONFIG_NVMEM=y
# Is auto selected by other options
#CONFIG_DAX_DRIVER=y
CONFIG_DAX=y
CONFIG_FS_DAX=y

View File

@@ -0,0 +1,5 @@
# Enable Elf loading, and script loading
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_SCRIPT=y
CONFIG_BINFMT_MISC=y

View File

@@ -0,0 +1,3 @@
# virtio-fs support
CONFIG_VIRTIO_FS=y
CONFIG_FUSE_FS=y

View File

@@ -0,0 +1,51 @@
# Enable a whole bunch of filesystem related items
CONFIG_BLK_DEV_INITRD=y
# Recommended for Docker
CONFIG_BLK_DEV_THROTTLING=y
# Required for hotplug block devices into Kata, using SCSI
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_SD=y
# support initial ramdisk
CONFIG_RD_GZIP=y
CONFIG_FS_IOMAP=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# FIXME - do we need journalling support in the container?
# https://github.com/kata-containers/packaging/issues/483
CONFIG_JBD2=y
CONFIG_FS_MBCACHE=y
CONFIG_XFS_FS=y
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_EXPORTFS_BLOCK_OPS=y
CONFIG_FILE_LOCKING=y
CONFIG_MANDATORY_FILE_LOCKING=y
# A bunch of these are required for systemd at least.
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_AUTOFS4_FS=y
CONFIG_AUTOFS_FS=y
CONFIG_TMPFS=y
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EPOLL=y
CONFIG_FHANDLE=y
# We should support Async IO.
CONFIG_AIO=y
# Docker in Docker support requires overlay
CONFIG_OVERLAY_FS=y
CONFIG_OVERLAY_FS_INDEX=y
CONFIG_OVERLAY_FS_REDIRECT_DIR=y

View File

@@ -0,0 +1,13 @@
# Setups to support our hotplug - memory, PCI devices and cpus
CONFIG_MEMORY_HOTPLUG=y
CONFIG_HOTPLUG_CPU=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_ACPI=y
CONFIG_PNPACPI=y
# Define hotplugs to be online immediately. Speeds things up, and makes things
# work smoother on some arch's.
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y

View File

@@ -0,0 +1,12 @@
# Items to enable large/huge mmu pages and tlbs etc.
# Compaction is the only memory management component to form high order
# (larger physically contiguous) memory blocks reliably. The lack of the
# feature can lead to unexpected OOM killer invocations for high order memory requests.
CONFIG_COMPACTION=y
CONFIG_HUGETLBFS=y
# Enable memory page physical migration here, as it can come
# into play when trying to find space to allocate a hugepage.
CONFIG_MIGRATION=y

View File

@@ -0,0 +1,3 @@
# mmio devices are required for firecracker
CONFIG_VIRTIO_MMIO=y
CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y

View File

@@ -0,0 +1,5 @@
# MMU specific items
# vmap the kernel stacks - detects stack over-runs better and reduces
# the stack attack window.
CONFIG_VMAP_STACK=y

View File

@@ -0,0 +1,11 @@
# We need namespaces to isolate the workload
# Cannot have namespaces if not multi user...
CONFIG_MULTIUSER=y
CONFIG_NAMESPACES=y
CONFIG_SYSVIPC=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y

View File

@@ -0,0 +1,203 @@
# Netfilter (used by sidecars like istio)
# FIXME - this is a big file - it could probably benefit from a
# good reviewing. https://github.com/kata-containers/packaging/issues/483
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y
CONFIG_NETFILTER_INGRESS=y
CONFIG_NETFILTER_NETLINK=y
CONFIG_NETFILTER_FAMILY_ARP=y
CONFIG_NETFILTER_NETLINK_ACCT=y
CONFIG_NETFILTER_NETLINK_QUEUE=y
CONFIG_NETFILTER_NETLINK_LOG=y
CONFIG_NETFILTER_NETLINK_OSF=y
CONFIG_NF_CONNTRACK=y
CONFIG_NF_LOG_COMMON=y
CONFIG_NETFILTER_CONNCOUNT=y
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_ZONES=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CONNTRACK_TIMEOUT=y
CONFIG_NF_CONNTRACK_TIMESTAMP=y
CONFIG_NF_CONNTRACK_LABELS=y
CONFIG_NF_CT_PROTO_DCCP=y
CONFIG_NF_CT_PROTO_GRE=y
CONFIG_NF_CT_PROTO_SCTP=y
CONFIG_NF_CT_PROTO_UDPLITE=y
CONFIG_NF_CONNTRACK_AMANDA=y
CONFIG_NF_CONNTRACK_FTP=y
CONFIG_NF_CONNTRACK_H323=y
CONFIG_NF_CONNTRACK_IRC=y
CONFIG_NF_CONNTRACK_BROADCAST=y
CONFIG_NF_CONNTRACK_NETBIOS_NS=y
CONFIG_NF_CONNTRACK_SNMP=y
CONFIG_NF_CONNTRACK_PPTP=y
CONFIG_NF_CONNTRACK_SANE=y
CONFIG_NF_CONNTRACK_SIP=y
CONFIG_NF_CONNTRACK_TFTP=y
CONFIG_NF_CT_NETLINK=y
CONFIG_NF_CT_NETLINK_TIMEOUT=y
CONFIG_NF_CT_NETLINK_HELPER=y
CONFIG_NETFILTER_NETLINK_GLUE_CT=y
CONFIG_NF_NAT=y
# NF_NAT_NEEDED is removed in newer kernels - we should drop once we move to next LTS (5.4).
# This is part of whitelist.conf
CONFIG_NF_NAT_NEEDED=y
# NF_NAT_PROTO_* are removed in newer kernels, but needed currentlyi. They are part of whitelist.conf:
CONFIG_NF_NAT_PROTO_DCCP=y
CONFIG_NF_NAT_PROTO_UDPLITE=y
CONFIG_NF_NAT_PROTO_SCTP=y
CONFIG_NF_NAT_PROTO_GRE=y
CONFIG_NF_NAT_AMANDA=y
CONFIG_NF_NAT_FTP=y
CONFIG_NF_NAT_IRC=y
CONFIG_NF_NAT_SIP=y
CONFIG_NF_NAT_TFTP=y
CONFIG_NF_NAT_REDIRECT=y
CONFIG_NETFILTER_SYNPROXY=y
CONFIG_NETFILTER_XTABLES=y
CONFIG_NETFILTER_XT_MARK=y
CONFIG_NETFILTER_XT_CONNMARK=y
CONFIG_NETFILTER_XT_SET=y
CONFIG_NETFILTER_XT_TARGET_CHECKSUM=y
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=y
CONFIG_NETFILTER_XT_TARGET_CONNMARK=y
CONFIG_NETFILTER_XT_TARGET_CT=y
CONFIG_NETFILTER_XT_TARGET_DSCP=y
CONFIG_NETFILTER_XT_TARGET_HL=y
CONFIG_NETFILTER_XT_TARGET_HMARK=y
CONFIG_NETFILTER_XT_TARGET_IDLETIMER=y
CONFIG_NETFILTER_XT_TARGET_LOG=y
CONFIG_NETFILTER_XT_TARGET_MARK=y
CONFIG_NETFILTER_XT_NAT=y
CONFIG_NETFILTER_XT_TARGET_NETMAP=y
CONFIG_NETFILTER_XT_TARGET_NFLOG=y
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=y
CONFIG_NETFILTER_XT_TARGET_RATEEST=y
CONFIG_NETFILTER_XT_TARGET_REDIRECT=y
CONFIG_NETFILTER_XT_TARGET_TEE=y
CONFIG_NETFILTER_XT_TARGET_TPROXY=y
CONFIG_NETFILTER_XT_TARGET_TRACE=y
CONFIG_NETFILTER_XT_TARGET_TCPMSS=y
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=y
CONFIG_NETFILTER_XT_MATCH_ADDRTYPE=y
CONFIG_NETFILTER_XT_MATCH_BPF=y
CONFIG_NETFILTER_XT_MATCH_CGROUP=y
CONFIG_NETFILTER_XT_MATCH_CLUSTER=y
CONFIG_NETFILTER_XT_MATCH_COMMENT=y
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=y
CONFIG_NETFILTER_XT_MATCH_CONNLABEL=y
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=y
CONFIG_NETFILTER_XT_MATCH_CONNMARK=y
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=y
CONFIG_NETFILTER_XT_MATCH_CPU=y
CONFIG_NETFILTER_XT_MATCH_DCCP=y
CONFIG_NETFILTER_XT_MATCH_DEVGROUP=y
CONFIG_NETFILTER_XT_MATCH_DSCP=y
CONFIG_NETFILTER_XT_MATCH_ECN=y
CONFIG_NETFILTER_XT_MATCH_ESP=y
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=y
CONFIG_NETFILTER_XT_MATCH_HELPER=y
CONFIG_NETFILTER_XT_MATCH_HL=y
CONFIG_NETFILTER_XT_MATCH_IPCOMP=y
CONFIG_NETFILTER_XT_MATCH_IPRANGE=y
CONFIG_NETFILTER_XT_MATCH_IPVS=y
CONFIG_NETFILTER_XT_MATCH_L2TP=y
CONFIG_NETFILTER_XT_MATCH_LENGTH=y
CONFIG_NETFILTER_XT_MATCH_LIMIT=y
CONFIG_NETFILTER_XT_MATCH_MAC=y
CONFIG_NETFILTER_XT_MATCH_MARK=y
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=y
CONFIG_NETFILTER_XT_MATCH_NFACCT=y
CONFIG_NETFILTER_XT_MATCH_OSF=y
CONFIG_NETFILTER_XT_MATCH_OWNER=y
CONFIG_NETFILTER_XT_MATCH_POLICY=y
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=y
CONFIG_NETFILTER_XT_MATCH_QUOTA=y
CONFIG_NETFILTER_XT_MATCH_RATEEST=y
CONFIG_NETFILTER_XT_MATCH_REALM=y
CONFIG_NETFILTER_XT_MATCH_RECENT=y
CONFIG_NETFILTER_XT_MATCH_SCTP=y
CONFIG_NETFILTER_XT_MATCH_STATE=y
CONFIG_NETFILTER_XT_MATCH_STATISTIC=y
CONFIG_NETFILTER_XT_MATCH_STRING=y
CONFIG_NETFILTER_XT_MATCH_TCPMSS=y
CONFIG_NETFILTER_XT_MATCH_TIME=y
CONFIG_NETFILTER_XT_MATCH_U32=y
CONFIG_IP_SET=y
CONFIG_IP_SET_BITMAP_IP=y
CONFIG_IP_SET_BITMAP_IPMAC=y
CONFIG_IP_SET_BITMAP_PORT=y
CONFIG_IP_SET_HASH_IP=y
CONFIG_IP_SET_HASH_IPMARK=y
CONFIG_IP_SET_HASH_IPPORT=y
CONFIG_IP_SET_HASH_IPPORTIP=y
CONFIG_IP_SET_HASH_IPPORTNET=y
CONFIG_IP_SET_HASH_MAC=y
CONFIG_IP_SET_HASH_NETPORTNET=y
CONFIG_IP_SET_HASH_NET=y
CONFIG_IP_SET_HASH_NETNET=y
CONFIG_IP_SET_HASH_NETPORT=y
CONFIG_IP_SET_HASH_NETIFACE=y
CONFIG_IP_SET_LIST_SET=y
CONFIG_IP_VS=y
CONFIG_IP_VS_PROTO_TCP=y
CONFIG_IP_VS_PROTO_UDP=y
CONFIG_IP_VS_PROTO_AH_ESP=y
CONFIG_IP_VS_PROTO_ESP=y
CONFIG_IP_VS_PROTO_AH=y
CONFIG_IP_VS_PROTO_SCTP=y
CONFIG_IP_VS_RR=y
CONFIG_IP_VS_WRR=y
CONFIG_IP_VS_LC=y
CONFIG_IP_VS_WLC=y
CONFIG_IP_VS_FO=y
CONFIG_IP_VS_OVF=y
CONFIG_IP_VS_LBLC=y
CONFIG_IP_VS_LBLCR=y
CONFIG_IP_VS_DH=y
CONFIG_IP_VS_SH=y
CONFIG_IP_VS_SED=y
CONFIG_IP_VS_NQ=y
CONFIG_IP_VS_FTP=y
CONFIG_IP_VS_NFCT=y
CONFIG_IP_VS_PE_SIP=y
CONFIG_NF_DEFRAG_IPV4=y
CONFIG_NF_TPROXY_IPV4=y
CONFIG_NF_DUP_IPV4=y
CONFIG_NF_LOG_IPV4=y
CONFIG_NF_REJECT_IPV4=y
# NF_NAT_IPV4 is removed in future kernel, and is part of whitelist.conf:
CONFIG_NF_NAT_IPV4=y
CONFIG_NF_NAT_SNMP_BASIC=y
CONFIG_NF_NAT_PPTP=y
CONFIG_NF_NAT_H323=y
CONFIG_IP_NF_IPTABLES=y
CONFIG_IP_NF_MATCH_AH=y
CONFIG_IP_NF_MATCH_ECN=y
CONFIG_IP_NF_MATCH_RPFILTER=y
CONFIG_IP_NF_MATCH_TTL=y
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
CONFIG_IP_NF_TARGET_SYNPROXY=y
CONFIG_IP_NF_NAT=y
CONFIG_IP_NF_TARGET_MASQUERADE=y
CONFIG_IP_NF_TARGET_NETMAP=y
CONFIG_IP_NF_TARGET_REDIRECT=y
CONFIG_IP_NF_MANGLE=y
CONFIG_IP_NF_TARGET_CLUSTERIP=y
CONFIG_IP_NF_TARGET_ECN=y
CONFIG_IP_NF_TARGET_TTL=y
CONFIG_IP_NF_RAW=y
CONFIG_IP_NF_SECURITY=y
CONFIG_IP_NF_ARPTABLES=y
CONFIG_IP_NF_ARPFILTER=y
CONFIG_IP_NF_ARP_MANGLE=y
CONFIG_NF_DUP_IPV6=y
CONFIG_NF_LOG_IPV6=y
CONFIG_NF_DEFRAG_IPV6=y

View File

@@ -0,0 +1,75 @@
# Our networking requirements
### FIXME - this probably needs a good review ###
# https://github.com/kata-containers/packaging/issues/483
# pre-reqs
CONFIG_NETDEVICES=y
CONFIG_PROC_FS=y
CONFIG_SYSFS=y
CONFIG_SECURITY=y
# The list
CONFIG_NET=y
CONFIG_ETHERNET=y
CONFIG_NET_CORE=y
CONFIG_NET_INGRESS=y
CONFIG_PACKET=y
CONFIG_PACKET_DIAG=y
CONFIG_UNIX=y
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
# Used for mobile ipv6 type instances, unlikely we need
#CONFIG_XFRM_MIGRATE=y
# Developer feature - unlikely we need it
#CONFIG_XFRM_STATISTICS=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ROUTE_CLASSID=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
CONFIG_SYN_COOKIES=y
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BBR=y
CONFIG_DEFAULT_BBR=y
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_STP=y
CONFIG_BRIDGE=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_HAVE_NET_DSA=y
CONFIG_LLC=y
CONFIG_NET_SCHED=y
CONFIG_NET_SCH_CBQ=y
CONFIG_NET_SCH_MULTIQ=y
CONFIG_NET_SCH_FQ_CODEL=y
CONFIG_NET_SCH_FQ=y
CONFIG_NET_CLS=y
CONFIG_NET_CLS_CGROUP=y
CONFIG_NET_EMATCH=y
CONFIG_NET_SCH_FIFO=y
CONFIG_VSOCKETS=y
CONFIG_VIRTIO_VSOCKETS=y
CONFIG_VIRTIO_VSOCKETS_COMMON=y
CONFIG_NET_SWITCHDEV=y
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
CONFIG_CGROUP_NET_PRIO=y
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_NET_FLOW_LIMIT=y
CONFIG_GRO_CELLS=y
CONFIG_FAILOVER=y
CONFIG_HAVE_EBPF_JIT=y
# We v.likely need some intel chip support
CONFIG_NET_VENDOR_INTEL=y
# Add VETH support (necessary for running Docker in the guest)
CONFIG_VETH=y
# We quite likely need to add others for passthrough and maybe SRIOV support

View File

@@ -0,0 +1,4 @@
# enable seccomp items
CONFIG_SECCOMP=y
CONFIG_SECCOMP_FILTER=y

View File

@@ -0,0 +1,6 @@
# Let's enable stack protection checks, and strong checks
# Estimated cost (detailed in the kernel config files)
# is maybe 2.3% for both
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y

View File

@@ -0,0 +1,14 @@
# We need some sort of 'serial' for virtio-serial consoles - at the moment.
# We might not need all of thse though...
# FIXME - https://github.com/kata-containers/packaging/issues/483
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_EARLYCON=y
# SERIO may be only for keyboards, mice etc., and not UARTS
# We likely don't need
#CONFIG_SERIO_RAW=y
#CONFIG_SERIO=y

View File

@@ -0,0 +1,29 @@
# We need virtio for 9p and serial and vsock at least
# To get VIRTIO, we need a bus - ours of choice is PCI. We need to enable
# PCI support to get VIRTIO_PCI support
CONFIG_PCI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
# To get to the VIRTIO_PCI, we need the VIRTIO_MENU enabled
CONFIG_VIRTIO_MENU=y
CONFIG_VIRTIO_PCI=y
# Without this nested-VM Kata does not work (we have not worked out exactly why)
CONFIG_VIRTIO_PCI_LEGACY=y
# This is used by the s390 arch at least. Leave it on globally.
CONFIG_HW_RANDOM=y
CONFIG_HW_RANDOM_VIRTIO=y
# This is required for booting from pmem
CONFIG_VIRTIO_PMEM=y
# FIXME - are we moving away from/choosing between SCSI and BLK support?
# https://github.com/kata-containers/packaging/issues/483
CONFIG_SCSI=y
CONFIG_SCSI_LOWLEVEL=y
CONFIG_SCSI_VIRTIO=y
CONFIG_VIRTIO_BLK=y
CONFIG_TTY=y
CONFIG_VIRTIO_CONSOLE=y
CONFIG_VIRTIO_NET=y

View File

@@ -0,0 +1,7 @@
# The following i915 kernel config options need to be enabled
CONFIG_DRM=y
CONFIG_DRM_I915=y
CONFIG_DRM_I915_USERPTR=y
# Linux kernel version suffix
CONFIG_LOCALVERSION="-intel-gpu"

View File

@@ -0,0 +1,14 @@
# Support mmconfig PCI config space access.
# It's used to enable the MMIO access method for PCIe devices.
CONFIG_PCI_MMCONFIG=y
# Support for loading modules.
# It is used to support loading GPU drivers.
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CRYPTO_FIPS requires this config when loading modules is enabled.
CONFIG_MODULE_SIG=y
# Linux kernel version suffix
CONFIG_LOCALVERSION="-nvidia-gpu"

View File

@@ -0,0 +1,8 @@
# configuration options which may dropped in newer kernels
# without generating an error in fragment merging
CONFIG_NF_NAT_IPV4
CONFIG_NF_NAT_NEEDED
CONFIG_NF_NAT_PROTO_DCCP
CONFIG_NF_NAT_PROTO_GRE
CONFIG_NF_NAT_PROTO_SCTP
CONFIG_NF_NAT_PROTO_UDPLITE

View File

@@ -0,0 +1,14 @@
CONFIG_X86_INTEL_PSTATE=y
# For old smp systems that do not have proper acpi support.
# Firecracker needs this to support `vcpu_count`
CONFIG_X86_MPPARSE=y
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP
CONFIG_ACPI_LPIT=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
CONFIG_HAVE_ACPI_APEI_NMI=y

View File

@@ -0,0 +1,20 @@
CONFIG_X86=y
CONFIG_X86_CPUID=y
CONFIG_X86_MSR=y
CONFIG_X86_X2APIC=y
CONFIG_X86_VERBOSE_BOOTUP=y
# Configs around linux guest support and optimizations.
CONFIG_HYPERVISOR_GUEST=y
CONFIG_KVM_GUEST=y
# Use the maximum number of CPUs supported by KVM (240)
CONFIG_NR_CPUS=240
# For security
CONFIG_LEGACY_VSYSCALL_NONE=y
CONFIG_RETPOLINE=y
# Boot directly into the uncompressed kernel
# Reduce memory footprint
CONFIG_PVH=y

View File

@@ -0,0 +1,5 @@
# x86 specific filesystem items
# Yes, we do support unaligned word accesses
CONFIG_DCACHE_WORD_ACCESS=y

View File

@@ -0,0 +1,5 @@
# Since we disable pci shpc hotplug for arm64,
# See https://github.com/kata-containers/packaging/pull/498
# for detailed reasons.
# we move this config into x86_64-specific.
CONFIG_HOTPLUG_PCI_SHPC=y

View File

@@ -0,0 +1,4 @@
# x86 specific mmu/memory related items
# Remove the kernel mapping from the user space - security improvement.
CONFIG_PAGE_TABLE_ISOLATION=y

View File

@@ -0,0 +1,7 @@
# Items needed to run the NEMU cut of QEMU
# NEMU uses an EFI bios/boot, so requires a few extra bits
CONFIG_MSDOS_PARTITION=y
CONFIG_EFI=y
CONFIG_EFI_ESRT=y
CONFIG_EFI_RUNTIME_WRAPPERS=y

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,7 @@
#
# This file contains config options which is removed/modified in kernel 4.14 but
# necessary for older kernels, if you're using a old kernel and failed to start
# kata containers, try to add these options and hope it can help! Enjoy it!
#
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y

View File

@@ -0,0 +1 @@
80

View File

@@ -0,0 +1,457 @@
From bee1ae5587a7427dbb9e9e313f6d0a43a9e0ec2e Mon Sep 17 00:00:00 2001
From: Jianyong Wu <jianyong.wu@arm.com>
Date: Mon, 30 Sep 2019 09:26:22 +0800
Subject: [PATCH] 4.19: enable ptp_kvm for arm64 in kata
---
drivers/clocksource/arm_arch_timer.c | 25 ++++++
drivers/ptp/Kconfig | 2 +-
drivers/ptp/Makefile | 1 +
drivers/ptp/ptp_kvm_arm64.c | 59 ++++++++++++++
drivers/ptp/{ptp_kvm.c => ptp_kvm_common.c} | 89 +++++----------------
drivers/ptp/ptp_kvm_x86.c | 87 ++++++++++++++++++++
include/asm-generic/ptp_kvm.h | 12 +++
include/linux/arm-smccc.h | 5 ++
virt/kvm/arm/psci.c | 12 +++
9 files changed, 221 insertions(+), 71 deletions(-)
create mode 100644 drivers/ptp/ptp_kvm_arm64.c
rename drivers/ptp/{ptp_kvm.c => ptp_kvm_common.c} (56%)
create mode 100644 drivers/ptp/ptp_kvm_x86.c
create mode 100644 include/asm-generic/ptp_kvm.h
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index d8c7f5750cdb..84ba8f9e57be 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -1571,3 +1571,28 @@ static int __init arch_timer_acpi_init(struct acpi_table_header *table)
}
TIMER_ACPI_DECLARE(arch_timer, ACPI_SIG_GTDT, arch_timer_acpi_init);
#endif
+
+#if IS_ENABLED(CONFIG_PTP_1588_CLOCK_KVM)
+#include <linux/arm-smccc.h>
+int kvm_arch_ptp_get_clock_fn(long *cycle, struct timespec64 *ts,
+ struct clocksource **cs)
+{
+ struct arm_smccc_res hvc_res;
+ ktime_t ktime_overall;
+ struct arm_smccc_quirk hvc_quirk;
+
+ __arm_smccc_hvc(ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID, 0, 0, 0, 0, 0, 0, 0, &hvc_res, &hvc_quirk);
+
+ if ((long)(hvc_res.a0) < 0)
+ return -EOPNOTSUPP;
+
+ ts->tv_sec = hvc_res.a0;
+ ts->tv_nsec = hvc_res.a1;
+ *cycle = hvc_res.a2 << 32 | hvc_res.a3;
+ *cs = &clocksource_counter;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_ptp_get_clock_fn);
+#endif
+
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index d137c480db46..318b3f5df1ea 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -109,7 +109,7 @@ config PTP_1588_CLOCK_PCH
config PTP_1588_CLOCK_KVM
tristate "KVM virtual PTP clock"
depends on PTP_1588_CLOCK
- depends on KVM_GUEST && X86
+ depends on KVM_GUEST && X86 || ARM64
default y
help
This driver adds support for using kvm infrastructure as a PTP
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 19efa9cfa950..1bf4940a88a6 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -4,6 +4,7 @@
#
ptp-y := ptp_clock.o ptp_chardev.o ptp_sysfs.o
+ptp_kvm-y := ptp_kvm_common.o ptp_kvm_$(ARCH).o
obj-$(CONFIG_PTP_1588_CLOCK) += ptp.o
obj-$(CONFIG_PTP_1588_CLOCK_DTE) += ptp_dte.o
obj-$(CONFIG_PTP_1588_CLOCK_IXP46X) += ptp_ixp46x.o
diff --git a/drivers/ptp/ptp_kvm_arm64.c b/drivers/ptp/ptp_kvm_arm64.c
new file mode 100644
index 000000000000..fcd83324c7e1
--- /dev/null
+++ b/drivers/ptp/ptp_kvm_arm64.c
@@ -0,0 +1,59 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Virtual PTP 1588 clock for use with KVM guests
+ * Copyright (C) 2019 ARM Ltd.
+ * All Rights Reserved
+ */
+
+#include <linux/kernel.h>
+#include <linux/err.h>
+#include <asm/hypervisor.h>
+#include <linux/module.h>
+#include <linux/psci.h>
+#include <linux/arm-smccc.h>
+#include <linux/timecounter.h>
+#include <linux/sched/clock.h>
+#include <asm/arch_timer.h>
+
+
+void arm_smccc_1_1_invoke(u32 id, struct arm_smccc_res *res)
+{
+ struct arm_smccc_quirk hvc_quirk;
+
+ __arm_smccc_hvc(id, 0, 0, 0, 0, 0, 0, 0, res, &hvc_quirk);
+}
+
+int kvm_arch_ptp_init(void)
+{
+ struct arm_smccc_res hvc_res;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID,
+ &hvc_res);
+ if ((long)(hvc_res.a0) < 0)
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_clock_generic(struct timespec64 *ts,
+ struct arm_smccc_res *hvc_res)
+{
+ arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID,
+ hvc_res);
+ if ((long)(hvc_res->a0) < 0)
+ return -EOPNOTSUPP;
+
+ ts->tv_sec = hvc_res->a0;
+ ts->tv_nsec = hvc_res->a1;
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_clock(struct timespec64 *ts)
+{
+ struct arm_smccc_res hvc_res;
+
+ kvm_arch_ptp_get_clock_generic(ts, &hvc_res);
+
+ return 0;
+}
diff --git a/drivers/ptp/ptp_kvm.c b/drivers/ptp/ptp_kvm_common.c
similarity index 56%
rename from drivers/ptp/ptp_kvm.c
rename to drivers/ptp/ptp_kvm_common.c
index c67dd11e08b1..c0b445fa6144 100644
--- a/drivers/ptp/ptp_kvm.c
+++ b/drivers/ptp/ptp_kvm_common.c
@@ -1,29 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
/*
* Virtual PTP 1588 clock for use with KVM guests
*
* Copyright (C) 2017 Red Hat Inc.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
- * GNU General Public License for more details.
- *
*/
#include <linux/device.h>
#include <linux/err.h>
#include <linux/init.h>
#include <linux/kernel.h>
+#include <linux/slab.h>
#include <linux/module.h>
#include <uapi/linux/kvm_para.h>
#include <asm/kvm_para.h>
-#include <asm/pvclock.h>
-#include <asm/kvmclock.h>
#include <uapi/asm/kvm_para.h>
+#include <asm-generic/ptp_kvm.h>
#include <linux/ptp_clock_kernel.h>
@@ -34,56 +24,29 @@ struct kvm_ptp_clock {
DEFINE_SPINLOCK(kvm_ptp_lock);
-static struct pvclock_vsyscall_time_info *hv_clock;
-
-static struct kvm_clock_pairing clock_pair;
-static phys_addr_t clock_pair_gpa;
-
static int ptp_kvm_get_time_fn(ktime_t *device_time,
struct system_counterval_t *system_counter,
void *ctx)
{
- unsigned long ret;
+ unsigned long ret, cycle;
struct timespec64 tspec;
- unsigned version;
- int cpu;
- struct pvclock_vcpu_time_info *src;
+ struct clocksource *cs;
spin_lock(&kvm_ptp_lock);
preempt_disable_notrace();
- cpu = smp_processor_id();
- src = &hv_clock[cpu].pvti;
-
- do {
- /*
- * We are using a TSC value read in the hosts
- * kvm_hc_clock_pairing handling.
- * So any changes to tsc_to_system_mul
- * and tsc_shift or any other pvclock
- * data invalidate that measurement.
- */
- version = pvclock_read_begin(src);
-
- ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
- clock_pair_gpa,
- KVM_CLOCK_PAIRING_WALLCLOCK);
- if (ret != 0) {
- pr_err_ratelimited("clock pairing hypercall ret %lu\n", ret);
- spin_unlock(&kvm_ptp_lock);
- preempt_enable_notrace();
- return -EOPNOTSUPP;
- }
-
- tspec.tv_sec = clock_pair.sec;
- tspec.tv_nsec = clock_pair.nsec;
- ret = __pvclock_read_cycles(src, clock_pair.tsc);
- } while (pvclock_read_retry(src, version));
+ ret = kvm_arch_ptp_get_clock_fn(&cycle, &tspec, &cs);
+ if (ret != 0) {
+ pr_err_ratelimited("clock pairing hypercall ret %lu\n", ret);
+ spin_unlock(&kvm_ptp_lock);
+ preempt_enable_notrace();
+ return -EOPNOTSUPP;
+ }
preempt_enable_notrace();
- system_counter->cycles = ret;
- system_counter->cs = &kvm_clock;
+ system_counter->cycles = cycle;
+ system_counter->cs = cs;
*device_time = timespec64_to_ktime(tspec);
@@ -126,17 +89,13 @@ static int ptp_kvm_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
spin_lock(&kvm_ptp_lock);
- ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
- clock_pair_gpa,
- KVM_CLOCK_PAIRING_WALLCLOCK);
+ ret = kvm_arch_ptp_get_clock(&tspec);
if (ret != 0) {
pr_err_ratelimited("clock offset hypercall ret %lu\n", ret);
spin_unlock(&kvm_ptp_lock);
return -EOPNOTSUPP;
}
- tspec.tv_sec = clock_pair.sec;
- tspec.tv_nsec = clock_pair.nsec;
spin_unlock(&kvm_ptp_lock);
memcpy(ts, &tspec, sizeof(struct timespec64));
@@ -176,21 +135,11 @@ static void __exit ptp_kvm_exit(void)
static int __init ptp_kvm_init(void)
{
- long ret;
-
- if (!kvm_para_available())
- return -ENODEV;
+ int ret;
- clock_pair_gpa = slow_virt_to_phys(&clock_pair);
- hv_clock = pvclock_get_pvti_cpu0_va();
-
- if (!hv_clock)
- return -ENODEV;
-
- ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING, clock_pair_gpa,
- KVM_CLOCK_PAIRING_WALLCLOCK);
- if (ret == -KVM_ENOSYS || ret == -KVM_EOPNOTSUPP)
- return -ENODEV;
+ ret = kvm_arch_ptp_init();
+ if (ret)
+ return -EOPNOTSUPP;
kvm_ptp_clock.caps = ptp_kvm_caps;
diff --git a/drivers/ptp/ptp_kvm_x86.c b/drivers/ptp/ptp_kvm_x86.c
new file mode 100644
index 000000000000..a52cf1c2990c
--- /dev/null
+++ b/drivers/ptp/ptp_kvm_x86.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with KVM guests
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ */
+
+#include <asm/pvclock.h>
+#include <asm/kvmclock.h>
+#include <linux/module.h>
+#include <uapi/asm/kvm_para.h>
+#include <uapi/linux/kvm_para.h>
+#include <linux/ptp_clock_kernel.h>
+
+phys_addr_t clock_pair_gpa;
+struct kvm_clock_pairing clock_pair;
+struct pvclock_vsyscall_time_info *hv_clock;
+
+int kvm_arch_ptp_init(void)
+{
+ int ret;
+
+ if (!kvm_para_available())
+ return -ENODEV;
+
+ clock_pair_gpa = slow_virt_to_phys(&clock_pair);
+ hv_clock = pvclock_get_pvti_cpu0_va();
+ if (!hv_clock)
+ return -ENODEV;
+
+ ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING, clock_pair_gpa,
+ KVM_CLOCK_PAIRING_WALLCLOCK);
+ if (ret == -KVM_ENOSYS || ret == -KVM_EOPNOTSUPP)
+ return -ENODEV;
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_clock(struct timespec64 *ts)
+{
+ long ret;
+
+ ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
+ clock_pair_gpa,
+ KVM_CLOCK_PAIRING_WALLCLOCK);
+ if (ret != 0)
+ return -EOPNOTSUPP;
+
+ ts->tv_sec = clock_pair.sec;
+ ts->tv_nsec = clock_pair.nsec;
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_clock_fn(unsigned long *cycle, struct timespec64 *tspec,
+ struct clocksource **cs)
+{
+ unsigned long ret;
+ unsigned int version;
+ int cpu;
+ struct pvclock_vcpu_time_info *src;
+
+ cpu = smp_processor_id();
+ src = &hv_clock[cpu].pvti;
+
+ do {
+ /*
+ * We are using a TSC value read in the hosts
+ * kvm_hc_clock_pairing handling.
+ * So any changes to tsc_to_system_mul
+ * and tsc_shift or any other pvclock
+ * data invalidate that measurement.
+ */
+ version = pvclock_read_begin(src);
+
+ ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
+ clock_pair_gpa,
+ KVM_CLOCK_PAIRING_WALLCLOCK);
+ tspec->tv_sec = clock_pair.sec;
+ tspec->tv_nsec = clock_pair.nsec;
+ *cycle = __pvclock_read_cycles(src, clock_pair.tsc);
+ } while (pvclock_read_retry(src, version));
+
+ *cs = &kvm_clock;
+
+ return 0;
+}
diff --git a/include/asm-generic/ptp_kvm.h b/include/asm-generic/ptp_kvm.h
new file mode 100644
index 000000000000..883eea494a80
--- /dev/null
+++ b/include/asm-generic/ptp_kvm.h
@@ -0,0 +1,12 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * linux/drivers/clocksource/arm_arch_timer.c
+ *
+ * Copyright (C) 2019 ARM Ltd.
+ * All Rights Reserved
+ */
+
+int kvm_arch_ptp_init(void);
+int kvm_arch_ptp_get_clock(struct timespec64 *ts);
+int kvm_arch_ptp_get_clock_fn(unsigned long *cycle,
+ struct timespec64 *tspec, void *cs);
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 18863d56273c..10e99c82d098 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -75,6 +75,11 @@
ARM_SMCCC_SMC_32, \
0, 1)
+#define ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_32, \
+ 0, 2)
+
#define ARM_SMCCC_ARCH_WORKAROUND_1 \
ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
ARM_SMCCC_SMC_32, \
diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
index 9b73d3ad918a..9b9999bdeab7 100644
--- a/virt/kvm/arm/psci.c
+++ b/virt/kvm/arm/psci.c
@@ -407,6 +407,9 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
u32 func_id = smccc_get_function(vcpu);
u32 val = SMCCC_RET_NOT_SUPPORTED;
u32 feature;
+ struct timespec64 ts;
+ u64 cycles, cycle_high, cycle_low;
+ struct system_time_snapshot systime_snapshot;
switch (func_id) {
case ARM_SMCCC_VERSION_FUNC_ID:
@@ -435,6 +438,15 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
break;
}
break;
+ case ARM_SMCCC_VENDOR_HYP_KVM_PTP_FUNC_ID:
+ ktime_get_real_ts64(&ts);
+ ktime_get_snapshot(&systime_snapshot);
+ cycles = systime_snapshot.cycles - vcpu_vtimer(vcpu)->cntvoff;
+ cycle_high = cycles >> 32;
+ cycle_low = cycles << 32 >> 32;
+
+ smccc_set_retval(vcpu, ts.tv_sec, ts.tv_nsec, cycle_high, cycle_low);
+ return 1;
default:
return kvm_psci_call(vcpu);
}
--
2.17.1

View File

@@ -0,0 +1,98 @@
From 33ffc9a93a1d9e72594d5eb3e4fc583a1a2911d1 Mon Sep 17 00:00:00 2001
From: Jianyong Wu <jianyong.wu@arm.com>
Date: Tue, 19 Feb 2019 01:15:32 -0500
Subject: [PATCH 2/5] Enable memory-hotplug using probe for arm64
---
arch/arm64/Kconfig | 7 +++++++
arch/arm64/mm/init.c | 9 ++++++++-
arch/arm64/mm/mmu.c | 17 +++++++++++++++++
arch/arm64/mm/numa.c | 10 ++++++++++
4 files changed, 42 insertions(+), 1 deletion(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 1b1a0e95c751..881bea194d53 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -740,6 +740,13 @@ config NUMA
local memory of the CPU and add some more
NUMA awareness to the kernel.
+config ARCH_MEMORY_PROBE
+ def_bool y
+ depends on MEMORY_HOTPLUG
+
+config ARCH_ENABLE_MEMORY_HOTPLUG
+ def_bool y
+
config NODES_SHIFT
int "Maximum NUMA Nodes (as a power of 2)"
range 1 10
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 787e27964ab9..e66e44b7bafe 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -288,9 +288,16 @@ static void __init zone_sizes_init(unsigned long min, unsigned long max)
int pfn_valid(unsigned long pfn)
{
phys_addr_t addr = pfn << PAGE_SHIFT;
-
if ((addr >> PAGE_SHIFT) != pfn)
return 0;
+
+#ifdef CONFIG_SPARSEMEM
+ if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
+ return 0;
+
+ if (!valid_section(__nr_to_section(pfn_to_section_nr(pfn))))
+ return 0;
+#endif
return memblock_is_map_memory(addr);
}
EXPORT_SYMBOL(pfn_valid);
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 8080c9f489c3..c393b37597af 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1028,3 +1028,20 @@ int pud_free_pmd_page(pud_t *pudp, unsigned long addr)
pmd_free(NULL, table);
return 1;
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+int arch_add_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap,
+ bool want_memblock)
+{
+ int flags = 0;
+
+ if (debug_pagealloc_enabled())
+ flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
+
+ __create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
+ size, PAGE_KERNEL, pgd_pgtable_alloc, flags);
+
+ return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+ altmap, want_memblock);
+}
+#endif
diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
index 146c04ceaa51..d276bd4d38b5 100644
--- a/arch/arm64/mm/numa.c
+++ b/arch/arm64/mm/numa.c
@@ -464,3 +464,13 @@ void __init arm64_numa_init(void)
numa_init(dummy_numa_init);
}
+
+/*
+ * We hope that we will be hotplugging memory on nodes we already know about,
+ * such that acpi_get_node() succeeds and we never fall back to this...
+ */
+int memory_add_physaddr_to_nid(u64 addr)
+{
+ pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr);
+ return 0;
+}
--
2.20.1

View File

@@ -0,0 +1,47 @@
From cab495651e8f71c39e87a08abbe051916110b3ca Mon Sep 17 00:00:00 2001
From: Julio Montes <julio.montes@intel.com>
Date: Mon, 18 Sep 2017 11:46:59 -0500
Subject: [PATCH 3/5] NO-UPSTREAM: 9P: always use cached inode to fill in
v9fs_vfs_getattr
So that if in cache=none mode, we don't have to lookup server that
might not support open-unlink-fstat operation.
fixes https://github.com/01org/cc-oci-runtime/issues/47
fixes https://github.com/01org/cc-oci-runtime/issues/1062
Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
fs/9p/vfs_inode.c | 2 +-
fs/9p/vfs_inode_dotl.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 85ff859d3af5..efdc2a8f37bb 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1080,7 +1080,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat *stat,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
- if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
+ if (!d_really_is_negative(dentry) || v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
generic_fillattr(d_inode(dentry), stat);
return 0;
}
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 4823e1c46999..daa5e6a41864 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -480,7 +480,7 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
- if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
+ if (!d_really_is_negative(dentry) || v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
generic_fillattr(d_inode(dentry), stat);
return 0;
}
--
2.20.1

View File

@@ -0,0 +1,29 @@
From d78297bf9d8e41711bddc6003f460e815340a214 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Fri, 10 Aug 2018 13:22:08 +0000
Subject: [PATCH 4/5] Compile in evged always
We need evged for NEMU (and in general for hw reduced)
The config option cannot be set normally since it breaks all
regular systems, and hardware reduced is really a runtime choice.
---
drivers/acpi/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 6d59aa109a91..97f2fbbd5014 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -47,7 +47,7 @@ acpi-y += acpi_pnp.o
acpi-$(CONFIG_ARM_AMBA) += acpi_amba.o
acpi-y += power.o
acpi-y += event.o
-acpi-$(CONFIG_ACPI_REDUCED_HARDWARE_ONLY) += evged.o
+acpi-y += evged.o
acpi-y += sysfs.o
acpi-y += property.o
acpi-$(CONFIG_X86) += acpi_cmos_rtc.o
--
2.20.1

View File

@@ -0,0 +1,49 @@
From 267ca21784bb307babbbb2f5a4a111da4da4c015 Mon Sep 17 00:00:00 2001
From: Sebastien Boeuf <sebastien.boeuf@intel.com>
Date: Thu, 13 Feb 2020 08:50:38 +0100
Subject: [PATCH] net: virtio_vsock: Fix race condition between bind and listen
Whenever the vsock backend on the host sends a packet through the RX
queue, it expects an answer on the TX queue. Unfortunately, there is one
case where the host side will hang waiting for the answer and will
effectively never recover.
This issue happens when the guest side starts binding to the socket,
which insert a new bound socket into the list of already bound sockets.
At this time, we expect the guest to also start listening, which will
trigger the sk_state to move from TCP_CLOSE to TCP_LISTEN. The problem
occurs if the host side queued a RX packet and triggered an interrupt
right between the end of the binding process and the beginning of the
listening process. In this specific case, the function processing the
packet virtio_transport_recv_pkt() will find a bound socket, which means
it will hit the switch statement checking for the sk_state, but the
state won't be changed into TCP_LISTEN yet, which leads the code to pick
the default statement. This default statement will only free the buffer,
while it should also respond to the host side, by sending a packet on
its TX queue.
In order to simply fix this unfortunate chain of events, it is important
that in case the default statement is entered, and because at this stage
we know the host side is waiting for an answer, we must send back a
packet containing the operation VIRTIO_VSOCK_OP_RST.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
---
net/vmw_vsock/virtio_transport_common.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 2a8651aa90c8..7d83e2c80b15 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1051,6 +1051,7 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
virtio_transport_free_pkt(pkt);
break;
default:
+ (void)virtio_transport_reset_no_sock(pkt);
virtio_transport_free_pkt(pkt);
break;
}
--
2.20.1

View File

@@ -0,0 +1,47 @@
From cab495651e8f71c39e87a08abbe051916110b3ca Mon Sep 17 00:00:00 2001
From: Julio Montes <julio.montes@intel.com>
Date: Mon, 18 Sep 2017 11:46:59 -0500
Subject: [PATCH 3/5] NO-UPSTREAM: 9P: always use cached inode to fill in
v9fs_vfs_getattr
So that if in cache=none mode, we don't have to lookup server that
might not support open-unlink-fstat operation.
fixes https://github.com/01org/cc-oci-runtime/issues/47
fixes https://github.com/01org/cc-oci-runtime/issues/1062
Signed-off-by: Peng Tao <bergwolf@gmail.com>
---
fs/9p/vfs_inode.c | 2 +-
fs/9p/vfs_inode_dotl.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 85ff859d3af5..efdc2a8f37bb 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -1080,7 +1080,7 @@ v9fs_vfs_getattr(const struct path *path, struct kstat *stat,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
- if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
+ if (!d_really_is_negative(dentry) || v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
generic_fillattr(d_inode(dentry), stat);
return 0;
}
diff --git a/fs/9p/vfs_inode_dotl.c b/fs/9p/vfs_inode_dotl.c
index 4823e1c46999..daa5e6a41864 100644
--- a/fs/9p/vfs_inode_dotl.c
+++ b/fs/9p/vfs_inode_dotl.c
@@ -480,7 +480,7 @@ v9fs_vfs_getattr_dotl(const struct path *path, struct kstat *stat,
p9_debug(P9_DEBUG_VFS, "dentry: %p\n", dentry);
v9ses = v9fs_dentry2v9ses(dentry);
- if (v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
+ if (!d_really_is_negative(dentry) || v9ses->cache == CACHE_LOOSE || v9ses->cache == CACHE_FSCACHE) {
generic_fillattr(d_inode(dentry), stat);
return 0;
}
--
2.20.1

View File

@@ -0,0 +1,49 @@
From ac1956caf20f8ac0589f69b2d5fcc81e6ba7c71a Mon Sep 17 00:00:00 2001
From: Sebastien Boeuf <sebastien.boeuf@intel.com>
Date: Thu, 13 Feb 2020 08:50:38 +0100
Subject: [PATCH] net: virtio_vsock: Fix race condition between bind and listen
Whenever the vsock backend on the host sends a packet through the RX
queue, it expects an answer on the TX queue. Unfortunately, there is one
case where the host side will hang waiting for the answer and will
effectively never recover.
This issue happens when the guest side starts binding to the socket,
which insert a new bound socket into the list of already bound sockets.
At this time, we expect the guest to also start listening, which will
trigger the sk_state to move from TCP_CLOSE to TCP_LISTEN. The problem
occurs if the host side queued a RX packet and triggered an interrupt
right between the end of the binding process and the beginning of the
listening process. In this specific case, the function processing the
packet virtio_transport_recv_pkt() will find a bound socket, which means
it will hit the switch statement checking for the sk_state, but the
state won't be changed into TCP_LISTEN yet, which leads the code to pick
the default statement. This default statement will only free the buffer,
while it should also respond to the host side, by sending a packet on
its TX queue.
In order to simply fix this unfortunate chain of events, it is important
that in case the default statement is entered, and because at this stage
we know the host side is waiting for an answer, we must send back a
packet containing the operation VIRTIO_VSOCK_OP_RST.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
---
net/vmw_vsock/virtio_transport_common.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index fb2060dffb0a..696e9a03ad0f 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1127,6 +1127,7 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
virtio_transport_free_pkt(pkt);
break;
default:
+ (void)virtio_transport_reset_no_sock(pkt);
virtio_transport_free_pkt(pkt);
break;
}
--
2.20.1

View File

@@ -0,0 +1,81 @@
From 3d1d7f8922ed2f080f6d8e08df0d51e22f9590ec Mon Sep 17 00:00:00 2001
From: Jianyong Wu <jianyong.wu@arm.com>
Date: Wed, 1 Apr 2020 15:19:29 +0800
Subject: [PATCH 1/9] arm/arm64: Provide a wrapper for SMCCC 1.1 calls
From: Steven Price <steven.price@arm.com>
SMCCC 1.1 calls may use either HVC or SMC depending on the PSCI
conduit. Rather than coding this in every call site, provide a macro
which uses the correct instruction. The macro also handles the case
where no conduit is configured/available returning a not supported error
in res, along with returning the conduit used for the call.
This allow us to remove some duplicated code and will be useful later
when adding paravirtualized time hypervisor calls.
Signed-off-by: Steven Price <steven.price@arm.com>
Acked-by: Will Deacon <will@kernel.org>
Signed-off-by: Marc Zyngier <maz@kernel.org>
---
include/linux/arm-smccc.h | 45 +++++++++++++++++++++++++++++++++++++++
1 file changed, 45 insertions(+)
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 080012a6f025..131edde5d37e 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -302,5 +302,50 @@ asmlinkage void __arm_smccc_hvc(unsigned long a0, unsigned long a1,
#define SMCCC_RET_NOT_SUPPORTED -1
#define SMCCC_RET_NOT_REQUIRED -2
+/*
+ * Like arm_smccc_1_1* but always returns SMCCC_RET_NOT_SUPPORTED.
+ * Used when the SMCCC conduit is not defined. The empty asm statement
+ * avoids compiler warnings about unused variables.
+ */
+#define __fail_smccc_1_1(...) \
+ do { \
+ __declare_args(__count_args(__VA_ARGS__), __VA_ARGS__); \
+ asm ("" __constraints(__count_args(__VA_ARGS__))); \
+ if (___res) \
+ ___res->a0 = SMCCC_RET_NOT_SUPPORTED; \
+ } while (0)
+
+/*
+ * arm_smccc_1_1_invoke() - make an SMCCC v1.1 compliant call
+ *
+ * This is a variadic macro taking one to eight source arguments, and
+ * an optional return structure.
+ *
+ * @a0-a7: arguments passed in registers 0 to 7
+ * @res: result values from registers 0 to 3
+ *
+ * This macro will make either an HVC call or an SMC call depending on the
+ * current SMCCC conduit. If no valid conduit is available then -1
+ * (SMCCC_RET_NOT_SUPPORTED) is returned in @res.a0 (if supplied).
+ *
+ * The return value also provides the conduit that was used.
+ */
+#define arm_smccc_1_1_invoke(...) ({ \
+ int method = arm_smccc_1_1_get_conduit(); \
+ switch (method) { \
+ case SMCCC_CONDUIT_HVC: \
+ arm_smccc_1_1_hvc(__VA_ARGS__); \
+ break; \
+ case SMCCC_CONDUIT_SMC: \
+ arm_smccc_1_1_smc(__VA_ARGS__); \
+ break; \
+ default: \
+ __fail_smccc_1_1(__VA_ARGS__); \
+ method = SMCCC_CONDUIT_NONE; \
+ break; \
+ } \
+ method; \
+ })
+
#endif /*__ASSEMBLY__*/
#endif /*__LINUX_ARM_SMCCC_H*/
--
2.17.1

View File

@@ -0,0 +1,81 @@
From b830806f5cd02119be9b25812b3ea56d97cd08f3 Mon Sep 17 00:00:00 2001
From: Mark Rutland <mark.rutland@arm.com>
Date: Fri, 9 Aug 2019 14:22:40 +0100
Subject: [PATCH 2/9] arm/arm64: smccc/psci: add arm_smccc_1_1_get_conduit()
SMCCC callers are currently amassing a collection of enums for the SMCCC
conduit, and are having to dig into the PSCI driver's internals in order
to figure out what to do.
Let's clean this up, with common SMCCC_CONDUIT_* definitions, and an
arm_smccc_1_1_get_conduit() helper that abstracts the PSCI driver's
internal state.
We can kill off the PSCI_CONDUIT_* definitions once we've migrated users
over to the new interface.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
---
drivers/firmware/psci/psci.c | 15 +++++++++++++++
include/linux/arm-smccc.h | 16 ++++++++++++++++
2 files changed, 31 insertions(+)
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index 84f4ff351c62..eb797081d159 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -57,6 +57,21 @@ struct psci_operations psci_ops = {
.smccc_version = SMCCC_VERSION_1_0,
};
+enum arm_smccc_conduit arm_smccc_1_1_get_conduit(void)
+{
+ if (psci_ops.smccc_version < SMCCC_VERSION_1_1)
+ return SMCCC_CONDUIT_NONE;
+
+ switch (psci_ops.conduit) {
+ case PSCI_CONDUIT_SMC:
+ return SMCCC_CONDUIT_SMC;
+ case PSCI_CONDUIT_HVC:
+ return SMCCC_CONDUIT_HVC;
+ default:
+ return SMCCC_CONDUIT_NONE;
+ }
+}
+
typedef unsigned long (psci_fn)(unsigned long, unsigned long,
unsigned long, unsigned long);
static psci_fn *invoke_psci_fn;
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index 131edde5d37e..e6d4cb4f61f1 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -80,6 +80,22 @@
#include <linux/linkage.h>
#include <linux/types.h>
+
+enum arm_smccc_conduit {
+ SMCCC_CONDUIT_NONE,
+ SMCCC_CONDUIT_SMC,
+ SMCCC_CONDUIT_HVC,
+};
+
+/**
+ * arm_smccc_1_1_get_conduit()
+ *
+ * Returns the conduit to be used for SMCCCv1.1 or later.
+ *
+ * When SMCCCv1.1 is not present, returns SMCCC_CONDUIT_NONE.
+ */
+enum arm_smccc_conduit arm_smccc_1_1_get_conduit(void);
+
/**
* struct arm_smccc_res - Result from SMC/HVC call
* @a0-a3 result values from registers 0 to 3
--
2.17.1

View File

@@ -0,0 +1,641 @@
From cb55878a1cecb7ef56956a28a9f1b745d0ac522b Mon Sep 17 00:00:00 2001
From: Jianyong Wu <jianyong.wu@arm.com>
Date: Wed, 1 Apr 2020 15:39:44 +0800
Subject: [PATCH 3/3] ptp: arm64: Enable ptp_kvm for arm64.
Currently in arm64 virtualization environment, there is no mechanism to
keep time sync between guest and host. Time in guest will drift compared
with host after boot up as they may both use third party time sources
to correct their time respectively. The time deviation will be in order
of milliseconds but some scenarios ask for higher time precision, like
in cloud envirenment, we want all the VMs running in the host aquire the
same level accuracy from host clock.
Use of kvm ptp clock, which choose the host clock source clock as a
reference clock to sync time clock between guest and host has been adopted
by x86 which makes the time sync order from milliseconds to nanoseconds.
This patch enables kvm ptp on arm64.
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
---
drivers/clocksource/arm_arch_timer.c | 24 ++++++
drivers/firmware/psci/psci.c | 1 +
drivers/ptp/Kconfig | 2 +-
drivers/ptp/Makefile | 1 +
drivers/ptp/ptp_kvm.h | 11 +++
drivers/ptp/ptp_kvm_arm64.c | 51 ++++++++++++
drivers/ptp/{ptp_kvm.c => ptp_kvm_common.c} | 78 +++++-------------
drivers/ptp/ptp_kvm_x86.c | 87 +++++++++++++++++++++
include/linux/arm-smccc.h | 8 ++
include/linux/clocksource.h | 6 ++
include/linux/clocksource_ids.h | 13 +++
include/linux/timekeeping.h | 12 +--
include/uapi/linux/kvm.h | 1 +
kernel/time/clocksource.c | 3 +
kernel/time/timekeeping.c | 1 +
virt/kvm/arm/arm.c | 1 +
virt/kvm/arm/psci.c | 23 ++++++
17 files changed, 258 insertions(+), 65 deletions(-)
create mode 100644 drivers/ptp/ptp_kvm.h
create mode 100644 drivers/ptp/ptp_kvm_arm64.c
rename drivers/ptp/{ptp_kvm.c => ptp_kvm_common.c} (63%)
create mode 100644 drivers/ptp/ptp_kvm_x86.c
create mode 100644 include/linux/clocksource_ids.h
diff --git a/drivers/clocksource/arm_arch_timer.c b/drivers/clocksource/arm_arch_timer.c
index 9a5464c625b4..0c723df39b55 100644
--- a/drivers/clocksource/arm_arch_timer.c
+++ b/drivers/clocksource/arm_arch_timer.c
@@ -16,6 +16,7 @@
#include <linux/cpu_pm.h>
#include <linux/clockchips.h>
#include <linux/clocksource.h>
+#include <linux/clocksource_ids.h>
#include <linux/interrupt.h>
#include <linux/of_irq.h>
#include <linux/of_address.h>
@@ -187,6 +188,7 @@ static u64 arch_counter_read_cc(const struct cyclecounter *cc)
static struct clocksource clocksource_counter = {
.name = "arch_sys_counter",
+ .id = CSID_ARM_ARCH_COUNTER,
.rating = 400,
.read = arch_counter_read,
.mask = CLOCKSOURCE_MASK(56),
@@ -1623,3 +1625,25 @@ static int __init arch_timer_acpi_init(struct acpi_table_header *table)
}
TIMER_ACPI_DECLARE(arch_timer, ACPI_SIG_GTDT, arch_timer_acpi_init);
#endif
+
+#if IS_ENABLED(CONFIG_PTP_1588_CLOCK_KVM)
+#include <linux/arm-smccc.h>
+int kvm_arch_ptp_get_crosststamp(unsigned long *cycle, struct timespec64 *ts,
+ struct clocksource **cs)
+{
+ struct arm_smccc_res hvc_res;
+ ktime_t ktime_overall;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_HYP_KVM_PTP_FUNC_ID, &hvc_res);
+ if ((long)(hvc_res.a0) < 0)
+ return -EOPNOTSUPP;
+
+ ktime_overall = hvc_res.a0 << 32 | hvc_res.a1;
+ *ts = ktime_to_timespec64(ktime_overall);
+ *cycle = hvc_res.a2 << 32 | hvc_res.a3;
+ *cs = &clocksource_counter;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_ptp_get_crosststamp);
+#endif
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index eb797081d159..87a7dc18b175 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -71,6 +71,7 @@ enum arm_smccc_conduit arm_smccc_1_1_get_conduit(void)
return SMCCC_CONDUIT_NONE;
}
}
+EXPORT_SYMBOL(arm_smccc_1_1_get_conduit);
typedef unsigned long (psci_fn)(unsigned long, unsigned long,
unsigned long, unsigned long);
diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 0517272a268e..6f3688e7e440 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -110,7 +110,7 @@ config PTP_1588_CLOCK_PCH
config PTP_1588_CLOCK_KVM
tristate "KVM virtual PTP clock"
depends on PTP_1588_CLOCK
- depends on KVM_GUEST && X86
+ depends on KVM_GUEST && X86 || ARM64 && ARM_ARCH_TIMER
default y
help
This driver adds support for using kvm infrastructure as a PTP
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 677d1d178a3e..3b7554f56ad9 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -4,6 +4,7 @@
#
ptp-y := ptp_clock.o ptp_chardev.o ptp_sysfs.o
+ptp_kvm-y := ptp_kvm_$(ARCH).o ptp_kvm_common.o
obj-$(CONFIG_PTP_1588_CLOCK) += ptp.o
obj-$(CONFIG_PTP_1588_CLOCK_DTE) += ptp_dte.o
obj-$(CONFIG_PTP_1588_CLOCK_IXP46X) += ptp_ixp46x.o
diff --git a/drivers/ptp/ptp_kvm.h b/drivers/ptp/ptp_kvm.h
new file mode 100644
index 000000000000..4bf1802bbeb8
--- /dev/null
+++ b/drivers/ptp/ptp_kvm.h
@@ -0,0 +1,11 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Virtual PTP 1588 clock for use with KVM guests
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ */
+
+int kvm_arch_ptp_init(void);
+int kvm_arch_ptp_get_clock(struct timespec64 *ts);
+int kvm_arch_ptp_get_crosststamp(unsigned long *cycle,
+ struct timespec64 *tspec, void *cs);
diff --git a/drivers/ptp/ptp_kvm_arm64.c b/drivers/ptp/ptp_kvm_arm64.c
new file mode 100644
index 000000000000..446f2444d285
--- /dev/null
+++ b/drivers/ptp/ptp_kvm_arm64.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Virtual PTP 1588 clock for use with KVM guests
+ * Copyright (C) 2019 ARM Ltd.
+ * All Rights Reserved
+ */
+
+#include <linux/kernel.h>
+#include <linux/err.h>
+#include <asm/hypervisor.h>
+#include <linux/module.h>
+#include <linux/psci.h>
+#include <linux/arm-smccc.h>
+#include <linux/timecounter.h>
+#include <linux/sched/clock.h>
+#include <asm/arch_timer.h>
+
+int kvm_arch_ptp_init(void)
+{
+ struct arm_smccc_res hvc_res;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_HYP_KVM_PTP_FUNC_ID, &hvc_res);
+ if ((long)(hvc_res.a0) < 0)
+ return -EOPNOTSUPP;
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_clock_generic(struct timespec64 *ts,
+ struct arm_smccc_res *hvc_res)
+{
+ ktime_t ktime_overall;
+
+ arm_smccc_1_1_invoke(ARM_SMCCC_HYP_KVM_PTP_FUNC_ID, hvc_res);
+ if ((long)(hvc_res->a0) < 0)
+ return -EOPNOTSUPP;
+
+ ktime_overall = hvc_res->a0 << 32 | hvc_res->a1;
+ *ts = ktime_to_timespec64(ktime_overall);
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_clock(struct timespec64 *ts)
+{
+ struct arm_smccc_res hvc_res;
+
+ kvm_arch_ptp_get_clock_generic(ts, &hvc_res);
+
+ return 0;
+}
diff --git a/drivers/ptp/ptp_kvm.c b/drivers/ptp/ptp_kvm_common.c
similarity index 63%
rename from drivers/ptp/ptp_kvm.c
rename to drivers/ptp/ptp_kvm_common.c
index fc7d0b77e118..60442f70d3fc 100644
--- a/drivers/ptp/ptp_kvm.c
+++ b/drivers/ptp/ptp_kvm_common.c
@@ -8,15 +8,16 @@
#include <linux/err.h>
#include <linux/init.h>
#include <linux/kernel.h>
+#include <linux/slab.h>
#include <linux/module.h>
#include <uapi/linux/kvm_para.h>
#include <asm/kvm_para.h>
-#include <asm/pvclock.h>
-#include <asm/kvmclock.h>
#include <uapi/asm/kvm_para.h>
#include <linux/ptp_clock_kernel.h>
+#include "ptp_kvm.h"
+
struct kvm_ptp_clock {
struct ptp_clock *ptp_clock;
struct ptp_clock_info caps;
@@ -24,56 +25,29 @@ struct kvm_ptp_clock {
DEFINE_SPINLOCK(kvm_ptp_lock);
-static struct pvclock_vsyscall_time_info *hv_clock;
-
-static struct kvm_clock_pairing clock_pair;
-static phys_addr_t clock_pair_gpa;
-
static int ptp_kvm_get_time_fn(ktime_t *device_time,
struct system_counterval_t *system_counter,
void *ctx)
{
- unsigned long ret;
+ unsigned long ret, cycle;
struct timespec64 tspec;
- unsigned version;
- int cpu;
- struct pvclock_vcpu_time_info *src;
+ struct clocksource *cs;
spin_lock(&kvm_ptp_lock);
preempt_disable_notrace();
- cpu = smp_processor_id();
- src = &hv_clock[cpu].pvti;
-
- do {
- /*
- * We are using a TSC value read in the hosts
- * kvm_hc_clock_pairing handling.
- * So any changes to tsc_to_system_mul
- * and tsc_shift or any other pvclock
- * data invalidate that measurement.
- */
- version = pvclock_read_begin(src);
-
- ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
- clock_pair_gpa,
- KVM_CLOCK_PAIRING_WALLCLOCK);
- if (ret != 0) {
- pr_err_ratelimited("clock pairing hypercall ret %lu\n", ret);
- spin_unlock(&kvm_ptp_lock);
- preempt_enable_notrace();
- return -EOPNOTSUPP;
- }
-
- tspec.tv_sec = clock_pair.sec;
- tspec.tv_nsec = clock_pair.nsec;
- ret = __pvclock_read_cycles(src, clock_pair.tsc);
- } while (pvclock_read_retry(src, version));
+ ret = kvm_arch_ptp_get_crosststamp(&cycle, &tspec, &cs);
+ if (ret != 0) {
+ pr_err_ratelimited("clock pairing hypercall ret %lu\n", ret);
+ spin_unlock(&kvm_ptp_lock);
+ preempt_enable_notrace();
+ return -EOPNOTSUPP;
+ }
preempt_enable_notrace();
- system_counter->cycles = ret;
- system_counter->cs = &kvm_clock;
+ system_counter->cycles = cycle;
+ system_counter->cs = cs;
*device_time = timespec64_to_ktime(tspec);
@@ -116,17 +90,13 @@ static int ptp_kvm_gettime(struct ptp_clock_info *ptp, struct timespec64 *ts)
spin_lock(&kvm_ptp_lock);
- ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
- clock_pair_gpa,
- KVM_CLOCK_PAIRING_WALLCLOCK);
+ ret = kvm_arch_ptp_get_clock(&tspec);
if (ret != 0) {
pr_err_ratelimited("clock offset hypercall ret %lu\n", ret);
spin_unlock(&kvm_ptp_lock);
return -EOPNOTSUPP;
}
- tspec.tv_sec = clock_pair.sec;
- tspec.tv_nsec = clock_pair.nsec;
spin_unlock(&kvm_ptp_lock);
memcpy(ts, &tspec, sizeof(struct timespec64));
@@ -166,21 +136,11 @@ static void __exit ptp_kvm_exit(void)
static int __init ptp_kvm_init(void)
{
- long ret;
-
- if (!kvm_para_available())
- return -ENODEV;
+ int ret;
- clock_pair_gpa = slow_virt_to_phys(&clock_pair);
- hv_clock = pvclock_get_pvti_cpu0_va();
-
- if (!hv_clock)
- return -ENODEV;
-
- ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING, clock_pair_gpa,
- KVM_CLOCK_PAIRING_WALLCLOCK);
- if (ret == -KVM_ENOSYS || ret == -KVM_EOPNOTSUPP)
- return -ENODEV;
+ ret = kvm_arch_ptp_init();
+ if (ret)
+ return -EOPNOTSUPP;
kvm_ptp_clock.caps = ptp_kvm_caps;
diff --git a/drivers/ptp/ptp_kvm_x86.c b/drivers/ptp/ptp_kvm_x86.c
new file mode 100644
index 000000000000..6c891d7299c6
--- /dev/null
+++ b/drivers/ptp/ptp_kvm_x86.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with KVM guests
+ *
+ * Copyright (C) 2017 Red Hat Inc.
+ */
+
+#include <asm/pvclock.h>
+#include <asm/kvmclock.h>
+#include <linux/module.h>
+#include <uapi/asm/kvm_para.h>
+#include <uapi/linux/kvm_para.h>
+#include <linux/ptp_clock_kernel.h>
+
+phys_addr_t clock_pair_gpa;
+struct kvm_clock_pairing clock_pair;
+struct pvclock_vsyscall_time_info *hv_clock;
+
+int kvm_arch_ptp_init(void)
+{
+ int ret;
+
+ if (!kvm_para_available())
+ return -ENODEV;
+
+ clock_pair_gpa = slow_virt_to_phys(&clock_pair);
+ hv_clock = pvclock_get_pvti_cpu0_va();
+ if (!hv_clock)
+ return -ENODEV;
+
+ ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING, clock_pair_gpa,
+ KVM_CLOCK_PAIRING_WALLCLOCK);
+ if (ret == -KVM_ENOSYS || ret == -KVM_EOPNOTSUPP)
+ return -ENODEV;
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_clock(struct timespec64 *ts)
+{
+ long ret;
+
+ ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
+ clock_pair_gpa,
+ KVM_CLOCK_PAIRING_WALLCLOCK);
+ if (ret != 0)
+ return -EOPNOTSUPP;
+
+ ts->tv_sec = clock_pair.sec;
+ ts->tv_nsec = clock_pair.nsec;
+
+ return 0;
+}
+
+int kvm_arch_ptp_get_crosststamp(unsigned long *cycle, struct timespec64 *tspec,
+ struct clocksource **cs)
+{
+ unsigned long ret;
+ unsigned int version;
+ int cpu;
+ struct pvclock_vcpu_time_info *src;
+
+ cpu = smp_processor_id();
+ src = &hv_clock[cpu].pvti;
+
+ do {
+ /*
+ * We are using a TSC value read in the hosts
+ * kvm_hc_clock_pairing handling.
+ * So any changes to tsc_to_system_mul
+ * and tsc_shift or any other pvclock
+ * data invalidate that measurement.
+ */
+ version = pvclock_read_begin(src);
+
+ ret = kvm_hypercall2(KVM_HC_CLOCK_PAIRING,
+ clock_pair_gpa,
+ KVM_CLOCK_PAIRING_WALLCLOCK);
+ tspec->tv_sec = clock_pair.sec;
+ tspec->tv_nsec = clock_pair.nsec;
+ *cycle = __pvclock_read_cycles(src, clock_pair.tsc);
+ } while (pvclock_read_retry(src, version));
+
+ *cs = &kvm_clock;
+
+ return 0;
+}
diff --git a/include/linux/arm-smccc.h b/include/linux/arm-smccc.h
index e6d4cb4f61f1..32a46d564934 100644
--- a/include/linux/arm-smccc.h
+++ b/include/linux/arm-smccc.h
@@ -45,6 +45,7 @@
#define ARM_SMCCC_OWNER_SIP 2
#define ARM_SMCCC_OWNER_OEM 3
#define ARM_SMCCC_OWNER_STANDARD 4
+#define ARM_SMCCC_OWNER_STANDARD_HYP 5
#define ARM_SMCCC_OWNER_TRUSTED_APP 48
#define ARM_SMCCC_OWNER_TRUSTED_APP_END 49
#define ARM_SMCCC_OWNER_TRUSTED_OS 50
@@ -76,6 +77,13 @@
ARM_SMCCC_SMC_32, \
0, 0x7fff)
+/* PTP KVM call requests clock time from guest OS to host */
+#define ARM_SMCCC_HYP_KVM_PTP_FUNC_ID \
+ ARM_SMCCC_CALL_VAL(ARM_SMCCC_FAST_CALL, \
+ ARM_SMCCC_SMC_32, \
+ ARM_SMCCC_OWNER_STANDARD_HYP, \
+ 0)
+
#ifndef __ASSEMBLY__
#include <linux/linkage.h>
diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index b21db536fd52..96e85b6f9ca0 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -17,6 +17,7 @@
#include <linux/timer.h>
#include <linux/init.h>
#include <linux/of.h>
+#include <linux/clocksource_ids.h>
#include <asm/div64.h>
#include <asm/io.h>
@@ -49,6 +50,10 @@ struct module;
* 400-499: Perfect
* The ideal clocksource. A must-use where
* available.
+ * @id: Defaults to CSID_GENERIC. The id value is captured
+ * in certain snapshot functions to allow callers to
+ * validate the clocksource from which the snapshot was
+ * taken.
* @read: returns a cycle value, passes clocksource as argument
* @enable: optional function to enable the clocksource
* @disable: optional function to disable the clocksource
@@ -91,6 +96,7 @@ struct clocksource {
const char *name;
struct list_head list;
int rating;
+ enum clocksource_ids id;
int (*enable)(struct clocksource *cs);
void (*disable)(struct clocksource *cs);
unsigned long flags;
diff --git a/include/linux/clocksource_ids.h b/include/linux/clocksource_ids.h
new file mode 100644
index 000000000000..93bec8426c44
--- /dev/null
+++ b/include/linux/clocksource_ids.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_CLOCKSOURCE_IDS_H
+#define _LINUX_CLOCKSOURCE_IDS_H
+
+/* Enum to give clocksources a unique identifier */
+enum clocksource_ids {
+ CSID_GENERIC = 0,
+ CSID_ARM_ARCH_COUNTER,
+ CSID_MAX,
+};
+
+#endif
+
diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index b27e2ffa96c1..4ecc32ad3879 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -2,6 +2,7 @@
#ifndef _LINUX_TIMEKEEPING_H
#define _LINUX_TIMEKEEPING_H
+#include <linux/clocksource_ids.h>
#include <linux/errno.h>
/* Included from linux/ktime.h */
@@ -232,11 +233,12 @@ extern void timekeeping_inject_sleeptime64(const struct timespec64 *delta);
* @cs_was_changed_seq: The sequence number of clocksource change events
*/
struct system_time_snapshot {
- u64 cycles;
- ktime_t real;
- ktime_t raw;
- unsigned int clock_was_set_seq;
- u8 cs_was_changed_seq;
+ u64 cycles;
+ ktime_t real;
+ ktime_t raw;
+ enum clocksource_ids cs_id;
+ unsigned int clock_was_set_seq;
+ u8 cs_was_changed_seq;
};
/*
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52641d8ca9e8..16008ebe5474 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1000,6 +1000,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_PMU_EVENT_FILTER 173
#define KVM_CAP_ARM_IRQ_LINE_LAYOUT_2 174
#define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 175
+#define KVM_CAP_ARM_KVM_PTP 176
#ifdef KVM_CAP_IRQ_ROUTING
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index fff5f64981c6..5fe2d61172b1 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -921,6 +921,9 @@ int __clocksource_register_scale(struct clocksource *cs, u32 scale, u32 freq)
clocksource_arch_init(cs);
+ if (WARN_ON_ONCE((unsigned int)cs->id >= CSID_MAX))
+ cs->id = CSID_GENERIC;
+
/* Initialize mult/shift and max_idle_ns */
__clocksource_update_freq_scale(cs, scale, freq);
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index ca69290bee2a..a8b378338b9e 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -979,6 +979,7 @@ void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
do {
seq = read_seqcount_begin(&tk_core.seq);
now = tk_clock_read(&tk->tkr_mono);
+ systime_snapshot->cs_id = tk->tkr_mono.clock->id;
systime_snapshot->cs_was_changed_seq = tk->cs_was_changed_seq;
systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
base_real = ktime_add(tk->tkr_mono.base,
diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
index 86c6aa1cb58e..ee159ce9ca39 100644
--- a/virt/kvm/arm/arm.c
+++ b/virt/kvm/arm/arm.c
@@ -197,6 +197,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_IMMEDIATE_EXIT:
case KVM_CAP_VCPU_EVENTS:
case KVM_CAP_ARM_IRQ_LINE_LAYOUT_2:
+ case KVM_CAP_ARM_KVM_PTP:
r = 1;
break;
case KVM_CAP_ARM_SET_DEVICE_ADDR:
diff --git a/virt/kvm/arm/psci.c b/virt/kvm/arm/psci.c
index 87927f7e1ee7..6e689f9952fb 100644
--- a/virt/kvm/arm/psci.c
+++ b/virt/kvm/arm/psci.c
@@ -9,6 +9,7 @@
#include <linux/kvm_host.h>
#include <linux/uaccess.h>
#include <linux/wait.h>
+#include <linux/clocksource_ids.h>
#include <asm/cputype.h>
#include <asm/kvm_emulate.h>
@@ -389,6 +390,9 @@ static int kvm_psci_call(struct kvm_vcpu *vcpu)
int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
{
+ struct system_time_snapshot systime_snapshot;
+ long arg[4];
+ u64 cycles;
u32 func_id = smccc_get_function(vcpu);
u32 val = SMCCC_RET_NOT_SUPPORTED;
u32 feature;
@@ -428,6 +432,25 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
break;
}
break;
+ /*
+ * This will used for virtual ptp kvm clock. three values will be
+ * passed back.
+ * reg0 stores high 32-bit host ktime;
+ * reg1 stores low 32-bit host ktime;
+ * reg2 stores high 32-bit difference of host cycles and cntvoff;
+ * reg3 stores low 32-bit difference of host cycles and cntvoff.
+ */
+ case ARM_SMCCC_HYP_KVM_PTP_FUNC_ID:
+ ktime_get_snapshot(&systime_snapshot);
+ if (systime_snapshot.cs_id != CSID_ARM_ARCH_COUNTER)
+ break;
+ arg[0] = systime_snapshot.real >> 32;
+ arg[1] = systime_snapshot.real << 32 >> 32;
+ cycles = systime_snapshot.cycles - vcpu_vtimer(vcpu)->cntvoff;
+ arg[2] = cycles >> 32;
+ arg[3] = cycles << 32 >> 32;
+ smccc_set_retval(vcpu, arg[0], arg[1], arg[2], arg[3]);
+ return 1;
default:
return kvm_psci_call(vcpu);
}
--
2.17.1

View File

@@ -0,0 +1,498 @@
From ba91422b18892bceacf3b4aa60354cf36fcabf9b Mon Sep 17 00:00:00 2001
From: Penny Zheng <penny.zheng@arm.com>
Date: Wed, 8 Apr 2020 10:26:52 +0800
Subject: [PATCH] arm64/mm: Enable memory hot remove
Backport Anshuman Khandual's patch series of Enabling memory hot
remove on aarch64(https://patchwork.kernel.org/cover/11419305/)
to v5.4.x.
This patch series has already been merged, and queued for 5.7.
Signed-off-by: Penny Zheng <penny.zheng@arm.com>
---
arch/arm64/Kconfig | 3 +
arch/arm64/include/asm/memory.h | 1 +
arch/arm64/mm/mmu.c | 379 +++++++++++++++++++++++++++++++-
arch/arm64/mm/ptdump_debugfs.c | 4 +
4 files changed, 378 insertions(+), 9 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6ccd2ed30963..d18b716fa569 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -274,6 +274,9 @@ config ZONE_DMA32
config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+ def_bool y
+
config SMP
def_bool y
diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
index c23c47360664..dbba06e258f5 100644
--- a/arch/arm64/include/asm/memory.h
+++ b/arch/arm64/include/asm/memory.h
@@ -54,6 +54,7 @@
#define MODULES_VADDR (BPF_JIT_REGION_END)
#define MODULES_VSIZE (SZ_128M)
#define VMEMMAP_START (-VMEMMAP_SIZE - SZ_2M)
+#define VMEMMAP_END (VMEMMAP_START + VMEMMAP_SIZE)
#define PCI_IO_END (VMEMMAP_START - SZ_2M)
#define PCI_IO_START (PCI_IO_END - PCI_IO_SIZE)
#define FIXADDR_TOP (PCI_IO_START - SZ_2M)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d10247fab0fd..99fec235144e 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -17,6 +17,7 @@
#include <linux/mman.h>
#include <linux/nodemask.h>
#include <linux/memblock.h>
+#include <linux/memory.h>
#include <linux/fs.h>
#include <linux/io.h>
#include <linux/mm.h>
@@ -725,6 +726,312 @@ int kern_addr_valid(unsigned long addr)
return pfn_valid(pte_pfn(pte));
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+ WARN_ON(PageReserved(page));
+ free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+ free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static bool pgtable_range_aligned(unsigned long start, unsigned long end,
+ unsigned long floor, unsigned long ceiling,
+ unsigned long mask)
+{
+ start &= mask;
+ if (start < floor)
+ return false;
+
+ if (ceiling) {
+ ceiling &= mask;
+ if (!ceiling)
+ return false;
+ }
+
+ if (end - 1 > ceiling - 1)
+ return false;
+ return true;
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, bool free_mapped)
+{
+ pte_t *ptep, pte;
+
+ do {
+ ptep = pte_offset_kernel(pmdp, addr);
+ pte = READ_ONCE(*ptep);
+ if (pte_none(pte))
+ continue;
+
+ WARN_ON(!pte_present(pte));
+ pte_clear(&init_mm, addr, ptep);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ if (free_mapped)
+ free_hotplug_page_range(pte_page(pte), PAGE_SIZE);
+ } while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+ unsigned long end, bool free_mapped)
+{
+ unsigned long next;
+ pmd_t *pmdp, pmd;
+
+ do {
+ next = pmd_addr_end(addr, end);
+ pmdp = pmd_offset(pudp, addr);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
+ continue;
+
+ WARN_ON(!pmd_present(pmd));
+ if (pmd_sect(pmd)) {
+ pmd_clear(pmdp);
+
+ /*
+ * One TLBI should be sufficient here as the PMD_SIZE
+ * range is mapped with a single block entry.
+ */
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ if (free_mapped)
+ free_hotplug_page_range(pmd_page(pmd),
+ PMD_SIZE);
+ continue;
+ }
+ WARN_ON(!pmd_table(pmd));
+ unmap_hotplug_pte_range(pmdp, addr, next, free_mapped);
+ } while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(p4d_t *p4dp, unsigned long addr,
+ unsigned long end, bool free_mapped)
+{
+ unsigned long next;
+ pud_t *pudp, pud;
+
+ do {
+ next = pud_addr_end(addr, end);
+ pudp = pud_offset(p4dp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ continue;
+
+ WARN_ON(!pud_present(pud));
+ if (pud_sect(pud)) {
+ pud_clear(pudp);
+
+ /*
+ * One TLBI should be sufficient here as the PUD_SIZE
+ * range is mapped with a single block entry.
+ */
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ if (free_mapped)
+ free_hotplug_page_range(pud_page(pud),
+ PUD_SIZE);
+ continue;
+ }
+ WARN_ON(!pud_table(pud));
+ unmap_hotplug_pmd_range(pudp, addr, next, free_mapped);
+ } while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_p4d_range(pgd_t *pgdp, unsigned long addr,
+ unsigned long end, bool free_mapped)
+{
+ unsigned long next;
+ p4d_t *p4dp, p4d;
+
+ do {
+ next = p4d_addr_end(addr, end);
+ p4dp = p4d_offset(pgdp, addr);
+ p4d = READ_ONCE(*p4dp);
+ if (p4d_none(p4d))
+ continue;
+
+ WARN_ON(!p4d_present(p4d));
+ unmap_hotplug_pud_range(p4dp, addr, next, free_mapped);
+ } while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+ bool free_mapped)
+{
+ unsigned long next;
+ pgd_t *pgdp, pgd;
+
+ do {
+ next = pgd_addr_end(addr, end);
+ pgdp = pgd_offset_k(addr);
+ pgd = READ_ONCE(*pgdp);
+ if (pgd_none(pgd))
+ continue;
+
+ WARN_ON(!pgd_present(pgd));
+ unmap_hotplug_p4d_range(pgdp, addr, next, free_mapped);
+ } while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, unsigned long floor,
+ unsigned long ceiling)
+{
+ pte_t *ptep, pte;
+ unsigned long i, start = addr;
+
+ do {
+ ptep = pte_offset_kernel(pmdp, addr);
+ pte = READ_ONCE(*ptep);
+
+ /*
+ * This is just a sanity check here which verifies that
+ * pte clearing has been done by earlier unmap loops.
+ */
+ WARN_ON(!pte_none(pte));
+ } while (addr += PAGE_SIZE, addr < end);
+
+ if (!pgtable_range_aligned(start, end, floor, ceiling, PMD_MASK))
+ return;
+
+ /*
+ * Check whether we can free the pte page if the rest of the
+ * entries are empty. Overlap with other regions have been
+ * handled by the floor/ceiling check.
+ */
+ ptep = pte_offset_kernel(pmdp, 0UL);
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ if (!pte_none(READ_ONCE(ptep[i])))
+ return;
+ }
+
+ pmd_clear(pmdp);
+ __flush_tlb_kernel_pgtable(start);
+ free_hotplug_pgtable_page(virt_to_page(ptep));
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+ unsigned long end, unsigned long floor,
+ unsigned long ceiling)
+{
+ pmd_t *pmdp, pmd;
+ unsigned long i, next, start = addr;
+
+ do {
+ next = pmd_addr_end(addr, end);
+ pmdp = pmd_offset(pudp, addr);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
+ continue;
+
+ WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+ free_empty_pte_table(pmdp, addr, next, floor, ceiling);
+ } while (addr = next, addr < end);
+
+ if (CONFIG_PGTABLE_LEVELS <= 2)
+ return;
+
+ if (!pgtable_range_aligned(start, end, floor, ceiling, PUD_MASK))
+ return;
+
+ /*
+ * Check whether we can free the pmd page if the rest of the
+ * entries are empty. Overlap with other regions have been
+ * handled by the floor/ceiling check.
+ */
+ pmdp = pmd_offset(pudp, 0UL);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (!pmd_none(READ_ONCE(pmdp[i])))
+ return;
+ }
+
+ pud_clear(pudp);
+ __flush_tlb_kernel_pgtable(start);
+ free_hotplug_pgtable_page(virt_to_page(pmdp));
+}
+
+static void free_empty_pud_table(p4d_t *p4dp, unsigned long addr,
+ unsigned long end, unsigned long floor,
+ unsigned long ceiling)
+{
+ pud_t *pudp, pud;
+ unsigned long i, next, start = addr;
+
+ do {
+ next = pud_addr_end(addr, end);
+ pudp = pud_offset(p4dp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ continue;
+
+ WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+ free_empty_pmd_table(pudp, addr, next, floor, ceiling);
+ } while (addr = next, addr < end);
+
+ if (CONFIG_PGTABLE_LEVELS <= 3)
+ return;
+
+ if (!pgtable_range_aligned(start, end, floor, ceiling, PGDIR_MASK))
+ return;
+
+ /*
+ * Check whether we can free the pud page if the rest of the
+ * entries are empty. Overlap with other regions have been
+ * handled by the floor/ceiling check.
+ */
+ pudp = pud_offset(p4dp, 0UL);
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ if (!pud_none(READ_ONCE(pudp[i])))
+ return;
+ }
+
+ p4d_clear(p4dp);
+ __flush_tlb_kernel_pgtable(start);
+ free_hotplug_pgtable_page(virt_to_page(pudp));
+}
+
+static void free_empty_p4d_table(pgd_t *pgdp, unsigned long addr,
+ unsigned long end, unsigned long floor,
+ unsigned long ceiling)
+{
+ unsigned long next;
+ p4d_t *p4dp, p4d;
+
+ do {
+ next = p4d_addr_end(addr, end);
+ p4dp = p4d_offset(pgdp, addr);
+ p4d = READ_ONCE(*p4dp);
+ if (p4d_none(p4d))
+ continue;
+
+ WARN_ON(!p4d_present(p4d));
+ free_empty_pud_table(p4dp, addr, next, floor, ceiling);
+ } while (addr = next, addr < end);
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end,
+ unsigned long floor, unsigned long ceiling)
+{
+ unsigned long next;
+ pgd_t *pgdp, pgd;
+
+ do {
+ next = pgd_addr_end(addr, end);
+ pgdp = pgd_offset_k(addr);
+ pgd = READ_ONCE(*pgdp);
+ if (pgd_none(pgd))
+ continue;
+
+ WARN_ON(!pgd_present(pgd));
+ free_empty_p4d_table(pgdp, addr, next, floor, ceiling);
+ } while (addr = next, addr < end);
+}
+#endif
+
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#if !ARM64_SWAPPER_USES_SECTION_MAPS
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -772,6 +1079,12 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
void vmemmap_free(unsigned long start, unsigned long end,
struct vmem_altmap *altmap)
{
+#ifdef CONFIG_MEMORY_HOTPLUG
+ WARN_ON((start < VMEMMAP_START) || (end > VMEMMAP_END));
+
+ unmap_hotplug_range(start, end, true);
+ free_empty_tables(start, end, VMEMMAP_START, VMEMMAP_END);
+#endif
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
@@ -1050,10 +1363,21 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
}
#ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+ unsigned long end = start + size;
+
+ WARN_ON(pgdir != init_mm.pgd);
+ WARN_ON((start < PAGE_OFFSET) || (end > PAGE_END));
+
+ unmap_hotplug_range(start, end, false);
+ free_empty_tables(start, end, PAGE_OFFSET, PAGE_END);
+}
+
int arch_add_memory(int nid, u64 start, u64 size,
struct mhp_restrictions *restrictions)
{
- int flags = 0;
+ int ret, flags = 0;
if (rodata_full || debug_pagealloc_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1061,22 +1385,59 @@ int arch_add_memory(int nid, u64 start, u64 size,
__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
- return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+ ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
restrictions);
+ if (ret)
+ __remove_pgd_mapping(swapper_pg_dir,
+ __phys_to_virt(start), size);
+ return ret;
}
+
void arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap)
{
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
- /*
- * FIXME: Cleanup page tables (also in arch_add_memory() in case
- * adding fails). Until then, this function should only be used
- * during memory hotplug (adding memory), not for memory
- * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
- * unlocked yet.
- */
__remove_pages(start_pfn, nr_pages, altmap);
+ __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
+}
+
+/*
+ * This memory hotplug notifier helps prevent boot memory from being
+ * inadvertently removed as it blocks pfn range offlining process in
+ * __offline_pages(). Hence this prevents both offlining as well as
+ * removal process for boot memory which is initially always online.
+ * In future if and when boot memory could be removed, this notifier
+ * should be dropped and free_hotplug_page_range() should handle any
+ * reserved pages allocated during boot.
+ */
+static int prevent_bootmem_remove_notifier(struct notifier_block *nb,
+ unsigned long action, void *data)
+{
+ struct mem_section *ms;
+ struct memory_notify *arg = data;
+ unsigned long end_pfn = arg->start_pfn + arg->nr_pages;
+ unsigned long pfn = arg->start_pfn;
+
+ if (action != MEM_GOING_OFFLINE)
+ return NOTIFY_OK;
+
+ for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+ ms = __pfn_to_section(pfn);
+ if (early_section(ms))
+ return NOTIFY_BAD;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block prevent_bootmem_remove_nb = {
+ .notifier_call = prevent_bootmem_remove_notifier,
+};
+
+static int __init prevent_bootmem_remove_init(void)
+{
+ return register_memory_notifier(&prevent_bootmem_remove_nb);
}
+device_initcall(prevent_bootmem_remove_init);
#endif
diff --git a/arch/arm64/mm/ptdump_debugfs.c b/arch/arm64/mm/ptdump_debugfs.c
index 064163f25592..b5eebc8c4924 100644
--- a/arch/arm64/mm/ptdump_debugfs.c
+++ b/arch/arm64/mm/ptdump_debugfs.c
@@ -1,5 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
#include <linux/debugfs.h>
+#include <linux/memory_hotplug.h>
#include <linux/seq_file.h>
#include <asm/ptdump.h>
@@ -7,7 +8,10 @@
static int ptdump_show(struct seq_file *m, void *v)
{
struct ptdump_info *info = m->private;
+
+ get_online_mems();
ptdump_walk_pgd(m, info);
+ put_online_mems();
return 0;
}
DEFINE_SHOW_ATTRIBUTE(ptdump);
--
2.17.1

View File

@@ -0,0 +1,49 @@
From c7ec155ec5e0f573e9c3cc4eb38d47543a2f1e81 Mon Sep 17 00:00:00 2001
From: Sebastien Boeuf <sebastien.boeuf@intel.com>
Date: Thu, 13 Feb 2020 08:50:38 +0100
Subject: [PATCH] net: virtio_vsock: Fix race condition between bind and listen
Whenever the vsock backend on the host sends a packet through the RX
queue, it expects an answer on the TX queue. Unfortunately, there is one
case where the host side will hang waiting for the answer and will
effectively never recover.
This issue happens when the guest side starts binding to the socket,
which insert a new bound socket into the list of already bound sockets.
At this time, we expect the guest to also start listening, which will
trigger the sk_state to move from TCP_CLOSE to TCP_LISTEN. The problem
occurs if the host side queued a RX packet and triggered an interrupt
right between the end of the binding process and the beginning of the
listening process. In this specific case, the function processing the
packet virtio_transport_recv_pkt() will find a bound socket, which means
it will hit the switch statement checking for the sk_state, but the
state won't be changed into TCP_LISTEN yet, which leads the code to pick
the default statement. This default statement will only free the buffer,
while it should also respond to the host side, by sending a packet on
its TX queue.
In order to simply fix this unfortunate chain of events, it is important
that in case the default statement is entered, and because at this stage
we know the host side is waiting for an answer, we must send back a
packet containing the operation VIRTIO_VSOCK_OP_RST.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
---
net/vmw_vsock/virtio_transport_common.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 6f1a8aff65c5..0b6fb687a3e0 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1048,6 +1048,7 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
virtio_transport_free_pkt(pkt);
break;
default:
+ (void)virtio_transport_reset_no_sock(t, pkt);
virtio_transport_free_pkt(pkt);
break;
}
--
2.20.1

View File

@@ -0,0 +1,453 @@
From: Anshuman Khandual <anshuman.khandual@arm.com>
Date: Mon, 15 Jul 2019 11:47:50 +0530
Subject: [PATCH] arm64/mm: Enable memory hot remove
The arch code for hot-remove must tear down portions of the linear map and
vmemmap corresponding to memory being removed. In both cases the page
tables mapping these regions must be freed, and when sparse vmemmap is in
use the memory backing the vmemmap must also be freed.
This patch adds a new remove_pagetable() helper which can be used to tear
down either region, and calls it from vmemmap_free() and
___remove_pgd_mapping(). The sparse_vmap argument determines whether the
backing memory will be freed.
remove_pagetable() makes two distinct passes over the kernel page table.
In the first pass it unmaps, invalidates applicable TLB cache and frees
backing memory if required (vmemmap) for each mapped leaf entry. In the
second pass it looks for empty page table sections whose page table page
can be unmapped, TLB invalidated and freed.
While freeing intermediate level page table pages bail out if any of its
entries are still valid. This can happen for partially filled kernel page
table either from a previously attempted failed memory hot add or while
removing an address range which does not span the entire page table page
range.
The vmemmap region may share levels of table with the vmalloc region.
There can be conflicts between hot remove freeing page table pages with
a concurrent vmalloc() walking the kernel page table. This conflict can
not just be solved by taking the init_mm ptl because of existing locking
scheme in vmalloc(). Hence unlike linear mapping, skip freeing page table
pages while tearing down vmemmap mapping.
While here update arch_add_memory() to handle __add_pages() failures by
just unmapping recently added kernel linear mapping. Now enable memory hot
remove on arm64 platforms by default with ARCH_ENABLE_MEMORY_HOTREMOVE.
This implementation is overall inspired from kernel page table tear down
procedure on X86 architecture.
Acked-by: Steve Capper <steve.capper@arm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
arch/arm64/Kconfig | 3 +
arch/arm64/include/asm/pgtable.h | 7 +-
arch/arm64/mm/mmu.c | 290 ++++++++++++++++++++++++++++++-
include/linux/mmzone.h | 1 +
mm/Kconfig | 2 +-
5 files changed, 291 insertions(+), 12 deletions(-)
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 3adcec05b1f6..5a1231b8b8cf 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -273,6 +273,9 @@ config ZONE_DMA32
config ARCH_ENABLE_MEMORY_HOTPLUG
def_bool y
+config ARCH_ENABLE_MEMORY_HOTREMOVE
+ def_bool y
+
config SMP
def_bool y
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5fdcfe237338..e09760ece844 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -209,7 +209,7 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
static inline pte_t pte_mkdevmap(pte_t pte)
{
- return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
+ return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
}
static inline void set_pte(pte_t *ptep, pte_t pte)
@@ -396,7 +396,10 @@ static inline int pmd_protnone(pmd_t pmd)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define pmd_devmap(pmd) pte_devmap(pmd_pte(pmd))
#endif
-#define pmd_mkdevmap(pmd) pte_pmd(pte_mkdevmap(pmd_pte(pmd)))
+static inline pmd_t pmd_mkdevmap(pmd_t pmd)
+{
+ return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
+}
#define __pmd_to_phys(pmd) __pte_to_phys(pmd_pte(pmd))
#define __phys_to_pmd_val(phys) __phys_to_pte_val(phys)
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 750a69dde39b..282a4b26218c 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -722,6 +722,250 @@ int kern_addr_valid(unsigned long addr)
return pfn_valid(pte_pfn(pte));
}
+
+#ifdef CONFIG_MEMORY_HOTPLUG
+static void free_hotplug_page_range(struct page *page, size_t size)
+{
+ WARN_ON(!page || PageReserved(page));
+ free_pages((unsigned long)page_address(page), get_order(size));
+}
+
+static void free_hotplug_pgtable_page(struct page *page)
+{
+ free_hotplug_page_range(page, PAGE_SIZE);
+}
+
+static void free_pte_table(pmd_t *pmdp, unsigned long addr)
+{
+ struct page *page;
+ pte_t *ptep;
+ int i;
+
+ ptep = pte_offset_kernel(pmdp, 0UL);
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ if (!pte_none(READ_ONCE(ptep[i])))
+ return;
+ }
+
+ page = pmd_page(READ_ONCE(*pmdp));
+ pmd_clear(pmdp);
+ __flush_tlb_kernel_pgtable(addr);
+ free_hotplug_pgtable_page(page);
+}
+
+static void free_pmd_table(pud_t *pudp, unsigned long addr)
+{
+ struct page *page;
+ pmd_t *pmdp;
+ int i;
+
+ if (CONFIG_PGTABLE_LEVELS <= 2)
+ return;
+
+ pmdp = pmd_offset(pudp, 0UL);
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ if (!pmd_none(READ_ONCE(pmdp[i])))
+ return;
+ }
+
+ page = pud_page(READ_ONCE(*pudp));
+ pud_clear(pudp);
+ __flush_tlb_kernel_pgtable(addr);
+ free_hotplug_pgtable_page(page);
+}
+
+static void free_pud_table(pgd_t *pgdp, unsigned long addr)
+{
+ struct page *page;
+ pud_t *pudp;
+ int i;
+
+ if (CONFIG_PGTABLE_LEVELS <= 3)
+ return;
+
+ pudp = pud_offset(pgdp, 0UL);
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ if (!pud_none(READ_ONCE(pudp[i])))
+ return;
+ }
+
+ page = pgd_page(READ_ONCE(*pgdp));
+ pgd_clear(pgdp);
+ __flush_tlb_kernel_pgtable(addr);
+ free_hotplug_pgtable_page(page);
+}
+
+static void unmap_hotplug_pte_range(pmd_t *pmdp, unsigned long addr,
+ unsigned long end, bool sparse_vmap)
+{
+ struct page *page;
+ pte_t *ptep, pte;
+
+ do {
+ ptep = pte_offset_kernel(pmdp, addr);
+ pte = READ_ONCE(*ptep);
+ if (pte_none(pte))
+ continue;
+
+ WARN_ON(!pte_present(pte));
+ page = sparse_vmap ? pte_page(pte) : NULL;
+ pte_clear(&init_mm, addr, ptep);
+ flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+ if (sparse_vmap)
+ free_hotplug_page_range(page, PAGE_SIZE);
+ } while (addr += PAGE_SIZE, addr < end);
+}
+
+static void unmap_hotplug_pmd_range(pud_t *pudp, unsigned long addr,
+ unsigned long end, bool sparse_vmap)
+{
+ unsigned long next;
+ struct page *page;
+ pmd_t *pmdp, pmd;
+
+ do {
+ next = pmd_addr_end(addr, end);
+ pmdp = pmd_offset(pudp, addr);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
+ continue;
+
+ WARN_ON(!pmd_present(pmd));
+ if (pmd_sect(pmd)) {
+ page = sparse_vmap ? pmd_page(pmd) : NULL;
+ pmd_clear(pmdp);
+ flush_tlb_kernel_range(addr, next);
+ if (sparse_vmap)
+ free_hotplug_page_range(page, PMD_SIZE);
+ continue;
+ }
+ WARN_ON(!pmd_table(pmd));
+ unmap_hotplug_pte_range(pmdp, addr, next, sparse_vmap);
+ } while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_pud_range(pgd_t *pgdp, unsigned long addr,
+ unsigned long end, bool sparse_vmap)
+{
+ unsigned long next;
+ struct page *page;
+ pud_t *pudp, pud;
+
+ do {
+ next = pud_addr_end(addr, end);
+ pudp = pud_offset(pgdp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ continue;
+
+ WARN_ON(!pud_present(pud));
+ if (pud_sect(pud)) {
+ page = sparse_vmap ? pud_page(pud) : NULL;
+ pud_clear(pudp);
+ flush_tlb_kernel_range(addr, next);
+ if (sparse_vmap)
+ free_hotplug_page_range(page, PUD_SIZE);
+ continue;
+ }
+ WARN_ON(!pud_table(pud));
+ unmap_hotplug_pmd_range(pudp, addr, next, sparse_vmap);
+ } while (addr = next, addr < end);
+}
+
+static void unmap_hotplug_range(unsigned long addr, unsigned long end,
+ bool sparse_vmap)
+{
+ unsigned long next;
+ pgd_t *pgdp, pgd;
+
+ do {
+ next = pgd_addr_end(addr, end);
+ pgdp = pgd_offset_k(addr);
+ pgd = READ_ONCE(*pgdp);
+ if (pgd_none(pgd))
+ continue;
+
+ WARN_ON(!pgd_present(pgd));
+ unmap_hotplug_pud_range(pgdp, addr, next, sparse_vmap);
+ } while (addr = next, addr < end);
+}
+
+static void free_empty_pte_table(pmd_t *pmdp, unsigned long addr,
+ unsigned long end)
+{
+ pte_t *ptep, pte;
+
+ do {
+ ptep = pte_offset_kernel(pmdp, addr);
+ pte = READ_ONCE(*ptep);
+ WARN_ON(!pte_none(pte));
+ } while (addr += PAGE_SIZE, addr < end);
+}
+
+static void free_empty_pmd_table(pud_t *pudp, unsigned long addr,
+ unsigned long end)
+{
+ unsigned long next;
+ pmd_t *pmdp, pmd;
+
+ do {
+ next = pmd_addr_end(addr, end);
+ pmdp = pmd_offset(pudp, addr);
+ pmd = READ_ONCE(*pmdp);
+ if (pmd_none(pmd))
+ continue;
+
+ WARN_ON(!pmd_present(pmd) || !pmd_table(pmd) || pmd_sect(pmd));
+ free_empty_pte_table(pmdp, addr, next);
+ free_pte_table(pmdp, addr);
+ } while (addr = next, addr < end);
+}
+
+static void free_empty_pud_table(pgd_t *pgdp, unsigned long addr,
+ unsigned long end)
+{
+ unsigned long next;
+ pud_t *pudp, pud;
+
+ do {
+ next = pud_addr_end(addr, end);
+ pudp = pud_offset(pgdp, addr);
+ pud = READ_ONCE(*pudp);
+ if (pud_none(pud))
+ continue;
+
+ WARN_ON(!pud_present(pud) || !pud_table(pud) || pud_sect(pud));
+ free_empty_pmd_table(pudp, addr, next);
+ free_pmd_table(pudp, addr);
+ } while (addr = next, addr < end);
+}
+
+static void free_empty_tables(unsigned long addr, unsigned long end)
+{
+ unsigned long next;
+ pgd_t *pgdp, pgd;
+
+ do {
+ next = pgd_addr_end(addr, end);
+ pgdp = pgd_offset_k(addr);
+ pgd = READ_ONCE(*pgdp);
+ if (pgd_none(pgd))
+ continue;
+
+ WARN_ON(!pgd_present(pgd));
+ free_empty_pud_table(pgdp, addr, next);
+ free_pud_table(pgdp, addr);
+ } while (addr = next, addr < end);
+}
+
+static void remove_pagetable(unsigned long start, unsigned long end,
+ bool sparse_vmap)
+{
+ unmap_hotplug_range(start, end, sparse_vmap);
+ free_empty_tables(start, end);
+}
+#endif
+
#ifdef CONFIG_SPARSEMEM_VMEMMAP
#if !ARM64_SWAPPER_USES_SECTION_MAPS
int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
@@ -769,6 +1013,27 @@ int __meminit vmemmap_populate(unsigned long start, unsigned long end, int node,
void vmemmap_free(unsigned long start, unsigned long end,
struct vmem_altmap *altmap)
{
+#ifdef CONFIG_MEMORY_HOTPLUG
+ /*
+ * FIXME: We should have called remove_pagetable(start, end, true).
+ * vmemmap and vmalloc virtual range might share intermediate kernel
+ * page table entries. Removing vmemmap range page table pages here
+ * can potentially conflict with a concurrent vmalloc() allocation.
+ *
+ * This is primarily because vmalloc() does not take init_mm ptl for
+ * the entire page table walk and it's modification. Instead it just
+ * takes the lock while allocating and installing page table pages
+ * via [p4d|pud|pmd|pte]_alloc(). A concurrently vanishing page table
+ * entry via memory hot remove can cause vmalloc() kernel page table
+ * walk pointers to be invalid on the fly which can cause corruption
+ * or worst, a crash.
+ *
+ * To avoid this problem, lets not free empty page table pages for
+ * given vmemmap range being hot-removed. Just unmap and free the
+ * range instead.
+ */
+ unmap_hotplug_range(start, end, true);
+#endif
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
@@ -1060,10 +1325,18 @@ int p4d_free_pud_page(p4d_t *p4d, unsigned long addr)
}
#ifdef CONFIG_MEMORY_HOTPLUG
+static void __remove_pgd_mapping(pgd_t *pgdir, unsigned long start, u64 size)
+{
+ unsigned long end = start + size;
+
+ WARN_ON(pgdir != init_mm.pgd);
+ remove_pagetable(start, end, false);
+}
+
int arch_add_memory(int nid, u64 start, u64 size,
struct mhp_restrictions *restrictions)
{
- int flags = 0;
+ int ret, flags = 0;
if (rodata_full || debug_pagealloc_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
@@ -1071,9 +1344,14 @@ int arch_add_memory(int nid, u64 start, u64 size,
__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
size, PAGE_KERNEL, __pgd_pgtable_alloc, flags);
- return __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
+ ret = __add_pages(nid, start >> PAGE_SHIFT, size >> PAGE_SHIFT,
restrictions);
+ if (ret)
+ __remove_pgd_mapping(swapper_pg_dir,
+ __phys_to_virt(start), size);
+ return ret;
}
+
void arch_remove_memory(int nid, u64 start, u64 size,
struct vmem_altmap *altmap)
{
@@ -1081,14 +1359,8 @@ void arch_remove_memory(int nid, u64 start, u64 size,
unsigned long nr_pages = size >> PAGE_SHIFT;
struct zone *zone;
- /*
- * FIXME: Cleanup page tables (also in arch_add_memory() in case
- * adding fails). Until then, this function should only be used
- * during memory hotplug (adding memory), not for memory
- * unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be
- * unlocked yet.
- */
zone = page_zone(pfn_to_page(start_pfn));
__remove_pages(zone, start_pfn, nr_pages, altmap);
+ __remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
}
#endif
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..47230ebdcb01 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1122,6 +1122,7 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
* PFN_SECTION_SHIFT pfn to/from section number
*/
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
+#define PA_SECTION_SIZE (1UL << PA_SECTION_SHIFT)
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
#define NR_MEM_SECTIONS (1UL << SECTIONS_SHIFT)
diff --git a/mm/Kconfig b/mm/Kconfig
index 56cec636a1fc..7c980f483a7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -677,7 +677,7 @@ config DEV_PAGEMAP_OPS
config HMM_MIRROR
bool "HMM mirror CPU page table into a device page table"
- depends on (X86_64 || PPC64)
+ depends on (X86_64 || PPC64 || ARM64)
depends on MMU && 64BIT
select MMU_NOTIFIER
help
--
2.17.1

View File

@@ -0,0 +1,49 @@
From c7ec155ec5e0f573e9c3cc4eb38d47543a2f1e81 Mon Sep 17 00:00:00 2001
From: Sebastien Boeuf <sebastien.boeuf@intel.com>
Date: Thu, 13 Feb 2020 08:50:38 +0100
Subject: [PATCH] net: virtio_vsock: Fix race condition between bind and listen
Whenever the vsock backend on the host sends a packet through the RX
queue, it expects an answer on the TX queue. Unfortunately, there is one
case where the host side will hang waiting for the answer and will
effectively never recover.
This issue happens when the guest side starts binding to the socket,
which insert a new bound socket into the list of already bound sockets.
At this time, we expect the guest to also start listening, which will
trigger the sk_state to move from TCP_CLOSE to TCP_LISTEN. The problem
occurs if the host side queued a RX packet and triggered an interrupt
right between the end of the binding process and the beginning of the
listening process. In this specific case, the function processing the
packet virtio_transport_recv_pkt() will find a bound socket, which means
it will hit the switch statement checking for the sk_state, but the
state won't be changed into TCP_LISTEN yet, which leads the code to pick
the default statement. This default statement will only free the buffer,
while it should also respond to the host side, by sending a packet on
its TX queue.
In order to simply fix this unfortunate chain of events, it is important
that in case the default statement is entered, and because at this stage
we know the host side is waiting for an answer, we must send back a
packet containing the operation VIRTIO_VSOCK_OP_RST.
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
---
net/vmw_vsock/virtio_transport_common.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 6f1a8aff65c5..0b6fb687a3e0 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1048,6 +1048,7 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
virtio_transport_free_pkt(pkt);
break;
default:
+ (void)virtio_transport_reset_no_sock(pkt);
virtio_transport_free_pkt(pkt);
break;
}
--
2.20.1

View File

@@ -0,0 +1,39 @@
From ac36d37e943635fc072e9d4f47e40a48fbcdb3f0 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <arjan@linux.intel.com>
Date: Wed, 9 Oct 2019 15:04:33 +0200
Subject: ACPI: Always build evged in
Although the Generic Event Device is a Hardware-reduced
platfom device in principle, it should not be restricted to
ACPI_REDUCED_HARDWARE_ONLY.
Kernels supporting both fixed and hardware-reduced ACPI platforms
should be able to probe the GED when dynamically detecting that a
platform is hardware-reduced. For that, the driver must be
unconditionally built in.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
drivers/acpi/Makefile | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
(limited to 'drivers/acpi/Makefile')
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index 5d361e4e3405..ef1ac4d127da 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -48,7 +48,7 @@ acpi-y += acpi_pnp.o
acpi-$(CONFIG_ARM_AMBA) += acpi_amba.o
acpi-y += power.o
acpi-y += event.o
-acpi-$(CONFIG_ACPI_REDUCED_HARDWARE_ONLY) += evged.o
+acpi-y += evged.o
acpi-y += sysfs.o
acpi-y += property.o
acpi-$(CONFIG_X86) += acpi_cmos_rtc.o
--
cgit 1.2-0.3.lf.el7