Introduce get/set iptable handling. We add a sandbox API for getting and
setting the IPTables within the guest. This routes it from sandbox
interface, through kata-agent, ultimately making requests to the guest
agent.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
The documentation of the bufio package explicitly says
"Err returns the first non-EOF error that was encountered by the
Scanner."
When io.EOF happens, `Err()` will return `nil` and `Scan()` will return
`false`.
Fixes#4079
Signed-off-by: Rafael Fonseca <r4f4rfs@gmail.com>
Translate the volume path from host-known path to guest-known path
and forward the request to kata agent.
Fixes: #3454
Signed-off-by: Feng Wang <feng.wang@databricks.com>
With the Linux implementation of the FilesystemSharer interface, we can
now remove all host filesystem sharing code from kata_agent and keep it
where it belongs: sandbox.go.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
To resourcecontrol, and make it consistent with the fact that cgroups
are a Linux implementation of the ResourceController interface.
Fixes: #3601
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
We call it a ResourceController, and we make it not so Linux specific.
Now the Linux implementations is the cgroups one.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
Pulling image is the most time-consuming step in the container lifecycle. This PR
introduse nydus to kata container, it can lazily pull image when container start. So it
can speed up kata container create and start.
Fixes#2724
Signed-off-by: luodaowen.backend <luodaowen.backend@bytedance.com>
We don't need to call NewNetwork() twice, and we can have the VM factory
case return immediatly. That makes the code more readable.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
And only have AddEndpoints/RemoveEndpoints for all cases (single
endpoint vs all of them, hotplug or not).
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
We are converting the Network structure into an interface, so that
different host OSes can have different networking implementations for
Kata.
One step into that direction is to rename all the Network structure
fields and methods to something that is less Linux networking namespace
specific. This will make the Network interface naming consistent.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
We are replacing the NetworkingNamespace structure with the Network
one, so we should have the hypervisor interface switching to it as well.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
Remove unused parameters.
Reduce the number of parameters by deriving some of them (e.g. a
networking config) from their outer structure (e.g. a Sandbox
reference).
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
Endpoints creations, attachement and hotplug are bound to the networking
namespace described through the Network structure.
Making them Network methods is natural and simplifies the code.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
For simplicity sake, there should only be one networking structure per
sandbox, as opposed to two (Network and NetworkingNamespace) currently.
This commit start expanding the Network structure in order to eventually
make it the single representation of a virtcontainers sandbox
networking.
Signed-off-by: Samuel Ortiz <s.ortiz@apple.com>
There are software and hardware architectures which do not support
dynamically adjusting the CPU and memory resources associated with a
sandbox. For these, today, they rely on "default CPU" and "default
memory" configuration options for the runtime, either set by annotation
or by the configuration toml on disk.
In the case of a single container (launched by ctr, or something like
"docker run"), we could allow for sizing the VM correctly, since all of
the information is already available to us at creation time.
In the sandbox / pod container case, it is possible for the upper layer
container runtime (ie, containerd or crio) could send a specific
annotation indicating the total workload resource requirements
associated with the sandbox creation request.
In the case of sizing information not being provided, we will follow
same behavior as today: start the VM with (just) the default CPU/memory.
If this information is provided, we'll track this as Workload specific
resources, and track default sizing information as Base resources. We
will update the hypervisor configuration to utilize Base+Workload
resources, thus starting the VM with the appropriate amount of CPU and
memory.
In this scenario (we start the VM with the "right" amount of
CPU/Memory), we do not want to update the VM resources when containers
are added, or adjusted in size.
This functionality is introduced behind a configuration flag,
`static_sandbox_resource_mgmt`. This is defaulted to false for all
configurations except Firecracker, which is set to true.
This'll greatly improve UX for folks who are utilizing
Kata with a VMM or hardware architecture that doesn't support hotplug.
Note, users will still be unable to do in place vertical pod autoscaling
or other dynamic container/pod sizing with this enabled.
Fixes: #3264
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
The OCI container spec specifies a limit of -1 signifies
unlimited memory. Update the sandbox memory calculator
to reflect this part of the spec.
Fixes: #3512
Signed-off-by: Braden Rayhorn <bradenrayhorn@fastmail.com>
When Sandbox AddInterface() is called, it may fail after endpoint.HotAttach,
we'd better rollback and call save() in the end.
Fixes: #3419
Signed-off-by: yangfeiyu <yangfeiyu20102011@163.com>
Today the hypervisor code in vc relies on persist pkg for two things:
1. To get the VM/run store path on the host filesystem,
2. For type definition of the Load/Save functions of the hypervisor
interface.
For (1), we can simply remove the store interface from the hypervisor
config and replace it with just the path, since this is all we really
need. When we create a NewHypervisor structure, outside of the
hypervisor, we can populate this path.
For (2), rather than have the persist pkg define the structure, let's
let the hypervisor code (soon to be pkg) define the structure. persist
API already needs to call into hypervisor anyway; let's allow us to
define the structure.
We'll probably want to look at following similar pattern for other parts
of vc that we want to make independent of the persist API.
In doing this, we started an initial hypervisors pkg, to hold these
types (avoid a circular dependency between virtcontainers and persist
pkg). Next step will be to remove all other dependencies and move the
hypervisor specific code into this pkg, and out of virtcontaienrs.
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
There are two types packages under virtcontainers, and the
virtcontainers/pkg/types has a few codes, merging them into
one can make it easy for outstanding and using types package.
Fixes: #3031
Signed-off-by: bin <bin@hyper.sh>
As github.com/containerd/cgroups doesn't support scope
units which are essential in some cases lets create
the cgroups manually and load it trough the cgroups
api
This is currently done only when there's single sandbox
cgroup (sandbox_cgroup_only=true), otherwise we set it
as static cgroup path as it used to be (until a proper
soultion for overhead cgroup under systemd will be
suggested)
Fixes: #2868
Signed-off-by: Snir Sheriber <ssheribe@redhat.com>
In later versions of OpenTelemetry label.Any() is deprecated. Create
addTag() to handle type assertions of values. Change AddTag() to
variadic function that accepts multiple keys and values.
Fixes#2547
Signed-off-by: Chelsea Mafrica <chelsea.e.mafrica@intel.com>
In order to support DPDK workloads, we need to change the way VFIO devices
will be handled in Kata containers. However, the current method, although
it is not remotely OCI compliant has real uses. Therefore, introduce a new
runtime configuration field "vfio_mode" to control how VFIO devices will be
presented to the container.
We also add a new sandbox annotation -
io.katacontainers.config.runtime.vfio_mode - to override this on a
per-sandbox basis.
For now, the only allowed value is "guest-kernel" which refers to the
current behaviour where VFIO devices added to the container will be bound
to whatever driver in the VM kernel claims them.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Last of a series of commits to export the top level
hypervisor generic methods.
s/createSandbox/CreateVM
Fixes#2880
Signed-off-by: Manohar Castelino <mcastelino@apple.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
Export commonly used hypervisor fields and utility functions.
These need to be exposed to allow the hypervisor to be consumed
externally.
Note: This does not change the hypervisor interface definition.
Those changes will be separate commits.
Signed-off-by: Manohar Castelino <mcastelino@apple.com>
Display a pseudo path to the sandbox socket in the output of
`kata-runtime env` for those hypervisors that use Hybrid VSOCK.
The path is not a real path since the command does not create a sandbox.
The output includes a `{ID}` tag which would be replaced with the real
sandbox ID (name) when the sandbox was created.
This feature is only useful for agent tracing with the trace forwarder
where the configured hypervisor uses Hybrid VSOCK.
Note that the features required a new `setConfig()` method to be added
to the `hypervisor` interface. This isn't normally needed as the
specified hypervisor configuration passed to `setConfig()` is also
passed to `createSandbox()`. However the new call is required by
`kata-runtime env` to display the correct socket path for Firecracker.
The new method isn't wholly redundant for the main code path though as
it's now used by each hypervisor's `createSandbox()` call.
Signed-off-by: James O. D. Hunt <james.o.hunt@intel.com>
The new API is based on containerd's cgroups package.
With that conversion we can simpligy the virtcontainers sandbox code and
also uniformize our cgroups external API dependency. We now only depend
on containerd/cgroups for everything cgroups related.
Depends-on: github.com/kata-containers/tests#3805
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
Signed-off-by: Eric Ernst <eric_ernst@apple.com>
This is a simplification of the host cgroup handling by partitioning the
host cgroups into 2: A sandbox cgroup and an overhead cgroup.
The sandbox cgroup is always created and initialized. The overhead
cgroup is only available when sandbox_cgroup_only is unset, and is
unconstrained on all controllers. The goal of having an overhead cgroup
is to be more flexible on how we manage a pod overhead. Having such
cgroup will allow for setting a fixed overhead per pod, for a subset of
controllers, while at the same time not having the pod being accounted
for those resources.
When sandbox_cgroup_only is not set, we move all non vCPU threads
to the overhead cgroup and let them run unconstrained. When it is set,
all pod related processes and threads will run in the sandbox cgroup.
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>
Regardless of the sandbox_cgroup_only setting, we create the sandbox
cgroup manager and set the sandbox cgroup path at the same time.
Without doing this, the hypervisor constraint routine is mostly a NOP as
the sandbox state cgroup path is not initialized.
Fixes#2184
Signed-off-by: Samuel Ortiz <samuel.e.ortiz@protonmail.com>