From 5b514177b04ee70ca0144f508df07299fe2426e0 Mon Sep 17 00:00:00 2001 From: "James O. D. Hunt" Date: Fri, 18 Jun 2021 09:40:03 +0100 Subject: [PATCH] docs: Add tracing proposals doc Create a document summarising the tracing design proposals from PR #1937. Fixes: #2061. Signed-off-by: bin Signed-off-by: James O. D. Hunt --- docs/design/README.md | 4 + docs/design/proposals/README.md | 5 + docs/design/proposals/tracing-proposals.md | 213 +++++++++++++++++++++ 3 files changed, 222 insertions(+) create mode 100644 docs/design/proposals/README.md create mode 100644 docs/design/proposals/tracing-proposals.md diff --git a/docs/design/README.md b/docs/design/README.md index b6b8c59fa..1a334453e 100644 --- a/docs/design/README.md +++ b/docs/design/README.md @@ -10,3 +10,7 @@ Kata Containers design documents: - [Host cgroups](host-cgroups.md) - [`Inotify` support](inotify.md) - [Metrics(Kata 2.0)](kata-2-0-metrics.md) + +--- + +- [Design proposals](proposals) diff --git a/docs/design/proposals/README.md b/docs/design/proposals/README.md new file mode 100644 index 000000000..8f1197e08 --- /dev/null +++ b/docs/design/proposals/README.md @@ -0,0 +1,5 @@ +# Design proposals + +Kata Containers design proposal documents: + +- [Kata Containers tracing](tracing-proposals.md) diff --git a/docs/design/proposals/tracing-proposals.md b/docs/design/proposals/tracing-proposals.md new file mode 100644 index 000000000..0853ffa21 --- /dev/null +++ b/docs/design/proposals/tracing-proposals.md @@ -0,0 +1,213 @@ +# Kata Tracing proposals + +## Overview + +This document summarises a set of proposals triggered by the +[tracing documentation PR][tracing-doc-pr]. + +## Required context + +This section explains some terminology required to understand the proposals. +Further details can be found in the +[tracing documentation PR][tracing-doc-pr]. + +### Agent trace mode terminology + +| Trace mode | Description | Use-case | +|-|-|-| +| Static | Trace agent from startup to shutdown | Entire lifespan | +| Dynamic | Toggle tracing on/off as desired | On-demand "snapshot" | + +### Agent trace type terminology + +| Trace type | Description | Use-case | +|-|-|-| +| isolated | traces all relate to single component | Observing lifespan | +| collated | traces "grouped" (runtime+agent) | Understanding component interaction | + +### Container lifespan + +| Lifespan | trace mode | trace type | +|-|-|-| +| short-lived | static | collated if possible, else isolated? | +| long-running | dynamic | collated? (to see interactions) | + +## Original plan for agent + +- Implement all trace types and trace modes for agent. + +- Why? + - Maximum flexibility. + + > **Counterargument:** + > + > Due to the intrusive nature of adding tracing, we have + > learnt that landing small incremental changes is simpler and quicker! + + - Compatibility with [Kata 1.x tracing][kata-1x-tracing]. + + > **Counterargument:** + > + > Agent tracing in Kata 1.x was extremely awkward to setup (to the extent + > that it's unclear how many users actually used it!) + > + > This point, coupled with the new architecture for Kata 2.x, suggests + > that we may not need to supply the same set of tracing features (in fact + > they may not make sense)). + +## Agent tracing proposals + +### Agent tracing proposal 1: Don't implement dynamic trace mode + +- All tracing will be static. + +- Why? + - Because dynamic tracing will always be "partial" + + > In fact, not only would it be only a "snapshot" of activity, it may not + > even be possible to create a complete "trace transaction". If this is + > true, the trace output would be partial and would appear "unstructured". + +### Agent tracing proposal 2: Simplify handling of trace type + +- Agent tracing will be "isolated" by default. +- Agent tracing will be "collated" if runtime tracing is also enabled. + +- Why? + - Offers a graceful fallback for agent tracing if runtime tracing disabled. + - Simpler code! + +## Questions to ask yourself (part 1) + +- Are your containers long-running or short-lived? + +- Would you ever need to turn on tracing "briefly"? + - If "yes", is a "partial trace" useful or useless? + + > Likely to be considered useless as it is a partial snapshot. + > Alternative tracing methods may be more appropriate to dynamic + > OpenTelemetry tracing. + +## Questions to ask yourself (part 2) + +- Are you happy to stop a container to enable tracing? + If "no", dynamic tracing may be required. + +- Would you ever want to trace the agent and the runtime "in isolation" at the + same time? + - If "yes", we need to fully implement `trace_mode=isolated` + + > This seems unlikely though. + +## Trace collection + +The second set of proposals affect the way traces are collected. + +### Motivation + +Currently: + +- The runtime sends trace spans to Jaeger directly. +- The agent will send trace spans to the [`trace-forwarder`][trace-forwarder] component. +- The trace forwarder will send trace spans to Jaeger. + +Kata agent tracing overview: + +``` ++-------------------------------------------+ +| Host | +| | +| +-----------+ | +| | Trace | | +| | Collector | | +| +-----+-----+ | +| ^ +--------------+ | +| | spans | Kata VM | | +| +-----+-----+ | | | +| | Kata | spans | +-----+ | | +| | Trace |<-----------------|Kata | | | +| | Forwarder | VSOCK | |Agent| | | +| +-----------+ Channel | +-----+ | | +| +--------------+ | ++-------------------------------------------+ +``` + +Currently: + +- If agent tracing is enabled but the trace forwarder is not running, + the agent will error. + +- If the trace forwarder is started but Jaeger is not running, + the trace forwarder will error. + +### Goals + +- The runtime and agent should: + - Use the same trace collection implementation. + - Use the most the common configuration items. + +- Kata should should support more trace collection software or `SaaS` + (for example `Zipkin`, `datadog`). + +- Trace collection should not block normal runtime/agent operations + (for example if `vsock-exporter`/Jaeger is not running, Kata Containers should work normally). + +### Trace collection proposals + +#### Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy + +Kata runtime/agent all send spans to trace forwarder, and the trace forwarder, +acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or `datadog`. + +**Pros:** + +- Runtime/agent will be simple. +- Could update trace collection target while Kata Containers are running. + +**Cons:** + +- Requires the trace forwarder component to be running (that is a pressure to operation). + +#### Trace collection proposal 2: Send spans to collector directly from runtime/agent + +Send spans to collector directly from runtime/agent, this proposal need +network accessible to the collector. + +**Pros:** + +- No additional trace forwarder component needed. + +**Cons:** + +- Need more code/configuration to support all trace collectors. + +## Future work + +- We could add dynamic and fully isolated tracing at a later stage, + if required. + +## Further details + +- See the new [GitHub project](https://github.com/orgs/kata-containers/projects/28). +- [kata-containers-tracing-status](https://gist.github.com/jodh-intel/0ee54d41d2a803ba761e166136b42277) gist. +- [tracing documentation PR][tracing-doc-pr]. + +## Summary + +### Time line + +- 2021-07-01: A summary of the discussion was + [posted to the mail list](http://lists.katacontainers.io/pipermail/kata-dev/2021-July/001996.html). +- 2021-06-22: These proposals were + [discussed in the Kata Architecture Committee meeting](https://etherpad.opendev.org/p/Kata_Containers_2021_Architecture_Committee_Mtgs). +- 2021-06-18: These proposals where + [announced on the mailing list](http://lists.katacontainers.io/pipermail/kata-dev/2021-June/001980.html). + +### Outcome + +- Nobody opposed the agent proposals, so they are being implemented. +- The trace collection proposals are still being considered. + +[kata-1x-tracing]: https://github.com/kata-containers/agent/blob/master/TRACING.md +[trace-forwarder]: /src/trace-forwarder +[tracing-doc-pr]: https://github.com/kata-containers/kata-containers/pull/1937