Skip to content

Latest commit

 

History

History
213 lines (144 loc) · 6.71 KB

tracing-proposals.md

File metadata and controls

213 lines (144 loc) · 6.71 KB

Kata Tracing proposals

Overview

This document summarises a set of proposals triggered by the tracing documentation PR.

Required context

This section explains some terminology required to understand the proposals. Further details can be found in the tracing documentation PR.

Agent trace mode terminology

Trace mode Description Use-case
Static Trace agent from startup to shutdown Entire lifespan
Dynamic Toggle tracing on/off as desired On-demand "snapshot"

Agent trace type terminology

Trace type Description Use-case
isolated traces all relate to single component Observing lifespan
collated traces "grouped" (runtime+agent) Understanding component interaction

Container lifespan

Lifespan trace mode trace type
short-lived static collated if possible, else isolated?
long-running dynamic collated? (to see interactions)

Original plan for agent

  • Implement all trace types and trace modes for agent.

  • Why?

    • Maximum flexibility.

      Counterargument:

      Due to the intrusive nature of adding tracing, we have learnt that landing small incremental changes is simpler and quicker!

    • Compatibility with Kata 1.x tracing.

      Counterargument:

      Agent tracing in Kata 1.x was extremely awkward to setup (to the extent that it's unclear how many users actually used it!)

      This point, coupled with the new architecture for Kata 2.x, suggests that we may not need to supply the same set of tracing features (in fact they may not make sense)).

Agent tracing proposals

Agent tracing proposal 1: Don't implement dynamic trace mode

  • All tracing will be static.

  • Why?

    • Because dynamic tracing will always be "partial"

      In fact, not only would it be only a "snapshot" of activity, it may not even be possible to create a complete "trace transaction". If this is true, the trace output would be partial and would appear "unstructured".

Agent tracing proposal 2: Simplify handling of trace type

  • Agent tracing will be "isolated" by default.

  • Agent tracing will be "collated" if runtime tracing is also enabled.

  • Why?

    • Offers a graceful fallback for agent tracing if runtime tracing disabled.
    • Simpler code!

Questions to ask yourself (part 1)

  • Are your containers long-running or short-lived?

  • Would you ever need to turn on tracing "briefly"?

    • If "yes", is a "partial trace" useful or useless?

      Likely to be considered useless as it is a partial snapshot. Alternative tracing methods may be more appropriate to dynamic OpenTelemetry tracing.

Questions to ask yourself (part 2)

  • Are you happy to stop a container to enable tracing? If "no", dynamic tracing may be required.

  • Would you ever want to trace the agent and the runtime "in isolation" at the same time?

    • If "yes", we need to fully implement trace_mode=isolated

      This seems unlikely though.

Trace collection

The second set of proposals affect the way traces are collected.

Motivation

Currently:

  • The runtime sends trace spans to Jaeger directly.
  • The agent will send trace spans to the trace-forwarder component.
  • The trace forwarder will send trace spans to Jaeger.

Kata agent tracing overview:

+-------------------------------------------+
| Host                                      |
|                                           |
| +-----------+                             |
| | Trace     |                             |
| | Collector |                             |
| +-----+-----+                             |
|       ^                  +--------------+ |
|       | spans            | Kata VM      | |
| +-----+-----+            |              | |
| | Kata      |    spans   |     +-----+  | |
| | Trace     |<-----------------|Kata |  | |
| | Forwarder |    VSOCK   |     |Agent|  | |
| +-----------+    Channel |     +-----+  | |
|                          +--------------+ |
+-------------------------------------------+

Currently:

  • If agent tracing is enabled but the trace forwarder is not running, the agent will error.

  • If the trace forwarder is started but Jaeger is not running, the trace forwarder will error.

Goals

  • The runtime and agent should:

    • Use the same trace collection implementation.
    • Use the most the common configuration items.
  • Kata should should support more trace collection software or SaaS (for example Zipkin, datadog).

  • Trace collection should not block normal runtime/agent operations (for example if vsock-exporter/Jaeger is not running, Kata Containers should work normally).

Trace collection proposals

Trace collection proposal 1: Send all spans to the trace forwarder as a span proxy

Kata runtime/agent all send spans to trace forwarder, and the trace forwarder, acting as a tracing proxy, sends all spans to a tracing back-end, such as Jaeger or datadog.

Pros:

  • Runtime/agent will be simple.
  • Could update trace collection target while Kata Containers are running.

Cons:

  • Requires the trace forwarder component to be running (that is a pressure to operation).

Trace collection proposal 2: Send spans to collector directly from runtime/agent

Send spans to collector directly from runtime/agent, this proposal need network accessible to the collector.

Pros:

  • No additional trace forwarder component needed.

Cons:

  • Need more code/configuration to support all trace collectors.

Future work

  • We could add dynamic and fully isolated tracing at a later stage, if required.

Further details

Summary

Time line

Outcome

  • Nobody opposed the agent proposals, so they are being implemented.
  • The trace collection proposals are still being considered.