[lttng-dev] Unexport of kvm_x86_ops vs tracer modules

Fri Apr 8 14:06:53 EDT 2022

----- On Apr 8, 2022, at 12:24 PM, Paolo Bonzini pbonzini at redhat.com wrote:

> On 4/8/22 17:36, Mathieu Desnoyers wrote:
>> LTTng is an out of tree kernel module, which currently relies on the export.
>> Indeed, arch/x86/kvm/x86.c exports a set of tracepoints to kernel modules, e.g.:
>> 
>> EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry)
>> 
>> But any probe implementation hooking on that tracepoint would need kvm_x86_ops
>> to translate the struct kvm_vcpu * into meaningful tracing data.
>> 
>> I could work-around this on my side in ugly ways, but I would like to discuss
>> how kernel module tracers are expected to implement kvm events probes without
>> the kvm_x86_ops symbol ?
> 
> The conversion is done in the TP_fast_assign snippets, which are part of
> kvm.ko and therefore do not need the export.  As I understand it, the
> issue is that LTTng cannot use the TP_fast_assign snippets, because they
> are embedded in the trace_event_raw_event_* symbols?

Indeed, the fact that the TP_fast_assign snippets are embedded in the
trace_event_raw_event_* symbols is an issue for LTTng. This ties those
to ftrace.

AFAIK, TP_fast_assign copies directly into ftrace ring buffers, and then
afterwards things like dynamic filters are applied, which then "uncommits" the
events if need be (and if possible). Also, TP_fast_assign is tied to the
ftrace ring buffer event layout. The fact that the TP_STRUCT__entry() (description)
and TP_fast_assign() (open-coded C) are separate fields really focuses on a
use-case where all data is serialized to a ring buffer.

In LTTng, the event fields are made available to a filter interpreter prior to
being copied into LTTng's ring buffer. This is made possible by implementing
our own LTTNG_TRACEPOINT_EVENT code generation headers. In addition, we have
recently released an event notification mechanism (lttng 2.13) which captures
specific event fields to send with an immediate notification (thus bypassing the
tracer buffering). We are also currently working on a LTTng trace hit counters
mechanism, which performs aggregation through per-cpu counters, which doesn't
even allocate a ring buffer.

For those reasons, LTTng reimplements its own tracepoint probe callbacks. All
those sit within LTTng kernel modules, which means we currently need the exported
kvm_x86_ops callbacks.

> We cannot do the extraction before calling trace_kvm_exit, because it's
> expensive.

I suspect that extracting relevant data prior to calling trace_kvm_exit
is too expensive because it cannot be skipped when the tracepoint is
disabled. This is because trace_kvm_exit() is a static inline function,
and the check to figure out if the event is enabled is within that function.
Unfortunately, even if the tracepoint is disabled, the side-effects of the
parameters passed to trace_kvm_exit() must happen.

I've solved this in LTTng-UST by implementing a lttng_ust_tracepoint()
macro, which basically "lifts" the tracepoint enabled check before the
evaluation of the arguments.

You could achieve something similar by using trace_kvm_exit_enabled() in the
kernel like so:

  if (trace_kvm_exit_enabled())
      trace_kvm_exit(....);

Which would skip evaluation of the argument side-effects when the tracepoint is
disabled.

By doing that, when multiple tracers are attached to a kvm tracepoint, the
translation from pointer-to-internal-structure to meaningful fields would only
need to be done once when a tracepoint is hit. And this would remove the need
for using kvm_x86_ops callbacks from tracer probe functions.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com