[lttng-dev] [RFC PATCH lttng-modules] Fix: nmi-safe clock on 32-bit systems

Tue Feb 14 20:44:16 UTC 2017

----- On Feb 14, 2017, at 3:02 PM, Trent Piepho tpiepho at kymetacorp.com wrote:

> On Thu, 2017-02-09 at 20:51 -0500, mathieu.desnoyers wrote:
>> On 32-bit systems, the current algorithm assume to have one clock read
>> per 32-bit LSB overflow period, which is not guaranteed. It also has an
>> issue on the first clock reads after module load, because the initial
>> value for the last LSB is 0. It can cause the time to stay stuck at the
>> same value for a few seconds at the beginning of the trace.
> 
> I tested this on my 32-bit ARM system and it has solved the issue with
> timestamp being fixed at zero until the first time the TSC wraps.
> 
> I'm not sure how to properly test that NMIs do not generate any reverse
> timestamps.

NMIs don't exist on ARM. They have FIQ which are close (fast interrupts),
but no NMIs. It would be a good thing to ensure that either FIQs are disabled
when the OS updates the clock, or that no tracing is ever done in FIQ
handlers. I don't think FIQs are disabled when updating the clock, but
I suspect that no tracing is done in FIQ handlers.

Also, lttng-modules, with that fix on ARM, just discard events that
would come from NMI context (and since nmi context don't exist on
ARM, it never discards anything).

> 
> 
>> Fix this by using the non-nmi-safe clock source on 32-bit systems,
>> except for x86 systems that support double-word cmpxchg (cmpxchg8b).
> 
> Is there a reason to restrict this to x86 systems?  ARM supports 64-bit
> cmpxchg too, and indeed that's what I was testing.

Since ARM don't have NMI, I don't see why we should care about making
the clock source nmi-safe, especially since the algorithm adds a
cmpxchg64_local (slight performance overhead). This might end up
being useful if we ever want to trace fast interrupt handlers though.

> 
>> +
>>  #if (LINUX_VERSION_CODE >= KERNEL_VERSION(3,17,0) \
>> +	&& (BITS_PER_LONG == 64 || CONFIG_X86_CMPXCHG64) \
>>  	&& !defined(LTTNG_CLOCK_NMI_SAFE_BROKEN))
> 
> This fails to compile on ARM, as CONFIG_X86_CMPXCHG64 is not defined.
> From what I can see, all ARM have cmpxchg64_local() so I modified this
> to allow ARM support, used defined(), etc.
> 
> It also appears that even if CONFIG_X86_CMPXCHG64 is not defined on x86,
> there is still support via another definition of cmpxchg64_local() using
> cmpxchg8b_emu.  Or does this version lack some atomic in the face of NMI
> ability that the cmpxchg8b instruction would have?

Indeed, cmpxchg8b_emu disables interrupts on UP within its emulation, which
is not safe against NMIs.

I am tempted to only use the nmi-safe clock source on architectures
where it matters, and ARM32/64 don't have NMIs. So I would use
the usual seqlock-protected clock source on ARM32/64, unless we
end up figuring out that FIQ handlers are instrumented with
tracepoints.

For x86 32/64, when 64-bit local cmpxchg is natively available,
I would like to use the nmi-safe clock source. However, using
double-word CAS on x86-32 is something that would be a new behavior
in lttng-modules, and I don't want to introduce this as a fix into
the stable branches. This is why the commit that landed into master
currently simply revert to the non-nmi-safe clock source on x86-32,
and don't trace NMIs on x86-32 for now.

I would be OK to introduce use of double-word CAS on x86-32 as a new
feature for upcoming lttng-modules versions (2.10+). But I would only use it
when CONFIG_X86_CMPXCHG64 is defined: it tells us that the architecture
does have a 64-bit cmpxchg instruction.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com