[lttng-dev] background information about LTTng timestamps

Thu Jan 26 11:50:10 EST 2012

Hi Sébastien,

First thing first: thanks a lot for pointing me out to the fact that I
was recently mixed up between the LTTng 0.x and 2.0 schemes! :-) The
late clock fixes in LTTng-UST 2.0 prereleases, even though they fix the
problem symptoms, do not "fix" the problem, they just make it go away by
always using a 64-bit timestamp all the time.

Let's first start with a bit of background on the problem:

In LTTng, we store current time values, not delta, because we do not
want to add arbitrary dependencies between consecutive events in the
trace buffer. However, 32-bit overflow can occur every 4 minutes or so
on a system with a clock running at 1GHz. (which is the case of
CLOCK_MONOTONIC).

If there are no events recorded in a stream during this time-frame, more
than 4 minutes will elapse, but the 32-bit overflow will cause the
delta observed by comparing the consecutive 32-low-order bits to be
smaller than 4 minutes (it would be recording the new time-stamp modulo
4 minutes).

The Babeltrace machinery detects single overflows, as stated in CTF
spec:

"For a N-bit integer type referring to a clock, if the integer overflows
compared to the N low order bits of the clock prior value, then it is
assumed that one, and only one, overflow occurred. It is therefore
important that events encoding time on a small number of bits happen
frequently enough to detect when more than one N-bit overflow occurs."

A bit of history: in earlier LTTng versions (0.x), instead of using a
storage size of 64-bit, which allows very large clock values,
we had a periodical timer that wrote an event into each trace stream to
guarantee that there is at most one 32-bit overflow between events.
However, this approach requires to wake up the system periodically, and
is not well suited for user-space tracing in a scheme where each process
is responsible for its own buffers: this would require to add timers
within each process, which I really want to avoid, due to the large
intrusiveness of this approach on system behavior (in other words:
signals are a can of worms). ;-)

In LTTng 2.0/LTTng-UST 2.0, the scheme differs. It does not require a
timer anymore, so this whole problem goes away. Let me explain the 2.0
scheme a bit more:

* On 64-bit architectures:

- we keep a per-stream 64-bit last_tsc value (see lttng-ust
  libringbuffer/frontend_internal.h save_last_tsc() surroundings)
- at each event, we store the full 64-bit timestamp within this
  variable. Before we store it, we check if it has a delta of more than
  a single N bits (e.g. 27 bits) overflow from the previous value.
- if more than one N-bit overflow is detected, we use a full 64-bit
  timestamp for the event.

  -->  actually the check in the code is probably not as strict as it
       could be; we have:

        if (caa_unlikely((tsc - v_read(config, &buf->last_tsc))
                     >> config->tsc_bits))
                return 1;
        else
                return 0;

  I think I could change it to:

        if (caa_unlikely((tsc - v_read(config, &buf->last_tsc))
                     >> (config->tsc_bits + 1)))

Because the current incarnation will require a full 64-bit timestamp
storage for single-bit overflows (which can be detected by the trace
reader). Thoughts ?

* On 32-bit architectures:

- idea very similar to the 64-bit architecture case, but we cannot
  do a fast 64-bit atomic read nor write. Therefore, we save only the
  high-order bits that are needed to detect the overflow (we can
  discard the N low-order bits).
- as for the 64-bit case, I think I could make the comparison a bit
  more strict, so single N bit overflows would not require a full 64-bit
  timestamp.

* If timestamps ever go backwards when moving forward in the stream

They are going to be detected as a "more than a single N-bit overflow",
and thus will use a 64-bit storage to record the information.

* Synchronization

We have to notice here that the last_tsc update/read are performed in
two distinct atomic operations, and that it is entirely possible (and
allowed) for events to come in nested on top of the tracing code (coming
from nmi/irq/softirq in kernel code, coming from signal handlers, or
from another thread, in user-space) and update last_tsc underneath us.

The key property of this scheme is that if such a scenario occurs, it
should _never_ cause a missed more-than-1-N-bit-overflow between two
consecutive events. However, it is perfectly valid to use 64-bit of
space even if not required, if it help handling some rare race
conditions.

So if we have a sequence like this at kernel-level:

CPU A

- Thread context: event A triggered, executing probe
- do
  - read current buffer write position
  - read current TSC
  - check if it overflows last_tsc
- while space reservation fails.
           - IRQ: event B triggered
             - reserve space, update last tsc, write event, commit.
- update last_tsc

As you notice, in this race, if event A needed to use a 64-bit
timestamp as storage due to more than 1 N bit overflow since the
previous one, the nested IRQ event B will also use a 64-bit timestamp,
even though it directly follows a timestamp that is close enough not to
require such a large storage. This race is "ok" because we end up
providing more information than needed, not less.

All this being said, I need to play with the code a little more to
understand what is going on in LTTng-UST.

Let me know if I can provide more info,

Best regards,

Mathieu

* Sébastien Barthélémy (barthelemy at crans.org) wrote:
> Hello Mathieu,
> 
> I just came over commit aae88c703374f4b1fbb8a5e7e95591bf8ce3e837
> "Force 64 bits timestamp"  [1].
> 
> I never understood how 27bits overflow detection was supposed to work. However,
> when using large headers, the 32bits overflow detection made sense to me:
> 
> the 32 MSBs of the previous timestamp are kept in the channel so it can detect
> overflow before writing the header.
> 
> So I don't understand why this new change is needed. Could you
> (briefly) explain it?
> 
> Or maybe it is for < 32bits arch?
> 
> Regards
> -- Sébastien
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com