[lttng-dev] background information about LTTng timestamps

Thu Jan 26 13:18:25 EST 2012

2012/1/26 Mathieu Desnoyers <compudj at krystal.dyndns.org>:
> Hi Sébastien,
> First thing first: thanks a lot for pointing me out to the fact that I
> was recently mixed up between the LTTng 0.x and 2.0 schemes! :-)

You're welcome.

Thank you also for the explanations and for the LTTng itself, it
already proved useful here.

> The late clock fixes in LTTng-UST 2.0 prereleases, even though they fix the
> problem symptoms, do not "fix" the problem, they just make it go away by
> always using a 64-bit timestamp all the time.

Ok.

Note that I never noticed this problem myself. I first believed so, because
the traces I got were incoherent. But it turned out that the traced program
was behaving incoherently.

> Let's first start with a bit of background on the problem:
>
> In LTTng, we store current time values, not delta, because we do not
> want to add arbitrary dependencies between consecutive events in the
> trace buffer. However, 32-bit overflow can occur every 4 minutes or so
> on a system with a clock running at 1GHz. (which is the case of
> CLOCK_MONOTONIC).

agreed

> If there are no events recorded in a stream during this time-frame, more
> than 4 minutes will elapse, but the 32-bit overflow will cause the
> delta observed by comparing the consecutive 32-low-order bits to be
> smaller than 4 minutes (it would be recording the new time-stamp modulo
> 4 minutes).

agreed, the delta might even be negative
> The Babeltrace machinery detects single overflows, as stated in CTF
> spec:
>
> "For a N-bit integer type referring to a clock, if the integer overflows
> compared to the N low order bits of the clock prior value, then it is
> assumed that one, and only one, overflow occurred. It is therefore
> important that events encoding time on a small number of bits happen
> frequently enough to detect when more than one N-bit overflow occurs."

I was missing this piece of information. I overlooked it while looking at
the code.

> A bit of history: in earlier LTTng versions (0.x), instead of using a
> storage size of 64-bit, which allows very large clock values,
> we had a periodical timer that wrote an event into each trace stream to
> guarantee that there is at most one 32-bit overflow between events.
> However, this approach requires to wake up the system periodically, and
> is not well suited for user-space tracing in a scheme where each process
> is responsible for its own buffers: this would require to add timers
> within each process, which I really want to avoid, due to the large
> intrusiveness of this approach on system behavior (in other words:
> signals are a can of worms). ;-)

agreed ;)

> In LTTng 2.0/LTTng-UST 2.0, the scheme differs. It does not require a
> timer anymore, so this whole problem goes away. Let me explain the 2.0
> scheme a bit more:
>
> * On 64-bit architectures:
>
> - we keep a per-stream 64-bit last_tsc value (see lttng-ust
>  libringbuffer/frontend_internal.h save_last_tsc() surroundings)
> - at each event, we store the full 64-bit timestamp within this
>  variable. Before we store it, we check if it has a delta of more than
>  a single N bits (e.g. 27 bits) overflow from the previous value.
> - if more than one N-bit overflow is detected, we use a full 64-bit
>  timestamp for the event.
>
>  -->  actually the check in the code is probably not as strict as it
>       could be; we have:
>
>        if (caa_unlikely((tsc - v_read(config, &buf->last_tsc))
>                     >> config->tsc_bits))
>                return 1;
>        else
>                return 0;
>
>  I think I could change it to:
>
>        if (caa_unlikely((tsc - v_read(config, &buf->last_tsc))
>                     >> (config->tsc_bits + 1)))
>
> Because the current incarnation will require a full 64-bit timestamp
> storage for single-bit overflows (which can be detected by the trace
> reader). Thoughts ?

I don't think it would be correct.

Let assume

  N == tsc_bits == 2
  last_tsc == 0b0000
  tsc == 0b0101 // we had an N-bits overflow, and more than 2**N ns occured

we get

  (tsc - last_tsc) == 0b0101
  ((tsc - last_tsc) >> tsc_bits) == 0b0001   // the current scheme
forces a full timestamp
  ((tsc - last_tsc) >> (tsc_bits+1)) == 0b0000 // the suggested change does not

With the current scheme, we store 0b000, 0b0101,
With the suggested change, we would store 0b0000, 0b01. The time
apparently does not rewinds,
and there is no way babeltrace could notice.

Another example

  N == tsc_bits == 2
  last_tsc == 0b0011
  tsc == 0b0110 // we had an N-bits overflow but less than 2**N ns occured

we get

  (tsc - last_tsc) == 0b0011
  ((tsc - last_tsc) >> tsc_bits) == 0b0000   // the current scheme
does not force a full timestamp
  ((tsc - last_tsc) >> (tsc_bits+1)) == 0b0000 // the suggested change
does not either

With the two schemes, we store 0b0011, 0b01, and let babeltrace detect
the overflow.

I therefore think the current implementation is correct and optimal.

> * On 32-bit architectures:
>
> - idea very similar to the 64-bit architecture case, but we cannot
>  do a fast 64-bit atomic read nor write. Therefore, we save only the
>  high-order bits that are needed to detect the overflow (we can
>  discard the N low-order bits).

I think there was an error here: we discarded the 32 low-order bits instead of
the N == 27 ones. See below.

> - as for the 64-bit case, I think I could make the comparison a bit
>  more strict, so single N bit overflows would not require a full 64-bit
>  timestamp.

see above

> * If timestamps ever go backwards when moving forward in the stream
>
> They are going to be detected as a "more than a single N-bit overflow",
> and thus will use a 64-bit storage to record the information.
>
>
> * Synchronization
>
> We have to notice here that the last_tsc update/read are performed in
> two distinct atomic operations, and that it is entirely possible (and
> allowed) for events to come in nested on top of the tracing code (coming
> from nmi/irq/softirq in kernel code, coming from signal handlers, or
> from another thread, in user-space) and update last_tsc underneath us.
>
> The key property of this scheme is that if such a scenario occurs, it
> should _never_ cause a missed more-than-1-N-bit-overflow between two
> consecutive events. However, it is perfectly valid to use 64-bit of
> space even if not required, if it help handling some rare race
> conditions.
>
> So if we have a sequence like this at kernel-level:
>
> CPU A
>
> - Thread context: event A triggered, executing probe
> - do
>  - read current buffer write position
>  - read current TSC
>  - check if it overflows last_tsc
> - while space reservation fails.
>           - IRQ: event B triggered
>             - reserve space, update last tsc, write event, commit.
> - update last_tsc
>
> As you notice, in this race, if event A needed to use a 64-bit
> timestamp as storage due to more than 1 N bit overflow since the
> previous one, the nested IRQ event B will also use a 64-bit timestamp,
> even though it directly follows a timestamp that is close enough not to
> require such a large storage. This race is "ok" because we end up
> providing more information than needed, not less.

agreed, I got that from comments.

> All this being said, I need to play with the code a little more to
> understand what is going on in LTTng-UST.

I your aae88c703374f4b1fbb8a5e7e95591bf8ce3e837 commit you changed

tsc_bits from 32 to 27 in liblttng-ust/ltt-ring-buffer-client.h

Would not that alone have fixed the problem?
As explained in a previous email, I think, we were storing the bits
32..64 instead of 27..59 in last_tsc.

Thus missing 2**5 overflows.

What do you think?

-- Sébastien