[lttng-dev] lttng snapshots and running traces

Thu Sep 26 15:54:30 EDT 2013

On 13-09-26 03:36 PM, Thibault, Daniel wrote:
> Envoyé : 26 septembre 2013 14:33
> 
>>>    Is the following scenario possible?  The consumer tries to get_subbuf, and is denied access (because tracers
>>> are writing into it).  It then tries to get the next sub-buffer, but the luck of task scheduling is such that by the 
>>> time it actually calls the request routine, the tracers have already overwritten the sub-buffer in question and 
>>> gone on to the next one.  The get_subbuf succeeds and the consumer keeps doing its thing (the timestamps 
>>> remain monotonically increasing, so there is no corruption).  The resulting snapshot on disk has a large jump 
>>> in its timestamps, having lost a whole buffer cycle's worth of event records.
>>>
>>>    To complicate things, the tracers then stop (for lack of event occurrences) in some sub-buffer ahead of the 
>>> consumer's current position, but still short of the consumer's stop goal.  When the consumer skips over the 
>>> busy sub-buffer (where the tracers are "stuck"), it would read events that apparently jump back in time (a 
>>> whole buffer cycle's worth of time), so I guess it would stop there and close the snapshot?
>>
>> One important design note to understand this situation: even though we write in a ring-buffer, the positions 
>> are free-running counters, they "never" wrap-around (well, we are dealing with 64-bit counters). In order to 
>> map a position to an actual subbuffer, we use a sort of modulo. This operation gives us the actual subbuffer 
>> to use and also allows us to detect if the tracer has already reused this subbuffer.
> 
>    Good to know, but it doesn't answer the question.   :-)
> 
>    If the tracers pass the consumer, and then the consumer passes the tracers, does the consumer stop copying 
> the buffer contents to the snapshot trace?
> 
> * consumer reads a number of "pass n" sub-buffers (where n is the number of times the tracers have gone around)
> * tracers catch up with the consumer, skip ahead, write at least one whole "pass n+1" sub-buffer
> * tracers then stop or slow down
> * consumer reads the "pass n+1" sub-buffer(s), then overtakes the tracers
> * consumer skips over the sub-buffer where the tracers are, finds a "pass n" sub-buffer
> 
>    If the consumer doesn't close the trace at that point, it would corrupt the snapshot.
The consumer reads absolute positions in the ring-buffer, if a subbuffer
has been overwritten, it skips to the next until it reaches its end
position.
The consumer will never read data that has been recorded after it
started taking the snapshot.
That is why it is important to know that we deal with absolute positions.

Is that clearer ?

If you want to take a look, the code for the kernel is in the function
lttng_kconsumer_snapshot_channel in
src/common/kernel-consumer/kernel-consumer.c.
You will see that we take the begin and end positions, and then iterate
subbuffer by subbuffer, if a get_subbuff(absolute_position) fails it
means it has already been reused and we skip to the next.

Julien

> 
> Daniel U. Thibault
> Protection des systèmes et contremesures (PSC) | Systems Protection & Countermeasures (SPC)
> Cyber sécurité pour les missions essentielles (CME) | Mission Critical Cyber Security (MCCS)
> R & D pour la défense Canada - Valcartier (RDDC Valcartier) | Defence R&D Canada - Valcartier (DRDC Valcartier)
> 2459 route de la Bravoure
> Québec QC  G3J 1X5
> CANADA
> Vox : (418) 844-4000 x4245
> Fax : (418) 844-4538
> NAC : 918V QSDJ <http://www.travelgis.com/map.asp?addr=918V%20QSDJ>
> Gouvernement du Canada | Government of Canada
> <http://www.valcartier.drdc-rddc.gc.ca/>
>