[lttng-dev] [RFC] Per-user event ID allocation proposal

Fri Sep 14 12:25:18 EDT 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

Mathieu Desnoyers:
> * David Goulet (dgoulet at efficios.com) wrote:
> 
> 
> Mathieu Desnoyers:
>>>> * David Goulet (dgoulet at efficios.com) wrote:
>>>> 
>>>> 
>>>> Mathieu Desnoyers:
>>>>>>> * David Goulet (dgoulet at efficios.com) wrote: Hi
>>>>>>> Mathieu,
>>>>>>> 
>>>>>>> This looks good! I have some questions to clarify part
>>>>>>> of the RFC.
>>>>>>> 
>>>>>>> Mathieu Desnoyers:
>>>>>>>>>> Mathieu Desnoyers September 11, 2012
>>>>>>>>>> 
>>>>>>>>>> Per-user event ID allocation proposal
>>>>>>>>>> 
>>>>>>>>>> The intent of this shared event ID registry is
>>>>>>>>>> to allow sharing tracing buffers between
>>>>>>>>>> applications belonging to the same user (UID) for
>>>>>>>>>> UST (user-space) tracing.
>>>>>>>>>> 
>>>>>>>>>> A.1) Overview of System-wide, per-ABI (32 and
>>>>>>>>>> 64-bit), per-user, per-session, LTTng-UST event
>>>>>>>>>> ID allocation:
>>>>>>>>>> 
>>>>>>>>>> - Modify LTTng-UST and lttng-tools to keep a 
>>>>>>>>>> system-wide, per-ABI (32 and 64-bit), per-user, 
>>>>>>>>>> per-session registry of enabled events and their 
>>>>>>>>>> associated numeric IDs. - LTTng-UST will have to 
>>>>>>>>>> register its tracepoints to the session daemon,
>>>>>>>>>> sending the field typing of these tracepoints
>>>>>>>>>> during registration, - Dynamically check that
>>>>>>>>>> field types match upon registration of an event
>>>>>>>>>> in this global registry, refuse registration if
>>>>>>>>>> the field types do not match, - The metadata will
>>>>>>>>>> be generated by lttng-tools instead of the
>>>>>>>>>> application.
>>>>>>>>>> 
>>>>>>>>>> A.2 Per-user Event ID Details:
>>>>>>>>>> 
>>>>>>>>>> The event ID registry is shared across all
>>>>>>>>>> processes for a given session/ABI/channel/user
>>>>>>>>>> (UID). The intent is to forbid one user to access
>>>>>>>>>> tracing data from another user, while keeping the
>>>>>>>>>> system-wide number of buffers small.
>>>>>>>>>> 
>>>>>>>>>> The event ID registry is attached to a: -
>>>>>>>>>> session, - specific ABI (32/64-bit), - channel, -
>>>>>>>>>> user (UID).
>>>>>>>>>> 
>>>>>>>>>> lttng-session fill this registry by pulling this 
>>>>>>>>>> information as needed from traced processes
>>>>>>>>>> (a.k.a. applications) to populate the registry.
>>>>>>>>>> This information is needed only when an event is
>>>>>>>>>> active for a created session. Therefore,
>>>>>>>>>> applications need not to notify the sessiond if
>>>>>>>>>> no session is created.
>>>>>>>>>> 
>>>>>>>>>> The rationale for using a "pull" scheme, where
>>>>>>>>>> the sessiond pulls information from applications,
>>>>>>>>>> in opposition to a "push" scheme, where
>>>>>>>>>> application would initiate commands to push the
>>>>>>>>>> information, is that it minimizes the amount of
>>>>>>>>>> logic required within liblttng-ust, and it does
>>>>>>>>>> not require liblttng-ust to wait for reply from
>>>>>>>>>> lttng-sessiond, which minimize the impact on the
>>>>>>>>>> application behavior, providing application
>>>>>>>>>> resilience to lttng-sessiond crash.
>>>>>>>>>> 
>>>>>>>>>> Updates to this registry are triggered by two
>>>>>>>>>> distinct scenarios: either an "enable-event"
>>>>>>>>>> command (could also be "start", depending on the
>>>>>>>>>> sessiond design) is being executed, or, while
>>>>>>>>>> tracing, a library is being loaded within the
>>>>>>>>>> application.
>>>>>>>>>> 
>>>>>>>>>> Before we start describing the algorithms that
>>>>>>>>>> update the registry, it is _very_ important to
>>>>>>>>>> understand that an event enabled with
>>>>>>>>>> "enable-event" can contain a wildcard (e.g.:
>>>>>>>>>> libc*) and loglevel, and therefore is associated
>>>>>>>>>> to possibly _many_ events in the application.
>>>>>>>>>> 
>>>>>>>>>> Algo (1) When an "enable-event"/"start" command
>>>>>>>>>> is executed, the sessiond will get, in return for
>>>>>>>>>> sending an enable-event command to the
>>>>>>>>>> application (which apply to a channel within a
>>>>>>>>>> session), a variable-sized array of enabled
>>>>>>>>>> events (remember, we can enable a wildcard!),
>>>>>>>>>> along with their name, loglevel, field name, and
>>>>>>>>>> field type. The sessiond proceeds to check that
>>>>>>>>>> each event does not conflict with another event
>>>>>>>>>> in the registry with the same name, but having
>>>>>>>>>> different field names/types or loglevel. If its
>>>>>>>>>> field names/typing or loglevel differ from a
>>>>>>>>>> previous event, it prints a warnings. If it
>>>>>>>>>> matches a previous event, it re-uses the same ID
>>>>>>>>>> as the previous event. If no match, it allocates
>>>>>>>>>> a new event ID. It sends a command to the
>>>>>>>>>> application to let it know the mapping between 
>>>>>>>>>> the event name and ID for the channel. When the 
>>>>>>>>>> application receives that command, it can
>>>>>>>>>> finally proceed to attach the tracepoint probe to
>>>>>>>>>> the tracepoint site.
>>>>>>> 
>>>>>>>>>> The sessiond keeps a per-application/per-channel
>>>>>>>>>> hash table of already enabled events, so it does
>>>>>>>>>> not provide the same event name/id mapping twice
>>>>>>>>>> for a given channel.
>>>>>>> 
>>>>>>> and per-session ?
>>>>>>> 
>>>>>>>> Yes.
>>>>>>> 
>>>>>>> 
>>>>>>> Of what I understand of this proposal, an event is
>>>>>>> associated to per-user/per-session/per-apps/per-channel
>>>>>>> values.
>>>>>>> 
>>>>>>>> Well, given that channels become per-user, an
>>>>>>>> application will write its data into the channel with
>>>>>>>> same UID as itself. (it might imply some limitations
>>>>>>>> with setuid() in an application, or at least to
>>>>>>>> document those, or that we overload setuid())
>>>>>>> 
>>>>>>>> The "per-app" part is not quite right. Event IDs are 
>>>>>>>> re-used and shared across all applications that
>>>>>>>> belong to the same UID.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> (I have a question at the end about how an event ID
>>>>>>> should be generated)
>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Algo (2) In the case where a library (.so) is
>>>>>>>>>> being loaded in the application while tracing,
>>>>>>>>>> the update sequence goes as follow: the
>>>>>>>>>> application first checks if there is any session
>>>>>>>>>> created. It so, it sends a NOTIFY_NEW_EVENTS
>>>>>>>>>> message to the sessiond through the communication
>>>>>>>>>> socket (normally used to send ACK to commands).
>>>>>>>>>> The lttng-sessiond will therefore need to listen
>>>>>>>>>> (read) to each application communication socket,
>>>>>>>>>> and will also need to dispatch NOTIFY_NEW_EVENTS
>>>>>>>>>> messages each time it expects an ACK reply for a
>>>>>>>>>> command it has sent to the application.
>>>>>>> 
>>>>>>> Taking back the last sentence, can you explain more or 
>>>>>>> clarify the mechanism here of "dispatching a 
>>>>>>> NOTIFY_NEW_EVENTS" each time an ACK reply is
>>>>>>> expected?... Do you mean that each time we are waiting
>>>>>>> for an ACK, if we get a NOTIFY instead (which could
>>>>>>> happen due to a race between notification and command
>>>>>>> handling) you will launch a NOTIFY code path where the
>>>>>>> session daemon check the events hash table and check
>>>>>>> for event(s) to pull from the UST tracer? ... so what
>>>>>>> about getting the real ACK after that ?
>>>>>>> 
>>>>>>>> In this scheme, the NOTIFY is entirely asynchronous,
>>>>>>>> and gets no acknowledge from the sessiond to the app.
>>>>>>>> This means that when dispatching the NOTIFY received
>>>>>>>> at the ACK site (due to the race you refer to), we
>>>>>>>> could simply queue this notify within the sessiond so
>>>>>>>> it gets handled after we finished handling the
>>>>>>>> current command (e.g. next time the thread go back to
>>>>>>>> poll fds).
>>>>>>> 
>>>>>>> 
>>>>>>>>>> When a NOTIFY_NEW_EVENTS is received from an 
>>>>>>>>>> application, the sessiond iterates on each
>>>>>>>>>> session, each channel, redoing Algo (1). The
>>>>>>>>>> per-app/per-channel hash table that remembers
>>>>>>>>>> already enabled events will ensure that we don't
>>>>>>>>>> end up enabling the same event twice.
>>>>>>>>>> 
>>>>>>>>>> At application startup, the "registration done"
>>>>>>>>>> message will only be sent once all the commands
>>>>>>>>>> setting the mapping between event name and ID are
>>>>>>>>>> sent. This ensures tracing is not started until
>>>>>>>>>> all events are enabled (delaying the application
>>>>>>>>>> for a configurable delay).
>>>>>>>>>> 
>>>>>>>>>> At library load, a "registration done" will also
>>>>>>>>>> be sent by the sessiond some time after the 
>>>>>>>>>> NOTIFY_NEW_EVENTS has been received -- at the end
>>>>>>>>>> of Algo(1). This means that library load, within 
>>>>>>>>>> applications, can be delayed for the same amount
>>>>>>>>>> of time that apply to application start
>>>>>>>>>> (configurable).
>>>>>>>>>> 
>>>>>>>>>> The registry is emptied when the session is
>>>>>>>>>> destroyed. Event IDs are never freed, only
>>>>>>>>>> re-used for events with the same name, after
>>>>>>>>>> loglevel, field name and field type match check.
>>>>>>> 
>>>>>>> This means that event IDs here should be some kind of a
>>>>>>> hash using a combination of values of the event to make
>>>>>>> sure it's unique on a per-event/per-channel/per-session
>>>>>>> basis ? (considering the sessiond should keep them in a
>>>>>>> separate registry)
>>>>>>> 
>>>>>>>> I think it would be enough to hash the events by
>>>>>>>> their full name, and then do a compare to check if
>>>>>>>> the fields match. We _want_ the hash table lookup to
>>>>>>>> succeed if we get an event with same name but
>>>>>>>> different fields, but then our detailed check for
>>>>>>>> field mispatch would fail.
>>>>>>> 
>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> This registry is also used to generate metadata
>>>>>>>>>> from the sessiond. The sessiond will now be
>>>>>>>>>> responsible for generation of the metadata
>>>>>>>>>> stream.
>>>>>>>>>> 
>>>>>>> 
>>>>>>> This implies that the session daemon will need to keep
>>>>>>> track of the global memory location of each
>>>>>>> applications in order to consumer metadata streams ?
>>>>>>> 
>>>>>>>> Uh ? no. The metadata stream would be _created_ by
>>>>>>>> the sessiond. Applications would not have anything to
>>>>>>>> do with the metadata stream in this scheme.
>>>>>>>> Basically, we need to ensure that all the information
>>>>>>>> required to generate the metadata for a given
>>>>>>>> session/channel is present in the table that contains
>>>>>>>> mapping between numeric event IDs and event
>>>>>>>> name/field names/types/loglevel.
>>>> 
>>>> Hmmm... the session daemon creates the metadata now... this
>>>> means you are going to get the full ringbuffer + ctf code
>>>> inside the lttng-tools tree....!?
>>>> 
>>>>> No. Just move the internal representation of the
>>>>> tracepoint metadata that UST currently has (see
>>>>> ust-events.h) into lttng-tools.
> 
> Right so the session daemon has to pass over the metadata buffer 
> information up to the application in order to write them?
> 
> Please, if I'm wrong again, maybe just write a paragraph to
> explain the whole shebang :P
> 
>> No.
> 
>> The sessiond keeps the registry about the entire metadata for
>> each channel. The application will _not_ generate the metadata
>> stream. That's the whole change: now the sessiond will generate
>> this metadata stream. The applications would now have nothing to
>> do with that stream: they would simply write into the data
>> buffers shared across all processes of a given user.
> 
>> Not sure if it's still unclear ?

Ok! So, this isn't a small task and we might want to elaborate on that
part since we have to decide what resources are needed to do that
either on the consumer side or session daemon... new threads?... new
dependency... and so on.

Thanks!
David

> 
>> Thanks,
> 
>> Mathieu
> 
> 
> Cheers David
> 
>>>> 
>>>>> Thanks,
>>>> 
>>>>> Mathieu
>>>> 
>>>> 
>>>> Any case, I would love having a reason for that since this
>>>> seems to me an important change.
>>>> 
>>>> David
>>>> 
>>>>>>> 
>>>>>>>> Anything still unclear ?
>>>>>>> 
>>>>>>>> Thanks,
>>>>>>> 
>>>>>>>> Mathieu
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers! David
>>>>>>> 
>>>> 
> 
-----BEGIN PGP SIGNATURE-----

iQEcBAEBCgAGBQJQU1prAAoJEELoaioR9I02PwgH+wXT6VTfYAo4oatZCVYg8PCO
NQBhDEa2/JHGfkJ68wlTZ9Yx/xVZvYWkm5CGAL6zX5iYU18t1hRotpekYrtO1e2+
NamJEYprgAKln3e7DUBv98ZPAUKp83Zca5cbJbTtXy4H5ljkx6QqajqRutB+g5zB
UZBoi8L7GlnJcII/n4874W0huefY9auF6QuJRR1/bjNkMWip87ohn6E1wz2hkG1O
HGymtaeUpo1sURkV5dAAnxZqendxM0QMLkwLeT1wjyED6WKwh/6SoBjfDs0yMlco
1CDKhsWrix7D3rAmKqLQMkzxDaigClfnkucmvUMi6HcVrNepBYHpkzJhZJe2298=
=mEj6
-----END PGP SIGNATURE-----