[lttng-dev] Sometimes a session destroy hangs lttng-consumerd process
Mathieu Desnoyers
mathieu.desnoyers at efficios.com
Mon Sep 14 10:48:50 EDT 2015
----- On Sep 14, 2015, at 7:12 AM, Per Hallsmark Per.Hallsmark at windriver.com wrote:
>> Hello,
>>
>> I'm hit by an issue in that when doing session create/destroy multiple times,
>> during destroy the lttng-consumerd cannot get out of a loop where it wants to
>> flush metadata, but the connection is probably already down at this stage.
>>
>> I'm attaching a script that excercise lttng. The issue seems hard to get at,
>> took 9 hours of running and the script console log looks like:
>>
>> ....
>> Round 4327
>> Session s1 created.
>> Traces will be written in net://128.224.95.199:53000:53001
>> Live timer set to 1000000 usec
>> UST channel metadata enabled for session s1
>> UST channel c1 enabled for session s1
>> All UST events are enabled in channel c1
>> Tracing started for session s1
>> Waiting for data availability..
>> Tracing stopped for session s1
>> Session s1 destroyed
>> Round 4328
>> Session s1 created.
>> Traces will be written in net://128.224.95.199:53000:53001
>> Live timer set to 1000000 usec
>> UST channel metadata enabled for session s1
>> UST channel c1 enabled for session s1
>> All UST events are enabled in channel c1
>> Tracing started for session s1
>> libust[14905/14905]: Error: Timed out waiting for lttng-sessiond (in
>> lttng_ust_init() at lttng-ust-comm.c:1521)
>> < here it hangs >
>>
>> If I'm now trying a simple "lttng list" or so in target, that also hangs:
>>
>> root at localhost:~# lttng -vvv list
>> DEBUG2 - 10:23:36.911853 [15122/15122]: Session name: (null) (in cmd_list() at
>> commands/list.c:911)
>> DEBUG1 - 10:23:36.913314 [15122/15122]: LSM cmd type : 13 (in send_session_msg()
>> at lttng-ctl.c:131)
>> < here it hangs >
>>
>> If I start lttng-sessiond to log things:
>>
>> lttng-sessiond -Z -v -b > lttng-sessiond.log 2>&1
>>
>> Then I get a huge log where it after some 9 hours shows like:
>>
>> DEBUG1 - 21:06:53.632585 [3807/3810]: Processing client command 17 (in
>> process_client_msg() at main.
>> DEBUG3 - 21:06:53.644582 [3807/3813]: Consumer pushing metadata on sock 33 of
>> len 686 (in consumer_p
>> DEBUG1 - 21:06:53.656539 [3807/3810]: Getting session s1 by name (in
>> process_client_msg() at main.c:
>> DEBUG1 - 21:06:53.656625 [3823/3829]: UST consumer push metadata key 8658 of len
>> 686 (in lttng_ustco
>> DEBUG1 - 21:06:53.693577 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:53.693585 [3823/3827]: Metadata poll return from wait with 2
>> fd(s) (in consumer_threa
>> DEBUG1 - 21:06:53.847615 [3823/3827]: Metadata event catched in thread (in
>> consumer_thread_metadata_
>> DEBUG1 - 21:06:53.847642 [3823/3827]: Metadata available on fd 81 (in
>> consumer_thread_metadata_poll(
>> DEBUG1 - 21:06:54.548885 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.264889 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.474883 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.684874 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.894872 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:55.104848 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:55.314850 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_r
>>
>> After that I only get these "Waiting for metadata to be flushed" lines.
>>
>> Following the code, I see that area in:
>>
>> common/ust-consumer/ust-consumer.c::lttng_ustconsumer_recv_metadata() where
>> there is a loop like:
>> while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
>> DBG("Waiting for metadata to be flushed");
>>
>> health_code_update();
>>
>> usleep(DEFAULT_METADATA_AVAILABILITY_WAIT_TIME);
>> }
>>
>> So this loop will carry on forever if consumer_metadata_cache_flushed() isn't
>> returning 0.
>> It seems to me we have some metadata to flush out, but during session destroy
>> the communication channel is already down so the metadata will never be flushed
>> out.
>> First thought it may be because consumer_metadata_cache_flushed() is not
>> recognizing channel->metadata_stream is down, but perhaps it is the fact we
>> don't scrap metadata after we have disconnected the channel or so.
>> Still trying to follow the code here which is a bit hard for a newcomer.
>>
>> This has been seen using lttng 2.5.4 / lttng-ust 2.5.5, looking at the latest
>> master branch (which at time I fetched was commit fb27f84 ) the code
>> around this look similar.
>>
>> I wonder if this is something that has been encountered before? At least a
>> comment some lines up in same function suggest it:
>> /*
>> * Skip metadata flush on write error since the offset and len might
>> * not have been updated which could create an infinite loop below when
>> * waiting for the metadata cache to be flushed.
>> */
>>
>> But as we unlock metadata_cache and lock it again in the
>> consumer_metadata_cache_flushed(), I guess there is a window here when other
>> things could get time todo stuff not thought of.
>
> I ran into this situation in my stress-tests recently, and proposed
> the following fix, currently merged in lttng-tools master branch:
>
> commit c585821bc78955b3d747fcd733aa1d2b81a3258e
> Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> Date: Wed Aug 19 14:44:59 2015 -0700
>
> Fix: sessiond vs consumerd push/get metadata deadlock
>
> We need to unlock the registry while we push metadata to break a
> circular dependency between the consumerd metadata lock and the sessiond
> registry lock. Indeed, pushing metadata to the consumerd awaits that it
> gets pushed all the way to relayd, but doing so requires grabbing the
> metadata lock. If a concurrent metadata request is being performed by
> consumerd, this can try to grab the registry lock on the sessiond while
> holding the metadata lock on the consumer daemon. Those push and pull
> schemes are performed on two different bidirectionnal communication
> sockets.
>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> Signed-off-by: Jérémie Galarneau <jeremie.galarneau at efficios.com>
>
> Can you reproduce this issue on the master branch of tools, ust and modules?
>
> Oh gosh, if this solves our issue I must say the lttng issue resolution is fast!
> :-)
> I'll see how I can do that. If not able to update to master branch, would the
> commit
> above alone do the work you think?
>
> During night I've been running with a crude workaround like:
>
> diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/us
> t-consumer.c
> index f891828..9ded651 100644
> --- a/src/common/ust-consumer/ust-consumer.c
> +++ b/src/common/ust-consumer/ust-consumer.c
> @@ -1026,12 +1026,19 @@ error:
> /*
> * Receive the metadata updates from the sessiond.
> */
> +int workaround_verifier=1;
> int lttng_ustconsumer_recv_metadata(int sock, uint64_t key, uint64_t offset,
> uint64_t len, struct lttng_consumer_channel *channel,
> int timer, int wait)
> {
> int ret, ret_code = LTTCOMM_CONSUMERD_SUCCESS;
> char *metadata_str;
> + int workaround_counter;
> +
> + if (workaround_verifier) {
> + DBG("%s: workaround exist in %s\n", __FILE__, __func__);
> + workaround_verifier=0;
> + }
>
> DBG("UST consumer push metadata key %" PRIu64 " of len %" PRIu64, key, l
> en);
>
> @@ -1072,7 +1079,11 @@ int lttng_ustconsumer_recv_metadata(int sock, uint64_t ke
> y, uint64_t offset,
> if (!wait) {
> goto end_free;
> }
> + workaround_counter = 42;
> while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
> + if (!workaround_counter--) {
> + break;
> + }
> DBG("Waiting for metadata to be flushed");
>
> health_code_update();
> --
> 1.7.9.5
>
> and it seems to help us get out of the hang, although not fixing why we entered
> there.
> I'll back that out and try to get your patch in...
>
> Ok, tested to use only the commit ontop of our base. Unfortunately it don't work
> at all then:
>
> root at axm5516:~# sh lttng-hammer1.sh
> Round 0
> Session s1 created.
> Traces will be written in net://128.224.95.199:53000:53001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> Waiting for data
> availability............................................................................................................................................................................................................................................................................................................................................................................................................................................^C
> root at axm5516:~#
>
> This may be because only using the commit I guess. It went in fine without any
> manual change, but perhaps dependant anyway on other changes before.
> Next is then to try a build with latest branch as you want, but that will not be
> as easy here.... I'll see what can be done.
> Any opinion on the crude workaround as a intermediate "fix"?
>
> BR,
> Per
>
> Ok, we have now run over the weekend with the latest&greatest on master branch.
> After 90000+ rounds, there is a breakdown:
>
> ...
> Round 90311
> Session s1 created.
> Traces will be written in net://128.224.95.15:52000:52001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> Waiting for data availability
> Tracing stopped for session s1
> Session s1 destroyed
> Round 90312
> Session s1 created.
> Traces will be written in net://128.224.95.15:52000:52001
> Live timer set to 1000000 usec
> Error: Channel metadata: Unable to connect to lttng-relayd (session s1)
> Error: Command error
> Error: Channel c1: Unable to connect to lttng-relayd (session s1)
> Error: Command error
> Error: Events: Unable to connect to lttng-relayd (channel c1, session s1)
> Error: Command error
> Error: No channel found in the session
> Warning: Tracing already stopped for session s1
> Session s1 destroyed
> ...
>
> After that it never get back to a working state again.
> In the lttng-sessiond/consumerd log we see:
>
> ...
> DEBUG1 - 12:45:17.205035 [8694/8694]: Cleaning up all agent apps (in
> sessiond_cleanup() at main.c:715)
> DEBUG1 - 12:45:17.205084 [8694/8694]: Closing all UST sockets (in
> sessiond_cleanup() at main.c:718)
> DEBUG3 - 12:45:17.205167 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205173 [8694/8694]: Buffer registry destroy all registry (in
> buffer_reg_destroy_registries() at buffer-registry.c:716)
> DEBUG3 - 12:45:17.205240 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG1 - 12:45:17.205280 [8694/8694]: Unloading kernel modules (in
> sessiond_cleanup() at main.c:730)
> DEBUG3 - 12:45:17.205292 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205378 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.205415 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205450 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.205487 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205541 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.205578 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205619 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.209681 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG1 - 12:45:17.209711 [8694/8695]: [ht-cleanup] quit. (in thread_ht_cleanup()
> at ht-cleanup.c:154)
> DEBUG1 - 12:45:17.209778 [8694/8695]: [ht-cleanup] Thread terminates. (in
> thread_ht_cleanup() at ht-cleanup.c:165)
> DEBUG1 - 12:45:17.210254 [8694/8694]: Cleaning up options (in
> sessiond_cleanup_options() at main.c:777)
> DEBUG1 - 12:45:17.210293 [8694/8694]: *** assert failed :-) *** ==> Matthew,
> BEET driven development works! (in sessiond_cleanup_options() at main.c:807)
> root at axm5516:~#
>
> Hard to say if that is same situation but at least it seems to be during a
> session destroy scenario.
> So the newer code base seems to have issues still.
We're missing the log from relayd to understand what is going on there.
Thanks,
Mathieu
>
> BR,
> Per
>
> Thanks,
>
> Mathieu
>
>>
>> Best regards,
>> Per
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev at lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list