[lttng-dev] Sometimes a session destroy hangs lttng-consumerd process

Mon Sep 14 10:48:50 EDT 2015

----- On Sep 14, 2015, at 7:12 AM, Per Hallsmark Per.Hallsmark at windriver.com wrote:

>> Hello,
>>
>> I'm hit by an issue in that when doing session create/destroy multiple times,
>> during destroy the lttng-consumerd cannot get out of a loop where it wants to
>> flush metadata, but the connection is probably already down at this stage.
>>
>> I'm attaching a script that excercise lttng. The issue seems hard to get at,
>> took 9 hours of running and the script console log looks like:
>>
>> ....
>> Round 4327
>> Session s1 created.
>> Traces will be written in net://128.224.95.199:53000:53001
>> Live timer set to 1000000 usec
>> UST channel metadata enabled for session s1
>> UST channel c1 enabled for session s1
>> All UST events are enabled in channel c1
>> Tracing started for session s1
>> Waiting for data availability..
>> Tracing stopped for session s1
>> Session s1 destroyed
>> Round 4328
>> Session s1 created.
>> Traces will be written in net://128.224.95.199:53000:53001
>> Live timer set to 1000000 usec
>> UST channel metadata enabled for session s1
>> UST channel c1 enabled for session s1
>> All UST events are enabled in channel c1
>> Tracing started for session s1
>> libust[14905/14905]: Error: Timed out waiting for lttng-sessiond (in
>> lttng_ust_init() at lttng-ust-comm.c:1521)
>> < here it hangs >
>>
>> If I'm now trying a simple "lttng list" or so in target, that also hangs:
>>
>> root at localhost:~# lttng -vvv list
>> DEBUG2 - 10:23:36.911853 [15122/15122]: Session name: (null) (in cmd_list() at
>> commands/list.c:911)
>> DEBUG1 - 10:23:36.913314 [15122/15122]: LSM cmd type : 13 (in send_session_msg()
>> at lttng-ctl.c:131)
>> < here it hangs >
>>
>> If I start lttng-sessiond to log things:
>>
>> lttng-sessiond -Z -v -b > lttng-sessiond.log 2>&1
>>
>> Then I get a huge log where it after some 9 hours shows like:
>>
>> DEBUG1 - 21:06:53.632585 [3807/3810]: Processing client command 17 (in
>> process_client_msg() at main.
>> DEBUG3 - 21:06:53.644582 [3807/3813]: Consumer pushing metadata on sock 33 of
>> len 686 (in consumer_p
>> DEBUG1 - 21:06:53.656539 [3807/3810]: Getting session s1 by name (in
>> process_client_msg() at main.c:
>> DEBUG1 - 21:06:53.656625 [3823/3829]: UST consumer push metadata key 8658 of len
>> 686 (in lttng_ustco
>> DEBUG1 - 21:06:53.693577 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:53.693585 [3823/3827]: Metadata poll return from wait with 2
>> fd(s) (in consumer_threa
>> DEBUG1 - 21:06:53.847615 [3823/3827]: Metadata event catched in thread (in
>> consumer_thread_metadata_
>> DEBUG1 - 21:06:53.847642 [3823/3827]: Metadata available on fd 81 (in
>> consumer_thread_metadata_poll(
>> DEBUG1 - 21:06:54.548885 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.264889 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.474883 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.684874 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:54.894872 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:55.104848 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_recv_
>> DEBUG1 - 21:06:55.314850 [3823/3829]: Waiting for metadata to be flushed (in
>> lttng_ustconsumer_r
>>
>> After that I only get these "Waiting for metadata to be flushed" lines.
>>
>> Following the code, I see that area in:
>>
>> common/ust-consumer/ust-consumer.c::lttng_ustconsumer_recv_metadata() where
>> there is a loop like:
>>        while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
>>                DBG("Waiting for metadata to be flushed");
>>
>>                health_code_update();
>>
>>                usleep(DEFAULT_METADATA_AVAILABILITY_WAIT_TIME);
>>        }
>>
>> So this loop will carry on forever if consumer_metadata_cache_flushed() isn't
>> returning 0.
>> It seems to me we have some metadata to flush out, but during session destroy
>> the communication channel is already down so the metadata will never be flushed
>> out.
>> First thought it may be because consumer_metadata_cache_flushed() is not
>> recognizing channel->metadata_stream is down, but perhaps it is the fact we
>> don't scrap metadata after we have disconnected the channel or so.
>> Still trying to follow the code here which is a bit hard for a newcomer.
>>
>> This has been seen using lttng 2.5.4 / lttng-ust 2.5.5, looking at the latest
>> master branch (which at time I fetched was commit fb27f84 ) the code
>> around this look similar.
>>
>> I wonder if this is something that has been encountered before? At least a
>> comment some lines up in same function suggest it:
>>                /*
>>                 * Skip metadata flush on write error since the offset and len might
>>                 * not have been updated which could create an infinite loop below when
>>                 * waiting for the metadata cache to be flushed.
>>                 */
>>
>> But as we unlock metadata_cache and lock it again in the
>> consumer_metadata_cache_flushed(), I guess there is a window here when other
>> things could get time todo stuff not thought of.
> 
> I ran into this situation in my stress-tests recently, and proposed
> the following fix, currently merged in lttng-tools master branch:
> 
> commit c585821bc78955b3d747fcd733aa1d2b81a3258e
> Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
> Date:   Wed Aug 19 14:44:59 2015 -0700
> 
>    Fix: sessiond vs consumerd push/get metadata deadlock
> 
>    We need to unlock the registry while we push metadata to break a
>    circular dependency between the consumerd metadata lock and the sessiond
>    registry lock. Indeed, pushing metadata to the consumerd awaits that it
>    gets pushed all the way to relayd, but doing so requires grabbing the
>    metadata lock. If a concurrent metadata request is being performed by
>    consumerd, this can try to grab the registry lock on the sessiond while
>    holding the metadata lock on the consumer daemon. Those push and pull
>    schemes are performed on two different bidirectionnal communication
>    sockets.
> 
>    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
>    Signed-off-by: Jérémie Galarneau <jeremie.galarneau at efficios.com>
> 
> Can you reproduce this issue on the master branch of tools, ust and modules?
> 
> Oh gosh, if this solves our issue I must say the lttng issue resolution is fast!
> :-)
> I'll see how I can do that. If not able to update to master branch, would the
> commit
> above alone do the work you think?
> 
> During night I've been running with a crude workaround like:
> 
> diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/us
> t-consumer.c
> index f891828..9ded651 100644
> --- a/src/common/ust-consumer/ust-consumer.c
> +++ b/src/common/ust-consumer/ust-consumer.c
> @@ -1026,12 +1026,19 @@ error:
> /*
>  * Receive the metadata updates from the sessiond.
>  */
> +int workaround_verifier=1;
> int lttng_ustconsumer_recv_metadata(int sock, uint64_t key, uint64_t offset,
>                uint64_t len, struct lttng_consumer_channel *channel,
>                int timer, int wait)
> {
>        int ret, ret_code = LTTCOMM_CONSUMERD_SUCCESS;
>        char *metadata_str;
> +       int workaround_counter;
> +
> +       if (workaround_verifier) {
> +               DBG("%s: workaround exist in %s\n", __FILE__, __func__);
> +               workaround_verifier=0;
> +       }
> 
>        DBG("UST consumer push metadata key %" PRIu64 " of len %" PRIu64, key, l
> en);
> 
> @@ -1072,7 +1079,11 @@ int lttng_ustconsumer_recv_metadata(int sock, uint64_t ke
> y, uint64_t offset,
>        if (!wait) {
>                goto end_free;
>        }
> +       workaround_counter = 42;
>        while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
> +               if (!workaround_counter--) {
> +                       break;
> +               }
>                DBG("Waiting for metadata to be flushed");
> 
>                health_code_update();
> --
> 1.7.9.5
> 
> and it seems to help us get out of the hang, although not fixing why we entered
> there.
> I'll back that out and try to get your patch in...
> 
> Ok, tested to use only the commit ontop of our base. Unfortunately it don't work
> at all then:
> 
> root at axm5516:~# sh lttng-hammer1.sh
> Round 0
> Session s1 created.
> Traces will be written in net://128.224.95.199:53000:53001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> Waiting for data
> availability............................................................................................................................................................................................................................................................................................................................................................................................................................................^C
> root at axm5516:~#
> 
> This may be because only using the commit I guess. It went in fine without any
> manual change, but perhaps dependant anyway on other changes before.
> Next is then to try a build with latest branch as you want, but that will not be
> as easy here.... I'll see what can be done.
> Any opinion on the crude workaround as a intermediate "fix"?
> 
> BR,
> Per
> 
> Ok, we have now run over the weekend with the latest&greatest on master branch.
> After 90000+ rounds, there is a breakdown:
> 
> ...
> Round 90311
> Session s1 created.
> Traces will be written in net://128.224.95.15:52000:52001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> Waiting for data availability
> Tracing stopped for session s1
> Session s1 destroyed
> Round 90312
> Session s1 created.
> Traces will be written in net://128.224.95.15:52000:52001
> Live timer set to 1000000 usec
> Error: Channel metadata: Unable to connect to lttng-relayd (session s1)
> Error: Command error
> Error: Channel c1: Unable to connect to lttng-relayd (session s1)
> Error: Command error
> Error: Events: Unable to connect to lttng-relayd (channel c1, session s1)
> Error: Command error
> Error: No channel found in the session
> Warning: Tracing already stopped for session s1
> Session s1 destroyed
> ...
> 
> After that it never get back to a working state again.
> In the lttng-sessiond/consumerd log we see:
> 
> ...
> DEBUG1 - 12:45:17.205035 [8694/8694]: Cleaning up all agent apps (in
> sessiond_cleanup() at main.c:715)
> DEBUG1 - 12:45:17.205084 [8694/8694]: Closing all UST sockets (in
> sessiond_cleanup() at main.c:718)
> DEBUG3 - 12:45:17.205167 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205173 [8694/8694]: Buffer registry destroy all registry (in
> buffer_reg_destroy_registries() at buffer-registry.c:716)
> DEBUG3 - 12:45:17.205240 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG1 - 12:45:17.205280 [8694/8694]: Unloading kernel modules (in
> sessiond_cleanup() at main.c:730)
> DEBUG3 - 12:45:17.205292 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205378 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.205415 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205450 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.205487 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205541 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.205578 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG3 - 12:45:17.205619 [8694/8695]: [ht-thread] Polling. (in
> thread_ht_cleanup() at ht-cleanup.c:67)
> DEBUG3 - 12:45:17.209681 [8694/8695]: [ht-thread] Returning from poll on 2 fds.
> (in thread_ht_cleanup() at ht-cleanup.c:74)
> DEBUG1 - 12:45:17.209711 [8694/8695]: [ht-cleanup] quit. (in thread_ht_cleanup()
> at ht-cleanup.c:154)
> DEBUG1 - 12:45:17.209778 [8694/8695]: [ht-cleanup] Thread terminates. (in
> thread_ht_cleanup() at ht-cleanup.c:165)
> DEBUG1 - 12:45:17.210254 [8694/8694]: Cleaning up options (in
> sessiond_cleanup_options() at main.c:777)
> DEBUG1 - 12:45:17.210293 [8694/8694]: *** assert failed :-) *** ==> Matthew,
> BEET driven development works! (in sessiond_cleanup_options() at main.c:807)
> root at axm5516:~#
> 
> Hard to say if that is same situation but at least it seems to be during a
> session destroy scenario.
> So the newer code base seems to have issues still.

We're missing the log from relayd to understand what is going on there.

Thanks,

Mathieu

> 
> BR,
> Per
> 
> Thanks,
> 
> Mathieu
> 
>>
>> Best regards,
>> Per
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev at lists.lttng.org
>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com