[lttng-dev] Sometimes a session destroy hangs lttng-consumerd process

Hallsmark, Per Per.Hallsmark at windriver.com
Mon Sep 14 07:12:40 EDT 2015


> Hello,
>
> I'm hit by an issue in that when doing session create/destroy multiple times,
> during destroy the lttng-consumerd cannot get out of a loop where it wants to
> flush metadata, but the connection is probably already down at this stage.
>
> I'm attaching a script that excercise lttng. The issue seems hard to get at,
> took 9 hours of running and the script console log looks like:
>
> ....
> Round 4327
> Session s1 created.
> Traces will be written in net://128.224.95.199:53000:53001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> Waiting for data availability..
> Tracing stopped for session s1
> Session s1 destroyed
> Round 4328
> Session s1 created.
> Traces will be written in net://128.224.95.199:53000:53001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> libust[14905/14905]: Error: Timed out waiting for lttng-sessiond (in
> lttng_ust_init() at lttng-ust-comm.c:1521)
> < here it hangs >
>
> If I'm now trying a simple "lttng list" or so in target, that also hangs:
>
> root at localhost:~# lttng -vvv list
> DEBUG2 - 10:23:36.911853 [15122/15122]: Session name: (null) (in cmd_list() at
> commands/list.c:911)
> DEBUG1 - 10:23:36.913314 [15122/15122]: LSM cmd type : 13 (in send_session_msg()
> at lttng-ctl.c:131)
> < here it hangs >
>
> If I start lttng-sessiond to log things:
>
> lttng-sessiond -Z -v -b > lttng-sessiond.log 2>&1
>
> Then I get a huge log where it after some 9 hours shows like:
>
> DEBUG1 - 21:06:53.632585 [3807/3810]: Processing client command 17 (in
> process_client_msg() at main.
> DEBUG3 - 21:06:53.644582 [3807/3813]: Consumer pushing metadata on sock 33 of
> len 686 (in consumer_p
> DEBUG1 - 21:06:53.656539 [3807/3810]: Getting session s1 by name (in
> process_client_msg() at main.c:
> DEBUG1 - 21:06:53.656625 [3823/3829]: UST consumer push metadata key 8658 of len
> 686 (in lttng_ustco
> DEBUG1 - 21:06:53.693577 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:53.693585 [3823/3827]: Metadata poll return from wait with 2
> fd(s) (in consumer_threa
> DEBUG1 - 21:06:53.847615 [3823/3827]: Metadata event catched in thread (in
> consumer_thread_metadata_
> DEBUG1 - 21:06:53.847642 [3823/3827]: Metadata available on fd 81 (in
> consumer_thread_metadata_poll(
> DEBUG1 - 21:06:54.548885 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.264889 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.474883 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.684874 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.894872 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:55.104848 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:55.314850 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_r
>
> After that I only get these "Waiting for metadata to be flushed" lines.
>
> Following the code, I see that area in:
>
> common/ust-consumer/ust-consumer.c::lttng_ustconsumer_recv_metadata() where
> there is a loop like:
>        while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
>                DBG("Waiting for metadata to be flushed");
>
>                health_code_update();
>
>                usleep(DEFAULT_METADATA_AVAILABILITY_WAIT_TIME);
>        }
>
> So this loop will carry on forever if consumer_metadata_cache_flushed() isn't
> returning 0.
> It seems to me we have some metadata to flush out, but during session destroy
> the communication channel is already down so the metadata will never be flushed
> out.
> First thought it may be because consumer_metadata_cache_flushed() is not
> recognizing channel->metadata_stream is down, but perhaps it is the fact we
> don't scrap metadata after we have disconnected the channel or so.
> Still trying to follow the code here which is a bit hard for a newcomer.
>
> This has been seen using lttng 2.5.4 / lttng-ust 2.5.5, looking at the latest
> master branch (which at time I fetched was commit fb27f84 ) the code
> around this look similar.
>
> I wonder if this is something that has been encountered before? At least a
> comment some lines up in same function suggest it:
>                /*
>                 * Skip metadata flush on write error since the offset and len might
>                 * not have been updated which could create an infinite loop below when
>                 * waiting for the metadata cache to be flushed.
>                 */
>
> But as we unlock metadata_cache and lock it again in the
> consumer_metadata_cache_flushed(), I guess there is a window here when other
> things could get time todo stuff not thought of.

I ran into this situation in my stress-tests recently, and proposed
the following fix, currently merged in lttng-tools master branch:

commit c585821bc78955b3d747fcd733aa1d2b81a3258e
Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
Date:   Wed Aug 19 14:44:59 2015 -0700

    Fix: sessiond vs consumerd push/get metadata deadlock

    We need to unlock the registry while we push metadata to break a
    circular dependency between the consumerd metadata lock and the sessiond
    registry lock. Indeed, pushing metadata to the consumerd awaits that it
    gets pushed all the way to relayd, but doing so requires grabbing the
    metadata lock. If a concurrent metadata request is being performed by
    consumerd, this can try to grab the registry lock on the sessiond while
    holding the metadata lock on the consumer daemon. Those push and pull
    schemes are performed on two different bidirectionnal communication
    sockets.

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
    Signed-off-by: Jérémie Galarneau <jeremie.galarneau at efficios.com>

Can you reproduce this issue on the master branch of tools, ust and modules?

Oh gosh, if this solves our issue I must say the lttng issue resolution is fast! :-)
I'll see how I can do that. If not able to update to master branch, would the commit
above alone do the work you think?

During night I've been running with a crude workaround like:

diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/us
t-consumer.c
index f891828..9ded651 100644
--- a/src/common/ust-consumer/ust-consumer.c
+++ b/src/common/ust-consumer/ust-consumer.c
@@ -1026,12 +1026,19 @@ error:
 /*
  * Receive the metadata updates from the sessiond.
  */
+int workaround_verifier=1;
 int lttng_ustconsumer_recv_metadata(int sock, uint64_t key, uint64_t offset,
                uint64_t len, struct lttng_consumer_channel *channel,
                int timer, int wait)
 {
        int ret, ret_code = LTTCOMM_CONSUMERD_SUCCESS;
        char *metadata_str;
+       int workaround_counter;
+
+       if (workaround_verifier) {
+               DBG("%s: workaround exist in %s\n", __FILE__, __func__);
+               workaround_verifier=0;
+       }

        DBG("UST consumer push metadata key %" PRIu64 " of len %" PRIu64, key, l
en);

@@ -1072,7 +1079,11 @@ int lttng_ustconsumer_recv_metadata(int sock, uint64_t ke
y, uint64_t offset,
        if (!wait) {
                goto end_free;
        }
+       workaround_counter = 42;
        while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
+               if (!workaround_counter--) {
+                       break;
+               }
                DBG("Waiting for metadata to be flushed");

                health_code_update();
--
1.7.9.5

and it seems to help us get out of the hang, although not fixing why we entered there.
I'll back that out and try to get your patch in...

Ok, tested to use only the commit ontop of our base. Unfortunately it don't work at all then:

root at axm5516:~# sh lttng-hammer1.sh
Round 0
Session s1 created.
Traces will be written in net://128.224.95.199:53000:53001
Live timer set to 1000000 usec
UST channel metadata enabled for session s1
UST channel c1 enabled for session s1
All UST events are enabled in channel c1
Tracing started for session s1
Waiting for data availability............................................................................................................................................................................................................................................................................................................................................................................................................................................^C
root at axm5516:~#

This may be because only using the commit I guess. It went in fine without any manual change, but perhaps dependant anyway on other changes before.
Next is then to try a build with latest branch as you want, but that will not be as easy here.... I'll see what can be done.
Any opinion on the crude workaround as a intermediate "fix"?

BR,
Per

Ok, we have now run over the weekend with the latest&greatest on master branch.
After 90000+ rounds, there is a breakdown:

...
Round 90311        
Session s1 created.                                      
Traces will be written in net://128.224.95.15:52000:52001
Live timer set to 1000000 usec             
UST channel metadata enabled for session s1
UST channel c1 enabled for session s1   
All UST events are enabled in channel c1
Tracing started for session s1
Waiting for data availability 
Tracing stopped for session s1
Session s1 destroyed
Round 90312        
Session s1 created.                                      
Traces will be written in net://128.224.95.15:52000:52001
Live timer set to 1000000 usec             
Error: Channel metadata: Unable to connect to lttng-relayd (session s1)
Error: Command error                    
Error: Channel c1: Unable to connect to lttng-relayd (session s1)
Error: Command error         
Error: Events: Unable to connect to lttng-relayd (channel c1, session s1)
Error: Command error
Error: No channel found in the session
Warning: Tracing already stopped for session s1
Session s1 destroyed                                     
...

After that it never get back to a working state again.
In the lttng-sessiond/consumerd log we see:

...
DEBUG1 - 12:45:17.205035 [8694/8694]: Cleaning up all agent apps (in sessiond_cleanup() at main.c:715)
DEBUG1 - 12:45:17.205084 [8694/8694]: Closing all UST sockets (in sessiond_cleanup() at main.c:718)
DEBUG3 - 12:45:17.205167 [8694/8695]: [ht-thread] Returning from poll on 2 fds. (in thread_ht_cleanup() at ht-cleanup.c:74)
DEBUG3 - 12:45:17.205173 [8694/8694]: Buffer registry destroy all registry (in buffer_reg_destroy_registries() at buffer-registry.c:716)
DEBUG3 - 12:45:17.205240 [8694/8695]: [ht-thread] Polling. (in thread_ht_cleanup() at ht-cleanup.c:67)
DEBUG1 - 12:45:17.205280 [8694/8694]: Unloading kernel modules (in sessiond_cleanup() at main.c:730)
DEBUG3 - 12:45:17.205292 [8694/8695]: [ht-thread] Returning from poll on 2 fds. (in thread_ht_cleanup() at ht-cleanup.c:74)
DEBUG3 - 12:45:17.205378 [8694/8695]: [ht-thread] Polling. (in thread_ht_cleanup() at ht-cleanup.c:67)
DEBUG3 - 12:45:17.205415 [8694/8695]: [ht-thread] Returning from poll on 2 fds. (in thread_ht_cleanup() at ht-cleanup.c:74)
DEBUG3 - 12:45:17.205450 [8694/8695]: [ht-thread] Polling. (in thread_ht_cleanup() at ht-cleanup.c:67)
DEBUG3 - 12:45:17.205487 [8694/8695]: [ht-thread] Returning from poll on 2 fds. (in thread_ht_cleanup() at ht-cleanup.c:74)
DEBUG3 - 12:45:17.205541 [8694/8695]: [ht-thread] Polling. (in thread_ht_cleanup() at ht-cleanup.c:67)
DEBUG3 - 12:45:17.205578 [8694/8695]: [ht-thread] Returning from poll on 2 fds. (in thread_ht_cleanup() at ht-cleanup.c:74)
DEBUG3 - 12:45:17.205619 [8694/8695]: [ht-thread] Polling. (in thread_ht_cleanup() at ht-cleanup.c:67)
DEBUG3 - 12:45:17.209681 [8694/8695]: [ht-thread] Returning from poll on 2 fds. (in thread_ht_cleanup() at ht-cleanup.c:74)
DEBUG1 - 12:45:17.209711 [8694/8695]: [ht-cleanup] quit. (in thread_ht_cleanup() at ht-cleanup.c:154)
DEBUG1 - 12:45:17.209778 [8694/8695]: [ht-cleanup] Thread terminates. (in thread_ht_cleanup() at ht-cleanup.c:165)
DEBUG1 - 12:45:17.210254 [8694/8694]: Cleaning up options (in sessiond_cleanup_options() at main.c:777)
DEBUG1 - 12:45:17.210293 [8694/8694]: *** assert failed :-) *** ==> Matthew, BEET driven development works! (in sessiond_cleanup_options() at main.c:807)
root at axm5516:~# 

Hard to say if that is same situation but at least it seems to be during a session destroy scenario.
So the newer code base seems to have issues still.

BR,
Per

Thanks,

Mathieu

>
> Best regards,
> Per
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
_______________________________________________
lttng-dev mailing list
lttng-dev at lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

_______________________________________________
lttng-dev mailing list
lttng-dev at lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev


More information about the lttng-dev mailing list