[lttng-dev] Sometimes a session destroy hangs lttng-consumerd process

Hallsmark, Per Per.Hallsmark at windriver.com
Thu Sep 10 02:47:47 EDT 2015



> Hello,
>
> I'm hit by an issue in that when doing session create/destroy multiple times,
> during destroy the lttng-consumerd cannot get out of a loop where it wants to
> flush metadata, but the connection is probably already down at this stage.
>
> I'm attaching a script that excercise lttng. The issue seems hard to get at,
> took 9 hours of running and the script console log looks like:
>
> ....
> Round 4327
> Session s1 created.
> Traces will be written in net://128.224.95.199:53000:53001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> Waiting for data availability..
> Tracing stopped for session s1
> Session s1 destroyed
> Round 4328
> Session s1 created.
> Traces will be written in net://128.224.95.199:53000:53001
> Live timer set to 1000000 usec
> UST channel metadata enabled for session s1
> UST channel c1 enabled for session s1
> All UST events are enabled in channel c1
> Tracing started for session s1
> libust[14905/14905]: Error: Timed out waiting for lttng-sessiond (in
> lttng_ust_init() at lttng-ust-comm.c:1521)
> < here it hangs >
>
> If I'm now trying a simple "lttng list" or so in target, that also hangs:
>
> root at localhost:~# lttng -vvv list
> DEBUG2 - 10:23:36.911853 [15122/15122]: Session name: (null) (in cmd_list() at
> commands/list.c:911)
> DEBUG1 - 10:23:36.913314 [15122/15122]: LSM cmd type : 13 (in send_session_msg()
> at lttng-ctl.c:131)
> < here it hangs >
>
> If I start lttng-sessiond to log things:
>
> lttng-sessiond -Z -v -b > lttng-sessiond.log 2>&1
>
> Then I get a huge log where it after some 9 hours shows like:
>
> DEBUG1 - 21:06:53.632585 [3807/3810]: Processing client command 17 (in
> process_client_msg() at main.
> DEBUG3 - 21:06:53.644582 [3807/3813]: Consumer pushing metadata on sock 33 of
> len 686 (in consumer_p
> DEBUG1 - 21:06:53.656539 [3807/3810]: Getting session s1 by name (in
> process_client_msg() at main.c:
> DEBUG1 - 21:06:53.656625 [3823/3829]: UST consumer push metadata key 8658 of len
> 686 (in lttng_ustco
> DEBUG1 - 21:06:53.693577 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:53.693585 [3823/3827]: Metadata poll return from wait with 2
> fd(s) (in consumer_threa
> DEBUG1 - 21:06:53.847615 [3823/3827]: Metadata event catched in thread (in
> consumer_thread_metadata_
> DEBUG1 - 21:06:53.847642 [3823/3827]: Metadata available on fd 81 (in
> consumer_thread_metadata_poll(
> DEBUG1 - 21:06:54.548885 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.264889 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.474883 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.684874 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:54.894872 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:55.104848 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_recv_
> DEBUG1 - 21:06:55.314850 [3823/3829]: Waiting for metadata to be flushed (in
> lttng_ustconsumer_r
>
> After that I only get these "Waiting for metadata to be flushed" lines.
>
> Following the code, I see that area in:
>
> common/ust-consumer/ust-consumer.c::lttng_ustconsumer_recv_metadata() where
> there is a loop like:
>        while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
>                DBG("Waiting for metadata to be flushed");
>
>                health_code_update();
>
>                usleep(DEFAULT_METADATA_AVAILABILITY_WAIT_TIME);
>        }
>
> So this loop will carry on forever if consumer_metadata_cache_flushed() isn't
> returning 0.
> It seems to me we have some metadata to flush out, but during session destroy
> the communication channel is already down so the metadata will never be flushed
> out.
> First thought it may be because consumer_metadata_cache_flushed() is not
> recognizing channel->metadata_stream is down, but perhaps it is the fact we
> don't scrap metadata after we have disconnected the channel or so.
> Still trying to follow the code here which is a bit hard for a newcomer.
>
> This has been seen using lttng 2.5.4 / lttng-ust 2.5.5, looking at the latest
> master branch (which at time I fetched was commit fb27f84 ) the code
> around this look similar.
>
> I wonder if this is something that has been encountered before? At least a
> comment some lines up in same function suggest it:
>                /*
>                 * Skip metadata flush on write error since the offset and len might
>                 * not have been updated which could create an infinite loop below when
>                 * waiting for the metadata cache to be flushed.
>                 */
>
> But as we unlock metadata_cache and lock it again in the
> consumer_metadata_cache_flushed(), I guess there is a window here when other
> things could get time todo stuff not thought of.

I ran into this situation in my stress-tests recently, and proposed
the following fix, currently merged in lttng-tools master branch:

commit c585821bc78955b3d747fcd733aa1d2b81a3258e
Author: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
Date:   Wed Aug 19 14:44:59 2015 -0700

    Fix: sessiond vs consumerd push/get metadata deadlock

    We need to unlock the registry while we push metadata to break a
    circular dependency between the consumerd metadata lock and the sessiond
    registry lock. Indeed, pushing metadata to the consumerd awaits that it
    gets pushed all the way to relayd, but doing so requires grabbing the
    metadata lock. If a concurrent metadata request is being performed by
    consumerd, this can try to grab the registry lock on the sessiond while
    holding the metadata lock on the consumer daemon. Those push and pull
    schemes are performed on two different bidirectionnal communication
    sockets.

    Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
    Signed-off-by: Jérémie Galarneau <jeremie.galarneau at efficios.com>

Can you reproduce this issue on the master branch of tools, ust and modules?

Oh gosh, if this solves our issue I must say the lttng issue resolution is fast! :-)
I'll see how I can do that. If not able to update to master branch, would the commit
above alone do the work you think?

During night I've been running with a crude workaround like:

diff --git a/src/common/ust-consumer/ust-consumer.c b/src/common/ust-consumer/us
t-consumer.c
index f891828..9ded651 100644
--- a/src/common/ust-consumer/ust-consumer.c
+++ b/src/common/ust-consumer/ust-consumer.c
@@ -1026,12 +1026,19 @@ error:
 /*
  * Receive the metadata updates from the sessiond.
  */
+int workaround_verifier=1;
 int lttng_ustconsumer_recv_metadata(int sock, uint64_t key, uint64_t offset,
 		uint64_t len, struct lttng_consumer_channel *channel,
 		int timer, int wait)
 {
 	int ret, ret_code = LTTCOMM_CONSUMERD_SUCCESS;
 	char *metadata_str;
+	int workaround_counter;
+
+	if (workaround_verifier) {
+		DBG("%s: workaround exist in %s\n", __FILE__, __func__);
+		workaround_verifier=0;
+	}
 
 	DBG("UST consumer push metadata key %" PRIu64 " of len %" PRIu64, key, l
en);
 
@@ -1072,7 +1079,11 @@ int lttng_ustconsumer_recv_metadata(int sock, uint64_t ke
y, uint64_t offset,
 	if (!wait) {
 		goto end_free;
 	}
+	workaround_counter = 42;
 	while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
+		if (!workaround_counter--) {
+			break;
+		}
 		DBG("Waiting for metadata to be flushed");
 
 		health_code_update();
-- 
1.7.9.5

and it seems to help us get out of the hang, although not fixing why we entered there.
I'll back that out and try to get your patch in...

Thanks,

Mathieu

>
> Best regards,
> Per
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


More information about the lttng-dev mailing list