[lttng-dev] Sometimes a session destroy hangs lttng-consumerd process

Hallsmark, Per Per.Hallsmark at windriver.com
Wed Sep 9 08:23:39 EDT 2015


Hello,

I'm hit by an issue in that when doing session create/destroy multiple times, during destroy the lttng-consumerd cannot get out of a loop where it wants to flush metadata, but the connection is probably already down at this stage.

I'm attaching a script that excercise lttng. The issue seems hard to get at, took 9 hours of running and the script console log looks like:

....
Round 4327
Session s1 created.
Traces will be written in net://128.224.95.199:53000:53001
Live timer set to 1000000 usec
UST channel metadata enabled for session s1
UST channel c1 enabled for session s1
All UST events are enabled in channel c1
Tracing started for session s1
Waiting for data availability..
Tracing stopped for session s1
Session s1 destroyed
Round 4328
Session s1 created.
Traces will be written in net://128.224.95.199:53000:53001
Live timer set to 1000000 usec
UST channel metadata enabled for session s1
UST channel c1 enabled for session s1
All UST events are enabled in channel c1
Tracing started for session s1
libust[14905/14905]: Error: Timed out waiting for lttng-sessiond (in lttng_ust_init() at lttng-ust-comm.c:1521)
< here it hangs >

If I'm now trying a simple "lttng list" or so in target, that also hangs:

root at localhost:~# lttng -vvv list
DEBUG2 - 10:23:36.911853 [15122/15122]: Session name: (null) (in cmd_list() at commands/list.c:911)
DEBUG1 - 10:23:36.913314 [15122/15122]: LSM cmd type : 13 (in send_session_msg() at lttng-ctl.c:131)
< here it hangs >

If I start lttng-sessiond to log things:

lttng-sessiond -Z -v -b > lttng-sessiond.log 2>&1

Then I get a huge log where it after some 9 hours shows like:

DEBUG1 - 21:06:53.632585 [3807/3810]: Processing client command 17 (in process_client_msg() at main.
DEBUG3 - 21:06:53.644582 [3807/3813]: Consumer pushing metadata on sock 33 of len 686 (in consumer_p
DEBUG1 - 21:06:53.656539 [3807/3810]: Getting session s1 by name (in process_client_msg() at main.c:
DEBUG1 - 21:06:53.656625 [3823/3829]: UST consumer push metadata key 8658 of len 686 (in lttng_ustco
DEBUG1 - 21:06:53.693577 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_recv_
DEBUG1 - 21:06:53.693585 [3823/3827]: Metadata poll return from wait with 2 fd(s) (in consumer_threa
DEBUG1 - 21:06:53.847615 [3823/3827]: Metadata event catched in thread (in consumer_thread_metadata_
DEBUG1 - 21:06:53.847642 [3823/3827]: Metadata available on fd 81 (in consumer_thread_metadata_poll(
DEBUG1 - 21:06:54.548885 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_recv_
DEBUG1 - 21:06:54.264889 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_recv_
DEBUG1 - 21:06:54.474883 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_recv_
DEBUG1 - 21:06:54.684874 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_recv_
DEBUG1 - 21:06:54.894872 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_recv_
DEBUG1 - 21:06:55.104848 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_recv_
DEBUG1 - 21:06:55.314850 [3823/3829]: Waiting for metadata to be flushed (in lttng_ustconsumer_r

After that I only get these "Waiting for metadata to be flushed" lines.

Following the code, I see that area in:

common/ust-consumer/ust-consumer.c::lttng_ustconsumer_recv_metadata() where there is a loop like:
        while (consumer_metadata_cache_flushed(channel, offset + len, timer)) {
                DBG("Waiting for metadata to be flushed");

                health_code_update();

                usleep(DEFAULT_METADATA_AVAILABILITY_WAIT_TIME);
        }

So this loop will carry on forever if consumer_metadata_cache_flushed() isn't returning 0.
It seems to me we have some metadata to flush out, but during session destroy the communication channel is already down so the metadata will never be flushed out.
First thought it may be because consumer_metadata_cache_flushed() is not recognizing channel->metadata_stream is down, but perhaps it is the fact we don't scrap metadata after we have disconnected the channel or so.
Still trying to follow the code here which is a bit hard for a newcomer.

This has been seen using lttng 2.5.4 / lttng-ust 2.5.5, looking at the latest master branch (which at time I fetched was commit fb27f84 ) the code
around this look similar.

I wonder if this is something that has been encountered before? At least a comment some lines up in same function suggest it:
                /*
                 * Skip metadata flush on write error since the offset and len might
                 * not have been updated which could create an infinite loop below when
                 * waiting for the metadata cache to be flushed.
                 */

But as we unlock metadata_cache and lock it again in the consumer_metadata_cache_flushed(), I guess there is a window here when other things could get time todo stuff not thought of.

Best regards,
Per
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lttng-hammer1.sh
Type: application/x-shellscript
Size: 350 bytes
Desc: lttng-hammer1.sh
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20150909/38c350a5/attachment-0001.bin>


More information about the lttng-dev mailing list