[lttng-dev] Lttng Soft Hang issue

Aravind HT aravind.ht at gmail.com
Mon Jul 13 15:29:16 EDT 2015


cds_list_del(&stream->recv_list) definitely needs to be called from
relay_close_stream() else conn->recv_head will continue to have a stale
pointer(for streams that got closed) until
the entire connection is destroyed, and conversely,
connection_destruction() means destroying the conn->recv_head list which
means iterating/removing entries of streams ( some of them could be stale,
if relay_close_stream() does not do cds_list_del() ) and in which case,
cds_list_del(&stream->recv_list) needs to be called also.

So removing either of them, I think, is not the good and in this case,
https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
could be reverted


 *** However *** ,
there could also be a synchronization issue if relay_close_stream() and
connection_destroy() can happen concurrently if these calls are not
serialized ( stream->recv_list being accessed from two contexts ), need to
verify this before reverting
https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f


I also do not know the kind of issues being seen/details to prompt the
reversal of
http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80



On Tue, Jul 14, 2015 at 12:17 AM, Aravind HT <aravind.ht at gmail.com> wrote:

> The specific change in
> http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
> that I need is for cds_list_del(&stream->recv_list); to be called.
> Below is the patch extract.
>
> diff --git a/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
> b/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>
> index a93151a
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
> ..d82a341
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
> 100644 (file)
> --- a/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
> +++ b/src/bin/lttng-relayd/main.c
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
> @@ -1340,6
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc#l1340>
> +1340,18
> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80#l1340>
> @@ int relay_close_stream(struct lttcomm_relayd_hdr *recv_hdr,
>         session->stream_count--;
>         assert(session->stream_count >= 0);
>
> +       /*
>
> +        * Remove the stream from the connection recv list since we are about to
>
> +        * flag it invalid and thus might be freed. This has to be done here since
> +        * only the control thread can do actions on that list.
> +        *
>
> +        * Note that this stream might NOT be in the list but we have to try to
>
> +        * remove it here else this can race with the stream destruction freeing
>
> +        * the object and the connection destroy doing a use after free when
> +        * deleting the remaining nodes in this list.
> +        */
> +       cds_list_del(&stream->recv_list);
> +
>         /* Check if we can close it or else the data will do it. */
>         try_close_stream(session, stream);
>
>
>
>
>
>
> On Mon, Jul 13, 2015 at 9:04 PM, Jérémie Galarneau <
> jeremie.galarneau at efficios.com> wrote:
>
>> Do you have a specific changeset in mind when you say you applied the
>> changes mentioned in that thread?
>>
>> Do you simply revert the following commit?
>>
>> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>>
>> Thanks,
>> Jérémie
>>
>>
>> On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht at gmail.com> wrote:
>>
>>> To try this fix, I also needed the changes talked about in
>>> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
>>> otherwise, I would see relayd coring.
>>> Once I took them I could see that the soft hang issue was no longer
>>> reproducible.
>>>
>>> Thanks for helping.
>>>
>>> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht at gmail.com> wrote:
>>>
>>>> Sure, I will try the fix and update.
>>>>
>>>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>>>> jeremie.galarneau at efficios.com> wrote:
>>>>
>>>>> Would you mind testing Mathieu's patch?
>>>>>
>>>>> I have rebased it on stable-2.6:
>>>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>>>
>>>>> Thanks,
>>>>> Jérémie
>>>>>
>>>>>
>>>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>>>> jeremie.galarneau at efficios.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> What do you observe when this happens? Does the session daemon become
>>>>>> unresponsive?
>>>>>>
>>>>>> Jérémie
>>>>>>
>>>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Aravind.
>>>>>>>
>>>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have recently started trying lttng 2.6 for a few applications on
>>>>>>>> my Ubuntu 12.04 and I noticed that the health check on sessiond and
>>>>>>>> consumerd failed soon after starting the session.
>>>>>>>>
>>>>>>>>
>>>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>>>> causing an overall health check failure.
>>>>>>>>
>>>>>>>>
>>>>>>>> Here are the sequence of steps that I found contributed to
>>>>>>>> *thread_manage_consumer()* exiting.
>>>>>>>>
>>>>>>>>
>>>>>>>> 1.       In *thread_manage_apps() : 1558 ,
>>>>>>>> ust_app_unregister(pollfd)* is being called. This happens when
>>>>>>>> there is an error detected by *revents = LTTNG_POLL_GETEV(&events,
>>>>>>>> i)*
>>>>>>>>
>>>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for
>>>>>>>> *epoll()* causing *ust_app_unregister()* to be called for that app.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is
>>>>>>>> set*.*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>>>> * consumer_close_metadata() *which sends
>>>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>>>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>>>>>> this doesn’t seem to help*.*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>>>>> *thread_manage_consumer():1353* at which point
>>>>>>>> *thread_manage_consumer()* decides to exit causing health_check()
>>>>>>>> to fail.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> So it looks like under some scenario, an application crash could
>>>>>>>> cause the lttng some problems.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think a possible fix for this scenario, is to instead of 4, send
>>>>>>>> an *ERROR* message back to *consumerd()* . This could be done from
>>>>>>>> *ust_consumer_metadata_request()* call. Can someone please let me
>>>>>>>> know if this is correct and shed more light on the issue ?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Please forgive if there are any guideline omissions for posting
>>>>>>>> here from my part. This is my first post.
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Aravind.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> lttng-dev mailing list
>>>>>>> lttng-dev at lists.lttng.org
>>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jérémie Galarneau
>>>>>> EfficiOS Inc.
>>>>>> http://www.efficios.com
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jérémie Galarneau
>>>>> EfficiOS Inc.
>>>>> http://www.efficios.com
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Jérémie Galarneau
>> EfficiOS Inc.
>> http://www.efficios.com
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20150714/dca785ca/attachment-0001.html>


More information about the lttng-dev mailing list