[lttng-dev] Lttng Soft Hang issue
Aravind HT
aravind.ht at gmail.com
Mon Jul 13 14:47:39 EDT 2015
The specific change in
http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
that I need is for cds_list_del(&stream->recv_list); to be called.
Below is the patch extract.
diff --git a/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
b/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
index a93151a
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
..d82a341
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
100644 (file)
--- a/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
+++ b/src/bin/lttng-relayd/main.c
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
@@ -1340,6
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc#l1340>
+1340,18
<http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80#l1340>
@@ int relay_close_stream(struct lttcomm_relayd_hdr *recv_hdr,
session->stream_count--;
assert(session->stream_count >= 0);
+ /*
+ * Remove the stream from the connection recv list since we are about to
+ * flag it invalid and thus might be freed. This has to be
done here since
+ * only the control thread can do actions on that list.
+ *
+ * Note that this stream might NOT be in the list but we have to try to
+ * remove it here else this can race with the stream destruction freeing
+ * the object and the connection destroy doing a use after free when
+ * deleting the remaining nodes in this list.
+ */
+ cds_list_del(&stream->recv_list);
+
/* Check if we can close it or else the data will do it. */
try_close_stream(session, stream);
On Mon, Jul 13, 2015 at 9:04 PM, Jérémie Galarneau <
jeremie.galarneau at efficios.com> wrote:
> Do you have a specific changeset in mind when you say you applied the
> changes mentioned in that thread?
>
> Do you simply revert the following commit?
>
> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>
> Thanks,
> Jérémie
>
>
> On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht at gmail.com> wrote:
>
>> To try this fix, I also needed the changes talked about in
>> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
>> otherwise, I would see relayd coring.
>> Once I took them I could see that the soft hang issue was no longer
>> reproducible.
>>
>> Thanks for helping.
>>
>> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht at gmail.com> wrote:
>>
>>> Sure, I will try the fix and update.
>>>
>>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>>> jeremie.galarneau at efficios.com> wrote:
>>>
>>>> Would you mind testing Mathieu's patch?
>>>>
>>>> I have rebased it on stable-2.6:
>>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>>
>>>> Thanks,
>>>> Jérémie
>>>>
>>>>
>>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>>> jeremie.galarneau at efficios.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> What do you observe when this happens? Does the session daemon become
>>>>> unresponsive?
>>>>>
>>>>> Jérémie
>>>>>
>>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>>
>>>>>> Thanks,
>>>>>> Aravind.
>>>>>>
>>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I have recently started trying lttng 2.6 for a few applications on
>>>>>>> my Ubuntu 12.04 and I noticed that the health check on sessiond and
>>>>>>> consumerd failed soon after starting the session.
>>>>>>>
>>>>>>>
>>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>>> causing an overall health check failure.
>>>>>>>
>>>>>>>
>>>>>>> Here are the sequence of steps that I found contributed to
>>>>>>> *thread_manage_consumer()* exiting.
>>>>>>>
>>>>>>>
>>>>>>> 1. In *thread_manage_apps() : 1558 ,
>>>>>>> ust_app_unregister(pollfd)* is being called. This happens when
>>>>>>> there is an error detected by *revents = LTTNG_POLL_GETEV(&events,
>>>>>>> i)*
>>>>>>>
>>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for
>>>>>>> *epoll()* causing *ust_app_unregister()* to be called for that app.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2. In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>>> ua_sess->consumer) *in which* registry->metadata_closed = 1 *is set
>>>>>>> *.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>>> * consumer_close_metadata() *which sends
>>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>>>>> this doesn’t seem to help*.*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 3. Next, I see that the *thread_manage_consumer():1353* for
>>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 4. This *–EPIPE* error cascades all the way back up to
>>>>>>> *thread_manage_consumer():1353* at which point
>>>>>>> *thread_manage_consumer()* decides to exit causing health_check()
>>>>>>> to fail.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> So it looks like under some scenario, an application crash could
>>>>>>> cause the lttng some problems.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think a possible fix for this scenario, is to instead of 4, send
>>>>>>> an *ERROR* message back to *consumerd()* . This could be done from
>>>>>>> *ust_consumer_metadata_request()* call. Can someone please let me
>>>>>>> know if this is correct and shed more light on the issue ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Please forgive if there are any guideline omissions for posting here
>>>>>>> from my part. This is my first post.
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Aravind.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> lttng-dev mailing list
>>>>>> lttng-dev at lists.lttng.org
>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jérémie Galarneau
>>>>> EfficiOS Inc.
>>>>> http://www.efficios.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jérémie Galarneau
>>>> EfficiOS Inc.
>>>> http://www.efficios.com
>>>>
>>>
>>>
>>
>
>
> --
> Jérémie Galarneau
> EfficiOS Inc.
> http://www.efficios.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20150714/e13141a9/attachment-0001.html>
More information about the lttng-dev
mailing list