[lttng-dev] Lttng Soft Hang issue

Jérémie Galarneau jeremie.galarneau at efficios.com
Mon Jul 13 11:34:38 EDT 2015


Do you have a specific changeset in mind when you say you applied the
changes mentioned in that thread?

Do you simply revert the following commit?
https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f

Thanks,
Jérémie

On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht at gmail.com> wrote:

> To try this fix, I also needed the changes talked about in
> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
> otherwise, I would see relayd coring.
> Once I took them I could see that the soft hang issue was no longer
> reproducible.
>
> Thanks for helping.
>
> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht at gmail.com> wrote:
>
>> Sure, I will try the fix and update.
>>
>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>> jeremie.galarneau at efficios.com> wrote:
>>
>>> Would you mind testing Mathieu's patch?
>>>
>>> I have rebased it on stable-2.6:
>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>
>>> Thanks,
>>> Jérémie
>>>
>>>
>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>> jeremie.galarneau at efficios.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> What do you observe when this happens? Does the session daemon become
>>>> unresponsive?
>>>>
>>>> Jérémie
>>>>
>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht at gmail.com>
>>>> wrote:
>>>>
>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>
>>>>> Thanks,
>>>>> Aravind.
>>>>>
>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have recently started trying lttng 2.6 for a few applications on my
>>>>>> Ubuntu 12.04 and I noticed that the health check on sessiond and consumerd
>>>>>> failed soon after starting the session.
>>>>>>
>>>>>>
>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>> causing an overall health check failure.
>>>>>>
>>>>>>
>>>>>> Here are the sequence of steps that I found contributed to
>>>>>> *thread_manage_consumer()* exiting.
>>>>>>
>>>>>>
>>>>>> 1.       In *thread_manage_apps() : 1558 ,
>>>>>> ust_app_unregister(pollfd)* is being called. This happens when there
>>>>>> is an error detected by *revents = LTTNG_POLL_GETEV(&events, i)*
>>>>>>
>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated for
>>>>>> *epoll()* causing *ust_app_unregister()* to be called for that app.
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is set
>>>>>> *.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>> * consumer_close_metadata() *which sends
>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the consumerd
>>>>>> to stop it from further dealing with the concerned app. Somehow this
>>>>>> doesn’t seem to help*.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>
>>>>>>
>>>>>>
>>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>>> *thread_manage_consumer():1353* at which point
>>>>>> *thread_manage_consumer()* decides to exit causing health_check() to
>>>>>> fail.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> So it looks like under some scenario, an application crash could
>>>>>> cause the lttng some problems.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think a possible fix for this scenario, is to instead of 4, send an
>>>>>> *ERROR* message back to *consumerd()* . This could be done from
>>>>>> *ust_consumer_metadata_request()* call. Can someone please let me
>>>>>> know if this is correct and shed more light on the issue ?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please forgive if there are any guideline omissions for posting here
>>>>>> from my part. This is my first post.
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Aravind.
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> lttng-dev mailing list
>>>>> lttng-dev at lists.lttng.org
>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Jérémie Galarneau
>>>> EfficiOS Inc.
>>>> http://www.efficios.com
>>>>
>>>
>>>
>>>
>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>


-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20150713/70baba34/attachment.html>


More information about the lttng-dev mailing list