[lttng-dev] Lttng Soft Hang issue

Wed Jul 15 10:00:54 EDT 2015

I have gone through a few scenarios and this is what I think of the problem
and my solution is at the bottom.

Please let me know if this is correct.


The brief analysis is:

1.       Whenever a new stream is added

a.      a stream object having path_name, channel_name, recv_list and other
things are allocated and populated.

b.      The address of recv_list, which is a member of the stream object is
added to conn->recv_head list. conn represents a connection that is made up
of multiple streams.



2.       One of the scenarios is that when a stream is closed
RELAYD_CLOSE_STREAM, the memory allocated in 1.a is freed through a call to
ctf_trace_destroy() -> stream_delete() and stream_destroy() . However, the
entry in the conn->recv_head is not removed. This now is a dangling
pointer. The closing of stream object can happen in multiple cases and each
will cause the same dangling pointer problem. The attached .xps document
shows quite a few code paths in which this can happen, raising the
complexity.

3.       Among the use cases that access this conn->recv_head list are
set_view_ready_flag() and addition of a new stream through
relay_add_stream(). Both iterate and try to modify the list. There may be
other cases as well.



In short, there is a linked list which is made of nodes allocated in memory
regions that get freed, but the nodes in this list are not removed at that
point. Any access of this list causes the problems.




Fixes attempted so far:

https://github.com/lttng/lttng-tools/commit/cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80

https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f


The above fixes of removing the node from conn->recv_head using
cds_list_del() attempts to do so from multiple places


1. relay_close_stream()


Nothing wrong in removing it here using cds_list_del()





2. connection_destroy()


This is where the problem of corruption is not getting solved. This is
because it gets called from destroy_connection()


static void destroy_connection(struct lttng_ht *relay_connections_ht,
struct relay_connection *conn)


{


                assert(relay_connections_ht);


                assert(conn);





                connection_delete(relay_connections_ht, conn);





                /* For the control socket, we try to destroy the session. */


                if (conn->type == RELAY_CONTROL && conn->session) {


                                destroy_session(conn->session,
conn->sessions_ht);


                }





                connection_destroy(conn);


}


But, destroy_session() above calls ctf_trace_destroy() which frees up the
memory allocated for the streams, thus a later attempt by
connection_destroy() to iterate over the conn->recv_head list will be a
problem as it now contains dangling pointers.



3. From my side, I thought that just calling cds_list_del() from
relay_close_stream() should fix it, however, when set_viewer_ready_flag()
is called


static void set_viewer_ready_flag(struct relay_connection *conn)


{


                struct relay_stream *stream, *tmp_stream;


                pthread_mutex_lock(&conn->session->viewer_ready_lock);


                cds_list_for_each_entry_safe(stream, tmp_stream,
&conn->recv_head,


                                                recv_list) {


                                stream->viewer_ready = 1;


                                cds_list_del(&stream->recv_list);


                }


                pthread_mutex_unlock(&conn->session->viewer_ready_lock);


                return;


}


It iterates over the entire conn->recv_head and removes all the entries. So
my previous solution would try to remove an entry from the list which is
already removed causing some other corruption ( the removal from the list
using cds_list_del() does not have any safety checks )



The solution that I am thinking of is below:


1.  Wherever cds_list_del(&stream->recv_list) gets called, reset the
pointers to NULL as

    stream->recv_list.next = stream->recv_list.prev = NULL;


2. Remove the entry from conn->recv_head list from only stream_delete()
after checking for above NULLs and stream->viewer_ready flag.

   The NULL set/check may not be needed as we could manage with the
viewer_ready flag alone, but in case we have missed any other scenario,
NULL would help us identity/fix it faster.




/---------------------------------------------------------------------/

The patch with the fix :

/---------------------------------------------------------------------/

diff --git a/src/bin/lttng-relayd/main.c b/src/bin/lttng-relayd/main.c


index cc68410..da1b2e2 100644


--- a/src/bin/lttng-relayd/main.c


+++ b/src/bin/lttng-relayd/main.c


@@ -1153,6 +1153,8 @@ void set_viewer_ready_flag(struct relay_connection
*conn)


                        recv_list) {


                stream->viewer_ready = 1;


                cds_list_del(&stream->recv_list);


+               /* Reset the removed node memory, we shall check in other
places for this */


+               memset(&stream->recv_list, 0, sizeof(struct cds_list_head));


        }


        pthread_mutex_unlock(&conn->session->viewer_ready_lock);


        return;








diff --git a/src/bin/lttng-relayd/stream.c b/src/bin/lttng-relayd/stream.c


index 410fae8..7043797 100644


--- a/src/bin/lttng-relayd/stream.c


+++ b/src/bin/lttng-relayd/stream.c


@@ -144,6 +144,13 @@ void stream_delete(struct lttng_ht *ht, struct
relay_stream *stream)


        assert(!ret);





        cds_list_del(&stream->trace_list);


+/* Before we attempt to remove the stream from conn->recv_head list, need
to check if it was already removed from set_viewer_ready_flag() */


+       if(stream->viewer_ready == 0) {


+               cds_list_del(&stream->recv_list);


+               /* Reset the removed node memory, we shall check in other
places for this */


+               memset(&stream->recv_list, 0, sizeof(struct cds_list_head));


+       }


+


 }



On Tue, Jul 14, 2015 at 12:59 AM, Aravind HT <aravind.ht at gmail.com> wrote:

> cds_list_del(&stream->recv_list) definitely needs to be called from
> relay_close_stream() else conn->recv_head will continue to have a stale
> pointer(for streams that got closed) until
> the entire connection is destroyed, and conversely,
> connection_destruction() means destroying the conn->recv_head list which
> means iterating/removing entries of streams ( some of them could be stale,
> if relay_close_stream() does not do cds_list_del() ) and in which case,
> cds_list_del(&stream->recv_list) needs to be called also.
>
> So removing either of them, I think, is not the good and in this case,
> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
> could be reverted
>
>
>  *** However *** ,
> there could also be a synchronization issue if relay_close_stream() and
> connection_destroy() can happen concurrently if these calls are not
> serialized ( stream->recv_list being accessed from two contexts ), need to
> verify this before reverting
> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>
>
> I also do not know the kind of issues being seen/details to prompt the
> reversal of
> http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
>
>
>
> On Tue, Jul 14, 2015 at 12:17 AM, Aravind HT <aravind.ht at gmail.com> wrote:
>
>> The specific change in
>> http://git.lttng.org/?p=lttng-tools.git;a=commitdiff;h=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80
>> that I need is for cds_list_del(&stream->recv_list); to be called.
>> Below is the patch extract.
>>
>> diff --git a/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
>> b/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>>
>> index a93151a
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
>> ..d82a341
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>> 100644 (file)
>> --- a/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc>
>> +++ b/src/bin/lttng-relayd/main.c
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80>
>> @@ -1340,6
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=a93151ac47f560875f07c9cbf113225f9dd892cc#l1340>
>> +1340,18
>> <http://git.lttng.org/?p=lttng-tools.git;a=blob;f=src/bin/lttng-relayd/main.c;h=d82a3412d7b8578be8f5b39abf0a9fd949ff07bc;hb=cd2ef1ef1d54ced9e4d0d03b865bb7fc6a905f80#l1340>
>> @@ int relay_close_stream(struct lttcomm_relayd_hdr *recv_hdr,
>>         session->stream_count--;
>>         assert(session->stream_count >= 0);
>>
>> +       /*
>>
>> +        * Remove the stream from the connection recv list since we are about to
>>
>> +        * flag it invalid and thus might be freed. This has to be done here since
>> +        * only the control thread can do actions on that list.
>> +        *
>>
>> +        * Note that this stream might NOT be in the list but we have to try to
>>
>> +        * remove it here else this can race with the stream destruction freeing
>>
>> +        * the object and the connection destroy doing a use after free when
>> +        * deleting the remaining nodes in this list.
>> +        */
>> +       cds_list_del(&stream->recv_list);
>> +
>>         /* Check if we can close it or else the data will do it. */
>>         try_close_stream(session, stream);
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 13, 2015 at 9:04 PM, Jérémie Galarneau <
>> jeremie.galarneau at efficios.com> wrote:
>>
>>> Do you have a specific changeset in mind when you say you applied the
>>> changes mentioned in that thread?
>>>
>>> Do you simply revert the following commit?
>>>
>>> https://github.com/lttng/lttng-tools/commit/1dc0526df43f2b5f86ef451e4c0331445346b15f
>>>
>>> Thanks,
>>> Jérémie
>>>
>>>
>>> On Sat, Jul 11, 2015 at 3:26 AM, Aravind HT <aravind.ht at gmail.com>
>>> wrote:
>>>
>>>> To try this fix, I also needed the changes talked about in
>>>> http://lists.lttng.org/pipermail/lttng-dev/2015-July/024689.html ,
>>>> otherwise, I would see relayd coring.
>>>> Once I took them I could see that the soft hang issue was no longer
>>>> reproducible.
>>>>
>>>> Thanks for helping.
>>>>
>>>> On Wed, Jul 8, 2015 at 9:49 AM, Aravind HT <aravind.ht at gmail.com>
>>>> wrote:
>>>>
>>>>> Sure, I will try the fix and update.
>>>>>
>>>>> On Mon, Jul 6, 2015 at 10:12 PM, Jérémie Galarneau <
>>>>> jeremie.galarneau at efficios.com> wrote:
>>>>>
>>>>>> Would you mind testing Mathieu's patch?
>>>>>>
>>>>>> I have rebased it on stable-2.6:
>>>>>> git clone https://github.com/jgalar/lttng-tools.git -b hang-fix
>>>>>>
>>>>>> Thanks,
>>>>>> Jérémie
>>>>>>
>>>>>>
>>>>>> On Mon, Jul 6, 2015 at 11:23 AM, Jérémie Galarneau <
>>>>>> jeremie.galarneau at efficios.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> What do you observe when this happens? Does the session daemon
>>>>>>> become unresponsive?
>>>>>>>
>>>>>>> Jérémie
>>>>>>>
>>>>>>> On Wed, May 27, 2015 at 6:19 AM, Aravind HT <aravind.ht at gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Request someone to kindly help with this. Blocked at this point,
>>>>>>>> unable to continue as any application crash leads to lttng not working.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Aravind.
>>>>>>>>
>>>>>>>> On Thu, May 21, 2015 at 12:18 AM, Aravind HT <aravind.ht at gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have recently started trying lttng 2.6 for a few applications on
>>>>>>>>> my Ubuntu 12.04 and I noticed that the health check on sessiond and
>>>>>>>>> consumerd failed soon after starting the session.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On investigating, I found that thread_manage_consumer() had exited
>>>>>>>>> causing an overall health check failure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here are the sequence of steps that I found contributed to
>>>>>>>>> *thread_manage_consumer()* exiting.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1.       In *thread_manage_apps() : 1558 ,
>>>>>>>>> ust_app_unregister(pollfd)* is being called. This happens when
>>>>>>>>> there is an error detected by *revents =
>>>>>>>>> LTTNG_POLL_GETEV(&events, i)*
>>>>>>>>>
>>>>>>>>> My initial guess here is that as one of my apps has crashed,
>>>>>>>>> producing a *LPOLLERR | LPOLLHUP | LPOLLRDHUP *to be generated
>>>>>>>>> for *epoll()* causing *ust_app_unregister()* to be called for
>>>>>>>>> that app.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2.       In *ust_app_unregister():3154 , close_metadata(registry,
>>>>>>>>> ua_sess->consumer)  *in which* registry->metadata_closed = 1 *is
>>>>>>>>> set*.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2.a) Note:* close_metadata() *also calls
>>>>>>>>> * consumer_close_metadata() *which sends
>>>>>>>>> * LTTNG_CONSUMER_CLOSE_METADATA *and* metadata_key *to the
>>>>>>>>> consumerd to stop it from further dealing with the concerned app. Somehow
>>>>>>>>> this doesn’t seem to help*.*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 3.       Next, I see that the *thread_manage_consumer():1353* for
>>>>>>>>> some reason has ignored the above 2.a and gets to do request/reply for that
>>>>>>>>> app by calling*ust_consumer_metadata_request():491 ->
>>>>>>>>> ust_app_push_metadat(ust_reg, socket,1) *which at line 460 checks
>>>>>>>>> for *registry->metadata_closed* and returns an *–EPIPE*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 4.       This *–EPIPE* error cascades all the way back up to
>>>>>>>>> *thread_manage_consumer():1353* at which point
>>>>>>>>> *thread_manage_consumer()* decides to exit causing health_check()
>>>>>>>>> to fail.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So it looks like under some scenario, an application crash could
>>>>>>>>> cause the lttng some problems.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think a possible fix for this scenario, is to instead of 4, send
>>>>>>>>> an *ERROR* message back to *consumerd()* . This could be done
>>>>>>>>> from *ust_consumer_metadata_request()* call. Can someone please
>>>>>>>>> let me know if this is correct and shed more light on the issue ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Please forgive if there are any guideline omissions for posting
>>>>>>>>> here from my part. This is my first post.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Aravind.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> lttng-dev mailing list
>>>>>>>> lttng-dev at lists.lttng.org
>>>>>>>> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jérémie Galarneau
>>>>>>> EfficiOS Inc.
>>>>>>> http://www.efficios.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Jérémie Galarneau
>>>>>> EfficiOS Inc.
>>>>>> http://www.efficios.com
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Jérémie Galarneau
>>> EfficiOS Inc.
>>> http://www.efficios.com
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20150715/279fbe1c/attachment-0001.html>