[lttng-dev] [RFC PATCH v2 06/13] Fix: unregister app notify socket on sessiond tear down

Jérémie Galarneau jeremie.galarneau at efficios.com
Thu Dec 14 01:56:15 UTC 2017


On 18 September 2017 at 18:51, Jonathan Rajotte
<jonathan.rajotte-julien at efficios.com> wrote:
> A race between the sessiond tear down and applications initialization
> can lead to a deadlock.
>
> Applications try to communicate via the notify sockets while sessiond

the sessiond

> does not listen anymore on these sockets since the thread responsible
> for reception/response is terminated (ust_thread_manage_notify). These
> sockets are never closed hence an application could hang on
> communication.
>
> Sessiond hang happen during call to cmd_destroy_session during

The sessiond hang happens during the call to cmd_destroy_session initiated
by sessiond_cleanup().

> sessiond_cleanup. Sessiond is trying to communicate with the app while
> the app is waiting for a response on the app notification socket.
>
> To prevent this situation a call to ust_app_notify_sock_unregister is
> performed on all entry of the ust_app_ht_by_notify_sock hash table at

entry -> entries

> the time of termination. This ensure that any pending communication

ensure -> ensures

> initiated by the application will be terminated since all sockets will
> be closed at the end of the grace period via call_rcu inside
> ust_app_notify_sock_unregister. The use of ust_app_ht_by_notify_sock
> instead of the ust_app_ht prevent a double call_rcu since entries are

prevent -> prevents

Do I understand the "big picture" of what you are trying to solve here?
  - App communicates on its 'notify' socket and waits for a reply
    - Sessiond never sends that reply since apps_notify_thread (the
      thread that manages application 'notify' sockets) is dead.
  - Sessiond is sending a command through the application 'command'
    socket during its clean-up
    - Sessiond never receives a reply since the application is still
      waiting for a reply on the 'notify' socket.

The 'simple' fix I see would be that the apps_notify_thread should
close all the FDs it manages (and set them to -1) as it is the only
thread interacting with those FDs (to my knowledge). That would
unblock the applications and prevent further notification from
being sent.

What I understand you are trying to do is to perform the clean-up
in the same way it would have occurred had the application simply
died. At first glance, that sounds like a clean way to handle the
problem.

Going back to your patch:

Why does using ust_app_ht_by_notify_sock instead of ust_app_ht
prevent a double call_rcu?

I see how this thread performs a deferred clean-up using call_rcu().
What is causing the second one?

> removed from ust_app_ht_by_notify_sock during ust_app_notify_sock_unregister.
>
> This can be reproduced using the sessiond_teardown_active_session
> scenario provided by [1].
>
> [1] https://github.com/PSRCode/lttng-stress
>
> Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>
> ---
>  src/bin/lttng-sessiond/ust-thread.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/src/bin/lttng-sessiond/ust-thread.c b/src/bin/lttng-sessiond/ust-thread.c
> index 1e7a8229..8f11133a 100644
> --- a/src/bin/lttng-sessiond/ust-thread.c
> +++ b/src/bin/lttng-sessiond/ust-thread.c
> @@ -27,6 +27,19 @@
>  #include "health-sessiond.h"
>  #include "testpoint.h"
>
> +
> +static
> +void notify_sock_unregister_all()
> +{
> +       struct lttng_ht_iter iter;
> +       struct ust_app *app;

Missing empty line here.

> +       rcu_read_lock();
> +       cds_lfht_for_each_entry(ust_app_ht_by_notify_sock->ht, &iter.iter, app, notify_sock_n.node) {

This line exceeds 80 chars.

> +               ust_app_notify_sock_unregister(app->notify_sock);
> +       }
> +       rcu_read_unlock();
> +}
> +
>  /*
>   * This thread manage application notify communication.
>   */
> @@ -53,7 +66,7 @@ void *ust_thread_manage_notify(void *data)
>
>         ret = lttng_poll_create(&events, 2, LTTNG_CLOEXEC);
>         if (ret < 0) {
> -               goto error;
> +               goto error_poll_create;
>         }
>
>         /* Add quit pipe */
> @@ -197,6 +210,8 @@ error_poll_create:
>  error_testpoint:
>         utils_close_pipe(apps_cmd_notify_pipe);
>         apps_cmd_notify_pipe[0] = apps_cmd_notify_pipe[1] = -1;
> +       notify_sock_unregister_all();
> +
>         DBG("Application notify communication apps thread cleanup complete");
>         if (err) {
>                 health_error();
> --
> 2.11.0
>



-- 
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com


More information about the lttng-dev mailing list