[lttng-dev] [RFC PATCH v2 06/13] Fix: unregister app notify socket on sessiond tear down
Jérémie Galarneau
jeremie.galarneau at efficios.com
Thu Dec 14 01:56:15 UTC 2017
On 18 September 2017 at 18:51, Jonathan Rajotte
<jonathan.rajotte-julien at efficios.com> wrote:
> A race between the sessiond tear down and applications initialization
> can lead to a deadlock.
>
> Applications try to communicate via the notify sockets while sessiond
the sessiond
> does not listen anymore on these sockets since the thread responsible
> for reception/response is terminated (ust_thread_manage_notify). These
> sockets are never closed hence an application could hang on
> communication.
>
> Sessiond hang happen during call to cmd_destroy_session during
The sessiond hang happens during the call to cmd_destroy_session initiated
by sessiond_cleanup().
> sessiond_cleanup. Sessiond is trying to communicate with the app while
> the app is waiting for a response on the app notification socket.
>
> To prevent this situation a call to ust_app_notify_sock_unregister is
> performed on all entry of the ust_app_ht_by_notify_sock hash table at
entry -> entries
> the time of termination. This ensure that any pending communication
ensure -> ensures
> initiated by the application will be terminated since all sockets will
> be closed at the end of the grace period via call_rcu inside
> ust_app_notify_sock_unregister. The use of ust_app_ht_by_notify_sock
> instead of the ust_app_ht prevent a double call_rcu since entries are
prevent -> prevents
Do I understand the "big picture" of what you are trying to solve here?
- App communicates on its 'notify' socket and waits for a reply
- Sessiond never sends that reply since apps_notify_thread (the
thread that manages application 'notify' sockets) is dead.
- Sessiond is sending a command through the application 'command'
socket during its clean-up
- Sessiond never receives a reply since the application is still
waiting for a reply on the 'notify' socket.
The 'simple' fix I see would be that the apps_notify_thread should
close all the FDs it manages (and set them to -1) as it is the only
thread interacting with those FDs (to my knowledge). That would
unblock the applications and prevent further notification from
being sent.
What I understand you are trying to do is to perform the clean-up
in the same way it would have occurred had the application simply
died. At first glance, that sounds like a clean way to handle the
problem.
Going back to your patch:
Why does using ust_app_ht_by_notify_sock instead of ust_app_ht
prevent a double call_rcu?
I see how this thread performs a deferred clean-up using call_rcu().
What is causing the second one?
> removed from ust_app_ht_by_notify_sock during ust_app_notify_sock_unregister.
>
> This can be reproduced using the sessiond_teardown_active_session
> scenario provided by [1].
>
> [1] https://github.com/PSRCode/lttng-stress
>
> Signed-off-by: Jonathan Rajotte <jonathan.rajotte-julien at efficios.com>
> ---
> src/bin/lttng-sessiond/ust-thread.c | 17 ++++++++++++++++-
> 1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/src/bin/lttng-sessiond/ust-thread.c b/src/bin/lttng-sessiond/ust-thread.c
> index 1e7a8229..8f11133a 100644
> --- a/src/bin/lttng-sessiond/ust-thread.c
> +++ b/src/bin/lttng-sessiond/ust-thread.c
> @@ -27,6 +27,19 @@
> #include "health-sessiond.h"
> #include "testpoint.h"
>
> +
> +static
> +void notify_sock_unregister_all()
> +{
> + struct lttng_ht_iter iter;
> + struct ust_app *app;
Missing empty line here.
> + rcu_read_lock();
> + cds_lfht_for_each_entry(ust_app_ht_by_notify_sock->ht, &iter.iter, app, notify_sock_n.node) {
This line exceeds 80 chars.
> + ust_app_notify_sock_unregister(app->notify_sock);
> + }
> + rcu_read_unlock();
> +}
> +
> /*
> * This thread manage application notify communication.
> */
> @@ -53,7 +66,7 @@ void *ust_thread_manage_notify(void *data)
>
> ret = lttng_poll_create(&events, 2, LTTNG_CLOEXEC);
> if (ret < 0) {
> - goto error;
> + goto error_poll_create;
> }
>
> /* Add quit pipe */
> @@ -197,6 +210,8 @@ error_poll_create:
> error_testpoint:
> utils_close_pipe(apps_cmd_notify_pipe);
> apps_cmd_notify_pipe[0] = apps_cmd_notify_pipe[1] = -1;
> + notify_sock_unregister_all();
> +
> DBG("Application notify communication apps thread cleanup complete");
> if (err) {
> health_error();
> --
> 2.11.0
>
--
Jérémie Galarneau
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list