[lttng-dev] Crash in application due to watchdog timeout with python3 lttng

Kienan Stewart kstewart at efficios.com
Fri Feb 16 11:41:43 EST 2024


Hi Lakshmi,

On 2/16/24 09:33, Lakshmi Deverkonda wrote:
> This is how, we have created the logger. So the first logger is for file 
> logging where is as the second one is for lttng.
> 
>   self.logger  = logging.getLogger('cd')
> self.lttng_logger = logging.getLogger('cd-lttng')
> 
> It seems like at the instant exactly when lttng is logging some data on 
> a particular thread and the same instant we receive SIGTERM for the 
> application,
> we are unable to join that particular thread. Can you please help.
> 

This doesn't constitute a usable reproducer for us. You are also 
omitting information on the setup and usage of lttng on your system.

I get the impression you are not in a position to share your code. 
EfficiOS offers support contracts with NDAs that could allow us to work 
with you to analyze and improve your use of LTTng. For more info, please 
feel free to contact sales at efficios.com.

> Also we see that performance of lttng is not that good for python3. My 
> application has around 24 threads and when logging is enabled for each 
> of the threads,
> there is a delay upto 24s for processing the external events.
> Please suggest how to proceed further on these issues.

Could you describe what you mean by 'processing external events'?

Which system(s) are involved in processing the events?

Are the 'external events' the events emitted by invoking 
`self.lttng_logger.info('...')`, for example?

What versions of lttng-tools, lttng-ust, urcu, babeltrace, and python3 
are you using? Are you using a relay-daemon at any point?

How are your lttng sessions configured? Eg. memory allocation, blocking 
settings, behaviour on full buffers, etc. The commands you use to create 
the session, enable the channels, and activate the events would be great 
information to have.

While performing the logging is the system under heavy load from other 
sources? What resources on the system face the most contention (CPU, IO, 
memory, ...)?

We'd be more than happy to analyze the performance of python-lttngust 
and work to make improvements so it can meet your needs under a 
development contract. For more information, please reach out to 
sales at efficios.com.

> 
> Regards,
> Lakshmi

I have taken the time to invent a fictitious example based on the few 
details you have given: 
https://gist.github.com/kienanstewart/879bd3bf19d852653b70a3c42caef361 
which spawns a number of python threads using the threading module.

I am using lttng-tools master, lttng-ust master at 
47bc09f338f3c1199a878f77b7b18be8d2a224f6, urcu master at 
81270292c23ff28aba1abd9a65f0624b657de82b, and babeltrace2 master at 
b93af5a2d22e36cf547da1739d60e19791daccbd. My system is running Debian 
sid with python 3.11.8.

To set up a recording session I do the following:

```
lttng create
lttng enable-event --python 'tp'
lttng start
```

To run the application, I do the following:

```
time python3 ./main.py
```

To quit the application, I send sigterm using the following command

```
killall $(pgrep -f 'python3 ./main.py')
```

After the application terminates, I stop the session and view the events

```
lttng stop
lttng view
lttng view | wc -l
```

In a 25s run of the application on my 4-thread laptop, I recorded 
1010748 events.

thanks,
kienan

> 
> ------------------------------------------------------------------------
> *From:* Lakshmi Deverkonda <laksd at nvidia.com>
> *Sent:* 13 February 2024 21:05
> *To:* Kienan Stewart <kstewart at efficios.com>; lttng-dev at lists.lttng.org 
> <lttng-dev at lists.lttng.org>
> *Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout 
> with python3 lttng
> Yes. We are trying to join only the threads related to the application. 
> The timeout is happening while trying to join the threads started by the 
> application.
> 
> Regards,
> Lakshmi
> ------------------------------------------------------------------------
> *From:* Kienan Stewart <kstewart at efficios.com>
> *Sent:* 13 February 2024 20:50
> *To:* Lakshmi Deverkonda <laksd at nvidia.com>; lttng-dev at lists.lttng.org 
> <lttng-dev at lists.lttng.org>
> *Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout 
> with python3 lttng
> External email: Use caution opening links or attachments
> 
> 
> Hi Lakshmi,
> 
> when the lttngust python agent starts, it attempts to connect to one or
> more session daemons[1].
> 
> Each connection starts a thread that loops forever, retrying the
> registration in case an exception occurs[2].
> 
> I don't think the it's designed to have `join()` called on those
> threads, which I assume is happening in some of the code you or your
> team have written.
> 
> My initial thought is that you should `join()` only the threads that
> pertinent to your application, ignoring the lttngust agent threads and
> then exit the application as normal.
> 
> [1]:
> https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334 <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334>
> [2]:
> https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83 <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83>
> 
> thanks,
> kienan
> 
> On 2/13/24 09:23, Lakshmi Deverkonda via lttng-dev wrote:
>> Hi,
>>
>> We are able to integrate python3 lttng module in our application(python3
>> based). However, we are seeing that whenever the application terminates,
>> there is watchdog timeout due to timeout in joining the threads. What
>> could be the reason for this ? Does lttng module hold any thread event
>> locks ?
>> We are completely blocked on this issue. Could you please help ?
>>
>> Here is the snippet of the core dump
>>
>> (gdb) py-bt
>> Traceback (most recent call first):
>>    File "/usr/lib/python3.7/threading.py", line 1048, in
>> _wait_for_tstate_lock
>>      elif lock.acquire(block, timeout):
>>    File "/usr/lib/python3.7/threading.py", line 1032, in join
>>      self._wait_for_tstate_lock()
>>    File "/usr/lib/python3/dist-packages/h.py", line 231, in JoinThreads
>>      self.TT.join()
>>    File "/usr/sbin/c", line 1466, in do_exit
>>      H.JoinThreads()
>>    File "/usr/sbin/c", line 7201, in main
>>      do_exit(nlm, status)
>>    File "/usr/sbin/c", line 7233, in <module>
>>      main()
>> (gdb)
>>
>> On a parallel note, thanks to Kienan who has been trying to provide
>> pointers on various issues reported so far.
>>
>> Need help on this issue as well.
>> Thanks in advance,
>>
>> Regards,
>> Lakshmi
>>
>>
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev at lists.lttng.org
>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev <https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev>


More information about the lttng-dev mailing list