[lttng-dev] Crash in application due to watchdog timeout with python3 lttng

Kienan Stewart kstewart at efficios.com
Fri Feb 23 10:09:02 EST 2024


Hi Lakshmi,

On 2/20/24 10:24 AM, Lakshmi Deverkonda wrote:
> Hi,
> 
> Sorry, I could not reply on this thread. We tried to handle these issues 
> in our application code.
> One query, what is the write way to delete the lttng logger to stop 
> logging the events(from the code).


I think that the design of LTTng-UST agents is such that tracing is not 
meant to be tightly controlled by the traced applications themselves.

To stop emitting events:

  * disable the python events for the session(s): `lttng disable-event 
--python --all-events --session=<session name> --channel=<channel name>`
  * stop the tracing session(s)`lttng stop <session name>`
  * quit the `sessiond`

Note: the agent thread(s) in the python application will continue to 
run, but not emit events.

If you want to do things from within the application, there are some 
options:

  * Implement your own logic to stop calling `self.lttng_logger.log()`
  * Add a filter to the `lttngust.loghandler._Handler` objects on the 
root logger
  * Forcibly remove the `lttngust.loghandler._Handler` handler(s) from 
root logger (this really isn't meant to be done)
  * Use liblttngctl to perform one of the above actions (disable the 
python event(s), stopping the tracing sessions(s), ...)

thanks,
kienan

> 
> 
> Regards,
> Lakshmi
> ------------------------------------------------------------------------
> *From:* Kienan Stewart <kstewart at efficios.com>
> *Sent:* 16 February 2024 22:11
> *To:* Lakshmi Deverkonda <laksd at nvidia.com>; lttng-dev at lists.lttng.org 
> <lttng-dev at lists.lttng.org>
> *Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout 
> with python3 lttng
> External email: Use caution opening links or attachments
> 
> 
> Hi Lakshmi,
> 
> On 2/16/24 09:33, Lakshmi Deverkonda wrote:
>> This is how, we have created the logger. So the first logger is for file
>> logging where is as the second one is for lttng.
>>
>>   self.logger  = logging.getLogger('cd')
>> self.lttng_logger = logging.getLogger('cd-lttng')
>>
>> It seems like at the instant exactly when lttng is logging some data on
>> a particular thread and the same instant we receive SIGTERM for the
>> application,
>> we are unable to join that particular thread. Can you please help.
>>
> 
> This doesn't constitute a usable reproducer for us. You are also
> omitting information on the setup and usage of lttng on your system.
> 
> I get the impression you are not in a position to share your code.
> EfficiOS offers support contracts with NDAs that could allow us to work
> with you to analyze and improve your use of LTTng. For more info, please
> feel free to contact sales at efficios.com.
> 
>> Also we see that performance of lttng is not that good for python3. My
>> application has around 24 threads and when logging is enabled for each
>> of the threads,
>> there is a delay upto 24s for processing the external events.
>> Please suggest how to proceed further on these issues.
> 
> Could you describe what you mean by 'processing external events'?
> 
> Which system(s) are involved in processing the events?
> 
> Are the 'external events' the events emitted by invoking
> `self.lttng_logger.info('...')`, for example?
> 
> What versions of lttng-tools, lttng-ust, urcu, babeltrace, and python3
> are you using? Are you using a relay-daemon at any point?
> 
> How are your lttng sessions configured? Eg. memory allocation, blocking
> settings, behaviour on full buffers, etc. The commands you use to create
> the session, enable the channels, and activate the events would be great
> information to have.
> 
> While performing the logging is the system under heavy load from other
> sources? What resources on the system face the most contention (CPU, IO,
> memory, ...)?
> 
> We'd be more than happy to analyze the performance of python-lttngust
> and work to make improvements so it can meet your needs under a
> development contract. For more information, please reach out to
> sales at efficios.com.
> 
>>
>> Regards,
>> Lakshmi
> 
> I have taken the time to invent a fictitious example based on the few
> details you have given:
> https://gist.github.com/kienanstewart/879bd3bf19d852653b70a3c42caef361 <https://gist.github.com/kienanstewart/879bd3bf19d852653b70a3c42caef361>
> which spawns a number of python threads using the threading module.
> 
> I am using lttng-tools master, lttng-ust master at
> 47bc09f338f3c1199a878f77b7b18be8d2a224f6, urcu master at
> 81270292c23ff28aba1abd9a65f0624b657de82b, and babeltrace2 master at
> b93af5a2d22e36cf547da1739d60e19791daccbd. My system is running Debian
> sid with python 3.11.8.
> 
> To set up a recording session I do the following:
> 
> ```
> lttng create
> lttng enable-event --python 'tp'
> lttng start
> ```
> 
> To run the application, I do the following:
> 
> ```
> time python3 ./main.py
> ```
> 
> To quit the application, I send sigterm using the following command
> 
> ```
> killall $(pgrep -f 'python3 ./main.py')
> ```
> 
> After the application terminates, I stop the session and view the events
> 
> ```
> lttng stop
> lttng view
> lttng view | wc -l
> ```
> 
> In a 25s run of the application on my 4-thread laptop, I recorded
> 1010748 events.
> 
> thanks,
> kienan
> 
>>
>> ------------------------------------------------------------------------
>> *From:* Lakshmi Deverkonda <laksd at nvidia.com>
>> *Sent:* 13 February 2024 21:05
>> *To:* Kienan Stewart <kstewart at efficios.com>; lttng-dev at lists.lttng.org
>> <lttng-dev at lists.lttng.org>
>> *Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout
>> with python3 lttng
>> Yes. We are trying to join only the threads related to the application.
>> The timeout is happening while trying to join the threads started by the
>> application.
>>
>> Regards,
>> Lakshmi
>> ------------------------------------------------------------------------
>> *From:* Kienan Stewart <kstewart at efficios.com>
>> *Sent:* 13 February 2024 20:50
>> *To:* Lakshmi Deverkonda <laksd at nvidia.com>; lttng-dev at lists.lttng.org
>> <lttng-dev at lists.lttng.org>
>> *Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout
>> with python3 lttng
>> External email: Use caution opening links or attachments
>>
>>
>> Hi Lakshmi,
>>
>> when the lttngust python agent starts, it attempts to connect to one or
>> more session daemons[1].
>>
>> Each connection starts a thread that loops forever, retrying the
>> registration in case an exception occurs[2].
>>
>> I don't think the it's designed to have `join()` called on those
>> threads, which I assume is happening in some of the code you or your
>> team have written.
>>
>> My initial thought is that you should `join()` only the threads that
>> pertinent to your application, ignoring the lttngust agent threads and
>> then exit the application as normal.
>>
>> [1]:
>> https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334 <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334> <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334 <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334>>
>> [2]:
>> https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83 <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83> <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83 <https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83>>
>>
>> thanks,
>> kienan
>>
>> On 2/13/24 09:23, Lakshmi Deverkonda via lttng-dev wrote:
>>> Hi,
>>>
>>> We are able to integrate python3 lttng module in our application(python3
>>> based). However, we are seeing that whenever the application terminates,
>>> there is watchdog timeout due to timeout in joining the threads. What
>>> could be the reason for this ? Does lttng module hold any thread event
>>> locks ?
>>> We are completely blocked on this issue. Could you please help ?
>>>
>>> Here is the snippet of the core dump
>>>
>>> (gdb) py-bt
>>> Traceback (most recent call first):
>>>    File "/usr/lib/python3.7/threading.py", line 1048, in
>>> _wait_for_tstate_lock
>>>      elif lock.acquire(block, timeout):
>>>    File "/usr/lib/python3.7/threading.py", line 1032, in join
>>>      self._wait_for_tstate_lock()
>>>    File "/usr/lib/python3/dist-packages/h.py", line 231, in JoinThreads
>>>      self.TT.join()
>>>    File "/usr/sbin/c", line 1466, in do_exit
>>>      H.JoinThreads()
>>>    File "/usr/sbin/c", line 7201, in main
>>>      do_exit(nlm, status)
>>>    File "/usr/sbin/c", line 7233, in <module>
>>>      main()
>>> (gdb)
>>>
>>> On a parallel note, thanks to Kienan who has been trying to provide
>>> pointers on various issues reported so far.
>>>
>>> Need help on this issue as well.
>>> Thanks in advance,
>>>
>>> Regards,
>>> Lakshmi
>>>
>>>
>>>
>>> _______________________________________________
>>> lttng-dev mailing list
>>> lttng-dev at lists.lttng.org
>>> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev <https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev> <https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev <https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev>>


More information about the lttng-dev mailing list