[lttng-dev] Crash in application due to watchdog timeout with python3 lttng

Lakshmi Deverkonda laksd at nvidia.com
Tue Feb 20 10:24:10 EST 2024


Hi,

Sorry, I could not reply on this thread. We tried to handle these issues in our application code.
One query, what is the write way to delete the lttng logger to stop logging the events(from the code).


Regards,
Lakshmi
________________________________
From: Kienan Stewart <kstewart at efficios.com>
Sent: 16 February 2024 22:11
To: Lakshmi Deverkonda <laksd at nvidia.com>; lttng-dev at lists.lttng.org <lttng-dev at lists.lttng.org>
Subject: Re: [lttng-dev] Crash in application due to watchdog timeout with python3 lttng

External email: Use caution opening links or attachments


Hi Lakshmi,

On 2/16/24 09:33, Lakshmi Deverkonda wrote:
> This is how, we have created the logger. So the first logger is for file
> logging where is as the second one is for lttng.
>
>   self.logger  = logging.getLogger('cd')
> self.lttng_logger = logging.getLogger('cd-lttng')
>
> It seems like at the instant exactly when lttng is logging some data on
> a particular thread and the same instant we receive SIGTERM for the
> application,
> we are unable to join that particular thread. Can you please help.
>

This doesn't constitute a usable reproducer for us. You are also
omitting information on the setup and usage of lttng on your system.

I get the impression you are not in a position to share your code.
EfficiOS offers support contracts with NDAs that could allow us to work
with you to analyze and improve your use of LTTng. For more info, please
feel free to contact sales at efficios.com.

> Also we see that performance of lttng is not that good for python3. My
> application has around 24 threads and when logging is enabled for each
> of the threads,
> there is a delay upto 24s for processing the external events.
> Please suggest how to proceed further on these issues.

Could you describe what you mean by 'processing external events'?

Which system(s) are involved in processing the events?

Are the 'external events' the events emitted by invoking
`self.lttng_logger.info('...')`, for example?

What versions of lttng-tools, lttng-ust, urcu, babeltrace, and python3
are you using? Are you using a relay-daemon at any point?

How are your lttng sessions configured? Eg. memory allocation, blocking
settings, behaviour on full buffers, etc. The commands you use to create
the session, enable the channels, and activate the events would be great
information to have.

While performing the logging is the system under heavy load from other
sources? What resources on the system face the most contention (CPU, IO,
memory, ...)?

We'd be more than happy to analyze the performance of python-lttngust
and work to make improvements so it can meet your needs under a
development contract. For more information, please reach out to
sales at efficios.com.

>
> Regards,
> Lakshmi

I have taken the time to invent a fictitious example based on the few
details you have given:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fkienanstewart%2F879bd3bf19d852653b70a3c42caef361&data=05%7C02%7Claksd%40nvidia.com%7C41cb3fd667944d89cab708dc2f0e2a78%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638436985100994598%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=W%2Fb94D39McorSr2ztImThxlyb%2FsFcAgTrnb22Ymo50Q%3D&reserved=0<https://gist.github.com/kienanstewart/879bd3bf19d852653b70a3c42caef361>
which spawns a number of python threads using the threading module.

I am using lttng-tools master, lttng-ust master at
47bc09f338f3c1199a878f77b7b18be8d2a224f6, urcu master at
81270292c23ff28aba1abd9a65f0624b657de82b, and babeltrace2 master at
b93af5a2d22e36cf547da1739d60e19791daccbd. My system is running Debian
sid with python 3.11.8.

To set up a recording session I do the following:

```
lttng create
lttng enable-event --python 'tp'
lttng start
```

To run the application, I do the following:

```
time python3 ./main.py
```

To quit the application, I send sigterm using the following command

```
killall $(pgrep -f 'python3 ./main.py')
```

After the application terminates, I stop the session and view the events

```
lttng stop
lttng view
lttng view | wc -l
```

In a 25s run of the application on my 4-thread laptop, I recorded
1010748 events.

thanks,
kienan

>
> ------------------------------------------------------------------------
> *From:* Lakshmi Deverkonda <laksd at nvidia.com>
> *Sent:* 13 February 2024 21:05
> *To:* Kienan Stewart <kstewart at efficios.com>; lttng-dev at lists.lttng.org
> <lttng-dev at lists.lttng.org>
> *Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout
> with python3 lttng
> Yes. We are trying to join only the threads related to the application.
> The timeout is happening while trying to join the threads started by the
> application.
>
> Regards,
> Lakshmi
> ------------------------------------------------------------------------
> *From:* Kienan Stewart <kstewart at efficios.com>
> *Sent:* 13 February 2024 20:50
> *To:* Lakshmi Deverkonda <laksd at nvidia.com>; lttng-dev at lists.lttng.org
> <lttng-dev at lists.lttng.org>
> *Subject:* Re: [lttng-dev] Crash in application due to watchdog timeout
> with python3 lttng
> External email: Use caution opening links or attachments
>
>
> Hi Lakshmi,
>
> when the lttngust python agent starts, it attempts to connect to one or
> more session daemons[1].
>
> Each connection starts a thread that loops forever, retrying the
> registration in case an exception occurs[2].
>
> I don't think the it's designed to have `join()` called on those
> threads, which I assume is happening in some of the code you or your
> team have written.
>
> My initial thought is that you should `join()` only the threads that
> pertinent to your application, ignoring the lttngust agent threads and
> then exit the application as normal.
>
> [1]:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flttng%2Flttng-ust%2Fblob%2F3287f48be61ef3491aff0a80b7185ac57b3d8a5d%2Fsrc%2Fpython-lttngust%2Flttngust%2Fagent.py%23L334&data=05%7C02%7Claksd%40nvidia.com%7C41cb3fd667944d89cab708dc2f0e2a78%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638436985101003000%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=mZjN5F0W%2BHNWslQAULblQHefyQIT5cpSNFylxAY2kAk%3D&reserved=0<https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flttng%2Flttng-ust%2Fblob%2F3287f48be61ef3491aff0a80b7185ac57b3d8a5d%2Fsrc%2Fpython-lttngust%2Flttngust%2Fagent.py%23L334&data=05%7C02%7Claksd%40nvidia.com%7C41cb3fd667944d89cab708dc2f0e2a78%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638436985101009055%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=JypI4VvwxEZzTGG9621hNAvonem63ljMZiXaqvJhIQw%3D&reserved=0<https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L334>>
> [2]:
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flttng%2Flttng-ust%2Fblob%2F3287f48be61ef3491aff0a80b7185ac57b3d8a5d%2Fsrc%2Fpython-lttngust%2Flttngust%2Fagent.py%23L83&data=05%7C02%7Claksd%40nvidia.com%7C41cb3fd667944d89cab708dc2f0e2a78%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638436985101015117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=ZTbmTvR2uY1QoS9qKA6HqwXRKFd3pY%2F39kw2qwjaaVM%3D&reserved=0<https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flttng%2Flttng-ust%2Fblob%2F3287f48be61ef3491aff0a80b7185ac57b3d8a5d%2Fsrc%2Fpython-lttngust%2Flttngust%2Fagent.py%23L83&data=05%7C02%7Claksd%40nvidia.com%7C41cb3fd667944d89cab708dc2f0e2a78%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638436985101021691%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=%2B6JlJyEJVLlfwCMv8eHWwnQwr298L%2Fsk3vGNUuhaAMU%3D&reserved=0<https://github.com/lttng/lttng-ust/blob/3287f48be61ef3491aff0a80b7185ac57b3d8a5d/src/python-lttngust/lttngust/agent.py#L83>>
>
> thanks,
> kienan
>
> On 2/13/24 09:23, Lakshmi Deverkonda via lttng-dev wrote:
>> Hi,
>>
>> We are able to integrate python3 lttng module in our application(python3
>> based). However, we are seeing that whenever the application terminates,
>> there is watchdog timeout due to timeout in joining the threads. What
>> could be the reason for this ? Does lttng module hold any thread event
>> locks ?
>> We are completely blocked on this issue. Could you please help ?
>>
>> Here is the snippet of the core dump
>>
>> (gdb) py-bt
>> Traceback (most recent call first):
>>    File "/usr/lib/python3.7/threading.py", line 1048, in
>> _wait_for_tstate_lock
>>      elif lock.acquire(block, timeout):
>>    File "/usr/lib/python3.7/threading.py", line 1032, in join
>>      self._wait_for_tstate_lock()
>>    File "/usr/lib/python3/dist-packages/h.py", line 231, in JoinThreads
>>      self.TT.join()
>>    File "/usr/sbin/c", line 1466, in do_exit
>>      H.JoinThreads()
>>    File "/usr/sbin/c", line 7201, in main
>>      do_exit(nlm, status)
>>    File "/usr/sbin/c", line 7233, in <module>
>>      main()
>> (gdb)
>>
>> On a parallel note, thanks to Kienan who has been trying to provide
>> pointers on various issues reported so far.
>>
>> Need help on this issue as well.
>> Thanks in advance,
>>
>> Regards,
>> Lakshmi
>>
>>
>>
>> _______________________________________________
>> lttng-dev mailing list
>> lttng-dev at lists.lttng.org
>> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.lttng.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flttng-dev&data=05%7C02%7Claksd%40nvidia.com%7C41cb3fd667944d89cab708dc2f0e2a78%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638436985101026830%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=mB2APfDsn7BObrStEEWtfOHkxa3zC8LOnk%2FsuKdL9%2F0%3D&reserved=0<https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.lttng.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Flttng-dev&data=05%7C02%7Claksd%40nvidia.com%7C41cb3fd667944d89cab708dc2f0e2a78%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638436985101031212%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=LEnpdupFl2tg8vh0cV3YlRpSBXAnfaW9oW40eVXzBfI%3D&reserved=0<https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20240220/ad276c86/attachment-0001.htm>


More information about the lttng-dev mailing list