[lttng-dev] Latency tracing using LTTng

Wed Jan 10 21:56:35 UTC 2018

Hi Anup,

Sorry for the delay.
I just tried the examples from the blog post (it has been a while !), I
had to update liblttng-profile-base/module_abi.h in lttng-profile like that:
#define LTTNGPROFILE_PROC "syscalls"

And it works, I did not patch lttng-modules to get the kernel events,
but I see them in ftrace, I will rebase the
0001-connect-to-latency_tracker-tracepoints.patch from the
latency-tracker to a more recent version of lttng-modules and it should
work as is. I also see the stack traces in the user-space and kernel trace.

I also see the flakyness you are talking about when sometimes I launch
sleep-demo and nothing happens, I will try to have a look at that.

I did not observe any kernel panic or hang, but I just tried on an
x86-64 4.4 non-RT kernel, so there might be some differences here.

Now for the main problem, which seems related to Process-2 that never
works with lttng-profile, do you think you could investigate a bit more
(see if it gets registered in dmesg), maybe the LD_PRELOAD is
overridden/lost if it is started by a launcher script, etc ?
Ideally, if you could get a minimal C program that has the same
behavior, that would help us investigate this a bit more.

Thanks,

Julien

On 2018-01-04 02:05 PM, Anup Pemmaiah wrote:
> Hi,
> 
> Was curious if any of the experts had any comments/suggestions on the
> issue I am facing.
> 
> Thanks
> Anup
> 
> On Wed, Dec 27, 2017 at 9:02 PM, Anup Pemmaiah <anup.pemmaiah at gmail.com
> <mailto:anup.pemmaiah at gmail.com>> wrote:
> 
> 
> 
>     On Wed, Dec 27, 2017 at 6:40 PM, Jonathan Rajotte-Julien
>     <jonathan.rajotte-julien at efficios.com
>     <mailto:jonathan.rajotte-julien at efficios.com>> wrote:
> 
>         Hi,
> 
>         On Wed, Dec 27, 2017 at 05:21:30PM -0800, Anup Pemmaiah wrote:
>         > Hi,
>         >
>         > I am trying to find the reason of a intermittent latency that I am noticing
>         > on my Linux system running on a RT_PREEMPT kernel. I have patched it on a
>         > Ubuntu 16.04 LTS. Have stopped the desktop lightdm service and this is on a
> 
>         Could we get more details on the version of the RT kernel?
>         Version/patchset?
> 
> 
>     The kernel and RT_PREEMPT patch is 4.4.38-rt49
>      
> 
> 
>         > Arm aarch64 platform.
>         >
>         > At a very high level, there are two processes interacting with each other.
>         > Process-1 does some work and sends UDP messages to Process-2 (Using the
>         > FASTRTPS real time Publish-Subscribe protocol).
> 
>         Are you using the eProsima implementation? If so, could you give
>         us more
>         information on the tracing strategy?
> 
> 
> 
>      
>     Yes. In my situation, it is part of the ROS(http://www.ros.org/)
>     stack. So ROS middlewear uses eProsima underneath for publishing and
>     subscribing. My process sits on top of ROS stack and
>     uses it for message passing.
> 
> 
>     Right now I have not addigned at RT priorities. All tasks are
>     running with SCHED_OTHER with default prio 20.
> 
>     I load the modules as follows
> 
>     $modprobe latency_tracker
> 
>     $modprobe latency_tracker_syscalls
> 
>     $modprobe lttng-probe-latency_tracker
> 
>     $echo 100000 > /sys/kernel/debug/latency/syscalls/default/threshold
> 
>     Start LTTng
> 
>     $sudo lttng-sessiond --daemonize
> 
>     $echo 1 > /sys/kernel/debug/latency/syscalls/default/tracking_on
> 
> 
>     Process-1 has a thread which keeps doing some work and publishing
>     messages every 100ms. I start Process-1 as
> 
>     LD_PRELOAD=liblttng-profile.so <Process-1>
> 
>     It gives some relevant information about few stack traces. 
> 
>     TimestampChannelCPUEvent typeContentsBinary LocationFunction
>     LocationSource LocationTID
>     16:13:32.971 988
>     199u_55lttng_profile:off_cpu_samplestack=/lib/aarch64-linux-gnu/libpthread.so.0(__nanosleep+0x28)
>     [0x7f98233238]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZNK8velodyne14input_socket_c11read_packetERNS_8packet_sE+0x160)
>     [0x7f9826e418]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZN8velodyne8driver_c16read_from_socketEv+0x30)
>     [0x7f9826c390]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZNK8velodyne8driver_c3runESt8fu,
>     overhead=313869, context._vtid=1171311713
> 
>     16:13:32.973 972
>     119u_00lttng_profile:off_cpu_samplestack=/lib/aarch64-linux-gnu/libpthread.so.0(__nanosleep+0x28)
>     [0x7f98233238]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZNK8velodyne14input_socket_c11read_packetERNS_8packet_sE+0x160)
>     [0x7f9826e418]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZN8velodyne8driver_c16read_from_socketEv+0x30)
>     [0x7f9826c390]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZNK8velodyne8driver_c3runESt8fu,
>     overhead=809697, context._vtid=1171311713
> 
>     16:13:32.973 981
>     912u_44lttng_profile:off_cpu_samplestack=/lib/aarch64-linux-gnu/libpthread.so.0(__nanosleep+0x28)
>     [0x7f98233238]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZNK8velodyne8driver_c3runESt8functionIFvvEE+0x94)
>     [0x7f9826c4a4]
>     /home/nvidia/workspace/test_672/grand_central/apex_ws/install/lib/libvelodyne_driver.so(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt7_Mem_fnIMN8velodyne8driver_cEKFvSt8functionIFvvEEEEPS4_St5_BindIFS2_IMS4_FvvEESB_EEEEE6_M_runEv+0x8c)
>     [0x7f9826d9fc]
>     /usr/lib/aarch64-linux-gnu/libstdc++.so., overhead=469267,
>     context._vtid=1171511715
> 
>     16:13:32.973 999
>     544u_33lttng_profile:off_cpu_samplestack=/lib/aarch64-linux-gnu/libpthread.so.0(sendmsg+0x2c)
>     [0x7f9823303c]
>     /opt/ros2/lib/libfastrtps.so.1(_ZN8eprosima8fastrtps4rtps14UDPv4Transport17SendThroughSocketEPKhjRKNS1_9Locator_tERN4asio21basic_datagram_socketINS8_2ip3udpEEE+0xf8)
>     [0x7f9647ad88]
>     /opt/ros2/lib/libfastrtps.so.1(_ZN8eprosima8fastrtps4rtps14UDPv4Transport4SendEPKhjRKNS1_9Locator_tES7_+0x158)
>     [0x7f9647c3e0]
>     /opt/ros2/lib/libfastrtps.so.1(_ZN8eprosima8fastrtps4rtps14SenderResource4SendEPKhjRKNS1_9Locator_tE+0x4c)
>     [0x7f9644b58c]
>     /opt/ros2/lib/, overhead=1289108, context._vtid=1171211712
> 
>     Once in a while when I kill the process with <ctrl-c> it hangs but
>     not always.
> 
>     Process-2 has a thread which subscribes for the messages.
> 
>     When I start LD_PRELOAD=liblttng-profile.so <Process-2>, it just hangs
> 
>      
> 
> 
>         > Data is sent from Process-1 to Process-2 every 100ms. I see that often
>         > there is a delay in Process-2 receiving the message.
>         >
>         >
>         > Trying to use LTTng and along with other supported profiling tools like
>         > latency-tracker, lttng-profile.
>         > Something like the description in this link
>         > http://lttng.org/blog/2015/03/18/full-stack-latencies
>         <http://lttng.org/blog/2015/03/18/full-stack-latencies> is what
>         I am looking
>         > for, but I am facing issues. The latency_tracker kernel modules and also
>         > lttng-profile is a bit flaky as it sometimes work and sometimes panics the
>         > kernel.  Simple test cases as provided in
>         > https://github.com/fdoray/latency_tracker_demo
>         <https://github.com/fdoray/latency_tracker_demo> works
>         sometimes. And also
> 
>         Not sure if those aged properly.
> 
>         If you did not already, I would recommend you take a look at the
>         latency-tracker
>         github page [1] that this project references and uses. It
>         changed quite a lot
>         in the last 3 years.
> 
>         [1] https://github.com/efficios/latency-tracker
>         <https://github.com/efficios/latency-tracker>
> 
> 
>     Yes, that is where I downloaded the source and followed. 
> 
> 
> 
> 
>         > profiling Process-1 works by providing some useful traces(intermittently
>         > kernel panics), but profiling Process-2 the process just hangs.
> 
>         What version of lttng-modules are you using? The
>         latency_tracker_demo install
>         script seems to use lttng-modules 2.6. 2.6 is not supported
>         anymore, it would be
>         interesting to perform you experimentation with a newer version.
> 
> 
>     Sorry for not being clear in my earlier msg. I did not use the
>     install scripts. All the kernel mdoules, LTTng modules,
>     latency-tracker modules and user space libs 
>     lttng-ust, lttng-tools, lttng-profile was downloaded from git hub
>     few weeks ago and build from source and loaded.  
> 
>     From the latency_tracker_demo, I just built the sleep-demo and used
>     the trace.sh to do the tracing
>      
> 
> 
>         Still, could you provide the panics call stack?
> 
> 
>     I do not see a panic trace all the time. In the previous run, when I
>     tried to kill the Process-2 with "kill -9 <pid>", as it was hung,
>     the system just rebooted without panic.
> 
> 
>     This is from the syslog, and I see the same output on console
> 
>     1613 Dec 27 16:53:37  kernel: [12566.627939] ring buffer
>     relay-metadata: 0 records written, 0 records overrun
>     1614 Dec 27 16:53:37  kernel: [12566.634983] ring buffer
>     relay-metadata, cpu -1: non-consumed data
>     1615 Dec 27 16:53:37  kernel: [12566.634983]   [ 12288 bytes
>     written, 8192 bytes read ]
>     1616 Dec 27 16:53:37  kernel: [12566.646189] ring buffer:
>     relay-metadata, cpu -1: 8192 bytes committed
>     1617 Dec 27 16:53:37  kernel: [12566.652780] ring buffer
>     relay-discard, cpu 0: 0 records written, 0 records overrun
>     1618 Dec 27 16:53:38  kernel: [12566.660865] ring buffer
>     relay-discard, cpu 1: 0 records written, 0 records overrun
>     1619 Dec 27 16:53:38  kernel: [12566.668924] ring buffer
>     relay-discard, cpu 2: 0 records written, 0 records overrun
>     1620 Dec 27 16:53:38  kernel: [12566.676897] ring buffer
>     relay-discard, cpu 3: 0 records written, 0 records overrun
>     1621 Dec 27 16:53:38  kernel: [12566.684809] ring buffer
>     relay-discard, cpu 4: 0 records written, 0 records overrun
>     1622 Dec 27 16:53:38  kernel: [12566.692681] ring buffer
>     relay-discard, cpu 5: 0 records written, 0 records overrun
>     1623 Dec 27 16:54:16  avahi-daemon[948]: Registering new address
>     record for 10.31.33.48 on eth0.IPv4.
>     1624 Dec 27 16:54:16  rsyslogd-2007: action 'action 9' suspended,
>     next retry is Wed Dec 27 16:55:46 2017 [v8.16.0 t     ry
>     http://www.rsyslog.com/e/2007 ]
>     1625 Dec 27 16:54:16  systemd-timesyncd[816]: Network configuration
>     changed, trying to establish connection.
>     1626 Dec 27 16:54:16  systemd-timesyncd[816]: Synchronized to time
>     server 91.189.94.4:123 <http://91.189.94.4:123> (ntp.ubuntu.com
>     <http://ntp.ubuntu.com>).
>     1627 Dec 27 16:54:17  rsyslogd: [origin software="rsyslogd"
>     swVersion="8.16.0" x-pid="924" x-info="http://www.rsysl     og.com
>     <http://og.com>"] start 
>     1628 Dec 27 16:54:17  rsyslogd-2222: command
>     'KLogPermitNonKernelFacility' is currently not permitted - did you
>     alr     eady set it via a RainerScript command (v6+ config)?
>     [v8.16.0 try http://www.rsyslog.com/e/2222 ]
>     1629 Dec 27 16:54:17  rsyslogd: rsyslogd's groupid changed to 115
>     1630 Dec 27 16:54:17  rsyslogd: rsyslogd's userid changed to 108
>     1631 Dec 27 16:54:17  kernel: [    0.000000] Booting Linux on
>     physical CPU 0x100 
>     1632 Dec 27 16:54:17  kernel: [    0.000000] Initializing cgroup
>     subsys cpuset
>     1633 Dec 27 16:54:17  kernel: [    0.000000] Initializing cgroup
>     subsys cpu
>     1634 Dec 27 16:54:17  kernel: [    0.000000] Initializing cgroup
>     subsys cpuacct
>     1635 Dec 27 16:54:17  kernel: [    0.000000] Linux version
>     4.4.38-rt49-tegra (anup.pemmaiah at anup-pemmaiah) (gcc ver     sion
>     4.9.2 (GCC) ) #1 SMP PREEMPT RT Thu Nov 30 17:12:21 PST 2017
>     1636 Dec 27 16:54:17  kernel: [    0.000000] Boot CPU: AArch64
>     Processor [411fd073]
>     1637 Dec 27 16:54:17  kernel: [    0.000000] earlycon: Early serial
>     console at MMIO32
>      
> 
> 
>         >
>         > Not sure how updated the above pasted links are, because I notice there is
>         > no more usec_threshold parameter and I use the threshold
>         > in /sys/kernel/debug/latency/syscalls/default/threshold
>         >
>         > I understand each situation is different, but was curious if  any of you
>         > have comments/suggestions that would help in debugging latency related
>         > issues using LTTng. I have looked many places to get required info but of
>         > not much help.
> 
>         Most of the maintainers are on vacation during this part of the
>         year so it might
>         take some time to get answers. I'm sure that Julien Desfossez
>         will be
>         interested in your use case.
> 
>         Cheers
> 
> 
>     I undersatnd and any suggestions will be very helpful 
> 
> 
>         >
>         > Your comments/suggestions will be greatly appreciated.
>         >
>         > Thanks
>         > Anup
> 
>         > _______________________________________________
>         > lttng-dev mailing list
>         > lttng-dev at lists.lttng.org <mailto:lttng-dev at lists.lttng.org>
>         > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>         <https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev>
> 
> 
>         --
>         Jonathan Rajotte-Julien
>         EfficiOS
> 
> 
> 
> 
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>