[lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent

Mon Jan 12 01:33:07 EST 2015

Hi Mathieu,

Apologies for the delay in getting back to you, please see below:

> -----Original Message-----
> From: Mathieu Desnoyers [mailto:mathieu.desnoyers at efficios.com]
> Sent: Friday, 12 December 2014 2:07 AM
> To: David OShea
> Cc: lttng-dev
> Subject: Re: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> dependent
> 
> ________________________________
> 
> 	From: "David OShea" <David.OShea at quantum.com>
> 	To: "lttng-dev" <lttng-dev at lists.lttng.org>
> 	Sent: Sunday, December 7, 2014 10:30:04 PM
> 	Subject: [lttng-dev] Segfault at v_read() called from
> lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> dependent
> 
> 
> 
> 	Hi all,
> 
> 	We have encountered a problem with using LTTng-UST tracing with
> our application, where on a particular VMware vCenter cluster we almost
> ways get segfaults when tracepoints are enabled, whereas on another
> vCenter cluster, and on every other machine we’ve ever used, we don’t
> hit this problem.
> 
> 	I can reproduce this using lttng-ust/tests/hello after using:
> 
> 	"""
> 
> 	lttng create
> 
> 	lttng enable-channel channel0 --userspace
> 
> 	lttng add-context --userspace -t vpid -t vtid -t procname
> 
> 	lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> 
> 	lttng start
> 
> 	"""
> 
> 	In which case I get the following stack trace with an obvious
> NULL pointer dereference:
> 
> 	"""
> 
> 	Program terminated with signal SIGSEGV, Segmentation fault.
> 
> 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> 
> 	48              return uatomic_read(&v_a->a);
> 
> 	[...]
> 
> 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> 
> 	#1  0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> 
> 	    buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> offsets=0x7fffef67c620,
> 
> 	    ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> 
> 	#2  0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> (ctx=0x7fffef67ca40)
> 
> 	    at ring_buffer_frontend.c:1819
> 
> 	#3  0x00007f4aa1095b75 in lib_ring_buffer_reserve
> (ctx=0x7fffef67ca40,
> 
> 	    config=0x7f4aa12b8ae0 <client_config>)
> 
> 	    at ../libringbuffer/frontend_api.h:211
> 
> 	#4  lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> 
> 	    at lttng-ring-buffer-client.h:473
> 
> 	#5  0x000000000040135f in __event_probe__ust_tests_hello___tptest
> (
> 
> 	    __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> 
> 	    text=0x7fffef67cb70 "test", textlen=<optimized out>,
> doublearg=2,
> 
> 	    floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> 
> 	#6  0x0000000000400d2c in
> __tracepoint_cb_ust_tests_hello___tptest (
> 
> 	    boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> 
> 	    text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> 
> 	    netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> 
> 	#7  main (argc=<optimized out>, argv=<optimized out>) at
> hello.c:92
> 
> 	"""
> 
> 	I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> otherwise had the same software installed on them:
> 
> 	- CentOS 6-based
> 
> 	- kernel-2.6.32-504.1.3.el6 with some minor changes made in
> networking
> 
> 	- userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> which might have some minor patches backported, and leftovers of
> changes to get them to build on CentOS 5
> 
> 	On the “good” vCenter, I tested on two different VM hosts:
> 
> 	Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> 
> 	EVC Mode: Intel(R) "Nehalem" Generation
> 
> 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> 
> 	Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> 
> 	EVC Mode: Intel(R) "Nehalem" Generation
> 
> 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> 
> 	The “bad” vCenter VM host that I tested on had this
> configuration:
> 
> 	ESX Version: VMware ESXi, 5.0.0, 469512
> 
> 	Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> 
> 	Any ideas?
> 
> 
> My bet would be that the OS is lying to userspace about the
> number of possible CPUs. I wonder what liblttng-ust
> libringbuffer/shm.h num_possible_cpus() is returning compared
> to what lib_ring_buffer_get_cpu() returns.
> 
> 
> Can you check this out ?

Yes, this seems to be the case - 'gdb' on the core dump shows:

(gdb) p __num_possible_cpus
$1 = 2

which is consistent with how I configured the virtual machine, which is consistent with this output:

# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 26
Stepping:              4
CPU MHz:               1995.000
BogoMIPS:              3990.00
Hypervisor vendor:     VMware
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              18432K
NUMA node0 CPU(s):     0,1

Despite the fact that there are 2 CPUs, when I hacked lttng-ring-buffer-client.h to output the result of lib_ring_buffer_get_cpu() and then ran tests/hello with tracing enabled, I could see it would sit on CPU 0 for a while, or CPU 1, and perhaps move between the two, but eventually either 2 or 3 would appear, immediately followed by the segfault.

The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading enabled.  The VM has its "HT Sharing" option set to "Any", which according to https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html means that each one of the virtual machine's virtual cores can share a physical core with another virtual machine, each virtual core using a different thread on that physical core.  I assume none of this should be relevant except perhaps if there are bugs in VMware.

Is it possible that this is an issue in LTTng, or should I work out how the kernel works out which CPU it is running on and then look into whether there are any VMware bugs in this area?

Thanks in advance,
David

----------------------------------------------------------------------
The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum. Quantum reserves the right to have electronic communications, including email and attachments, sent across its networks filtered through anti virus and spam software programs and retain such messages in order to comply with applicable data security and retention requirements. Quantum is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.