[lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent

Mon Jan 12 10:34:37 EST 2015

----- Original Message -----
> From: "David OShea" <David.OShea at quantum.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> Cc: "lttng-dev" <lttng-dev at lists.lttng.org>
> Sent: Monday, January 12, 2015 1:33:07 AM
> Subject: RE: [lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> - CPU/VMware dependent
> 
> Hi Mathieu,
> 
> Apologies for the delay in getting back to you, please see below:
> 
> > -----Original Message-----
> > From: Mathieu Desnoyers [mailto:mathieu.desnoyers at efficios.com]
> > Sent: Friday, 12 December 2014 2:07 AM
> > To: David OShea
> > Cc: lttng-dev
> > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > dependent
> > 
> > ________________________________
> > 
> > 	From: "David OShea" <David.OShea at quantum.com>
> > 	To: "lttng-dev" <lttng-dev at lists.lttng.org>
> > 	Sent: Sunday, December 7, 2014 10:30:04 PM
> > 	Subject: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > dependent
> > 
> > 
> > 
> > 	Hi all,
> > 
> > 	We have encountered a problem with using LTTng-UST tracing with
> > our application, where on a particular VMware vCenter cluster we almost
> > ways get segfaults when tracepoints are enabled, whereas on another
> > vCenter cluster, and on every other machine we’ve ever used, we don’t
> > hit this problem.
> > 
> > 	I can reproduce this using lttng-ust/tests/hello after using:
> > 
> > 	"""
> > 
> > 	lttng create
> > 
> > 	lttng enable-channel channel0 --userspace
> > 
> > 	lttng add-context --userspace -t vpid -t vtid -t procname
> > 
> > 	lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> > 
> > 	lttng start
> > 
> > 	"""
> > 
> > 	In which case I get the following stack trace with an obvious
> > NULL pointer dereference:
> > 
> > 	"""
> > 
> > 	Program terminated with signal SIGSEGV, Segmentation fault.
> > 
> > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > 
> > 	48              return uatomic_read(&v_a->a);
> > 
> > 	[...]
> > 
> > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > 
> > 	#1  0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> > 
> > 	    buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> > offsets=0x7fffef67c620,
> > 
> > 	    ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> > 
> > 	#2  0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> > (ctx=0x7fffef67ca40)
> > 
> > 	    at ring_buffer_frontend.c:1819
> > 
> > 	#3  0x00007f4aa1095b75 in lib_ring_buffer_reserve
> > (ctx=0x7fffef67ca40,
> > 
> > 	    config=0x7f4aa12b8ae0 <client_config>)
> > 
> > 	    at ../libringbuffer/frontend_api.h:211
> > 
> > 	#4  lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> > 
> > 	    at lttng-ring-buffer-client.h:473
> > 
> > 	#5  0x000000000040135f in __event_probe__ust_tests_hello___tptest
> > (
> > 
> > 	    __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> > 
> > 	    text=0x7fffef67cb70 "test", textlen=<optimized out>,
> > doublearg=2,
> > 
> > 	    floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> > 
> > 	#6  0x0000000000400d2c in
> > __tracepoint_cb_ust_tests_hello___tptest (
> > 
> > 	    boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> > 
> > 	    text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> > 
> > 	    netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> > 
> > 	#7  main (argc=<optimized out>, argv=<optimized out>) at
> > hello.c:92
> > 
> > 	"""
> > 
> > 	I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> > one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> > otherwise had the same software installed on them:
> > 
> > 	- CentOS 6-based
> > 
> > 	- kernel-2.6.32-504.1.3.el6 with some minor changes made in
> > networking
> > 
> > 	- userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> > which might have some minor patches backported, and leftovers of
> > changes to get them to build on CentOS 5
> > 
> > 	On the “good” vCenter, I tested on two different VM hosts:
> > 
> > 	Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> > 
> > 	EVC Mode: Intel(R) "Nehalem" Generation
> > 
> > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > 
> > 	Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> > 
> > 	EVC Mode: Intel(R) "Nehalem" Generation
> > 
> > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > 
> > 	The “bad” vCenter VM host that I tested on had this
> > configuration:
> > 
> > 	ESX Version: VMware ESXi, 5.0.0, 469512
> > 
> > 	Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> > 
> > 	Any ideas?
> > 
> > 
> > My bet would be that the OS is lying to userspace about the
> > number of possible CPUs. I wonder what liblttng-ust
> > libringbuffer/shm.h num_possible_cpus() is returning compared
> > to what lib_ring_buffer_get_cpu() returns.
> > 
> > 
> > Can you check this out ?
> 
> Yes, this seems to be the case - 'gdb' on the core dump shows:
> 
> (gdb) p __num_possible_cpus
> $1 = 2
> 
> which is consistent with how I configured the virtual machine, which is
> consistent with this output:
> 
> # lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                2
> On-line CPU(s) list:   0,1
> Thread(s) per core:    1
> Core(s) per socket:    1
> Socket(s):             2
> NUMA node(s):          1
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 26
> Stepping:              4
> CPU MHz:               1995.000
> BogoMIPS:              3990.00
> Hypervisor vendor:     VMware
> Virtualization type:   full
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              18432K
> NUMA node0 CPU(s):     0,1
> 
> Despite the fact that there are 2 CPUs, when I hacked
> lttng-ring-buffer-client.h to output the result of lib_ring_buffer_get_cpu()
> and then ran tests/hello with tracing enabled, I could see it would sit on
> CPU 0 for a while, or CPU 1, and perhaps move between the two, but
> eventually either 2 or 3 would appear, immediately followed by the segfault.
> 
> The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading enabled.
> The VM has its "HT Sharing" option set to "Any", which according to
> https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html
> means that each one of the virtual machine's virtual cores can share a
> physical core with another virtual machine, each virtual core using a
> different thread on that physical core.  I assume none of this should be
> relevant except perhaps if there are bugs in VMware.
> 
> Is it possible that this is an issue in LTTng, or should I work out how the
> kernel works out which CPU it is running on and then look into whether there
> are any VMware bugs in this area?

This appears to be very likely a VMware bug. /proc/cpuinfo should show
4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the current
CPU number can be 0, 1, 2, 3 throughout execution.

Thanks,

Mathieu

> 
> Thanks in advance,
> David
> 
> ----------------------------------------------------------------------
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum. Quantum reserves the right to have electronic communications,
> including email and attachments, sent across its networks filtered through
> anti virus and spam software programs and retain such messages in order to
> comply with applicable data security and retention requirements. Quantum is
> not responsible for the proper and complete transmission of the substance of
> this communication or for any delay in its receipt.
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com