[lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent

Mon Jan 12 10:36:08 EST 2015

----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> To: "David OShea" <David.OShea at quantum.com>
> Cc: "lttng-dev" <lttng-dev at lists.lttng.org>
> Sent: Monday, January 12, 2015 10:34:37 AM
> Subject: Re: [lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> - CPU/VMware dependent
> 
> ----- Original Message -----
> > From: "David OShea" <David.OShea at quantum.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> > Cc: "lttng-dev" <lttng-dev at lists.lttng.org>
> > Sent: Monday, January 12, 2015 1:33:07 AM
> > Subject: RE: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> > - CPU/VMware dependent
> > 
> > Hi Mathieu,
> > 
> > Apologies for the delay in getting back to you, please see below:
> > 
> > > -----Original Message-----
> > > From: Mathieu Desnoyers [mailto:mathieu.desnoyers at efficios.com]
> > > Sent: Friday, 12 December 2014 2:07 AM
> > > To: David OShea
> > > Cc: lttng-dev
> > > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > > dependent
> > > 
> > > ________________________________
> > > 
> > > 	From: "David OShea" <David.OShea at quantum.com>
> > > 	To: "lttng-dev" <lttng-dev at lists.lttng.org>
> > > 	Sent: Sunday, December 7, 2014 10:30:04 PM
> > > 	Subject: [lttng-dev] Segfault at v_read() called from
> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > > dependent
> > > 
> > > 
> > > 
> > > 	Hi all,
> > > 
> > > 	We have encountered a problem with using LTTng-UST tracing with
> > > our application, where on a particular VMware vCenter cluster we almost
> > > ways get segfaults when tracepoints are enabled, whereas on another
> > > vCenter cluster, and on every other machine we’ve ever used, we don’t
> > > hit this problem.
> > > 
> > > 	I can reproduce this using lttng-ust/tests/hello after using:
> > > 
> > > 	"""
> > > 
> > > 	lttng create
> > > 
> > > 	lttng enable-channel channel0 --userspace
> > > 
> > > 	lttng add-context --userspace -t vpid -t vtid -t procname
> > > 
> > > 	lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> > > 
> > > 	lttng start
> > > 
> > > 	"""
> > > 
> > > 	In which case I get the following stack trace with an obvious
> > > NULL pointer dereference:
> > > 
> > > 	"""
> > > 
> > > 	Program terminated with signal SIGSEGV, Segmentation fault.
> > > 
> > > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > > 
> > > 	48              return uatomic_read(&v_a->a);
> > > 
> > > 	[...]
> > > 
> > > 	#0  v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > > 
> > > 	#1  0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> > > 
> > > 	    buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> > > offsets=0x7fffef67c620,
> > > 
> > > 	    ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> > > 
> > > 	#2  0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> > > (ctx=0x7fffef67ca40)
> > > 
> > > 	    at ring_buffer_frontend.c:1819
> > > 
> > > 	#3  0x00007f4aa1095b75 in lib_ring_buffer_reserve
> > > (ctx=0x7fffef67ca40,
> > > 
> > > 	    config=0x7f4aa12b8ae0 <client_config>)
> > > 
> > > 	    at ../libringbuffer/frontend_api.h:211
> > > 
> > > 	#4  lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> > > 
> > > 	    at lttng-ring-buffer-client.h:473
> > > 
> > > 	#5  0x000000000040135f in __event_probe__ust_tests_hello___tptest
> > > (
> > > 
> > > 	    __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> > > 
> > > 	    text=0x7fffef67cb70 "test", textlen=<optimized out>,
> > > doublearg=2,
> > > 
> > > 	    floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> > > 
> > > 	#6  0x0000000000400d2c in
> > > __tracepoint_cb_ust_tests_hello___tptest (
> > > 
> > > 	    boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> > > 
> > > 	    text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> > > 
> > > 	    netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> > > 
> > > 	#7  main (argc=<optimized out>, argv=<optimized out>) at
> > > hello.c:92
> > > 
> > > 	"""
> > > 
> > > 	I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> > > one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> > > otherwise had the same software installed on them:
> > > 
> > > 	- CentOS 6-based
> > > 
> > > 	- kernel-2.6.32-504.1.3.el6 with some minor changes made in
> > > networking
> > > 
> > > 	- userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> > > which might have some minor patches backported, and leftovers of
> > > changes to get them to build on CentOS 5
> > > 
> > > 	On the “good” vCenter, I tested on two different VM hosts:
> > > 
> > > 	Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> > > 
> > > 	EVC Mode: Intel(R) "Nehalem" Generation
> > > 
> > > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > > 
> > > 	Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> > > 
> > > 	EVC Mode: Intel(R) "Nehalem" Generation
> > > 
> > > 	Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > > 
> > > 	The “bad” vCenter VM host that I tested on had this
> > > configuration:
> > > 
> > > 	ESX Version: VMware ESXi, 5.0.0, 469512
> > > 
> > > 	Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> > > 
> > > 	Any ideas?
> > > 
> > > 
> > > My bet would be that the OS is lying to userspace about the
> > > number of possible CPUs. I wonder what liblttng-ust
> > > libringbuffer/shm.h num_possible_cpus() is returning compared
> > > to what lib_ring_buffer_get_cpu() returns.
> > > 
> > > 
> > > Can you check this out ?
> > 
> > Yes, this seems to be the case - 'gdb' on the core dump shows:
> > 
> > (gdb) p __num_possible_cpus
> > $1 = 2
> > 
> > which is consistent with how I configured the virtual machine, which is
> > consistent with this output:
> > 
> > # lscpu
> > Architecture:          x86_64
> > CPU op-mode(s):        32-bit, 64-bit
> > Byte Order:            Little Endian
> > CPU(s):                2
> > On-line CPU(s) list:   0,1
> > Thread(s) per core:    1
> > Core(s) per socket:    1
> > Socket(s):             2
> > NUMA node(s):          1
> > Vendor ID:             GenuineIntel
> > CPU family:            6
> > Model:                 26
> > Stepping:              4
> > CPU MHz:               1995.000
> > BogoMIPS:              3990.00
> > Hypervisor vendor:     VMware
> > Virtualization type:   full
> > L1d cache:             32K
> > L1i cache:             32K
> > L2 cache:              256K
> > L3 cache:              18432K
> > NUMA node0 CPU(s):     0,1
> > 
> > Despite the fact that there are 2 CPUs, when I hacked
> > lttng-ring-buffer-client.h to output the result of
> > lib_ring_buffer_get_cpu()
> > and then ran tests/hello with tracing enabled, I could see it would sit on
> > CPU 0 for a while, or CPU 1, and perhaps move between the two, but
> > eventually either 2 or 3 would appear, immediately followed by the
> > segfault.
> > 
> > The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading
> > enabled.
> > The VM has its "HT Sharing" option set to "Any", which according to
> > https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html
> > means that each one of the virtual machine's virtual cores can share a
> > physical core with another virtual machine, each virtual core using a
> > different thread on that physical core.  I assume none of this should be
> > relevant except perhaps if there are bugs in VMware.
> > 
> > Is it possible that this is an issue in LTTng, or should I work out how the
> > kernel works out which CPU it is running on and then look into whether
> > there
> > are any VMware bugs in this area?
> 
> This appears to be very likely a VMware bug. /proc/cpuinfo should show
> 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the current
> CPU number can be 0, 1, 2, 3 throughout execution.

You might want to look at the sysconf(3) manpage, especially the parts about
_SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that vmware is lying
about the number of "possible" CPUs (_SC_NPROCESSORS_CONF).

Thanks,

Mathieu

> 
> Thanks,
> 
> Mathieu
> 
> 
> > 
> > Thanks in advance,
> > David
> > 
> > ----------------------------------------------------------------------
> > The information contained in this transmission may be confidential. Any
> > disclosure, copying, or further distribution of confidential information is
> > not permitted unless such privilege is explicitly granted in writing by
> > Quantum. Quantum reserves the right to have electronic communications,
> > including email and attachments, sent across its networks filtered through
> > anti virus and spam software programs and retain such messages in order to
> > comply with applicable data security and retention requirements. Quantum is
> > not responsible for the proper and complete transmission of the substance
> > of
> > this communication or for any delay in its receipt.
> > 
> 
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com