[lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
Mathieu Desnoyers
mathieu.desnoyers at efficios.com
Mon Jan 12 10:36:08 EST 2015
----- Original Message -----
> From: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> To: "David OShea" <David.OShea at quantum.com>
> Cc: "lttng-dev" <lttng-dev at lists.lttng.org>
> Sent: Monday, January 12, 2015 10:34:37 AM
> Subject: Re: [lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> - CPU/VMware dependent
>
> ----- Original Message -----
> > From: "David OShea" <David.OShea at quantum.com>
> > To: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> > Cc: "lttng-dev" <lttng-dev at lists.lttng.org>
> > Sent: Monday, January 12, 2015 1:33:07 AM
> > Subject: RE: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> > - CPU/VMware dependent
> >
> > Hi Mathieu,
> >
> > Apologies for the delay in getting back to you, please see below:
> >
> > > -----Original Message-----
> > > From: Mathieu Desnoyers [mailto:mathieu.desnoyers at efficios.com]
> > > Sent: Friday, 12 December 2014 2:07 AM
> > > To: David OShea
> > > Cc: lttng-dev
> > > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > > dependent
> > >
> > > ________________________________
> > >
> > > From: "David OShea" <David.OShea at quantum.com>
> > > To: "lttng-dev" <lttng-dev at lists.lttng.org>
> > > Sent: Sunday, December 7, 2014 10:30:04 PM
> > > Subject: [lttng-dev] Segfault at v_read() called from
> > > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > > dependent
> > >
> > >
> > >
> > > Hi all,
> > >
> > > We have encountered a problem with using LTTng-UST tracing with
> > > our application, where on a particular VMware vCenter cluster we almost
> > > ways get segfaults when tracepoints are enabled, whereas on another
> > > vCenter cluster, and on every other machine we’ve ever used, we don’t
> > > hit this problem.
> > >
> > > I can reproduce this using lttng-ust/tests/hello after using:
> > >
> > > """
> > >
> > > lttng create
> > >
> > > lttng enable-channel channel0 --userspace
> > >
> > > lttng add-context --userspace -t vpid -t vtid -t procname
> > >
> > > lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> > >
> > > lttng start
> > >
> > > """
> > >
> > > In which case I get the following stack trace with an obvious
> > > NULL pointer dereference:
> > >
> > > """
> > >
> > > Program terminated with signal SIGSEGV, Segmentation fault.
> > >
> > > #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > >
> > > 48 return uatomic_read(&v_a->a);
> > >
> > > [...]
> > >
> > > #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> > >
> > > #1 0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> > >
> > > buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> > > offsets=0x7fffef67c620,
> > >
> > > ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> > >
> > > #2 0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> > > (ctx=0x7fffef67ca40)
> > >
> > > at ring_buffer_frontend.c:1819
> > >
> > > #3 0x00007f4aa1095b75 in lib_ring_buffer_reserve
> > > (ctx=0x7fffef67ca40,
> > >
> > > config=0x7f4aa12b8ae0 <client_config>)
> > >
> > > at ../libringbuffer/frontend_api.h:211
> > >
> > > #4 lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> > >
> > > at lttng-ring-buffer-client.h:473
> > >
> > > #5 0x000000000040135f in __event_probe__ust_tests_hello___tptest
> > > (
> > >
> > > __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> > >
> > > text=0x7fffef67cb70 "test", textlen=<optimized out>,
> > > doublearg=2,
> > >
> > > floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> > >
> > > #6 0x0000000000400d2c in
> > > __tracepoint_cb_ust_tests_hello___tptest (
> > >
> > > boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> > >
> > > text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> > >
> > > netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> > >
> > > #7 main (argc=<optimized out>, argv=<optimized out>) at
> > > hello.c:92
> > >
> > > """
> > >
> > > I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> > > one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> > > otherwise had the same software installed on them:
> > >
> > > - CentOS 6-based
> > >
> > > - kernel-2.6.32-504.1.3.el6 with some minor changes made in
> > > networking
> > >
> > > - userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> > > which might have some minor patches backported, and leftovers of
> > > changes to get them to build on CentOS 5
> > >
> > > On the “good” vCenter, I tested on two different VM hosts:
> > >
> > > Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> > >
> > > EVC Mode: Intel(R) "Nehalem" Generation
> > >
> > > Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > >
> > > Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> > >
> > > EVC Mode: Intel(R) "Nehalem" Generation
> > >
> > > Image Profile: (Updated) ESXi-5.1.0-799733-standard
> > >
> > > The “bad” vCenter VM host that I tested on had this
> > > configuration:
> > >
> > > ESX Version: VMware ESXi, 5.0.0, 469512
> > >
> > > Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> > >
> > > Any ideas?
> > >
> > >
> > > My bet would be that the OS is lying to userspace about the
> > > number of possible CPUs. I wonder what liblttng-ust
> > > libringbuffer/shm.h num_possible_cpus() is returning compared
> > > to what lib_ring_buffer_get_cpu() returns.
> > >
> > >
> > > Can you check this out ?
> >
> > Yes, this seems to be the case - 'gdb' on the core dump shows:
> >
> > (gdb) p __num_possible_cpus
> > $1 = 2
> >
> > which is consistent with how I configured the virtual machine, which is
> > consistent with this output:
> >
> > # lscpu
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 2
> > On-line CPU(s) list: 0,1
> > Thread(s) per core: 1
> > Core(s) per socket: 1
> > Socket(s): 2
> > NUMA node(s): 1
> > Vendor ID: GenuineIntel
> > CPU family: 6
> > Model: 26
> > Stepping: 4
> > CPU MHz: 1995.000
> > BogoMIPS: 3990.00
> > Hypervisor vendor: VMware
> > Virtualization type: full
> > L1d cache: 32K
> > L1i cache: 32K
> > L2 cache: 256K
> > L3 cache: 18432K
> > NUMA node0 CPU(s): 0,1
> >
> > Despite the fact that there are 2 CPUs, when I hacked
> > lttng-ring-buffer-client.h to output the result of
> > lib_ring_buffer_get_cpu()
> > and then ran tests/hello with tracing enabled, I could see it would sit on
> > CPU 0 for a while, or CPU 1, and perhaps move between the two, but
> > eventually either 2 or 3 would appear, immediately followed by the
> > segfault.
> >
> > The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading
> > enabled.
> > The VM has its "HT Sharing" option set to "Any", which according to
> > https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html
> > means that each one of the virtual machine's virtual cores can share a
> > physical core with another virtual machine, each virtual core using a
> > different thread on that physical core. I assume none of this should be
> > relevant except perhaps if there are bugs in VMware.
> >
> > Is it possible that this is an issue in LTTng, or should I work out how the
> > kernel works out which CPU it is running on and then look into whether
> > there
> > are any VMware bugs in this area?
>
> This appears to be very likely a VMware bug. /proc/cpuinfo should show
> 4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the current
> CPU number can be 0, 1, 2, 3 throughout execution.
You might want to look at the sysconf(3) manpage, especially the parts about
_SC_NPROCESSORS_CONF and _SC_NPROCESSORS_ONLN. My guess is that vmware is lying
about the number of "possible" CPUs (_SC_NPROCESSORS_CONF).
Thanks,
Mathieu
>
> Thanks,
>
> Mathieu
>
>
> >
> > Thanks in advance,
> > David
> >
> > ----------------------------------------------------------------------
> > The information contained in this transmission may be confidential. Any
> > disclosure, copying, or further distribution of confidential information is
> > not permitted unless such privilege is explicitly granted in writing by
> > Quantum. Quantum reserves the right to have electronic communications,
> > including email and attachments, sent across its networks filtered through
> > anti virus and spam software programs and retain such messages in order to
> > comply with applicable data security and retention requirements. Quantum is
> > not responsible for the proper and complete transmission of the substance
> > of
> > this communication or for any delay in its receipt.
> >
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list