[lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware dependent
Mathieu Desnoyers
mathieu.desnoyers at efficios.com
Mon Jan 12 10:34:37 EST 2015
----- Original Message -----
> From: "David OShea" <David.OShea at quantum.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> Cc: "lttng-dev" <lttng-dev at lists.lttng.org>
> Sent: Monday, January 12, 2015 1:33:07 AM
> Subject: RE: [lttng-dev] Segfault at v_read() called from lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app
> - CPU/VMware dependent
>
> Hi Mathieu,
>
> Apologies for the delay in getting back to you, please see below:
>
> > -----Original Message-----
> > From: Mathieu Desnoyers [mailto:mathieu.desnoyers at efficios.com]
> > Sent: Friday, 12 December 2014 2:07 AM
> > To: David OShea
> > Cc: lttng-dev
> > Subject: Re: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > dependent
> >
> > ________________________________
> >
> > From: "David OShea" <David.OShea at quantum.com>
> > To: "lttng-dev" <lttng-dev at lists.lttng.org>
> > Sent: Sunday, December 7, 2014 10:30:04 PM
> > Subject: [lttng-dev] Segfault at v_read() called from
> > lib_ring_buffer_try_reserve_slow() in LTTng-UST traced app - CPU/VMware
> > dependent
> >
> >
> >
> > Hi all,
> >
> > We have encountered a problem with using LTTng-UST tracing with
> > our application, where on a particular VMware vCenter cluster we almost
> > ways get segfaults when tracepoints are enabled, whereas on another
> > vCenter cluster, and on every other machine we’ve ever used, we don’t
> > hit this problem.
> >
> > I can reproduce this using lttng-ust/tests/hello after using:
> >
> > """
> >
> > lttng create
> >
> > lttng enable-channel channel0 --userspace
> >
> > lttng add-context --userspace -t vpid -t vtid -t procname
> >
> > lttng enable-event --userspace "ust_tests_hello:*" -c channel0
> >
> > lttng start
> >
> > """
> >
> > In which case I get the following stack trace with an obvious
> > NULL pointer dereference:
> >
> > """
> >
> > Program terminated with signal SIGSEGV, Segmentation fault.
> >
> > #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> >
> > 48 return uatomic_read(&v_a->a);
> >
> > [...]
> >
> > #0 v_read (config=<optimized out>, v_a=0x0) at vatomic.h:48
> >
> > #1 0x00007f4aa10a4804 in lib_ring_buffer_try_reserve_slow (
> >
> > buf=0x7f4a98008a00, chan=0x7f4a98008a00,
> > offsets=0x7fffef67c620,
> >
> > ctx=0x7fffef67ca40) at ring_buffer_frontend.c:1677
> >
> > #2 0x00007f4aa10a6c9f in lib_ring_buffer_reserve_slow
> > (ctx=0x7fffef67ca40)
> >
> > at ring_buffer_frontend.c:1819
> >
> > #3 0x00007f4aa1095b75 in lib_ring_buffer_reserve
> > (ctx=0x7fffef67ca40,
> >
> > config=0x7f4aa12b8ae0 <client_config>)
> >
> > at ../libringbuffer/frontend_api.h:211
> >
> > #4 lttng_event_reserve (ctx=0x7fffef67ca40, event_id=0)
> >
> > at lttng-ring-buffer-client.h:473
> >
> > #5 0x000000000040135f in __event_probe__ust_tests_hello___tptest
> > (
> >
> > __tp_data=0xed3410, anint=0, netint=0, values=0x7fffef67cb50,
> >
> > text=0x7fffef67cb70 "test", textlen=<optimized out>,
> > doublearg=2,
> >
> > floatarg=2222, boolarg=true) at ././ust_tests_hello.h:32
> >
> > #6 0x0000000000400d2c in
> > __tracepoint_cb_ust_tests_hello___tptest (
> >
> > boolarg=true, floatarg=2222, doublearg=2, textlen=4,
> >
> > text=0x7fffef67cb70 "test", values=0x7fffef67cb50,
> >
> > netint=<optimized out>, anint=0) at ust_tests_hello.h:32
> >
> > #7 main (argc=<optimized out>, argv=<optimized out>) at
> > hello.c:92
> >
> > """
> >
> > I hit this segfault 10 out of 10 times I ran “hello” on a VM on
> > one vCenter and 0 out of 10 times I ran it on the other, and the VMs
> > otherwise had the same software installed on them:
> >
> > - CentOS 6-based
> >
> > - kernel-2.6.32-504.1.3.el6 with some minor changes made in
> > networking
> >
> > - userspace-rcu-0.8.3, lttng-ust-2.3.2 and lttng-tools-2.3.2
> > which might have some minor patches backported, and leftovers of
> > changes to get them to build on CentOS 5
> >
> > On the “good” vCenter, I tested on two different VM hosts:
> >
> > Processor Type: Intel(R) Xeon(R) CPU E5530 @ 2.40GHz
> >
> > EVC Mode: Intel(R) "Nehalem" Generation
> >
> > Image Profile: (Updated) ESXi-5.1.0-799733-standard
> >
> > Processor Type: Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
> >
> > EVC Mode: Intel(R) "Nehalem" Generation
> >
> > Image Profile: (Updated) ESXi-5.1.0-799733-standard
> >
> > The “bad” vCenter VM host that I tested on had this
> > configuration:
> >
> > ESX Version: VMware ESXi, 5.0.0, 469512
> >
> > Processor Type: Intel(R) Xeon(R) CPU X7550 @ 2.00GHz
> >
> > Any ideas?
> >
> >
> > My bet would be that the OS is lying to userspace about the
> > number of possible CPUs. I wonder what liblttng-ust
> > libringbuffer/shm.h num_possible_cpus() is returning compared
> > to what lib_ring_buffer_get_cpu() returns.
> >
> >
> > Can you check this out ?
>
> Yes, this seems to be the case - 'gdb' on the core dump shows:
>
> (gdb) p __num_possible_cpus
> $1 = 2
>
> which is consistent with how I configured the virtual machine, which is
> consistent with this output:
>
> # lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 2
> On-line CPU(s) list: 0,1
> Thread(s) per core: 1
> Core(s) per socket: 1
> Socket(s): 2
> NUMA node(s): 1
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 26
> Stepping: 4
> CPU MHz: 1995.000
> BogoMIPS: 3990.00
> Hypervisor vendor: VMware
> Virtualization type: full
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 18432K
> NUMA node0 CPU(s): 0,1
>
> Despite the fact that there are 2 CPUs, when I hacked
> lttng-ring-buffer-client.h to output the result of lib_ring_buffer_get_cpu()
> and then ran tests/hello with tracing enabled, I could see it would sit on
> CPU 0 for a while, or CPU 1, and perhaps move between the two, but
> eventually either 2 or 3 would appear, immediately followed by the segfault.
>
> The VM host has 4 sockets, 8 cores per socket, with Hyper-Threading enabled.
> The VM has its "HT Sharing" option set to "Any", which according to
> https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vm_admin.doc_50%2FGUID-101176D4-9866-420D-AB4F-6374025CABDA.html
> means that each one of the virtual machine's virtual cores can share a
> physical core with another virtual machine, each virtual core using a
> different thread on that physical core. I assume none of this should be
> relevant except perhaps if there are bugs in VMware.
>
> Is it possible that this is an issue in LTTng, or should I work out how the
> kernel works out which CPU it is running on and then look into whether there
> are any VMware bugs in this area?
This appears to be very likely a VMware bug. /proc/cpuinfo should show
4 CPUs (and sysconf(_SC_NPROCESSORS_CONF) should return 4) if the current
CPU number can be 0, 1, 2, 3 throughout execution.
Thanks,
Mathieu
>
> Thanks in advance,
> David
>
> ----------------------------------------------------------------------
> The information contained in this transmission may be confidential. Any
> disclosure, copying, or further distribution of confidential information is
> not permitted unless such privilege is explicitly granted in writing by
> Quantum. Quantum reserves the right to have electronic communications,
> including email and attachments, sent across its networks filtered through
> anti virus and spam software programs and retain such messages in order to
> comply with applicable data security and retention requirements. Quantum is
> not responsible for the proper and complete transmission of the substance of
> this communication or for any delay in its receipt.
>
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
More information about the lttng-dev
mailing list