[ltt-dev] use percpu variable for ltt_nesting

Jiaying Zhang jiayingz at google.com
Mon Sep 15 03:46:45 EDT 2008


Hi Mathieu,

On Tue, Sep 9, 2008 at 2:51 PM, Mathieu Desnoyers <
compudj at krystal.dyndns.org> wrote:

> Hi Jiaying,
>
> I'm interested in knowing a little bit more about the setup generating
> these numbers. Can you ship me the .config and dmesg output of a machine
> running LTTng ? I am particularly interested in knowing if LTTng
> detected synchronized TSCs. I would also like if you could try to force
> the TIF_KERNEL_TRACE to 0 (by bypassing kernel/sched.c:
> set_kernel_trace_flag_all_tasks()).


Here are the LTT configurations in my .config file:

CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
# CONFIG_MARKERS_USERSPACE is not set
CONFIG_HAVE_MARKERS_ABI=y
CONFIG_TRACEPROBES=y

#
# Linux Trace Toolkit
#
CONFIG_LTT_TIMESTAMP=y
CONFIG_HAVE_LTT_UNSTABLE_TSC=y
CONFIG_HAVE_LTT_CLOCK=y
# CONFIG_HAVE_LTT_SYNTHETIC_TSC is not set
CONFIG_HAVE_LTT_DUMP_TABLES=y
CONFIG_LTT=y
CONFIG_LTT_RELAY=y
CONFIG_LTT_SERIALIZE=y
CONFIG_LTT_MARKER_CONTROL=y
CONFIG_LTT_TRACER=y
CONFIG_LTT_ALIGNMENT=y
CONFIG_LTT_VMCORE=y
# CONFIG_LTT_HEARTBEAT is not set
CONFIG_LTT_NETLINK_CONTROL=m
CONFIG_LTT_STATEDUMP=m

The TSC of the machine should be synchronized. But I think LTT uses
unstable TSC by default. I tried to bypass TIF_KERNEL_TRACEas you
suggested. The performance is about 10% better on my testing machine.
But I did not get any valid tracing data in that case. So I wonder if the
performance difference is because we did not dump tracing data to disk
in that case.

>
> Having benchmarks for your setup with :
>
> - LTTng compiled out
> - LTTng compiled-in, all markers disabled
> - LTTng compiled-in, default markers enabled
> - LTTng compiled-in, markers disabled, a sigle trace is created, but not
>  active. (lttctl -n name -c)
>  This will call set_kernel_trace_flag_all_tasks() to activate syscall
>  tracing for all threads.
> - LTTng compiled-in, default markers enabled, taking an active trace in
>  normal mode (dumping to disk).
> - Same as above, but LTTng running in flight recorder mode, with default
>  size buffers.
> - Same as about, but with LTTng running with tiny buffers (~64kB each).
>
> If it's easy enough, doing these tests on UP and SMP kernels would be
> good. Knowing all that, we will be able to see if we can start improving
> performance.


I collected some performance results with tbench. When uses tracing, I
enables
the following tracepoints:

   - fs_exec
   - kernel_arch_kthread_create
   - kernel_process_exit
   - kernel_process_fork
   - kernel_process_free
   - kernel_process_wait
   - kernel_sched_schedule
   - list_process_state
   - kernel_arch_syscall_entry
   - kernel_arch_syscall_exit
   - kernel_irq_entry
   - kernel_irq_exit
   - kernel_arch_trap_entry
   - kernel_arch_trap_exit

Here are the results:
SMP:
 -LTTng compiled out
Throughput 761.508 MB/sec 4 procs
-LTTng compiled-in, all markers disabled
Throughput 722.983 MB/sec 4 procs
- LTTng compiled-in, default markers enabled
Throughput 711.769 MB/sec 4 procs
- LTTng compiled-in, markers disabled, a sigle trace is created, but not
active. (lttctl -n name -c)
Throughput 706.733 MB/sec 4 procs
- LTTng compiled-in, default markers enabled, taking an active trace in
normal mode (dumping to disk).
Throughput 636.569 MB/sec 4 procs
- Same as above, but LTTng running in flight recorder mode, with default
size buffers.
Throughput 646.674 MB/sec 4 procs
- Same as about, but with LTTng running with tiny buffers (~64kB each).
Throughput 645.467 MB/sec 4 procs

UP:
- LTTng compiled out
Throughput 233.791 MB/sec 1 procs
- LTTng compiled-in, all markers disabled
Throughput 226.542 MB/sec 1 procs
 - LTTng compiled-in, default markers enabled
Throughput 226.053 MB/sec 1 procs
- LTTng compiled-in, markers disabled, a single trace is created, but not
active. (lttctl -n name -c)
Throughput 232.174 MB/sec 1 procs
- LTTng compiled-in, default markers enabled, taking an active trace in
normal mode (dumping to disk)
Throughput 214.197 MB/sec 1 procs
- Same as above, but LTTng running in flight recorder mode, with default
size buffers.
Throughput 201.183 MB/sec 1 procs
- Same as about, but with LTTng running with tiny buffers (~64kB each).
Throughput 200.156 MB/sec 1 procs

The first observation is that just compiling LTT in adds about 6-7% overhead
with SMP.
I tried to enable/disable different LTT configuration options, and found
CONFIG_TRACEPROBES
is the one that adds the additional cost. If I configure out
CONFIG_TRACEPROBES or configured
it with 'module' option, the performance is the same as config LTTng out.
Here are the
results I collected.

-LTTng compiled out: CONFIG_TRACEPOINTS=n CONFIG_MARKERS=n
CONFIG_TRACEPROBES=n CONFIG_LTT=n
Throughput 761.508 MB/sec 4 procs
-CONFIG_TRACEPOINTS=y CONFIG_MARKERS=y CONFIG_TRACEPROBES=m CONFIG_LTT=n
Throughput 766.4 MB/sec 4 procs
-CONFIG_TRACEPOINTS=y CONFIG_MARKERS=y CONFIG_TRACEPROBES=m CONFIG_LTT=y
(similar results see when CONFIG_TRACEPOINTS=n and CONFIG_TRACEPROBES=y)
Throughput 760.434 MB/sec 4 procs
-LTTng compiled-in, all markers disabled, i.e.,CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y CONFIG_TRACEPROBES=y CONFIG_LTT=y
Throughput 722.983 MB/sec 4 procs

The only difference i can see with CONFIG_TRACEPROBES is that the system
needs to run extra code when the option is enabled. But I am not sure if
there is
some synchronization cost that the additional overhead is so high.

Another observation is that tracing adds about another 10% overhead with
SMP.
The overhead with UP is a little lower, about 7%. So there may be some
synchronization
cost with SMP tracing.

Using 'flight' mode does not seem to help performance much. I guess dumping
data to disk isn't the bottleneck here. Do you have any suggestions about
the
places we should look at where tracing may add overhead?

Jiaying

>
>
> Thanks,
>
> Mathieu
>
> * Jiaying Zhang (jiayingz at google.com) wrote:
> > Hi Mathieu,
> >
> > I found lttng sometimes has very poor performance on multiple cpu systems
> > and it seems
> > the more processors the system has, the more performance overhead I saw
> with
> > lttng enabled.
> > Here are the results I collected with tbench benchmark (
> > http://samba.org/ftp/tridge/dbench/).
> >
> > single processor:
> >    lttng disabled: 236.07 MB/sec
> >    lttng enabled:  210.569 MB/sec
> >
> > 16 processors:
> >    lttng disabled:  4173.83 MB/sec
> >    lttng enabled:  1766.77 MB/sec
> >
> > After playing with the code for a while and asking around, I found this
> > performance
> > issue is caused by the cpu contention while updating the ltt_nesting
> > variable used
> > in ltt/ltt-serialize.c:ltt_vtrace:
> >   ltt_nesting[smp_processor_id()]++;
> >   ...
> >   ltt_nesting[smp_processor_id()]--;
> >
> > In the attachment is the patch that changes ltt_nesting into a per_cpu
> > variable. With the patch applied,
> > the tbench performance with lttng applied gets to about 3600 MB/sec on
> the
> > 16 processor system.
> >
> > Jiaying
>
> > Index: linux-2.6.26/include/linux/ltt-core.h
> > ===================================================================
> > --- linux-2.6.26.orig/include/linux/ltt-core.h        2008-09-02
> 21:36:27.000000000 -0700
> > +++ linux-2.6.26/include/linux/ltt-core.h     2008-09-02
> 21:36:52.000000000 -0700
> > @@ -25,7 +25,7 @@
> >
> >
> >  /* Keep track of trap nesting inside LTT */
> > -extern unsigned int ltt_nesting[];
> > +extern unsigned int per_cpu_var(ltt_nesting);
> >
> >  typedef int (*ltt_run_filter_functor)(void *trace, uint16_t eID);
> >  extern ltt_run_filter_functor ltt_run_filter;
> > Index: linux-2.6.26/ltt/ltt-relay.c
> > ===================================================================
> > --- linux-2.6.26.orig/ltt/ltt-relay.c 2008-09-02 21:36:27.000000000 -0700
> > +++ linux-2.6.26/ltt/ltt-relay.c      2008-09-02 21:36:52.000000000 -0700
> > @@ -1013,7 +1013,7 @@
> >       /*
> >        * Perform retryable operations.
> >        */
> > -     if (ltt_nesting[smp_processor_id()] > 4) {
> > +     if (per_cpu(ltt_nesting, smp_processor_id()) > 4) {
> >               local_inc(&ltt_buf->events_lost);
> >               return NULL;
> >       }
> > @@ -1212,7 +1212,7 @@
> >       "buffer full : event lost in blocking "
> >       "mode. Increase LTT_RESERVE_CRITICAL.\n");
> >       printk(KERN_ERR "LTT nesting level is %u.\n",
> > -             ltt_nesting[cpu]);
> > +             per_cpu(ltt_nesting, cpu));
> >       printk(KERN_ERR "LTT avail size %lu.\n",
> >               dbg->avail_size);
> >       printk(KERN_ERR "avai write : %lu, read : %lu\n",
> > Index: linux-2.6.26/ltt/ltt-serialize.c
> > ===================================================================
> > --- linux-2.6.26.orig/ltt/ltt-serialize.c     2008-09-02
> 21:36:27.000000000 -0700
> > +++ linux-2.6.26/ltt/ltt-serialize.c  2008-09-02 21:36:52.000000000 -0700
> > @@ -647,7 +647,7 @@
> >               cpu = smp_processor_id();
> >       else
> >               cpu = private_data->cpu;
> > -     ltt_nesting[smp_processor_id()]++;
> > +     per_cpu(ltt_nesting, smp_processor_id())++;
> >
> >       if (unlikely(private_data && private_data->trace))
> >               dest_trace = private_data->trace;
> > @@ -703,7 +703,7 @@
> >               va_end(args_copy);
> >               ltt_commit_slot(channel, &transport_data, buffer,
> slot_size);
> >       }
> > -     ltt_nesting[smp_processor_id()]--;
> > +     per_cpu(ltt_nesting, smp_processor_id())--;
> >       preempt_enable();
> >  }
> >  EXPORT_SYMBOL_GPL(ltt_vtrace);
>
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20080915/c69cec47/attachment-0003.htm>


More information about the lttng-dev mailing list