[lttng-dev] 答复: UST performance

zhenyu.ren zhenyu.ren at aliyun.com
Tue Apr 28 21:31:16 EDT 2015


Hi,Mathieu> Is your sched_getcpu() system call a vdso too    I use VSYSCALL_ADDR(__NR_vgetcpu) instead sched_getcpu so there is no getcpu system calls.
> make sure you #define _LGPL_SOURCE in your benchmark program    I do it as you suggested but no difference.
> Then, you can try using "perf" to profile the runtime execution $~/renzhyu/perf record ./sample
WARNING: Kernel address maps (/proc/{kallsyms,modules}) are restricted,
check /proc/sys/kernel/kptr_restrict.

Samples in kernel functions may not be resolved if a suitable vmlinux
file is not found in the buildid cache or in the vmlinux path.

Samples in kernel modules won't be resolved at all.

If some relocation was applied (e.g. kexec) symbols may be misresolved
even with a suitable vmlinux or kallsyms file.


using 5720094251 ns
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.221 MB perf.data (~9658 samples) ]
$~/renzhyu/perf report
Events: 5K cycles
16.82% sample liblttng-ust.so.0.0.0 [.] lttng_event_commit10.55% sample liblttng-ust.so.0.0.0 [.] lttng_event_strcpy10.45% sample [kernel.kallsyms] [k] 0xffffffff8103548a8.16% sample liblttng-ust.so.0.0.0 [.] lttng_event_reserve7.57% sample sample [.] __event_get_size__sample_component___message6.42% sample sample [.] __event_probe__sample_component___message6.24% sample liblttng-ust.so.0.0.0 [.] lttng_write_event_header5.40% sample sample [.] _rcu_read_lock_update5.35% sample sample [.] _rcu_read_unlock_bp5.17% sample liburcu-bp.so.2.0.0 [.] rcu_read_unlock_bp4.14% sample ld-2.5.so [.] __tls_get_addr3.38% sample [vdso] [.] 0x7fff8f72e6be3.35% sample sample [.] main1.88% sample sample [.] lib_ring_buffer_ctx_init1.79% sample sample [.] _rcu_read_lock_bp1.20% sample liburcu-bp.so.2.0.0 [.] rcu_read_lock_bp0.66% sample sample [.] lib_ring_buffer_align_ctx0.38% sample sample [.] cds_list_empty0.37% sample sample [.] lib_ring_buffer_align0.35% sample liburcu-bp.so.2.0.0 [.] __tls_get_addr at plt0.28% sample sample [.] __event_get_align__sample_component___message 0.09% sample ld-2.5.so [.] do_lookup_The symbol  0xffffffff8103548a is in native_write_msr_safe,I don't know what the function is and why that kernel function can be TOP 3.ffffffff81035480 <native_write_msr_safe>:
ffffffff81035480: 55 push %rbp
ffffffff81035481: 89 f9 mov %edi,%ecx
ffffffff81035483: 89 f0 mov %esi,%eax
ffffffff81035485: 48 89 e5 mov %rsp,%rbp
ffffffff81035488: 0f 30 wrmsr
ffffffff8103548a: 31 c0 xor %eax,%eax
ffffffff8103548c: c9 leaveq
ffffffff8103548d: c3 retq
ffffffff8103548e: 66 90 xchg %ax,%axPS:my sample.c#include <time.h>
#include <unistd.h>
#include <stdio.h>
#define _LGPL_SOURCE
#define TRACEPOINT_DEFINE
#include "sample_component_provider.h"

#define NUM 10000000
int main(int argc, char **argv)
{
 int i = 0;
 struct timespec bef,aft;
 long diff;
 clock_gettime(CLOCK_MONOTONIC,&bef);
 for (i = 0; i < NUM; i++) {
tracepoint(sample_component, message, "HelloWorld");
 }
 clock_gettime(CLOCK_MONOTONIC,&aft);
 diff = (aft.tv_sec-bef.tv_sec)*1000000000+(aft.tv_nsec-bef.tv_nsec);
 printf("using %ld ns\n", diff);
 return 0;
}
Thankszhenyu.ren------------------------------------------------------------------发件人:Mathieu Desnoyers <mathieu.desnoyers at efficios.com>发送时间:2015年4月28日(星期二) 23:47收件人:zhenyu.ren <zhenyu.ren at aliyun.com>抄 送:lttng-dev <lttng-dev at lists.lttng.org>主 题:Re: [lttng-dev] UST performance
Hi,
Is your sched_getcpu() system call a vdso too ? This may explain the difference.

Moreover, make sure you #define _LGPL_SOURCE in your benchmark program
before including any lttng ust header. It will inline many functions associated with
tracepoints.

Then, you can try using "perf" to profile the runtime execution of your program
traced by lttng-ust ? This should give us the information we need to investigate the
performance profile.

Thanks,

Mathieu


Hi,dev
    About a month ago,I asked a question about how to do time recording for ust event when vdso_clock_gettime not availble on my platform(Glibc too old).Recently I implemented a lib to analyze vdso in memory and lookuped the symbol vdso_clock_gettime successfully.With this vdso_clock_gettime availble,I get a 16% improvement in time spending per ust event. I.E. I measured the time taked by a event with 10 bytes string :
 no   vdso_clock_gettime : 0.642us with vdso_clock_gettime:0.554us
However,Both the numbers are much higher than the following 0.28us(But closed to 0.5us --W/O opt). So what's the difference between UST(0.28us) and UST W/O OPT(0.5us)?? 


PS:My hardware2 x Xeon E5-2430  2.20GHz 100MHz FSB (24 cores) 94.6GB / 96GB 1333MHz DDR3 == 6 x 16GB, 6 x empty
Thankszhenyu.ren


_______________________________________________
lttng-dev mailing list
lttng-dev at lists.lttng.org
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev



-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20150429/84c72d06/attachment-0001.html>


More information about the lttng-dev mailing list