[ltt-dev] UST clock rdtsc vs clock_gettime

David Goulet david.goulet at polymtl.ca
Tue Jul 13 12:55:35 EDT 2010



On 10-07-07 12:32 PM, Mathieu Desnoyers wrote:
> * David Goulet (david.goulet at polymtl.ca) wrote:
>> On 10-07-06 03:39 PM, Nils Carlson wrote:
>>> Cool, so the measurements came through...
>>>
>>
>> I've retested UST per event time with the new commit made few days ago
>> fixing the custom probes and cache line alignment. Here are the results
>> for TSC counter and clock_gettime (test made 1000 times on i7) :
>>
>> rdtsc :
>> Average : 0.000000242229708 sec, 242.22971 nsec
>> Standard Deviation : 0.000000001663147 sec , 1.66315 nsec
>>
>> clock_gettime :
>> Average : 0.000000272516616 sec, 272.51662 nsec
>> Standard Deviation : 0.000000002340784 sec , 2.34078 nsec
>>
>>> What I would like to see is the automatic detection of whether the rdtsc
>>> instruction is usable,
>>> a test for this already exists in the kernel and the question is whether
>>> this info is currently exported
>>> or whether we need to submit a patch to export it.
>>>
>>
>>  From userspace, to test, this would be a syscall via prctl right? The
>> thing is that it's needed at compile time. Right now, the __i386__ and
>> __x86_64__ define is tested. Upon gcc compilation, it would be great to
>> have something like TSC_AVAILABLE define and then compile the right
>> function (either clock_gettime or rdtsc).
>>
>> However, there is some issues about consistency by using TSC for example
>> between CPUs counter... so I think we need to be very careful about that
>> even if the performance are 30ns less and much more _stable_ (see std
>> variation).
>
> We only care about having a consistent read across CPU (and speed, aka
> throughput). Having different standard deviation does not matter much.
> We cannot know if the architecture we will be deployed on has consistent
> TSCs across cores, so we have to test it at runtime.
>
> One approach might be to try using prctl at library load, but I don't
> see any information about consistent tsc in there.
>
> The other approach is to use a vDSO for trace clock (as I proposed
> earlier). You can try to create something very similar in userland for
> benchmarks: Create a function that tests a global boolean to figure out
> if we can simply read the TSC, and perform the TSC read if the check is
> ok. Make sure the function is -not- static and has the attribute
> "noinline", so the compiler generates the function call. Also make sure
> that the variable you are testing for "tsc consistency" is not marked
> static neither, but rather marked "volatile", so the compiler does not
> optimize the load away. Compile with -O2.
>

I'm wondering why do the test function need to be "noinline" and the 
bool volatile? If the test is done at library load (prctl() syscall), it 
won't change for the rest of the execution so inlining should be here 
more efficient and static bool also no?

Note that this is not about TSC consistency has we talked the other day 
but rather only check _if_ the TSC is available.

Thanks
David

> For the cases where we need more kernel support (due to non-consistent
> TSCs across cores), you might also want to export the linux sequence
> lock: include/linux/seqlock.h into user-space (we only nead the read
> seqlock part, with smp_mb() mapped to the urcu memory barriers) and
> figure out the overhead of this sequence lock. This will be needed to
> ensure consistency of the data structures that will be needed to support
> the vDSO when the "consistent tsc" dynamic check fails.
>
> Thanks,
>
> Mathieu
>
>>
>> David
>>
>>> Then we should probably start looking at a simple choosing mechanism,
>>> probably a function pointer?
>>>
>>> /Nils
>>> On Jul 6, 2010, at 8:12 PM, David Goulet wrote:
>>>
>>>> Hey,
>>>>
>>>> After some talks with Nils from Ericsson, there was some questions
>>>> about using the TSC counter and not clock_gettime in include/ust/clock.h
>>>>
>>>> I ran some test after the meeting and was quite surprised by the
>>>> overhead of clock_gettime.
>>>>
>>>> On an average run ...
>>>> WITH clock_gettime : ~ 266ns per events
>>>> WITH rdtsc instruction : ~ 235ns per events
>>>>
>>>> And it is systematic... I'm getting stable result with rdtsc with
>>>> standard deviation of ~2ns.
>>>>
>>>> As little as I know on TSC, one thing for sure, with SMP, it becomes
>>>> much more "fragile" to rely on it because we don't have assurance of
>>>> coherent counters between CPUs and also the CPU scaling policy
>>>> (ondemand is default on Ubuntu now). New CPUs support constant_tsc and
>>>> nonstop_tsc flags but still a small range of them.
>>>>
>>>> Right now, UST is forcing the use of clock_gettime even if i386 or
>>>> x86_64 is used.
>>>> Should a change be consider ?
>>>>
>>>> Thanks
>>>> David
>>>>
>>>> _______________________________________________
>>>> ltt-dev mailing list
>>>> ltt-dev at lists.casi.polymtl.ca
>>>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>>>
>>
>> _______________________________________________
>> ltt-dev mailing list
>> ltt-dev at lists.casi.polymtl.ca
>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>>
>

-- 
David Goulet
LTTng project, DORSAL Lab.

PGP/GPG : 1024D/16BD8563
BE3C 672B 9331 9796 291A  14C6 4AF7 C14B 16BD 8563




More information about the lttng-dev mailing list