[ltt-dev] OMAP3/4 trace clock and lockdep tracing

Thu Apr 7 08:57:01 EDT 2011

2011/4/6 Mathieu Desnoyers <compudj at krystal.dyndns.org>:
> I haven't tested this on OMAP4 personally, so I'm really interested in
> your feedback. I did my development on OMAP3 UP, but I designed the
> trace clock to support SMP (but it's not tested by myself, others have
> reported success though).
Good to know that other have succeeded on a SMP system.

> You might want to try with and without:
>
> - power management suspend/resume
> - cpufreq changes
For initial debug I could turn these off, but the use cases we are to
analyse must have these on, and have correct trace clock for these
activities.

> There is code in trace clock to support each of these. 10 ms offset
> between the cores is way too much, so I guess there is something wrong
> there. First identifying if it's suspend/resume or cpufreq that are
> causing the problems would help us there.
Yes it must be something with the suspend/resume and not resyncing correctly.

> What do you mean by "enters/leaves WFI" ? (WFI TLA means ?)
Wait-For-Instruction is the last instruction that the processor do
before actually sleep or idle. It hangs on this until it is again
waked-up. Hence I placed the calls to save clock and resync before and
after these points.

> Given what you say above, cpufreq running in performance mode should
> make it OK as far as cpufreq is concerned, although there might be some
> discrepancy at boot time if the cpufreq mode is only changed later.
I have discovered that the idle code sometimes enters a state clocked
at 32kHz, hence that would mess up the clock counter. This might be
the cause of the problem, but I have not had time to look in more
detail.

> I suspect that the problem might come from that your OMAP4 architecture
> does not call the trace clock resync after power management resume, like
> OMAP3 is doing.
Note that this is not an OMAP4 arch I'm running on but another dual
core SMP ARMv7. I thought I called resync at all places but maybe not.

> You might also want to have a look at the Linaro 2.6.38
> git tree, which integrates LTTng with Linaro, which might have better
> support for OMAP4 suspend/resume.
I have not checked that yet, need to take a closer look later on.

> I'm really interested in fixing this up, but I'll need your help for
> testing.
That's generous of you, I'm happy to do testing but it might be
difficult since you don't have access to the actual architecture files
and the issue likely is in those. Could I get back on this one, since
I would like to have good performance on this, but I'm using code that
my employer has not yet released (need to follow process). For now I
decided to implement a much simpler trace clock, due to that people
were waiting on the solution, that always is in sync between cores and
don't need resync at idle/wake-up but still has decent resolution. It
looks like this:

static atomic64_t last_clock_val;
inline u64 trace_clock_read64(void)
{
    u64 clock_val, ret, inc_clock_val;

    do {
	clock_val = cnt32_to_63(clock32k->read(clock32k)) << TRACE_CLOCK_32K_BITSHIFT;
	inc_clock_val = atomic64_add_return(1,&last_clock_val); //contains
memory barrier
    } while( (clock_val > inc_clock_val) &&
(atomic64_cmpxchg(&last_clock_val, inc_clock_val, clock_val) !=
inc_clock_val));

    if(clock_val > inc_clock_val)
	ret = clock_val;
    else
	ret = inc_clock_val;
    return ret;
}

Same concept as the generic jiffy + counter, but using an always on
32kHz clock and no timers. It speculatively increase the LSBs of the
atomic clock variable and replace it with the up shifted 32kHz clock
if it has ticked up. Worst case it spins the while loop 1<<
TRACE_CLOCK_32K_BITSHIFT times if two threads goes in lock-step for
each line (although the atomic operations can also spin). We are still
looking into improving it, but it seems to work. ARM has faster cache
snooping than x86 so it is less of a problem that it uses a shared
atomic between cores. I'm also speculating in using a cycle counter
progress instead of an inc of 1, but that would mean the
TRACE_CLOCK_32K_BITSHIFT needs to be larger and that is problematic.

Regards,
 Harald