[ltt-dev] feedback on lost events patch ARM linux - 2.6.21 , hardwaretimer integration

Venu A venu_a at hyd.hellosoft.com
Wed Jan 7 12:41:05 EST 2009


Dear Mathieu

Thank you for the clarity in your reply.

Quick question before I move on to other things:
I see an application thread calling gettimeofday in a loop, though there is 
no explicit code doing that.
How do I map the system call address in the trace to a function which is 
either in kernel or user application. I could not find the address anywhere 
in System.map of kernel or in the application map file. To obtain the 
application map file I invoke  #nm <executable_name>. Am I missing something 
?

Please find the cumulative patch for ARM Jiffies TSC with a Kconfig option 
at the end of the mail and also as attachment. This combines Richard's patch
for ARM to use the Jiffies TSC as timestamp instead of a logical clock.

After much thought, I decided to use used the jiffies hardware timer for 
timestamps which runs on a 1MHz clock (and thus offers a 1us precision at 
best).
Since the timer register is easily relatable to time as opposed to CPU 
cycles, I read off the timer register in ltt_get_timestamp32() instead of 
get_cycles().

With a Logical Clock some of the problems I faced were:
 - There is an jump of 10ms between entry and exit of the Timer IRQ, because 
jiffies is incremented in the timer handler and the ltt logical clock is 
updated within the handler. As a result, the duration for which Timer IRQ is 
serviced is incorrectly shown to be 10ms in LTTv. Since the Timer IRQ can 
occur in any thread, different threads in LTTv randomly have 10ms CPU time, 
thus distorting CPU occupancy of each thread.
 - Also when analyzing timing of periodic events, there is jitter of 10ms in 
the timing because sometimes the event can occur just before Timer IRQ and 
sometimes just after.  As a result the event timing will suffer a resolution 
loss of +/-10ms which is quite a big loss of time.

My point is not to show the inadequacies of the logical clock  (in fact it 
ensures great portability and preserves event order in SMP), but that for 
ARM processors and other single CPU platforms, the jiffies timer may be more 
simpler which avoid some of the above mentioned issues I faced. I think the 
reading off the hardware timer register does not consume many cycles, and no 
locking is necessary for single CPU systems. Also since the jiffies timer is 
being used, there is no need to allocate another hardware timer. A separate 
hardware timer can be justified if it has more precision that 1us.Any 
thoughts on this respect is welcome.

As an extension of above for SMP systems: For SMP systems event ordering 
within 1us can suffer in my approach. To preserve event ordering we can 
combine the jiffies timer and the logical clock.The time stamp would be 
incremented by 1ns for every event that occurs within the 1us time window.
Since the ns field is always zero for 1us timer resolution, so we can 
theoretically have 1000 events ordered within 1us.

Pseudo-code:
==========
In initialization:
static variable: logical_incr = 0;
static variable: last_read_time = 0;

/***  please use locking appropriately ***/
ltt_get_timestamp32()
{
   new_time = read_time_in_us_from_timer_register();
  Atomically compare:   if( new_time == last_read_time)
 {
    logical incr +=1;
  }
else
{
   logical_incr = 0;
}
  atomically exchange : last_read_time = new_time;
 if(logical_incr  >= 1000)
{
   logical_incr = 999;
}
  return (new_time + logical_incr);

}
PS: I dont work with SMP systems, so would appreciate if someone can take
this forward.

regards
Venu Annamraju
Team Lead, VoIP-DSP
HelloSoft India

===============================================================================
--- linux.orig/include/asm/ltt.h 2008-11-03 14:30:24.000000000 +0530
+++ include/asm/ltt.h 2009-01-07 21:47:36.000000000 +0530
@@ -15,68 +15,98 @@
 #define LTT_ARCH_TYPE LTT_ARCH_TYPE_ARM
 #define LTT_ARCH_VARIANT LTT_ARCH_VARIANT_NONE

-#undef LTT_HAS_TSC
+#ifdef CONFIG_LTT_USE_JIFFIES_TSC
+   #include <asm/io.h>
+   #include <asm/arch/io.h>
+   /* static inline void __iomem *__io(unsigned long addr) { return (void 
__iomem *)addr; } */
+   /***************** Platform specific  stuff --  modify for your platform 
if using the jiffies timer register *********************/
+   #include <asm/arch/hardware.h>
+   /* #define IO_ADDRESS(x)           (((x) & 0x0fffffff) + (((x) >> 4) & 
0x0f000000) + 0xf0000000) */
+   #include  <../arch/arm/mach-hs1xx/core.h>
+   /* #define __io_address(n)         __io(IO_ADDRESS(n)) */
+   #include <asm/arch/platform.h>
+   /* #define HS1XX_TMR_BASE (0x70002000)
+   #define HS1XX_VIC_BASE       (0xfefff000)
+   #define INT_TMR0     (1)      // Interrupt Bit position of Timer 0 in 
Interrupt RAW status Register
+   #define IRQMASK_TMR0         (1 << INT_TMR0)  //IRQ interrupt mask for 
Timer 0 */
+   #include <asm/hardware/arm_timer.h>
+   /* #define TIMER_VALUE         (0x04)         // timer register offset 
from timer base  */
+   #include <asm/hardware/vic.h>
+   /* #define VIC_RAW_STATUS (0x08) // Offset of Raw Interrupt Status 
Register from VIC_BASE of Vector Interrupt Controller */
+   #define VIC_BASE       (0xfefff000)
+   /******************* End of Platform specific stuff 
********************************************************************************/
+   #define JIFFIES_VA_BASE (__io_address(HS1XX_TMR_BASE))
+   #define VADDR_VIC_BASE       ( __io(VIC_BASE))
+   #define TICKS_PER_uS         (1) //Timer ticks in 1 microsecond for a 
1MHz clock with no division, is 1. For higher precision timers it may be 
higher.
+   #define TMR_INTERVAL  (TICKS_PER_uS * 10000) // jiffies timer fires 
every 10ms. So the timer interval in terms of ticks is 10000.
+#endif
+
+u64 ltt_heartbeat_read_synthetic_tsc(void);
+extern atomic_t lttng_clock;

-#define LTTNG_LOGICAL_SHIFT 13
+#undef LTT_HAS_TSC

-extern atomic_t lttng_logical_clock;
+#define LTT_CLOCK_SHIFT 13

 static inline u32 ltt_get_timestamp32(void)
 {
- unsigned long seq;
- unsigned long try = 5;
- u32 ret;
-
- do {
-  seq = read_seqbegin(&xtime_lock);
-  ret = (jiffies << LTTNG_LOGICAL_SHIFT)
-   | (atomic_add_return(1, &lttng_logical_clock));
- } while (read_seqretry(&xtime_lock, seq) && (--try) > 0);
-
- if (try == 0)
-  return 0;
- else
-  return ret;
+#ifdef CONFIG_LTT_USE_JIFFIES_TSC
+    int tmp_clk, status,ovr=0;
+    /* this interrupt check is useful in providing a smooth continuous 
timer out of the jiffies timer.
+    In an event where the timer interrupt is waiting because we are in an 
NMI or entry of timer IRQ handler, the timer reg would have wrapped around.
+    Obtaining a timestamp at that instance will show negative time.
+    This check increment add a jiffy if Timer Interrupt is waiting when 
ltt_get_timestamp32 is called  */
+    status = __raw_readl(VADDR_VIC_BASE + VIC_RAW_STATUS);
+    if (status & IRQMASK_TMR0)
+    {
+       ovr = 1;
+    }
+    tmp_clk = atomic_read(&lttng_clock);
+    return (tmp_clk + TICKS_PER_uS*(((TMR_INTERVAL)<<ovr) - 
(readl(JIFFIES_VA_BASE + TIMER_VALUE))));
+#else
+ return atomic_add_return(1, &lttng_clock);
+#endif
 }

-
 /* The shift overflow doesn't matter */
 static inline u64 ltt_get_timestamp64(void)
 {
- unsigned long seq;
- unsigned long try = 5;
- u64 ret;
-
- do {
-  seq = read_seqbegin(&xtime_lock);
-  ret = (jiffies_64 << LTTNG_LOGICAL_SHIFT)
-   | (atomic_add_return(1, &lttng_logical_clock));
- } while (read_seqretry(&xtime_lock, seq) && (--try) > 0);
-
- if (try == 0)
-  return 0;
- else
-  return ret;
+ return ltt_heartbeat_read_synthetic_tsc();
 }

-/* this has to be called with the write seqlock held */
-static inline void ltt_reset_timestamp(void)
+static inline void ltt_inc_timestamp(void)
 {
- atomic_set(&lttng_logical_clock, 0);
+ int old_clock, new_clock;
+
+#ifdef  CONFIG_LTT_USE_JIFFIES_TSC
+ do {
+ old_clock = atomic_read(&lttng_clock);
+ new_clock = (old_clock 
+TICKS_PER_uS*(((TMR_INTERVAL)<<1)-(readl(JIFFIES_VA_BASE + TIMER_VALUE))));
+ } while (atomic_cmpxchg(&lttng_clock, old_clock, new_clock) !=
+old_clock);
+#else
+ do {
+ old_clock = atomic_read(&lttng_clock);
+ new_clock = (old_clock + (1 << LTT_CLOCK_SHIFT))
+ & (~((1 << LTT_CLOCK_SHIFT) - 1));
+ } while (atomic_cmpxchg(&lttng_clock, old_clock, new_clock) !=
+old_clock);
+#endif
 }


 static inline unsigned int ltt_frequency(void)
 {
-  return HZ << LTTNG_LOGICAL_SHIFT;
+#ifdef  CONFIG_LTT_USE_JIFFIES_TSC
+   return HZ*TMR_INTERVAL*TICKS_PER_uS; //number of timer ticks in a 
second.
+#else
+ return HZ << LTT_CLOCK_SHIFT;
+#endif
 }

-
 static inline u32 ltt_freq_scale(void)
 {
   return 1;
 }

-
-
 #endif
--- linux.orig/arch/arm/kernel/time.c 2008-11-03 14:30:24.000000000 +0530
+++ arch/arm/kernel/time.c 2009-01-06 15:06:29.000000000 +0530
@@ -336,10 +336,10 @@
  profile_tick(CPU_PROFILING);
  do_leds();
  do_set_rtc();
- do_timer(1);
 #ifdef CONFIG_LTT
- ltt_reset_timestamp();
+ ltt_inc_timestamp();
 #endif
+ do_timer(1);
 #ifndef CONFIG_SMP
  update_process_times(user_mode(get_irq_regs()));
 #endif
--- linux.orig/ltt/ltt-heartbeat.c 2008-11-03 14:30:24.000000000 +0530
+++ ltt/ltt-heartbeat.c 2008-12-22 10:59:25.000000000 +0530
@@ -78,7 +78,7 @@

  preempt_disable();
  cpu_synth = &synthetic_tsc[smp_processor_id()];
- tsc = (u32)get_cycles(); /* We deal with a 32 LSB TSC */
+        tsc = ltt_get_timestamp32(); /* We deal with a 32 LSB TSC */

  if (tsc < cpu_synth->tsc[cpu_synth->index][1]) {
   unsigned int new_index = cpu_synth->index ? 0 : 1; /* 0 <-> 1 */
@@ -109,7 +109,7 @@
  preempt_disable();
  cpu_synth = &synthetic_tsc[smp_processor_id()];
  index = cpu_synth->index; /* atomic read */
- tsc = (u32)get_cycles(); /* We deal with a 32 LSB TSC */
+        tsc = ltt_get_timestamp32(); /* We deal with a 32 LSB TSC */

  if (tsc < cpu_synth->tsc[index][1]) {
   /* Overflow */
--- linux.orig/ltt/Kconfig 2008-11-03 15:14:44.000000000 +0530
+++ ltt/Kconfig 2009-01-06 14:51:54.000000000 +0530
@@ -4,6 +4,10 @@
  depends on MARKERS
  select LTT_HEARTBEAT if MIPS
  select LTT_SYNTHETIC_TSC if MIPS
+        select LTT_HEARTBEAT if ARM
+        select LTT_SYNTHETIC_TSC if ARM
+        select LTT_HEARTBEAT if PPC
+        select LTT_SYNTHETIC_TSC if PPC
  default n
  help
    It is possible for the kernel to log important events to a trace
@@ -80,6 +84,16 @@
    but it is more efficient on some architectures (especially 64 bits) to
    align data than to write it unaligned.

+config LTT_USE_JIFFIES_TSC
+        bool "Use the Jiffies Hardware Timer - Warning: May 
crash/not_compile/ unless integrated properly!!"
+        depends on LTT
+        depends on ARM
+        default n
+        help
+          This option will use the Jiffies Hardware timer instead of the 
LTT Logical Clock
+          As a result the timing is more accurate for analysis.
+          It will need access to the hardware timer register and interrupt 
registers.
+
 config LTT_HEARTBEAT
  bool "Activate Linux Trace Toolkit Heartbeat Timer"
  depends on LTT

===================================================================================


-regards
venu


----- Original Message ----- 
From: "Mathieu Desnoyers" <mathieu.desnoyers at polymtl.ca>
To: "Venu A" <venu_a at hyd.hellosoft.com>
Cc: <ltt-dev at lists.casi.polymtl.ca>; <rpurdie at rpsys.net>
Sent: Tuesday, January 06, 2009 3:09 AM
Subject: Re: feedback on lost events patch ARM linux - 2.6.21 ,
hardwaretimer integration


>* Venu A (venu_a at hyd.hellosoft.com) wrote:
>> Hello Mathieu / Richard
>> Apologize for the long mail...
>> Recently we integrated LTTng for ARM-linux 2.6.21 to isolate some thread
>> control issue during VoIP call.
>>
>> Some of my experiences on ARM+LTTng:
>> 1)Procedure
>> Being a newbie, had some teething problems in getting LTTng up and
>> running, the mailing list did help a lot though.
>> While the procedure for installing is pretty straight forward for x86,
>> same on ARM is different. I will share the procedure (once I get done)
>> for my fellow newbies working on ARM.
>>
>> 2) Lost events in Normal mode or Flight recorder mode
>> Lost events always give the jitters ...there is always a fear of going
>> off course because of loss of critical data.
>> There were significant lost events when using the lttd daemon and the
>> flight recorder mode on ARM. When using the daemon, lost events could be
>> explained due to a commit failure as explained in Mathieu's paper. But
>> lost events in flight recorder mode did not make sense. Same did not
>> happen in x86, making the issue more specific to ARM.
>> Luckily I chanced upon Richard Purdie's and Mathieu's conversation and
>> integrated the patch. After that no lost events!!! thanks for the patch
>> Richard/Mathieu
>>
>> 3) Timer IRQ Almost exactly 10ms was being spent in the timer IRQ every
>> single time. Other IRQs would exit in a few microseconds.
>> ------------------------------------------------------------------------------------------
>> kernel.irq_entry: 42949913.090079349
>> (/home/voip/ltt/ltt_traces/flt_oncall/flight-cpu_0), 0, 0, UNNAMED,
>> UNBRANDED, 0, 0x0, IRQ { 1, kernel }
>> kernel.irq_exit: 42949913.100001224
>> (/home/voip/ltt/ltt_traces/flt_oncall/flight-cpu_0), 0, 0, UNNAMED,
>> UNBRANDED, 0, 0x0, SYSCALL
>> kernel.soft_irq_entry: 42949913.100002444
>> (/home/voip/ltt/ltt_traces/flt_oncall/flight-cpu_0), 0, 0, UNNAMED,
>> UNBRANDED, 0, 0x0, SOFTIRQ { 1 }
>> kernel.timer_set: 42949913.100003665
>> (/home/voip/ltt/ltt_traces/flt_oncall/flight-cpu_0), 0, 0, UNNAMED,
>> UNBRANDED, 0, 0x0, SOFTIRQ { 24015, 0xbf0237e4, 3275424544 }
>> kernel.timer_set: 42949913.100004886
>> (/home/voip/ltt/ltt_traces/flt_oncall/flight-cpu_0), 0, 0, UNNAMED,
>> UNBRANDED, 0, 0x0, SOFTIRQ { 24015, 0xc02dc8fc, 0 }
>> kernel.soft_irq_exit: 42949913.100006106
>> (/home/voip/ltt/ltt_traces/flt_oncall/flight-cpu_0), 0, 0, UNNAMED,
>> UNBRANDED, 0, 0x0, SYSCALL { 1 }
>> --------------------------------------------------------------------------------------------
>> I incorrectly assumed the timing in nanoseconds was accurate since there
>> was no rounding. Only later I realized from the mailing-list that ARM
>> does not have TSC, hence a logical clock is being used. Also in the timer
>> IRQ handler, the jiffies is updated causing 10ms to be incremented at IRQ
>> exit. This really threw me off course for a couple fo days.
>> After I integrated Richard's/ Mathieu's patch the kernel_update_time
>> started showing up in between the timer IRQ entry/exit, and I could
>> safely ignore this behavior.
>> ------------------------------------------------------------------------------------------
>> kernel.irq_entry: 8593.370026854
>> (/home/voip/ltt/ltt_traces/lttver4/flt_oncall2/flight-cpu_0), 0, 0,
>> swapper, UNBRANDED, 0, 0x0, IRQ { 1, kernel }
>> kernel.update_time: 8593.370028075
>> (/home/voip/ltt/ltt_traces/lttver4/flt_oncall2/flight-cpu_0), 0, 0,
>> swapper, UNBRANDED, 0, 0x0, IRQ { 4296320922, 13836, 260000000, 0, 0 }
>> kernel.irq_exit: 8593.380001220
>> (/home/voip/ltt/ltt_traces/lttver4/flt_oncall2/flight-cpu_0), 0, 0,
>> swapper, UNBRANDED, 0, 0x0, SYSCALL
>> kernel.soft_irq_entry: 8593.380002440
>> (/home/voip/ltt/ltt_traces/lttver4/flt_oncall2/flight-cpu_0), 0, 0,
>> swapper, UNBRANDED, 0, 0x0, SOFTIRQ { 1 }
>> kernel.timer_set: 8593.380003661
>> (/home/voip/ltt/ltt_traces/lttver4/flt_oncall2/flight-cpu_0), 0, 0,
>> swapper, UNBRANDED, 0, 0x0, SOFTIRQ { 1353627, 0xc02dcbf4, 0 }
>> kernel.soft_irq_exit: 8593.380004882
>> (/home/voip/ltt/ltt_traces/lttver4/flt_oncall2/flight-cpu_0), 0, 0,
>> swapper, UNBRANDED, 0, 0x0, SYSCALL { 1 }
>> ------------------------------------------------------------------------------------------
>>
>> 4) FLIGHT RECORDER MODE
>> To minimize impact of LTTng profiling, I prefer using flight recorder
>> mode. I noticed that LTT daemon periodically dumps profiling data into a
>> file causing a burst of CPU processing in eventsd/0 process. This can
>> distort the picture of a real-time or a heavy load process. The downside
>> of using flight recorder mode is obvious ... only last few seconds of
>> data is stored (I still need to understand and use the hybrid mode.).In
>> my case, with all the probes enabled, the default configuration would
>> store 10 seconds of data.
>>
>> 5) Text Dump
>> The text dump utility proved very useful for indepth analysis... I felt
>> the viewer was somwhat hard to navigate as easily. I would be helpful if
>> there was some documentaion of the other plug-ins and their options
>>
>
> Hi Venu,
>
> Yes, indeed the fact that the logical clock does not follow time
> accurately should probably be told to the user as a warning in LTTV.
>
>> Some questions
>> -------------------- 
>> Hardware Timer Integration for ARM for more precision: (Our Soc has
>> free hardware timers available for use. )
>> Regarding timer integration:
>
> Yes, having platform-specific support for this on architectures which
> has free timers is definitely the way to go.
>
>> 1) For ARM, can the hardware timer be integrated in
>> ltt_get_timestamp32() following powerpc example or in get_cycles() as
>> in mips ?
>
> Yes. Just make sure this timer increments (and does not decrement). You
> may need to do 1 - value to make it go upward.
>
>> 2) Is there any penalty incurred in reading off a timer register in
>> ltt_get_timestamp32(). if so any ways to minimize ? Is any locking
>> required?
>
> It's pretty cheap, I would not worry about it. It only disables
> preemption and uses an RCU-like mechanism which is very efficient and
> has good reentrancy WRT interrupts, nmis, ...
>
>> 3) Is it ok to reset the dedicated hardware timer at boot up and let
>> it overflow, assuming the heartbeat event and synthetic TSC are used.
>> Does the overflow period have any bearing other than the heartbeat
>> event period ?
>
> In the (old) LTTng version you are using, the overflow has effect on
> both the heartbeat timer and on the synthetic tsc 32->64 bits. On newer
> LTTng, the heartbeat is not required anymore.
>
> Therefore, both timers must happen at least once every overflow period
> while tracing is active.
>
>> Regarding timer resolution:
>> 1) In mathieu's paper, for x86 "the time spent in the two TSC reads
>> varies between 97 and 105 cycles, with an average of 100.0 cycles".
>> Would this overhead also apply to ARM for directly reading the timer
>> register.
>
> No, it's platform-dependant. And LTTng changed a lot since these timings
> have been taken.
>
>> Also for x86, the average time spent in the probe is 288.5 cycles. Can
>> this be used as a ballpark figure for ARM.
>
> I don't think so. Very platform-dependent.
>
>> 2)I feel if profiling event takes x cycles, then the resolution of
>> timer can be set to 2x or 3x. Is it reasonable to assume that the
>> granularity of profiling is limited by the overhead of profiling event
>> itself ?
>
> You can dig back in an LKML thread for a discussion between Linus and me
> about this.
>
> http://lkml.org/lkml/2008/9/23/186
>
>> 3) At 300 MHz for a 32 bit timer:
>> a) with no division factor per cycle - Timer will overflow in 14.3
>> seconds, resolution is 3.3ns
>> b) for a division factor of 256 the timer will overflow in 3660 sec or
>> 1 hour, resolution is 844ns or close to 1 microsecond.
>> Which is more preferable ?
>
> Highest resolution (option a).
>
>> 4) Are there any open issues pending for ARM are during sleep mode and
>> frequency scaling modes ?
>
> Probably, I have not looked much into this area yet. Testing would be
> welcome, and there will probably be problems.
>
>> Really appreciate the work being done in the LTTng group.
>>
>
> You are very welcome. Thanks for the useful feedback. Hopefully we'll
> manage to integrate your work in the LTTng tree.
>
> Mathieu
>
>> CPU LOADING Suggestion
>> for plotting micro CPU loading on x86 and other platforms with dedicated
>> timers:
>>
>> Since the timer interrupt comes periodically every 10ms,  the time
>> difference between last_event before Timer_IRQ_entry and the previous
>> Timer_IRQ_exit can give the instantaneous loading.
>> Eg:
>> Timer_IRQ_entry @ T1
>> Timer_IRQ_exit @ T2
>> {
>> some processing
>> last_event @ T3
>> }
>> Timer_IRQ_entry @ T4
>>
>> micro processing load % = (Last_event @ T3 - Timer_IRQ_exit @ T2 /  10)
>> *100
>>
>> This data can be averaged of user defined windows of 10ms, 100ms 1 sec
>> etc to give an average trend of CPU loading. The CPU loading analyzer can
>> help identify regions in the trace where CPU loading is uneven. Any other
>> ideas on this regard ?
>>
>> regards
>> Venu Annamraju
>> Team Lead, VoIP-DSP
>> HelloSoft India
>>
>> The information contained in this e-mail message and/or attachments to it
>>   may
>> contain confidential or privileged information. If you are not the
>> intended
>>   recipient, any dissemination, use, review, distribution, printing or
>>   copying of the information contained in this e-mail message and/or
>>   attachments to it are strictly prohibited. If you have received this
>>   communication in error, please notify us by reply e-mail immediately
>> and
>>   permanently delete the message and any attachments.
>>
>
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE
> 9A68

The information contained in this e-mail message and/or attachments to it
  may
contain confidential or privileged information. If you are not the intended
  recipient, any dissemination, use, review, distribution, printing or
  copying of the information contained in this e-mail message and/or
  attachments to it are strictly prohibited. If you have received this
  communication in error, please notify us by reply e-mail immediately and
  permanently delete the message and any attachments.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: arm_2_6_21_jiffies_tsc.patch
Type: application/octet-stream
Size: 7305 bytes
Desc: not available
URL: <http://lists.casi.polymtl.ca/pipermail/lttng-dev/attachments/20090107/1218fcee/attachment-0003.obj>


More information about the lttng-dev mailing list