[ltt-dev] cli/sti vs local_cmpxchg and local_add_return

Wed Mar 18 11:10:24 EDT 2009

* Nick Piggin (nickpiggin at yahoo.com.au) wrote:
> On Wednesday 18 March 2009 02:14:37 Mathieu Desnoyers wrote:
> > * Nick Piggin (nickpiggin at yahoo.com.au) wrote:
> > > On Tuesday 17 March 2009 12:32:20 Mathieu Desnoyers wrote:
> > > > Hi,
> > > >
> > > > I am trying to get access to some non-x86 hardware to run some atomic
> > > > primitive benchmarks for a paper on LTTng I am preparing. That should
> > > > be useful to argue about performance benefit of per-cpu atomic
> > > > operations vs interrupt disabling. I would like to run the following
> > > > benchmark module on CONFIG_SMP :
> > > >
> > > > - PowerPC
> > > > - MIPS
> > > > - ia64
> > > > - alpha
> > > >
> > > > usage :
> > > > make
> > > > insmod test-cmpxchg-nolock.ko
> > > > insmod: error inserting 'test-cmpxchg-nolock.ko': -1 Resource
> > > > temporarily unavailable dmesg (see dmesg output)
> > > >
> > > > If some of you would be kind enough to run my test module provided
> > > > below and provide the results of these tests on a recent kernel
> > > > (2.6.26~2.6.29 should be good) along with their cpuinfo, I would
> > > > greatly appreciate.
> > > >
> > > > Here are the CAS results for various Intel-based architectures :
> > > >
> > > > Architecture         | Speedup                      |      CAS     |
> > > >  Interrupts         |
> > > >
> > > >                      | (cli + sti) / local cmpxchg  | local | sync |
> > > >                      | Enable (sti) | Disable (cli)
> > > >
> > > > -----------------------------------------------------------------------
> > > >---- ---------------------- Intel Pentium 4      | 5.24                 
> > > >        | 25   | 81   | 70           | 61          | AMD Athlon(tm)64 X2
> > > >  | 4.57
> > > >
> > > >                     |  7    | 17   | 17           | 15          | Intel
> > > >
> > > > Core2          | 6.33                         |  6    | 30   | 20
> > > >
> > > > | 18          | Intel Xeon E5405     | 5.25                         | 
> > > > | 8 24   | 20           | 22          |
> > > >
> > > > The benefit expected on PowerPC, ia64 and alpha should principally come
> > > > from removed memory barriers in the local primitives.
> > >
> > > Benefit versus what? I think all of those architectures can do SMP
> > > atomic compare exchange sequences without barriers, can't they?
> >
> > Hi Nick,
> >
> > I want to compare if it is faster to use SMP cas without barriers to
> > perform synchronization of the tracing hot path wrt interrupts or if it
> > is faster to disable interrupts. These decisions will depend on the
> > benchmark I propose, because it is comparing the time it takes to
> > perform both.
> >
> > Overall, the benchmarks will allow to choose between those two
> > simplified hotpath pseudo-codes (offset is global to the buffer,
> > commit_count is per-subbuffer).
> >
> >
> > * lockless :
> >
> > do {
> >   old_offset = local_read(&offset);
> >   get_cycles();
> >   compute needed size.
> >   new_offset = old_offset + size;
> > } while (local_cmpxchg(&offset, old_offset, new_offset) != old_offset);
> >
> > /*
> >  * note : writing to buffer is done out-of-order wrt buffer slot
> >  * physical order.
> >  */
> > write_to_buffer(offset);
> >
> > /*
> >  * Make sure the data is written in the buffer before commit count is
> >  * incremented.
> >  */
> > smp_wmb();
> >
> > /* note : incrementing the commit count is also done out-of-order */
> > count = local_add_return(size, &commit_count[subbuf_index]);
> > if (count is filling a subbuffer)
> >   allow to wake up readers
> 
> Ah OK, so you just mean the benefit of using local atomics is avoiding
> the barriers that you get with atomic_t.
> 
> I'd thought you were referring to some benefit over irq disable pattern.
> 

On powerpc and mips, for instance, yes the gain is just the disabled
barriers. On x86 it becomes more interesting because we can remove the
lock; prefix, which gives a good speedup. All I want to do here is to
figure out which of barrier-less local_t ops vs disabling interrupts is
faster (and how much faster/slower) on various architectures.

For instance, on architecture like the powerpc64 (tests provided by Paul
McKenney), it's only a difference of less than 4 cycles between irq
off/irq (14-16 cycles, and this is without doing the data access) and
doing both local_cmpxchg and local_add_return (18 cycles). So given we
might have tracepoints called from NMI context, the tiny performance
impact we have with local_t ops does not counter balance the benefit of
having a lockless NMI-safe trace buffer management algorithm.

Thanks,

Mathieu

> 
> > * irq off :
> >
> > (note : offset and commit count would each be written to atomically
> > (type unsigned long))
> >
> > local_irq_save(flags);
> >
> > get_cycles();
> > compute needed size;
> > offset += size;
> >
> > write_to_buffer(offset);
> >
> > /*
> >  * Make sure the data is written in the buffer before commit count is
> >  * incremented.
> >  */
> > smp_wmb();
> >
> > commit_count[subbuf_index] += size;
> > if (count is filling a subbuffer)
> >   allow to wake up readers
> >
> > local_irq_restore(flags);
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68