[lttng-dev] Xeon Phi memory barriers

Sat Dec 7 00:58:54 EST 2013

----- Original Message -----
> From: "Paul E. McKenney" <paulmck at linux.vnet.ibm.com>
> To: "Mathieu Desnoyers" <mathieu.desnoyers at efficios.com>
> Cc: "Simon Marchi" <simon.marchi at polymtl.ca>, lttng-dev at lists.lttng.org
> Sent: Friday, December 6, 2013 10:40:45 PM
> Subject: Re: [lttng-dev] Xeon Phi memory barriers
> 
> On Fri, Dec 06, 2013 at 08:15:38PM +0000, Mathieu Desnoyers wrote:
> > ----- Original Message -----
> > > From: "Simon Marchi" <simon.marchi at polymtl.ca>
> > > To: lttng-dev at lists.lttng.org
> > > Sent: Tuesday, November 19, 2013 4:26:06 PM
> > > Subject: [lttng-dev] Xeon Phi memory barriers
> > > 
> > > Hello there,
> > 
> > Hi Simon,
> > 
> > While reading this reply, please keep in mind that I'm in a
> > mindset where I've been in a full week of meeting, and it's late on
> > Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can
> > debunk my answer :)
> > 
> > > 
> > > liburcu does not build on the Intel Xeon Phi, because the chip is
> > > recognized as x86_64, but lacks the {s,l,m}fence instructions found on
> > > usual x86_64 processors. The following is taken from the Xeon Phi dev
> > > guide:
> > 
> > Let's have a look:
> > 
> > > 
> > > The Intel® Xeon PhiTM coprocessor memory model is the same as that of
> > > the Intel® Pentium processor. The reads and writes always appear in
> > > programmed order at the system bus (or the ring interconnect in the
> > > case of the Intel® Xeon PhiTM coprocessor); the exception being that
> > > read misses are permitted to go ahead of buffered writes on the system
> > > bus when all the buffered writes are cached hits and are, therefore,
> > > not directed to the same address being accessed by the read miss.
> > 
> > OK, so reads can be reordered with respect to following writes.
> 
> That would be -preceding- writes, correct?

Oh, yes, I got it reversed.

> 
> > > As a consequence of its stricter memory ordering model, the Intel®
> > > Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
> > > instructions that provide a more efficient way of controlling memory
> > > ordering on other Intel processors.
> > 
> > I guess sfence and lfence are indeed completely useless, because we only
> > can ever care about ordering reads vs writes (mfence). But even the mfence
> > is not there.
> 
> The usual approach is an atomic operation to a dummy location on the
> stack.  Is that the recommendation for Xeon Phi?

Yes, see below,

> 
> Either way, what should userspace RCU do to detect that it is being built
> on a Xeon Phi?  I am sure that Mathieu would welcome the relevant patches
> for this.  ;-)
> 
> > > While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
> > > program order on the system bus,
> > 
> > This part of the sentence seems misleading to me. Didn't the first
> > sentence state the opposite ? "the exception being that
> > read misses are permitted to go ahead of buffered writes on the system
> > bus when all the buffered writes are cached hits and are, therefore,
> > not directed to the same address being accessed by the read miss."
> > 
> > I'm probably missing something.
> 
> The trick might be that read misses are only allowed to pass write
> -hits-, which would mean that the system bus would have already seen
> the invalidate corresponding to the delayed write, and thus would
> have no evidence of any misorderingr
> 
> > > the compiler can still reorder
> > > unrelated memory operations while maintaining program order on a
> > > single Intel® Xeon PhiTM coprocessor (hardware thread). If software
> > > running on an Intel® Xeon PhiTM coprocessor is dependent on the order
> > > of memory operations on another Intel® Xeon PhiTM coprocessor then a
> > > serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
> > > between the memory operations is required to guarantee completion of
> > > all memory accesses issued prior to the serializing instruction before
> > > any subsequent memory operations are started.
> 
> OK, sounds like my guess of atomic instruction to dummy stack location
> is correct, or perhaps carrying out a nearby assignment using an
> xchg instruction.

Yes, or CPUID instruction seems OK too. We already use lock; addl on stack
in URCU for cases where fence instructions may not be available (x86-32).

> 
> > > (end of quote)
> > > 
> > > From what I understand, it is safe to leave out any run-time memory
> > > barriers, but we still need barriers that prevent the compiler from
> > > reordering (using __asm__ __volatile__ ("":::"memory")). In
> > > urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
> > > memory barriers result in both compile-time and run-time memory
> > > barriers:  __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory").
> > > I guess this would work for the Phi, but the lock instruction does not
> > > seem necessary.
> > 
> > Actually, either a cpuid (core serializing) instruction or lock-prefixed
> > instruction (serializing as a side-effect memory accesses) seems required.
> 
> It would certainly be safe.  One approach would be to keep it that way
> unless/until someone showed it to be unnecessary.
> 
> > > So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling
> > > for the Phi and go on with our lives, or should we add a specific
> > > config for this case?
> > 
> > I _think_ we could get away with this mapping:
> > 
> > smp_wmb() -> barrier()
> >   reasoning: write vs write are not reordered by the processor.
> > 
> > smp_rmb() -> barrier()
> >   reasoning: read vs read not reordered by processor.
> > 
> > smp_mb() -> __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory")
> >    or a cpuid instruction
> >   reasoning: cpu can reorder reads vs later writes.
> > 
> > smp_read_barrier_depends() -> nothing at all (not needed at any level).
> 
> This should be safe, though I would argue for do { } while (0) for
> smp_read_barrier_depends().

Indeed.

> 
> > Interestingly enough, AFAIU, this seems to map to x86-TSO. Maybe that
> > instead
> > of defining a compiling option specifically for Xeon Phi, we could instead
> > define a x86-tso.h header variant in userspace RCU and use it in all Intel
> > processors that map to TSO (hint: very vast majority). The only exceptions
> > seems to be Pentium Pro (needing smp_rmb() -> lfence) and some Windchip
> > processors which could reorder stores (thus needing smp_wmb() -> sfence).
> > 
> > Thoughts ?
> 
> As long as there is some reasonable way of detecting them.

The issue here is that I don't see any easy way to detect PPro and Windchip. AFAIU
it needs to be done dynamically (e.g. by reading /proc/cpuinfo), and this would
require code patching. We unfortunately don't have the infrastructure for this yet.

> 
> Actually, why not use the locked add of zero for all x86 systems for
> smp_mb()?

I suspect that perhaps on NUMA x86-64 systems, using locked add might have
more severe performance impact than mfence. Also, AFAIU, when compiling for
Xeon Phi, the compiler is targeting a specific sub-architecture

  "x$host_vendor" == "xk1om"

So we might not need to find the minimum common denominator between x86-64
(generic) and Xeon Phi at all, since it has its own instruction set.

Thoughts ?

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > Simon
> > > 
> > > _______________________________________________
> > > lttng-dev mailing list
> > > lttng-dev at lists.lttng.org
> > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > 
> > 
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com
> > 
> 
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com