[lttng-dev] Xeon Phi memory barriers

Fri Dec 6 15:15:38 EST 2013

----- Original Message -----
> From: "Simon Marchi" <simon.marchi at polymtl.ca>
> To: lttng-dev at lists.lttng.org
> Sent: Tuesday, November 19, 2013 4:26:06 PM
> Subject: [lttng-dev] Xeon Phi memory barriers
> 
> Hello there,

Hi Simon,

While reading this reply, please keep in mind that I'm in a
mindset where I've been in a full week of meeting, and it's late on
Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can
debunk my answer :)

> 
> liburcu does not build on the Intel Xeon Phi, because the chip is
> recognized as x86_64, but lacks the {s,l,m}fence instructions found on
> usual x86_64 processors. The following is taken from the Xeon Phi dev
> guide:

Let's have a look:

> 
> The Intel® Xeon PhiTM coprocessor memory model is the same as that of
> the Intel® Pentium processor. The reads and writes always appear in
> programmed order at the system bus (or the ring interconnect in the
> case of the Intel® Xeon PhiTM coprocessor); the exception being that
> read misses are permitted to go ahead of buffered writes on the system
> bus when all the buffered writes are cached hits and are, therefore,
> not directed to the same address being accessed by the read miss.

OK, so reads can be reordered with respect to following writes.

> 
> As a consequence of its stricter memory ordering model, the Intel®
> Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
> instructions that provide a more efficient way of controlling memory
> ordering on other Intel processors.

I guess sfence and lfence are indeed completely useless, because we only
can ever care about ordering reads vs writes (mfence). But even the mfence
is not there.

> 
> While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
> program order on the system bus,

This part of the sentence seems misleading to me. Didn't the first
sentence state the opposite ? "the exception being that
read misses are permitted to go ahead of buffered writes on the system
bus when all the buffered writes are cached hits and are, therefore,
not directed to the same address being accessed by the read miss."

I'm probably missing something.

> the compiler can still reorder
> unrelated memory operations while maintaining program order on a
> single Intel® Xeon PhiTM coprocessor (hardware thread). If software
> running on an Intel® Xeon PhiTM coprocessor is dependent on the order
> of memory operations on another Intel® Xeon PhiTM coprocessor then a
> serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
> between the memory operations is required to guarantee completion of
> all memory accesses issued prior to the serializing instruction before
> any subsequent memory operations are started.
> 
> (end of quote)
> 
> From what I understand, it is safe to leave out any run-time memory
> barriers, but we still need barriers that prevent the compiler from
> reordering (using __asm__ __volatile__ ("":::"memory")). In
> urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
> memory barriers result in both compile-time and run-time memory
> barriers:  __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory").
> I guess this would work for the Phi, but the lock instruction does not
> seem necessary.

Actually, either a cpuid (core serializing) instruction or lock-prefixed
instruction (serializing as a side-effect memory accesses) seems required.

> 
> So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling
> for the Phi and go on with our lives, or should we add a specific
> config for this case?

I _think_ we could get away with this mapping:

smp_wmb() -> barrier()
  reasoning: write vs write are not reordered by the processor.

smp_rmb() -> barrier()
  reasoning: read vs read not reordered by processor.

smp_mb() -> __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory")
   or a cpuid instruction
  reasoning: cpu can reorder reads vs later writes.

smp_read_barrier_depends() -> nothing at all (not needed at any level).

Interestingly enough, AFAIU, this seems to map to x86-TSO. Maybe that instead
of defining a compiling option specifically for Xeon Phi, we could instead
define a x86-tso.h header variant in userspace RCU and use it in all Intel
processors that map to TSO (hint: very vast majority). The only exceptions
seems to be Pentium Pro (needing smp_rmb() -> lfence) and some Windchip
processors which could reorder stores (thus needing smp_wmb() -> sfence).

Thoughts ?

Thanks,

Mathieu

> 
> Simon
> 
> _______________________________________________
> lttng-dev mailing list
> lttng-dev at lists.lttng.org
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> 

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com