[ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)

Mathieu Desnoyers compudj at krystal.dyndns.org
Wed Feb 11 01:35:20 EST 2009


* Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck at linux.vnet.ibm.com) wrote:
> > > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > > 
> > > > > > > [ . . . ]
> > > > > > > 
> > > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > > 
> > > > > > > > 8-way x86_64
> > > > > > > > E5405 @2 GHZ
> > > > > > > > 
> > > > > > > > ./urcutorture 8 perf
> > > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > > 
> > > > > > > > ./urcutorture 8 uperf
> > > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > > 
> > > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > > 
> > > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > > 
> > > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > > would be:
> > > > > > > 
> > > > > > > o	Just before exit from the signal handler, do a
> > > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > > 
> > > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > > 
> > > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > > 
> > > > > > > This should raise the probability of seeing the failure in the case
> > > > > > > where there is a single switch_qparity().
> > > > > > > 
> > > > > > 
> > > > > > I just did a mb() version of the urcu :
> > > > > > 
> > > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > > 
> > > > > > Time per read : 48.4086 cycles
> > > > > > (about 6-7 times slower, as expected)
> > > > > > 
> > > > > > This will be useful especially to increase the chance to trigger races.
> > > > > > 
> > > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > > 
> > > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > > in an invalid state.
> > > > > > 
> > > > > > So what really make the problem appear quickly is to add a delay between
> > > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > > 
> > > > > > It now appears after just a few seconds when running
> > > > > > ./test_urcu_yield 20 -r -n
> > > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > > 
> > > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > > with the -DDEBUG_FULL_MB version.
> > > > > > 
> > > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > > 
> > > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > > see a failure:
> > > > > 
> > > > > paulmck at paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > > paulmck at paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > ./rcu_nest32 1 stress
> > > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > > paulmck at paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > ./rcu_nest32 1 stress
> > > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > > paulmck at paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > > 
> > > > > This is my prototype version, with read-side memory barriers, no
> > > > > signals, and without your initialization-value speedup.
> > > > > 
> > > > 
> > > > It would be interesting to re-sync our trees, or if you can point me to
> > > > a current version of your prototype, I could review it.
> > > 
> > > Look at:
> > > 
> > > 	CodeSamples/defer/rcu_nest32.[hc]
> > > 
> > > In the git archive:
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > 
> > flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> > mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.
> 
> Yep, this is before your optimization.
> 

Hrm, and given the RCU_GP_CTR_BOTTOM_BIT is in the MSBs, there is no
possible effect on the LSBs. That should work even if it overflows. OK.
That should even work with my optimization. But I somehow prefer the xor
(if it's not slower), because we really only need 1 bit to flip on and
off.

> > Another major difference between our tree is the lack of smp_mb() at the
> > end of flip_counter_and_wait() (in your tree).
> > 
> > Your code does :
> > 
> >   smp_mb()
> >   switch parity
> >   smp_mb()
> >   wait for each thread ongoing old gp
> >     <<<<<<< ---- missing smp_mb.
> >   switch parity
> >   smp_mb()
> >   wait for each thread ongoing old gp
> >   smp_mb()
> 
> This should be OK -- or am I missing a failure scenario?
> Keep in mind that I get failures only when omitting a counter
> flip, not with the above code.
> 

OK, it's good that you point out that the failure only occurs when
omitting the counter flip.

So if we leave out the mb() we can end up in a situation where a reader
thread is still in an ongoing old gp and we switch the parity. The big
question is : should we be concerned about this ?

From the writer point of view :

Given there is no data dependency between the parity update and the
per_thread(rcu_reader_gp, t) read done in the while loop waiting for
threads, and given even the compiler barrier() has no effect wrt the
last test done after the last iteration of the loop, we could think of
compiler optimizations doing the following to our code (let's focus on a
single loop of for_each_thread) :

transforming

                while (rcu_old_gp_ongoing(t))
                        barrier();
                rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;

into

                if (!rcu_old_gp_ongoing(t))
                  goto end;
                while (rcu_old_gp_ongoing(t))
                        barrier();
end:
                rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;

This leaves the choice to the compiler to perform the rcu_gp_ctr
increment before the per_thread(rcu_reader_gp, t) read, because there is
no barrier.

Not only does this apply to the compiler, but also to the memory
barriers. We can end up in a situation where the CPU decides to to the
rcu_gp_ctr increment before reading the last rcu_old_gp_ongoing value,
given there is no data dependency between those two.

You could argue that ACCESS_ONCE() around the per_thread(rcu_reader_gp,
t) read will order reads, but I don't think we should rely on this on
SMP. This is really supposed to be there just to make sure we don't end
up doing multiple variable reads on UP wrt to local interrupts.

You could also argue that rcu_gp_ctr is read within
rcu_old_gp_ongoing(), which should normally order the memory accesses.
It actually does only order memory access to the rcu_gp_ctr variable,
not the per_thread(rcu_reader_gp, t), because, here again, there if no
data dependency whatsoever between per_thread(rcu_reader_gp, t) and
rcu_gp_ctr. A possible scenario : rcu_gp_ctr could be read, then we have
the rcu_gp_ctr increment, and only then could the
per_thread(rcu_reader_gp, t) variable be read to perform the test.

But I see that even in rcu_read_lock, there is no strict ordering
between __get_thread_var(rcu_reader_gp) and rcu_gp_ctr read. Therefore,
I conclude that ordering between those two variables does not matter at
all. I also suspect that this is the core reason for doing 2 q.s. period
flip at each update.

Am I correct ?


> > I also wonder why you have a smp_mb() after spin_unlock() in your
> > synchronize_rcu() -> if you follow the Linux kernel semantics for
> > spinlocks, the smp_mb() should be implied. (but I have not looked at
> > your spin_lock/unlock primitives yet).
> 
> Perhaps things have changed, but last I knew, spin_lock() and
> spin_unlock() were only required to keep the critical section in, not
> to keep things out of the critical section.
> 

Hrm, reading Documentation/memory-barriers.txt again tells me things
might have changed (if I am reading correctly the section LOCKS VS
MEMORY ACCESSES).

Correct me if I am wrong, but I don't think it makes sense to insure
memory barriers to keep accesses within the critical section and not
outside, because such memory access could well be another spinlock.

Therefore, we could end up in a situation where we have two locks, A and
B, taken in the following order in the source code :

LOCK A

UNLOCK A

LOCK B

UNLOCK B

Then, following your assumption, it would be possible for a CPU to do
the memory accesses associated to lock A and B in a random order one vs
the other. Given there would be no requirement to keep things out of
those respective critical sections, LOCK A could be taken within LOCK B,
and the opposite would also be valid.

Valid memory access orders :

1)
LOCK A
LOCK B
UNLOCK B
UNLOCK A

2)
LOCK B
LOCK A
UNLOCK A
UNLOCK B

The only constraint that ensures we won't end up in this situation is
the fact that memory accesses done outside of the critical section stays
outside of the critical section.

Mathieu



> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > 							Thanx, Paul
> > > 
> > > _______________________________________________
> > > ltt-dev mailing list
> > > ltt-dev at lists.casi.polymtl.ca
> > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68




More information about the lttng-dev mailing list