[lttng-dev] [RFC PATCH] timekeeping: introduce timekeeping_is_busy()

Wed Sep 11 16:36:18 EDT 2013

On Wed, Sep 11, 2013 at 02:54:41PM -0400, Mathieu Desnoyers wrote:
> * John Stultz (john.stultz at linaro.org) wrote:
> > On 09/11/2013 08:08 AM, Mathieu Desnoyers wrote:
> [...]
> 
> Now focusing on features (the fix discussion is in a separate
> sub-thread):
> 
> > 
> > > LTTng uses ktime to have the same time-base across kernel and
> > > user-space, so traces gathered from LTTng-modules and LTTng-UST can be
> > > correlated. We plan on using ktime until a fast, scalable, and
> > > fine-grained time-source for tracing that can be used across kernel and
> > > user-space, and which does not rely on read seqlock for kernel-level
> > > synchronization, makes its way into the kernel.
> > 
> > So my fix for the issue aside, I could see cases where using timekeeping
> > for tracing could run into similar issues, so something like your
> > timekeeping_is_busy() check sounds reasonable.
> 
> Yep, it would certainly make those use of ktime_get() more robust
> against internal changes.
> 
> > I might suggest we wrap
> > the timekeeper locking in a helper function so we don't have the
> > spinlock(); set_owner(); write_seqcount(); pattern all over the place
> > (and it avoids me forgetting to set the owner in some future change,
> > further mucking things up :).
> 
> Good idea.
> 
> > As for your waiting for "fast, scalable, and fine-grained time-source
> > for tracing that can be used across kernel and user-space, and which
> > does not rely on read seqlock for kernel-level synchronization" wish,
> > I'd be interested in hearing ideas if anyone has them.
> 
> So far, the best I could come up with is this: using a RCU (or RCU-like)
> scheme to protect in-kernel timestamp reads (possibly with RCU sched,
> which is based on preemption disabling), and use a sequence lock to
> protect reads from user-space.
> 
> Time updates within the kernel would have to deal with both RCU pointer
> update and track quiescent state, and would need to hold a write seqlock
> to synchronize against concurrent user-space reads.
> 
> > After getting the recent lock-hold reduction work merged in 3.10, I had
> > some thoughts that maybe we could do some sort of rcu style timekeeper
> > switch. The down side is that there really is a time bound in which the
> > timekeeper state is valid for, so there would have to be some sort of
> > seqcount style "retry if we didn't finish the calculation within the
> > valid bound" (which can run into similar deadlock problems if the
> > updater is delayed by a reader spinning waiting for an update).
> 
> What could make a reader fail to finish the calculation within the valid
> time bound ? Besides preemption ? If it's caused by a too long
> interrupt, this will have an effect on the entire timekeeping, because
> the timer interrupt will likely be delayed, and therefore the periodical
> update changing the write seqlock value will be delayed too. So for the
> interrupt case, it looks like a "too long" interrupt (or interrupt
> disable section) will already disrupt timekeeping with the current
> design.
> 
> > 
> > Also there is the memory issue of having N timekeeper structures hanging
> > around, since there could be many readers delayed mid-calculation, but
> > that could probably be bound by falling back to a seqcount (and again,
> > that opens up deadlock possibilities). Anyway, it all gets pretty
> > complicated pretty quickly, which makes ensuring correctness even harder. :(
> > 
> > But yea, I'd be interested in other ideas and approaches.
> 
> If we can afford a synchronize_rcu_sched() wherever the write seqlock is
> needed, we could go with the following. Please note that I use
> synchronize_rcu_sched() rather than call_rcu_sched() here because I try
> to avoid having too many timekeeper structures hanging around, and I
> think it can be generally a good think to ensure timekeeping core does
> not depend on the memory allocator (but I could very well be wrong).

The issue called out with this the last time I remember it being put
forward was that grace periods can be delayed for longer than is an
acceptable gap between timekeeping updates.  But maybe something has
changed since then -- that was a few years ago.

							Thanx, Paul

> In kernel/time/timekeeper.c:
> 
> static DEFINE_MUTEX(timekeeper_mutex);
> static seqcount_t timekeeper_user_seq;
> 
> struct timekeeper_rcu {
>         struct a[2];
>         struct timekeeper *p;   /* current */
> };
> 
> /* Timekeeper structure for kernel readers */
> static struct timekeeper_rcu timekeeper_rcu;
> 
> /* Timekeeper structure for userspace readers */
> static struct timekeeper timekeeper_user;
> 
> /* for updates */
> update_time()
> {
>         struct timekeeper *next_p;
> 
>         mutex_lock(&timekeeper_mutex);
> 
>         /* RCU update, for kernel readers */
>         if (timekeeper_rcu.p == &timekeeper_rcu.a[0])
>                 next_p = &timekeeper_rcu.a[1];
>         else
>                 next_p = &timekeeper_rcu.a[0];
> 
>         timekeeper_copy(next_p, timekeeper_rcu.p);
>         timekeeper_do_update(next_p, ...);
> 
>         rcu_assign_pointer(timekeeper_rcu.p, next_p);
> 
>         /*
>          * Ensure there are no more readers in flight accessing the old
>          * element.
>          */
>         synchronize_rcu_sched();
> 
>         /* seqlock update, for user-space readers */
>         write_seqcount_begin(&timekeeper_user_seq);
>         timekeeper_do_update_user(&timekeeper_user);
>         write_seqcount_end(&timekeeper_user_seq);
> 
>         mutex_unlock(&timekeeper_mutex);
> }
> 
> 
> /*
>  * Accessing time from kernel. Should be called with preemption
>  * disabled.
>  */
> u64 __read_time_kernel()
> {
>         struct timekeeper *p;
> 
>         p = rcu_dereference(timekeeper_rcu.p);
>         return get_value_from(p);
> }
> 
> /* Accessing time from user-space (vDSO) */
> 
> u64 read_time_user()
> {
>         /*
>          * Read timekeeper_user within a read_seqcount loop.
>          */
> }
> 
> 
> As far as I see, the write seqcount is currently taken by:
> 
> - do_settimeofday()
> - timekeeping_inject_offset()
> - timekeeping_set_tai_offset()
> - change_clocksource()
> - timekeeping_init()
> - timekeeping_inject_sleeptime()
> - timekeeping_resume()
> - timekeeping_suspend()
> - update_wall_time()
> - do_adjtimex()
> - hardpps()
> 
> It might be good to breakdown which of these functions could afford a
> synchronize_rcu_sched(), and which can't. Maybe part of the solution
> could be to use different data structures, and separate synchronization,
> for the ajdtime part (executed in thread context) vs for
> update_wall_time, AFAIK executed in interrupt context. We might also
> want to consider that the adjtime part needs to be global, vs
> update_wall_time which could possibly be done on per-cpu data
> structures, which could simplify synchronization.
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>