[ltt-dev] [patch] add tracepoints to trace activate/deactivate task

Wed Dec 10 02:08:22 EST 2008

On Tue, 2008-12-09 at 17:10 -0500, Mathieu Desnoyers wrote:
> * Jason Baron (jbaron at redhat.com) wrote:
> > On Mon, Dec 08, 2008 at 11:42:49PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2008-12-08 at 17:38 -0500, Jason Baron wrote:
> > > > On Mon, Dec 08, 2008 at 08:54:10PM +0100, Peter Zijlstra wrote:
> > > > > On Mon, 2008-12-08 at 14:49 -0500, Jason Baron wrote:
> > > > > > hi,
> > > > > > 
> > > > > > I thought it would be useful to track when a task is
> > > > > > 'activated/deactivated'. This case is different from wakeup/wait, in that
> > > > > > task can be activated and deactivated, when the scheduler re-balances
> > > > > > tasks, the allowable cpuset changes, or cpu hotplug occurs. Using these
> > > > > > patches I can more precisely figure out when a task becomes runnable and
> > > > > > why. 
> > > > > 
> > > > > Then I still not agree with it because it does not expose the event that
> > > > > did the change.
> > > > > 
> > > > > If you want the cpu allowed mask, put a tracepoint there. If you want
> > > > > migrate information (didn't we have that?) then put one there, etc.
> > > > > 
> > > > 
> > > > well, with stap backtrace I can figure out the event, otherwise i'm
> > > > sprinkling 14 more trace events in the scheduler...I can go down that
> > > > patch if people think its better?
> > > 
> > > what events are you interested in? some of them are just straight
> > > syscall things like nice.
> > > 
> > > But yes, I'd rather you'd do the events - that's what tracepoints are
> > > all about, marking indivudual events, not some fugly hook for stap.
> > 
> > well, i think that the activate/deactivate combination gives you a lot
> > of interesting statistics. You could figure out how long tasks wait on
> > the runqueue, when and how tasks are migrated between runqueues, queue
> > lengths, average queue lengths, large queues lengths. These statistics
> > could help diagnose performance problems.
> > 
> > For example, i just wrote the systemtap script below which outputs the 
> > distribution of queue lengths per-cpu on my system. I'm sure Frank could 
> > improve the stap code, but below is the script and the output.
> > 
> 
> Quoting yourself in this thread :
> 
> "I thought it would be useful to track when a task is
> 'activated/deactivated'. This case is different from wakeup/wait, in
> that task can be activated and deactivated, when the scheduler
> re-balances tasks, the allowable cpuset changes, or cpu hotplug occurs.
> Using these patches I can more precisely figure out when a task becomes
> runnable and why."
> 
> Peter Zijlstra objected that the key events we would like to see traced
> are more detailed than just "activate/deactivate" state, e.g. an event
> for wakeup, one for wait, one for re-balance, one for cpuset change, one
> for hotplug. Doing this will allow other tracers to do other useful
> stuff with the information.
> 
> So trying to argue that "activate/deactivate" is "good" is missing the
> point here. Yes, we need that information, but in fact we need _more
> precise_ information, which is a superset of those "activate/deactivate"
> events.
> 
> Peter, am I understanding your point correctly ?

Quite so.

The point is that for some operations the tasks never conceptually leave
the rq, like renice - its never not runnable, we just need to pull it
off and re-insert it in order to make the weight change, but its for all
intents and purposes an 'atomic' operation.

So if you want that information, you want a trace_sched_setscheduler()
hook that will tell you about the old scheduler/prio and the new
scheduler/prio for a task - because that is the information that is
native to that event, deactivate/activate not so, one could conceivably
implement renice without using them.

Same thing with migration, its conceptually an atomic op, we just need
to lift a task from one cpu to another, it doesn't conceptually become
unrunnable during that move.

So looking at activate/deactivate is wrong.

---
Subject: trace: fixup task migration trace point

The trace point only caught one of many places where a task changes cpu,
put it in the right place to we get all of them.

Change the signature while we're at it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra at chello.nl>
---
 include/trace/sched.h |    4 ++--
 kernel/sched.c        |    3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/trace/sched.h b/include/trace/sched.h
index 9b2854a..f4549d5 100644
--- a/include/trace/sched.h
+++ b/include/trace/sched.h
@@ -30,8 +30,8 @@ DECLARE_TRACE(sched_switch,
 		TPARGS(rq, prev, next));
 
 DECLARE_TRACE(sched_migrate_task,
-	TPPROTO(struct rq *rq, struct task_struct *p, int dest_cpu),
-		TPARGS(rq, p, dest_cpu));
+	TPPROTO(struct task_struct *p, int orig_cpu, int dest_cpu),
+		TPARGS(p, orig_cpu, dest_cpu));
 
 DECLARE_TRACE(sched_process_free,
 	TPPROTO(struct task_struct *p),
diff --git a/kernel/sched.c b/kernel/sched.c
index 0eff15b..3dc54cd 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1861,6 +1861,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 
 	clock_offset = old_rq->clock - new_rq->clock;
 
+	trace_sched_migrate_task(p, task_cpu(p), new_cpu);
+
 #ifdef CONFIG_SCHEDSTATS
 	if (p->se.wait_start)
 		p->se.wait_start -= clock_offset;
@@ -2841,7 +2843,6 @@ static void sched_migrate_task(struct task_struct *p, int dest_cpu)
 	    || unlikely(!cpu_active(dest_cpu)))
 		goto out;
 
-	trace_sched_migrate_task(rq, p, dest_cpu);
 	/* force the process onto the specified CPU */
 	if (migrate_task(p, dest_cpu, &req)) {
 		/* Need to wait for migration thread (might exit: take ref). */