[lttng-dev] Best way to analyze CTF files

Eugene Ivanov eiva at tbricks.com
Wed Oct 22 07:27:48 EDT 2014


On Mon, Oct 20, 2014 at 7:48 PM, Boisvert, Sebastien <boisvert at anl.gov>
wrote:

> > From: Eugene Ivanov [eiva at tbricks.com]
> > Sent: Monday, October 20, 2014 6:01 AM
> > To: Boisvert, Sebastien
> > Cc: lttng-dev at lists.lttng.org
> > Subject: Re: [lttng-dev] Best way to analyze CTF files
> >
> >
> >
> > Hi Sebastian,
>
> *Sebastien* or *Sébastien*
>

Oops, I'm sorry, Sebastien.


> >
> > In our company one of the main lttng applications is to measure latency
> between two probes which get same ID as argument.
>
> We do the same thing with LTTng (our ID is called message_number, and
> every actor message carries one):
>
> > For this purpose we've implemented in C a special utility which is based
> on libraries to read CTF (IIRC). Also some developers often
> >  use own simple perl scripts to analyze babeltrace output.
>
> In our system, messages are sent by actors (like Erlang processes, but in
> C 1999 for throughput).
>
> We ended up using LTTng's powerful real-time filtering engine to filter
> only messages
> for a single actor:
>
> lttng enable-event --userspace thorium_message:* --filter
> "message_source_actor == 1001563"
>
>
> Then we are simply just looking at individual messages (LTTng provides us
> with the delivery path of
> every actor message for the selected actor):
>
> [boisvert at bigmem biosal]$ cat trace.txt | grep "message_number = 99,"|awk
> '{print $4" "$2}'
> thorium_message:actor_send: (+0.003747706)
> thorium_message:worker_send: (+0.000001442)
> thorium_message:worker_enqueue_message: (+0.000001257)
> thorium_message:worker_publish_message: (+0.000000976)
> thorium_message:worker_pool_dequeue: (+0.003363601)
> thorium_message:node_send: (+0.000001483)
> thorium_message:node_send_system: (+0.000001349)
> thorium_message:transport_send: (+0.000001040)
> thorium_message:node_receive: (+0.000073889)
> thorium_message:node_dispatch_message: (+0.000001891)
> thorium_message:worker_pool_enqueue: (+0.000001696)
> thorium_message:worker_receive: (+0.000001774)
> thorium_message:actor_receive: (+0.000002841)
>
>
> > I wish there is a better way to implement custom CTF analyzers, probably
> some basic DSL.
>
> I think that an approach that would be efficient would be to just import
> everything in
> sqlite. Otherwise, I think that filtering is good since it does generates
> files that are too large.
>
> Without filtering, our HPC app (supercomputing, not cloud) generates ~30
> GiB worth of
> LTTng CTF data in 2 minutes with only 4 Thorium nodes (~ 30 x86-64 cores).
> In production, we are using
> ~6000 x86 cores on a Cray XE6 for ~20 minutes and ~30000 POWER cores (A2)
> for ~ 20 minutes.
> So LTTng's builtin filter is really our only option IMO.
>

I see.


> >
> >
> >
> > On Sat, Oct 18, 2014 at 2:23 AM, Boisvert, Sebastien
> > <boisvert at anl.gov> wrote:
> >
> > Bonjour,
> >
> > First, thank you for LTTng-UST. This is very useful and convenient.
> >
> > I just got started today using LTTng (LTTng-UST) for tracing a HPC
> application
> > that I am working on (I am a postdoc). I am impressed by how easy LTTng
> is to use it.
> >
> > In my system, an actor message is represented by a pair
> > <message_actor_source, message_number>.
> >
> > I want to list all messages that have a high delivery time
> (message:actor_receive - message:actor_send).
> >
> > I am doing this to get the messages of one actor (actor 1000019):
> >
> > [boisvert at bigmem biosal]$ babeltrace
> ~/lttng-traces/auto-20141017-181240|grep "message_source_actor = 1000019"
> > actor_1000019
> >
> > Then, I can look at one message with (message <1000019, 14>):
> >
> > [boisvert at bigmem biosal]$ grep "message_number = 14," actor_1000019
> > [18:12:43.647017211] (+0.000005110)
> > bigmem.knoxville.kbase.us message:actor_send: { cpu_id = 30 }, {
> message_number = 14, message_action = 31592, message_count = 8,
> message_source_actor = 1000019, message_destination_actor = 1000059,
> message_source_node = -1, message_destination_node = -1
> >  }
> > [18:12:43.647025249] (+0.000002860)
> > bigmem.knoxville.kbase.us message:actor_receive: { cpu_id = 49 }, {
> message_number = 14, message_action = 31592, message_count = 8,
> message_source_actor = 1000019, message_destination_actor = 1000059,
> message_source_node = 3, message_destination_node =
> >  3 }
> >
> > If I substract the times:
> >
> > irb(main):003:0> (43.647025249-43.647017211)*10**9
> > => 8038.00000426236
> >
> > This message (<1000019, 14>) required 8038 ns for the delivery. This one
> is fine.
> >
> >
> > So basically my question is:
> >
> > Is there an easy way to analyze these tracepoint files ?
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev at lists.lttng.org
> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> >
> >
> >
> >
> > --
> > Eugene
> >
> >
>



-- 
Eugene
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20141022/7f89bf9b/attachment-0001.html>


More information about the lttng-dev mailing list