[lttng-dev] Best way to analyze CTF files

Sat Oct 18 11:18:38 EDT 2014

Hi Philippe,

Thanks for the detailed Python code.

In my application, I know that something is wrong with the message delivery path 
for specific actors. But I need to know more.

I looked carefully at the LTTng documentation and I found what I needed.
There is a --filter option which carefully selects events, and
reduces the amount of recorded events.

Thanks to LTTng, I am closer to solving this problem:

One message with the same source node and destination node:

[boisvert at bigmem biosal]$ babeltrace  ~/lttng-traces/auto-20141018-104939/ |grep "message_number = 8000," | awk '{print $4" "$2}'|sed 's/: / /g'|sed 's/(//g'|sed 's/)//g'
thorium_message:actor_send +0.000010835
thorium_message:worker_send +0.000000876
thorium_message:worker_enqueue_message +0.000001410
thorium_message:worker_dequeue_message +0.003886256   <=========================
thorium_message:worker_pool_dequeue +0.000001434
thorium_message:node_send +0.000001438
thorium_message:node_send_system +0.000001006
thorium_message:node_dispatch_message +0.000001231
thorium_message:worker_pool_enqueue +0.000001827
thorium_message:node_send_dispatch +0.000001100
thorium_message:worker_receive +0.000003944
thorium_message:actor_receive +0.000003028

A message sent between nodes:

[boisvert at bigmem biosal]$ babeltrace  ~/lttng-traces/auto-20141018-104939/ |grep "message_number = 8800," | awk '{print $4" "$2}'|sed 's/: / /g'|sed 's/(//g'|sed 's/)//g'
thorium_message:actor_send +0.004115537
thorium_message:worker_send +0.000001402
thorium_message:worker_enqueue_message +0.000001457
thorium_message:worker_dequeue_message +0.004062639   <=============================
thorium_message:worker_pool_dequeue +0.000001536
thorium_message:node_send +0.000000934
thorium_message:node_send_system +0.000001185
thorium_message:transport_send +0.000001039
thorium_message:node_receive +0.000061961
thorium_message:node_dispatch_message +0.000001502
thorium_message:worker_pool_enqueue +0.000002299
thorium_message:worker_receive +0.000001632
thorium_message:actor_receive +0.000003461

> ________________________________________
> From: Philippe Proulx [eeppeliteloop at gmail.com]
> Sent: Friday, October 17, 2014 9:15 PM
> To: Christian Babeux
> Cc: Boisvert, Sebastien; lttng-dev at lists.lttng.org
> Subject: Re: [lttng-dev] Best way to analyze CTF files
> On Fri, Oct 17, 2014 at 8:13 PM, Christian Babeux
> <christian.babeux at efficios.com> wrote:
> > Hi Sebastien,
> >
> > A simple Python script using the Babeltrace bindings would probably do
> > the job here. See [1] for an example using the bindings.
> Yes, here's a start Sébastien <http://pastebin.com/QXfev1ve>:
> import babeltrace
> import json
> import sys
> 
> def get_msg_intervals(trace_path):
> col = babeltrace.TraceCollection()
> if col.add_trace(trace_path, 'ctf') is None:
> raise RuntimeError('Cannot add trace')
> msg_intervals = {}
> for ev in col.events:
> if ev.name not in ['message:actor_send', 'message:actor_receive']:
> continue
> msg_source_actor = ev['message_source_actor']
> msg_number = ev['message_number']
> if msg_source_actor not in msg_intervals:
> msg_intervals[msg_source_actor] = {}
> if msg_number not in msg_intervals[msg_source_actor]:
> msg_intervals[msg_source_actor][msg_number] = {}
> interval = msg_intervals[msg_source_actor][msg_number]
> if ev.name == 'message:actor_send':
> interval['send_ts'] = ev.timestamp
> else:
> interval['recv_ts'] = ev.timestamp
> return msg_intervals
> 
> if __name__ == '__main__':
> msg_intervals = get_msg_intervals(sys.argv[1])
> print(json.dumps(msg_intervals, indent=2))
> This will output a JSON dict mapping actors to message numbers, and then
> each message number to a dict containing both the send and receive timestamps:
> python3 intervals.py /path/to/trace > intervals.json
> Once you have that JSON database, you may either read it manually or query it
> easily using another Python script <http://pastebin.com/fLwyvEz8>:
> import json
> import sys
> 
> def print_msg_diff(filename, actor_id, msg_number):
> def get_diff(msg):
> send_ts = msg['send_ts']
> recv_ts = msg['recv_ts']
> if send_ts == -1 or recv_ts == -1:
> return None
> return recv_ts - send_ts
> with open(filename) as f:
> msg_intervals = json.load(f)
> actor = msg_intervals[actor_id]
> if msg_number is None:
> for msg_number in sorted(actor.keys()):
> msg = actor[msg_number]
> print('{}: {}'.format(msg_number, get_diff(msg)))
> else:
> print(get_diff(actor[msg_number]))
> 
> if __name__ == '__main__':
> filename = sys.argv[1]
> actor_id = sys.argv[2]
> msg_number = None
> if len(sys.argv) > 3:
> msg_number = sys.argv[3]
> print_msg_diff(filename, actor_id, msg_number)
> Use it like this:
> python3 query.py <JSON file path> <actor ID> [<message number>]
> For example:
> python3 query.py intervals.json 1000019
> python3 query.py intervals.json 1000059
> python3 query.py intervals.json 1000019 14
> Modify those scripts at will to fit your exact use case. Of course the versions
> above could crash in many ways because they lack a few run time checks, etc.
> Hope it helps,
> Phil
> >
> > Thanks,
> >
> > Christian
> >
> > [1] - https://lttng.org/docs/#doc-viewing-and-analyzing-your-traces
> >
> > On Fri, Oct 17, 2014 at 6:23 PM, Boisvert, Sebastien <boisvert at anl.gov> wrote:
> >> Bonjour,
> >>
> >> First, thank you for LTTng-UST. This is very useful and convenient.
> >>
> >> I just got started today using LTTng (LTTng-UST) for tracing a HPC application
> >> that I am working on (I am a postdoc). I am impressed by how easy LTTng is to use it.
> >>
> >> In my system, an actor message is represented by a pair
> >> <message_actor_source, message_number>.
> >>
> >> I want to list all messages that have a high delivery time (message:actor_receive - message:actor_send).
> >>
> >> I am doing this to get the messages of one actor (actor 1000019):
> >>
> >> [boisvert at bigmem biosal]$ babeltrace ~/lttng-traces/auto-20141017-181240|grep "message_source_actor = 1000019"  > actor_1000019
> >>
> >> Then, I can look at one message with (message <1000019, 14>):
> >>
> >> [boisvert at bigmem biosal]$ grep "message_number = 14," actor_1000019
> >> [18:12:43.647017211] (+0.000005110) bigmem.knoxville.kbase.us message:actor_send: { cpu_id = 30 }, { message_number = 14, message_action = 31592, message_count = 8, message_source_actor = 1000019, message_destination_actor = 1000059, message_source_node = -1, message_destination_node = -1 }
> >> [18:12:43.647025249] (+0.000002860) bigmem.knoxville.kbase.us message:actor_receive: { cpu_id = 49 }, { message_number = 14, message_action = 31592, message_count = 8, message_source_actor = 1000019, message_destination_actor = 1000059, message_source_node = 3, message_destination_node = 3 }
> >>
> >> If I substract the times:
> >>
> >> irb(main):003:0> (43.647025249-43.647017211)*10**9
> >> => 8038.00000426236
> >>
> >> This message (<1000019, 14>) required 8038 ns for the delivery. This one is fine.
> >>
> >>
> >> So basically my question is:
> >>
> >> Is there an easy way to analyze these tracepoint files ?
> >> _______________________________________________
> >> lttng-dev mailing list
> >> lttng-dev at lists.lttng.org
> >> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> >
> > _______________________________________________
> > lttng-dev mailing list
> > lttng-dev at lists.lttng.org
> > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev