[ltt-dev] General Questions
Mathieu Desnoyers
compudj at krystal.dyndns.org
Thu Jul 31 08:54:34 EDT 2008
* Kevin Harms (harms at mcs.anl.gov) wrote:
>
> I've made progress with processing the trace output, but I'm not sure I'm seeing what I expect.
>
> I ran a trace of my test program which produced the following output:
>
> 0.014005
> 0.013193
> 0.913196
> 0.012794
> 0.013194
> 0.013196
> 0.013195
> 0.012796
> 0.026788
Ok. Now let's concentrate on the interesting time window : the one
containing the events likely to cause the long fsync(). Given that we
know fsync() takes the buffers (filled by write, writev, ...) system
calls and flushes them to disk, the root cause of the problem should be
those write* system calls between the previous (correct) fsync() and the
fsync() which takes 0.9 s.
You can then find out what happened :
- First get the timestamp associated to your 0.9 s fsync(). It will be
used as an anchor in the trace.
- In graphical mode, try to paste this time (format 00.000000) on the
"Current time" label at the bottom right of the screen. It should seek
the trace to this time.
- See what happens in the time interval before, by locating the previous
fsync() too with a similar method. You can change the start and end
time by pasting the timestamp values in a similar way.
- It may help to refresh (refresh button) the GUI once your time window
has shrinked : it removes the inactive threads from the control flow
view, thus making it easier to navigate.
- I recommend to launch another detailed event list view and to use the
graphical filter associated to it (small button at the left of the
view) and select write, writev and other writev-like events and/or
on syscall_entry. This should help you navigate more easily through
the important events.
- Then, see which thread/process causes this problem, and fix userspace
code ;) Given we currently don't have the userspace stack dump (which
we once had before), it can be a little bit harder to find which
userspace code is responsible for those write system calls, but
hopefully there is not too much code there.
Mathieu
> 0.912795
> 0.013195
> 0.012794
> 0.013196
> 0.013195
> 0.013195
> 0.012795
> 0.913194
> 0.013196
> 0.013195
> 0.013196
>
> These are timings of fsync(). You can see normally the fsync time is on the order of 0.01 seconds but sometimes there's a 0.9 second one. This occurs on different hardware and file systems, so it would seem to be something with my test program or linux. Looking at the trace output. I don't really see this. The fsync calls seemed to be at regular intervals with no obvious ones taking longer than expected. See the screen shot at the link below. Am I looking in the wrong place?
>
> http://www-unix.mcs.anl.gov/~harms/lttv-capture.png
>
> thanks,
> kevin
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
More information about the lttng-dev
mailing list