[lttng-dev] [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED

Tue Jul 2 20:55:14 EDT 2013

* Mathieu Desnoyers (mathieu.desnoyers at efficios.com) wrote:
> * Dave Chinner (david at fromorbit.com) wrote:
> > On Thu, Jun 20, 2013 at 08:20:16AM -0400, Mathieu Desnoyers wrote:
> > > * Rob van der Heij (rvdheij at gmail.com) wrote:
> > > > Wouldn't you batch the calls to drop the pages from cache rather than drop
> > > > one packet at a time?
> > > 
> > > By default for kernel tracing, lttng's trace packets are 1MB, so I
> > > consider the call to fadvise to be already batched by applying it to 1MB
> > > packets rather than indivitual pages. Even there, it seems that the
> > > extra overhead added by the lru drain on each CPU is noticeable.
> > > 
> > > Another reason for not batching this in larger chunks is to limit the
> > > impact of the tracer on the kernel page cache. LTTng limits itself to
> > > its own set of buffers, and use the page cache for what is absolutely
> > > needed to perform I/O, but no more.
> > 
> > I think you are doing it wrong. This is a poster child case for
> > using Direct IO and completely avoiding the page cache altogether....
> 
> I just tried replacing my sync_file_range()+fadvise() calls and instead
> pass the O_DIRECT flag to open(). Unfortunately, I must be doing
> something very wrong, because I get only 1/3rd of the throughput, and
> the page cache fills up. Any idea why ?

Since O_DIRECT does not seem to provide acceptable throughput, it may be
interesting to investigate other ways to lessen the latency impact of
the fadvise DONTNEED hint.

Given it is just a hint, we should be allowed to perform page
deactivation lazily. Is there any fundamental reason to wait for worker
threads on each CPU to complete their lru drain before returning from
fadvise() to user-space ?

Thanks,

Mathieu

> 
> Here are my results:
> 
> heavy-syscall.c: 30M sigaction() syscall with bad parameters (returns
> immediately). Used as high-throughput stress-test for the tracer.
> Tracing to disk with LTTng, all kernel tracepoints activated, including
> system calls.
> 
> Tracer configuration: per-core buffers split into 4 sub-buffers of
> 262kB. splice() is used to transfer data from buffers to disk. Runs on a
> 8-core Intel machine.
> 
> Writing to a software raid-1 ext3 partition.
> ext3 mount options: rw,errors=remount-ro
> 
> * sync_file_range+fadvise 3.9.8
>   - with lru drain on fadvise
> 
> Kernel cache usage:
> Before tracing: 56272k cached
> After tracing:  56388k cached
> 
> 939M	/root/lttng-traces/auto-20130702-090430
> time ./heavy-syscall 
> real	0m21.910s
> throughput: 42MB/s
> 
> 
> * sync_file_range+fadvise 3.9.8
>   - without lru drain on fadvise: manually reverted
> 
> Kernel cache usage:
> Before tracing: 67968k cached
> After tracing:  67984k cached
> 
> 945M	/root/lttng-traces/auto-20130702-092505
> time ./heavy-syscall 
> real	0m21.872s
> throughput: 43MB/s
> 
> 
> * O_DIRECT 3.9.8
>   - O_DIRECT flag on open(), removed fadvise and sync_file_range calls
> 
> Kernel cache usage:
> Before tracing:  99480k cached
> After tracing:  360132k cached
> 
> 258M	/root/lttng-traces/auto-20130702-090603
> time ./heavy-syscall 
> real	0m19.627s
> throughput: 13MB/s
> 
> 
> * No cache hints 3.9.8
>   - only removed fadvise and sync_file_range calls
> 
> Kernel cache usage:
> Before tracing: 103556k cached
> After tracing:  363712k cached
> 
> 945M	/root/lttng-traces/auto-20130702-092505
> time ./heavy-syscall 
> real	0m19.672s
> throughput: 48MB/s
> 
> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com