<div dir="ltr"><div><div>Wouldn't you batch the calls to drop the pages from cache rather than drop one packet at a time? Your effort to help Linux mm seems a bit overkill, and you don't want every application to do it like that himself. The fadvise will not even work when the page is still to be flushed out. Without the patch that started the thread, it would 'at random' not work due to SMP race condition (not multi-threading).<br>
<br>I believe the right way would be for Linux to implement the fadvise flags properly to guide cache replacement.<br><br></div>My situation was slightly different. I run in a virtualized environment where it would be a global improvement for the Linux guest not to cache data, even though from Linux mm perspective there is plenty of space and no good reason not to keep the data. My scenario intercepted the close() call to drop each input file during a tar or file system scan. This avoids building a huge page cache with stuff that will not be referenced until next backup and thus gets paged out by the hypervisor. There's some more here:<br>
<a href="http://zvmperf.wordpress.com/2013/01/27/taming-linux-page-cache/">http://zvmperf.wordpress.com/2013/01/27/taming-linux-page-cache/</a><br><br></div>Rob<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">
On 19 June 2013 21:25, Mathieu Desnoyers <span dir="ltr"><<a href="mailto:mathieu.desnoyers@efficios.com" target="_blank">mathieu.desnoyers@efficios.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div class="HOEnZb"><div class="h5">* Mel Gorman (<a href="mailto:mgorman@suse.de">mgorman@suse.de</a>) wrote:<br>
> On Tue, Jun 18, 2013 at 10:29:26AM +0100, Mel Gorman wrote:<br>
> > On Mon, Jun 17, 2013 at 02:24:59PM -0700, Andrew Morton wrote:<br>
> > > On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <<a href="mailto:mathieu.desnoyers@efficios.com">mathieu.desnoyers@efficios.com</a>> wrote:<br>
> > ><br>
> > > > Hi,<br>
> > > ><br>
> > > > CCing lkml on this,<br>
> > > ><br>
> > > > * Yannick Brosseau (<a href="mailto:yannick.brosseau@gmail.com">yannick.brosseau@gmail.com</a>) wrote:<br>
> > > > > Hi all,<br>
> > > > ><br>
> > > > > We discovered a performance regression in recent kernels with LTTng<br>
> > > > > related to the use of fadvise DONTNEED.<br>
> > > > > A call to this syscall is present in the LTTng consumer.<br>
> > > > ><br>
> > > > > The following kernel commit cause the call to fadvise to be sometime<br>
> > > > > really slower.<br>
> > > > ><br>
> > > > > Kernel commit info:<br>
> > > > > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard<br>
> > > > > all pages<br>
> > > > > main tree: (since 3.9-rc1)<br>
> > > > > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732<br>
> > > > > stable tree: (since 3.8.1)<br>
> > > > > <a href="https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d" target="_blank">https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d</a><br>
> > > > ><br>
> > > > > On the workload test, we observe that the call to fadvise takes about<br>
> > > > > 4-5 us before this patch is applied. After applying the patch, The<br>
> > > > > syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The<br>
> > > > > effect on lttng is that the consumer is frozen for this long period<br>
> > > > > which leads to dropped event in the trace.<br>
> > ><br>
> > > That change wasn't terribly efficient - if there are any unpopulated<br>
> > > pages in the range (which is quite likely), fadvise() will now always<br>
> > > call invalidate_mapping_pages() a second time.<br>
> > ><br>
> ><br>
> > I did not view the path as being performance critical and did not anticipate<br>
> > a delay that severe.<br>
><br>
> Which I should have, schedule_on_each_cpu is documented to be slow.<br>
><br>
> > The original test case as well was based on<br>
> > sequential IO as part of a backup so I was also generally expecting the<br>
> > range to be populated. I looked at the other users of lru_add_drain_all()<br>
> > but there are fairly few. compaction uses them but only when used via sysfs<br>
> > or proc. ksm uses it but it's not likely to be that noticable. mlock uses<br>
> > it but it's unlikely it is being called frequently so I'm not going to<br>
> > worry performance of lru_add_drain_all() in general. I'll look closer at<br>
> > properly detecting when it's necessarily to call in the fadvise case.<br>
> ><br>
><br>
> This compile-only tested prototype should detect remaining pages in the rage<br>
> that were not invalidated. This will at least detect unpopulated pages but<br>
> whether it has any impact depends on what lttng is invalidating. If it's<br>
> invalidating a range of per-cpu traces then I doubt this will work because<br>
> there will always be remaining pages. Similarly I wonder if passing down<br>
> the mapping will really help if a large number of cpus are tracing as we<br>
> end up scheduling work on every CPU regardless.<br>
<br>
</div></div>Here is how LTTng consumerd is using POSIX_FADV_DONTNEED. Please let me<br>
know if we are doing something stupid. ;-)<br>
<br>
The LTTng consumerd deals with trace packets, which consists of a set of<br>
pages. I'll just discuss how the pages are moved around, leaving out<br>
discussion about synchronization between producer/consumer.<br>
<br>
Whenever the Nth trace packet is ready to be written to disk:<br>
<br>
1) splice the entire set of pages of trace packet N to disk through a<br>
pipe,<br>
2) sync_file_range on trace packet N-1 with the following flags:<br>
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER<br>
so we wait (blocking) on the _previous_ trace packet to be completely<br>
written to disk.<br>
3) posix_fadvise POSIX_FADV_DONTNEED on trace packet N-1.<br>
<br>
There are a couple of comments in my consumerd code explaining why we do<br>
this sequence of steps:<br>
<br>
/*<br>
* Give hints to the kernel about how we access the file:<br>
* POSIX_FADV_DONTNEED : we won't re-access data in a near future after<br>
* we write it.<br>
*<br>
* We need to call fadvise again after the file grows because the<br>
* kernel does not seem to apply fadvise to non-existing parts of the<br>
* file.<br>
*<br>
* Call fadvise _after_ having waited for the page writeback to<br>
* complete because the dirty page writeback semantic is not well<br>
* defined. So it can be expected to lead to lower throughput in<br>
* streaming.<br>
*/<br>
<br>
Currently, the lttng-consumerd is single-threaded, but we plan to<br>
re-introduce multi-threading, and per-cpu affinity, in a near future.<br>
<br>
Yannick will try your patch tomorrow and report whether it improves<br>
performance or not,<br>
<br>
Thanks!<br>
<span class="HOEnZb"><font color="#888888"><br>
Mathieu<br>
</font></span><div class="HOEnZb"><div class="h5"><br>
><br>
> diff --git a/include/linux/fs.h b/include/linux/fs.h<br>
> index 2c28271..e2bb47e 100644<br>
> --- a/include/linux/fs.h<br>
> +++ b/include/linux/fs.h<br>
> @@ -2176,6 +2176,8 @@ extern int invalidate_partition(struct gendisk *, int);<br>
> #endif<br>
> unsigned long invalidate_mapping_pages(struct address_space *mapping,<br>
> pgoff_t start, pgoff_t end);<br>
> +unsigned long invalidate_mapping_pages_check(struct address_space *mapping,<br>
> + pgoff_t start, pgoff_t end);<br>
><br>
> static inline void invalidate_remote_inode(struct inode *inode)<br>
> {<br>
> diff --git a/mm/fadvise.c b/mm/fadvise.c<br>
> index 7e09268..0579e60 100644<br>
> --- a/mm/fadvise.c<br>
> +++ b/mm/fadvise.c<br>
> @@ -122,7 +122,9 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)<br>
> end_index = (endbyte >> PAGE_CACHE_SHIFT);<br>
><br>
> if (end_index >= start_index) {<br>
> - unsigned long count = invalidate_mapping_pages(mapping,<br>
> + unsigned long nr_remaining;<br>
> +<br>
> + nr_remaining = invalidate_mapping_pages_check(mapping,<br>
> start_index, end_index);<br>
><br>
> /*<br>
> @@ -131,7 +133,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)<br>
> * a per-cpu pagevec for a remote CPU. Drain all<br>
> * pagevecs and try again.<br>
> */<br>
> - if (count < (end_index - start_index + 1)) {<br>
> + if (nr_remaining) {<br>
> lru_add_drain_all();<br>
> invalidate_mapping_pages(mapping, start_index,<br>
> end_index);<br>
> diff --git a/mm/truncate.c b/mm/truncate.c<br>
> index c75b736..86cfc2e 100644<br>
> --- a/mm/truncate.c<br>
> +++ b/mm/truncate.c<br>
> @@ -312,26 +312,16 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)<br>
> }<br>
> EXPORT_SYMBOL(truncate_inode_pages);<br>
><br>
> -/**<br>
> - * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode<br>
> - * @mapping: the address_space which holds the pages to invalidate<br>
> - * @start: the offset 'from' which to invalidate<br>
> - * @end: the offset 'to' which to invalidate (inclusive)<br>
> - *<br>
> - * This function only removes the unlocked pages, if you want to<br>
> - * remove all the pages of one inode, you must call truncate_inode_pages.<br>
> - *<br>
> - * invalidate_mapping_pages() will not block on IO activity. It will not<br>
> - * invalidate pages which are dirty, locked, under writeback or mapped into<br>
> - * pagetables.<br>
> - */<br>
> -unsigned long invalidate_mapping_pages(struct address_space *mapping,<br>
> - pgoff_t start, pgoff_t end)<br>
> +static void __invalidate_mapping_pages(struct address_space *mapping,<br>
> + pgoff_t start, pgoff_t end,<br>
> + unsigned long *ret_nr_invalidated,<br>
> + unsigned long *ret_nr_remaining)<br>
> {<br>
> struct pagevec pvec;<br>
> pgoff_t index = start;<br>
> unsigned long ret;<br>
> - unsigned long count = 0;<br>
> + unsigned long nr_invalidated = 0;<br>
> + unsigned long nr_remaining = 0;<br>
> int i;<br>
><br>
> /*<br>
> @@ -354,8 +344,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,<br>
> if (index > end)<br>
> break;<br>
><br>
> - if (!trylock_page(page))<br>
> + if (!trylock_page(page)) {<br>
> + nr_remaining++;<br>
> continue;<br>
> + }<br>
> WARN_ON(page->index != index);<br>
> ret = invalidate_inode_page(page);<br>
> unlock_page(page);<br>
> @@ -365,17 +357,73 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,<br>
> */<br>
> if (!ret)<br>
> deactivate_page(page);<br>
> - count += ret;<br>
> + else<br>
> + nr_remaining++;<br>
> + nr_invalidated += ret;<br>
> }<br>
> pagevec_release(&pvec);<br>
> mem_cgroup_uncharge_end();<br>
> cond_resched();<br>
> index++;<br>
> }<br>
> - return count;<br>
> +<br>
> + *ret_nr_invalidated = nr_invalidated;<br>
> + *ret_nr_remaining = nr_remaining;<br>
> }<br>
> EXPORT_SYMBOL(invalidate_mapping_pages);<br>
><br>
> +/**<br>
> + * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode<br>
> + * @mapping: the address_space which holds the pages to invalidate<br>
> + * @start: the offset 'from' which to invalidate<br>
> + * @end: the offset 'to' which to invalidate (inclusive)<br>
> + *<br>
> + * This function only removes the unlocked pages, if you want to<br>
> + * remove all the pages of one inode, you must call truncate_inode_pages.<br>
> + *<br>
> + * invalidate_mapping_pages() will not block on IO activity. It will not<br>
> + * invalidate pages which are dirty, locked, under writeback or mapped into<br>
> + * pagetables.<br>
> + *<br>
> + * Returns the number of pages invalidated<br>
> + */<br>
> +unsigned long invalidate_mapping_pages(struct address_space *mapping,<br>
> + pgoff_t start, pgoff_t end)<br>
> +{<br>
> + unsigned long nr_invalidated, nr_remaining;<br>
> +<br>
> + __invalidate_mapping_pages(mapping, start, end,<br>
> + &nr_invalidated, &nr_remaining);<br>
> +<br>
> + return nr_invalidated;<br>
> +}<br>
> +<br>
> +/**<br>
> + * invalidate_mapping_pages_check - Invalidate all the unlocked pages of one inode and check for remaining pages.<br>
> + * @mapping: the address_space which holds the pages to invalidate<br>
> + * @start: the offset 'from' which to invalidate<br>
> + * @end: the offset 'to' which to invalidate (inclusive)<br>
> + *<br>
> + * This function only removes the unlocked pages, if you want to<br>
> + * remove all the pages of one inode, you must call truncate_inode_pages.<br>
> + *<br>
> + * invalidate_mapping_pages() will not block on IO activity. It will not<br>
> + * invalidate pages which are dirty, locked, under writeback or mapped into<br>
> + * pagetables.<br>
> + *<br>
> + * Returns the number of pages remaining in the invalidated range<br>
> + */<br>
> +unsigned long invalidate_mapping_pages_check(struct address_space *mapping,<br>
> + pgoff_t start, pgoff_t end)<br>
> +{<br>
> + unsigned long nr_invalidated, nr_remaining;<br>
> +<br>
> + __invalidate_mapping_pages(mapping, start, end,<br>
> + &nr_invalidated, &nr_remaining);<br>
> +<br>
> + return nr_remaining;<br>
> +}<br>
> +<br>
> /*<br>
> * This is like invalidate_complete_page(), except it ignores the page's<br>
> * refcount. We do this because invalidate_inode_pages2() needs stronger<br>
<br>
</div></div><div class="HOEnZb"><div class="h5">--<br>
Mathieu Desnoyers<br>
EfficiOS Inc.<br>
<a href="http://www.efficios.com" target="_blank">http://www.efficios.com</a><br>
</div></div></blockquote></div><br></div>