[lttng-dev] [-stable 3.8.1 performance regression] madvise POSIX_FADV_DONTNEED

Thu Jun 20 02:36:59 EDT 2013

Wouldn't you batch the calls to drop the pages from cache rather than drop
one packet at a time? Your effort to help Linux mm seems a bit overkill,
and you don't want every application to do it like that himself. The
fadvise will not even work when the page is still to be flushed out.
Without the patch that started the thread, it would 'at random' not work
due to SMP race condition (not multi-threading).

I believe the right way would be for Linux to implement the fadvise flags
properly to guide cache replacement.

My situation was slightly different. I run in a virtualized environment
where it would be a global improvement for the Linux guest not to cache
data, even though from Linux mm perspective there is plenty of space and no
good reason not to keep the data. My scenario intercepted the close() call
to drop each input file during a tar or file system scan. This avoids
building a huge page cache with stuff that will not be referenced until
next backup and thus gets paged out by the hypervisor. There's some more
here:
http://zvmperf.wordpress.com/2013/01/27/taming-linux-page-cache/

Rob

On 19 June 2013 21:25, Mathieu Desnoyers <mathieu.desnoyers at efficios.com>wrote:

> * Mel Gorman (mgorman at suse.de) wrote:
> > On Tue, Jun 18, 2013 at 10:29:26AM +0100, Mel Gorman wrote:
> > > On Mon, Jun 17, 2013 at 02:24:59PM -0700, Andrew Morton wrote:
> > > > On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <
> mathieu.desnoyers at efficios.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > CCing lkml on this,
> > > > >
> > > > > * Yannick Brosseau (yannick.brosseau at gmail.com) wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > We discovered a performance regression in recent kernels with
> LTTng
> > > > > > related to the use of fadvise DONTNEED.
> > > > > > A call to this syscall is present in the LTTng consumer.
> > > > > >
> > > > > > The following kernel commit cause the call to fadvise to be
> sometime
> > > > > > really slower.
> > > > > >
> > > > > > Kernel commit info:
> > > > > > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to
> discard
> > > > > > all pages
> > > > > > main tree: (since 3.9-rc1)
> > > > > > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> > > > > > stable tree: (since 3.8.1)
> > > > > >
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
> > > > > >
> > > > > > On the workload test, we observe that the call to fadvise takes
> about
> > > > > > 4-5 us before this patch is applied. After applying the patch,
> The
> > > > > > syscall now takes values from 5 us up to 4 ms (4000 us)
> sometime. The
> > > > > > effect on lttng is that the consumer is frozen for this long
> period
> > > > > > which leads to dropped event in the trace.
> > > >
> > > > That change wasn't terribly efficient - if there are any unpopulated
> > > > pages in the range (which is quite likely), fadvise() will now always
> > > > call invalidate_mapping_pages() a second time.
> > > >
> > >
> > > I did not view the path as being performance critical and did not
> anticipate
> > > a delay that severe.
> >
> > Which I should have, schedule_on_each_cpu is documented to be slow.
> >
> > > The original test case as well was based on
> > > sequential IO as part of a backup so I was also generally expecting the
> > > range to be populated. I looked at the other users of
> lru_add_drain_all()
> > > but there are fairly few. compaction uses them but only when used via
> sysfs
> > > or proc. ksm uses it but it's not likely to be that noticable. mlock
> uses
> > > it but it's unlikely it is being called frequently so I'm not going to
> > > worry performance of lru_add_drain_all() in general. I'll look closer
> at
> > > properly detecting when it's necessarily to call in the fadvise case.
> > >
> >
> > This compile-only tested prototype should detect remaining pages in the
> rage
> > that were not invalidated. This will at least detect unpopulated pages
> but
> > whether it has any impact depends on what lttng is invalidating. If it's
> > invalidating a range of per-cpu traces then I doubt this will work
> because
> > there will always be remaining pages. Similarly I wonder if passing down
> > the mapping will really help if a large number of cpus are tracing as we
> > end up scheduling work on every CPU regardless.
>
> Here is how LTTng consumerd is using POSIX_FADV_DONTNEED. Please let me
> know if we are doing something stupid. ;-)
>
> The LTTng consumerd deals with trace packets, which consists of a set of
> pages. I'll just discuss how the pages are moved around, leaving out
> discussion about synchronization between producer/consumer.
>
> Whenever the Nth trace packet is ready to be written to disk:
>
> 1) splice the entire set of pages of trace packet N to disk through a
>    pipe,
> 2) sync_file_range on trace packet N-1 with the following flags:
>    SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
> SYNC_FILE_RANGE_WAIT_AFTER
>    so we wait (blocking) on the _previous_ trace packet to be completely
>    written to disk.
> 3) posix_fadvise POSIX_FADV_DONTNEED on trace packet N-1.
>
> There are a couple of comments in my consumerd code explaining why we do
> this sequence of steps:
>
>         /*
>          * Give hints to the kernel about how we access the file:
>          * POSIX_FADV_DONTNEED : we won't re-access data in a near future
> after
>          * we write it.
>          *
>          * We need to call fadvise again after the file grows because the
>          * kernel does not seem to apply fadvise to non-existing parts of
> the
>          * file.
>          *
>          * Call fadvise _after_ having waited for the page writeback to
>          * complete because the dirty page writeback semantic is not well
>          * defined. So it can be expected to lead to lower throughput in
>          * streaming.
>          */
>
> Currently, the lttng-consumerd is single-threaded, but we plan to
> re-introduce multi-threading, and per-cpu affinity, in a near future.
>
> Yannick will try your patch tomorrow and report whether it improves
> performance or not,
>
> Thanks!
>
> Mathieu
>
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 2c28271..e2bb47e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2176,6 +2176,8 @@ extern int invalidate_partition(struct gendisk *,
> int);
> >  #endif
> >  unsigned long invalidate_mapping_pages(struct address_space *mapping,
> >                                       pgoff_t start, pgoff_t end);
> > +unsigned long invalidate_mapping_pages_check(struct address_space
> *mapping,
> > +                                     pgoff_t start, pgoff_t end);
> >
> >  static inline void invalidate_remote_inode(struct inode *inode)
> >  {
> > diff --git a/mm/fadvise.c b/mm/fadvise.c
> > index 7e09268..0579e60 100644
> > --- a/mm/fadvise.c
> > +++ b/mm/fadvise.c
> > @@ -122,7 +122,9 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
> loff_t len, int advice)
> >               end_index = (endbyte >> PAGE_CACHE_SHIFT);
> >
> >               if (end_index >= start_index) {
> > -                     unsigned long count =
> invalidate_mapping_pages(mapping,
> > +                     unsigned long nr_remaining;
> > +
> > +                     nr_remaining =
> invalidate_mapping_pages_check(mapping,
> >                                               start_index, end_index);
> >
> >                       /*
> > @@ -131,7 +133,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
> loff_t len, int advice)
> >                        * a per-cpu pagevec for a remote CPU. Drain all
> >                        * pagevecs and try again.
> >                        */
> > -                     if (count < (end_index - start_index + 1)) {
> > +                     if (nr_remaining) {
> >                               lru_add_drain_all();
> >                               invalidate_mapping_pages(mapping,
> start_index,
> >                                               end_index);
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index c75b736..86cfc2e 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -312,26 +312,16 @@ void truncate_inode_pages(struct address_space
> *mapping, loff_t lstart)
> >  }
> >  EXPORT_SYMBOL(truncate_inode_pages);
> >
> > -/**
> > - * invalidate_mapping_pages - Invalidate all the unlocked pages of one
> inode
> > - * @mapping: the address_space which holds the pages to invalidate
> > - * @start: the offset 'from' which to invalidate
> > - * @end: the offset 'to' which to invalidate (inclusive)
> > - *
> > - * This function only removes the unlocked pages, if you want to
> > - * remove all the pages of one inode, you must call
> truncate_inode_pages.
> > - *
> > - * invalidate_mapping_pages() will not block on IO activity. It will not
> > - * invalidate pages which are dirty, locked, under writeback or mapped
> into
> > - * pagetables.
> > - */
> > -unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > -             pgoff_t start, pgoff_t end)
> > +static void __invalidate_mapping_pages(struct address_space *mapping,
> > +             pgoff_t start, pgoff_t end,
> > +             unsigned long *ret_nr_invalidated,
> > +             unsigned long *ret_nr_remaining)
> >  {
> >       struct pagevec pvec;
> >       pgoff_t index = start;
> >       unsigned long ret;
> > -     unsigned long count = 0;
> > +     unsigned long nr_invalidated = 0;
> > +     unsigned long nr_remaining = 0;
> >       int i;
> >
> >       /*
> > @@ -354,8 +344,10 @@ unsigned long invalidate_mapping_pages(struct
> address_space *mapping,
> >                       if (index > end)
> >                               break;
> >
> > -                     if (!trylock_page(page))
> > +                     if (!trylock_page(page)) {
> > +                             nr_remaining++;
> >                               continue;
> > +                     }
> >                       WARN_ON(page->index != index);
> >                       ret = invalidate_inode_page(page);
> >                       unlock_page(page);
> > @@ -365,17 +357,73 @@ unsigned long invalidate_mapping_pages(struct
> address_space *mapping,
> >                        */
> >                       if (!ret)
> >                               deactivate_page(page);
> > -                     count += ret;
> > +                     else
> > +                             nr_remaining++;
> > +                     nr_invalidated += ret;
> >               }
> >               pagevec_release(&pvec);
> >               mem_cgroup_uncharge_end();
> >               cond_resched();
> >               index++;
> >       }
> > -     return count;
> > +
> > +     *ret_nr_invalidated = nr_invalidated;
> > +     *ret_nr_remaining = nr_remaining;
> >  }
> >  EXPORT_SYMBOL(invalidate_mapping_pages);
> >
> > +/**
> > + * invalidate_mapping_pages - Invalidate all the unlocked pages of one
> inode
> > + * @mapping: the address_space which holds the pages to invalidate
> > + * @start: the offset 'from' which to invalidate
> > + * @end: the offset 'to' which to invalidate (inclusive)
> > + *
> > + * This function only removes the unlocked pages, if you want to
> > + * remove all the pages of one inode, you must call
> truncate_inode_pages.
> > + *
> > + * invalidate_mapping_pages() will not block on IO activity. It will not
> > + * invalidate pages which are dirty, locked, under writeback or mapped
> into
> > + * pagetables.
> > + *
> > + * Returns the number of pages invalidated
> > + */
> > +unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > +             pgoff_t start, pgoff_t end)
> > +{
> > +     unsigned long nr_invalidated, nr_remaining;
> > +
> > +     __invalidate_mapping_pages(mapping, start, end,
> > +                                &nr_invalidated, &nr_remaining);
> > +
> > +     return nr_invalidated;
> > +}
> > +
> > +/**
> > + * invalidate_mapping_pages_check - Invalidate all the unlocked pages
> of one inode and check for remaining pages.
> > + * @mapping: the address_space which holds the pages to invalidate
> > + * @start: the offset 'from' which to invalidate
> > + * @end: the offset 'to' which to invalidate (inclusive)
> > + *
> > + * This function only removes the unlocked pages, if you want to
> > + * remove all the pages of one inode, you must call
> truncate_inode_pages.
> > + *
> > + * invalidate_mapping_pages() will not block on IO activity. It will not
> > + * invalidate pages which are dirty, locked, under writeback or mapped
> into
> > + * pagetables.
> > + *
> > + * Returns the number of pages remaining in the invalidated range
> > + */
> > +unsigned long invalidate_mapping_pages_check(struct address_space
> *mapping,
> > +             pgoff_t start, pgoff_t end)
> > +{
> > +     unsigned long nr_invalidated, nr_remaining;
> > +
> > +     __invalidate_mapping_pages(mapping, start, end,
> > +                                &nr_invalidated, &nr_remaining);
> > +
> > +     return nr_remaining;
> > +}
> > +
> >  /*
> >   * This is like invalidate_complete_page(), except it ignores the page's
> >   * refcount.  We do this because invalidate_inode_pages2() needs
> stronger
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lttng.org/pipermail/lttng-dev/attachments/20130620/13df72e5/attachment-0001.html>