[04/15] mm: remove the pgmap field from struct hmm_vma_walk

Submitted by Jason Gunthorpe on Aug. 7, 2019, 5:45 p.m.

Details

Message ID 20190807174548.GJ1571@mellanox.com
State New
Headers show
Series "Series without cover letter" ( rev: 2 ) in Nouveau

Not browsing as part of any series.

Commit Message

Jason Gunthorpe Aug. 7, 2019, 5:45 p.m.
On Tue, Aug 06, 2019 at 07:05:42PM +0300, Christoph Hellwig wrote:
> There is only a single place where the pgmap is passed over a function
> call, so replace it with local variables in the places where we deal
> with the pgmap.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
>  mm/hmm.c | 62 ++++++++++++++++++++++++--------------------------------
>  1 file changed, 27 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 9a908902e4cc..d66fa29b42e0 100644
> +++ b/mm/hmm.c
> @@ -278,7 +278,6 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
>  
>  struct hmm_vma_walk {
>  	struct hmm_range	*range;
> -	struct dev_pagemap	*pgmap;
>  	unsigned long		last;
>  	unsigned int		flags;
>  };
> @@ -475,6 +474,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	struct hmm_vma_walk *hmm_vma_walk = walk->private;
>  	struct hmm_range *range = hmm_vma_walk->range;
> +	struct dev_pagemap *pgmap = NULL;
>  	unsigned long pfn, npages, i;
>  	bool fault, write_fault;
>  	uint64_t cpu_flags;
> @@ -490,17 +490,14 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>  	pfn = pmd_pfn(pmd) + pte_index(addr);
>  	for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
>  		if (pmd_devmap(pmd)) {
> -			hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> -					      hmm_vma_walk->pgmap);
> -			if (unlikely(!hmm_vma_walk->pgmap))
> +			pgmap = get_dev_pagemap(pfn, pgmap);
> +			if (unlikely(!pgmap))
>  				return -EBUSY;

Unrelated to this patch, but what is the point of getting checking
that the pgmap exists for the page and then immediately releasing it?
This code has this pattern in several places.

It feels racy

>  		}
>  		pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
>  	}
> -	if (hmm_vma_walk->pgmap) {
> -		put_dev_pagemap(hmm_vma_walk->pgmap);
> -		hmm_vma_walk->pgmap = NULL;

Putting the value in the hmm_vma_walk would have made some sense to me
if the pgmap was not set to NULL all over the place. Then the most
xa_loads would be eliminated, as I would expect the pgmap tends to be
mostly uniform for these use cases.

Is there some reason the pgmap ref can't be held across
faulting/sleeping? ie like below.

Anyhow, I looked over this pretty carefully and the change looks
functionally OK, I just don't know why the code is like this in the
first place.

Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>

Patch hide | download patch | download mbox

diff --git a/mm/hmm.c b/mm/hmm.c
index 9a908902e4cc38..4e30128c23a505 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -497,10 +497,6 @@  static int hmm_vma_handle_pmd(struct mm_walk *walk,
 		}
 		pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
 	}
-	if (hmm_vma_walk->pgmap) {
-		put_dev_pagemap(hmm_vma_walk->pgmap);
-		hmm_vma_walk->pgmap = NULL;
-	}
 	hmm_vma_walk->last = end;
 	return 0;
 #else
@@ -604,10 +600,6 @@  static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	return 0;
 
 fault:
-	if (hmm_vma_walk->pgmap) {
-		put_dev_pagemap(hmm_vma_walk->pgmap);
-		hmm_vma_walk->pgmap = NULL;
-	}
 	pte_unmap(ptep);
 	/* Fault any virtual address we were asked to fault */
 	return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
@@ -690,16 +682,6 @@  static int hmm_vma_walk_pmd(pmd_t *pmdp,
 			return r;
 		}
 	}
-	if (hmm_vma_walk->pgmap) {
-		/*
-		 * We do put_dev_pagemap() here and not in hmm_vma_handle_pte()
-		 * so that we can leverage get_dev_pagemap() optimization which
-		 * will not re-take a reference on a pgmap if we already have
-		 * one.
-		 */
-		put_dev_pagemap(hmm_vma_walk->pgmap);
-		hmm_vma_walk->pgmap = NULL;
-	}
 	pte_unmap(ptep - 1);
 
 	hmm_vma_walk->last = addr;
@@ -751,10 +733,6 @@  static int hmm_vma_walk_pud(pud_t *pudp,
 			pfns[i] = hmm_device_entry_from_pfn(range, pfn) |
 				  cpu_flags;
 		}
-		if (hmm_vma_walk->pgmap) {
-			put_dev_pagemap(hmm_vma_walk->pgmap);
-			hmm_vma_walk->pgmap = NULL;
-		}
 		hmm_vma_walk->last = end;
 		return 0;
 	}
@@ -1026,6 +1004,14 @@  long hmm_range_fault(struct hmm_range *range, unsigned int flags)
 			/* Keep trying while the range is valid. */
 		} while (ret == -EBUSY && range->valid);
 
+		/*
+		 * We do put_dev_pagemap() here so that we can leverage
+		 * get_dev_pagemap() optimization which will not re-take a
+		 * reference on a pgmap if we already have one.
+		 */
+		if (hmm_vma_walk->pgmap)
+			put_dev_pagemap(hmm_vma_walk->pgmap);
+
 		if (ret) {
 			unsigned long i;
 

Comments

On Wed, Aug 7, 2019 at 10:45 AM Jason Gunthorpe <jgg@mellanox.com> wrote:
>
> On Tue, Aug 06, 2019 at 07:05:42PM +0300, Christoph Hellwig wrote:
> > There is only a single place where the pgmap is passed over a function
> > call, so replace it with local variables in the places where we deal
> > with the pgmap.
> >
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> >  mm/hmm.c | 62 ++++++++++++++++++++++++--------------------------------
> >  1 file changed, 27 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 9a908902e4cc..d66fa29b42e0 100644
> > +++ b/mm/hmm.c
> > @@ -278,7 +278,6 @@ EXPORT_SYMBOL(hmm_mirror_unregister);
> >
> >  struct hmm_vma_walk {
> >       struct hmm_range        *range;
> > -     struct dev_pagemap      *pgmap;
> >       unsigned long           last;
> >       unsigned int            flags;
> >  };
> > @@ -475,6 +474,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >       struct hmm_vma_walk *hmm_vma_walk = walk->private;
> >       struct hmm_range *range = hmm_vma_walk->range;
> > +     struct dev_pagemap *pgmap = NULL;
> >       unsigned long pfn, npages, i;
> >       bool fault, write_fault;
> >       uint64_t cpu_flags;
> > @@ -490,17 +490,14 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
> >       pfn = pmd_pfn(pmd) + pte_index(addr);
> >       for (i = 0; addr < end; addr += PAGE_SIZE, i++, pfn++) {
> >               if (pmd_devmap(pmd)) {
> > -                     hmm_vma_walk->pgmap = get_dev_pagemap(pfn,
> > -                                           hmm_vma_walk->pgmap);
> > -                     if (unlikely(!hmm_vma_walk->pgmap))
> > +                     pgmap = get_dev_pagemap(pfn, pgmap);
> > +                     if (unlikely(!pgmap))
> >                               return -EBUSY;
>
> Unrelated to this patch, but what is the point of getting checking
> that the pgmap exists for the page and then immediately releasing it?
> This code has this pattern in several places.
>
> It feels racy

Agree, not sure what the intent is here. The only other reason call
get_dev_pagemap() is to just check in general if the pfn is indeed
owned by some ZONE_DEVICE instance, but if the intent is to make sure
the device is still attached/enabled that check is invalidated at
put_dev_pagemap().

If it's the former case, validating ZONE_DEVICE pfns, I imagine we can
do something cheaper with a helper that is on the order of the same
cost as pfn_valid(). I.e. replace PTE_DEVMAP with a mem_section flag
or something similar.

>
> >               }
> >               pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
> >       }
> > -     if (hmm_vma_walk->pgmap) {
> > -             put_dev_pagemap(hmm_vma_walk->pgmap);
> > -             hmm_vma_walk->pgmap = NULL;
>
> Putting the value in the hmm_vma_walk would have made some sense to me
> if the pgmap was not set to NULL all over the place. Then the most
> xa_loads would be eliminated, as I would expect the pgmap tends to be
> mostly uniform for these use cases.
>
> Is there some reason the pgmap ref can't be held across
> faulting/sleeping? ie like below.

No restriction on holding refs over faulting / sleeping.

>
> Anyhow, I looked over this pretty carefully and the change looks
> functionally OK, I just don't know why the code is like this in the
> first place.
>
> Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 9a908902e4cc38..4e30128c23a505 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -497,10 +497,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk,
>                 }
>                 pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags;
>         }
> -       if (hmm_vma_walk->pgmap) {
> -               put_dev_pagemap(hmm_vma_walk->pgmap);
> -               hmm_vma_walk->pgmap = NULL;
> -       }
>         hmm_vma_walk->last = end;
>         return 0;
>  #else
> @@ -604,10 +600,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>         return 0;
>
>  fault:
> -       if (hmm_vma_walk->pgmap) {
> -               put_dev_pagemap(hmm_vma_walk->pgmap);
> -               hmm_vma_walk->pgmap = NULL;
> -       }
>         pte_unmap(ptep);
>         /* Fault any virtual address we were asked to fault */
>         return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
> @@ -690,16 +682,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
>                         return r;
>                 }
>         }
> -       if (hmm_vma_walk->pgmap) {
> -               /*
> -                * We do put_dev_pagemap() here and not in hmm_vma_handle_pte()
> -                * so that we can leverage get_dev_pagemap() optimization which
> -                * will not re-take a reference on a pgmap if we already have
> -                * one.
> -                */
> -               put_dev_pagemap(hmm_vma_walk->pgmap);
> -               hmm_vma_walk->pgmap = NULL;
> -       }
>         pte_unmap(ptep - 1);
>
>         hmm_vma_walk->last = addr;
> @@ -751,10 +733,6 @@ static int hmm_vma_walk_pud(pud_t *pudp,
>                         pfns[i] = hmm_device_entry_from_pfn(range, pfn) |
>                                   cpu_flags;
>                 }
> -               if (hmm_vma_walk->pgmap) {
> -                       put_dev_pagemap(hmm_vma_walk->pgmap);
> -                       hmm_vma_walk->pgmap = NULL;
> -               }
>                 hmm_vma_walk->last = end;
>                 return 0;
>         }
> @@ -1026,6 +1004,14 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags)
>                         /* Keep trying while the range is valid. */
>                 } while (ret == -EBUSY && range->valid);
>
> +               /*
> +                * We do put_dev_pagemap() here so that we can leverage
> +                * get_dev_pagemap() optimization which will not re-take a
> +                * reference on a pgmap if we already have one.
> +                */
> +               if (hmm_vma_walk->pgmap)
> +                       put_dev_pagemap(hmm_vma_walk->pgmap);
> +

Seems ok, but only if the caller is guaranteeing that the range does
not span outside of a single pagemap instance. If that guarantee is
met why not just have the caller pass in a pinned pagemap? If that
guarantee is not met, then I think we're back to your race concern.
On Wed, Aug 07, 2019 at 11:47:22AM -0700, Dan Williams wrote:
> > Unrelated to this patch, but what is the point of getting checking
> > that the pgmap exists for the page and then immediately releasing it?
> > This code has this pattern in several places.
> >
> > It feels racy
> 
> Agree, not sure what the intent is here. The only other reason call
> get_dev_pagemap() is to just check in general if the pfn is indeed
> owned by some ZONE_DEVICE instance, but if the intent is to make sure
> the device is still attached/enabled that check is invalidated at
> put_dev_pagemap().
> 
> If it's the former case, validating ZONE_DEVICE pfns, I imagine we can
> do something cheaper with a helper that is on the order of the same
> cost as pfn_valid(). I.e. replace PTE_DEVMAP with a mem_section flag
> or something similar.

The hmm literally never dereferences the pgmap, so validity checking is
the only explanation for it.

> > +               /*
> > +                * We do put_dev_pagemap() here so that we can leverage
> > +                * get_dev_pagemap() optimization which will not re-take a
> > +                * reference on a pgmap if we already have one.
> > +                */
> > +               if (hmm_vma_walk->pgmap)
> > +                       put_dev_pagemap(hmm_vma_walk->pgmap);
> > +
> 
> Seems ok, but only if the caller is guaranteeing that the range does
> not span outside of a single pagemap instance. If that guarantee is
> met why not just have the caller pass in a pinned pagemap? If that
> guarantee is not met, then I think we're back to your race concern.

It iterates over multiple ptes in a non-huge pmd.  Is there any kind of
limitations on different pgmap instances inside a pmd?  I can't think
of one, so this might actually be a bug.
On Wed, Aug 7, 2019 at 11:59 PM Christoph Hellwig <hch@lst.de> wrote:
>
> On Wed, Aug 07, 2019 at 11:47:22AM -0700, Dan Williams wrote:
> > > Unrelated to this patch, but what is the point of getting checking
> > > that the pgmap exists for the page and then immediately releasing it?
> > > This code has this pattern in several places.
> > >
> > > It feels racy
> >
> > Agree, not sure what the intent is here. The only other reason call
> > get_dev_pagemap() is to just check in general if the pfn is indeed
> > owned by some ZONE_DEVICE instance, but if the intent is to make sure
> > the device is still attached/enabled that check is invalidated at
> > put_dev_pagemap().
> >
> > If it's the former case, validating ZONE_DEVICE pfns, I imagine we can
> > do something cheaper with a helper that is on the order of the same
> > cost as pfn_valid(). I.e. replace PTE_DEVMAP with a mem_section flag
> > or something similar.
>
> The hmm literally never dereferences the pgmap, so validity checking is
> the only explanation for it.
>
> > > +               /*
> > > +                * We do put_dev_pagemap() here so that we can leverage
> > > +                * get_dev_pagemap() optimization which will not re-take a
> > > +                * reference on a pgmap if we already have one.
> > > +                */
> > > +               if (hmm_vma_walk->pgmap)
> > > +                       put_dev_pagemap(hmm_vma_walk->pgmap);
> > > +
> >
> > Seems ok, but only if the caller is guaranteeing that the range does
> > not span outside of a single pagemap instance. If that guarantee is
> > met why not just have the caller pass in a pinned pagemap? If that
> > guarantee is not met, then I think we're back to your race concern.
>
> It iterates over multiple ptes in a non-huge pmd.  Is there any kind of
> limitations on different pgmap instances inside a pmd?  I can't think
> of one, so this might actually be a bug.

Section alignment constraints somewhat save us here. The only example
I can think of a PMD not containing a uniform pgmap association for
each pte is the case when the pgmap overlaps normal dram, i.e. shares
the same 'struct memory_section' for a given span. Otherwise, distinct
pgmaps arrange to manage their own exclusive sections (and now
subsections as of v5.3). Otherwise the implementation could not
guarantee different mapping lifetimes.

That said, this seems to want a better mechanism to determine "pfn is
ZONE_DEVICE".
On Tue, Aug 13, 2019 at 06:36:33PM -0700, Dan Williams wrote:
> Section alignment constraints somewhat save us here. The only example
> I can think of a PMD not containing a uniform pgmap association for
> each pte is the case when the pgmap overlaps normal dram, i.e. shares
> the same 'struct memory_section' for a given span. Otherwise, distinct
> pgmaps arrange to manage their own exclusive sections (and now
> subsections as of v5.3). Otherwise the implementation could not
> guarantee different mapping lifetimes.
> 
> That said, this seems to want a better mechanism to determine "pfn is
> ZONE_DEVICE".

So I guess this patch is fine for now, and once you provide a better
mechanism we can switch over to it?