mm/hmm: replace hmm_update with mmu_notifier_range

Submitted by Ralph Campbell on July 23, 2019, 9:05 p.m.

Details

Message ID 20190723210506.25127-1-rcampbell@nvidia.com
State New
Headers show
Series "mm/hmm: replace hmm_update with mmu_notifier_range" ( rev: 1 ) in DRI devel

Not browsing as part of any series.

Commit Message

Ralph Campbell July 23, 2019, 9:05 p.m.
The hmm_mirror_ops callback function sync_cpu_device_pagetables() passes
a struct hmm_update which is a simplified version of struct
mmu_notifier_range. This is unnecessary so replace hmm_update with
mmu_notifier_range directly.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ben Skeggs <bskeggs@redhat.com>
---

This is based on 5.3.0-rc1 plus Christoph Hellwig's 6 patches
("hmm_range_fault related fixes and legacy API removal v2").
Jason, I believe this is the patch you were requesting.

 drivers/gpu/drm/nouveau/nouveau_svm.c |  4 ++--
 include/linux/hmm.h                   | 31 ++++-----------------------
 mm/hmm.c                              | 13 ++++-------
 3 files changed, 10 insertions(+), 38 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index a9c5c58d425b..6298d2dadb55 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -252,13 +252,13 @@  nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit)
 
 static int
 nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
-					const struct hmm_update *update)
+					const struct mmu_notifier_range *update)
 {
 	struct nouveau_svmm *svmm = container_of(mirror, typeof(*svmm), mirror);
 	unsigned long start = update->start;
 	unsigned long limit = update->end;
 
-	if (!update->blockable)
+	if (!mmu_notifier_range_blockable(update))
 		return -EAGAIN;
 
 	SVMM_DBG(svmm, "invalidate %016lx-%016lx", start, limit);
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 9f32586684c9..659e25a15700 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -340,29 +340,6 @@  static inline uint64_t hmm_device_entry_from_pfn(const struct hmm_range *range,
 
 struct hmm_mirror;
 
-/*
- * enum hmm_update_event - type of update
- * @HMM_UPDATE_INVALIDATE: invalidate range (no indication as to why)
- */
-enum hmm_update_event {
-	HMM_UPDATE_INVALIDATE,
-};
-
-/*
- * struct hmm_update - HMM update information for callback
- *
- * @start: virtual start address of the range to update
- * @end: virtual end address of the range to update
- * @event: event triggering the update (what is happening)
- * @blockable: can the callback block/sleep ?
- */
-struct hmm_update {
-	unsigned long start;
-	unsigned long end;
-	enum hmm_update_event event;
-	bool blockable;
-};
-
 /*
  * struct hmm_mirror_ops - HMM mirror device operations callback
  *
@@ -383,9 +360,9 @@  struct hmm_mirror_ops {
 	/* sync_cpu_device_pagetables() - synchronize page tables
 	 *
 	 * @mirror: pointer to struct hmm_mirror
-	 * @update: update information (see struct hmm_update)
-	 * Return: -EAGAIN if update.blockable false and callback need to
-	 *          block, 0 otherwise.
+	 * @update: update information (see struct mmu_notifier_range)
+	 * Return: -EAGAIN if mmu_notifier_range_blockable(update) is false
+	 * and callback needs to block, 0 otherwise.
 	 *
 	 * This callback ultimately originates from mmu_notifiers when the CPU
 	 * page table is updated. The device driver must update its page table
@@ -397,7 +374,7 @@  struct hmm_mirror_ops {
 	 * synchronous call.
 	 */
 	int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
-					  const struct hmm_update *update);
+				const struct mmu_notifier_range *update);
 };
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index 16b6731a34db..b810a4fa3de9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -165,7 +165,6 @@  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
-	struct hmm_update update;
 	struct hmm_range *range;
 	unsigned long flags;
 	int ret = 0;
@@ -173,15 +172,10 @@  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 	if (!kref_get_unless_zero(&hmm->kref))
 		return 0;
 
-	update.start = nrange->start;
-	update.end = nrange->end;
-	update.event = HMM_UPDATE_INVALIDATE;
-	update.blockable = mmu_notifier_range_blockable(nrange);
-
 	spin_lock_irqsave(&hmm->ranges_lock, flags);
 	hmm->notifiers++;
 	list_for_each_entry(range, &hmm->ranges, list) {
-		if (update.end < range->start || update.start >= range->end)
+		if (nrange->end < range->start || nrange->start >= range->end)
 			continue;
 
 		range->valid = false;
@@ -198,9 +192,10 @@  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 	list_for_each_entry(mirror, &hmm->mirrors, list) {
 		int rc;
 
-		rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
+		rc = mirror->ops->sync_cpu_device_pagetables(mirror, nrange);
 		if (rc) {
-			if (WARN_ON(update.blockable || rc != -EAGAIN))
+			if (WARN_ON(mmu_notifier_range_blockable(nrange) ||
+			    rc != -EAGAIN))
 				continue;
 			ret = -EAGAIN;
 			break;

Comments

On Wed 24-07-19 12:28:58, Jason Gunthorpe wrote:
> On Wed, Jul 24, 2019 at 09:05:53AM +0200, Christoph Hellwig wrote:
> > Looks good:
> > 
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > 
> > One comment on a related cleanup:
> > 
> > >  	list_for_each_entry(mirror, &hmm->mirrors, list) {
> > >  		int rc;
> > >  
> > > -		rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
> > > +		rc = mirror->ops->sync_cpu_device_pagetables(mirror, nrange);
> > >  		if (rc) {
> > > -			if (WARN_ON(update.blockable || rc != -EAGAIN))
> > > +			if (WARN_ON(mmu_notifier_range_blockable(nrange) ||
> > > +			    rc != -EAGAIN))
> > >  				continue;
> > >  			ret = -EAGAIN;
> > >  			break;
> > 
> > This magic handling of error seems odd.  I think we should merge rc and
> > ret into one variable and just break out if any error happens instead
> > or claiming in the comments -EAGAIN is the only valid error and then
> > ignoring all others here.
> 
> The WARN_ON is enforcing the rules already commented near
> mmuu_notifier_ops.invalidate_start - we could break or continue, it
> doesn't much matter how to recover from a broken driver, but since we
> did the WARN_ON this should sanitize the ret to EAGAIN or 0
> 
> Humm. Actually having looked this some more, I wonder if this is a
> problem:
> 
> I see in __oom_reap_task_mm():
> 
> 			if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
> 				tlb_finish_mmu(&tlb, range.start, range.end);
> 				ret = false;
> 				continue;
> 			}
> 			unmap_page_range(&tlb, vma, range.start, range.end, NULL);
> 			mmu_notifier_invalidate_range_end(&range);
> 
> Which looks like it creates an unbalanced start/end pairing if any
> start returns EAGAIN?
> 
> This does not seem OK.. Many users require start/end to be paired to
> keep track of their internal locking. Ie for instance hmm breaks
> because the hmm->notifiers counter becomes unable to get to 0.
> 
> Below is the best idea I've had so far..
> 
> Michal, what do you think?

IIRC we have discussed this with Jerome back then when I've introduced
this code and unless I misremember he said the current code was OK.
Maybe new users have started relying on a new semantic in the meantime,
back then, none of the notifier has even started any action in blocking
mode on a EAGAIN bailout. Most of them simply did trylock early in the
process and bailed out so there was nothing to do for the range_end
callback.

Has this changed?
On Wed 24-07-19 17:33:05, Christoph Hellwig wrote:
> On Wed, Jul 24, 2019 at 12:28:58PM -0300, Jason Gunthorpe wrote:
> > Humm. Actually having looked this some more, I wonder if this is a
> > problem:
> 
> What a mess.
> 
> Question: do we even care for the non-blocking events?  The only place
> where mmu_notifier_invalidate_range_start_nonblock is called is the oom
> killer, which means the process is about to die and the pagetable will
> get torn down ASAP.  Should we just skip them unconditionally?

How do you tell whether they are going to block or not without trying?
On Wed, Jul 24, 2019 at 07:58:58PM +0200, Michal Hocko wrote:
> On Wed 24-07-19 12:28:58, Jason Gunthorpe wrote:
> > On Wed, Jul 24, 2019 at 09:05:53AM +0200, Christoph Hellwig wrote:
> > > Looks good:
> > > 
> > > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > > 
> > > One comment on a related cleanup:
> > > 
> > > >  	list_for_each_entry(mirror, &hmm->mirrors, list) {
> > > >  		int rc;
> > > >  
> > > > -		rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
> > > > +		rc = mirror->ops->sync_cpu_device_pagetables(mirror, nrange);
> > > >  		if (rc) {
> > > > -			if (WARN_ON(update.blockable || rc != -EAGAIN))
> > > > +			if (WARN_ON(mmu_notifier_range_blockable(nrange) ||
> > > > +			    rc != -EAGAIN))
> > > >  				continue;
> > > >  			ret = -EAGAIN;
> > > >  			break;
> > > 
> > > This magic handling of error seems odd.  I think we should merge rc and
> > > ret into one variable and just break out if any error happens instead
> > > or claiming in the comments -EAGAIN is the only valid error and then
> > > ignoring all others here.
> > 
> > The WARN_ON is enforcing the rules already commented near
> > mmuu_notifier_ops.invalidate_start - we could break or continue, it
> > doesn't much matter how to recover from a broken driver, but since we
> > did the WARN_ON this should sanitize the ret to EAGAIN or 0
> > 
> > Humm. Actually having looked this some more, I wonder if this is a
> > problem:
> > 
> > I see in __oom_reap_task_mm():
> > 
> > 			if (mmu_notifier_invalidate_range_start_nonblock(&range)) {
> > 				tlb_finish_mmu(&tlb, range.start, range.end);
> > 				ret = false;
> > 				continue;
> > 			}
> > 			unmap_page_range(&tlb, vma, range.start, range.end, NULL);
> > 			mmu_notifier_invalidate_range_end(&range);
> > 
> > Which looks like it creates an unbalanced start/end pairing if any
> > start returns EAGAIN?
> > 
> > This does not seem OK.. Many users require start/end to be paired to
> > keep track of their internal locking. Ie for instance hmm breaks
> > because the hmm->notifiers counter becomes unable to get to 0.
> > 
> > Below is the best idea I've had so far..
> > 
> > Michal, what do you think?
> 
> IIRC we have discussed this with Jerome back then when I've introduced
> this code and unless I misremember he said the current code was OK.

Nope, it has always been broken.

> Maybe new users have started relying on a new semantic in the meantime,
> back then, none of the notifier has even started any action in blocking
> mode on a EAGAIN bailout. Most of them simply did trylock early in the
> process and bailed out so there was nothing to do for the range_end
> callback.

Single notifiers are not the problem. I tried to make this clear in
the commit message, but lets be more explicit.

We have *two* notifiers registered to the mm, A and B:

A invalidate_range_start: (has no blocking)
    spin_lock()
    counter++
    spin_unlock()

A invalidate_range_end:
    spin_lock()
    counter--
    spin_unlock()

And this one:

B invalidate_range_start: (has blocking)
    if (!try_mutex_lock())
        return -EAGAIN;
    counter++
    mutex_unlock()

B invalidate_range_end:
    spin_lock()
    counter--
    spin_unlock()

So now the oom path does:

invalidate_range_start_non_blocking:
 for each mn:
   a->invalidate_range_start
   b->invalidate_range_start
   rc = EAGAIN

Now we SKIP A's invalidate_range_end even though A had no idea this
would happen has state that needs to be unwound. A is broken.

B survived just fine.

A and B *alone* work fine, combined they fail.

When the commit was landed you can use KVM as an example of A and RDMA
ODP as an example of B

Jason
On Wed 24-07-19 15:08:37, Jason Gunthorpe wrote:
> On Wed, Jul 24, 2019 at 07:58:58PM +0200, Michal Hocko wrote:
[...]
> > Maybe new users have started relying on a new semantic in the meantime,
> > back then, none of the notifier has even started any action in blocking
> > mode on a EAGAIN bailout. Most of them simply did trylock early in the
> > process and bailed out so there was nothing to do for the range_end
> > callback.
> 
> Single notifiers are not the problem. I tried to make this clear in
> the commit message, but lets be more explicit.
> 
> We have *two* notifiers registered to the mm, A and B:
> 
> A invalidate_range_start: (has no blocking)
>     spin_lock()
>     counter++
>     spin_unlock()
> 
> A invalidate_range_end:
>     spin_lock()
>     counter--
>     spin_unlock()
> 
> And this one:
> 
> B invalidate_range_start: (has blocking)
>     if (!try_mutex_lock())
>         return -EAGAIN;
>     counter++
>     mutex_unlock()
> 
> B invalidate_range_end:
>     spin_lock()
>     counter--
>     spin_unlock()
> 
> So now the oom path does:
> 
> invalidate_range_start_non_blocking:
>  for each mn:
>    a->invalidate_range_start
>    b->invalidate_range_start
>    rc = EAGAIN
> 
> Now we SKIP A's invalidate_range_end even though A had no idea this
> would happen has state that needs to be unwound. A is broken.
> 
> B survived just fine.
> 
> A and B *alone* work fine, combined they fail.

But that requires that they share some state, right?

> When the commit was landed you can use KVM as an example of A and RDMA
> ODP as an example of B

Could you point me where those two share the state please? KVM seems to
be using kvm->mmu_notifier_count but I do not know where to look for the
RDMA...
On Wed 24-07-19 20:56:17, Michal Hocko wrote:
> On Wed 24-07-19 15:08:37, Jason Gunthorpe wrote:
> > On Wed, Jul 24, 2019 at 07:58:58PM +0200, Michal Hocko wrote:
> [...]
> > > Maybe new users have started relying on a new semantic in the meantime,
> > > back then, none of the notifier has even started any action in blocking
> > > mode on a EAGAIN bailout. Most of them simply did trylock early in the
> > > process and bailed out so there was nothing to do for the range_end
> > > callback.
> > 
> > Single notifiers are not the problem. I tried to make this clear in
> > the commit message, but lets be more explicit.
> > 
> > We have *two* notifiers registered to the mm, A and B:
> > 
> > A invalidate_range_start: (has no blocking)
> >     spin_lock()
> >     counter++
> >     spin_unlock()
> > 
> > A invalidate_range_end:
> >     spin_lock()
> >     counter--
> >     spin_unlock()
> > 
> > And this one:
> > 
> > B invalidate_range_start: (has blocking)
> >     if (!try_mutex_lock())
> >         return -EAGAIN;
> >     counter++
> >     mutex_unlock()
> > 
> > B invalidate_range_end:
> >     spin_lock()
> >     counter--
> >     spin_unlock()
> > 
> > So now the oom path does:
> > 
> > invalidate_range_start_non_blocking:
> >  for each mn:
> >    a->invalidate_range_start
> >    b->invalidate_range_start
> >    rc = EAGAIN
> > 
> > Now we SKIP A's invalidate_range_end even though A had no idea this
> > would happen has state that needs to be unwound. A is broken.
> > 
> > B survived just fine.
> > 
> > A and B *alone* work fine, combined they fail.
> 
> But that requires that they share some state, right?
> 
> > When the commit was landed you can use KVM as an example of A and RDMA
> > ODP as an example of B
> 
> Could you point me where those two share the state please? KVM seems to
> be using kvm->mmu_notifier_count but I do not know where to look for the
> RDMA...

Scratch that. ELONGDAY... I can see your point. It is all or nothing
that doesn't really work here. Looking back at your patch it seems
reasonable but I am not sure what is supposed to be a behavior for
notifiers that failed.
On Wed, Jul 24, 2019 at 08:59:10PM +0200, Michal Hocko wrote:
> On Wed 24-07-19 20:56:17, Michal Hocko wrote:
> > On Wed 24-07-19 15:08:37, Jason Gunthorpe wrote:
> > > On Wed, Jul 24, 2019 at 07:58:58PM +0200, Michal Hocko wrote:
> > [...]
> > > > Maybe new users have started relying on a new semantic in the meantime,
> > > > back then, none of the notifier has even started any action in blocking
> > > > mode on a EAGAIN bailout. Most of them simply did trylock early in the
> > > > process and bailed out so there was nothing to do for the range_end
> > > > callback.
> > > 
> > > Single notifiers are not the problem. I tried to make this clear in
> > > the commit message, but lets be more explicit.
> > > 
> > > We have *two* notifiers registered to the mm, A and B:
> > > 
> > > A invalidate_range_start: (has no blocking)
> > >     spin_lock()
> > >     counter++
> > >     spin_unlock()
> > > 
> > > A invalidate_range_end:
> > >     spin_lock()
> > >     counter--
> > >     spin_unlock()
> > > 
> > > And this one:
> > > 
> > > B invalidate_range_start: (has blocking)
> > >     if (!try_mutex_lock())
> > >         return -EAGAIN;
> > >     counter++
> > >     mutex_unlock()
> > > 
> > > B invalidate_range_end:
> > >     spin_lock()
> > >     counter--
> > >     spin_unlock()
> > > 
> > > So now the oom path does:
> > > 
> > > invalidate_range_start_non_blocking:
> > >  for each mn:
> > >    a->invalidate_range_start
> > >    b->invalidate_range_start
> > >    rc = EAGAIN
> > > 
> > > Now we SKIP A's invalidate_range_end even though A had no idea this
> > > would happen has state that needs to be unwound. A is broken.
> > > 
> > > B survived just fine.
> > > 
> > > A and B *alone* work fine, combined they fail.
> > 
> > But that requires that they share some state, right?
> > 
> > > When the commit was landed you can use KVM as an example of A and RDMA
> > > ODP as an example of B
> > 
> > Could you point me where those two share the state please? KVM seems to
> > be using kvm->mmu_notifier_count but I do not know where to look for the
> > RDMA...
> 
> Scratch that. ELONGDAY... I can see your point. It is all or nothing
> that doesn't really work here. Looking back at your patch it seems
> reasonable but I am not sure what is supposed to be a behavior for
> notifiers that failed.

Okay, good to know I'm not missing something. The idea was the failed
notifier would have to handle the mandatory _end callback.

I've reflected on it some more, and I have a scheme to be able to
'undo' that is safe against concurrent hlist_del_rcu.

If we change the register to keep the hlist sorted by address then we
can do a targetted 'undo' of past starts terminated by address
less-than comparison of the first failing struct mmu_notifier.

It relies on the fact that rcu is only used to remove items, the list
adds are all protected by mm locks, and the number of mmu notifiers is
very small.

This seems workable and does not need more driver review/update...

However, hmm's implementation still needs more fixing.

Thanks,
Jason
On Wed, Jul 24, 2019 at 09:48:55PM +0200, Christoph Hellwig wrote:
> On Wed, Jul 24, 2019 at 04:21:55PM -0300, Jason Gunthorpe wrote:
> > If we change the register to keep the hlist sorted by address then we
> > can do a targetted 'undo' of past starts terminated by address
> > less-than comparison of the first failing struct mmu_notifier.
> > 
> > It relies on the fact that rcu is only used to remove items, the list
> > adds are all protected by mm locks, and the number of mmu notifiers is
> > very small.
> > 
> > This seems workable and does not need more driver review/update...
> > 
> > However, hmm's implementation still needs more fixing.
> 
> Can we take one step back, please?  The only reason why drivers
> implement both ->invalidate_range_start and ->invalidate_range_end and
> expect them to be called paired is to keep some form of counter of
> active invalidation "sections".  So instead of doctoring around
> undo schemes the only sane answer is to take such a counter into the
> core VM code instead of having each driver struggle with it.

This might work as a hybrid sort of idea, like what HMM tried to do
with the counter and valid together.

If we keep the counter global and then provide an 'all invalidates
finished' callback then the driver could potentially still ignore
invalidates that do not touch its ranges during its page fault path.

I'd have to sketch it..

I agree it would solve this problem as well and better advance the
goal to make mmu notifiers simpler to use.. 

But I didn't audit all the invalidate_end users to be sure :)

Jason
On Tue, Jul 23, 2019 at 02:05:06PM -0700, Ralph Campbell wrote:
> The hmm_mirror_ops callback function sync_cpu_device_pagetables() passes
> a struct hmm_update which is a simplified version of struct
> mmu_notifier_range. This is unnecessary so replace hmm_update with
> mmu_notifier_range directly.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> 
> This is based on 5.3.0-rc1 plus Christoph Hellwig's 6 patches
> ("hmm_range_fault related fixes and legacy API removal v2").
> Jason, I believe this is the patch you were requesting.

Doesn't this need revision to include amgpu?

drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:         .sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_gfx,
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:         .sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_hsa,

Thanks,
Jason
On 7/24/19 6:14 PM, Jason Gunthorpe wrote:
> On Tue, Jul 23, 2019 at 02:05:06PM -0700, Ralph Campbell wrote:
>> The hmm_mirror_ops callback function sync_cpu_device_pagetables() passes
>> a struct hmm_update which is a simplified version of struct
>> mmu_notifier_range. This is unnecessary so replace hmm_update with
>> mmu_notifier_range directly.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Jason Gunthorpe <jgg@mellanox.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Ben Skeggs <bskeggs@redhat.com>
>>
>> This is based on 5.3.0-rc1 plus Christoph Hellwig's 6 patches
>> ("hmm_range_fault related fixes and legacy API removal v2").
>> Jason, I believe this is the patch you were requesting.
> 
> Doesn't this need revision to include amgpu?
> 
> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:         .sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_gfx,
> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c:         .sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_hsa,
> 
> Thanks,
> Jason
> 

Yes. I have added this to v2 which I'll send out with Christoph's 2 
patches and the hmm_range.vma removal patch you suggested.