drm/vblank: Fixup and document timestamp update/read barriers

Submitted by Daniel Vetter on April 15, 2015, 7:17 a.m.

Details

Message ID 1429082222-20820-1-git-send-email-daniel.vetter@ffwll.ch
State New
Headers show

Not browsing as part of any series.

Commit Message

Daniel Vetter April 15, 2015, 7:17 a.m.
This was a bit too much cargo-culted, so lets make it solid:
- vblank->count doesn't need to be an atomic, writes are always done
  under the protection of dev->vblank_time_lock. Switch to an unsigned
  long instead and update comments. Note that atomic_read is just a
  normal read of a volatile variable, so no need to audit all the
  read-side access specifically.

- The barriers for the vblank counter seqlock weren't complete: The
  read-side was missing the first barrier between the counter read and
  the timestamp read, it only had a barrier between the ts and the
  counter read. We need both.

- Barriers weren't properly documented. Since barriers only work if
  you have them on boths sides of the transaction it's prudent to
  reference where the other side is. To avoid duplicating the
  write-side comment 3 times extract a little store_vblank() helper.
  In that helper also assert that we do indeed hold
  dev->vblank_time_lock, since in some cases the lock is acquired a
  few functions up in the callchain.

Spotted while reviewing a patch from Chris Wilson to add a fastpath to
the vblank_wait ioctl.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Michel Dänzer <michel@daenzer.net>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
 include/drm/drmP.h        |  8 +++--
 2 files changed, 54 insertions(+), 46 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
index c8a34476570a..23bfbc61a494 100644
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -74,6 +74,33 @@  module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
 module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
 module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
 
+static void store_vblank(struct drm_device *dev, int crtc,
+			 unsigned vblank_count_inc,
+			 struct timeval *t_vblank)
+{
+	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
+	u32 tslot;
+
+	assert_spin_locked(&dev->vblank_time_lock);
+
+	if (t_vblank) {
+		tslot = vblank->count + vblank_count_inc;
+		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
+	}
+
+	/*
+	 * vblank timestamp updates are protected on the write side with
+	 * vblank_time_lock, but on the read side done locklessly using a
+	 * sequence-lock on the vblank counter. Ensure correct ordering using
+	 * memory barrriers. We need the barrier both before and also after the
+	 * counter update to synchronize with the next timestamp write.
+	 * The read-side barriers for this are in drm_vblank_count_and_time.
+	 */
+	smp_wmb();
+	vblank->count += vblank_count_inc;
+	smp_wmb();
+}
+
 /**
  * drm_update_vblank_count - update the master vblank counter
  * @dev: DRM device
@@ -93,7 +120,7 @@  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
 static void drm_update_vblank_count(struct drm_device *dev, int crtc)
 {
 	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
-	u32 cur_vblank, diff, tslot;
+	u32 cur_vblank, diff;
 	bool rc;
 	struct timeval t_vblank;
 
@@ -129,18 +156,12 @@  static void drm_update_vblank_count(struct drm_device *dev, int crtc)
 	if (diff == 0)
 		return;
 
-	/* Reinitialize corresponding vblank timestamp if high-precision query
-	 * available. Skip this step if query unsupported or failed. Will
-	 * reinitialize delayed at next vblank interrupt in that case.
+	/*
+	 * Only reinitialize corresponding vblank timestamp if high-precision query
+	 * available and didn't fail. Will reinitialize delayed at next vblank
+	 * interrupt in that case.
 	 */
-	if (rc) {
-		tslot = atomic_read(&vblank->count) + diff;
-		vblanktimestamp(dev, crtc, tslot) = t_vblank;
-	}
-
-	smp_mb__before_atomic();
-	atomic_add(diff, &vblank->count);
-	smp_mb__after_atomic();
+	store_vblank(dev, crtc, diff, rc ? &t_vblank : NULL);
 }
 
 /*
@@ -218,7 +239,7 @@  static void vblank_disable_and_save(struct drm_device *dev, int crtc)
 	/* Compute time difference to stored timestamp of last vblank
 	 * as updated by last invocation of drm_handle_vblank() in vblank irq.
 	 */
-	vblcount = atomic_read(&vblank->count);
+	vblcount = vblank->count;
 	diff_ns = timeval_to_ns(&tvblank) -
 		  timeval_to_ns(&vblanktimestamp(dev, crtc, vblcount));
 
@@ -234,17 +255,8 @@  static void vblank_disable_and_save(struct drm_device *dev, int crtc)
 	 * available. In that case we can't account for this and just
 	 * hope for the best.
 	 */
-	if (vblrc && (abs64(diff_ns) > 1000000)) {
-		/* Store new timestamp in ringbuffer. */
-		vblanktimestamp(dev, crtc, vblcount + 1) = tvblank;
-
-		/* Increment cooked vblank count. This also atomically commits
-		 * the timestamp computed above.
-		 */
-		smp_mb__before_atomic();
-		atomic_inc(&vblank->count);
-		smp_mb__after_atomic();
-	}
+	if (vblrc && (abs64(diff_ns) > 1000000))
+		store_vblank(dev, crtc, 1, &tvblank);
 
 	spin_unlock_irqrestore(&dev->vblank_time_lock, irqflags);
 }
@@ -852,7 +864,7 @@  u32 drm_vblank_count(struct drm_device *dev, int crtc)
 
 	if (WARN_ON(crtc >= dev->num_crtcs))
 		return 0;
-	return atomic_read(&vblank->count);
+	return vblank->count;
 }
 EXPORT_SYMBOL(drm_vblank_count);
 
@@ -897,16 +909,17 @@  u32 drm_vblank_count_and_time(struct drm_device *dev, int crtc,
 	if (WARN_ON(crtc >= dev->num_crtcs))
 		return 0;
 
-	/* Read timestamp from slot of _vblank_time ringbuffer
-	 * that corresponds to current vblank count. Retry if
-	 * count has incremented during readout. This works like
-	 * a seqlock.
+	/*
+	 * Vblank timestamps are read lockless. To ensure consistency the vblank
+	 * counter is rechecked and ordering is ensured using memory barriers.
+	 * This works like a seqlock. The write-side barriers are in store_vblank.
 	 */
 	do {
-		cur_vblank = atomic_read(&vblank->count);
+		cur_vblank = vblank->count;
+		smp_rmb();
 		*vblanktime = vblanktimestamp(dev, crtc, cur_vblank);
 		smp_rmb();
-	} while (cur_vblank != atomic_read(&vblank->count));
+	} while (cur_vblank != vblank->count);
 
 	return cur_vblank;
 }
@@ -1715,7 +1728,7 @@  bool drm_handle_vblank(struct drm_device *dev, int crtc)
 	 */
 
 	/* Get current timestamp and count. */
-	vblcount = atomic_read(&vblank->count);
+	vblcount = vblank->count;
 	drm_get_last_vbltimestamp(dev, crtc, &tvblank, DRM_CALLED_FROM_VBLIRQ);
 
 	/* Compute time difference to timestamp of last vblank */
@@ -1731,20 +1744,11 @@  bool drm_handle_vblank(struct drm_device *dev, int crtc)
 	 * e.g., due to spurious vblank interrupts. We need to
 	 * ignore those for accounting.
 	 */
-	if (abs64(diff_ns) > DRM_REDUNDANT_VBLIRQ_THRESH_NS) {
-		/* Store new timestamp in ringbuffer. */
-		vblanktimestamp(dev, crtc, vblcount + 1) = tvblank;
-
-		/* Increment cooked vblank count. This also atomically commits
-		 * the timestamp computed above.
-		 */
-		smp_mb__before_atomic();
-		atomic_inc(&vblank->count);
-		smp_mb__after_atomic();
-	} else {
+	if (abs64(diff_ns) > DRM_REDUNDANT_VBLIRQ_THRESH_NS)
+		store_vblank(dev, crtc, 1, &tvblank);
+	else
 		DRM_DEBUG("crtc %d: Redundant vblirq ignored. diff_ns = %d\n",
 			  crtc, (int) diff_ns);
-	}
 
 	spin_unlock(&dev->vblank_time_lock);
 
diff --git a/include/drm/drmP.h b/include/drm/drmP.h
index 62c40777c009..4c31a2cc5a33 100644
--- a/include/drm/drmP.h
+++ b/include/drm/drmP.h
@@ -686,9 +686,13 @@  struct drm_pending_vblank_event {
 struct drm_vblank_crtc {
 	struct drm_device *dev;		/* pointer to the drm_device */
 	wait_queue_head_t queue;	/**< VBLANK wait queue */
-	struct timeval time[DRM_VBLANKTIME_RBSIZE];	/**< timestamp of current count */
 	struct timer_list disable_timer;		/* delayed disable timer */
-	atomic_t count;			/**< number of VBLANK interrupts */
+
+	/* vblank counter, protected by dev->vblank_time_lock for writes */
+	unsigned long count;
+	/* vblank timestamps, protected by dev->vblank_time_lock for writes */
+	struct timeval time[DRM_VBLANKTIME_RBSIZE];
+
 	atomic_t refcount;		/* number of users of vblank interruptsper crtc */
 	u32 last;			/* protected by dev->vbl_lock, used */
 					/* for wraparound handling */

Comments

On Wed, Apr 15, 2015 at 09:17:02AM +0200, Daniel Vetter wrote:
> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> index c8a34476570a..23bfbc61a494 100644
> --- a/drivers/gpu/drm/drm_irq.c
> +++ b/drivers/gpu/drm/drm_irq.c
> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>  module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>  
> +static void store_vblank(struct drm_device *dev, int crtc,
> +			 unsigned vblank_count_inc,
> +			 struct timeval *t_vblank)
> +{
> +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> +	u32 tslot;
> +
> +	assert_spin_locked(&dev->vblank_time_lock);
> +
> +	if (t_vblank) {
> +		tslot = vblank->count + vblank_count_inc;
> +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> +	}

It is not obvious this updates the right tslot in all circumstances.
Care to explain?

Otherwise the rest looks consistent with seqlock, using the
vblank->count as the latch.
-Chris
On Wed, Apr 15, 2015 at 09:17:03AM +0100, Chris Wilson wrote:
> On Wed, Apr 15, 2015 at 09:17:02AM +0200, Daniel Vetter wrote:
> > diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> > index c8a34476570a..23bfbc61a494 100644
> > --- a/drivers/gpu/drm/drm_irq.c
> > +++ b/drivers/gpu/drm/drm_irq.c
> > @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
> >  module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
> >  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
> >  
> > +static void store_vblank(struct drm_device *dev, int crtc,
> > +			 unsigned vblank_count_inc,
> > +			 struct timeval *t_vblank)
> > +{
> > +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> > +	u32 tslot;
> > +
> > +	assert_spin_locked(&dev->vblank_time_lock);
> > +
> > +	if (t_vblank) {
> > +		tslot = vblank->count + vblank_count_inc;
> > +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> > +	}
> 
> It is not obvious this updates the right tslot in all circumstances.
> Care to explain?

Writers are synchronized with vblank_time_lock, so there shouldn't be any
races. Mario also has a patch to clear the ts slot if we don't have
anything to set it too (that one will conflict ofc).

Or what exactly do you mean?
-Daniel
On Wed, Apr 15, 2015 at 11:25:00AM +0200, Daniel Vetter wrote:
> On Wed, Apr 15, 2015 at 09:17:03AM +0100, Chris Wilson wrote:
> > On Wed, Apr 15, 2015 at 09:17:02AM +0200, Daniel Vetter wrote:
> > > diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> > > index c8a34476570a..23bfbc61a494 100644
> > > --- a/drivers/gpu/drm/drm_irq.c
> > > +++ b/drivers/gpu/drm/drm_irq.c
> > > @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
> > >  module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
> > >  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
> > >  
> > > +static void store_vblank(struct drm_device *dev, int crtc,
> > > +			 unsigned vblank_count_inc,
> > > +			 struct timeval *t_vblank)
> > > +{
> > > +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> > > +	u32 tslot;
> > > +
> > > +	assert_spin_locked(&dev->vblank_time_lock);
> > > +
> > > +	if (t_vblank) {
> > > +		tslot = vblank->count + vblank_count_inc;
> > > +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> > > +	}
> > 
> > It is not obvious this updates the right tslot in all circumstances.
> > Care to explain?
> 
> Writers are synchronized with vblank_time_lock, so there shouldn't be any
> races. Mario also has a patch to clear the ts slot if we don't have
> anything to set it too (that one will conflict ofc).
> 
> Or what exactly do you mean?

I was staring at vblank->count and reading backwards from the smp_wmb().

Just something like:
if (t_vblank) {
	/* All writers hold the spinlock, but readers are serialized by
	 * the latching of vblank->count below.
	 */
	 u32 tslot = vblank->count + vblank_count_inc;
	 ...

would help me understand the relationship better.
-Chris
Hi Daniel,

On 04/15/2015 03:17 AM, Daniel Vetter wrote:
> This was a bit too much cargo-culted, so lets make it solid:
> - vblank->count doesn't need to be an atomic, writes are always done
>   under the protection of dev->vblank_time_lock. Switch to an unsigned
>   long instead and update comments. Note that atomic_read is just a
>   normal read of a volatile variable, so no need to audit all the
>   read-side access specifically.
> 
> - The barriers for the vblank counter seqlock weren't complete: The
>   read-side was missing the first barrier between the counter read and
>   the timestamp read, it only had a barrier between the ts and the
>   counter read. We need both.
> 
> - Barriers weren't properly documented. Since barriers only work if
>   you have them on boths sides of the transaction it's prudent to
>   reference where the other side is. To avoid duplicating the
>   write-side comment 3 times extract a little store_vblank() helper.
>   In that helper also assert that we do indeed hold
>   dev->vblank_time_lock, since in some cases the lock is acquired a
>   few functions up in the callchain.
> 
> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
> the vblank_wait ioctl.
> 
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> Cc: Michel Dänzer <michel@daenzer.net>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
>  include/drm/drmP.h        |  8 +++--
>  2 files changed, 54 insertions(+), 46 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> index c8a34476570a..23bfbc61a494 100644
> --- a/drivers/gpu/drm/drm_irq.c
> +++ b/drivers/gpu/drm/drm_irq.c
> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>  module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>  
> +static void store_vblank(struct drm_device *dev, int crtc,
> +			 unsigned vblank_count_inc,
> +			 struct timeval *t_vblank)
> +{
> +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> +	u32 tslot;
> +
> +	assert_spin_locked(&dev->vblank_time_lock);
> +
> +	if (t_vblank) {
> +		tslot = vblank->count + vblank_count_inc;
> +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> +	}
> +
> +	/*
> +	 * vblank timestamp updates are protected on the write side with
> +	 * vblank_time_lock, but on the read side done locklessly using a
> +	 * sequence-lock on the vblank counter. Ensure correct ordering using
> +	 * memory barrriers. We need the barrier both before and also after the
> +	 * counter update to synchronize with the next timestamp write.
> +	 * The read-side barriers for this are in drm_vblank_count_and_time.
> +	 */
> +	smp_wmb();
> +	vblank->count += vblank_count_inc;
> +	smp_wmb();

The comment and the code are each self-contradictory.

If vblank->count writes are always protected by vblank_time_lock (something I
did not verify but that the comment above asserts), then the trailing write
barrier is not required (and the assertion that it is in the comment is incorrect).

A spin unlock operation is always a write barrier.

Regards,
Peter Hurley

> +}
> +
>  /**
>   * drm_update_vblank_count - update the master vblank counter
>   * @dev: DRM device
> @@ -93,7 +120,7 @@ module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>  static void drm_update_vblank_count(struct drm_device *dev, int crtc)
>  {
>  	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> -	u32 cur_vblank, diff, tslot;
> +	u32 cur_vblank, diff;
>  	bool rc;
>  	struct timeval t_vblank;
>  
> @@ -129,18 +156,12 @@ static void drm_update_vblank_count(struct drm_device *dev, int crtc)
>  	if (diff == 0)
>  		return;
>  
> -	/* Reinitialize corresponding vblank timestamp if high-precision query
> -	 * available. Skip this step if query unsupported or failed. Will
> -	 * reinitialize delayed at next vblank interrupt in that case.
> +	/*
> +	 * Only reinitialize corresponding vblank timestamp if high-precision query
> +	 * available and didn't fail. Will reinitialize delayed at next vblank
> +	 * interrupt in that case.
>  	 */
> -	if (rc) {
> -		tslot = atomic_read(&vblank->count) + diff;
> -		vblanktimestamp(dev, crtc, tslot) = t_vblank;
> -	}
> -
> -	smp_mb__before_atomic();
> -	atomic_add(diff, &vblank->count);
> -	smp_mb__after_atomic();
> +	store_vblank(dev, crtc, diff, rc ? &t_vblank : NULL);
>  }
>  
>  /*
> @@ -218,7 +239,7 @@ static void vblank_disable_and_save(struct drm_device *dev, int crtc)
>  	/* Compute time difference to stored timestamp of last vblank
>  	 * as updated by last invocation of drm_handle_vblank() in vblank irq.
>  	 */
> -	vblcount = atomic_read(&vblank->count);
> +	vblcount = vblank->count;
>  	diff_ns = timeval_to_ns(&tvblank) -
>  		  timeval_to_ns(&vblanktimestamp(dev, crtc, vblcount));
>  
> @@ -234,17 +255,8 @@ static void vblank_disable_and_save(struct drm_device *dev, int crtc)
>  	 * available. In that case we can't account for this and just
>  	 * hope for the best.
>  	 */
> -	if (vblrc && (abs64(diff_ns) > 1000000)) {
> -		/* Store new timestamp in ringbuffer. */
> -		vblanktimestamp(dev, crtc, vblcount + 1) = tvblank;
> -
> -		/* Increment cooked vblank count. This also atomically commits
> -		 * the timestamp computed above.
> -		 */
> -		smp_mb__before_atomic();
> -		atomic_inc(&vblank->count);
> -		smp_mb__after_atomic();
> -	}
> +	if (vblrc && (abs64(diff_ns) > 1000000))
> +		store_vblank(dev, crtc, 1, &tvblank);
>  
>  	spin_unlock_irqrestore(&dev->vblank_time_lock, irqflags);
>  }
> @@ -852,7 +864,7 @@ u32 drm_vblank_count(struct drm_device *dev, int crtc)
>  
>  	if (WARN_ON(crtc >= dev->num_crtcs))
>  		return 0;
> -	return atomic_read(&vblank->count);
> +	return vblank->count;
>  }
>  EXPORT_SYMBOL(drm_vblank_count);
>  
> @@ -897,16 +909,17 @@ u32 drm_vblank_count_and_time(struct drm_device *dev, int crtc,
>  	if (WARN_ON(crtc >= dev->num_crtcs))
>  		return 0;
>  
> -	/* Read timestamp from slot of _vblank_time ringbuffer
> -	 * that corresponds to current vblank count. Retry if
> -	 * count has incremented during readout. This works like
> -	 * a seqlock.
> +	/*
> +	 * Vblank timestamps are read lockless. To ensure consistency the vblank
> +	 * counter is rechecked and ordering is ensured using memory barriers.
> +	 * This works like a seqlock. The write-side barriers are in store_vblank.
>  	 */
>  	do {
> -		cur_vblank = atomic_read(&vblank->count);
> +		cur_vblank = vblank->count;
> +		smp_rmb();
>  		*vblanktime = vblanktimestamp(dev, crtc, cur_vblank);
>  		smp_rmb();
> -	} while (cur_vblank != atomic_read(&vblank->count));
> +	} while (cur_vblank != vblank->count);
>  
>  	return cur_vblank;
>  }
> @@ -1715,7 +1728,7 @@ bool drm_handle_vblank(struct drm_device *dev, int crtc)
>  	 */
>  
>  	/* Get current timestamp and count. */
> -	vblcount = atomic_read(&vblank->count);
> +	vblcount = vblank->count;
>  	drm_get_last_vbltimestamp(dev, crtc, &tvblank, DRM_CALLED_FROM_VBLIRQ);
>  
>  	/* Compute time difference to timestamp of last vblank */
> @@ -1731,20 +1744,11 @@ bool drm_handle_vblank(struct drm_device *dev, int crtc)
>  	 * e.g., due to spurious vblank interrupts. We need to
>  	 * ignore those for accounting.
>  	 */
> -	if (abs64(diff_ns) > DRM_REDUNDANT_VBLIRQ_THRESH_NS) {
> -		/* Store new timestamp in ringbuffer. */
> -		vblanktimestamp(dev, crtc, vblcount + 1) = tvblank;
> -
> -		/* Increment cooked vblank count. This also atomically commits
> -		 * the timestamp computed above.
> -		 */
> -		smp_mb__before_atomic();
> -		atomic_inc(&vblank->count);
> -		smp_mb__after_atomic();
> -	} else {
> +	if (abs64(diff_ns) > DRM_REDUNDANT_VBLIRQ_THRESH_NS)
> +		store_vblank(dev, crtc, 1, &tvblank);
> +	else
>  		DRM_DEBUG("crtc %d: Redundant vblirq ignored. diff_ns = %d\n",
>  			  crtc, (int) diff_ns);
> -	}
>  
>  	spin_unlock(&dev->vblank_time_lock);
>  
> diff --git a/include/drm/drmP.h b/include/drm/drmP.h
> index 62c40777c009..4c31a2cc5a33 100644
> --- a/include/drm/drmP.h
> +++ b/include/drm/drmP.h
> @@ -686,9 +686,13 @@ struct drm_pending_vblank_event {
>  struct drm_vblank_crtc {
>  	struct drm_device *dev;		/* pointer to the drm_device */
>  	wait_queue_head_t queue;	/**< VBLANK wait queue */
> -	struct timeval time[DRM_VBLANKTIME_RBSIZE];	/**< timestamp of current count */
>  	struct timer_list disable_timer;		/* delayed disable timer */
> -	atomic_t count;			/**< number of VBLANK interrupts */
> +
> +	/* vblank counter, protected by dev->vblank_time_lock for writes */
> +	unsigned long count;
> +	/* vblank timestamps, protected by dev->vblank_time_lock for writes */
> +	struct timeval time[DRM_VBLANKTIME_RBSIZE];
> +
>  	atomic_t refcount;		/* number of users of vblank interruptsper crtc */
>  	u32 last;			/* protected by dev->vbl_lock, used */
>  					/* for wraparound handling */
>
On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
> Hi Daniel,
> 
> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
> > This was a bit too much cargo-culted, so lets make it solid:
> > - vblank->count doesn't need to be an atomic, writes are always done
> >   under the protection of dev->vblank_time_lock. Switch to an unsigned
> >   long instead and update comments. Note that atomic_read is just a
> >   normal read of a volatile variable, so no need to audit all the
> >   read-side access specifically.
> > 
> > - The barriers for the vblank counter seqlock weren't complete: The
> >   read-side was missing the first barrier between the counter read and
> >   the timestamp read, it only had a barrier between the ts and the
> >   counter read. We need both.
> > 
> > - Barriers weren't properly documented. Since barriers only work if
> >   you have them on boths sides of the transaction it's prudent to
> >   reference where the other side is. To avoid duplicating the
> >   write-side comment 3 times extract a little store_vblank() helper.
> >   In that helper also assert that we do indeed hold
> >   dev->vblank_time_lock, since in some cases the lock is acquired a
> >   few functions up in the callchain.
> > 
> > Spotted while reviewing a patch from Chris Wilson to add a fastpath to
> > the vblank_wait ioctl.
> > 
> > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
> > Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> > Cc: Michel Dänzer <michel@daenzer.net>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> > ---
> >  drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
> >  include/drm/drmP.h        |  8 +++--
> >  2 files changed, 54 insertions(+), 46 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> > index c8a34476570a..23bfbc61a494 100644
> > --- a/drivers/gpu/drm/drm_irq.c
> > +++ b/drivers/gpu/drm/drm_irq.c
> > @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
> >  module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
> >  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
> >  
> > +static void store_vblank(struct drm_device *dev, int crtc,
> > +			 unsigned vblank_count_inc,
> > +			 struct timeval *t_vblank)
> > +{
> > +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> > +	u32 tslot;
> > +
> > +	assert_spin_locked(&dev->vblank_time_lock);
> > +
> > +	if (t_vblank) {
> > +		tslot = vblank->count + vblank_count_inc;
> > +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> > +	}
> > +
> > +	/*
> > +	 * vblank timestamp updates are protected on the write side with
> > +	 * vblank_time_lock, but on the read side done locklessly using a
> > +	 * sequence-lock on the vblank counter. Ensure correct ordering using
> > +	 * memory barrriers. We need the barrier both before and also after the
> > +	 * counter update to synchronize with the next timestamp write.
> > +	 * The read-side barriers for this are in drm_vblank_count_and_time.
> > +	 */
> > +	smp_wmb();
> > +	vblank->count += vblank_count_inc;
> > +	smp_wmb();
> 
> The comment and the code are each self-contradictory.
> 
> If vblank->count writes are always protected by vblank_time_lock (something I
> did not verify but that the comment above asserts), then the trailing write
> barrier is not required (and the assertion that it is in the comment is incorrect).
> 
> A spin unlock operation is always a write barrier.

Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
That the spinlock is held I can assure. That no one goes around and does
multiple vblank updates (because somehow that code raced with the hw
itself) I can't easily assure with a simple assert or something similar.
It's not the case right now, but that can changes.

Also it's not contradictory here, since you'd need to audit all the
callers to be able to make the claim that the 2nd smp_wmb() is redundant.
I'll just add a comment about this.
-Daniel
Tested-By: Intel Graphics QA PRTS (Patch Regression Test System Contact: shuang.he@intel.com)
Task id: 6195
-------------------------------------Summary-------------------------------------
Platform          Delta          drm-intel-nightly          Series Applied
PNV                                  276/276              276/276
ILK                 -1              302/302              301/302
SNB                                  318/318              318/318
IVB                                  341/341              341/341
BYT                                  287/287              287/287
HSW                                  395/395              395/395
BDW                                  318/318              318/318
-------------------------------------Detailed-------------------------------------
Platform  Test                                drm-intel-nightly          Series Applied
*ILK  igt@gem_fenced_exec_thrash@no-spare-fences-busy-interruptible      PASS(2)      DMESG_WARN(1)PASS(1)
(dmesg patch applied)drm:i915_hangcheck_elapsed[i915]]*ERROR*Hangcheck_timer_elapsed...bsd_ring_idle@Hangcheck timer elapsed... bsd ring idle
Note: You need to pay more attention to line start with '*'
On 04/15/2015 01:31 PM, Daniel Vetter wrote:
> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
>> Hi Daniel,
>>
>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
>>> This was a bit too much cargo-culted, so lets make it solid:
>>> - vblank->count doesn't need to be an atomic, writes are always done
>>>   under the protection of dev->vblank_time_lock. Switch to an unsigned
>>>   long instead and update comments. Note that atomic_read is just a
>>>   normal read of a volatile variable, so no need to audit all the
>>>   read-side access specifically.
>>>
>>> - The barriers for the vblank counter seqlock weren't complete: The
>>>   read-side was missing the first barrier between the counter read and
>>>   the timestamp read, it only had a barrier between the ts and the
>>>   counter read. We need both.
>>>
>>> - Barriers weren't properly documented. Since barriers only work if
>>>   you have them on boths sides of the transaction it's prudent to
>>>   reference where the other side is. To avoid duplicating the
>>>   write-side comment 3 times extract a little store_vblank() helper.
>>>   In that helper also assert that we do indeed hold
>>>   dev->vblank_time_lock, since in some cases the lock is acquired a
>>>   few functions up in the callchain.
>>>
>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
>>> the vblank_wait ioctl.
>>>
>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>> Cc: Michel Dänzer <michel@daenzer.net>
>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>> ---
>>>  drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
>>>  include/drm/drmP.h        |  8 +++--
>>>  2 files changed, 54 insertions(+), 46 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>> index c8a34476570a..23bfbc61a494 100644
>>> --- a/drivers/gpu/drm/drm_irq.c
>>> +++ b/drivers/gpu/drm/drm_irq.c
>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>>>  module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>>>  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>>>  
>>> +static void store_vblank(struct drm_device *dev, int crtc,
>>> +			 unsigned vblank_count_inc,
>>> +			 struct timeval *t_vblank)
>>> +{
>>> +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
>>> +	u32 tslot;
>>> +
>>> +	assert_spin_locked(&dev->vblank_time_lock);
>>> +
>>> +	if (t_vblank) {
>>> +		tslot = vblank->count + vblank_count_inc;
>>> +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
>>> +	}
>>> +
>>> +	/*
>>> +	 * vblank timestamp updates are protected on the write side with
>>> +	 * vblank_time_lock, but on the read side done locklessly using a
>>> +	 * sequence-lock on the vblank counter. Ensure correct ordering using
>>> +	 * memory barrriers. We need the barrier both before and also after the
>>> +	 * counter update to synchronize with the next timestamp write.
>>> +	 * The read-side barriers for this are in drm_vblank_count_and_time.
>>> +	 */
>>> +	smp_wmb();
>>> +	vblank->count += vblank_count_inc;
>>> +	smp_wmb();
>>
>> The comment and the code are each self-contradictory.
>>
>> If vblank->count writes are always protected by vblank_time_lock (something I
>> did not verify but that the comment above asserts), then the trailing write
>> barrier is not required (and the assertion that it is in the comment is incorrect).
>>
>> A spin unlock operation is always a write barrier.
> 
> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
> That the spinlock is held I can assure. That no one goes around and does
> multiple vblank updates (because somehow that code raced with the hw
> itself) I can't easily assure with a simple assert or something similar.
> It's not the case right now, but that can changes.

The algorithm would be broken if multiple updates for the same vblank
count were allowed; that's why it checks to see if the vblank count has
not advanced before storing a new timestamp.

Otherwise, the read side would not be able to determine that the
timestamp is valid by double-checking that the vblank count has not
changed.

And besides, even if the code looped without dropping the spinlock,
the correct write order would still be observed because it would still
be executing on the same cpu.

My objection to the write memory barrier is not about optimization;
it's about correct code.

Regards,
Peter Hurley
On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
> > On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
> >> Hi Daniel,
> >>
> >> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
> >>> This was a bit too much cargo-culted, so lets make it solid:
> >>> - vblank->count doesn't need to be an atomic, writes are always done
> >>>   under the protection of dev->vblank_time_lock. Switch to an unsigned
> >>>   long instead and update comments. Note that atomic_read is just a
> >>>   normal read of a volatile variable, so no need to audit all the
> >>>   read-side access specifically.
> >>>
> >>> - The barriers for the vblank counter seqlock weren't complete: The
> >>>   read-side was missing the first barrier between the counter read and
> >>>   the timestamp read, it only had a barrier between the ts and the
> >>>   counter read. We need both.
> >>>
> >>> - Barriers weren't properly documented. Since barriers only work if
> >>>   you have them on boths sides of the transaction it's prudent to
> >>>   reference where the other side is. To avoid duplicating the
> >>>   write-side comment 3 times extract a little store_vblank() helper.
> >>>   In that helper also assert that we do indeed hold
> >>>   dev->vblank_time_lock, since in some cases the lock is acquired a
> >>>   few functions up in the callchain.
> >>>
> >>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
> >>> the vblank_wait ioctl.
> >>>
> >>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
> >>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>> Cc: Michel Dänzer <michel@daenzer.net>
> >>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>> ---
> >>>  drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
> >>>  include/drm/drmP.h        |  8 +++--
> >>>  2 files changed, 54 insertions(+), 46 deletions(-)
> >>>
> >>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> >>> index c8a34476570a..23bfbc61a494 100644
> >>> --- a/drivers/gpu/drm/drm_irq.c
> >>> +++ b/drivers/gpu/drm/drm_irq.c
> >>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
> >>>  module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
> >>>  module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
> >>>  
> >>> +static void store_vblank(struct drm_device *dev, int crtc,
> >>> +			 unsigned vblank_count_inc,
> >>> +			 struct timeval *t_vblank)
> >>> +{
> >>> +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> >>> +	u32 tslot;
> >>> +
> >>> +	assert_spin_locked(&dev->vblank_time_lock);
> >>> +
> >>> +	if (t_vblank) {
> >>> +		tslot = vblank->count + vblank_count_inc;
> >>> +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> >>> +	}
> >>> +
> >>> +	/*
> >>> +	 * vblank timestamp updates are protected on the write side with
> >>> +	 * vblank_time_lock, but on the read side done locklessly using a
> >>> +	 * sequence-lock on the vblank counter. Ensure correct ordering using
> >>> +	 * memory barrriers. We need the barrier both before and also after the
> >>> +	 * counter update to synchronize with the next timestamp write.
> >>> +	 * The read-side barriers for this are in drm_vblank_count_and_time.
> >>> +	 */
> >>> +	smp_wmb();
> >>> +	vblank->count += vblank_count_inc;
> >>> +	smp_wmb();
> >>
> >> The comment and the code are each self-contradictory.
> >>
> >> If vblank->count writes are always protected by vblank_time_lock (something I
> >> did not verify but that the comment above asserts), then the trailing write
> >> barrier is not required (and the assertion that it is in the comment is incorrect).
> >>
> >> A spin unlock operation is always a write barrier.
> > 
> > Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
> > That the spinlock is held I can assure. That no one goes around and does
> > multiple vblank updates (because somehow that code raced with the hw
> > itself) I can't easily assure with a simple assert or something similar.
> > It's not the case right now, but that can changes.
> 
> The algorithm would be broken if multiple updates for the same vblank
> count were allowed; that's why it checks to see if the vblank count has
> not advanced before storing a new timestamp.
> 
> Otherwise, the read side would not be able to determine that the
> timestamp is valid by double-checking that the vblank count has not
> changed.
> 
> And besides, even if the code looped without dropping the spinlock,
> the correct write order would still be observed because it would still
> be executing on the same cpu.
> 
> My objection to the write memory barrier is not about optimization;
> it's about correct code.

Well diff=0 is not allowed, I guess I could enforce this with some
WARN_ON. And I still think my point of non-local correctness is solid.
With the smp_wmb() removed the following still works correctly:

spin_lock(vblank_time_lock);
store_vblank(dev, crtc, 1, ts1);
spin_unlock(vblank_time_lock);

spin_lock(vblank_time_lock);
store_vblank(dev, crtc, 1, ts2);
spin_unlock(vblank_time_lock);

But with the smp_wmb(); removed the following would be broken:

spin_lock(vblank_time_lock);
store_vblank(dev, crtc, 1, ts1);
store_vblank(dev, crtc, 1, ts2);
spin_unlock(vblank_time_lock);

because the compiler/cpu is free to reorder the store for vblank->count
_ahead_ of the store for the timestamp. And that would trick readers into
believing that they have a valid timestamp when they potentially raced.

Now you're correct that right now there's no such thing going on, and it's
unlikely to happen (given the nature of vblank updates). But my point is
that if we optimize this then the correctness can't be proven locally
anymore by just looking at store_vblank, but instead you must audit all
the callers. And leaking locking/barriers like that is too fragile design
for my taste.

But you insist that my approach is broken somehow and dropping the smp_wmb
is needed for correctness. I don't see how that's the case at all.
-Daniel
On 04/16/2015 03:03 PM, Daniel Vetter wrote:
> On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
>> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
>>> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
>>>> Hi Daniel,
>>>>
>>>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
>>>>> This was a bit too much cargo-culted, so lets make it solid:
>>>>> - vblank->count doesn't need to be an atomic, writes are always done
>>>>>    under the protection of dev->vblank_time_lock. Switch to an unsigned
>>>>>    long instead and update comments. Note that atomic_read is just a
>>>>>    normal read of a volatile variable, so no need to audit all the
>>>>>    read-side access specifically.
>>>>>
>>>>> - The barriers for the vblank counter seqlock weren't complete: The
>>>>>    read-side was missing the first barrier between the counter read and
>>>>>    the timestamp read, it only had a barrier between the ts and the
>>>>>    counter read. We need both.
>>>>>
>>>>> - Barriers weren't properly documented. Since barriers only work if
>>>>>    you have them on boths sides of the transaction it's prudent to
>>>>>    reference where the other side is. To avoid duplicating the
>>>>>    write-side comment 3 times extract a little store_vblank() helper.
>>>>>    In that helper also assert that we do indeed hold
>>>>>    dev->vblank_time_lock, since in some cases the lock is acquired a
>>>>>    few functions up in the callchain.
>>>>>
>>>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
>>>>> the vblank_wait ioctl.
>>>>>
>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>> Cc: Michel Dänzer <michel@daenzer.net>
>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
>>>>>   include/drm/drmP.h        |  8 +++--
>>>>>   2 files changed, 54 insertions(+), 46 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>> index c8a34476570a..23bfbc61a494 100644
>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>>>>>   module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>>>>>   module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>>>>>
>>>>> +static void store_vblank(struct drm_device *dev, int crtc,
>>>>> +			 unsigned vblank_count_inc,
>>>>> +			 struct timeval *t_vblank)
>>>>> +{
>>>>> +	struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
>>>>> +	u32 tslot;
>>>>> +
>>>>> +	assert_spin_locked(&dev->vblank_time_lock);
>>>>> +
>>>>> +	if (t_vblank) {
>>>>> +		tslot = vblank->count + vblank_count_inc;
>>>>> +		vblanktimestamp(dev, crtc, tslot) = *t_vblank;
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * vblank timestamp updates are protected on the write side with
>>>>> +	 * vblank_time_lock, but on the read side done locklessly using a
>>>>> +	 * sequence-lock on the vblank counter. Ensure correct ordering using
>>>>> +	 * memory barrriers. We need the barrier both before and also after the
>>>>> +	 * counter update to synchronize with the next timestamp write.
>>>>> +	 * The read-side barriers for this are in drm_vblank_count_and_time.
>>>>> +	 */
>>>>> +	smp_wmb();
>>>>> +	vblank->count += vblank_count_inc;
>>>>> +	smp_wmb();
>>>>
>>>> The comment and the code are each self-contradictory.
>>>>
>>>> If vblank->count writes are always protected by vblank_time_lock (something I
>>>> did not verify but that the comment above asserts), then the trailing write
>>>> barrier is not required (and the assertion that it is in the comment is incorrect).
>>>>
>>>> A spin unlock operation is always a write barrier.
>>>
>>> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
>>> That the spinlock is held I can assure. That no one goes around and does
>>> multiple vblank updates (because somehow that code raced with the hw
>>> itself) I can't easily assure with a simple assert or something similar.
>>> It's not the case right now, but that can changes.
>>
>> The algorithm would be broken if multiple updates for the same vblank
>> count were allowed; that's why it checks to see if the vblank count has
>> not advanced before storing a new timestamp.
>>
>> Otherwise, the read side would not be able to determine that the
>> timestamp is valid by double-checking that the vblank count has not
>> changed.
>>
>> And besides, even if the code looped without dropping the spinlock,
>> the correct write order would still be observed because it would still
>> be executing on the same cpu.
>>
>> My objection to the write memory barrier is not about optimization;
>> it's about correct code.
>
> Well diff=0 is not allowed, I guess I could enforce this with some
> WARN_ON. And I still think my point of non-local correctness is solid.
> With the smp_wmb() removed the following still works correctly:
>
> spin_lock(vblank_time_lock);
> store_vblank(dev, crtc, 1, ts1);
> spin_unlock(vblank_time_lock);
>
> spin_lock(vblank_time_lock);
> store_vblank(dev, crtc, 1, ts2);
> spin_unlock(vblank_time_lock);
>
> But with the smp_wmb(); removed the following would be broken:
>
> spin_lock(vblank_time_lock);
> store_vblank(dev, crtc, 1, ts1);
> store_vblank(dev, crtc, 1, ts2);
> spin_unlock(vblank_time_lock);
>
> because the compiler/cpu is free to reorder the store for vblank->count
> _ahead_ of the store for the timestamp. And that would trick readers into
> believing that they have a valid timestamp when they potentially raced.
>
> Now you're correct that right now there's no such thing going on, and it's
> unlikely to happen (given the nature of vblank updates). But my point is
> that if we optimize this then the correctness can't be proven locally
> anymore by just looking at store_vblank, but instead you must audit all
> the callers. And leaking locking/barriers like that is too fragile design
> for my taste.
>
> But you insist that my approach is broken somehow and dropping the smp_wmb
> is needed for correctness. I don't see how that's the case at all.
> -Daniel
>

Fwiw, i spent some time reeducating myself about memory barriers (thanks 
for your explanations) and thinking about this, and the last version of 
your patch looks good to me. It also makes sense to me to leave that 
last smb_wmb() in place to make future use of the helper robust - for 
non-local correctness, to avoid having to audit all future callers of 
that helper.

I also tested your patch + a slightly modified version of Chris vblank 
delayed disable / instant query patches + my fixes using my own stress 
tests and hardware timing test equipment on both intel and nouveau, and 
everything seems to work fine.

So i'm all for including this patch and it has my

Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com>

I just sent out an updated version of my patches, so they don't conflict 
with this one and also fix a compile failure of drm/qxl with yours.

Thanks,
-mario
On 05/04/2015 12:52 AM, Mario Kleiner wrote:
> On 04/16/2015 03:03 PM, Daniel Vetter wrote:
>> On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
>>> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
>>>> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
>>>>>> This was a bit too much cargo-culted, so lets make it solid:
>>>>>> - vblank->count doesn't need to be an atomic, writes are always done
>>>>>>    under the protection of dev->vblank_time_lock. Switch to an unsigned
>>>>>>    long instead and update comments. Note that atomic_read is just a
>>>>>>    normal read of a volatile variable, so no need to audit all the
>>>>>>    read-side access specifically.
>>>>>>
>>>>>> - The barriers for the vblank counter seqlock weren't complete: The
>>>>>>    read-side was missing the first barrier between the counter read and
>>>>>>    the timestamp read, it only had a barrier between the ts and the
>>>>>>    counter read. We need both.
>>>>>>
>>>>>> - Barriers weren't properly documented. Since barriers only work if
>>>>>>    you have them on boths sides of the transaction it's prudent to
>>>>>>    reference where the other side is. To avoid duplicating the
>>>>>>    write-side comment 3 times extract a little store_vblank() helper.
>>>>>>    In that helper also assert that we do indeed hold
>>>>>>    dev->vblank_time_lock, since in some cases the lock is acquired a
>>>>>>    few functions up in the callchain.
>>>>>>
>>>>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
>>>>>> the vblank_wait ioctl.
>>>>>>
>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>> Cc: Michel Dänzer <michel@daenzer.net>
>>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
>>>>>>   include/drm/drmP.h        |  8 +++--
>>>>>>   2 files changed, 54 insertions(+), 46 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>>> index c8a34476570a..23bfbc61a494 100644
>>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>>>>>>   module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>>>>>>   module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>>>>>>
>>>>>> +static void store_vblank(struct drm_device *dev, int crtc,
>>>>>> +             unsigned vblank_count_inc,
>>>>>> +             struct timeval *t_vblank)
>>>>>> +{
>>>>>> +    struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
>>>>>> +    u32 tslot;
>>>>>> +
>>>>>> +    assert_spin_locked(&dev->vblank_time_lock);
>>>>>> +
>>>>>> +    if (t_vblank) {
>>>>>> +        tslot = vblank->count + vblank_count_inc;
>>>>>> +        vblanktimestamp(dev, crtc, tslot) = *t_vblank;
>>>>>> +    }
>>>>>> +
>>>>>> +    /*
>>>>>> +     * vblank timestamp updates are protected on the write side with
>>>>>> +     * vblank_time_lock, but on the read side done locklessly using a
>>>>>> +     * sequence-lock on the vblank counter. Ensure correct ordering using
>>>>>> +     * memory barrriers. We need the barrier both before and also after the
>>>>>> +     * counter update to synchronize with the next timestamp write.
>>>>>> +     * The read-side barriers for this are in drm_vblank_count_and_time.
>>>>>> +     */
>>>>>> +    smp_wmb();
>>>>>> +    vblank->count += vblank_count_inc;
>>>>>> +    smp_wmb();
>>>>>
>>>>> The comment and the code are each self-contradictory.
>>>>>
>>>>> If vblank->count writes are always protected by vblank_time_lock (something I
>>>>> did not verify but that the comment above asserts), then the trailing write
>>>>> barrier is not required (and the assertion that it is in the comment is incorrect).
>>>>>
>>>>> A spin unlock operation is always a write barrier.
>>>>
>>>> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
>>>> That the spinlock is held I can assure. That no one goes around and does
>>>> multiple vblank updates (because somehow that code raced with the hw
>>>> itself) I can't easily assure with a simple assert or something similar.
>>>> It's not the case right now, but that can changes.
>>>
>>> The algorithm would be broken if multiple updates for the same vblank
>>> count were allowed; that's why it checks to see if the vblank count has
>>> not advanced before storing a new timestamp.
>>>
>>> Otherwise, the read side would not be able to determine that the
>>> timestamp is valid by double-checking that the vblank count has not
>>> changed.
>>>
>>> And besides, even if the code looped without dropping the spinlock,
>>> the correct write order would still be observed because it would still
>>> be executing on the same cpu.
>>>
>>> My objection to the write memory barrier is not about optimization;
>>> it's about correct code.
>>
>> Well diff=0 is not allowed, I guess I could enforce this with some
>> WARN_ON. And I still think my point of non-local correctness is solid.
>> With the smp_wmb() removed the following still works correctly:
>>
>> spin_lock(vblank_time_lock);
>> store_vblank(dev, crtc, 1, ts1);
>> spin_unlock(vblank_time_lock);
>>
>> spin_lock(vblank_time_lock);
>> store_vblank(dev, crtc, 1, ts2);
>> spin_unlock(vblank_time_lock);
>>
>> But with the smp_wmb(); removed the following would be broken:
>>
>> spin_lock(vblank_time_lock);
>> store_vblank(dev, crtc, 1, ts1);
>> store_vblank(dev, crtc, 1, ts2);
>> spin_unlock(vblank_time_lock);
>>
>> because the compiler/cpu is free to reorder the store for vblank->count
>> _ahead_ of the store for the timestamp. And that would trick readers into
>> believing that they have a valid timestamp when they potentially raced.
>>
>> Now you're correct that right now there's no such thing going on, and it's
>> unlikely to happen (given the nature of vblank updates). But my point is
>> that if we optimize this then the correctness can't be proven locally
>> anymore by just looking at store_vblank, but instead you must audit all
>> the callers. And leaking locking/barriers like that is too fragile design
>> for my taste.
>>
>> But you insist that my approach is broken somehow and dropping the smp_wmb
>> is needed for correctness. I don't see how that's the case at all.

Daniel,

I've been really busy this last week; my apologies for not replying promptly.

> Fwiw, i spent some time reeducating myself about memory barriers (thanks for your explanations) and thinking about this, and the last version of your patch looks good to me. It also makes sense to me to leave that last smb_wmb() in place to make future use of the helper robust - for non-local correctness, to avoid having to audit all future callers of that helper.

My concern wrt to unnecessary barriers in this algorithm is that the trailing
barrier now appears mandatory, when in fact it is not.

Moreover, this algorithm is, in general, fragile and not designed to handle
random or poorly-researched changes.

For example, if only the read and store operations are considered, it's obviously
unsafe, since a read may unwittingly retrieve an store in progress.


CPU 0                                   | CPU 1
                                        |
                             /* vblank->count == 0 */
                                        |
drm_vblank_count_and_time()             | store_vblank(.., inc = 2, ...)
                                        |
  cur_vblank <= LOAD vblank->count      |
                                        |   tslot = vblank->count + 2
                                        |   /* tslot == 2 */
                                        |   STORE vblanktime[0]
  - - - - - - - - - - - - - - - - - - - - - smp_wmb() - - - - - - - - - -
  /* cur_vblank == 0 */                 |
  local <= LOAD vblanktime[0]           |
  smp_rmb - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
                                        |
 * cpu has loaded the wrong timestamp * |
                                        |
  local <= LOAD vblank->count           |
  cur_vblank == local?                  |
  yes - exit loop                       |
                                        |   vblank->count += 2
  - - - - - - - - - - - - - - - - - - - - - smp_wmb() - - - - - - - - - -

Regards,
Peter Hurley


> I also tested your patch + a slightly modified version of Chris vblank delayed disable / instant query patches + my fixes using my own stress tests and hardware timing test equipment on both intel and nouveau, and everything seems to work fine.
>
> So i'm all for including this patch and it has my
> 
> Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com>
> 
> I just sent out an updated version of my patches, so they don't conflict with this one and also fix a compile failure of drm/qxl with yours.
> 
> Thanks,
> -mario
On Tue, May 05, 2015 at 10:36:24AM -0400, Peter Hurley wrote:
> On 05/04/2015 12:52 AM, Mario Kleiner wrote:
> > On 04/16/2015 03:03 PM, Daniel Vetter wrote:
> >> On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
> >>> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
> >>>> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
> >>>>> Hi Daniel,
> >>>>>
> >>>>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
> >>>>>> This was a bit too much cargo-culted, so lets make it solid:
> >>>>>> - vblank->count doesn't need to be an atomic, writes are always done
> >>>>>>    under the protection of dev->vblank_time_lock. Switch to an unsigned
> >>>>>>    long instead and update comments. Note that atomic_read is just a
> >>>>>>    normal read of a volatile variable, so no need to audit all the
> >>>>>>    read-side access specifically.
> >>>>>>
> >>>>>> - The barriers for the vblank counter seqlock weren't complete: The
> >>>>>>    read-side was missing the first barrier between the counter read and
> >>>>>>    the timestamp read, it only had a barrier between the ts and the
> >>>>>>    counter read. We need both.
> >>>>>>
> >>>>>> - Barriers weren't properly documented. Since barriers only work if
> >>>>>>    you have them on boths sides of the transaction it's prudent to
> >>>>>>    reference where the other side is. To avoid duplicating the
> >>>>>>    write-side comment 3 times extract a little store_vblank() helper.
> >>>>>>    In that helper also assert that we do indeed hold
> >>>>>>    dev->vblank_time_lock, since in some cases the lock is acquired a
> >>>>>>    few functions up in the callchain.
> >>>>>>
> >>>>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
> >>>>>> the vblank_wait ioctl.
> >>>>>>
> >>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
> >>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>>>>> Cc: Michel Dänzer <michel@daenzer.net>
> >>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>> ---
> >>>>>>   drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
> >>>>>>   include/drm/drmP.h        |  8 +++--
> >>>>>>   2 files changed, 54 insertions(+), 46 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> >>>>>> index c8a34476570a..23bfbc61a494 100644
> >>>>>> --- a/drivers/gpu/drm/drm_irq.c
> >>>>>> +++ b/drivers/gpu/drm/drm_irq.c
> >>>>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
> >>>>>>   module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
> >>>>>>   module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
> >>>>>>
> >>>>>> +static void store_vblank(struct drm_device *dev, int crtc,
> >>>>>> +             unsigned vblank_count_inc,
> >>>>>> +             struct timeval *t_vblank)
> >>>>>> +{
> >>>>>> +    struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> >>>>>> +    u32 tslot;
> >>>>>> +
> >>>>>> +    assert_spin_locked(&dev->vblank_time_lock);
> >>>>>> +
> >>>>>> +    if (t_vblank) {
> >>>>>> +        tslot = vblank->count + vblank_count_inc;
> >>>>>> +        vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    /*
> >>>>>> +     * vblank timestamp updates are protected on the write side with
> >>>>>> +     * vblank_time_lock, but on the read side done locklessly using a
> >>>>>> +     * sequence-lock on the vblank counter. Ensure correct ordering using
> >>>>>> +     * memory barrriers. We need the barrier both before and also after the
> >>>>>> +     * counter update to synchronize with the next timestamp write.
> >>>>>> +     * The read-side barriers for this are in drm_vblank_count_and_time.
> >>>>>> +     */
> >>>>>> +    smp_wmb();
> >>>>>> +    vblank->count += vblank_count_inc;
> >>>>>> +    smp_wmb();
> >>>>>
> >>>>> The comment and the code are each self-contradictory.
> >>>>>
> >>>>> If vblank->count writes are always protected by vblank_time_lock (something I
> >>>>> did not verify but that the comment above asserts), then the trailing write
> >>>>> barrier is not required (and the assertion that it is in the comment is incorrect).
> >>>>>
> >>>>> A spin unlock operation is always a write barrier.
> >>>>
> >>>> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
> >>>> That the spinlock is held I can assure. That no one goes around and does
> >>>> multiple vblank updates (because somehow that code raced with the hw
> >>>> itself) I can't easily assure with a simple assert or something similar.
> >>>> It's not the case right now, but that can changes.
> >>>
> >>> The algorithm would be broken if multiple updates for the same vblank
> >>> count were allowed; that's why it checks to see if the vblank count has
> >>> not advanced before storing a new timestamp.
> >>>
> >>> Otherwise, the read side would not be able to determine that the
> >>> timestamp is valid by double-checking that the vblank count has not
> >>> changed.
> >>>
> >>> And besides, even if the code looped without dropping the spinlock,
> >>> the correct write order would still be observed because it would still
> >>> be executing on the same cpu.
> >>>
> >>> My objection to the write memory barrier is not about optimization;
> >>> it's about correct code.
> >>
> >> Well diff=0 is not allowed, I guess I could enforce this with some
> >> WARN_ON. And I still think my point of non-local correctness is solid.
> >> With the smp_wmb() removed the following still works correctly:
> >>
> >> spin_lock(vblank_time_lock);
> >> store_vblank(dev, crtc, 1, ts1);
> >> spin_unlock(vblank_time_lock);
> >>
> >> spin_lock(vblank_time_lock);
> >> store_vblank(dev, crtc, 1, ts2);
> >> spin_unlock(vblank_time_lock);
> >>
> >> But with the smp_wmb(); removed the following would be broken:
> >>
> >> spin_lock(vblank_time_lock);
> >> store_vblank(dev, crtc, 1, ts1);
> >> store_vblank(dev, crtc, 1, ts2);
> >> spin_unlock(vblank_time_lock);
> >>
> >> because the compiler/cpu is free to reorder the store for vblank->count
> >> _ahead_ of the store for the timestamp. And that would trick readers into
> >> believing that they have a valid timestamp when they potentially raced.
> >>
> >> Now you're correct that right now there's no such thing going on, and it's
> >> unlikely to happen (given the nature of vblank updates). But my point is
> >> that if we optimize this then the correctness can't be proven locally
> >> anymore by just looking at store_vblank, but instead you must audit all
> >> the callers. And leaking locking/barriers like that is too fragile design
> >> for my taste.
> >>
> >> But you insist that my approach is broken somehow and dropping the smp_wmb
> >> is needed for correctness. I don't see how that's the case at all.
> 
> Daniel,
> 
> I've been really busy this last week; my apologies for not replying promptly.
> 
> > Fwiw, i spent some time reeducating myself about memory barriers (thanks for your explanations) and thinking about this, and the last version of your patch looks good to me. It also makes sense to me to leave that last smb_wmb() in place to make future use of the helper robust - for non-local correctness, to avoid having to audit all future callers of that helper.
> 
> My concern wrt to unnecessary barriers in this algorithm is that the trailing
> barrier now appears mandatory, when in fact it is not.
> 
> Moreover, this algorithm is, in general, fragile and not designed to handle
> random or poorly-researched changes.

Less fragility is exactly why I want that surplus barrier. But I've run
out of new ideas for how to explain that ...

> For example, if only the read and store operations are considered, it's obviously
> unsafe, since a read may unwittingly retrieve an store in progress.
> 
> 
> CPU 0                                   | CPU 1
>                                         |
>                              /* vblank->count == 0 */
>                                         |
> drm_vblank_count_and_time()             | store_vblank(.., inc = 2, ...)
>                                         |
>   cur_vblank <= LOAD vblank->count      |
>                                         |   tslot = vblank->count + 2
>                                         |   /* tslot == 2 */
>                                         |   STORE vblanktime[0]

This line here is wrong, it should be "STORE vblanktime[2]"

The "STORE vblanktime[0]" happened way earlier, before 2 smp_wmb and the
previous updating of vblank->count.

I'm also somewhat confused about how you to a line across both cpus for
barriers because barriers only have cpu-local effects (which is why we
always need a barrier on both ends of a transaction).

In short I still don't follow what's wrong.
-Daniel

>   - - - - - - - - - - - - - - - - - - - - - smp_wmb() - - - - - - - - - -
>   /* cur_vblank == 0 */                 |
>   local <= LOAD vblanktime[0]           |
>   smp_rmb - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
>                                         |
>  * cpu has loaded the wrong timestamp * |
>                                         |
>   local <= LOAD vblank->count           |
>   cur_vblank == local?                  |
>   yes - exit loop                       |
>                                         |   vblank->count += 2
>   - - - - - - - - - - - - - - - - - - - - - smp_wmb() - - - - - - - - - -
> 
> Regards,
> Peter Hurley
> 
> 
> > I also tested your patch + a slightly modified version of Chris vblank delayed disable / instant query patches + my fixes using my own stress tests and hardware timing test equipment on both intel and nouveau, and everything seems to work fine.
> >
> > So i'm all for including this patch and it has my
> > 
> > Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com>
> > 
> > I just sent out an updated version of my patches, so they don't conflict with this one and also fix a compile failure of drm/qxl with yours.
> > 
> > Thanks,
> > -mario
>
On 05/05/2015 11:42 AM, Daniel Vetter wrote:
> On Tue, May 05, 2015 at 10:36:24AM -0400, Peter Hurley wrote:
>> On 05/04/2015 12:52 AM, Mario Kleiner wrote:
>>> On 04/16/2015 03:03 PM, Daniel Vetter wrote:
>>>> On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
>>>>> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
>>>>>> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
>>>>>>>> This was a bit too much cargo-culted, so lets make it solid:
>>>>>>>> - vblank->count doesn't need to be an atomic, writes are always done
>>>>>>>>    under the protection of dev->vblank_time_lock. Switch to an unsigned
>>>>>>>>    long instead and update comments. Note that atomic_read is just a
>>>>>>>>    normal read of a volatile variable, so no need to audit all the
>>>>>>>>    read-side access specifically.
>>>>>>>>
>>>>>>>> - The barriers for the vblank counter seqlock weren't complete: The
>>>>>>>>    read-side was missing the first barrier between the counter read and
>>>>>>>>    the timestamp read, it only had a barrier between the ts and the
>>>>>>>>    counter read. We need both.
>>>>>>>>
>>>>>>>> - Barriers weren't properly documented. Since barriers only work if
>>>>>>>>    you have them on boths sides of the transaction it's prudent to
>>>>>>>>    reference where the other side is. To avoid duplicating the
>>>>>>>>    write-side comment 3 times extract a little store_vblank() helper.
>>>>>>>>    In that helper also assert that we do indeed hold
>>>>>>>>    dev->vblank_time_lock, since in some cases the lock is acquired a
>>>>>>>>    few functions up in the callchain.
>>>>>>>>
>>>>>>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
>>>>>>>> the vblank_wait ioctl.
>>>>>>>>
>>>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
>>>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>>>> Cc: Michel Dänzer <michel@daenzer.net>
>>>>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>> ---
>>>>>>>>   drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
>>>>>>>>   include/drm/drmP.h        |  8 +++--
>>>>>>>>   2 files changed, 54 insertions(+), 46 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>>>>> index c8a34476570a..23bfbc61a494 100644
>>>>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>>>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>>>>>>>>   module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>>>>>>>>   module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>>>>>>>>
>>>>>>>> +static void store_vblank(struct drm_device *dev, int crtc,
>>>>>>>> +             unsigned vblank_count_inc,
>>>>>>>> +             struct timeval *t_vblank)
>>>>>>>> +{
>>>>>>>> +    struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
>>>>>>>> +    u32 tslot;
>>>>>>>> +
>>>>>>>> +    assert_spin_locked(&dev->vblank_time_lock);
>>>>>>>> +
>>>>>>>> +    if (t_vblank) {
>>>>>>>> +        tslot = vblank->count + vblank_count_inc;
>>>>>>>> +        vblanktimestamp(dev, crtc, tslot) = *t_vblank;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    /*
>>>>>>>> +     * vblank timestamp updates are protected on the write side with
>>>>>>>> +     * vblank_time_lock, but on the read side done locklessly using a
>>>>>>>> +     * sequence-lock on the vblank counter. Ensure correct ordering using
>>>>>>>> +     * memory barrriers. We need the barrier both before and also after the
>>>>>>>> +     * counter update to synchronize with the next timestamp write.
>>>>>>>> +     * The read-side barriers for this are in drm_vblank_count_and_time.
>>>>>>>> +     */
>>>>>>>> +    smp_wmb();
>>>>>>>> +    vblank->count += vblank_count_inc;
>>>>>>>> +    smp_wmb();
>>>>>>>
>>>>>>> The comment and the code are each self-contradictory.
>>>>>>>
>>>>>>> If vblank->count writes are always protected by vblank_time_lock (something I
>>>>>>> did not verify but that the comment above asserts), then the trailing write
>>>>>>> barrier is not required (and the assertion that it is in the comment is incorrect).
>>>>>>>
>>>>>>> A spin unlock operation is always a write barrier.
>>>>>>
>>>>>> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
>>>>>> That the spinlock is held I can assure. That no one goes around and does
>>>>>> multiple vblank updates (because somehow that code raced with the hw
>>>>>> itself) I can't easily assure with a simple assert or something similar.
>>>>>> It's not the case right now, but that can changes.
>>>>>
>>>>> The algorithm would be broken if multiple updates for the same vblank
>>>>> count were allowed; that's why it checks to see if the vblank count has
>>>>> not advanced before storing a new timestamp.
>>>>>
>>>>> Otherwise, the read side would not be able to determine that the
>>>>> timestamp is valid by double-checking that the vblank count has not
>>>>> changed.
>>>>>
>>>>> And besides, even if the code looped without dropping the spinlock,
>>>>> the correct write order would still be observed because it would still
>>>>> be executing on the same cpu.
>>>>>
>>>>> My objection to the write memory barrier is not about optimization;
>>>>> it's about correct code.
>>>>
>>>> Well diff=0 is not allowed, I guess I could enforce this with some
>>>> WARN_ON. And I still think my point of non-local correctness is solid.
>>>> With the smp_wmb() removed the following still works correctly:
>>>>
>>>> spin_lock(vblank_time_lock);
>>>> store_vblank(dev, crtc, 1, ts1);
>>>> spin_unlock(vblank_time_lock);
>>>>
>>>> spin_lock(vblank_time_lock);
>>>> store_vblank(dev, crtc, 1, ts2);
>>>> spin_unlock(vblank_time_lock);
>>>>
>>>> But with the smp_wmb(); removed the following would be broken:
>>>>
>>>> spin_lock(vblank_time_lock);
>>>> store_vblank(dev, crtc, 1, ts1);
>>>> store_vblank(dev, crtc, 1, ts2);
>>>> spin_unlock(vblank_time_lock);
>>>>
>>>> because the compiler/cpu is free to reorder the store for vblank->count
>>>> _ahead_ of the store for the timestamp. And that would trick readers into
>>>> believing that they have a valid timestamp when they potentially raced.
>>>>
>>>> Now you're correct that right now there's no such thing going on, and it's
>>>> unlikely to happen (given the nature of vblank updates). But my point is
>>>> that if we optimize this then the correctness can't be proven locally
>>>> anymore by just looking at store_vblank, but instead you must audit all
>>>> the callers. And leaking locking/barriers like that is too fragile design
>>>> for my taste.
>>>>
>>>> But you insist that my approach is broken somehow and dropping the smp_wmb
>>>> is needed for correctness. I don't see how that's the case at all.
>>
>> Daniel,
>>
>> I've been really busy this last week; my apologies for not replying promptly.
>>
>>> Fwiw, i spent some time reeducating myself about memory barriers (thanks for your explanations) and thinking about this, and the last version of your patch looks good to me. It also makes sense to me to leave that last smb_wmb() in place to make future use of the helper robust - for non-local correctness, to avoid having to audit all future callers of that helper.
>>
>> My concern wrt to unnecessary barriers in this algorithm is that the trailing
>> barrier now appears mandatory, when in fact it is not.
>>
>> Moreover, this algorithm is, in general, fragile and not designed to handle
>> random or poorly-researched changes.
> 
> Less fragility is exactly why I want that surplus barrier. But I've run
> out of new ideas for how to explain that ...
> 
>> For example, if only the read and store operations are considered, it's obviously
>> unsafe, since a read may unwittingly retrieve an store in progress.
>>
>>
>> CPU 0                                   | CPU 1
>>                                         |
>>                              /* vblank->count == 0 */
>>                                         |
>> drm_vblank_count_and_time()             | store_vblank(.., inc = 2, ...)
>>                                         |
>>   cur_vblank <= LOAD vblank->count      |
>>                                         |   tslot = vblank->count + 2
>>                                         |   /* tslot == 2 */
>>                                         |   STORE vblanktime[0]
> 
> This line here is wrong, it should be "STORE vblanktime[2]"
> 
> The "STORE vblanktime[0]" happened way earlier, before 2 smp_wmb and the
> previous updating of vblank->count.

&vblanktime[0] == &vblanktime[2]

That's why I keep trying to explain you actually have to look at and
understand the algorithm before blindly assuming local behavior is
sufficient.

> I'm also somewhat confused about how you to a line across both cpus for
> barriers because barriers only have cpu-local effects (which is why we
> always need a barrier on both ends of a transaction).
> 
> In short I still don't follow what's wrong.
> -Daniel
> 
>>   - - - - - - - - - - - - - - - - - - - - - smp_wmb() - - - - - - - - - -
>>   /* cur_vblank == 0 */                 |
>>   local <= LOAD vblanktime[0]           |
>>   smp_rmb - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
>>                                         |
>>  * cpu has loaded the wrong timestamp * |
>>                                         |
>>   local <= LOAD vblank->count           |
>>   cur_vblank == local?                  |
>>   yes - exit loop                       |
>>                                         |   vblank->count += 2
>>   - - - - - - - - - - - - - - - - - - - - - smp_wmb() - - - - - - - - - -
>>
>> Regards,
>> Peter Hurley
>>
>>
>>> I also tested your patch + a slightly modified version of Chris vblank delayed disable / instant query patches + my fixes using my own stress tests and hardware timing test equipment on both intel and nouveau, and everything seems to work fine.
>>>
>>> So i'm all for including this patch and it has my
>>>
>>> Reviewed-and-tested-by: Mario Kleiner <mario.kleiner.de@gmail.com>
>>>
>>> I just sent out an updated version of my patches, so they don't conflict with this one and also fix a compile failure of drm/qxl with yours.
>>>
>>> Thanks,
>>> -mario
>>
>
On 05/05/2015 11:57 AM, Peter Hurley wrote:
> On 05/05/2015 11:42 AM, Daniel Vetter wrote:
>> I'm also somewhat confused about how you to a line across both cpus for
>> barriers because barriers only have cpu-local effects (which is why we
>> always need a barrier on both ends of a transaction).

I'm sorry if my barrier notation confuses you; I find that it clearly
identifies matching pairs.

Also, there is a distinction between "can be visible" and "must be visible";
the load and stores themselves are not cpu-local.

Regards,
Peter Hurley
On Tue, May 05, 2015 at 11:57:42AM -0400, Peter Hurley wrote:
> On 05/05/2015 11:42 AM, Daniel Vetter wrote:
> > On Tue, May 05, 2015 at 10:36:24AM -0400, Peter Hurley wrote:
> >> On 05/04/2015 12:52 AM, Mario Kleiner wrote:
> >>> On 04/16/2015 03:03 PM, Daniel Vetter wrote:
> >>>> On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
> >>>>> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
> >>>>>> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
> >>>>>>> Hi Daniel,
> >>>>>>>
> >>>>>>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
> >>>>>>>> This was a bit too much cargo-culted, so lets make it solid:
> >>>>>>>> - vblank->count doesn't need to be an atomic, writes are always done
> >>>>>>>>    under the protection of dev->vblank_time_lock. Switch to an unsigned
> >>>>>>>>    long instead and update comments. Note that atomic_read is just a
> >>>>>>>>    normal read of a volatile variable, so no need to audit all the
> >>>>>>>>    read-side access specifically.
> >>>>>>>>
> >>>>>>>> - The barriers for the vblank counter seqlock weren't complete: The
> >>>>>>>>    read-side was missing the first barrier between the counter read and
> >>>>>>>>    the timestamp read, it only had a barrier between the ts and the
> >>>>>>>>    counter read. We need both.
> >>>>>>>>
> >>>>>>>> - Barriers weren't properly documented. Since barriers only work if
> >>>>>>>>    you have them on boths sides of the transaction it's prudent to
> >>>>>>>>    reference where the other side is. To avoid duplicating the
> >>>>>>>>    write-side comment 3 times extract a little store_vblank() helper.
> >>>>>>>>    In that helper also assert that we do indeed hold
> >>>>>>>>    dev->vblank_time_lock, since in some cases the lock is acquired a
> >>>>>>>>    few functions up in the callchain.
> >>>>>>>>
> >>>>>>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
> >>>>>>>> the vblank_wait ioctl.
> >>>>>>>>
> >>>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> >>>>>>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
> >>>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
> >>>>>>>> Cc: Michel Dänzer <michel@daenzer.net>
> >>>>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >>>>>>>> ---
> >>>>>>>>   drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
> >>>>>>>>   include/drm/drmP.h        |  8 +++--
> >>>>>>>>   2 files changed, 54 insertions(+), 46 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
> >>>>>>>> index c8a34476570a..23bfbc61a494 100644
> >>>>>>>> --- a/drivers/gpu/drm/drm_irq.c
> >>>>>>>> +++ b/drivers/gpu/drm/drm_irq.c
> >>>>>>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
> >>>>>>>>   module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
> >>>>>>>>   module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
> >>>>>>>>
> >>>>>>>> +static void store_vblank(struct drm_device *dev, int crtc,
> >>>>>>>> +             unsigned vblank_count_inc,
> >>>>>>>> +             struct timeval *t_vblank)
> >>>>>>>> +{
> >>>>>>>> +    struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
> >>>>>>>> +    u32 tslot;
> >>>>>>>> +
> >>>>>>>> +    assert_spin_locked(&dev->vblank_time_lock);
> >>>>>>>> +
> >>>>>>>> +    if (t_vblank) {
> >>>>>>>> +        tslot = vblank->count + vblank_count_inc;
> >>>>>>>> +        vblanktimestamp(dev, crtc, tslot) = *t_vblank;
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    /*
> >>>>>>>> +     * vblank timestamp updates are protected on the write side with
> >>>>>>>> +     * vblank_time_lock, but on the read side done locklessly using a
> >>>>>>>> +     * sequence-lock on the vblank counter. Ensure correct ordering using
> >>>>>>>> +     * memory barrriers. We need the barrier both before and also after the
> >>>>>>>> +     * counter update to synchronize with the next timestamp write.
> >>>>>>>> +     * The read-side barriers for this are in drm_vblank_count_and_time.
> >>>>>>>> +     */
> >>>>>>>> +    smp_wmb();
> >>>>>>>> +    vblank->count += vblank_count_inc;
> >>>>>>>> +    smp_wmb();
> >>>>>>>
> >>>>>>> The comment and the code are each self-contradictory.
> >>>>>>>
> >>>>>>> If vblank->count writes are always protected by vblank_time_lock (something I
> >>>>>>> did not verify but that the comment above asserts), then the trailing write
> >>>>>>> barrier is not required (and the assertion that it is in the comment is incorrect).
> >>>>>>>
> >>>>>>> A spin unlock operation is always a write barrier.
> >>>>>>
> >>>>>> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
> >>>>>> That the spinlock is held I can assure. That no one goes around and does
> >>>>>> multiple vblank updates (because somehow that code raced with the hw
> >>>>>> itself) I can't easily assure with a simple assert or something similar.
> >>>>>> It's not the case right now, but that can changes.
> >>>>>
> >>>>> The algorithm would be broken if multiple updates for the same vblank
> >>>>> count were allowed; that's why it checks to see if the vblank count has
> >>>>> not advanced before storing a new timestamp.
> >>>>>
> >>>>> Otherwise, the read side would not be able to determine that the
> >>>>> timestamp is valid by double-checking that the vblank count has not
> >>>>> changed.
> >>>>>
> >>>>> And besides, even if the code looped without dropping the spinlock,
> >>>>> the correct write order would still be observed because it would still
> >>>>> be executing on the same cpu.
> >>>>>
> >>>>> My objection to the write memory barrier is not about optimization;
> >>>>> it's about correct code.
> >>>>
> >>>> Well diff=0 is not allowed, I guess I could enforce this with some
> >>>> WARN_ON. And I still think my point of non-local correctness is solid.
> >>>> With the smp_wmb() removed the following still works correctly:
> >>>>
> >>>> spin_lock(vblank_time_lock);
> >>>> store_vblank(dev, crtc, 1, ts1);
> >>>> spin_unlock(vblank_time_lock);
> >>>>
> >>>> spin_lock(vblank_time_lock);
> >>>> store_vblank(dev, crtc, 1, ts2);
> >>>> spin_unlock(vblank_time_lock);
> >>>>
> >>>> But with the smp_wmb(); removed the following would be broken:
> >>>>
> >>>> spin_lock(vblank_time_lock);
> >>>> store_vblank(dev, crtc, 1, ts1);
> >>>> store_vblank(dev, crtc, 1, ts2);
> >>>> spin_unlock(vblank_time_lock);
> >>>>
> >>>> because the compiler/cpu is free to reorder the store for vblank->count
> >>>> _ahead_ of the store for the timestamp. And that would trick readers into
> >>>> believing that they have a valid timestamp when they potentially raced.
> >>>>
> >>>> Now you're correct that right now there's no such thing going on, and it's
> >>>> unlikely to happen (given the nature of vblank updates). But my point is
> >>>> that if we optimize this then the correctness can't be proven locally
> >>>> anymore by just looking at store_vblank, but instead you must audit all
> >>>> the callers. And leaking locking/barriers like that is too fragile design
> >>>> for my taste.
> >>>>
> >>>> But you insist that my approach is broken somehow and dropping the smp_wmb
> >>>> is needed for correctness. I don't see how that's the case at all.
> >>
> >> Daniel,
> >>
> >> I've been really busy this last week; my apologies for not replying promptly.
> >>
> >>> Fwiw, i spent some time reeducating myself about memory barriers (thanks for your explanations) and thinking about this, and the last version of your patch looks good to me. It also makes sense to me to leave that last smb_wmb() in place to make future use of the helper robust - for non-local correctness, to avoid having to audit all future callers of that helper.
> >>
> >> My concern wrt to unnecessary barriers in this algorithm is that the trailing
> >> barrier now appears mandatory, when in fact it is not.
> >>
> >> Moreover, this algorithm is, in general, fragile and not designed to handle
> >> random or poorly-researched changes.
> > 
> > Less fragility is exactly why I want that surplus barrier. But I've run
> > out of new ideas for how to explain that ...
> > 
> >> For example, if only the read and store operations are considered, it's obviously
> >> unsafe, since a read may unwittingly retrieve an store in progress.
> >>
> >>
> >> CPU 0                                   | CPU 1
> >>                                         |
> >>                              /* vblank->count == 0 */
> >>                                         |
> >> drm_vblank_count_and_time()             | store_vblank(.., inc = 2, ...)
> >>                                         |
> >>   cur_vblank <= LOAD vblank->count      |
> >>                                         |   tslot = vblank->count + 2
> >>                                         |   /* tslot == 2 */
> >>                                         |   STORE vblanktime[0]
> > 
> > This line here is wrong, it should be "STORE vblanktime[2]"
> > 
> > The "STORE vblanktime[0]" happened way earlier, before 2 smp_wmb and the
> > previous updating of vblank->count.
> 
> &vblanktime[0] == &vblanktime[2]
> 
> That's why I keep trying to explain you actually have to look at and
> understand the algorithm before blindly assuming local behavior is
> sufficient.

Ok now I think I got it, the issue is when the array (which is only 2
elements big) wraps around. And that's racy because we don't touch the
increment before _and_ after the write side update. But that seems like a
bug that's always been there?
-Daniel
On 05/06/2015 04:56 AM, Daniel Vetter wrote:
> On Tue, May 05, 2015 at 11:57:42AM -0400, Peter Hurley wrote:
>> On 05/05/2015 11:42 AM, Daniel Vetter wrote:
>>> On Tue, May 05, 2015 at 10:36:24AM -0400, Peter Hurley wrote:
>>>> On 05/04/2015 12:52 AM, Mario Kleiner wrote:
>>>>> On 04/16/2015 03:03 PM, Daniel Vetter wrote:
>>>>>> On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
>>>>>>> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
>>>>>>>> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
>>>>>>>>> Hi Daniel,
>>>>>>>>>
>>>>>>>>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
>>>>>>>>>> This was a bit too much cargo-culted, so lets make it solid:
>>>>>>>>>> - vblank->count doesn't need to be an atomic, writes are always done
>>>>>>>>>>    under the protection of dev->vblank_time_lock. Switch to an unsigned
>>>>>>>>>>    long instead and update comments. Note that atomic_read is just a
>>>>>>>>>>    normal read of a volatile variable, so no need to audit all the
>>>>>>>>>>    read-side access specifically.
>>>>>>>>>>
>>>>>>>>>> - The barriers for the vblank counter seqlock weren't complete: The
>>>>>>>>>>    read-side was missing the first barrier between the counter read and
>>>>>>>>>>    the timestamp read, it only had a barrier between the ts and the
>>>>>>>>>>    counter read. We need both.
>>>>>>>>>>
>>>>>>>>>> - Barriers weren't properly documented. Since barriers only work if
>>>>>>>>>>    you have them on boths sides of the transaction it's prudent to
>>>>>>>>>>    reference where the other side is. To avoid duplicating the
>>>>>>>>>>    write-side comment 3 times extract a little store_vblank() helper.
>>>>>>>>>>    In that helper also assert that we do indeed hold
>>>>>>>>>>    dev->vblank_time_lock, since in some cases the lock is acquired a
>>>>>>>>>>    few functions up in the callchain.
>>>>>>>>>>
>>>>>>>>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
>>>>>>>>>> the vblank_wait ioctl.
>>>>>>>>>>
>>>>>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>>>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
>>>>>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>>>>>> Cc: Michel Dänzer <michel@daenzer.net>
>>>>>>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>>>> ---
>>>>>>>>>>   drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
>>>>>>>>>>   include/drm/drmP.h        |  8 +++--
>>>>>>>>>>   2 files changed, 54 insertions(+), 46 deletions(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>>>>>>> index c8a34476570a..23bfbc61a494 100644
>>>>>>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>>>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>>>>>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>>>>>>>>>>   module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>>>>>>>>>>   module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>>>>>>>>>>
>>>>>>>>>> +static void store_vblank(struct drm_device *dev, int crtc,
>>>>>>>>>> +             unsigned vblank_count_inc,
>>>>>>>>>> +             struct timeval *t_vblank)
>>>>>>>>>> +{
>>>>>>>>>> +    struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
>>>>>>>>>> +    u32 tslot;
>>>>>>>>>> +
>>>>>>>>>> +    assert_spin_locked(&dev->vblank_time_lock);
>>>>>>>>>> +
>>>>>>>>>> +    if (t_vblank) {
>>>>>>>>>> +        tslot = vblank->count + vblank_count_inc;
>>>>>>>>>> +        vblanktimestamp(dev, crtc, tslot) = *t_vblank;
>>>>>>>>>> +    }
>>>>>>>>>> +
>>>>>>>>>> +    /*
>>>>>>>>>> +     * vblank timestamp updates are protected on the write side with
>>>>>>>>>> +     * vblank_time_lock, but on the read side done locklessly using a
>>>>>>>>>> +     * sequence-lock on the vblank counter. Ensure correct ordering using
>>>>>>>>>> +     * memory barrriers. We need the barrier both before and also after the
>>>>>>>>>> +     * counter update to synchronize with the next timestamp write.
>>>>>>>>>> +     * The read-side barriers for this are in drm_vblank_count_and_time.
>>>>>>>>>> +     */
>>>>>>>>>> +    smp_wmb();
>>>>>>>>>> +    vblank->count += vblank_count_inc;
>>>>>>>>>> +    smp_wmb();
>>>>>>>>>
>>>>>>>>> The comment and the code are each self-contradictory.
>>>>>>>>>
>>>>>>>>> If vblank->count writes are always protected by vblank_time_lock (something I
>>>>>>>>> did not verify but that the comment above asserts), then the trailing write
>>>>>>>>> barrier is not required (and the assertion that it is in the comment is incorrect).
>>>>>>>>>
>>>>>>>>> A spin unlock operation is always a write barrier.
>>>>>>>>
>>>>>>>> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
>>>>>>>> That the spinlock is held I can assure. That no one goes around and does
>>>>>>>> multiple vblank updates (because somehow that code raced with the hw
>>>>>>>> itself) I can't easily assure with a simple assert or something similar.
>>>>>>>> It's not the case right now, but that can changes.
>>>>>>>
>>>>>>> The algorithm would be broken if multiple updates for the same vblank
>>>>>>> count were allowed; that's why it checks to see if the vblank count has
>>>>>>> not advanced before storing a new timestamp.
>>>>>>>
>>>>>>> Otherwise, the read side would not be able to determine that the
>>>>>>> timestamp is valid by double-checking that the vblank count has not
>>>>>>> changed.
>>>>>>>
>>>>>>> And besides, even if the code looped without dropping the spinlock,
>>>>>>> the correct write order would still be observed because it would still
>>>>>>> be executing on the same cpu.
>>>>>>>
>>>>>>> My objection to the write memory barrier is not about optimization;
>>>>>>> it's about correct code.
>>>>>>
>>>>>> Well diff=0 is not allowed, I guess I could enforce this with some
>>>>>> WARN_ON. And I still think my point of non-local correctness is solid.
>>>>>> With the smp_wmb() removed the following still works correctly:
>>>>>>
>>>>>> spin_lock(vblank_time_lock);
>>>>>> store_vblank(dev, crtc, 1, ts1);
>>>>>> spin_unlock(vblank_time_lock);
>>>>>>
>>>>>> spin_lock(vblank_time_lock);
>>>>>> store_vblank(dev, crtc, 1, ts2);
>>>>>> spin_unlock(vblank_time_lock);
>>>>>>
>>>>>> But with the smp_wmb(); removed the following would be broken:
>>>>>>
>>>>>> spin_lock(vblank_time_lock);
>>>>>> store_vblank(dev, crtc, 1, ts1);
>>>>>> store_vblank(dev, crtc, 1, ts2);
>>>>>> spin_unlock(vblank_time_lock);
>>>>>>
>>>>>> because the compiler/cpu is free to reorder the store for vblank->count
>>>>>> _ahead_ of the store for the timestamp. And that would trick readers into
>>>>>> believing that they have a valid timestamp when they potentially raced.
>>>>>>
>>>>>> Now you're correct that right now there's no such thing going on, and it's
>>>>>> unlikely to happen (given the nature of vblank updates). But my point is
>>>>>> that if we optimize this then the correctness can't be proven locally
>>>>>> anymore by just looking at store_vblank, but instead you must audit all
>>>>>> the callers. And leaking locking/barriers like that is too fragile design
>>>>>> for my taste.
>>>>>>
>>>>>> But you insist that my approach is broken somehow and dropping the smp_wmb
>>>>>> is needed for correctness. I don't see how that's the case at all.
>>>>
>>>> Daniel,
>>>>
>>>> I've been really busy this last week; my apologies for not replying promptly.
>>>>
>>>>> Fwiw, i spent some time reeducating myself about memory barriers (thanks for your explanations) and thinking about this, and the last version of your patch looks good to me. It also makes sense to me to leave that last smb_wmb() in place to make future use of the helper robust - for non-local correctness, to avoid having to audit all future callers of that helper.
>>>>
>>>> My concern wrt to unnecessary barriers in this algorithm is that the trailing
>>>> barrier now appears mandatory, when in fact it is not.
>>>>
>>>> Moreover, this algorithm is, in general, fragile and not designed to handle
>>>> random or poorly-researched changes.
>>>
>>> Less fragility is exactly why I want that surplus barrier. But I've run
>>> out of new ideas for how to explain that ...
>>>
>>>> For example, if only the read and store operations are considered, it's obviously
>>>> unsafe, since a read may unwittingly retrieve an store in progress.
>>>>
>>>>
>>>> CPU 0                                   | CPU 1
>>>>                                         |
>>>>                              /* vblank->count == 0 */
>>>>                                         |
>>>> drm_vblank_count_and_time()             | store_vblank(.., inc = 2, ...)
>>>>                                         |
>>>>   cur_vblank <= LOAD vblank->count      |
>>>>                                         |   tslot = vblank->count + 2
>>>>                                         |   /* tslot == 2 */
>>>>                                         |   STORE vblanktime[0]
>>>
>>> This line here is wrong, it should be "STORE vblanktime[2]"
>>>
>>> The "STORE vblanktime[0]" happened way earlier, before 2 smp_wmb and the
>>> previous updating of vblank->count.
>>
>> &vblanktime[0] == &vblanktime[2]
>>
>> That's why I keep trying to explain you actually have to look at and
>> understand the algorithm before blindly assuming local behavior is
>> sufficient.
> 
> Ok now I think I got it, the issue is when the array (which is only 2
> elements big) wraps around. And that's racy because we don't touch the
> increment before _and_ after the write side update. But that seems like a
> bug that's always been there?

I'm not sure if those conditions can actually occur; it's been a long time
since I analyzed vblank timestamping.
On 05/07/2015 01:56 PM, Peter Hurley wrote:
> On 05/06/2015 04:56 AM, Daniel Vetter wrote:
>> On Tue, May 05, 2015 at 11:57:42AM -0400, Peter Hurley wrote:
>>> On 05/05/2015 11:42 AM, Daniel Vetter wrote:
>>>> On Tue, May 05, 2015 at 10:36:24AM -0400, Peter Hurley wrote:
>>>>> On 05/04/2015 12:52 AM, Mario Kleiner wrote:
>>>>>> On 04/16/2015 03:03 PM, Daniel Vetter wrote:
>>>>>>> On Thu, Apr 16, 2015 at 08:30:55AM -0400, Peter Hurley wrote:
>>>>>>>> On 04/15/2015 01:31 PM, Daniel Vetter wrote:
>>>>>>>>> On Wed, Apr 15, 2015 at 09:00:04AM -0400, Peter Hurley wrote:
>>>>>>>>>> Hi Daniel,
>>>>>>>>>>
>>>>>>>>>> On 04/15/2015 03:17 AM, Daniel Vetter wrote:
>>>>>>>>>>> This was a bit too much cargo-culted, so lets make it solid:
>>>>>>>>>>> - vblank->count doesn't need to be an atomic, writes are always done
>>>>>>>>>>>     under the protection of dev->vblank_time_lock. Switch to an unsigned
>>>>>>>>>>>     long instead and update comments. Note that atomic_read is just a
>>>>>>>>>>>     normal read of a volatile variable, so no need to audit all the
>>>>>>>>>>>     read-side access specifically.
>>>>>>>>>>>
>>>>>>>>>>> - The barriers for the vblank counter seqlock weren't complete: The
>>>>>>>>>>>     read-side was missing the first barrier between the counter read and
>>>>>>>>>>>     the timestamp read, it only had a barrier between the ts and the
>>>>>>>>>>>     counter read. We need both.
>>>>>>>>>>>
>>>>>>>>>>> - Barriers weren't properly documented. Since barriers only work if
>>>>>>>>>>>     you have them on boths sides of the transaction it's prudent to
>>>>>>>>>>>     reference where the other side is. To avoid duplicating the
>>>>>>>>>>>     write-side comment 3 times extract a little store_vblank() helper.
>>>>>>>>>>>     In that helper also assert that we do indeed hold
>>>>>>>>>>>     dev->vblank_time_lock, since in some cases the lock is acquired a
>>>>>>>>>>>     few functions up in the callchain.
>>>>>>>>>>>
>>>>>>>>>>> Spotted while reviewing a patch from Chris Wilson to add a fastpath to
>>>>>>>>>>> the vblank_wait ioctl.
>>>>>>>>>>>
>>>>>>>>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>>>>>>>>> Cc: Mario Kleiner <mario.kleiner.de@gmail.com>
>>>>>>>>>>> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
>>>>>>>>>>> Cc: Michel Dänzer <michel@daenzer.net>
>>>>>>>>>>> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>>>>>>>>>>> ---
>>>>>>>>>>>    drivers/gpu/drm/drm_irq.c | 92 ++++++++++++++++++++++++-----------------------
>>>>>>>>>>>    include/drm/drmP.h        |  8 +++--
>>>>>>>>>>>    2 files changed, 54 insertions(+), 46 deletions(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
>>>>>>>>>>> index c8a34476570a..23bfbc61a494 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/drm_irq.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/drm_irq.c
>>>>>>>>>>> @@ -74,6 +74,33 @@ module_param_named(vblankoffdelay, drm_vblank_offdelay, int, 0600);
>>>>>>>>>>>    module_param_named(timestamp_precision_usec, drm_timestamp_precision, int, 0600);
>>>>>>>>>>>    module_param_named(timestamp_monotonic, drm_timestamp_monotonic, int, 0600);
>>>>>>>>>>>
>>>>>>>>>>> +static void store_vblank(struct drm_device *dev, int crtc,
>>>>>>>>>>> +             unsigned vblank_count_inc,
>>>>>>>>>>> +             struct timeval *t_vblank)
>>>>>>>>>>> +{
>>>>>>>>>>> +    struct drm_vblank_crtc *vblank = &dev->vblank[crtc];
>>>>>>>>>>> +    u32 tslot;
>>>>>>>>>>> +
>>>>>>>>>>> +    assert_spin_locked(&dev->vblank_time_lock);
>>>>>>>>>>> +
>>>>>>>>>>> +    if (t_vblank) {
>>>>>>>>>>> +        tslot = vblank->count + vblank_count_inc;
>>>>>>>>>>> +        vblanktimestamp(dev, crtc, tslot) = *t_vblank;
>>>>>>>>>>> +    }
>>>>>>>>>>> +
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * vblank timestamp updates are protected on the write side with
>>>>>>>>>>> +     * vblank_time_lock, but on the read side done locklessly using a
>>>>>>>>>>> +     * sequence-lock on the vblank counter. Ensure correct ordering using
>>>>>>>>>>> +     * memory barrriers. We need the barrier both before and also after the
>>>>>>>>>>> +     * counter update to synchronize with the next timestamp write.
>>>>>>>>>>> +     * The read-side barriers for this are in drm_vblank_count_and_time.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    smp_wmb();
>>>>>>>>>>> +    vblank->count += vblank_count_inc;
>>>>>>>>>>> +    smp_wmb();
>>>>>>>>>>
>>>>>>>>>> The comment and the code are each self-contradictory.
>>>>>>>>>>
>>>>>>>>>> If vblank->count writes are always protected by vblank_time_lock (something I
>>>>>>>>>> did not verify but that the comment above asserts), then the trailing write
>>>>>>>>>> barrier is not required (and the assertion that it is in the comment is incorrect).
>>>>>>>>>>
>>>>>>>>>> A spin unlock operation is always a write barrier.
>>>>>>>>>
>>>>>>>>> Hm yeah. Otoh to me that's bordering on "code too clever for my own good".
>>>>>>>>> That the spinlock is held I can assure. That no one goes around and does
>>>>>>>>> multiple vblank updates (because somehow that code raced with the hw
>>>>>>>>> itself) I can't easily assure with a simple assert or something similar.
>>>>>>>>> It's not the case right now, but that can changes.
>>>>>>>>
>>>>>>>> The algorithm would be broken if multiple updates for the same vblank
>>>>>>>> count were allowed; that's why it checks to see if the vblank count has
>>>>>>>> not advanced before storing a new timestamp.
>>>>>>>>
>>>>>>>> Otherwise, the read side would not be able to determine that the
>>>>>>>> timestamp is valid by double-checking that the vblank count has not
>>>>>>>> changed.
>>>>>>>>
>>>>>>>> And besides, even if the code looped without dropping the spinlock,
>>>>>>>> the correct write order would still be observed because it would still
>>>>>>>> be executing on the same cpu.
>>>>>>>>
>>>>>>>> My objection to the write memory barrier is not about optimization;
>>>>>>>> it's about correct code.
>>>>>>>
>>>>>>> Well diff=0 is not allowed, I guess I could enforce this with some
>>>>>>> WARN_ON. And I still think my point of non-local correctness is solid.
>>>>>>> With the smp_wmb() removed the following still works correctly:
>>>>>>>
>>>>>>> spin_lock(vblank_time_lock);
>>>>>>> store_vblank(dev, crtc, 1, ts1);
>>>>>>> spin_unlock(vblank_time_lock);
>>>>>>>
>>>>>>> spin_lock(vblank_time_lock);
>>>>>>> store_vblank(dev, crtc, 1, ts2);
>>>>>>> spin_unlock(vblank_time_lock);
>>>>>>>
>>>>>>> But with the smp_wmb(); removed the following would be broken:
>>>>>>>
>>>>>>> spin_lock(vblank_time_lock);
>>>>>>> store_vblank(dev, crtc, 1, ts1);
>>>>>>> store_vblank(dev, crtc, 1, ts2);
>>>>>>> spin_unlock(vblank_time_lock);
>>>>>>>
>>>>>>> because the compiler/cpu is free to reorder the store for vblank->count
>>>>>>> _ahead_ of the store for the timestamp. And that would trick readers into
>>>>>>> believing that they have a valid timestamp when they potentially raced.
>>>>>>>
>>>>>>> Now you're correct that right now there's no such thing going on, and it's
>>>>>>> unlikely to happen (given the nature of vblank updates). But my point is
>>>>>>> that if we optimize this then the correctness can't be proven locally
>>>>>>> anymore by just looking at store_vblank, but instead you must audit all
>>>>>>> the callers. And leaking locking/barriers like that is too fragile design
>>>>>>> for my taste.
>>>>>>>
>>>>>>> But you insist that my approach is broken somehow and dropping the smp_wmb
>>>>>>> is needed for correctness. I don't see how that's the case at all.
>>>>>
>>>>> Daniel,
>>>>>
>>>>> I've been really busy this last week; my apologies for not replying promptly.
>>>>>
>>>>>> Fwiw, i spent some time reeducating myself about memory barriers (thanks for your explanations) and thinking about this, and the last version of your patch looks good to me. It also makes sense to me to leave that last smb_wmb() in place to make future use of the helper robust - for non-local correctness, to avoid having to audit all future callers of that helper.
>>>>>
>>>>> My concern wrt to unnecessary barriers in this algorithm is that the trailing
>>>>> barrier now appears mandatory, when in fact it is not.
>>>>>
>>>>> Moreover, this algorithm is, in general, fragile and not designed to handle
>>>>> random or poorly-researched changes.
>>>>
>>>> Less fragility is exactly why I want that surplus barrier. But I've run
>>>> out of new ideas for how to explain that ...
>>>>
>>>>> For example, if only the read and store operations are considered, it's obviously
>>>>> unsafe, since a read may unwittingly retrieve an store in progress.
>>>>>
>>>>>
>>>>> CPU 0                                   | CPU 1
>>>>>                                          |
>>>>>                               /* vblank->count == 0 */
>>>>>                                          |
>>>>> drm_vblank_count_and_time()             | store_vblank(.., inc = 2, ...)
>>>>>                                          |
>>>>>    cur_vblank <= LOAD vblank->count      |
>>>>>                                          |   tslot = vblank->count + 2
>>>>>                                          |   /* tslot == 2 */
>>>>>                                          |   STORE vblanktime[0]
>>>>
>>>> This line here is wrong, it should be "STORE vblanktime[2]"
>>>>
>>>> The "STORE vblanktime[0]" happened way earlier, before 2 smp_wmb and the
>>>> previous updating of vblank->count.
>>>
>>> &vblanktime[0] == &vblanktime[2]
>>>
>>> That's why I keep trying to explain you actually have to look at and
>>> understand the algorithm before blindly assuming local behavior is
>>> sufficient.
>>
>> Ok now I think I got it, the issue is when the array (which is only 2
>> elements big) wraps around. And that's racy because we don't touch the
>> increment before _and_ after the write side update. But that seems like a
>> bug that's always been there?
>
> I'm not sure if those conditions can actually occur; it's been a long time
> since I analyzed vblank timestamping.
>
>

They shouldn't occur under correct use. Normally one has to wrap any 
call to drm_vblank_count() or drm_vblank_count_and_time() into a pair of 
drm_vblank_get() -> query -> drm_vblank_put(). Only drm_vblank_get() 
will call drm_update_vblank() on a refcount 0 -> 1 transition if vblanks 
were previously off, and only that function bumps the count by more than 
+1. Iow. the overflow case isn't executed in parallel with queries -> 
problem avoided.

Proper _get()->query->put() sequence is given for queueing vblank 
events, waiting for target vblank counts or during pageflip completion.

The one exception would be in Chris recently proposed "lockless instant 
query" patch where a pure query is done - That patch that triggered 
Daniels cleanup patch. I was just about to ok that one, and testing with 
my timing tests and under normal use didn't show problems. There 
drm_vblank_count_and_time() is used outside a get/put protected path. 
I'm not sure if some race there could happen under realistic conditions.

-mario