[13/14] staging: android: ion: Do not sync CPU cache on map/unmap

Submitted by Andrew F. Davis on Jan. 11, 2019, 6:05 p.m.

Details

Message ID 20190111180523.27862-14-afd@ti.com
State New
Series "Misc ION cleanups and adding unmapped heap"
Headers show

Commit Message

Andrew F. Davis Jan. 11, 2019, 6:05 p.m.
Buffers may not be mapped from the CPU so skip cache maintenance here.
Accesses from the CPU to a cached heap should be bracketed with
{begin,end}_cpu_access calls so maintenance should not be needed anyway.

Signed-off-by: Andrew F. Davis <afd@ti.com>
---
 drivers/staging/android/ion/ion.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
index 14e48f6eb734..09cb5a8e2b09 100644
--- a/drivers/staging/android/ion/ion.c
+++ b/drivers/staging/android/ion/ion.c
@@ -261,8 +261,8 @@  static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
 
 	table = a->table;
 
-	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
-			direction))
+	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
+			      direction, DMA_ATTR_SKIP_CPU_SYNC))
 		return ERR_PTR(-ENOMEM);
 
 	return table;
@@ -272,7 +272,8 @@  static void ion_unmap_dma_buf(struct dma_buf_attachment *attachment,
 			      struct sg_table *table,
 			      enum dma_data_direction direction)
 {
-	dma_unmap_sg(attachment->dev, table->sgl, table->nents, direction);
+	dma_unmap_sg_attrs(attachment->dev, table->sgl, table->nents,
+			   direction, DMA_ATTR_SKIP_CPU_SYNC);
 }
 
 static int ion_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)

Comments

Liam Mark Jan. 14, 2019, 5:13 p.m.
On Fri, 11 Jan 2019, Andrew F. Davis wrote:

> Buffers may not be mapped from the CPU so skip cache maintenance here.
> Accesses from the CPU to a cached heap should be bracketed with
> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
> 
> Signed-off-by: Andrew F. Davis <afd@ti.com>
> ---
>  drivers/staging/android/ion/ion.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
> index 14e48f6eb734..09cb5a8e2b09 100644
> --- a/drivers/staging/android/ion/ion.c
> +++ b/drivers/staging/android/ion/ion.c
> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
>  
>  	table = a->table;
>  
> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> -			direction))
> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))

Unfortunately I don't think you can do this for a couple reasons.
You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
If the calls to {begin,end}_cpu_access were made before the call to 
dma_buf_attach then there won't have been a device attached so the calls 
to {begin,end}_cpu_access won't have done any cache maintenance.

Also ION no longer provides DMA ready memory, so if you are not doing CPU 
access then there is no requirement (that I am aware of) for you to call 
{begin,end}_cpu_access before passing the buffer to the device and if this 
buffer is cached and your device is not IO-coherent then the cache maintenance
in ion_map_dma_buf and ion_unmap_dma_buf is required.

>  		return ERR_PTR(-ENOMEM);
>  
>  	return table;
> @@ -272,7 +272,8 @@ static void ion_unmap_dma_buf(struct dma_buf_attachment *attachment,
>  			      struct sg_table *table,
>  			      enum dma_data_direction direction)
>  {
> -	dma_unmap_sg(attachment->dev, table->sgl, table->nents, direction);
> +	dma_unmap_sg_attrs(attachment->dev, table->sgl, table->nents,
> +			   direction, DMA_ATTR_SKIP_CPU_SYNC);
>  }
>  
>  static int ion_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
> -- 
> 2.19.1
> 
> _______________________________________________
> devel mailing list
> devel@linuxdriverproject.org
> http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
> 

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
Andrew F. Davis Jan. 15, 2019, 3:44 p.m.
On 1/14/19 11:13 AM, Liam Mark wrote:
> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> 
>> Buffers may not be mapped from the CPU so skip cache maintenance here.
>> Accesses from the CPU to a cached heap should be bracketed with
>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
>>
>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>> ---
>>  drivers/staging/android/ion/ion.c | 7 ++++---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
>> index 14e48f6eb734..09cb5a8e2b09 100644
>> --- a/drivers/staging/android/ion/ion.c
>> +++ b/drivers/staging/android/ion/ion.c
>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
>>  
>>  	table = a->table;
>>  
>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>> -			direction))
>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
> 
> Unfortunately I don't think you can do this for a couple reasons.
> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
> If the calls to {begin,end}_cpu_access were made before the call to 
> dma_buf_attach then there won't have been a device attached so the calls 
> to {begin,end}_cpu_access won't have done any cache maintenance.
> 

That should be okay though, if you have no attachments (or all
attachments are IO-coherent) then there is no need for cache
maintenance. Unless you mean a sequence where a non-io-coherent device
is attached later after data has already been written. Does that
sequence need supporting? DMA-BUF doesn't have to allocate the backing
memory until map_dma_buf() time, and that should only happen after all
the devices have attached so it can know where to put the buffer. So we
shouldn't expect any CPU access to buffers before all the devices are
attached and mapped, right?

> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
> access then there is no requirement (that I am aware of) for you to call 
> {begin,end}_cpu_access before passing the buffer to the device and if this 
> buffer is cached and your device is not IO-coherent then the cache maintenance
> in ion_map_dma_buf and ion_unmap_dma_buf is required.
> 

If I am not doing any CPU access then why do I need CPU cache
maintenance on the buffer?

Andrew

>>  		return ERR_PTR(-ENOMEM);
>>  
>>  	return table;
>> @@ -272,7 +272,8 @@ static void ion_unmap_dma_buf(struct dma_buf_attachment *attachment,
>>  			      struct sg_table *table,
>>  			      enum dma_data_direction direction)
>>  {
>> -	dma_unmap_sg(attachment->dev, table->sgl, table->nents, direction);
>> +	dma_unmap_sg_attrs(attachment->dev, table->sgl, table->nents,
>> +			   direction, DMA_ATTR_SKIP_CPU_SYNC);
>>  }
>>  
>>  static int ion_mmap(struct dma_buf *dmabuf, struct vm_area_struct *vma)
>> -- 
>> 2.19.1
>>
>> _______________________________________________
>> devel mailing list
>> devel@linuxdriverproject.org
>> http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel
>>
> 
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
Liam Mark Jan. 15, 2019, 5:45 p.m.
On Tue, 15 Jan 2019, Andrew F. Davis wrote:

> On 1/14/19 11:13 AM, Liam Mark wrote:
> > On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> > 
> >> Buffers may not be mapped from the CPU so skip cache maintenance here.
> >> Accesses from the CPU to a cached heap should be bracketed with
> >> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
> >>
> >> Signed-off-by: Andrew F. Davis <afd@ti.com>
> >> ---
> >>  drivers/staging/android/ion/ion.c | 7 ++++---
> >>  1 file changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
> >> index 14e48f6eb734..09cb5a8e2b09 100644
> >> --- a/drivers/staging/android/ion/ion.c
> >> +++ b/drivers/staging/android/ion/ion.c
> >> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
> >>  
> >>  	table = a->table;
> >>  
> >> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> >> -			direction))
> >> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> >> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
> > 
> > Unfortunately I don't think you can do this for a couple reasons.
> > You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
> > If the calls to {begin,end}_cpu_access were made before the call to 
> > dma_buf_attach then there won't have been a device attached so the calls 
> > to {begin,end}_cpu_access won't have done any cache maintenance.
> > 
> 
> That should be okay though, if you have no attachments (or all
> attachments are IO-coherent) then there is no need for cache
> maintenance. Unless you mean a sequence where a non-io-coherent device
> is attached later after data has already been written. Does that
> sequence need supporting? 

Yes, but also I think there are cases where CPU access can happen before 
in Android, but I will focus on later for now.

> DMA-BUF doesn't have to allocate the backing
> memory until map_dma_buf() time, and that should only happen after all
> the devices have attached so it can know where to put the buffer. So we
> shouldn't expect any CPU access to buffers before all the devices are
> attached and mapped, right?
> 

Here is an example where CPU access can happen later in Android.

Camera device records video -> software post processing -> video device 
(who does compression of raw data) and writes to a file

In this example assume the buffer is cached and the devices are not 
IO-coherent (quite common).

ION buffer is allocated.

//Camera device records video
dma_buf_attach
dma_map_attachment (buffer needs to be cleaned)
[camera device writes to buffer]
dma_buf_unmap_attachment (buffer needs to be invalidated)
dma_buf_detach  (device cannot stay attached because it is being sent down 
the pipeline and Camera doesn't know the end of the use case)

//buffer is send down the pipeline

// Usersapce software post processing occurs
mmap buffer
DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
devices attached to buffer
[CPU reads/writes to the buffer]
DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
devices attached to buffer
munmap buffer

//buffer is send down the pipeline
// Buffer is send to video device (who does compression of raw data) and 
writes to a file
dma_buf_attach
dma_map_attachment (buffer needs to be cleaned)
[video device writes to buffer]
dma_buf_unmap_attachment 
dma_buf_detach  (device cannot stay attached because it is being sent down 
the pipeline and Video doesn't know the end of the use case)



> > Also ION no longer provides DMA ready memory, so if you are not doing CPU 
> > access then there is no requirement (that I am aware of) for you to call 
> > {begin,end}_cpu_access before passing the buffer to the device and if this 
> > buffer is cached and your device is not IO-coherent then the cache maintenance
> > in ion_map_dma_buf and ion_unmap_dma_buf is required.
> > 
> 
> If I am not doing any CPU access then why do I need CPU cache
> maintenance on the buffer?
> 

Because ION no longer provides DMA ready memory.
Take the above example.

ION allocates memory from buddy allocator and requests zeroing.
Zeros are written to the cache.

You pass the buffer to the camera device which is not IO-coherent.
The camera devices writes directly to the buffer in DDR.
Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
evicted from the cache, this zero overwrites data the camera device has 
written which corrupts your data.

Liam

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
Andrew F. Davis Jan. 15, 2019, 6:38 p.m.
On 1/15/19 11:45 AM, Liam Mark wrote:
> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
> 
>> On 1/14/19 11:13 AM, Liam Mark wrote:
>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
>>>
>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
>>>> Accesses from the CPU to a cached heap should be bracketed with
>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
>>>>
>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>>>> ---
>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
>>>> index 14e48f6eb734..09cb5a8e2b09 100644
>>>> --- a/drivers/staging/android/ion/ion.c
>>>> +++ b/drivers/staging/android/ion/ion.c
>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
>>>>  
>>>>  	table = a->table;
>>>>  
>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>>>> -			direction))
>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
>>>
>>> Unfortunately I don't think you can do this for a couple reasons.
>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
>>> If the calls to {begin,end}_cpu_access were made before the call to 
>>> dma_buf_attach then there won't have been a device attached so the calls 
>>> to {begin,end}_cpu_access won't have done any cache maintenance.
>>>
>>
>> That should be okay though, if you have no attachments (or all
>> attachments are IO-coherent) then there is no need for cache
>> maintenance. Unless you mean a sequence where a non-io-coherent device
>> is attached later after data has already been written. Does that
>> sequence need supporting? 
> 
> Yes, but also I think there are cases where CPU access can happen before 
> in Android, but I will focus on later for now.
> 
>> DMA-BUF doesn't have to allocate the backing
>> memory until map_dma_buf() time, and that should only happen after all
>> the devices have attached so it can know where to put the buffer. So we
>> shouldn't expect any CPU access to buffers before all the devices are
>> attached and mapped, right?
>>
> 
> Here is an example where CPU access can happen later in Android.
> 
> Camera device records video -> software post processing -> video device 
> (who does compression of raw data) and writes to a file
> 
> In this example assume the buffer is cached and the devices are not 
> IO-coherent (quite common).
> 

This is the start of the problem, having cached mappings of memory that
is also being accessed non-coherently is going to cause issues one way
or another. On top of the speculative cache fills that have to be
constantly fought back against with CMOs like below; some coherent
interconnects behave badly when you mix coherent and non-coherent access
(snoop filters get messed up).

The solution is to either always have the addresses marked non-coherent
(like device memory, no-map carveouts), or if you really want to use
regular system memory allocated at runtime, then all cached mappings of
it need to be dropped, even the kernel logical address (area as painful
as that would be).

> ION buffer is allocated.
> 
> //Camera device records video
> dma_buf_attach
> dma_map_attachment (buffer needs to be cleaned)

Why does the buffer need to be cleaned here? I just got through reading
the thread linked by Laura in the other reply. I do like +Brian's
suggestion of tracking if the buffer has had CPU access since the last
time and only flushing the cache if it has. As unmapped heaps never get
CPU mapped this would never be the case for unmapped heaps, it solves my
problem.

> [camera device writes to buffer]
> dma_buf_unmap_attachment (buffer needs to be invalidated)

It doesn't know there will be any further CPU access, it could get freed
after this for all we know, the invalidate can be saved until the CPU
requests access again.

> dma_buf_detach  (device cannot stay attached because it is being sent down 
> the pipeline and Camera doesn't know the end of the use case)
> 

This seems like a broken use-case, I understand the desire to keep
everything as modular as possible and separate the steps, but at this
point no one owns this buffers backing memory, not the CPU or any
device. I would go as far as to say DMA-BUF should be free now to
de-allocate the backing storage if it wants, that way it could get ready
for the next attachment, which may change the required backing memory
completely.

All devices should attach before the first mapping, and only let go
after the task is complete, otherwise this buffers data needs copied off
to a different location or the CPU needs to take ownership in-between.

> //buffer is send down the pipeline
> 
> // Usersapce software post processing occurs
> mmap buffer

Perhaps the invalidate should happen here in mmap.

> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
> devices attached to buffer

And that should be okay, mmap does the sync, and if no devices are
attached nothing could have changed the underlying memory in the
mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.

> [CPU reads/writes to the buffer]
> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
> devices attached to buffer
> munmap buffer
> 
> //buffer is send down the pipeline
> // Buffer is send to video device (who does compression of raw data) and 
> writes to a file
> dma_buf_attach
> dma_map_attachment (buffer needs to be cleaned)
> [video device writes to buffer]
> dma_buf_unmap_attachment 
> dma_buf_detach  (device cannot stay attached because it is being sent down 
> the pipeline and Video doesn't know the end of the use case)
> 
> 
> 
>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
>>> access then there is no requirement (that I am aware of) for you to call 
>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
>>> buffer is cached and your device is not IO-coherent then the cache maintenance
>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
>>>
>>
>> If I am not doing any CPU access then why do I need CPU cache
>> maintenance on the buffer?
>>
> 
> Because ION no longer provides DMA ready memory.
> Take the above example.
> 
> ION allocates memory from buddy allocator and requests zeroing.
> Zeros are written to the cache.
> 
> You pass the buffer to the camera device which is not IO-coherent.
> The camera devices writes directly to the buffer in DDR.
> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
> evicted from the cache, this zero overwrites data the camera device has 
> written which corrupts your data.
> 

The zeroing *is* a CPU access, therefor it should handle the needed CMO
for CPU access at the time of zeroing.

Andrew

> Liam
> 
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
Andrew F. Davis Jan. 15, 2019, 6:40 p.m.
On 1/15/19 12:38 PM, Andrew F. Davis wrote:
> On 1/15/19 11:45 AM, Liam Mark wrote:
>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
>>
>>> On 1/14/19 11:13 AM, Liam Mark wrote:
>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
>>>>
>>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
>>>>> Accesses from the CPU to a cached heap should be bracketed with
>>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
>>>>>
>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>>>>> ---
>>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
>>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
>>>>> --- a/drivers/staging/android/ion/ion.c
>>>>> +++ b/drivers/staging/android/ion/ion.c
>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
>>>>>  
>>>>>  	table = a->table;
>>>>>  
>>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>>>>> -			direction))
>>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
>>>>
>>>> Unfortunately I don't think you can do this for a couple reasons.
>>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
>>>> If the calls to {begin,end}_cpu_access were made before the call to 
>>>> dma_buf_attach then there won't have been a device attached so the calls 
>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
>>>>
>>>
>>> That should be okay though, if you have no attachments (or all
>>> attachments are IO-coherent) then there is no need for cache
>>> maintenance. Unless you mean a sequence where a non-io-coherent device
>>> is attached later after data has already been written. Does that
>>> sequence need supporting? 
>>
>> Yes, but also I think there are cases where CPU access can happen before 
>> in Android, but I will focus on later for now.
>>
>>> DMA-BUF doesn't have to allocate the backing
>>> memory until map_dma_buf() time, and that should only happen after all
>>> the devices have attached so it can know where to put the buffer. So we
>>> shouldn't expect any CPU access to buffers before all the devices are
>>> attached and mapped, right?
>>>
>>
>> Here is an example where CPU access can happen later in Android.
>>
>> Camera device records video -> software post processing -> video device 
>> (who does compression of raw data) and writes to a file
>>
>> In this example assume the buffer is cached and the devices are not 
>> IO-coherent (quite common).
>>
> 
> This is the start of the problem, having cached mappings of memory that
> is also being accessed non-coherently is going to cause issues one way
> or another. On top of the speculative cache fills that have to be
> constantly fought back against with CMOs like below; some coherent
> interconnects behave badly when you mix coherent and non-coherent access
> (snoop filters get messed up).
> 
> The solution is to either always have the addresses marked non-coherent
> (like device memory, no-map carveouts), or if you really want to use
> regular system memory allocated at runtime, then all cached mappings of
> it need to be dropped, even the kernel logical address (area as painful
> as that would be).
> 
>> ION buffer is allocated.
>>
>> //Camera device records video
>> dma_buf_attach
>> dma_map_attachment (buffer needs to be cleaned)
> 
> Why does the buffer need to be cleaned here? I just got through reading
> the thread linked by Laura in the other reply. I do like +Brian's

Actually +Brian this time :)

> suggestion of tracking if the buffer has had CPU access since the last
> time and only flushing the cache if it has. As unmapped heaps never get
> CPU mapped this would never be the case for unmapped heaps, it solves my
> problem.
> 
>> [camera device writes to buffer]
>> dma_buf_unmap_attachment (buffer needs to be invalidated)
> 
> It doesn't know there will be any further CPU access, it could get freed
> after this for all we know, the invalidate can be saved until the CPU
> requests access again.
> 
>> dma_buf_detach  (device cannot stay attached because it is being sent down 
>> the pipeline and Camera doesn't know the end of the use case)
>>
> 
> This seems like a broken use-case, I understand the desire to keep
> everything as modular as possible and separate the steps, but at this
> point no one owns this buffers backing memory, not the CPU or any
> device. I would go as far as to say DMA-BUF should be free now to
> de-allocate the backing storage if it wants, that way it could get ready
> for the next attachment, which may change the required backing memory
> completely.
> 
> All devices should attach before the first mapping, and only let go
> after the task is complete, otherwise this buffers data needs copied off
> to a different location or the CPU needs to take ownership in-between.
> 
>> //buffer is send down the pipeline
>>
>> // Usersapce software post processing occurs
>> mmap buffer
> 
> Perhaps the invalidate should happen here in mmap.
> 
>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
>> devices attached to buffer
> 
> And that should be okay, mmap does the sync, and if no devices are
> attached nothing could have changed the underlying memory in the
> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> 
>> [CPU reads/writes to the buffer]
>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
>> devices attached to buffer
>> munmap buffer
>>
>> //buffer is send down the pipeline
>> // Buffer is send to video device (who does compression of raw data) and 
>> writes to a file
>> dma_buf_attach
>> dma_map_attachment (buffer needs to be cleaned)
>> [video device writes to buffer]
>> dma_buf_unmap_attachment 
>> dma_buf_detach  (device cannot stay attached because it is being sent down 
>> the pipeline and Video doesn't know the end of the use case)
>>
>>
>>
>>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
>>>> access then there is no requirement (that I am aware of) for you to call 
>>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
>>>> buffer is cached and your device is not IO-coherent then the cache maintenance
>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
>>>>
>>>
>>> If I am not doing any CPU access then why do I need CPU cache
>>> maintenance on the buffer?
>>>
>>
>> Because ION no longer provides DMA ready memory.
>> Take the above example.
>>
>> ION allocates memory from buddy allocator and requests zeroing.
>> Zeros are written to the cache.
>>
>> You pass the buffer to the camera device which is not IO-coherent.
>> The camera devices writes directly to the buffer in DDR.
>> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
>> evicted from the cache, this zero overwrites data the camera device has 
>> written which corrupts your data.
>>
> 
> The zeroing *is* a CPU access, therefor it should handle the needed CMO
> for CPU access at the time of zeroing.
> 
> Andrew
> 
>> Liam
>>
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>> a Linux Foundation Collaborative Project
>>
Laura Abbott Jan. 15, 2019, 7:05 p.m.
On 1/15/19 10:38 AM, Andrew F. Davis wrote:
> On 1/15/19 11:45 AM, Liam Mark wrote:
>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
>>
>>> On 1/14/19 11:13 AM, Liam Mark wrote:
>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
>>>>
>>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
>>>>> Accesses from the CPU to a cached heap should be bracketed with
>>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
>>>>>
>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>>>>> ---
>>>>>   drivers/staging/android/ion/ion.c | 7 ++++---
>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
>>>>> --- a/drivers/staging/android/ion/ion.c
>>>>> +++ b/drivers/staging/android/ion/ion.c
>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
>>>>>   
>>>>>   	table = a->table;
>>>>>   
>>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>>>>> -			direction))
>>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
>>>>
>>>> Unfortunately I don't think you can do this for a couple reasons.
>>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
>>>> If the calls to {begin,end}_cpu_access were made before the call to
>>>> dma_buf_attach then there won't have been a device attached so the calls
>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
>>>>
>>>
>>> That should be okay though, if you have no attachments (or all
>>> attachments are IO-coherent) then there is no need for cache
>>> maintenance. Unless you mean a sequence where a non-io-coherent device
>>> is attached later after data has already been written. Does that
>>> sequence need supporting?
>>
>> Yes, but also I think there are cases where CPU access can happen before
>> in Android, but I will focus on later for now.
>>
>>> DMA-BUF doesn't have to allocate the backing
>>> memory until map_dma_buf() time, and that should only happen after all
>>> the devices have attached so it can know where to put the buffer. So we
>>> shouldn't expect any CPU access to buffers before all the devices are
>>> attached and mapped, right?
>>>
>>
>> Here is an example where CPU access can happen later in Android.
>>
>> Camera device records video -> software post processing -> video device
>> (who does compression of raw data) and writes to a file
>>
>> In this example assume the buffer is cached and the devices are not
>> IO-coherent (quite common).
>>
> 
> This is the start of the problem, having cached mappings of memory that
> is also being accessed non-coherently is going to cause issues one way
> or another. On top of the speculative cache fills that have to be
> constantly fought back against with CMOs like below; some coherent
> interconnects behave badly when you mix coherent and non-coherent access
> (snoop filters get messed up).
> 
> The solution is to either always have the addresses marked non-coherent
> (like device memory, no-map carveouts), or if you really want to use
> regular system memory allocated at runtime, then all cached mappings of
> it need to be dropped, even the kernel logical address (area as painful
> as that would be).
> 

I agree it's broken, hence my desire to remove it :)

The other problem is that uncached buffers are being used for
performance reason so anything that would involve getting
rid of the logical address would probably negate any performance
benefit.

>> ION buffer is allocated.
>>
>> //Camera device records video
>> dma_buf_attach
>> dma_map_attachment (buffer needs to be cleaned)
> 
> Why does the buffer need to be cleaned here? I just got through reading
> the thread linked by Laura in the other reply. I do like +Brian's
> suggestion of tracking if the buffer has had CPU access since the last
> time and only flushing the cache if it has. As unmapped heaps never get
> CPU mapped this would never be the case for unmapped heaps, it solves my
> problem.
> 
>> [camera device writes to buffer]
>> dma_buf_unmap_attachment (buffer needs to be invalidated)
> 
> It doesn't know there will be any further CPU access, it could get freed
> after this for all we know, the invalidate can be saved until the CPU
> requests access again.
> 
>> dma_buf_detach  (device cannot stay attached because it is being sent down
>> the pipeline and Camera doesn't know the end of the use case)
>>
> 
> This seems like a broken use-case, I understand the desire to keep
> everything as modular as possible and separate the steps, but at this
> point no one owns this buffers backing memory, not the CPU or any
> device. I would go as far as to say DMA-BUF should be free now to
> de-allocate the backing storage if it wants, that way it could get ready
> for the next attachment, which may change the required backing memory
> completely.
> 
> All devices should attach before the first mapping, and only let go
> after the task is complete, otherwise this buffers data needs copied off
> to a different location or the CPU needs to take ownership in-between.
> 

Maybe it's broken but it's the status quo and we spent a good
amount of time at plumbers concluding there isn't a great way
to fix it :/

>> //buffer is send down the pipeline
>>
>> // Usersapce software post processing occurs
>> mmap buffer
> 
> Perhaps the invalidate should happen here in mmap.
> 
>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no
>> devices attached to buffer
> 
> And that should be okay, mmap does the sync, and if no devices are
> attached nothing could have changed the underlying memory in the
> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> 
>> [CPU reads/writes to the buffer]
>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no
>> devices attached to buffer
>> munmap buffer
>>
>> //buffer is send down the pipeline
>> // Buffer is send to video device (who does compression of raw data) and
>> writes to a file
>> dma_buf_attach
>> dma_map_attachment (buffer needs to be cleaned)
>> [video device writes to buffer]
>> dma_buf_unmap_attachment
>> dma_buf_detach  (device cannot stay attached because it is being sent down
>> the pipeline and Video doesn't know the end of the use case)
>>
>>
>>
>>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU
>>>> access then there is no requirement (that I am aware of) for you to call
>>>> {begin,end}_cpu_access before passing the buffer to the device and if this
>>>> buffer is cached and your device is not IO-coherent then the cache maintenance
>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
>>>>
>>>
>>> If I am not doing any CPU access then why do I need CPU cache
>>> maintenance on the buffer?
>>>
>>
>> Because ION no longer provides DMA ready memory.
>> Take the above example.
>>
>> ION allocates memory from buddy allocator and requests zeroing.
>> Zeros are written to the cache.
>>
>> You pass the buffer to the camera device which is not IO-coherent.
>> The camera devices writes directly to the buffer in DDR.
>> Since you didn't clean the buffer a dirty cache line (one of the zeros) is
>> evicted from the cache, this zero overwrites data the camera device has
>> written which corrupts your data.
>>
> 
> The zeroing *is* a CPU access, therefor it should handle the needed CMO
> for CPU access at the time of zeroing.
> 
> Andrew
> 
>> Liam
>>
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>> a Linux Foundation Collaborative Project
>>
Brian Starkey Jan. 16, 2019, 3:19 p.m.
Hi :-)

On Tue, Jan 15, 2019 at 12:40:16PM -0600, Andrew F. Davis wrote:
> On 1/15/19 12:38 PM, Andrew F. Davis wrote:
> > On 1/15/19 11:45 AM, Liam Mark wrote:
> >> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
> >>
> >>> On 1/14/19 11:13 AM, Liam Mark wrote:
> >>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> >>>>
> >>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
> >>>>> Accesses from the CPU to a cached heap should be bracketed with
> >>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
> >>>>>
> >>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
> >>>>> ---
> >>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
> >>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
> >>>>> index 14e48f6eb734..09cb5a8e2b09 100644
> >>>>> --- a/drivers/staging/android/ion/ion.c
> >>>>> +++ b/drivers/staging/android/ion/ion.c
> >>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
> >>>>>  
> >>>>>  	table = a->table;
> >>>>>  
> >>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> >>>>> -			direction))
> >>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> >>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
> >>>>
> >>>> Unfortunately I don't think you can do this for a couple reasons.
> >>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
> >>>> If the calls to {begin,end}_cpu_access were made before the call to 
> >>>> dma_buf_attach then there won't have been a device attached so the calls 
> >>>> to {begin,end}_cpu_access won't have done any cache maintenance.
> >>>>
> >>>
> >>> That should be okay though, if you have no attachments (or all
> >>> attachments are IO-coherent) then there is no need for cache
> >>> maintenance. Unless you mean a sequence where a non-io-coherent device
> >>> is attached later after data has already been written. Does that
> >>> sequence need supporting? 
> >>
> >> Yes, but also I think there are cases where CPU access can happen before 
> >> in Android, but I will focus on later for now.
> >>
> >>> DMA-BUF doesn't have to allocate the backing
> >>> memory until map_dma_buf() time, and that should only happen after all
> >>> the devices have attached so it can know where to put the buffer. So we
> >>> shouldn't expect any CPU access to buffers before all the devices are
> >>> attached and mapped, right?
> >>>
> >>
> >> Here is an example where CPU access can happen later in Android.
> >>
> >> Camera device records video -> software post processing -> video device 
> >> (who does compression of raw data) and writes to a file
> >>
> >> In this example assume the buffer is cached and the devices are not 
> >> IO-coherent (quite common).
> >>
> > 
> > This is the start of the problem, having cached mappings of memory that
> > is also being accessed non-coherently is going to cause issues one way
> > or another. On top of the speculative cache fills that have to be
> > constantly fought back against with CMOs like below; some coherent
> > interconnects behave badly when you mix coherent and non-coherent access
> > (snoop filters get messed up).
> > 
> > The solution is to either always have the addresses marked non-coherent
> > (like device memory, no-map carveouts), or if you really want to use
> > regular system memory allocated at runtime, then all cached mappings of
> > it need to be dropped, even the kernel logical address (area as painful
> > as that would be).

Ouch :-( I wasn't aware about these potential interconnect issues. How
"real" is that? It seems that we aren't really hitting that today on
real devices.

> > 
> >> ION buffer is allocated.
> >>
> >> //Camera device records video
> >> dma_buf_attach
> >> dma_map_attachment (buffer needs to be cleaned)
> > 
> > Why does the buffer need to be cleaned here? I just got through reading
> > the thread linked by Laura in the other reply. I do like +Brian's
> 
> Actually +Brian this time :)
> 
> > suggestion of tracking if the buffer has had CPU access since the last
> > time and only flushing the cache if it has. As unmapped heaps never get
> > CPU mapped this would never be the case for unmapped heaps, it solves my
> > problem.
> > 
> >> [camera device writes to buffer]
> >> dma_buf_unmap_attachment (buffer needs to be invalidated)
> > 
> > It doesn't know there will be any further CPU access, it could get freed
> > after this for all we know, the invalidate can be saved until the CPU
> > requests access again.

We don't have any API to allow the invalidate to happen on CPU access
if all devices already detached. We need a struct device pointer to
give to the DMA API, otherwise on arm64 there'll be no invalidate.

I had a chat with a few people internally after the previous
discussion with Liam. One suggestion was to use
DMA_ATTR_SKIP_CPU_SYNC in unmap_dma_buf, but only if there's at least
one other device attached (guarantees that we can do an invalidate in
the future if begin_cpu_access is called). If the last device
detaches, do a sync then.

Conversely, in map_dma_buf, we would track if there was any CPU access
and use/skip the sync appropriately.

I did start poking the code to check out how that would look, but then
Christmas happened and I'm still catching back up.

> > 
> >> dma_buf_detach  (device cannot stay attached because it is being sent down 
> >> the pipeline and Camera doesn't know the end of the use case)
> >>
> > 
> > This seems like a broken use-case, I understand the desire to keep
> > everything as modular as possible and separate the steps, but at this
> > point no one owns this buffers backing memory, not the CPU or any
> > device. I would go as far as to say DMA-BUF should be free now to
> > de-allocate the backing storage if it wants, that way it could get ready
> > for the next attachment, which may change the required backing memory
> > completely.
> > 
> > All devices should attach before the first mapping, and only let go
> > after the task is complete, otherwise this buffers data needs copied off
> > to a different location or the CPU needs to take ownership in-between.
> > 

Yeah.. that's certainly the theory. Are there any DMA-BUF
implementations which actually do that? I hear it quoted a lot,
because that's what the docs say - but if the reality doesn't match
it, maybe we should change the docs.

> >> //buffer is send down the pipeline
> >>
> >> // Usersapce software post processing occurs
> >> mmap buffer
> > 
> > Perhaps the invalidate should happen here in mmap.
> > 
> >> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
> >> devices attached to buffer
> > 
> > And that should be okay, mmap does the sync, and if no devices are
> > attached nothing could have changed the underlying memory in the
> > mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.

Yeah, that's true - so long as you did an invalidate in unmap_dma_buf.
Liam was saying that it's too painful for them to do that every time a
device unmaps - when in many cases (device->device, no CPU) it's not
needed.

> > 
> >> [CPU reads/writes to the buffer]
> >> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
> >> devices attached to buffer
> >> munmap buffer
> >>
> >> //buffer is send down the pipeline
> >> // Buffer is send to video device (who does compression of raw data) and 
> >> writes to a file
> >> dma_buf_attach
> >> dma_map_attachment (buffer needs to be cleaned)
> >> [video device writes to buffer]
> >> dma_buf_unmap_attachment 
> >> dma_buf_detach  (device cannot stay attached because it is being sent down 
> >> the pipeline and Video doesn't know the end of the use case)
> >>
> >>
> >>
> >>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
> >>>> access then there is no requirement (that I am aware of) for you to call 
> >>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
> >>>> buffer is cached and your device is not IO-coherent then the cache maintenance
> >>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
> >>>>
> >>>
> >>> If I am not doing any CPU access then why do I need CPU cache
> >>> maintenance on the buffer?
> >>>
> >>
> >> Because ION no longer provides DMA ready memory.
> >> Take the above example.
> >>
> >> ION allocates memory from buddy allocator and requests zeroing.
> >> Zeros are written to the cache.
> >>
> >> You pass the buffer to the camera device which is not IO-coherent.
> >> The camera devices writes directly to the buffer in DDR.
> >> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
> >> evicted from the cache, this zero overwrites data the camera device has 
> >> written which corrupts your data.
> >>
> > 
> > The zeroing *is* a CPU access, therefor it should handle the needed CMO
> > for CPU access at the time of zeroing.
> > 

Actually that should be at the point of the first non-coherent device
mapping the buffer right? No point in doing CMO if the future accesses
are coherent.

Cheers,
-Brian

> > Andrew
> > 
> >> Liam
> >>
> >> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> >> a Linux Foundation Collaborative Project
> >>
Andrew F. Davis Jan. 16, 2019, 4:17 p.m.
On 1/15/19 1:05 PM, Laura Abbott wrote:
> On 1/15/19 10:38 AM, Andrew F. Davis wrote:
>> On 1/15/19 11:45 AM, Liam Mark wrote:
>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
>>>
>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
>>>>>
>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance
>>>>>> here.
>>>>>> Accesses from the CPU to a cached heap should be bracketed with
>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed
>>>>>> anyway.
>>>>>>
>>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>>>>>> ---
>>>>>>   drivers/staging/android/ion/ion.c | 7 ++++---
>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/staging/android/ion/ion.c
>>>>>> b/drivers/staging/android/ion/ion.c
>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
>>>>>> --- a/drivers/staging/android/ion/ion.c
>>>>>> +++ b/drivers/staging/android/ion/ion.c
>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct
>>>>>> dma_buf_attachment *attachment,
>>>>>>         table = a->table;
>>>>>>   -    if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>>>>>> -            direction))
>>>>>> +    if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>>>>>> +                  direction, DMA_ATTR_SKIP_CPU_SYNC))
>>>>>
>>>>> Unfortunately I don't think you can do this for a couple reasons.
>>>>> You can't rely on {begin,end}_cpu_access calls to do cache
>>>>> maintenance.
>>>>> If the calls to {begin,end}_cpu_access were made before the call to
>>>>> dma_buf_attach then there won't have been a device attached so the
>>>>> calls
>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
>>>>>
>>>>
>>>> That should be okay though, if you have no attachments (or all
>>>> attachments are IO-coherent) then there is no need for cache
>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
>>>> is attached later after data has already been written. Does that
>>>> sequence need supporting?
>>>
>>> Yes, but also I think there are cases where CPU access can happen before
>>> in Android, but I will focus on later for now.
>>>
>>>> DMA-BUF doesn't have to allocate the backing
>>>> memory until map_dma_buf() time, and that should only happen after all
>>>> the devices have attached so it can know where to put the buffer. So we
>>>> shouldn't expect any CPU access to buffers before all the devices are
>>>> attached and mapped, right?
>>>>
>>>
>>> Here is an example where CPU access can happen later in Android.
>>>
>>> Camera device records video -> software post processing -> video device
>>> (who does compression of raw data) and writes to a file
>>>
>>> In this example assume the buffer is cached and the devices are not
>>> IO-coherent (quite common).
>>>
>>
>> This is the start of the problem, having cached mappings of memory that
>> is also being accessed non-coherently is going to cause issues one way
>> or another. On top of the speculative cache fills that have to be
>> constantly fought back against with CMOs like below; some coherent
>> interconnects behave badly when you mix coherent and non-coherent access
>> (snoop filters get messed up).
>>
>> The solution is to either always have the addresses marked non-coherent
>> (like device memory, no-map carveouts), or if you really want to use
>> regular system memory allocated at runtime, then all cached mappings of
>> it need to be dropped, even the kernel logical address (area as painful
>> as that would be).
>>
> 
> I agree it's broken, hence my desire to remove it :)
> 
> The other problem is that uncached buffers are being used for
> performance reason so anything that would involve getting
> rid of the logical address would probably negate any performance
> benefit.
> 

I wouldn't go as far as to remove them just yet.. Liam seems pretty
adamant that they have valid uses. I'm just not sure performance is one
of them, maybe in the case of software locks between devices or
something where there needs to be a lot of back and forth interleaved
access on small amounts of data?

>>> ION buffer is allocated.
>>>
>>> //Camera device records video
>>> dma_buf_attach
>>> dma_map_attachment (buffer needs to be cleaned)
>>
>> Why does the buffer need to be cleaned here? I just got through reading
>> the thread linked by Laura in the other reply. I do like +Brian's
>> suggestion of tracking if the buffer has had CPU access since the last
>> time and only flushing the cache if it has. As unmapped heaps never get
>> CPU mapped this would never be the case for unmapped heaps, it solves my
>> problem.
>>
>>> [camera device writes to buffer]
>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
>>
>> It doesn't know there will be any further CPU access, it could get freed
>> after this for all we know, the invalidate can be saved until the CPU
>> requests access again.
>>
>>> dma_buf_detach  (device cannot stay attached because it is being sent
>>> down
>>> the pipeline and Camera doesn't know the end of the use case)
>>>
>>
>> This seems like a broken use-case, I understand the desire to keep
>> everything as modular as possible and separate the steps, but at this
>> point no one owns this buffers backing memory, not the CPU or any
>> device. I would go as far as to say DMA-BUF should be free now to
>> de-allocate the backing storage if it wants, that way it could get ready
>> for the next attachment, which may change the required backing memory
>> completely.
>>
>> All devices should attach before the first mapping, and only let go
>> after the task is complete, otherwise this buffers data needs copied off
>> to a different location or the CPU needs to take ownership in-between.
>>
> 
> Maybe it's broken but it's the status quo and we spent a good
> amount of time at plumbers concluding there isn't a great way
> to fix it :/
> 

Hmm, guess that doesn't prove there is not a great way to fix it either.. :/

Perhaps just stronger rules on sequencing of operations? I'm not saying
I have a good solution either, I just don't see any way forward without
some use-case getting broken, so better to fix now over later.

>>> //buffer is send down the pipeline
>>>
>>> // Usersapce software post processing occurs
>>> mmap buffer
>>
>> Perhaps the invalidate should happen here in mmap.
>>
>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no
>>> devices attached to buffer
>>
>> And that should be okay, mmap does the sync, and if no devices are
>> attached nothing could have changed the underlying memory in the
>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
>>
>>> [CPU reads/writes to the buffer]
>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no
>>> devices attached to buffer
>>> munmap buffer
>>>
>>> //buffer is send down the pipeline
>>> // Buffer is send to video device (who does compression of raw data) and
>>> writes to a file
>>> dma_buf_attach
>>> dma_map_attachment (buffer needs to be cleaned)
>>> [video device writes to buffer]
>>> dma_buf_unmap_attachment
>>> dma_buf_detach  (device cannot stay attached because it is being sent
>>> down
>>> the pipeline and Video doesn't know the end of the use case)
>>>
>>>
>>>
>>>>> Also ION no longer provides DMA ready memory, so if you are not
>>>>> doing CPU
>>>>> access then there is no requirement (that I am aware of) for you to
>>>>> call
>>>>> {begin,end}_cpu_access before passing the buffer to the device and
>>>>> if this
>>>>> buffer is cached and your device is not IO-coherent then the cache
>>>>> maintenance
>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
>>>>>
>>>>
>>>> If I am not doing any CPU access then why do I need CPU cache
>>>> maintenance on the buffer?
>>>>
>>>
>>> Because ION no longer provides DMA ready memory.
>>> Take the above example.
>>>
>>> ION allocates memory from buddy allocator and requests zeroing.
>>> Zeros are written to the cache.
>>>
>>> You pass the buffer to the camera device which is not IO-coherent.
>>> The camera devices writes directly to the buffer in DDR.
>>> Since you didn't clean the buffer a dirty cache line (one of the
>>> zeros) is
>>> evicted from the cache, this zero overwrites data the camera device has
>>> written which corrupts your data.
>>>
>>
>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
>> for CPU access at the time of zeroing.
>>
>> Andrew
>>
>>> Liam
>>>
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>> a Linux Foundation Collaborative Project
>>>
>
Andrew F. Davis Jan. 16, 2019, 5:05 p.m.
On 1/16/19 9:19 AM, Brian Starkey wrote:
> Hi :-)
> 
> On Tue, Jan 15, 2019 at 12:40:16PM -0600, Andrew F. Davis wrote:
>> On 1/15/19 12:38 PM, Andrew F. Davis wrote:
>>> On 1/15/19 11:45 AM, Liam Mark wrote:
>>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
>>>>
>>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
>>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
>>>>>>
>>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
>>>>>>> Accesses from the CPU to a cached heap should be bracketed with
>>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
>>>>>>>
>>>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>>>>>>> ---
>>>>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
>>>>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
>>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
>>>>>>> --- a/drivers/staging/android/ion/ion.c
>>>>>>> +++ b/drivers/staging/android/ion/ion.c
>>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
>>>>>>>  
>>>>>>>  	table = a->table;
>>>>>>>  
>>>>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>>>>>>> -			direction))
>>>>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>>>>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
>>>>>>
>>>>>> Unfortunately I don't think you can do this for a couple reasons.
>>>>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
>>>>>> If the calls to {begin,end}_cpu_access were made before the call to 
>>>>>> dma_buf_attach then there won't have been a device attached so the calls 
>>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
>>>>>>
>>>>>
>>>>> That should be okay though, if you have no attachments (or all
>>>>> attachments are IO-coherent) then there is no need for cache
>>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
>>>>> is attached later after data has already been written. Does that
>>>>> sequence need supporting? 
>>>>
>>>> Yes, but also I think there are cases where CPU access can happen before 
>>>> in Android, but I will focus on later for now.
>>>>
>>>>> DMA-BUF doesn't have to allocate the backing
>>>>> memory until map_dma_buf() time, and that should only happen after all
>>>>> the devices have attached so it can know where to put the buffer. So we
>>>>> shouldn't expect any CPU access to buffers before all the devices are
>>>>> attached and mapped, right?
>>>>>
>>>>
>>>> Here is an example where CPU access can happen later in Android.
>>>>
>>>> Camera device records video -> software post processing -> video device 
>>>> (who does compression of raw data) and writes to a file
>>>>
>>>> In this example assume the buffer is cached and the devices are not 
>>>> IO-coherent (quite common).
>>>>
>>>
>>> This is the start of the problem, having cached mappings of memory that
>>> is also being accessed non-coherently is going to cause issues one way
>>> or another. On top of the speculative cache fills that have to be
>>> constantly fought back against with CMOs like below; some coherent
>>> interconnects behave badly when you mix coherent and non-coherent access
>>> (snoop filters get messed up).
>>>
>>> The solution is to either always have the addresses marked non-coherent
>>> (like device memory, no-map carveouts), or if you really want to use
>>> regular system memory allocated at runtime, then all cached mappings of
>>> it need to be dropped, even the kernel logical address (area as painful
>>> as that would be).
> 
> Ouch :-( I wasn't aware about these potential interconnect issues. How
> "real" is that? It seems that we aren't really hitting that today on
> real devices.
> 

Sadly there is at least one real device like this now (TI AM654). We
spent some time working with the ARM interconnect spec designers to see
if this was allowed behavior, final conclusion was mixing coherent and
non-coherent accesses is never a good idea.. So we have been working to
try to minimize any cases of mixed attributes [0], if a region is
coherent then everyone in the system needs to treat it as such and
vice-versa, even clever CMO that work on other systems wont save you
here. :(

[0] https://github.com/ARM-software/arm-trusted-firmware/pull/1553


>>>
>>>> ION buffer is allocated.
>>>>
>>>> //Camera device records video
>>>> dma_buf_attach
>>>> dma_map_attachment (buffer needs to be cleaned)
>>>
>>> Why does the buffer need to be cleaned here? I just got through reading
>>> the thread linked by Laura in the other reply. I do like +Brian's
>>
>> Actually +Brian this time :)
>>
>>> suggestion of tracking if the buffer has had CPU access since the last
>>> time and only flushing the cache if it has. As unmapped heaps never get
>>> CPU mapped this would never be the case for unmapped heaps, it solves my
>>> problem.
>>>
>>>> [camera device writes to buffer]
>>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
>>>
>>> It doesn't know there will be any further CPU access, it could get freed
>>> after this for all we know, the invalidate can be saved until the CPU
>>> requests access again.
> 
> We don't have any API to allow the invalidate to happen on CPU access
> if all devices already detached. We need a struct device pointer to
> give to the DMA API, otherwise on arm64 there'll be no invalidate.
> 
> I had a chat with a few people internally after the previous
> discussion with Liam. One suggestion was to use
> DMA_ATTR_SKIP_CPU_SYNC in unmap_dma_buf, but only if there's at least
> one other device attached (guarantees that we can do an invalidate in
> the future if begin_cpu_access is called). If the last device
> detaches, do a sync then.
> 
> Conversely, in map_dma_buf, we would track if there was any CPU access
> and use/skip the sync appropriately.
> 

Now that I think this all through I agree this patch is probably wrong.
The real fix needs to be better handling in the dma_map_sg() to deal
with the case of the memory not being mapped (what I'm dealing with for
unmapped heaps), and for cases when the memory in question is not cached
(Liam's issue I think). For both these cases the dma_map_sg() does the
wrong thing.

> I did start poking the code to check out how that would look, but then
> Christmas happened and I'm still catching back up.
> 
>>>
>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
>>>> the pipeline and Camera doesn't know the end of the use case)
>>>>
>>>
>>> This seems like a broken use-case, I understand the desire to keep
>>> everything as modular as possible and separate the steps, but at this
>>> point no one owns this buffers backing memory, not the CPU or any
>>> device. I would go as far as to say DMA-BUF should be free now to
>>> de-allocate the backing storage if it wants, that way it could get ready
>>> for the next attachment, which may change the required backing memory
>>> completely.
>>>
>>> All devices should attach before the first mapping, and only let go
>>> after the task is complete, otherwise this buffers data needs copied off
>>> to a different location or the CPU needs to take ownership in-between.
>>>
> 
> Yeah.. that's certainly the theory. Are there any DMA-BUF
> implementations which actually do that? I hear it quoted a lot,
> because that's what the docs say - but if the reality doesn't match
> it, maybe we should change the docs.
> 

Do you mean on the userspace side? I'm not sure, seems like Android
might be doing this wrong from what I can gather. From kernel side if
you mean the "de-allocate the backing storage", we will have some cases
like this soon, so I want to make sure userspace is not abusing DMA-BUF
in ways not specified in the documentation. Changing the docs to force
the backing memory to always be allocated breaks the central goal in
having attach/map in DMA-BUF separate.

>>>> //buffer is send down the pipeline
>>>>
>>>> // Usersapce software post processing occurs
>>>> mmap buffer
>>>
>>> Perhaps the invalidate should happen here in mmap.
>>>
>>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
>>>> devices attached to buffer
>>>
>>> And that should be okay, mmap does the sync, and if no devices are
>>> attached nothing could have changed the underlying memory in the
>>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> 
> Yeah, that's true - so long as you did an invalidate in unmap_dma_buf.
> Liam was saying that it's too painful for them to do that every time a
> device unmaps - when in many cases (device->device, no CPU) it's not
> needed.

Invalidates are painless, at least compared to a real cache flush, just
set the invalid bit vs actually writing out lines. I thought the issue
was on the map side.

> 
>>>
>>>> [CPU reads/writes to the buffer]
>>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
>>>> devices attached to buffer
>>>> munmap buffer
>>>>
>>>> //buffer is send down the pipeline
>>>> // Buffer is send to video device (who does compression of raw data) and 
>>>> writes to a file
>>>> dma_buf_attach
>>>> dma_map_attachment (buffer needs to be cleaned)
>>>> [video device writes to buffer]
>>>> dma_buf_unmap_attachment 
>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
>>>> the pipeline and Video doesn't know the end of the use case)
>>>>
>>>>
>>>>
>>>>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
>>>>>> access then there is no requirement (that I am aware of) for you to call 
>>>>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
>>>>>> buffer is cached and your device is not IO-coherent then the cache maintenance
>>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
>>>>>>
>>>>>
>>>>> If I am not doing any CPU access then why do I need CPU cache
>>>>> maintenance on the buffer?
>>>>>
>>>>
>>>> Because ION no longer provides DMA ready memory.
>>>> Take the above example.
>>>>
>>>> ION allocates memory from buddy allocator and requests zeroing.
>>>> Zeros are written to the cache.
>>>>
>>>> You pass the buffer to the camera device which is not IO-coherent.
>>>> The camera devices writes directly to the buffer in DDR.
>>>> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
>>>> evicted from the cache, this zero overwrites data the camera device has 
>>>> written which corrupts your data.
>>>>
>>>
>>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
>>> for CPU access at the time of zeroing.
>>>
> 
> Actually that should be at the point of the first non-coherent device
> mapping the buffer right? No point in doing CMO if the future accesses
> are coherent.

I see your point, as long as the zeroing is guaranteed to be the first
access to this buffer then it should be safe.

Andrew

> 
> Cheers,
> -Brian
> 
>>> Andrew
>>>
>>>> Liam
>>>>
>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>>> a Linux Foundation Collaborative Project
>>>>
Liam Mark Jan. 16, 2019, 10:48 p.m.
On Wed, 16 Jan 2019, Andrew F. Davis wrote:

> On 1/15/19 1:05 PM, Laura Abbott wrote:
> > On 1/15/19 10:38 AM, Andrew F. Davis wrote:
> >> On 1/15/19 11:45 AM, Liam Mark wrote:
> >>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
> >>>
> >>>> On 1/14/19 11:13 AM, Liam Mark wrote:
> >>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> >>>>>
> >>>>>> Buffers may not be mapped from the CPU so skip cache maintenance
> >>>>>> here.
> >>>>>> Accesses from the CPU to a cached heap should be bracketed with
> >>>>>> {begin,end}_cpu_access calls so maintenance should not be needed
> >>>>>> anyway.
> >>>>>>
> >>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
> >>>>>> ---
> >>>>>>   drivers/staging/android/ion/ion.c | 7 ++++---
> >>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/staging/android/ion/ion.c
> >>>>>> b/drivers/staging/android/ion/ion.c
> >>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
> >>>>>> --- a/drivers/staging/android/ion/ion.c
> >>>>>> +++ b/drivers/staging/android/ion/ion.c
> >>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct
> >>>>>> dma_buf_attachment *attachment,
> >>>>>>         table = a->table;
> >>>>>>   -    if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> >>>>>> -            direction))
> >>>>>> +    if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> >>>>>> +                  direction, DMA_ATTR_SKIP_CPU_SYNC))
> >>>>>
> >>>>> Unfortunately I don't think you can do this for a couple reasons.
> >>>>> You can't rely on {begin,end}_cpu_access calls to do cache
> >>>>> maintenance.
> >>>>> If the calls to {begin,end}_cpu_access were made before the call to
> >>>>> dma_buf_attach then there won't have been a device attached so the
> >>>>> calls
> >>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
> >>>>>
> >>>>
> >>>> That should be okay though, if you have no attachments (or all
> >>>> attachments are IO-coherent) then there is no need for cache
> >>>> maintenance. Unless you mean a sequence where a non-io-coherent device
> >>>> is attached later after data has already been written. Does that
> >>>> sequence need supporting?
> >>>
> >>> Yes, but also I think there are cases where CPU access can happen before
> >>> in Android, but I will focus on later for now.
> >>>
> >>>> DMA-BUF doesn't have to allocate the backing
> >>>> memory until map_dma_buf() time, and that should only happen after all
> >>>> the devices have attached so it can know where to put the buffer. So we
> >>>> shouldn't expect any CPU access to buffers before all the devices are
> >>>> attached and mapped, right?
> >>>>
> >>>
> >>> Here is an example where CPU access can happen later in Android.
> >>>
> >>> Camera device records video -> software post processing -> video device
> >>> (who does compression of raw data) and writes to a file
> >>>
> >>> In this example assume the buffer is cached and the devices are not
> >>> IO-coherent (quite common).
> >>>
> >>
> >> This is the start of the problem, having cached mappings of memory that
> >> is also being accessed non-coherently is going to cause issues one way
> >> or another. On top of the speculative cache fills that have to be
> >> constantly fought back against with CMOs like below; some coherent
> >> interconnects behave badly when you mix coherent and non-coherent access
> >> (snoop filters get messed up).
> >>
> >> The solution is to either always have the addresses marked non-coherent
> >> (like device memory, no-map carveouts), or if you really want to use
> >> regular system memory allocated at runtime, then all cached mappings of
> >> it need to be dropped, even the kernel logical address (area as painful
> >> as that would be).
> >>
> > 
> > I agree it's broken, hence my desire to remove it :)
> > 
> > The other problem is that uncached buffers are being used for
> > performance reason so anything that would involve getting
> > rid of the logical address would probably negate any performance
> > benefit.
> > 
> 
> I wouldn't go as far as to remove them just yet.. Liam seems pretty
> adamant that they have valid uses. I'm just not sure performance is one
> of them, maybe in the case of software locks between devices or
> something where there needs to be a lot of back and forth interleaved
> access on small amounts of data?
> 

I wasn't aware that ARM considered this not supported, I thought it was 
supported but they advised against it because of the potential performance 
impact.

This is after all supported in the DMA APIs and up until now devices have 
been successfully commercializing with this configurations, and I think 
they will continue to commercialize with these configurations for quite a 
while.

It would be really unfortunate if support was removed as I think that 
would drive clients away from using upstream ION.

> >>> ION buffer is allocated.
> >>>
> >>> //Camera device records video
> >>> dma_buf_attach
> >>> dma_map_attachment (buffer needs to be cleaned)
> >>
> >> Why does the buffer need to be cleaned here? I just got through reading
> >> the thread linked by Laura in the other reply. I do like +Brian's
> >> suggestion of tracking if the buffer has had CPU access since the last
> >> time and only flushing the cache if it has. As unmapped heaps never get
> >> CPU mapped this would never be the case for unmapped heaps, it solves my
> >> problem.
> >>
> >>> [camera device writes to buffer]
> >>> dma_buf_unmap_attachment (buffer needs to be invalidated)
> >>
> >> It doesn't know there will be any further CPU access, it could get freed
> >> after this for all we know, the invalidate can be saved until the CPU
> >> requests access again.
> >>
> >>> dma_buf_detach  (device cannot stay attached because it is being sent
> >>> down
> >>> the pipeline and Camera doesn't know the end of the use case)
> >>>
> >>
> >> This seems like a broken use-case, I understand the desire to keep
> >> everything as modular as possible and separate the steps, but at this
> >> point no one owns this buffers backing memory, not the CPU or any
> >> device. I would go as far as to say DMA-BUF should be free now to
> >> de-allocate the backing storage if it wants, that way it could get ready
> >> for the next attachment, which may change the required backing memory
> >> completely.
> >>
> >> All devices should attach before the first mapping, and only let go
> >> after the task is complete, otherwise this buffers data needs copied off
> >> to a different location or the CPU needs to take ownership in-between.
> >>
> > 
> > Maybe it's broken but it's the status quo and we spent a good
> > amount of time at plumbers concluding there isn't a great way
> > to fix it :/
> > 
> 
> Hmm, guess that doesn't prove there is not a great way to fix it either.. :/
> 
> Perhaps just stronger rules on sequencing of operations? I'm not saying
> I have a good solution either, I just don't see any way forward without
> some use-case getting broken, so better to fix now over later.
> 

I can see the benefits of Android doing things the way they do, I would 
request that changes we make continue to support Android, or we find a way 
to convice them to change, as they are the main ION client and I assume 
other ION clients in the future will want to do this as well.

I am concerned that if you go with a solution which enforces what you 
mention above, and bring ION out of staging that way, it will make it that
much harder to solve this for Android and therefore harder to get 
Android clients to move to the upstream ION (and get everybody off their 
vendor modified Android versions).

> >>> //buffer is send down the pipeline
> >>>
> >>> // Usersapce software post processing occurs
> >>> mmap buffer
> >>
> >> Perhaps the invalidate should happen here in mmap.
> >>
> >>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no
> >>> devices attached to buffer
> >>
> >> And that should be okay, mmap does the sync, and if no devices are
> >> attached nothing could have changed the underlying memory in the
> >> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> >>
> >>> [CPU reads/writes to the buffer]
> >>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no
> >>> devices attached to buffer
> >>> munmap buffer
> >>>
> >>> //buffer is send down the pipeline
> >>> // Buffer is send to video device (who does compression of raw data) and
> >>> writes to a file
> >>> dma_buf_attach
> >>> dma_map_attachment (buffer needs to be cleaned)
> >>> [video device writes to buffer]
> >>> dma_buf_unmap_attachment
> >>> dma_buf_detach  (device cannot stay attached because it is being sent
> >>> down
> >>> the pipeline and Video doesn't know the end of the use case)
> >>>
> >>>
> >>>
> >>>>> Also ION no longer provides DMA ready memory, so if you are not
> >>>>> doing CPU
> >>>>> access then there is no requirement (that I am aware of) for you to
> >>>>> call
> >>>>> {begin,end}_cpu_access before passing the buffer to the device and
> >>>>> if this
> >>>>> buffer is cached and your device is not IO-coherent then the cache
> >>>>> maintenance
> >>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
> >>>>>
> >>>>
> >>>> If I am not doing any CPU access then why do I need CPU cache
> >>>> maintenance on the buffer?
> >>>>
> >>>
> >>> Because ION no longer provides DMA ready memory.
> >>> Take the above example.
> >>>
> >>> ION allocates memory from buddy allocator and requests zeroing.
> >>> Zeros are written to the cache.
> >>>
> >>> You pass the buffer to the camera device which is not IO-coherent.
> >>> The camera devices writes directly to the buffer in DDR.
> >>> Since you didn't clean the buffer a dirty cache line (one of the
> >>> zeros) is
> >>> evicted from the cache, this zero overwrites data the camera device has
> >>> written which corrupts your data.
> >>>
> >>
> >> The zeroing *is* a CPU access, therefor it should handle the needed CMO
> >> for CPU access at the time of zeroing.
> >>
> >> Andrew
> >>
> >>> Liam
> >>>
> >>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> >>> a Linux Foundation Collaborative Project
> >>>
> > 
> 

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
Liam Mark Jan. 16, 2019, 10:54 p.m.
On Wed, 16 Jan 2019, Andrew F. Davis wrote:

> On 1/16/19 9:19 AM, Brian Starkey wrote:
> > Hi :-)
> > 
> > On Tue, Jan 15, 2019 at 12:40:16PM -0600, Andrew F. Davis wrote:
> >> On 1/15/19 12:38 PM, Andrew F. Davis wrote:
> >>> On 1/15/19 11:45 AM, Liam Mark wrote:
> >>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
> >>>>
> >>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
> >>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> >>>>>>
> >>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
> >>>>>>> Accesses from the CPU to a cached heap should be bracketed with
> >>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
> >>>>>>>
> >>>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
> >>>>>>> ---
> >>>>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
> >>>>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
> >>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
> >>>>>>> --- a/drivers/staging/android/ion/ion.c
> >>>>>>> +++ b/drivers/staging/android/ion/ion.c
> >>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
> >>>>>>>  
> >>>>>>>  	table = a->table;
> >>>>>>>  
> >>>>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> >>>>>>> -			direction))
> >>>>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> >>>>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
> >>>>>>
> >>>>>> Unfortunately I don't think you can do this for a couple reasons.
> >>>>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
> >>>>>> If the calls to {begin,end}_cpu_access were made before the call to 
> >>>>>> dma_buf_attach then there won't have been a device attached so the calls 
> >>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
> >>>>>>
> >>>>>
> >>>>> That should be okay though, if you have no attachments (or all
> >>>>> attachments are IO-coherent) then there is no need for cache
> >>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
> >>>>> is attached later after data has already been written. Does that
> >>>>> sequence need supporting? 
> >>>>
> >>>> Yes, but also I think there are cases where CPU access can happen before 
> >>>> in Android, but I will focus on later for now.
> >>>>
> >>>>> DMA-BUF doesn't have to allocate the backing
> >>>>> memory until map_dma_buf() time, and that should only happen after all
> >>>>> the devices have attached so it can know where to put the buffer. So we
> >>>>> shouldn't expect any CPU access to buffers before all the devices are
> >>>>> attached and mapped, right?
> >>>>>
> >>>>
> >>>> Here is an example where CPU access can happen later in Android.
> >>>>
> >>>> Camera device records video -> software post processing -> video device 
> >>>> (who does compression of raw data) and writes to a file
> >>>>
> >>>> In this example assume the buffer is cached and the devices are not 
> >>>> IO-coherent (quite common).
> >>>>
> >>>
> >>> This is the start of the problem, having cached mappings of memory that
> >>> is also being accessed non-coherently is going to cause issues one way
> >>> or another. On top of the speculative cache fills that have to be
> >>> constantly fought back against with CMOs like below; some coherent
> >>> interconnects behave badly when you mix coherent and non-coherent access
> >>> (snoop filters get messed up).
> >>>
> >>> The solution is to either always have the addresses marked non-coherent
> >>> (like device memory, no-map carveouts), or if you really want to use
> >>> regular system memory allocated at runtime, then all cached mappings of
> >>> it need to be dropped, even the kernel logical address (area as painful
> >>> as that would be).
> > 
> > Ouch :-( I wasn't aware about these potential interconnect issues. How
> > "real" is that? It seems that we aren't really hitting that today on
> > real devices.
> > 
> 
> Sadly there is at least one real device like this now (TI AM654). We
> spent some time working with the ARM interconnect spec designers to see
> if this was allowed behavior, final conclusion was mixing coherent and
> non-coherent accesses is never a good idea.. So we have been working to
> try to minimize any cases of mixed attributes [0], if a region is
> coherent then everyone in the system needs to treat it as such and
> vice-versa, even clever CMO that work on other systems wont save you
> here. :(
> 
> [0] https://github.com/ARM-software/arm-trusted-firmware/pull/1553
> 
> 
> >>>
> >>>> ION buffer is allocated.
> >>>>
> >>>> //Camera device records video
> >>>> dma_buf_attach
> >>>> dma_map_attachment (buffer needs to be cleaned)
> >>>
> >>> Why does the buffer need to be cleaned here? I just got through reading
> >>> the thread linked by Laura in the other reply. I do like +Brian's
> >>
> >> Actually +Brian this time :)
> >>
> >>> suggestion of tracking if the buffer has had CPU access since the last
> >>> time and only flushing the cache if it has. As unmapped heaps never get
> >>> CPU mapped this would never be the case for unmapped heaps, it solves my
> >>> problem.
> >>>
> >>>> [camera device writes to buffer]
> >>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
> >>>
> >>> It doesn't know there will be any further CPU access, it could get freed
> >>> after this for all we know, the invalidate can be saved until the CPU
> >>> requests access again.
> > 
> > We don't have any API to allow the invalidate to happen on CPU access
> > if all devices already detached. We need a struct device pointer to
> > give to the DMA API, otherwise on arm64 there'll be no invalidate.
> > 
> > I had a chat with a few people internally after the previous
> > discussion with Liam. One suggestion was to use
> > DMA_ATTR_SKIP_CPU_SYNC in unmap_dma_buf, but only if there's at least
> > one other device attached (guarantees that we can do an invalidate in
> > the future if begin_cpu_access is called). If the last device
> > detaches, do a sync then.
> > 
> > Conversely, in map_dma_buf, we would track if there was any CPU access
> > and use/skip the sync appropriately.
> > 
> 
> Now that I think this all through I agree this patch is probably wrong.
> The real fix needs to be better handling in the dma_map_sg() to deal
> with the case of the memory not being mapped (what I'm dealing with for
> unmapped heaps), and for cases when the memory in question is not cached
> (Liam's issue I think). For both these cases the dma_map_sg() does the
> wrong thing.
> 
> > I did start poking the code to check out how that would look, but then
> > Christmas happened and I'm still catching back up.
> > 
> >>>
> >>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
> >>>> the pipeline and Camera doesn't know the end of the use case)
> >>>>
> >>>
> >>> This seems like a broken use-case, I understand the desire to keep
> >>> everything as modular as possible and separate the steps, but at this
> >>> point no one owns this buffers backing memory, not the CPU or any
> >>> device. I would go as far as to say DMA-BUF should be free now to
> >>> de-allocate the backing storage if it wants, that way it could get ready
> >>> for the next attachment, which may change the required backing memory
> >>> completely.
> >>>
> >>> All devices should attach before the first mapping, and only let go
> >>> after the task is complete, otherwise this buffers data needs copied off
> >>> to a different location or the CPU needs to take ownership in-between.
> >>>
> > 
> > Yeah.. that's certainly the theory. Are there any DMA-BUF
> > implementations which actually do that? I hear it quoted a lot,
> > because that's what the docs say - but if the reality doesn't match
> > it, maybe we should change the docs.
> > 
> 
> Do you mean on the userspace side? I'm not sure, seems like Android
> might be doing this wrong from what I can gather. From kernel side if
> you mean the "de-allocate the backing storage", we will have some cases
> like this soon, so I want to make sure userspace is not abusing DMA-BUF
> in ways not specified in the documentation. Changing the docs to force
> the backing memory to always be allocated breaks the central goal in
> having attach/map in DMA-BUF separate.
> 
> >>>> //buffer is send down the pipeline
> >>>>
> >>>> // Usersapce software post processing occurs
> >>>> mmap buffer
> >>>
> >>> Perhaps the invalidate should happen here in mmap.
> >>>
> >>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
> >>>> devices attached to buffer
> >>>
> >>> And that should be okay, mmap does the sync, and if no devices are
> >>> attached nothing could have changed the underlying memory in the
> >>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> > 
> > Yeah, that's true - so long as you did an invalidate in unmap_dma_buf.
> > Liam was saying that it's too painful for them to do that every time a
> > device unmaps - when in many cases (device->device, no CPU) it's not
> > needed.
> 
> Invalidates are painless, at least compared to a real cache flush, just
> set the invalid bit vs actually writing out lines. I thought the issue
> was on the map side.
> 

Invalidates aren't painless for us because we have a coherent system cache 
so clean lines get written out.
And these invalidates can occur on fairly large buffers.

That is why we haven't went with using cached ION memory and "tracking CPU 
access" because it only solves half the problem, ie there isn't a way to 
safely skip the invalidate (because we can't read the future).
Our solution was to go with uncached ION memory (when possible), but as 
you can see in other discussions upstream support for uncached memory has
its own issues.

> > 
> >>>
> >>>> [CPU reads/writes to the buffer]
> >>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
> >>>> devices attached to buffer
> >>>> munmap buffer
> >>>>
> >>>> //buffer is send down the pipeline
> >>>> // Buffer is send to video device (who does compression of raw data) and 
> >>>> writes to a file
> >>>> dma_buf_attach
> >>>> dma_map_attachment (buffer needs to be cleaned)
> >>>> [video device writes to buffer]
> >>>> dma_buf_unmap_attachment 
> >>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
> >>>> the pipeline and Video doesn't know the end of the use case)
> >>>>
> >>>>
> >>>>
> >>>>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
> >>>>>> access then there is no requirement (that I am aware of) for you to call 
> >>>>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
> >>>>>> buffer is cached and your device is not IO-coherent then the cache maintenance
> >>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
> >>>>>>
> >>>>>
> >>>>> If I am not doing any CPU access then why do I need CPU cache
> >>>>> maintenance on the buffer?
> >>>>>
> >>>>
> >>>> Because ION no longer provides DMA ready memory.
> >>>> Take the above example.
> >>>>
> >>>> ION allocates memory from buddy allocator and requests zeroing.
> >>>> Zeros are written to the cache.
> >>>>
> >>>> You pass the buffer to the camera device which is not IO-coherent.
> >>>> The camera devices writes directly to the buffer in DDR.
> >>>> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
> >>>> evicted from the cache, this zero overwrites data the camera device has 
> >>>> written which corrupts your data.
> >>>>
> >>>
> >>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
> >>> for CPU access at the time of zeroing.
> >>>
> > 
> > Actually that should be at the point of the first non-coherent device
> > mapping the buffer right? No point in doing CMO if the future accesses
> > are coherent.
> 
> I see your point, as long as the zeroing is guaranteed to be the first
> access to this buffer then it should be safe.
> 
> Andrew
> 
> > 
> > Cheers,
> > -Brian
> > 
> >>> Andrew
> >>>
> >>>> Liam
> >>>>
> >>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> >>>> a Linux Foundation Collaborative Project
> >>>>
> 

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
Andrew F. Davis Jan. 17, 2019, 4:13 p.m.
On 1/16/19 4:48 PM, Liam Mark wrote:
> On Wed, 16 Jan 2019, Andrew F. Davis wrote:
> 
>> On 1/15/19 1:05 PM, Laura Abbott wrote:
>>> On 1/15/19 10:38 AM, Andrew F. Davis wrote:
>>>> On 1/15/19 11:45 AM, Liam Mark wrote:
>>>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
>>>>>
>>>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
>>>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
>>>>>>>
>>>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance
>>>>>>>> here.
>>>>>>>> Accesses from the CPU to a cached heap should be bracketed with
>>>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed
>>>>>>>> anyway.
>>>>>>>>
>>>>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>>>>>>>> ---
>>>>>>>>   drivers/staging/android/ion/ion.c | 7 ++++---
>>>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/staging/android/ion/ion.c
>>>>>>>> b/drivers/staging/android/ion/ion.c
>>>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
>>>>>>>> --- a/drivers/staging/android/ion/ion.c
>>>>>>>> +++ b/drivers/staging/android/ion/ion.c
>>>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct
>>>>>>>> dma_buf_attachment *attachment,
>>>>>>>>         table = a->table;
>>>>>>>>   -    if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>>>>>>>> -            direction))
>>>>>>>> +    if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>>>>>>>> +                  direction, DMA_ATTR_SKIP_CPU_SYNC))
>>>>>>>
>>>>>>> Unfortunately I don't think you can do this for a couple reasons.
>>>>>>> You can't rely on {begin,end}_cpu_access calls to do cache
>>>>>>> maintenance.
>>>>>>> If the calls to {begin,end}_cpu_access were made before the call to
>>>>>>> dma_buf_attach then there won't have been a device attached so the
>>>>>>> calls
>>>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
>>>>>>>
>>>>>>
>>>>>> That should be okay though, if you have no attachments (or all
>>>>>> attachments are IO-coherent) then there is no need for cache
>>>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
>>>>>> is attached later after data has already been written. Does that
>>>>>> sequence need supporting?
>>>>>
>>>>> Yes, but also I think there are cases where CPU access can happen before
>>>>> in Android, but I will focus on later for now.
>>>>>
>>>>>> DMA-BUF doesn't have to allocate the backing
>>>>>> memory until map_dma_buf() time, and that should only happen after all
>>>>>> the devices have attached so it can know where to put the buffer. So we
>>>>>> shouldn't expect any CPU access to buffers before all the devices are
>>>>>> attached and mapped, right?
>>>>>>
>>>>>
>>>>> Here is an example where CPU access can happen later in Android.
>>>>>
>>>>> Camera device records video -> software post processing -> video device
>>>>> (who does compression of raw data) and writes to a file
>>>>>
>>>>> In this example assume the buffer is cached and the devices are not
>>>>> IO-coherent (quite common).
>>>>>
>>>>
>>>> This is the start of the problem, having cached mappings of memory that
>>>> is also being accessed non-coherently is going to cause issues one way
>>>> or another. On top of the speculative cache fills that have to be
>>>> constantly fought back against with CMOs like below; some coherent
>>>> interconnects behave badly when you mix coherent and non-coherent access
>>>> (snoop filters get messed up).
>>>>
>>>> The solution is to either always have the addresses marked non-coherent
>>>> (like device memory, no-map carveouts), or if you really want to use
>>>> regular system memory allocated at runtime, then all cached mappings of
>>>> it need to be dropped, even the kernel logical address (area as painful
>>>> as that would be).
>>>>
>>>
>>> I agree it's broken, hence my desire to remove it :)
>>>
>>> The other problem is that uncached buffers are being used for
>>> performance reason so anything that would involve getting
>>> rid of the logical address would probably negate any performance
>>> benefit.
>>>
>>
>> I wouldn't go as far as to remove them just yet.. Liam seems pretty
>> adamant that they have valid uses. I'm just not sure performance is one
>> of them, maybe in the case of software locks between devices or
>> something where there needs to be a lot of back and forth interleaved
>> access on small amounts of data?
>>
> 
> I wasn't aware that ARM considered this not supported, I thought it was 
> supported but they advised against it because of the potential performance 
> impact.
> 

Not sure what you mean by "this" being not supported, do you mean mixed
attribute mappings? If so, it will certainly cause problems, and the
problems will change from platform to platform, avoid at all costs is my
understanding of ARM's position.

> This is after all supported in the DMA APIs and up until now devices have 
> been successfully commercializing with this configurations, and I think 
> they will continue to commercialize with these configurations for quite a 
> while.
> 

Use of uncached memory mappings are almost always wrong in my experience
and are used to work around some bug or because the user doesn't want to
implement proper CMOs. Counter examples welcome.

> It would be really unfortunate if support was removed as I think that 
> would drive clients away from using upstream ION.
> 

I'm not petitioning to remove support, but at very least lets reverse
the ION_FLAG_CACHED flag. Ion should hand out cached normal memory by
default, to get uncached you should need to add a flag to your
allocation command pointing out you know what you are doing.

>>>>> ION buffer is allocated.
>>>>>
>>>>> //Camera device records video
>>>>> dma_buf_attach
>>>>> dma_map_attachment (buffer needs to be cleaned)
>>>>
>>>> Why does the buffer need to be cleaned here? I just got through reading
>>>> the thread linked by Laura in the other reply. I do like +Brian's
>>>> suggestion of tracking if the buffer has had CPU access since the last
>>>> time and only flushing the cache if it has. As unmapped heaps never get
>>>> CPU mapped this would never be the case for unmapped heaps, it solves my
>>>> problem.
>>>>
>>>>> [camera device writes to buffer]
>>>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
>>>>
>>>> It doesn't know there will be any further CPU access, it could get freed
>>>> after this for all we know, the invalidate can be saved until the CPU
>>>> requests access again.
>>>>
>>>>> dma_buf_detach  (device cannot stay attached because it is being sent
>>>>> down
>>>>> the pipeline and Camera doesn't know the end of the use case)
>>>>>
>>>>
>>>> This seems like a broken use-case, I understand the desire to keep
>>>> everything as modular as possible and separate the steps, but at this
>>>> point no one owns this buffers backing memory, not the CPU or any
>>>> device. I would go as far as to say DMA-BUF should be free now to
>>>> de-allocate the backing storage if it wants, that way it could get ready
>>>> for the next attachment, which may change the required backing memory
>>>> completely.
>>>>
>>>> All devices should attach before the first mapping, and only let go
>>>> after the task is complete, otherwise this buffers data needs copied off
>>>> to a different location or the CPU needs to take ownership in-between.
>>>>
>>>
>>> Maybe it's broken but it's the status quo and we spent a good
>>> amount of time at plumbers concluding there isn't a great way
>>> to fix it :/
>>>
>>
>> Hmm, guess that doesn't prove there is not a great way to fix it either.. :/
>>
>> Perhaps just stronger rules on sequencing of operations? I'm not saying
>> I have a good solution either, I just don't see any way forward without
>> some use-case getting broken, so better to fix now over later.
>>
> 
> I can see the benefits of Android doing things the way they do, I would 
> request that changes we make continue to support Android, or we find a way 
> to convice them to change, as they are the main ION client and I assume 
> other ION clients in the future will want to do this as well.
> 

Android may be the biggest user today (makes sense, Ion come out of the
Android project), but that can change, and getting changes into Android
will be easier that the upstream kernel once Ion is out of staging.

Unlike some other big ARM vendors, we (TI) do not primarily build mobile
chips targeting Android, our core offerings target more traditional
Linux userspaces, and I'm guessing others will start to do the same as
ARM tries to push more into desktop, server, and other spaces again.

> I am concerned that if you go with a solution which enforces what you 
> mention above, and bring ION out of staging that way, it will make it that
> much harder to solve this for Android and therefore harder to get 
> Android clients to move to the upstream ION (and get everybody off their 
> vendor modified Android versions).
> 

That would be an Android problem, reducing functionality in upstream to
match what some evil vendor trees do to support Android is not the way
forward on this. At least for us we are going to try to make all our
software offerings follow proper buffer ownership (including our Android
offering).

>>>>> //buffer is send down the pipeline
>>>>>
>>>>> // Usersapce software post processing occurs
>>>>> mmap buffer
>>>>
>>>> Perhaps the invalidate should happen here in mmap.
>>>>
>>>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no
>>>>> devices attached to buffer
>>>>
>>>> And that should be okay, mmap does the sync, and if no devices are
>>>> attached nothing could have changed the underlying memory in the
>>>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
>>>>
>>>>> [CPU reads/writes to the buffer]
>>>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no
>>>>> devices attached to buffer
>>>>> munmap buffer
>>>>>
>>>>> //buffer is send down the pipeline
>>>>> // Buffer is send to video device (who does compression of raw data) and
>>>>> writes to a file
>>>>> dma_buf_attach
>>>>> dma_map_attachment (buffer needs to be cleaned)
>>>>> [video device writes to buffer]
>>>>> dma_buf_unmap_attachment
>>>>> dma_buf_detach  (device cannot stay attached because it is being sent
>>>>> down
>>>>> the pipeline and Video doesn't know the end of the use case)
>>>>>
>>>>>
>>>>>
>>>>>>> Also ION no longer provides DMA ready memory, so if you are not
>>>>>>> doing CPU
>>>>>>> access then there is no requirement (that I am aware of) for you to
>>>>>>> call
>>>>>>> {begin,end}_cpu_access before passing the buffer to the device and
>>>>>>> if this
>>>>>>> buffer is cached and your device is not IO-coherent then the cache
>>>>>>> maintenance
>>>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
>>>>>>>
>>>>>>
>>>>>> If I am not doing any CPU access then why do I need CPU cache
>>>>>> maintenance on the buffer?
>>>>>>
>>>>>
>>>>> Because ION no longer provides DMA ready memory.
>>>>> Take the above example.
>>>>>
>>>>> ION allocates memory from buddy allocator and requests zeroing.
>>>>> Zeros are written to the cache.
>>>>>
>>>>> You pass the buffer to the camera device which is not IO-coherent.
>>>>> The camera devices writes directly to the buffer in DDR.
>>>>> Since you didn't clean the buffer a dirty cache line (one of the
>>>>> zeros) is
>>>>> evicted from the cache, this zero overwrites data the camera device has
>>>>> written which corrupts your data.
>>>>>
>>>>
>>>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
>>>> for CPU access at the time of zeroing.
>>>>
>>>> Andrew
>>>>
>>>>> Liam
>>>>>
>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>>>> a Linux Foundation Collaborative Project
>>>>>
>>>
>>
> 
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
Andrew F. Davis Jan. 17, 2019, 4:25 p.m.
On 1/16/19 4:54 PM, Liam Mark wrote:
> On Wed, 16 Jan 2019, Andrew F. Davis wrote:
> 
>> On 1/16/19 9:19 AM, Brian Starkey wrote:
>>> Hi :-)
>>>
>>> On Tue, Jan 15, 2019 at 12:40:16PM -0600, Andrew F. Davis wrote:
>>>> On 1/15/19 12:38 PM, Andrew F. Davis wrote:
>>>>> On 1/15/19 11:45 AM, Liam Mark wrote:
>>>>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
>>>>>>
>>>>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
>>>>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
>>>>>>>>
>>>>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
>>>>>>>>> Accesses from the CPU to a cached heap should be bracketed with
>>>>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
>>>>>>>>> ---
>>>>>>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
>>>>>>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
>>>>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
>>>>>>>>> --- a/drivers/staging/android/ion/ion.c
>>>>>>>>> +++ b/drivers/staging/android/ion/ion.c
>>>>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
>>>>>>>>>  
>>>>>>>>>  	table = a->table;
>>>>>>>>>  
>>>>>>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
>>>>>>>>> -			direction))
>>>>>>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
>>>>>>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
>>>>>>>>
>>>>>>>> Unfortunately I don't think you can do this for a couple reasons.
>>>>>>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
>>>>>>>> If the calls to {begin,end}_cpu_access were made before the call to 
>>>>>>>> dma_buf_attach then there won't have been a device attached so the calls 
>>>>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
>>>>>>>>
>>>>>>>
>>>>>>> That should be okay though, if you have no attachments (or all
>>>>>>> attachments are IO-coherent) then there is no need for cache
>>>>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
>>>>>>> is attached later after data has already been written. Does that
>>>>>>> sequence need supporting? 
>>>>>>
>>>>>> Yes, but also I think there are cases where CPU access can happen before 
>>>>>> in Android, but I will focus on later for now.
>>>>>>
>>>>>>> DMA-BUF doesn't have to allocate the backing
>>>>>>> memory until map_dma_buf() time, and that should only happen after all
>>>>>>> the devices have attached so it can know where to put the buffer. So we
>>>>>>> shouldn't expect any CPU access to buffers before all the devices are
>>>>>>> attached and mapped, right?
>>>>>>>
>>>>>>
>>>>>> Here is an example where CPU access can happen later in Android.
>>>>>>
>>>>>> Camera device records video -> software post processing -> video device 
>>>>>> (who does compression of raw data) and writes to a file
>>>>>>
>>>>>> In this example assume the buffer is cached and the devices are not 
>>>>>> IO-coherent (quite common).
>>>>>>
>>>>>
>>>>> This is the start of the problem, having cached mappings of memory that
>>>>> is also being accessed non-coherently is going to cause issues one way
>>>>> or another. On top of the speculative cache fills that have to be
>>>>> constantly fought back against with CMOs like below; some coherent
>>>>> interconnects behave badly when you mix coherent and non-coherent access
>>>>> (snoop filters get messed up).
>>>>>
>>>>> The solution is to either always have the addresses marked non-coherent
>>>>> (like device memory, no-map carveouts), or if you really want to use
>>>>> regular system memory allocated at runtime, then all cached mappings of
>>>>> it need to be dropped, even the kernel logical address (area as painful
>>>>> as that would be).
>>>
>>> Ouch :-( I wasn't aware about these potential interconnect issues. How
>>> "real" is that? It seems that we aren't really hitting that today on
>>> real devices.
>>>
>>
>> Sadly there is at least one real device like this now (TI AM654). We
>> spent some time working with the ARM interconnect spec designers to see
>> if this was allowed behavior, final conclusion was mixing coherent and
>> non-coherent accesses is never a good idea.. So we have been working to
>> try to minimize any cases of mixed attributes [0], if a region is
>> coherent then everyone in the system needs to treat it as such and
>> vice-versa, even clever CMO that work on other systems wont save you
>> here. :(
>>
>> [0] https://github.com/ARM-software/arm-trusted-firmware/pull/1553
>>
>>
>>>>>
>>>>>> ION buffer is allocated.
>>>>>>
>>>>>> //Camera device records video
>>>>>> dma_buf_attach
>>>>>> dma_map_attachment (buffer needs to be cleaned)
>>>>>
>>>>> Why does the buffer need to be cleaned here? I just got through reading
>>>>> the thread linked by Laura in the other reply. I do like +Brian's
>>>>
>>>> Actually +Brian this time :)
>>>>
>>>>> suggestion of tracking if the buffer has had CPU access since the last
>>>>> time and only flushing the cache if it has. As unmapped heaps never get
>>>>> CPU mapped this would never be the case for unmapped heaps, it solves my
>>>>> problem.
>>>>>
>>>>>> [camera device writes to buffer]
>>>>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
>>>>>
>>>>> It doesn't know there will be any further CPU access, it could get freed
>>>>> after this for all we know, the invalidate can be saved until the CPU
>>>>> requests access again.
>>>
>>> We don't have any API to allow the invalidate to happen on CPU access
>>> if all devices already detached. We need a struct device pointer to
>>> give to the DMA API, otherwise on arm64 there'll be no invalidate.
>>>
>>> I had a chat with a few people internally after the previous
>>> discussion with Liam. One suggestion was to use
>>> DMA_ATTR_SKIP_CPU_SYNC in unmap_dma_buf, but only if there's at least
>>> one other device attached (guarantees that we can do an invalidate in
>>> the future if begin_cpu_access is called). If the last device
>>> detaches, do a sync then.
>>>
>>> Conversely, in map_dma_buf, we would track if there was any CPU access
>>> and use/skip the sync appropriately.
>>>
>>
>> Now that I think this all through I agree this patch is probably wrong.
>> The real fix needs to be better handling in the dma_map_sg() to deal
>> with the case of the memory not being mapped (what I'm dealing with for
>> unmapped heaps), and for cases when the memory in question is not cached
>> (Liam's issue I think). For both these cases the dma_map_sg() does the
>> wrong thing.
>>
>>> I did start poking the code to check out how that would look, but then
>>> Christmas happened and I'm still catching back up.
>>>
>>>>>
>>>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
>>>>>> the pipeline and Camera doesn't know the end of the use case)
>>>>>>
>>>>>
>>>>> This seems like a broken use-case, I understand the desire to keep
>>>>> everything as modular as possible and separate the steps, but at this
>>>>> point no one owns this buffers backing memory, not the CPU or any
>>>>> device. I would go as far as to say DMA-BUF should be free now to
>>>>> de-allocate the backing storage if it wants, that way it could get ready
>>>>> for the next attachment, which may change the required backing memory
>>>>> completely.
>>>>>
>>>>> All devices should attach before the first mapping, and only let go
>>>>> after the task is complete, otherwise this buffers data needs copied off
>>>>> to a different location or the CPU needs to take ownership in-between.
>>>>>
>>>
>>> Yeah.. that's certainly the theory. Are there any DMA-BUF
>>> implementations which actually do that? I hear it quoted a lot,
>>> because that's what the docs say - but if the reality doesn't match
>>> it, maybe we should change the docs.
>>>
>>
>> Do you mean on the userspace side? I'm not sure, seems like Android
>> might be doing this wrong from what I can gather. From kernel side if
>> you mean the "de-allocate the backing storage", we will have some cases
>> like this soon, so I want to make sure userspace is not abusing DMA-BUF
>> in ways not specified in the documentation. Changing the docs to force
>> the backing memory to always be allocated breaks the central goal in
>> having attach/map in DMA-BUF separate.
>>
>>>>>> //buffer is send down the pipeline
>>>>>>
>>>>>> // Usersapce software post processing occurs
>>>>>> mmap buffer
>>>>>
>>>>> Perhaps the invalidate should happen here in mmap.
>>>>>
>>>>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
>>>>>> devices attached to buffer
>>>>>
>>>>> And that should be okay, mmap does the sync, and if no devices are
>>>>> attached nothing could have changed the underlying memory in the
>>>>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
>>>
>>> Yeah, that's true - so long as you did an invalidate in unmap_dma_buf.
>>> Liam was saying that it's too painful for them to do that every time a
>>> device unmaps - when in many cases (device->device, no CPU) it's not
>>> needed.
>>
>> Invalidates are painless, at least compared to a real cache flush, just
>> set the invalid bit vs actually writing out lines. I thought the issue
>> was on the map side.
>>
> 
> Invalidates aren't painless for us because we have a coherent system cache 
> so clean lines get written out.

That seems very broken, why would clean lines ever need to be written
out, that defeats the whole point of having the invalidate separate from
clean. How do you deal with stale cache lines? I guess in your case this
is what forces you to have to use uncached memory for DMA-able memory.

> And these invalidates can occur on fairly large buffers.
> 
> That is why we haven't went with using cached ION memory and "tracking CPU 
> access" because it only solves half the problem, ie there isn't a way to 
> safely skip the invalidate (because we can't read the future).
> Our solution was to go with uncached ION memory (when possible), but as 
> you can see in other discussions upstream support for uncached memory has
> its own issues.
> 

Sounds like you need to fix upstream support then, finding a way to drop
all cacheable mappings of memory you want to make uncached mappings for
seems to be the only solution.

>>>
>>>>>
>>>>>> [CPU reads/writes to the buffer]
>>>>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
>>>>>> devices attached to buffer
>>>>>> munmap buffer
>>>>>>
>>>>>> //buffer is send down the pipeline
>>>>>> // Buffer is send to video device (who does compression of raw data) and 
>>>>>> writes to a file
>>>>>> dma_buf_attach
>>>>>> dma_map_attachment (buffer needs to be cleaned)
>>>>>> [video device writes to buffer]
>>>>>> dma_buf_unmap_attachment 
>>>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
>>>>>> the pipeline and Video doesn't know the end of the use case)
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
>>>>>>>> access then there is no requirement (that I am aware of) for you to call 
>>>>>>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
>>>>>>>> buffer is cached and your device is not IO-coherent then the cache maintenance
>>>>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
>>>>>>>>
>>>>>>>
>>>>>>> If I am not doing any CPU access then why do I need CPU cache
>>>>>>> maintenance on the buffer?
>>>>>>>
>>>>>>
>>>>>> Because ION no longer provides DMA ready memory.
>>>>>> Take the above example.
>>>>>>
>>>>>> ION allocates memory from buddy allocator and requests zeroing.
>>>>>> Zeros are written to the cache.
>>>>>>
>>>>>> You pass the buffer to the camera device which is not IO-coherent.
>>>>>> The camera devices writes directly to the buffer in DDR.
>>>>>> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
>>>>>> evicted from the cache, this zero overwrites data the camera device has 
>>>>>> written which corrupts your data.
>>>>>>
>>>>>
>>>>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
>>>>> for CPU access at the time of zeroing.
>>>>>
>>>
>>> Actually that should be at the point of the first non-coherent device
>>> mapping the buffer right? No point in doing CMO if the future accesses
>>> are coherent.
>>
>> I see your point, as long as the zeroing is guaranteed to be the first
>> access to this buffer then it should be safe.
>>
>> Andrew
>>
>>>
>>> Cheers,
>>> -Brian
>>>
>>>>> Andrew
>>>>>
>>>>>> Liam
>>>>>>
>>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>>>>> a Linux Foundation Collaborative Project
>>>>>>
>>
> 
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
Liam Mark Jan. 18, 2019, 1:04 a.m.
On Thu, 17 Jan 2019, Andrew F. Davis wrote:

> On 1/16/19 4:48 PM, Liam Mark wrote:
> > On Wed, 16 Jan 2019, Andrew F. Davis wrote:
> > 
> >> On 1/15/19 1:05 PM, Laura Abbott wrote:
> >>> On 1/15/19 10:38 AM, Andrew F. Davis wrote:
> >>>> On 1/15/19 11:45 AM, Liam Mark wrote:
> >>>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
> >>>>>
> >>>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
> >>>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> >>>>>>>
> >>>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance
> >>>>>>>> here.
> >>>>>>>> Accesses from the CPU to a cached heap should be bracketed with
> >>>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed
> >>>>>>>> anyway.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
> >>>>>>>> ---
> >>>>>>>>   drivers/staging/android/ion/ion.c | 7 ++++---
> >>>>>>>>   1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/staging/android/ion/ion.c
> >>>>>>>> b/drivers/staging/android/ion/ion.c
> >>>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
> >>>>>>>> --- a/drivers/staging/android/ion/ion.c
> >>>>>>>> +++ b/drivers/staging/android/ion/ion.c
> >>>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct
> >>>>>>>> dma_buf_attachment *attachment,
> >>>>>>>>         table = a->table;
> >>>>>>>>   -    if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> >>>>>>>> -            direction))
> >>>>>>>> +    if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> >>>>>>>> +                  direction, DMA_ATTR_SKIP_CPU_SYNC))
> >>>>>>>
> >>>>>>> Unfortunately I don't think you can do this for a couple reasons.
> >>>>>>> You can't rely on {begin,end}_cpu_access calls to do cache
> >>>>>>> maintenance.
> >>>>>>> If the calls to {begin,end}_cpu_access were made before the call to
> >>>>>>> dma_buf_attach then there won't have been a device attached so the
> >>>>>>> calls
> >>>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
> >>>>>>>
> >>>>>>
> >>>>>> That should be okay though, if you have no attachments (or all
> >>>>>> attachments are IO-coherent) then there is no need for cache
> >>>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
> >>>>>> is attached later after data has already been written. Does that
> >>>>>> sequence need supporting?
> >>>>>
> >>>>> Yes, but also I think there are cases where CPU access can happen before
> >>>>> in Android, but I will focus on later for now.
> >>>>>
> >>>>>> DMA-BUF doesn't have to allocate the backing
> >>>>>> memory until map_dma_buf() time, and that should only happen after all
> >>>>>> the devices have attached so it can know where to put the buffer. So we
> >>>>>> shouldn't expect any CPU access to buffers before all the devices are
> >>>>>> attached and mapped, right?
> >>>>>>
> >>>>>
> >>>>> Here is an example where CPU access can happen later in Android.
> >>>>>
> >>>>> Camera device records video -> software post processing -> video device
> >>>>> (who does compression of raw data) and writes to a file
> >>>>>
> >>>>> In this example assume the buffer is cached and the devices are not
> >>>>> IO-coherent (quite common).
> >>>>>
> >>>>
> >>>> This is the start of the problem, having cached mappings of memory that
> >>>> is also being accessed non-coherently is going to cause issues one way
> >>>> or another. On top of the speculative cache fills that have to be
> >>>> constantly fought back against with CMOs like below; some coherent
> >>>> interconnects behave badly when you mix coherent and non-coherent access
> >>>> (snoop filters get messed up).
> >>>>
> >>>> The solution is to either always have the addresses marked non-coherent
> >>>> (like device memory, no-map carveouts), or if you really want to use
> >>>> regular system memory allocated at runtime, then all cached mappings of
> >>>> it need to be dropped, even the kernel logical address (area as painful
> >>>> as that would be).
> >>>>
> >>>
> >>> I agree it's broken, hence my desire to remove it :)
> >>>
> >>> The other problem is that uncached buffers are being used for
> >>> performance reason so anything that would involve getting
> >>> rid of the logical address would probably negate any performance
> >>> benefit.
> >>>
> >>
> >> I wouldn't go as far as to remove them just yet.. Liam seems pretty
> >> adamant that they have valid uses. I'm just not sure performance is one
> >> of them, maybe in the case of software locks between devices or
> >> something where there needs to be a lot of back and forth interleaved
> >> access on small amounts of data?
> >>
> > 
> > I wasn't aware that ARM considered this not supported, I thought it was 
> > supported but they advised against it because of the potential performance 
> > impact.
> > 
> 
> Not sure what you mean by "this" being not supported, do you mean mixed
> attribute mappings? If so, it will certainly cause problems, and the
> problems will change from platform to platform, avoid at all costs is my
> understanding of ARM's position.
> 
> > This is after all supported in the DMA APIs and up until now devices have 
> > been successfully commercializing with this configurations, and I think 
> > they will continue to commercialize with these configurations for quite a 
> > while.
> > 
> 
> Use of uncached memory mappings are almost always wrong in my experience
> and are used to work around some bug or because the user doesn't want to
> implement proper CMOs. Counter examples welcome.
> 

Okay, let me first try to clarify what I am referring to, as perhaps I am 
misunderstanding the conversation.

In this discussion I was originally referring to a use case with cached 
memory being accessed by a non io-cohernet device.

"In this example assume the buffer is cached and the devices are not 
IO-coherent (quite common)."	  

to which you did not think was supported:

"This is the start of the problem, having cached mappings of memory 
that is also being accessed non-coherently is going to cause issues 
one way or another. 
"

And I interpreted Laura's comment below as saying she wanted to remove 
support in ION for cached memory being accessed by non io-cohernet 
devices:
"I agree it's broken, hence my desire to remove it :)"

So assuming my understanding above is correct (and you are not talking 
about something separate such as removing uncached ION allocation 
support).

Then I guess I am not clear why current uses which use cached memory with 
non IO-coherent devices are considered to be working around some bug or 
are not implementing proper CMOs.

They use CPU cached mappings because that is the most effective way to 
access the memory from the CPU side and the devices have an uncached 
IOMMU mapping because they don't support IO-coherency, and currenlty in 
the CPU they do cache mainteance at the time of dma map and dma umap so
to me they are implementing correct CMOs.

> > It would be really unfortunate if support was removed as I think that 
> > would drive clients away from using upstream ION.
> > 
> 
> I'm not petitioning to remove support, but at very least lets reverse
> the ION_FLAG_CACHED flag. Ion should hand out cached normal memory by
> default, to get uncached you should need to add a flag to your
> allocation command pointing out you know what you are doing.
> 

You may not be petitioning to remove support for using cached memory with 
non io-coherent devices but I interpreted Laura's comment as wanting to do 
so, and I had concerns about that.

> >>>>> ION buffer is allocated.
> >>>>>
> >>>>> //Camera device records video
> >>>>> dma_buf_attach
> >>>>> dma_map_attachment (buffer needs to be cleaned)
> >>>>
> >>>> Why does the buffer need to be cleaned here? I just got through reading
> >>>> the thread linked by Laura in the other reply. I do like +Brian's
> >>>> suggestion of tracking if the buffer has had CPU access since the last
> >>>> time and only flushing the cache if it has. As unmapped heaps never get
> >>>> CPU mapped this would never be the case for unmapped heaps, it solves my
> >>>> problem.
> >>>>
> >>>>> [camera device writes to buffer]
> >>>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
> >>>>
> >>>> It doesn't know there will be any further CPU access, it could get freed
> >>>> after this for all we know, the invalidate can be saved until the CPU
> >>>> requests access again.
> >>>>
> >>>>> dma_buf_detach  (device cannot stay attached because it is being sent
> >>>>> down
> >>>>> the pipeline and Camera doesn't know the end of the use case)
> >>>>>
> >>>>
> >>>> This seems like a broken use-case, I understand the desire to keep
> >>>> everything as modular as possible and separate the steps, but at this
> >>>> point no one owns this buffers backing memory, not the CPU or any
> >>>> device. I would go as far as to say DMA-BUF should be free now to
> >>>> de-allocate the backing storage if it wants, that way it could get ready
> >>>> for the next attachment, which may change the required backing memory
> >>>> completely.
> >>>>
> >>>> All devices should attach before the first mapping, and only let go
> >>>> after the task is complete, otherwise this buffers data needs copied off
> >>>> to a different location or the CPU needs to take ownership in-between.
> >>>>
> >>>
> >>> Maybe it's broken but it's the status quo and we spent a good
> >>> amount of time at plumbers concluding there isn't a great way
> >>> to fix it :/
> >>>
> >>
> >> Hmm, guess that doesn't prove there is not a great way to fix it either.. :/
> >>
> >> Perhaps just stronger rules on sequencing of operations? I'm not saying
> >> I have a good solution either, I just don't see any way forward without
> >> some use-case getting broken, so better to fix now over later.
> >>
> > 
> > I can see the benefits of Android doing things the way they do, I would 
> > request that changes we make continue to support Android, or we find a way 
> > to convice them to change, as they are the main ION client and I assume 
> > other ION clients in the future will want to do this as well.
> > 
> 
> Android may be the biggest user today (makes sense, Ion come out of the
> Android project), but that can change, and getting changes into Android
> will be easier that the upstream kernel once Ion is out of staging.
> 
> Unlike some other big ARM vendors, we (TI) do not primarily build mobile
> chips targeting Android, our core offerings target more traditional
> Linux userspaces, and I'm guessing others will start to do the same as
> ARM tries to push more into desktop, server, and other spaces again.
> 
> > I am concerned that if you go with a solution which enforces what you 
> > mention above, and bring ION out of staging that way, it will make it that
> > much harder to solve this for Android and therefore harder to get 
> > Android clients to move to the upstream ION (and get everybody off their 
> > vendor modified Android versions).
> > 
> 
> That would be an Android problem, reducing functionality in upstream to
> match what some evil vendor trees do to support Android is not the way
> forward on this. At least for us we are going to try to make all our
> software offerings follow proper buffer ownership (including our Android
> offering).
> 
> >>>>> //buffer is send down the pipeline
> >>>>>
> >>>>> // Usersapce software post processing occurs
> >>>>> mmap buffer
> >>>>
> >>>> Perhaps the invalidate should happen here in mmap.
> >>>>
> >>>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no
> >>>>> devices attached to buffer
> >>>>
> >>>> And that should be okay, mmap does the sync, and if no devices are
> >>>> attached nothing could have changed the underlying memory in the
> >>>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> >>>>
> >>>>> [CPU reads/writes to the buffer]
> >>>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no
> >>>>> devices attached to buffer
> >>>>> munmap buffer
> >>>>>
> >>>>> //buffer is send down the pipeline
> >>>>> // Buffer is send to video device (who does compression of raw data) and
> >>>>> writes to a file
> >>>>> dma_buf_attach
> >>>>> dma_map_attachment (buffer needs to be cleaned)
> >>>>> [video device writes to buffer]
> >>>>> dma_buf_unmap_attachment
> >>>>> dma_buf_detach  (device cannot stay attached because it is being sent
> >>>>> down
> >>>>> the pipeline and Video doesn't know the end of the use case)
> >>>>>
> >>>>>
> >>>>>
> >>>>>>> Also ION no longer provides DMA ready memory, so if you are not
> >>>>>>> doing CPU
> >>>>>>> access then there is no requirement (that I am aware of) for you to
> >>>>>>> call
> >>>>>>> {begin,end}_cpu_access before passing the buffer to the device and
> >>>>>>> if this
> >>>>>>> buffer is cached and your device is not IO-coherent then the cache
> >>>>>>> maintenance
> >>>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
> >>>>>>>
> >>>>>>
> >>>>>> If I am not doing any CPU access then why do I need CPU cache
> >>>>>> maintenance on the buffer?
> >>>>>>
> >>>>>
> >>>>> Because ION no longer provides DMA ready memory.
> >>>>> Take the above example.
> >>>>>
> >>>>> ION allocates memory from buddy allocator and requests zeroing.
> >>>>> Zeros are written to the cache.
> >>>>>
> >>>>> You pass the buffer to the camera device which is not IO-coherent.
> >>>>> The camera devices writes directly to the buffer in DDR.
> >>>>> Since you didn't clean the buffer a dirty cache line (one of the
> >>>>> zeros) is
> >>>>> evicted from the cache, this zero overwrites data the camera device has
> >>>>> written which corrupts your data.
> >>>>>
> >>>>
> >>>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
> >>>> for CPU access at the time of zeroing.
> >>>>
> >>>> Andrew
> >>>>
> >>>>> Liam
> >>>>>
> >>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> >>>>> a Linux Foundation Collaborative Project
> >>>>>
> >>>
> >>
> > 
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > a Linux Foundation Collaborative Project
> > 
> 

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project
Liam Mark Jan. 18, 2019, 1:11 a.m.
On Thu, 17 Jan 2019, Andrew F. Davis wrote:

> On 1/16/19 4:54 PM, Liam Mark wrote:
> > On Wed, 16 Jan 2019, Andrew F. Davis wrote:
> > 
> >> On 1/16/19 9:19 AM, Brian Starkey wrote:
> >>> Hi :-)
> >>>
> >>> On Tue, Jan 15, 2019 at 12:40:16PM -0600, Andrew F. Davis wrote:
> >>>> On 1/15/19 12:38 PM, Andrew F. Davis wrote:
> >>>>> On 1/15/19 11:45 AM, Liam Mark wrote:
> >>>>>> On Tue, 15 Jan 2019, Andrew F. Davis wrote:
> >>>>>>
> >>>>>>> On 1/14/19 11:13 AM, Liam Mark wrote:
> >>>>>>>> On Fri, 11 Jan 2019, Andrew F. Davis wrote:
> >>>>>>>>
> >>>>>>>>> Buffers may not be mapped from the CPU so skip cache maintenance here.
> >>>>>>>>> Accesses from the CPU to a cached heap should be bracketed with
> >>>>>>>>> {begin,end}_cpu_access calls so maintenance should not be needed anyway.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Andrew F. Davis <afd@ti.com>
> >>>>>>>>> ---
> >>>>>>>>>  drivers/staging/android/ion/ion.c | 7 ++++---
> >>>>>>>>>  1 file changed, 4 insertions(+), 3 deletions(-)
> >>>>>>>>>
> >>>>>>>>> diff --git a/drivers/staging/android/ion/ion.c b/drivers/staging/android/ion/ion.c
> >>>>>>>>> index 14e48f6eb734..09cb5a8e2b09 100644
> >>>>>>>>> --- a/drivers/staging/android/ion/ion.c
> >>>>>>>>> +++ b/drivers/staging/android/ion/ion.c
> >>>>>>>>> @@ -261,8 +261,8 @@ static struct sg_table *ion_map_dma_buf(struct dma_buf_attachment *attachment,
> >>>>>>>>>  
> >>>>>>>>>  	table = a->table;
> >>>>>>>>>  
> >>>>>>>>> -	if (!dma_map_sg(attachment->dev, table->sgl, table->nents,
> >>>>>>>>> -			direction))
> >>>>>>>>> +	if (!dma_map_sg_attrs(attachment->dev, table->sgl, table->nents,
> >>>>>>>>> +			      direction, DMA_ATTR_SKIP_CPU_SYNC))
> >>>>>>>>
> >>>>>>>> Unfortunately I don't think you can do this for a couple reasons.
> >>>>>>>> You can't rely on {begin,end}_cpu_access calls to do cache maintenance.
> >>>>>>>> If the calls to {begin,end}_cpu_access were made before the call to 
> >>>>>>>> dma_buf_attach then there won't have been a device attached so the calls 
> >>>>>>>> to {begin,end}_cpu_access won't have done any cache maintenance.
> >>>>>>>>
> >>>>>>>
> >>>>>>> That should be okay though, if you have no attachments (or all
> >>>>>>> attachments are IO-coherent) then there is no need for cache
> >>>>>>> maintenance. Unless you mean a sequence where a non-io-coherent device
> >>>>>>> is attached later after data has already been written. Does that
> >>>>>>> sequence need supporting? 
> >>>>>>
> >>>>>> Yes, but also I think there are cases where CPU access can happen before 
> >>>>>> in Android, but I will focus on later for now.
> >>>>>>
> >>>>>>> DMA-BUF doesn't have to allocate the backing
> >>>>>>> memory until map_dma_buf() time, and that should only happen after all
> >>>>>>> the devices have attached so it can know where to put the buffer. So we
> >>>>>>> shouldn't expect any CPU access to buffers before all the devices are
> >>>>>>> attached and mapped, right?
> >>>>>>>
> >>>>>>
> >>>>>> Here is an example where CPU access can happen later in Android.
> >>>>>>
> >>>>>> Camera device records video -> software post processing -> video device 
> >>>>>> (who does compression of raw data) and writes to a file
> >>>>>>
> >>>>>> In this example assume the buffer is cached and the devices are not 
> >>>>>> IO-coherent (quite common).
> >>>>>>
> >>>>>
> >>>>> This is the start of the problem, having cached mappings of memory that
> >>>>> is also being accessed non-coherently is going to cause issues one way
> >>>>> or another. On top of the speculative cache fills that have to be
> >>>>> constantly fought back against with CMOs like below; some coherent
> >>>>> interconnects behave badly when you mix coherent and non-coherent access
> >>>>> (snoop filters get messed up).
> >>>>>
> >>>>> The solution is to either always have the addresses marked non-coherent
> >>>>> (like device memory, no-map carveouts), or if you really want to use
> >>>>> regular system memory allocated at runtime, then all cached mappings of
> >>>>> it need to be dropped, even the kernel logical address (area as painful
> >>>>> as that would be).
> >>>
> >>> Ouch :-( I wasn't aware about these potential interconnect issues. How
> >>> "real" is that? It seems that we aren't really hitting that today on
> >>> real devices.
> >>>
> >>
> >> Sadly there is at least one real device like this now (TI AM654). We
> >> spent some time working with the ARM interconnect spec designers to see
> >> if this was allowed behavior, final conclusion was mixing coherent and
> >> non-coherent accesses is never a good idea.. So we have been working to
> >> try to minimize any cases of mixed attributes [0], if a region is
> >> coherent then everyone in the system needs to treat it as such and
> >> vice-versa, even clever CMO that work on other systems wont save you
> >> here. :(
> >>
> >> [0] https://github.com/ARM-software/arm-trusted-firmware/pull/1553
> >>
> >>
> >>>>>
> >>>>>> ION buffer is allocated.
> >>>>>>
> >>>>>> //Camera device records video
> >>>>>> dma_buf_attach
> >>>>>> dma_map_attachment (buffer needs to be cleaned)
> >>>>>
> >>>>> Why does the buffer need to be cleaned here? I just got through reading
> >>>>> the thread linked by Laura in the other reply. I do like +Brian's
> >>>>
> >>>> Actually +Brian this time :)
> >>>>
> >>>>> suggestion of tracking if the buffer has had CPU access since the last
> >>>>> time and only flushing the cache if it has. As unmapped heaps never get
> >>>>> CPU mapped this would never be the case for unmapped heaps, it solves my
> >>>>> problem.
> >>>>>
> >>>>>> [camera device writes to buffer]
> >>>>>> dma_buf_unmap_attachment (buffer needs to be invalidated)
> >>>>>
> >>>>> It doesn't know there will be any further CPU access, it could get freed
> >>>>> after this for all we know, the invalidate can be saved until the CPU
> >>>>> requests access again.
> >>>
> >>> We don't have any API to allow the invalidate to happen on CPU access
> >>> if all devices already detached. We need a struct device pointer to
> >>> give to the DMA API, otherwise on arm64 there'll be no invalidate.
> >>>
> >>> I had a chat with a few people internally after the previous
> >>> discussion with Liam. One suggestion was to use
> >>> DMA_ATTR_SKIP_CPU_SYNC in unmap_dma_buf, but only if there's at least
> >>> one other device attached (guarantees that we can do an invalidate in
> >>> the future if begin_cpu_access is called). If the last device
> >>> detaches, do a sync then.
> >>>
> >>> Conversely, in map_dma_buf, we would track if there was any CPU access
> >>> and use/skip the sync appropriately.
> >>>
> >>
> >> Now that I think this all through I agree this patch is probably wrong.
> >> The real fix needs to be better handling in the dma_map_sg() to deal
> >> with the case of the memory not being mapped (what I'm dealing with for
> >> unmapped heaps), and for cases when the memory in question is not cached
> >> (Liam's issue I think). For both these cases the dma_map_sg() does the
> >> wrong thing.
> >>
> >>> I did start poking the code to check out how that would look, but then
> >>> Christmas happened and I'm still catching back up.
> >>>
> >>>>>
> >>>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
> >>>>>> the pipeline and Camera doesn't know the end of the use case)
> >>>>>>
> >>>>>
> >>>>> This seems like a broken use-case, I understand the desire to keep
> >>>>> everything as modular as possible and separate the steps, but at this
> >>>>> point no one owns this buffers backing memory, not the CPU or any
> >>>>> device. I would go as far as to say DMA-BUF should be free now to
> >>>>> de-allocate the backing storage if it wants, that way it could get ready
> >>>>> for the next attachment, which may change the required backing memory
> >>>>> completely.
> >>>>>
> >>>>> All devices should attach before the first mapping, and only let go
> >>>>> after the task is complete, otherwise this buffers data needs copied off
> >>>>> to a different location or the CPU needs to take ownership in-between.
> >>>>>
> >>>
> >>> Yeah.. that's certainly the theory. Are there any DMA-BUF
> >>> implementations which actually do that? I hear it quoted a lot,
> >>> because that's what the docs say - but if the reality doesn't match
> >>> it, maybe we should change the docs.
> >>>
> >>
> >> Do you mean on the userspace side? I'm not sure, seems like Android
> >> might be doing this wrong from what I can gather. From kernel side if
> >> you mean the "de-allocate the backing storage", we will have some cases
> >> like this soon, so I want to make sure userspace is not abusing DMA-BUF
> >> in ways not specified in the documentation. Changing the docs to force
> >> the backing memory to always be allocated breaks the central goal in
> >> having attach/map in DMA-BUF separate.
> >>
> >>>>>> //buffer is send down the pipeline
> >>>>>>
> >>>>>> // Usersapce software post processing occurs
> >>>>>> mmap buffer
> >>>>>
> >>>>> Perhaps the invalidate should happen here in mmap.
> >>>>>
> >>>>>> DMA_BUF_IOCTL_SYNC IOCT with flags DMA_BUF_SYNC_START // No CMO since no 
> >>>>>> devices attached to buffer
> >>>>>
> >>>>> And that should be okay, mmap does the sync, and if no devices are
> >>>>> attached nothing could have changed the underlying memory in the
> >>>>> mean-time, DMA_BUF_SYNC_START can safely be a no-op as they are.
> >>>
> >>> Yeah, that's true - so long as you did an invalidate in unmap_dma_buf.
> >>> Liam was saying that it's too painful for them to do that every time a
> >>> device unmaps - when in many cases (device->device, no CPU) it's not
> >>> needed.
> >>
> >> Invalidates are painless, at least compared to a real cache flush, just
> >> set the invalid bit vs actually writing out lines. I thought the issue
> >> was on the map side.
> >>
> > 
> > Invalidates aren't painless for us because we have a coherent system cache 
> > so clean lines get written out.
> 
> That seems very broken, why would clean lines ever need to be written
> out, that defeats the whole point of having the invalidate separate from
> clean. How do you deal with stale cache lines? I guess in your case this
> is what forces you to have to use uncached memory for DMA-able memory.
> 

My understanding is that our ARM invalidate is a clean + invalidate, I had 
concerns about the clean lines being written to the the system cache as 
part of the 'clean', but the following 'invalidate' would take care of 
actually invalidating the lines (so nothign broken).
But i am probably wrong on this and it is probably smart enough not to the 
writing of the clean lines.

But regardless, targets supporting a coherent system cache is a legitamate 
configuration and an invalidate on this configuration does have to go to 
the bus to invalidate the system cache (which isn't free) so I dont' think
you can make the assumption that invalidates are cheap so that it is okay 
to do them (even if they are not needed) on every dma unmap.

> > And these invalidates can occur on fairly large buffers.
> > 
> > That is why we haven't went with using cached ION memory and "tracking CPU 
> > access" because it only solves half the problem, ie there isn't a way to 
> > safely skip the invalidate (because we can't read the future).
> > Our solution was to go with uncached ION memory (when possible), but as 
> > you can see in other discussions upstream support for uncached memory has
> > its own issues.
> > 
> 
> Sounds like you need to fix upstream support then, finding a way to drop
> all cacheable mappings of memory you want to make uncached mappings for
> seems to be the only solution.
> 

I think we can probably agree that there woudln't be a good way to remove 
cached mappings without causing an unacceptable performance degradation 
since it would fragment all the nice 1GB kernel mappings we have.

So I am trying to find an alternative solution.

> >>>
> >>>>>
> >>>>>> [CPU reads/writes to the buffer]
> >>>>>> DMA_BUF_IOCTL_SYNC IOCTL with flags DMA_BUF_SYNC_END // No CMO since no 
> >>>>>> devices attached to buffer
> >>>>>> munmap buffer
> >>>>>>
> >>>>>> //buffer is send down the pipeline
> >>>>>> // Buffer is send to video device (who does compression of raw data) and 
> >>>>>> writes to a file
> >>>>>> dma_buf_attach
> >>>>>> dma_map_attachment (buffer needs to be cleaned)
> >>>>>> [video device writes to buffer]
> >>>>>> dma_buf_unmap_attachment 
> >>>>>> dma_buf_detach  (device cannot stay attached because it is being sent down 
> >>>>>> the pipeline and Video doesn't know the end of the use case)
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>>> Also ION no longer provides DMA ready memory, so if you are not doing CPU 
> >>>>>>>> access then there is no requirement (that I am aware of) for you to call 
> >>>>>>>> {begin,end}_cpu_access before passing the buffer to the device and if this 
> >>>>>>>> buffer is cached and your device is not IO-coherent then the cache maintenance
> >>>>>>>> in ion_map_dma_buf and ion_unmap_dma_buf is required.
> >>>>>>>>
> >>>>>>>
> >>>>>>> If I am not doing any CPU access then why do I need CPU cache
> >>>>>>> maintenance on the buffer?
> >>>>>>>
> >>>>>>
> >>>>>> Because ION no longer provides DMA ready memory.
> >>>>>> Take the above example.
> >>>>>>
> >>>>>> ION allocates memory from buddy allocator and requests zeroing.
> >>>>>> Zeros are written to the cache.
> >>>>>>
> >>>>>> You pass the buffer to the camera device which is not IO-coherent.
> >>>>>> The camera devices writes directly to the buffer in DDR.
> >>>>>> Since you didn't clean the buffer a dirty cache line (one of the zeros) is 
> >>>>>> evicted from the cache, this zero overwrites data the camera device has 
> >>>>>> written which corrupts your data.
> >>>>>>
> >>>>>
> >>>>> The zeroing *is* a CPU access, therefor it should handle the needed CMO
> >>>>> for CPU access at the time of zeroing.
> >>>>>
> >>>
> >>> Actually that should be at the point of the first non-coherent device
> >>> mapping the buffer right? No point in doing CMO if the future accesses
> >>> are coherent.
> >>
> >> I see your point, as long as the zeroing is guaranteed to be the first
> >> access to this buffer then it should be safe.
> >>
> >> Andrew
> >>
> >>>
> >>> Cheers,
> >>> -Brian
> >>>
> >>>>> Andrew
> >>>>>
> >>>>>> Liam
> >>>>>>
> >>>>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> >>>>>> a Linux Foundation Collaborative Project
> >>>>>>
> >>
> > 
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > a Linux Foundation Collaborative Project
> > 
> 

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project