Revert "PCI: Enable NVIDIA HDA controllers"

Submitted by Lyude Paul on July 31, 2019, 8:19 p.m.

Details

Message ID 20190731201927.22054-1-lyude@redhat.com
State New
Headers show
Series "Revert "PCI: Enable NVIDIA HDA controllers"" ( rev: 1 ) in Nouveau

Not browsing as part of any series.

Commit Message

Lyude Paul July 31, 2019, 8:19 p.m.
This reverts commit b516ea586d717472178e6ef1c152e85608b0ce32.

While this fixes audio for a number of users, this commit has the
sideaffect of breaking the BIOS workaround that's required to make the
GPU on the nvidia P50 work, by causing the GPU's PCI device function to
stop working after it's been set to multifunction mode.

Signed-off-by: Lyude Paul <lyude@redhat.com>
Fixes: b516ea586d71 ("PCI: Enable NVIDIA HDA controllers")
Cc: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Drake <drake@endlessm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Aaron Plattner <aplattner@nvidia.com>
Cc: Peter Wu <peter@lekensteyn.nl>
Cc: Ilia Mirkin <imirkin@alum.mit.edu>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Maik Freudenberg <hhfeuer@gmx.de>
Cc: linux-pci@vger.kernel.org
---

I'm not really holding my breath on this patch to being accepted:
there's a good chance there's a better solution for this (and I'm going
to continue investigating for one after sending this patch), this is
more just to start a conversation on what the proper way to fix this is.

So, I'm kind of confused about why exactly this was implemented as an
early boot quirk in the first place. If we're seeing the GPU's PCI
device, we already know the GPU is there. Shouldn't we be able to check
for the existence of the HDA device once we probe the GPU in nouveau?
This would make a lot more sense and be a lot less troublesome. I can
see that in the discussion on

https://bugs.freedesktop.org/show_bug.cgi?id=75985

That people mentioned that unloading nouveau then trying to reprobe for
the audio device didn't work, but that still doesn't explain why this
was implemented as an early quirk and not as something we just do before
nouveau is setup. Can we maybe move this somewhere a little more
sensible?

 drivers/pci/quirks.c    | 30 ------------------------------
 include/linux/pci_ids.h |  1 -
 2 files changed, 31 deletions(-)

Patch hide | download patch | download mbox

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 208aacf39329..c66c0ca446c4 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -5011,36 +5011,6 @@  DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
 			      PCI_CLASS_SERIAL_UNKNOWN, 8,
 			      quirk_gpu_usb_typec_ucsi);
 
-/*
- * Enable the NVIDIA GPU integrated HDA controller if the BIOS left it
- * disabled.  https://devtalk.nvidia.com/default/topic/1024022
- */
-static void quirk_nvidia_hda(struct pci_dev *gpu)
-{
-	u8 hdr_type;
-	u32 val;
-
-	/* There was no integrated HDA controller before MCP89 */
-	if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M)
-		return;
-
-	/* Bit 25 at offset 0x488 enables the HDA controller */
-	pci_read_config_dword(gpu, 0x488, &val);
-	if (val & BIT(25))
-		return;
-
-	pci_info(gpu, "Enabling HDA controller\n");
-	pci_write_config_dword(gpu, 0x488, val | BIT(25));
-
-	/* The GPU becomes a multi-function device when the HDA is enabled */
-	pci_read_config_byte(gpu, PCI_HEADER_TYPE, &hdr_type);
-	gpu->multifunction = !!(hdr_type & 0x80);
-}
-DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
-			       PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
-DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
-			       PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
-
 /*
  * Some IDT switches incorrectly flag an ACS Source Validation error on
  * completions for config read requests even though PCIe r4.0, sec
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index c842735a4f45..f496fb619287 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -1336,7 +1336,6 @@ 
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP78S_SMBUS    0x0752
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP77_IDE       0x0759
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP73_SMBUS     0x07D8
-#define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M           0x08A0
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS     0x0AA2
 #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SATA	    0x0D85
 

Comments

Also, I realized after sending this that I should clarify something so there
isn't any confusion.

A bunch of people on the bug that was mentioned in b516ea586d71 ("PCI: Enable
NVIDIA HDA controllers") said that this worked perfectly for their P50
laptops. While I don't doubt that at all, it should be noted that the P50
quirk there is only present on a _very specific_ subset of P50 SKUs, so it's
quite likely that the people in that bug report just didn't have a P50 that
hits this issue. The relevant model numbers of the P50 with the flakey bioses
that require this quirk should be mentioned here:

https://bugzilla.kernel.org/show_bug.cgi?id=203003



On Wed, 2019-07-31 at 16:19 -0400, Lyude Paul wrote:
> This reverts commit b516ea586d717472178e6ef1c152e85608b0ce32.
> 
> While this fixes audio for a number of users, this commit has the
> sideaffect of breaking the BIOS workaround that's required to make the
> GPU on the nvidia P50 work, by causing the GPU's PCI device function to
> stop working after it's been set to multifunction mode.
> 
> Signed-off-by: Lyude Paul <lyude@redhat.com>
> Fixes: b516ea586d71 ("PCI: Enable NVIDIA HDA controllers")
> Cc: Lukas Wunner <lukas@wunner.de>
> Cc: Daniel Drake <drake@endlessm.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Aaron Plattner <aplattner@nvidia.com>
> Cc: Peter Wu <peter@lekensteyn.nl>
> Cc: Ilia Mirkin <imirkin@alum.mit.edu>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Maik Freudenberg <hhfeuer@gmx.de>
> Cc: linux-pci@vger.kernel.org
> ---
> 
> I'm not really holding my breath on this patch to being accepted:
> there's a good chance there's a better solution for this (and I'm going
> to continue investigating for one after sending this patch), this is
> more just to start a conversation on what the proper way to fix this is.
> 
> So, I'm kind of confused about why exactly this was implemented as an
> early boot quirk in the first place. If we're seeing the GPU's PCI
> device, we already know the GPU is there. Shouldn't we be able to check
> for the existence of the HDA device once we probe the GPU in nouveau?
> This would make a lot more sense and be a lot less troublesome. I can
> see that in the discussion on
> 
> https://bugs.freedesktop.org/show_bug.cgi?id=75985
> 
> That people mentioned that unloading nouveau then trying to reprobe for
> the audio device didn't work, but that still doesn't explain why this
> was implemented as an early quirk and not as something we just do before
> nouveau is setup. Can we maybe move this somewhere a little more
> sensible?
> 
>  drivers/pci/quirks.c    | 30 ------------------------------
>  include/linux/pci_ids.h |  1 -
>  2 files changed, 31 deletions(-)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 208aacf39329..c66c0ca446c4 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -5011,36 +5011,6 @@ DECLARE_PCI_FIXUP_CLASS_FINAL(PCI_VENDOR_ID_NVIDIA,
> PCI_ANY_ID,
>  			      PCI_CLASS_SERIAL_UNKNOWN, 8,
>  			      quirk_gpu_usb_typec_ucsi);
>  
> -/*
> - * Enable the NVIDIA GPU integrated HDA controller if the BIOS left it
> - * disabled.  https://devtalk.nvidia.com/default/topic/1024022
> - */
> -static void quirk_nvidia_hda(struct pci_dev *gpu)
> -{
> -	u8 hdr_type;
> -	u32 val;
> -
> -	/* There was no integrated HDA controller before MCP89 */
> -	if (gpu->device < PCI_DEVICE_ID_NVIDIA_GEFORCE_320M)
> -		return;
> -
> -	/* Bit 25 at offset 0x488 enables the HDA controller */
> -	pci_read_config_dword(gpu, 0x488, &val);
> -	if (val & BIT(25))
> -		return;
> -
> -	pci_info(gpu, "Enabling HDA controller\n");
> -	pci_write_config_dword(gpu, 0x488, val | BIT(25));
> -
> -	/* The GPU becomes a multi-function device when the HDA is enabled */
> -	pci_read_config_byte(gpu, PCI_HEADER_TYPE, &hdr_type);
> -	gpu->multifunction = !!(hdr_type & 0x80);
> -}
> -DECLARE_PCI_FIXUP_CLASS_HEADER(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> -			       PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
> -DECLARE_PCI_FIXUP_CLASS_RESUME_EARLY(PCI_VENDOR_ID_NVIDIA, PCI_ANY_ID,
> -			       PCI_BASE_CLASS_DISPLAY, 16, quirk_nvidia_hda);
> -
>  /*
>   * Some IDT switches incorrectly flag an ACS Source Validation error on
>   * completions for config read requests even though PCIe r4.0, sec
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index c842735a4f45..f496fb619287 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -1336,7 +1336,6 @@
>  #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP78S_SMBUS    0x0752
>  #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP77_IDE       0x0759
>  #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP73_SMBUS     0x07D8
> -#define PCI_DEVICE_ID_NVIDIA_GEFORCE_320M           0x08A0
>  #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP79_SMBUS     0x0AA2
>  #define PCI_DEVICE_ID_NVIDIA_NFORCE_MCP89_SATA	    0x0D85
>
On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote:
> While this fixes audio for a number of users, this commit has the
> sideaffect of breaking the BIOS workaround that's required to make the
> GPU on the nvidia P50 work, by causing the GPU's PCI device function to
> stop working after it's been set to multifunction mode.

This is missing a reference to the commit introducing the P50 quirk,
which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot
if necessary").

Please describe in more detail how the GPU's PCI function stops working.
Does it respond with "all ones" when accessing MMIO?
Do MMIO accesses cause the system to hang?

Could you provide lspci -vvxx output for the GPU and its associated
HDA controller with and without b516ea586d71?

Does this machine have external display connectors via which audio
can be streamed?


> I'm not really holding my breath on this patch to being accepted:
> there's a good chance there's a better solution for this (and I'm going
> to continue investigating for one after sending this patch), this is
> more just to start a conversation on what the proper way to fix this is.

Posting as an RFC might have been more appropriate then.


> So, I'm kind of confused about why exactly this was implemented as an
> early boot quirk in the first place. If we're seeing the GPU's PCI
> device, we already know the GPU is there. Shouldn't we be able to check
> for the existence of the HDA device once we probe the GPU in nouveau?

I think a motivation to keep this generic was to make it work with
other drivers besides nouveau, specifically Nvidia's proprietary driver.
nouveau might not even be enabled.


> that still doesn't explain why this was implemented as an early quirk

This isn't an early quirk.  Those live in arch/x86/kernel/early-quirks.c.
This is just a PCI quirk executed on device enumeration and on resume.
Devices aren't necessarily enumerated only on boot, e.g. think Thunderbolt.

Thanks,

Lukas
On Wed, Jul 31, 2019 at 11:18 PM Lukas Wunner <lukas@wunner.de> wrote:
>
> On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote:
> > While this fixes audio for a number of users, this commit has the
> > sideaffect of breaking the BIOS workaround that's required to make the
> > GPU on the nvidia P50 work, by causing the GPU's PCI device function to
> > stop working after it's been set to multifunction mode.
>
> This is missing a reference to the commit introducing the P50 quirk,
> which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot
> if necessary").
>
> Please describe in more detail how the GPU's PCI function stops working.
> Does it respond with "all ones" when accessing MMIO?
> Do MMIO accesses cause the system to hang?
>
> Could you provide lspci -vvxx output for the GPU and its associated
> HDA controller with and without b516ea586d71?
>
> Does this machine have external display connectors via which audio
> can be streamed?
>
>
> > I'm not really holding my breath on this patch to being accepted:
> > there's a good chance there's a better solution for this (and I'm going
> > to continue investigating for one after sending this patch), this is
> > more just to start a conversation on what the proper way to fix this is.
>
> Posting as an RFC might have been more appropriate then.
>

no, a revert is actually appropriate.  If a commit fixes something,
but breaks something else, it gets either reverted or fixed. If nobody
fixes it, then revert it is.

>
> > So, I'm kind of confused about why exactly this was implemented as an
> > early boot quirk in the first place. If we're seeing the GPU's PCI
> > device, we already know the GPU is there. Shouldn't we be able to check
> > for the existence of the HDA device once we probe the GPU in nouveau?
>
> I think a motivation to keep this generic was to make it work with
> other drivers besides nouveau, specifically Nvidia's proprietary driver.
> nouveau might not even be enabled.
>
>
> > that still doesn't explain why this was implemented as an early quirk
>
> This isn't an early quirk.  Those live in arch/x86/kernel/early-quirks.c.
> This is just a PCI quirk executed on device enumeration and on resume.
> Devices aren't necessarily enumerated only on boot, e.g. think Thunderbolt.
>
> Thanks,
>
> Lukas
On Wed, 2019-07-31 at 23:26 +0200, Karol Herbst wrote:
> On Wed, Jul 31, 2019 at 11:18 PM Lukas Wunner <lukas@wunner.de> wrote:
> > On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote:
> > > While this fixes audio for a number of users, this commit has the
> > > sideaffect of breaking the BIOS workaround that's required to make the
> > > GPU on the nvidia P50 work, by causing the GPU's PCI device function to
> > > stop working after it's been set to multifunction mode.
> > 
> > This is missing a reference to the commit introducing the P50 quirk,
> > which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot
> > if necessary").
> > 
> > Please describe in more detail how the GPU's PCI function stops working.
> > Does it respond with "all ones" when accessing MMIO?
> > Do MMIO accesses cause the system to hang?
> > 
> > Could you provide lspci -vvxx output for the GPU and its associated
> > HDA controller with and without b516ea586d71?
> > 
> > Does this machine have external display connectors via which audio
> > can be streamed?
> > 
> > 
> > > I'm not really holding my breath on this patch to being accepted:
> > > there's a good chance there's a better solution for this (and I'm going
> > > to continue investigating for one after sending this patch), this is
> > > more just to start a conversation on what the proper way to fix this is.
> > 
> > Posting as an RFC might have been more appropriate then.
> > 
> 
> no, a revert is actually appropriate.  If a commit fixes something,
> but breaks something else, it gets either reverted or fixed. If nobody
> fixes it, then revert it is.

To answer Lukas's question btw: most of the details on how things break are
back in the original commit (sorry for forgetting the reference!), there's a
_lot_ of explanation there that I'd rather not retype, so just refer back to
the commit and bug @ https://bugs.freedesktop.org/show_bug.cgi?id=75985

Additionally, there was some extra discussion providing some more detail in
the email thread that I had with Bjorn:

https://lkml.org/lkml/2019/2/12/1172

As for how this commit breaks the workaround: it seems that when we enable the
HDA controller and put the GPU into multifunction mode, the function-level
reset stops working and thus we can't reset the GPU anymore. Currently I can
see a couple of solutions (again, please feel free to suggest more!):

 * Just revert the commit. We should do this if necessary, but of course I'd
   much rather try finding a fix first
 * Disable the HDA controller temporarily when a GPU reset is neded in
   quirk_reset_lenovo_thinkpad_p50_nvgpu(), then call the function level
   reset, then re-enable the HDA controller. I have no idea if this actually
   works yet, but I'm about to try this on my system
 * Get quirk_reset_lenovo_thinkpad_p50_nvgpu() to run before
   quirk_nvidia_hda(). This would probably be fine, but we would need to
   rework some stuff in the PCI subsystem (maybe it already has a way to do
   this? haven't checked yet) so that we could perform an flr probe early
   enough to perform the quirk
> 
> > > So, I'm kind of confused about why exactly this was implemented as an
> > > early boot quirk in the first place. If we're seeing the GPU's PCI
> > > device, we already know the GPU is there. Shouldn't we be able to check
> > > for the existence of the HDA device once we probe the GPU in nouveau?
> > 
> > I think a motivation to keep this generic was to make it work with
> > other drivers besides nouveau, specifically Nvidia's proprietary driver.
> > nouveau might not even be enabled.
> > 
> > 
> > > that still doesn't explain why this was implemented as an early quirk
> > 
> > This isn't an early quirk.  Those live in arch/x86/kernel/early-quirks.c.
> > This is just a PCI quirk executed on device enumeration and on resume.
> > Devices aren't necessarily enumerated only on boot, e.g. think
> > Thunderbolt.
> > 
> > Thanks,
> > 
> > Lukas
On Wed, 2019-07-31 at 17:35 -0400, Lyude Paul wrote:
> On Wed, 2019-07-31 at 23:26 +0200, Karol Herbst wrote:
> > On Wed, Jul 31, 2019 at 11:18 PM Lukas Wunner <lukas@wunner.de> wrote:
> > > On Wed, Jul 31, 2019 at 04:19:27PM -0400, Lyude Paul wrote:
> > > > While this fixes audio for a number of users, this commit has the
> > > > sideaffect of breaking the BIOS workaround that's required to make the
> > > > GPU on the nvidia P50 work, by causing the GPU's PCI device function
> > > > to
> > > > stop working after it's been set to multifunction mode.
> > > 
> > > This is missing a reference to the commit introducing the P50 quirk,
> > > which is e0547c81bfcf ("PCI: Reset Lenovo ThinkPad P50 nvgpu at boot
> > > if necessary").
> > > 
> > > Please describe in more detail how the GPU's PCI function stops working.
> > > Does it respond with "all ones" when accessing MMIO?
> > > Do MMIO accesses cause the system to hang?
> > > 
> > > Could you provide lspci -vvxx output for the GPU and its associated
> > > HDA controller with and without b516ea586d71?
> > > 
> > > Does this machine have external display connectors via which audio
> > > can be streamed?
> > > 
> > > 
> > > > I'm not really holding my breath on this patch to being accepted:
> > > > there's a good chance there's a better solution for this (and I'm
> > > > going
> > > > to continue investigating for one after sending this patch), this is
> > > > more just to start a conversation on what the proper way to fix this
> > > > is.
> > > 
> > > Posting as an RFC might have been more appropriate then.
> > > 
> > 
> > no, a revert is actually appropriate.  If a commit fixes something,
> > but breaks something else, it gets either reverted or fixed. If nobody
> > fixes it, then revert it is.
> 
> To answer Lukas's question btw: most of the details on how things break are
> back in the original commit (sorry for forgetting the reference!), there's a
> _lot_ of explanation there that I'd rather not retype, so just refer back to
> the commit and bug @ https://bugs.freedesktop.org/show_bug.cgi?id=75985
> 
> Additionally, there was some extra discussion providing some more detail in
> the email thread that I had with Bjorn:
> 
> https://lkml.org/lkml/2019/2/12/1172
> 
> As for how this commit breaks the workaround: it seems that when we enable
> the
> HDA controller and put the GPU into multifunction mode, the function-level
> reset stops working and thus we can't reset the GPU anymore. Currently I can
> see a couple of solutions (again, please feel free to suggest more!):
> 
>  * Just revert the commit. We should do this if necessary, but of course I'd
>    much rather try finding a fix first
>  * Disable the HDA controller temporarily when a GPU reset is neded in
>    quirk_reset_lenovo_thinkpad_p50_nvgpu(), then call the function level
>    reset, then re-enable the HDA controller. I have no idea if this actually
>    works yet, but I'm about to try this on my system
>  * Get quirk_reset_lenovo_thinkpad_p50_nvgpu() to run before
>    quirk_nvidia_hda(). This would probably be fine, but we would need to
>    rework some stuff in the PCI subsystem (maybe it already has a way to do
>    this? haven't checked yet) so that we could perform an flr probe early
>    enough to perform the quirk

Good news! After some investigation looks like that function level reset
actually does work, just that after we put it in multifunction mode
pci_parent_bus_reset() sees multiple devices on the bus and returns -ENOTTY as
a result. So I should definitely be able to come up with a fix for this other
then reverting this :). Will send out patches soon

> > > > So, I'm kind of confused about why exactly this was implemented as an
> > > > early boot quirk in the first place. If we're seeing the GPU's PCI
> > > > device, we already know the GPU is there. Shouldn't we be able to
> > > > check
> > > > for the existence of the HDA device once we probe the GPU in nouveau?
> > > 
> > > I think a motivation to keep this generic was to make it work with
> > > other drivers besides nouveau, specifically Nvidia's proprietary driver.
> > > nouveau might not even be enabled.
> > > 
> > > 
> > > > that still doesn't explain why this was implemented as an early quirk
> > > 
> > > This isn't an early quirk.  Those live in arch/x86/kernel/early-
> > > quirks.c.
> > > This is just a PCI quirk executed on device enumeration and on resume.
> > > Devices aren't necessarily enumerated only on boot, e.g. think
> > > Thunderbolt.
> > > 
> > > Thanks,
> > > 
> > > Lukas