amd-staging-drm-next: Oops - BUG: unable to handle kernel NULL pointer dereference, bisected.

Submitted by Koenig, Christian on Jan. 30, 2019, 12:42 p.m.

Details

Message ID 7a65412b-1b41-9e5b-f700-0a944a33cf49@amd.com
State New
Headers show
Series "amd-staging-drm-next: Oops - BUG: unable to handle kernel NULL pointer dereference, bisected." ( rev: 1 ) in AMD X.Org drivers

Not browsing as part of any series.

Commit Message

Koenig, Christian Jan. 30, 2019, 12:42 p.m.
Does the attached patch fix the issue?

Christian.

Am 30.01.19 um 13:06 schrieb Christian König:
Sorry I accidentally replied to the wrong mail.

This is a new issue. Going to take a look now.

Christian.

Am 30.01.19 um 13:02 schrieb Christian König:
This is a known issue, see here as well https://bugs.freedesktop.org/show_bug.cgi?id=109487

Christian.

Am 30.01.19 um 12:07 schrieb Przemek Socha:

Good morning,

after last pull from the amd-staging-drm-next tree (29th of February) I have
random Oops on A6 6310 APU with r4 Mullins.

Here is the Oops part of the log taken from pstore:

<1>[   55.166270] BUG: unable to handle kernel NULL pointer dereference at
0000000000000208
<1>[   55.166281] #PF error: [normal kernel read fault]
<6>[   55.166285] PGD 0 P4D 0
<4>[   55.166293] Oops: 0000 [#1] PREEMPT SMP
<4>[   55.166301] CPU: 3 PID: 11006 Comm: kwin_x11:cs0 Not tainted 5.0.0-rc1+
#44
<4>[   55.166305] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13)
08/04/2016
<4>[   55.166320] RIP: 0010:ttm_bo_bulk_move_lru_tail+0xd3/0x188 [ttm]
<4>[   55.166326] Code: 00 4c 8b 0a 48 8b 81 a8 00 00 00 48 81 c1 a8 00 00 00
49 89 02 4c 8b 92 b0 00 00 00 4c 89 50 08 44 89 c0 48 c1 e0 04 4c 01 c8 <4c>
8b 90 08 02 00 00 4d 89 1a 4c 8b 90 08 02 00 00 4c 89 92 b0 00
<4>[   55.166330] RSP: 0018:ffffa8bdc0f33b18 EFLAGS: 00010246
<4>[   55.166335] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
ffff9cfa935778f8
<4>[   55.166339] RDX: ffff9cfa950c5050 RSI: 0000000000000070 RDI:
ffff9cfa93575dd0
<4>[   55.166342] RBP: ffff9cfa5d44d800 R08: 0000000000000000 R09:
0000000000000000
<4>[   55.166346] R10: ffff9cfa8f7730f8 R11: ffff9cfa950c50f8 R12: ffff9cfa93575dd0
<4>[   55.166350] R13: ffff9cfa93575800 R14: 0000000000000001 R15: ffffffffc03adc10
<4>[   55.166355] FS:  00007fb327fff700(0000) GS:ffff9cfa97b80000(0000) knlGS:
0000000000000000
<4>[   55.166359] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   55.166363] CR2: 0000000000000208 CR3: 00000002150f0000 CR4:
00000000000406e0
<4>[   55.166366] Call Trace:
<4>[   55.166477]  amdgpu_vm_move_to_lru_tail+0xe4/0x100 [amdgpu]
<4>[   55.166563]  amdgpu_cs_ioctl+0x14e7/0x1b08 [amdgpu]
<4>[   55.166586]  ? __switch_to_asm+0x40/0x70
<4>[   55.166689]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
<4>[   55.166698]  drm_ioctl_kernel+0xa4/0xe8
<4>[   55.166707]  drm_ioctl+0x1db/0x358
<4>[   55.166805]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
<4>[   55.166901]  amdgpu_drm_ioctl+0x44/0x78 [amdgpu]
<4>[   55.166931]  do_vfs_ioctl+0x9f/0x618
<4>[   55.166940]  ksys_ioctl+0x5b/0x88
<4>[   55.166947]  __x64_sys_ioctl+0x11/0x18
<4>[   55.166955]  do_syscall_64+0x50/0x168
<4>[   55.166963]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
<4>[   55.166969] RIP: 0033:0x7fb34b035fa7
<4>[   55.166974] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 8d
dc 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 8b 0d a9 ae 0c 00 f7 d8 64 89 01 48
<4>[   55.166978] RSP: 002b:00007fb327ffea88 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
<4>[   55.166984] RAX: ffffffffffffffda RBX: 00007fb327ffec58 RCX: 00007fb34b035fa7
<4>[   55.166987] RDX: 00007fb327ffeb10 RSI: 00000000c0186444 RDI:
0000000000000010
<4>[   55.166991] RBP: 00007fb327ffeb10 R08: 00007fb327ffec80 R09:
00007fb327ffec58
<4>[   55.166995] R10: 00007fb327ffeca0 R11: 0000000000000246 R12:
00000000c0186444
<4>[   55.166998] R13: 0000000000000010 R14: 000055ecd2705dc0 R15:
0000000000000003
<4>[   55.167004] Modules linked in: rfcomm nf_tables ebtable_nat ip_set
nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables overlay squashfs
loop bnep ipv6 rtsx_usb_ms memstick rtsx_usb_sdmmc rtsx_usb uvcvideo
videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev
media ath3k btusb btintel bluetooth ecdh_generic ath9k ath9k_common kvm_amd
ath9k_hw sdhci_pci kvm cqhci irqbypass mac80211 sdhci crc32_pclmul
ghash_clmulni_intel ath serio_raw mmc_core cfg80211 amdgpu mfd_core chash
gpu_sched xhci_pci ttm xhci_hcd ehci_pci ehci_hcd sp5100_tco
<4>[   55.167063] CR2: 0000000000000208
<4>[   55.167069] ---[ end trace bf1c4be089002236 ]---

Bisected, and  it seems that the bad commit is "drm/amdgpu: cleanup setting
bulk_movable". I hope this is relevant.

full git bisect log:

git bisect start
# good: [10117450735c7a7c0858095fb46a860e7037cb9a] drm/amd/display: add -msse2
to prevent Clang from emitting libcalls to undefined SW FP routines
git bisect good 10117450735c7a7c0858095fb46a860e7037cb9a
# bad: [b9c6252b7f980e7e03c0bf659a251798b36a8094] Revert "drm/amd/display: add
-msse2 to prevent Clang from emitting libcalls to undefined SW FP routines"
git bisect bad b9c6252b7f980e7e03c0bf659a251798b36a8094
# good: [1de29da5b7281c9a8427d84948bf3d77bc4b8d16] drm: disable uncached DMA
optimization for ARM and arm64
git bisect good 1de29da5b7281c9a8427d84948bf3d77bc4b8d16
# good: [bbf48cae572b39c4df6023b01d6f8de66ef41b34] Revert "test patch for hpd
dpms check"
git bisect good bbf48cae572b39c4df6023b01d6f8de66ef41b34
# good: [257b75d373c77d6792d0011f7379398ba60799ec] drm/amdgpu: Show XGMI node
and hive message per device only once
git bisect good 257b75d373c77d6792d0011f7379398ba60799ec
# good: [4d771657c533d8fe3b574c561084f66aebc77bb6] drm/amdgpu: cleanup
amdgpu_pte_update_params
git bisect good 4d771657c533d8fe3b574c561084f66aebc77bb6
# bad: [4ef27005fefd4be102010b7d8552fec1ee13435a] drm/amdgpu: cleanup setting
bulk_movable
git bisect bad 4ef27005fefd4be102010b7d8552fec1ee13435a
# first bad commit: [4ef27005fefd4be102010b7d8552fec1ee13435a] drm/amdgpu:
cleanup setting bulk_movable

4ef27005fefd4be102010b7d8552fec1ee13435a is the first bad commit
commit 4ef27005fefd4be102010b7d8552fec1ee13435a
Author: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>
Date:   Mon Jan 28 13:41:58 2019 +0100

    drm/amdgpu: cleanup setting bulk_movable

    We only need to set this to false now when BOs are removed from the LRU.

    Signed-off-by: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>

    Reviewed-by: Chunming Zhou <david1.zhou@amd.com><mailto:david1.zhou@amd.com>


If other info is needed, please do not hesitate.

Thanks,
Przemek.




_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Patch hide | download patch | download mbox

From 3a7a65eb1952439a90f244a07f6d9bb338c2e4b1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Christian=20K=C3=B6nig?= <christian.koenig@amd.com>
Date: Wed, 30 Jan 2019 13:41:05 +0100
Subject: [PATCH] drm/amdgpu: partial revert cleanup setting bulk_movable
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We still need to set bulk_movable to false when new BOs are added.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 79f9dde70bc0..1e101a77eec9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -332,6 +332,7 @@  static void amdgpu_vm_bo_base_init(struct amdgpu_vm_bo_base *base,
 	if (bo->tbo.resv != vm->root.base.bo->tbo.resv)
 		return;
 
+	vm->bulk_moveable = false;
 	if (bo->tbo.type == ttm_bo_type_kernel)
 		amdgpu_vm_bo_relocated(base);
 	else
-- 
2.17.1


Comments

Dnia środa, 30 stycznia 2019 13:42:33 CET piszesz:
> Does the attached patch fix the issue?
> 
> Christian.
> 
> .....

Thanks for the rapid response, but unfortunately no. 
System freezes and only mouse pointer is movable (cannot switch tty's, reboot 
by pwr button, tree-finger-salute doesn't work also).

Here is a trace log after applying the patch. I'm attaching it because it 
looks different:

<4>[   46.864336] ------------[ cut here ]------------
<2>[   46.864343] kernel BUG at drivers/gpu/drm/ttm/ttm_bo.c:196!
<4>[   46.864361] invalid opcode: 0000 [#1] PREEMPT SMP
<4>[   46.864369] CPU: 3 PID: 10966 Comm: plasmashel:cs0 Not tainted 5.0.0-
rc1+ #44
<4>[   46.864373] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13) 
08/04/2016
<4>[   46.864388] RIP: 0010:ttm_bo_ref_bug+0x0/0x8 [ttm]
<4>[   46.864393] Code: 00 00 08 00 75 0c 48 83 c7 0c 4c 39 cf 75 ab 31 c0 c3 
b8 01 00 00 00 c3 66 90 f0 ff 8f a4 00 00 00 c3 0f 1f 84 00 00 00 00 00 <0f> 0b 
66 0f 1f 44 00 00 53 48 8b 07 48 89 fb 48 8b 40 18 48 8b 40
<4>[   46.864397] RSP: 0018:ffffa86fc1263af8 EFLAGS: 00010247
<4>[   46.864403] RAX: ffff8c7b133a787c RBX: ffffa86fc1263c48 RCX: ffff8c7b0f7698f8
<4>[   46.864406] RDX: ffff8c7b133a78f8 RSI: ffff8c7b11aa2800 RDI: ffff8c7b133a787c
<4>[   46.864410] RBP: ffff8c7ac16d1b38 R08: ffff8c7b1348d0f8 R09: ffffa86fc12639b0
<4>[   46.864414] R10: ffffcfd6c84d07c0 R11: 0000000000000003 R12: ffffffffc0364c10
<4>[   46.864417] R13: ffffa86fc1263be0 R14: 0000000000000000 R15: 
ffffa86fc1263c48
<4>[   46.864422] FS:  00007f3e34019700(0000) GS:ffff8c7b17b80000(0000) knlGS:
0000000000000000
<4>[   46.864426] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   46.864430] CR2: 00007fb1cb765000 CR3: 000000021333e000 CR4: 
00000000000406e0
<4>[   46.864433] Call Trace:
<4>[   46.864446]  ttm_bo_del_from_lru+0xab/0xc8 [ttm]
<4>[   46.864456]  ttm_eu_reserve_buffers+0x140/0x2c8 [ttm]
<4>[   46.864557]  amdgpu_cs_ioctl+0x4ee/0x1b08 [amdgpu]
<4>[   46.864575]  ? __switch_to_asm+0x40/0x70
<4>[   46.864668]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
<4>[   46.864678]  drm_ioctl_kernel+0xa4/0xe8<4>[   46.864686]  
drm_ioctl+0x1db/0x358
<4>[   46.864767]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
<4>[   46.864848]  amdgpu_drm_ioctl+0x44/0x78 [amdgpu]
<4>[   46.864859]  do_vfs_ioctl+0x9f/0x618
<4>[   46.864867]  ksys_ioctl+0x5b/0x88
<4>[   46.864874]  __x64_sys_ioctl+0x11/0x18
<4>[   46.864881]  do_syscall_64+0x50/0x168
<4>[   46.864888]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
<4>[   46.864895] RIP: 0033:0x7f3e4a939fa7
<4>[   46.864900] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 8d 
dc 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d 
01 f0 ff ff 73 01 c3 48 8b 0d a9 ae 0c 00 f7 d8 64 89 01 48
<4>[   46.864904] RSP: 002b:00007f3e34018ab8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
<4>[   46.864909] RAX: ffffffffffffffda RBX: 00007f3e34018c58 RCX: 00007f3e4a939fa7
<4>[   46.864913] RDX: 00007f3e34018b40 RSI: 00000000c0186444 RDI: 
0000000000000010
<4>[   46.864916] RBP: 00007f3e34018b40 R08: 00007f3e34018c80 R09: 
00007f3e34018c58
<4>[   46.864920] R10: 00007f3e34018ca0 R11: 0000000000000246 R12: 
00000000c0186444
<4>[   46.864923] R13: 0000000000000010 R14: 000055555e550d70 R15: 
0000000000000003
<4>[   46.864929] Modules linked in: rfcomm nf_tables ebtable_nat ip_set 
nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables overlay squashfs 
loop bnep ipv6 rtsx_usb_ms memstick rtsx_usb_sdmmc rtsx_usb ath3k btusb 
btintel bluetooth ecdh_generic uvcvideo videobuf2_vmalloc videobuf2_memops 
videobuf2_v4l2 videobuf2_common videodev media kvm_amd ath9k kvm ath9k_common 
irqbypass ath9k_hw crc32_pclmul mac80211 sdhci_pci cqhci sdhci 
ghash_clmulni_intel serio_raw mmc_core ath cfg80211 amdgpu mfd_core chash 
gpu_sched xhci_pci ttm ehci_pci xhci_hcd ehci_hcd sp5100_tco
<4>[   46.864981] ---[ end trace 7bdf1a5927cdc874 ]---

Thanks,
Przemek.
On 2019-01-30 7:42 a.m., Koenig, Christian wrote:
> Does the attached patch fix the issue?


No.  Now I get a lockup when I start GNOME and try to bring up a 
terminal.  The patch also didn't apply cleanly on top of drm-next but I 
was able to just manually add the line.

[   88.018735] general protection fault: 0000 [#1] SMP NOPTI
[   88.018741] CPU: 5 PID: 4164 Comm: gnome-shel:cs0 Tainted: G        W 
         5.0.0-rc1+ #20
[   88.018743] Hardware name: System manufacturer System Product 
Name/TUF B350M-PLUS GAMING, BIOS 4011 04/19/2018
[   88.018750] RIP: 0010:ttm_bo_bulk_move_lru_tail+0x36/0x190 [ttm]
[   88.018753] Code: 90 48 85 d2 74 66 48 8b 4c 37 98 4c 8b 92 b0 00 00 
00 4c 8d 9a a8 00 00 00 4c 8b 0a 48 8b 81 a8 00 00 00 48 81 c1 a8 00 00 
00 <49> 89 02 4c 8b 92 b0 00 00 00 4c 89 50 08 44 89 c0 48 c1 e0 04 4c
[   88.018755] RSP: 0018:ffffb419c1fefb18 EFLAGS: 00010296
[   88.018757] RAX: ffff9692d9a013a0 RBX: 0000000000000000 RCX: 
ffff9693032f2f90
[   88.018759] RDX: ffff9692e099cad8 RSI: 0000000000000070 RDI: 
ffff9693058a7598
[   88.018761] RBP: ffff9692ed34f4e8 R08: 0000000000000000 R09: 
6b6b6b6b6b6b6b6b
[   88.018762] R10: 6b6b6b6b6b6b6b6b R11: ffff9692e099cb80 R12: 
ffff9693058a7598
[   88.018763] R13: ffff9693058a6fc8 R14: 0000000000000001 R15: 
ffffffffc033dbc0
[   88.018765] FS:  00007fc351843700(0000) GS:ffff969337b40000(0000) 
knlGS:0000000000000000
[   88.018767] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   88.018769] CR2: 00007fa78adb08a0 CR3: 0000000206d56000 CR4: 
00000000003406e0
[   88.018770] Call Trace:
[   88.018807]  amdgpu_vm_move_to_lru_tail+0xe1/0x100 [amdgpu]
[   88.018842]  amdgpu_cs_ioctl+0x14de/0x1ad0 [amdgpu]
[   88.018846]  ? __switch_to_asm+0x34/0x70
[   88.018881]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[   88.018884]  drm_ioctl_kernel+0xa4/0xf0
[   88.018887]  drm_ioctl+0x1db/0x370
[   88.018921]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]
[   88.018970]  amdgpu_drm_ioctl+0x44/0x80 [amdgpu]
[   88.018975]  do_vfs_ioctl+0x9f/0x610
[   88.018980]  ? __x64_sys_futex+0x137/0x180
[   88.018983]  ksys_ioctl+0x5b/0x90
[   88.018986]  __x64_sys_ioctl+0x11/0x20
[   88.018989]  do_syscall_64+0x43/0xf0
[   88.018992]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   88.018995] RIP: 0033:0x7fc37a1b1c97
[   88.018997] Code: 00 00 90 48 8b 05 09 82 2c 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d9 81 2c 00 f7 d8 64 89 01 48
[   88.018999] RSP: 002b:00007fc3518425d8 EFLAGS: 00000202 ORIG_RAX: 
0000000000000010
[   88.019002] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 
00007fc37a1b1c97
[   88.019004] RDX: 00007fc3518426b0 RSI: 00000000c0186444 RDI: 
000000000000000c
[   88.019005] RBP: 00007fc351842610 R08: 00007fc351842720 R09: 
00007fc351842798
[   88.019007] R10: 0000000000000000 R11: 0000000000000202 R12: 
00007ffdf8dc068e
[   88.019009] R13: 00007ffdf8dc068f R14: 00007ffdf8dc0690 R15: 
0000000000000000
[   88.019011] Modules linked in: fuse amdgpu mfd_core chash gpu_sched 
ttm ax88179_178a usbnet
[   88.019019] ---[ end trace dc532cd45c6dc064 ]---

Tom

> 

> Christian.

> 

> Am 30.01.19 um 13:06 schrieb Christian König:

>> Sorry I accidentally replied to the wrong mail.

>>

>> This is a new issue. Going to take a look now.

>>

>> Christian.

>>

>> Am 30.01.19 um 13:02 schrieb Christian König:

>>> This is a known issue, see here as well 

>>> https://bugs.freedesktop.org/show_bug.cgi?id=109487

>>>

>>> Christian.

>>>

>>> Am 30.01.19 um 12:07 schrieb Przemek Socha:

>>>> Good morning,

>>>>

>>>> after last pull from the amd-staging-drm-next tree (29th of February) I have

>>>> random Oops on A6 6310 APU with r4 Mullins.

>>>>

>>>> Here is the Oops part of the log taken from pstore:

>>>>

>>>> <1>[   55.166270] BUG: unable to handle kernel NULL pointer dereference at

>>>> 0000000000000208

>>>> <1>[   55.166281] #PF error: [normal kernel read fault]

>>>> <6>[   55.166285] PGD 0 P4D 0

>>>> <4>[   55.166293] Oops: 0000 [#1] PREEMPT SMP

>>>> <4>[   55.166301] CPU: 3 PID: 11006 Comm: kwin_x11:cs0 Not tainted 5.0.0-rc1+

>>>> #44

>>>> <4>[   55.166305] Hardware name: LENOVO 80E3/Lancer 5B2, BIOS A2CN45WW(V2.13)

>>>> 08/04/2016

>>>> <4>[   55.166320] RIP: 0010:ttm_bo_bulk_move_lru_tail+0xd3/0x188 [ttm]

>>>> <4>[   55.166326] Code: 00 4c 8b 0a 48 8b 81 a8 00 00 00 48 81 c1 a8 00 00 00

>>>> 49 89 02 4c 8b 92 b0 00 00 00 4c 89 50 08 44 89 c0 48 c1 e0 04 4c 01 c8 <4c>

>>>> 8b 90 08 02 00 00 4d 89 1a 4c 8b 90 08 02 00 00 4c 89 92 b0 00

>>>> <4>[   55.166330] RSP: 0018:ffffa8bdc0f33b18 EFLAGS: 00010246

>>>> <4>[   55.166335] RAX: 0000000000000000 RBX: 0000000000000000 RCX:

>>>> ffff9cfa935778f8

>>>> <4>[   55.166339] RDX: ffff9cfa950c5050 RSI: 0000000000000070 RDI:

>>>> ffff9cfa93575dd0

>>>> <4>[   55.166342] RBP: ffff9cfa5d44d800 R08: 0000000000000000 R09:

>>>> 0000000000000000

>>>> <4>[   55.166346] R10: ffff9cfa8f7730f8 R11: ffff9cfa950c50f8 R12: ffff9cfa93575dd0

>>>> <4>[   55.166350] R13: ffff9cfa93575800 R14: 0000000000000001 R15: ffffffffc03adc10

>>>> <4>[   55.166355] FS:  00007fb327fff700(0000) GS:ffff9cfa97b80000(0000) knlGS:

>>>> 0000000000000000

>>>> <4>[   55.166359] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

>>>> <4>[   55.166363] CR2: 0000000000000208 CR3: 00000002150f0000 CR4:

>>>> 00000000000406e0

>>>> <4>[   55.166366] Call Trace:

>>>> <4>[   55.166477]  amdgpu_vm_move_to_lru_tail+0xe4/0x100 [amdgpu]

>>>> <4>[   55.166563]  amdgpu_cs_ioctl+0x14e7/0x1b08 [amdgpu]

>>>> <4>[   55.166586]  ? __switch_to_asm+0x40/0x70

>>>> <4>[   55.166689]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]

>>>> <4>[   55.166698]  drm_ioctl_kernel+0xa4/0xe8

>>>> <4>[   55.166707]  drm_ioctl+0x1db/0x358

>>>> <4>[   55.166805]  ? amdgpu_cs_find_mapping+0x110/0x110 [amdgpu]

>>>> <4>[   55.166901]  amdgpu_drm_ioctl+0x44/0x78 [amdgpu]

>>>> <4>[   55.166931]  do_vfs_ioctl+0x9f/0x618

>>>> <4>[   55.166940]  ksys_ioctl+0x5b/0x88

>>>> <4>[   55.166947]  __x64_sys_ioctl+0x11/0x18

>>>> <4>[   55.166955]  do_syscall_64+0x50/0x168

>>>> <4>[   55.166963]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

>>>> <4>[   55.166969] RIP: 0033:0x7fb34b035fa7

>>>> <4>[   55.166974] Code: 00 00 00 75 0c 48 c7 c0 ff ff ff ff 48 83 c4 18 c3 e8 8d

>>>> dc 01 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 10 00 00 00 0f 05 <48> 3d

>>>> 01 f0 ff ff 73 01 c3 48 8b 0d a9 ae 0c 00 f7 d8 64 89 01 48

>>>> <4>[   55.166978] RSP: 002b:00007fb327ffea88 EFLAGS: 00000246 ORIG_RAX:

>>>> 0000000000000010

>>>> <4>[   55.166984] RAX: ffffffffffffffda RBX: 00007fb327ffec58 RCX: 00007fb34b035fa7

>>>> <4>[   55.166987] RDX: 00007fb327ffeb10 RSI: 00000000c0186444 RDI:

>>>> 0000000000000010

>>>> <4>[   55.166991] RBP: 00007fb327ffeb10 R08: 00007fb327ffec80 R09:

>>>> 00007fb327ffec58

>>>> <4>[   55.166995] R10: 00007fb327ffeca0 R11: 0000000000000246 R12:

>>>> 00000000c0186444

>>>> <4>[   55.166998] R13: 0000000000000010 R14: 000055ecd2705dc0 R15:

>>>> 0000000000000003

>>>> <4>[   55.167004] Modules linked in: rfcomm nf_tables ebtable_nat ip_set

>>>> nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables overlay squashfs

>>>> loop bnep ipv6 rtsx_usb_ms memstick rtsx_usb_sdmmc rtsx_usb uvcvideo

>>>> videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev

>>>> media ath3k btusb btintel bluetooth ecdh_generic ath9k ath9k_common kvm_amd

>>>> ath9k_hw sdhci_pci kvm cqhci irqbypass mac80211 sdhci crc32_pclmul

>>>> ghash_clmulni_intel ath serio_raw mmc_core cfg80211 amdgpu mfd_core chash

>>>> gpu_sched xhci_pci ttm xhci_hcd ehci_pci ehci_hcd sp5100_tco

>>>> <4>[   55.167063] CR2: 0000000000000208

>>>> <4>[   55.167069] ---[ end trace bf1c4be089002236 ]---

>>>>

>>>> Bisected, and  it seems that the bad commit is "drm/amdgpu: cleanup setting

>>>> bulk_movable". I hope this is relevant.

>>>>

>>>> full git bisect log:

>>>>

>>>> git bisect start

>>>> # good: [10117450735c7a7c0858095fb46a860e7037cb9a] drm/amd/display: add -msse2

>>>> to prevent Clang from emitting libcalls to undefined SW FP routines

>>>> git bisect good 10117450735c7a7c0858095fb46a860e7037cb9a

>>>> # bad: [b9c6252b7f980e7e03c0bf659a251798b36a8094] Revert "drm/amd/display: add

>>>> -msse2 to prevent Clang from emitting libcalls to undefined SW FP routines"

>>>> git bisect bad b9c6252b7f980e7e03c0bf659a251798b36a8094

>>>> # good: [1de29da5b7281c9a8427d84948bf3d77bc4b8d16] drm: disable uncached DMA

>>>> optimization for ARM and arm64

>>>> git bisect good 1de29da5b7281c9a8427d84948bf3d77bc4b8d16

>>>> # good: [bbf48cae572b39c4df6023b01d6f8de66ef41b34] Revert "test patch for hpd

>>>> dpms check"

>>>> git bisect good bbf48cae572b39c4df6023b01d6f8de66ef41b34

>>>> # good: [257b75d373c77d6792d0011f7379398ba60799ec] drm/amdgpu: Show XGMI node

>>>> and hive message per device only once

>>>> git bisect good 257b75d373c77d6792d0011f7379398ba60799ec

>>>> # good: [4d771657c533d8fe3b574c561084f66aebc77bb6] drm/amdgpu: cleanup

>>>> amdgpu_pte_update_params

>>>> git bisect good 4d771657c533d8fe3b574c561084f66aebc77bb6

>>>> # bad: [4ef27005fefd4be102010b7d8552fec1ee13435a] drm/amdgpu: cleanup setting

>>>> bulk_movable

>>>> git bisect bad 4ef27005fefd4be102010b7d8552fec1ee13435a

>>>> # first bad commit: [4ef27005fefd4be102010b7d8552fec1ee13435a] drm/amdgpu:

>>>> cleanup setting bulk_movable

>>>>

>>>> 4ef27005fefd4be102010b7d8552fec1ee13435a is the first bad commit

>>>> commit 4ef27005fefd4be102010b7d8552fec1ee13435a

>>>> Author: Christian König<christian.koenig@amd.com>

>>>> Date:   Mon Jan 28 13:41:58 2019 +0100

>>>>

>>>>      drm/amdgpu: cleanup setting bulk_movable

>>>>      

>>>>      We only need to set this to false now when BOs are removed from the LRU.

>>>>      

>>>>      Signed-off-by: Christian König<christian.koenig@amd.com>

>>>>      Reviewed-by: Chunming Zhou<david1.zhou@amd.com>

>>>>

>>>> If other info is needed, please do not hesitate.

>>>>

>>>> Thanks,

>>>> Przemek.

>>>>

>>>> _______________________________________________

>>>> amd-gfx mailing list

>>>> amd-gfx@lists.freedesktop.org

>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

>>>

>>

> 

> 

> _______________________________________________

> amd-gfx mailing list

> amd-gfx@lists.freedesktop.org

> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

>
Dnia środa, 30 stycznia 2019 13:42:33 CET piszesz:
> Does the attached patch fix the issue?
> 
> Christian.

I have tested this one also - "drm/amdgpu: partial revert cleanup setting 
bulk_movable v2"

>We still need to set bulk_movable to false when new BOs are added or removed.
>
>v2: also set it to false on removal
>
>Signed-off-by: Christian König <christian.koenig@amd.com>
>---
> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
>diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/
>amdgpu/amdgpu_vm.c
>index 79f9dde70bc0..822546a149fa 100644
>--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>@@ -332,6 +332,7 @@  static void amdgpu_vm_bo_base_init(struct 
>amdgpu_vm_bo_base *base,
> 	if (bo->tbo.resv != vm->root.base.bo->tbo.resv)
> 		return;
> 
>+	vm->bulk_moveable = false;
> 	if (bo->tbo.type == ttm_bo_type_kernel)
> 		amdgpu_vm_bo_relocated(base);
> 	else
>@@ -2772,6 +2773,9 @@  void amdgpu_vm_bo_rmv(struct amdgpu_device *adev,
> 	struct amdgpu_vm_bo_base **base;
> 
> 	if (bo) {
>+		if (bo->tbo.resv == vm->root.base.bo->tbo.resv)
>+			vm->bulk_moveable = false;
>+
> 		for (base = &bo_va->base.bo->vm_bo; *base;
> 		     base = &(*base)->next) {
> 			if (*base != &bo_va->base)

and so far I have no lockup and Oops, so I think this one is ok.

Thank you very much,
Przemek.
On 2019-01-31 4:23 a.m., Przemek Socha wrote:
> Dnia środa, 30 stycznia 2019 13:42:33 CET piszesz:

>> Does the attached patch fix the issue?

>>

>> Christian.

> 

> I have tested this one also - "drm/amdgpu: partial revert cleanup setting

> bulk_movable v2"

> 

>> We still need to set bulk_movable to false when new BOs are added or removed.

>>

>> v2: also set it to false on removal

>>

>> Signed-off-by: Christian König <christian.koenig@amd.com>

>> ---

>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 4 ++++

>> 1 file changed, 4 insertions(+)

>>

>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/

>> amdgpu/amdgpu_vm.c

>> index 79f9dde70bc0..822546a149fa 100644

>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c

>> @@ -332,6 +332,7 @@  static void amdgpu_vm_bo_base_init(struct

>> amdgpu_vm_bo_base *base,

>> 	if (bo->tbo.resv != vm->root.base.bo->tbo.resv)

>> 		return;

>>

>> +	vm->bulk_moveable = false;

>> 	if (bo->tbo.type == ttm_bo_type_kernel)

>> 		amdgpu_vm_bo_relocated(base);

>> 	else

>> @@ -2772,6 +2773,9 @@  void amdgpu_vm_bo_rmv(struct amdgpu_device *adev,

>> 	struct amdgpu_vm_bo_base **base;

>>

>> 	if (bo) {

>> +		if (bo->tbo.resv == vm->root.base.bo->tbo.resv)

>> +			vm->bulk_moveable = false;

>> +

>> 		for (base = &bo_va->base.bo->vm_bo; *base;

>> 		     base = &(*base)->next) {

>> 			if (*base != &bo_va->base)

> 

> and so far I have no lockup and Oops, so I think this one is ok.


In my experience only the last chunk of the patch is necessary.  Can you 
try this without:

 >> +	vm->bulk_moveable = false;


Too?

Thanks,
Tom
Dnia czwartek, 31 stycznia 2019 17:56:32 CET piszesz:

> In my experience only the last chunk of the patch is necessary.  Can you 
> try this without:
> 
> 
>  >> +	vm->bulk_moveable = false;
> 
> 
> Too?
> 
> Thanks,
> Tom

Sure.

I have applied only the last chunk of the patch on top of today's amd-staging-
drm-next pull:

> >> @@ -2772,6 +2773,9 @@  void amdgpu_vm_bo_rmv(struct amdgpu_device *adev,
> >> 
> >> 	struct amdgpu_vm_bo_base **base;
> >>
> >>
> >>
> >> 	if (bo) {
> >> 
> >> +		if (bo->tbo.resv == vm->root.base.bo->tbo.resv)
> >> +			vm->bulk_moveable = false;
> >> +
> >> 
> >> 		for (base = &bo_va->base.bo->vm_bo; *base;
> >> 		
> >> 		     base = &(*base)->next) {
> >> 			
> >> 			if (*base != &bo_va->base)

and it seems to be working as expected also. 

Thanks,
Przemek.