NVidia Quadro P5000 in a Razer Core X Chroma eGPU fails to initialize

Discussion in 'The Linux Corner' started by andrejpodzimek, Nov 30, 2020.

Thread Status:
Not open for further replies.
  1. andrejpodzimek

    andrejpodzimek New Member

    Hi insiders! I have posted this on the NVidia forums and now repost here for increased visibility, hoping that someone from the community may have already encountered (or even resolved) this.

    I'm trying to make my NVidia Quadro P5000 work in a Razer Core X Chroma eGPU, but it just fails to initialize. (The GPU had been working fine for years in a regular desktop machine.) What could be wrong? I'm out of ideas.

    On a Desktop

    Motherboard: ASRock x570 Creator
    CPU: AMD Ryzen 3950X
    System: ArchLinux with kernel 5.9.11
    GPU in the on-board PCIe: AMD Radeon Pro W5700
    Related kernel flags: pci=realloc,assign-busses,hpbussize=0x33 radeon.auxch=1 mem_encrypt=on

    Without the pci=... flag, Thunderbolt devices don't work. With the flag they appear to work just fine (tested e.g. with a Lenovo Thunderbolt 3 dock).

    Here's a `dmesg` output when I plug in the eGPU. The most relevant part might be:

    Nov 30 16:44:33 charon kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 234
    Nov 30 16:44:33 charon kernel: nvidia 0000:3d:00.0: enabling device (0000 -> 0003)
    Nov 30 16:44:33 charon kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
    NVRM: BAR0 is 0M @ 0x0 (PCI:0000:3d:00.0)
    Nov 30 16:44:33 charon kernel: NVRM: The system BIOS may have misconfigured your GPU.
    Nov 30 16:44:33 charon kernel: nvidia: probe of 0000:3d:00.0 failed with error -1
    Nov 30 16:44:33 charon kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
    Nov 30 16:44:33 charon kernel: NVRM: None of the NVIDIA devices were initialized.
    Nov 30 16:44:33 charon kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 234

    I've searched for the error messages. Starting from the NVidia forums (1) (2), I've double-checked that
    • Above 64b decoding is enabled in my UEFI setup and
    • I do have at least one 64-bit window (>8 hex digits) earlier in dmesg:
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io 0x0000-0x03af window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io 0x03e0-0x0cf7 window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io 0x03b0-0x03df window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [io 0x0d00-0xffff window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0x000c0000-0x000dffff window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0xb0000000-0xefffffff window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [mem 0x2050000000-0x7fffffffff window]
      Nov 30 15:44:31 archlinux kernel: pci_bus 0000:00: root bus resource [bus 00-ff]

    The eGPU appears normally in boltctl list (authorized etc.):
    ● Razer Core X Chroma
    ├─ type: peripheral
    ├─ name: Core X Chroma
    ├─ vendor: Razer
    ├─ uuid: 00653854-e510-2701-ffff-ffffffffffff
    ├─ generation: Thunderbolt 3
    ├─ status: authorized
    │ ├─ domain: ce010000-0060-6c0e-03b7-b91c46b12223
    │ ├─ rx speed: 40 Gb/s = 2 lanes * 20 Gb/s
    │ ├─ tx speed: 40 Gb/s = 2 lanes * 20 Gb/s
    │ └─ authflags: secure
    ├─ authorized: Mon 30 Nov 2020 05:12:12 PM UTC
    ├─ connected: Mon 30 Nov 2020 05:12:01 PM UTC
    └─ stored: Mon 30 Nov 2020 02:18:57 PM UTC
    ├─ policy: auto
    └─ key: yes
    ● Razer Core X Chroma #2
    ├─ type: peripheral
    ├─ name: Core X Chroma
    ├─ vendor: Razer
    ├─ uuid: 00306925-e510-2701-ffff-ffffffffffff
    ├─ generation: Thunderbolt 3
    ├─ status: authorized
    │ ├─ domain: ce010000-0060-6c0e-03b7-b91c46b12223
    │ ├─ rx speed: 40 Gb/s = 2 lanes * 20 Gb/s
    │ ├─ tx speed: 40 Gb/s = 2 lanes * 20 Gb/s
    │ └─ authflags: secure
    ├─ authorized: Mon 30 Nov 2020 05:12:18 PM UTC
    ├─ connected: Mon 30 Nov 2020 05:12:02 PM UTC
    └─ stored: Mon 30 Nov 2020 02:19:08 PM UTC
    ├─ policy: auto
    └─ key: yes

    NVidia Quadro P5000 appears in lspci. However, nothing else works, neither the NVidia itself nor the USB hub(s) (with a built-in ASIX ethernet) in the eGPU.

    Some threads recommended /sys/bus/pci/devices gymnastics, such as this post, but that not only doesn't work for me, but this crash from 2015 still crashes my machine today — my system freezes and panic-reboots when I try that. So I haven't experimented any further.

    On a Laptop

    Machine: Lenovo X1 Carbon v7
    CPU: Intel Core i7-8665U
    System: Debian with kernel 5.9.8
    Related kernel flags: pci=noats

    Importantly, the laptop does not have the NVidia driver installed — some forum posts explicitly asked for dmesg without the NVidia driver. So here it is — a `dmesg` output from the laptop without NVidia drivers.

    Again, boltctl list looks normal (authorized etc., just as above). The NVidia Quadro P5000 appears in lspci. The difference from the desktop case above is that at least something works — the USB buses and the ASIX network card (ax88179_178a). But the NVidia card doesn't work — "no space for" occurs a number of times in dmesg.
     
    Last edited: Nov 30, 2020
  2. andrejpodzimek

    andrejpodzimek New Member

    Update: I've managed to make it work on the laptop. Despite the issues reported in dmesg, enabling the NVidia driver made it work. I have the module loaded, nvidia-smi shows something reasonable and I can offload applications using __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only __GLX_VENDOR_LIBRARY_NAME=nvidia <command>, as described here.

    Remaining problems:
    • It works fine for glxgears (uh oh) and (e.g.) for stellarium, but google-chrome yields a 100% black window.
    • It's terribly slow; acceleration on the Intel GPU is way faster. I think this is because I have a 4k laptop display, a 5k Thunderbolt monitor connected to a TB port of the laptop (daisy-chained through a dock, actually) and the eGPU connected to the other TB port on the laptop. So there may be (?) a shortage of bandwidth somewhere in the setup (laptop → eGPU → laptop → dock → monitor). (I can't connect the monitor directly to the eGPU, because the P5000 doesn't have a Thunderbolt.)
    • The desktop: It just doesn't work. But I think this narrows the possible causes down to the ASRock x570 Creator motherboard. In hindsight I should have asked ASRock rather than NVidia and Razer.
    Résumé: I'll ask ASRock about this. Perhaps this is an inherent limitation of the CPU/chipset that wouldn't allow an additional GPU when there is already a GPU in a slot (AMD Radeon Pro W5700), Thunderbolt is enabled, both M2 slots have SSDs in them etc.
     
    Last edited: Dec 1, 2020
  3. andrejpodzimek

    andrejpodzimek New Member

    Alright, I've figured it out, based on this post. The magic is:
    Code:
    pcie_ports=native pci=assign-busses,hpbussize=0x33,realloc,hpmmiosize=128M,hpmmioprefsize=16G
    
    With this^^^ on the kernel command line, I can just plug in the eGPU an it works, no problem at all. The nvidia kernel module loads correctly and I'm calculating Folding@Home on the eGPU right now, so it definitely works.

    (My machine won't boot if I add the recommended nocrs to pci=..., because the kernel can't talk to SATA controllers and drives in that mode and freezes forever while trying to do so. But the eGPU works without nocrs just fine, so I'm not messing with that any further.)
     
  4. DeepCerisePERIDOTbiz287

    DeepCerisePERIDOTbiz287 Active Member

    Razer will always work better with razer.
     
Thread Status:
Not open for further replies.
Sign In with Razer ID >


Don't have a Razer ID yet?
Get Razer ID >