User Tools

Site Tools


tamiwiki:projects:egpu

This is an old revision of the document!


EGPU

https://docs.kernel.org/admin-guide/thunderbolt.html TLDR

  1. upgrade kernel (??)
  2. install gfx (nvidia|amd) drivers
  3. plug card
  4. reboot
  5. trust thunderbolt
The authorized attribute reads 0 which means no PCIe tunnels are created yet. The user can authorize the device by simply entering:

# echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized

This will create the PCIe tunnels and the device is now connected.

upgrade kernel

from mainline, (the ubuntu dist-upgrade is too conservative (5.19))

cd /tmp
rm -i *deb
 
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-headers-6.3.7-060307-generic_6.3.7-060307.202306090936_amd64.deb
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-headers-6.3.7-060307_6.3.7-060307.202306090936_all.deb
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-image-unsigned-6.3.7-060307-generic_6.3.7-060307.202306090936_amd64.deb
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-modules-6.3.7-060307-generic_6.3.7-060307.202306090936_amd64.deb
 
sudo dpkg -i *.deb

trust

hmm, you need to connect before boot.
now permissions

$ sudo dmesg
dprobe" pid=563 comm="apparmor_parser"
[    7.888207] audit: type=1400 audit(1686781044.331:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=563 comm="apparmor_parser"

authorized the tamala!

(base) user@eight:~$ echo 1 | sudo tee /sys/bus/thunderbolt/devices/0-1/authorized
1
(base) user@eight:~$ sudo ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.4/0000:04:00.0/0000:05:01.0/0000:07:00.0/0000:08:01.0/0000:09:00.0 ==
modalias : pci:v000010DEd00000DD8sv000010DEsd0000084Abc03sc00i00
vendor   : NVIDIA Corporation
model    : GF106GL [Quadro 2000]
manual_install: True
driver   : nvidia-driver-390 - distro non-free recommended
driver   : xserver-xorg-video-nouveau - distro free builtin

just an old card…

but EEK

(base) user@eight:~$ lspci | tail
08:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] (rev 06)
09:00.0 VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GF106 High Definition Audio Controller (rev a1)
 
$sudo dmesg
[ 1041.053826] nvidia: module license 'NVIDIA' taints kernel.
[ 1041.053831] Disabling lock debugging due to kernel taint
[ 1041.484017] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 1041.484032] NVRM: The NVIDIA Quadro 2000 GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 535.43.02 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[ 1041.501047] NVRM: No NVIDIA GPU found.
[ 1041.521176] nvidia-nvlink: Unregistered Nvlink Core, major device number 509
[ 1042.332830] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 1042.332842] NVRM: The NVIDIA Quadro 2000 GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 535.43.02 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[ 1042.335282] NVRM: No NVIDIA GPU found.
[ 1042.335835] nvidia-nvlink: Unregistered Nvlink Core, major device number 509

WE ARE TAINTED

driver

we went with ubuntu selection

but cute https://www.nvidia.com/en-us/drivers/unix/

$ sudo apt installl nvidia-headless-535
 
 
#downgrade nvidia to quadro supported version
sudo apt install nvidia-headless-390
 
# EEK
RROR (dkms apport): kernel package linux-headers-6.3.7-060307-generic is not supported
Error! Bad return status for module build on kernel: 6.3.7-060307-generic (x86_64)
Consult /var/lib/dkms/nvidia/390.157/build/make.log for more information.
dpkg: error processing package nvidia-dkms-390 (--configure):
 installed nvidia-dkms-390 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-headless-390:
 nvidia-headless-390 depends on nvidia-dkms-390; however:
  Package nvidia-dkms-390 is not configured yet.
 
dpkg: error processing package nvidia-headless-390 (--configure):
 dependency problems - leaving unconfigured
Processing triggers for libc-bin (2.36-0ubuntu4) ...
No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                          /sbin/ldconfig.real: /lib/lib
ndi.so.4 is not a symbolic link
 
Processing triggers for man-db (2.10.2-2) ...
Processing triggers for initramfs-tools (0.140ubuntu17) ...
update-initramfs: Generating /boot/initrd.img-6.3.7-060307-generic
Errors were encountered while processing:
 nvidia-dkms-390
 nvidia-headless-390

downgrading but to headless,
without touching the x config?

going with avalon readme

[1] A 3D video game environment and benchmark designed from scratch for reinforcement learning research

conda create -n avalon python=3.9
conda activate avalon
 
sudo apt install --no-install-recommends libegl-dev libglew-dev libglfw3-dev libnvidia-gl libopengl-dev libosmesa6 mesa-utils-extra
 
#this will also install torch...
pip install avalon-rl[train] 
 
python -m avalon.install_godot_binary
python -m avalon.common.check_install

why even bother, the quaDRO IS JUST A TEST.
NEED TO CLEAN REMOVE THE 390 driver AND MOVE BACK TO

NVIDIA-CURRENT

P40

https://github.com/JingShing/How-to-use-tesla-p40

installing

sudo apt instakll nvidia-headless-535

there is some issue, unlike other cards the blue led doesnt turn green on thunderbolt connection.
no power passing to the gPU.
unlike with other cards we tried (quadro 2000 and 660Ti)

:(

1080Ti

looks legit

$sudo dmesg -w
[96236.873213] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
 
[96236.874544] nvidia 0000:09:00.0: enabling device (0006 -> 0007)
[96236.874646] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[96236.991272] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.43.02  Mon May 22 20:46:13 UTC 2023
[96237.009537] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.43.02  Mon May 22 20:25:24 UTC 2023
[96237.013346] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[96238.239429] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 1
[96238.269008] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[96238.330257] nvidia-uvm: Loaded the UVM driver, major device number 507.
[96238.399348] NVRM: API mismatch: the client has the version 390.157, but
               NVRM: this kernel module has the version 535.43.02.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver

update the driver to fit

$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.4/0000:04:00.0/0000:05:01.0/0000:07:00.0/0000:08:01.0/0000:09:00.0 ==
modalias : pci:v000010DEd00001B06sv00001458sd0000377Abc03sc00i00
vendor   : NVIDIA Corporation
model    : GP102 [GeForce GTX 1080 Ti]
manual_install: True
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-510 - distro non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-525 - distro non-free
driver   : nvidia-driver-535 - third-party non-free recommended
driver   : nvidia-driver-515 - distro non-free
driver   : nvidia-driver-515-server - distro non-free
driver   : nvidia-driver-530 - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin
 
$ sudo ubuntu-drivers autoinstall
1The following additional packages will be installed:
  libnvidia-common-535 libnvidia-compute-535:i386
  libnvidia-decode-535 libnvidia-decode-535:i386
  libnvidia-encode-535 libnvidia-encode-535:i386
  libnvidia-extra-535 libnvidia-fbc1-535 libnvidia-fbc1-535:i386
  libnvidia-gl-535 libnvidia-gl-535:i386 nvidia-prime
  nvidia-settings nvidia-utils-535 screen-resolution-extra
  xserver-xorg-video-nvidia-535
The following packages will be REMOVED:
  libnvidia-common-390 libnvidia-gl-390
The following NEW packages will be installed:
  libnvidia-common-535 libnvidia-compute-535:i386
  libnvidia-decode-535 libnvidia-decode-535:i386
  libnvidia-encode-535 libnvidia-encode-535:i386
  libnvidia-extra-535 libnvidia-fbc1-535 libnvidia-fbc1-535:i386
  libnvidia-gl-535 libnvidia-gl-535:i386 nvidia-driver-535
  nvidia-prime nvidia-settings nvidia-utils-535
  screen-resolution-extra xserver-xorg-video-nvidia-535

misc

 lspci -v | grep -A 2 -E "(VGA comp|3D)"
00:02.0 VGA compatible controller: Intel Corporation Iris Pro Graphics 580 (rev 09) (prog-if 00 [VGA controller])
	DeviceName:  CPU
	Subsystem: Intel Corporation Iris Pro Graphics 580
--
09:00.0 VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation GF106GL [Quadro 2000]
	Flags: bus master, fast devsel, latency 0

power from 12v dc plug (150W?)
https://www.reddit.com/r/eGPU/comments/ukqto9/comment/ige1rwv

https://egpu.io/forums/thunderbolt-linux-setup/

tamiwiki/projects/egpu.1686950832.txt.gz · Last modified: 2023/06/17 00:27 by yair