User Tools

Site Tools


tamiwiki:projects:egpu

This is an old revision of the document!


EGPU

we are using the TH3P4G3 eGPU external thunderbolt thing.

Linux Kernal notes > https://docs.kernel.org/admin-guide/thunderbolt.html

TLDR

  1. upgrade kernel (??)
  2. install gfx (nvidia|amd) drivers
  3. plug card
  4. reboot
  5. trust thunderbolt
The authorized attribute reads 0 which means no PCIe tunnels are created yet. The user can authorize the device by simply entering:

# echo 1 > /sys/bus/thunderbolt/devices/0-1/authorized

This will create the PCIe tunnels and the device is now connected.

upgrade kernel

from mainline, (the ubuntu dist-upgrade is too conservative (5.19))

cd /tmp
rm -i *deb
 
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-headers-6.3.7-060307-generic_6.3.7-060307.202306090936_amd64.deb
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-headers-6.3.7-060307_6.3.7-060307.202306090936_all.deb
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-image-unsigned-6.3.7-060307-generic_6.3.7-060307.202306090936_amd64.deb
wget -c   https://kernel.ubuntu.com/~kernel-ppa/mainline/v6.3.7/amd64/linux-modules-6.3.7-060307-generic_6.3.7-060307.202306090936_amd64.deb
 
sudo dpkg -i *.deb

trust

hmm, you need to connect before boot.
now permissions

$ sudo dmesg
dprobe" pid=563 comm="apparmor_parser"
[    7.888207] audit: type=1400 audit(1686781044.331:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=563 comm="apparmor_parser"

authorized the tamala!

(base) user@eight:~$ echo 1 | sudo tee /sys/bus/thunderbolt/devices/0-1/authorized
1
(base) user@eight:~$ sudo ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.4/0000:04:00.0/0000:05:01.0/0000:07:00.0/0000:08:01.0/0000:09:00.0 ==
modalias : pci:v000010DEd00000DD8sv000010DEsd0000084Abc03sc00i00
vendor   : NVIDIA Corporation
model    : GF106GL [Quadro 2000]
manual_install: True
driver   : nvidia-driver-390 - distro non-free recommended
driver   : xserver-xorg-video-nouveau - distro free builtin

just an old card…

but EEK

(base) user@eight:~$ lspci | tail
08:04.0 PCI bridge: Intel Corporation JHL7540 Thunderbolt 3 Bridge [Titan Ridge DD 2018] (rev 06)
09:00.0 VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
09:00.1 Audio device: NVIDIA Corporation GF106 High Definition Audio Controller (rev a1)
 
$sudo dmesg
[ 1041.053826] nvidia: module license 'NVIDIA' taints kernel.
[ 1041.053831] Disabling lock debugging due to kernel taint
[ 1041.484017] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 1041.484032] NVRM: The NVIDIA Quadro 2000 GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 535.43.02 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[ 1041.501047] NVRM: No NVIDIA GPU found.
[ 1041.521176] nvidia-nvlink: Unregistered Nvlink Core, major device number 509
[ 1042.332830] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
[ 1042.332842] NVRM: The NVIDIA Quadro 2000 GPU installed in this system is
               NVRM:  supported through the NVIDIA 390.xx Legacy drivers. Please
               NVRM:  visit http://www.nvidia.com/object/unix.html for more
               NVRM:  information.  The 535.43.02 NVIDIA driver will ignore
               NVRM:  this GPU.  Continuing probe...
[ 1042.335282] NVRM: No NVIDIA GPU found.
[ 1042.335835] nvidia-nvlink: Unregistered Nvlink Core, major device number 509

WE ARE TAINTED

driver

we went with ubuntu selection

but cute https://www.nvidia.com/en-us/drivers/unix/

$ sudo apt installl nvidia-headless-535
 
 
#downgrade nvidia to quadro supported version
sudo apt install nvidia-headless-390
 
# EEK
RROR (dkms apport): kernel package linux-headers-6.3.7-060307-generic is not supported
Error! Bad return status for module build on kernel: 6.3.7-060307-generic (x86_64)
Consult /var/lib/dkms/nvidia/390.157/build/make.log for more information.
dpkg: error processing package nvidia-dkms-390 (--configure):
 installed nvidia-dkms-390 package post-installation script subprocess returned error exit status 10
dpkg: dependency problems prevent configuration of nvidia-headless-390:
 nvidia-headless-390 depends on nvidia-dkms-390; however:
  Package nvidia-dkms-390 is not configured yet.
 
dpkg: error processing package nvidia-headless-390 (--configure):
 dependency problems - leaving unconfigured
Processing triggers for libc-bin (2.36-0ubuntu4) ...
No apport report written because the error message indicates its a followup error from a previous failure.
                                                                                                          /sbin/ldconfig.real: /lib/lib
ndi.so.4 is not a symbolic link
 
Processing triggers for man-db (2.10.2-2) ...
Processing triggers for initramfs-tools (0.140ubuntu17) ...
update-initramfs: Generating /boot/initrd.img-6.3.7-060307-generic
Errors were encountered while processing:
 nvidia-dkms-390
 nvidia-headless-390

downgrading but to headless,
without touching the x config?

going with avalon readme

[1] A 3D video game environment and benchmark designed from scratch for reinforcement learning research

conda create -n avalon python=3.9
conda activate avalon
 
sudo apt install --no-install-recommends libegl-dev libglew-dev libglfw3-dev libnvidia-gl libopengl-dev libosmesa6 mesa-utils-extra
 
#this will also install torch...
pip install avalon-rl[train] 
 
python -m avalon.install_godot_binary
python -m avalon.common.check_install

why even bother, the quaDRO IS JUST A TEST.
NEED TO CLEAN REMOVE THE 390 driver AND MOVE BACK TO

NVIDIA-CURRENT

P40

unlike other cards the blue led doesnt turn green on thunderbolt connection.

NVIDIA Tesla P40 24GB DDR5 GPU Accelerator Card Dual PCI-E 3.0 x16
need to retrofit with a FAN,it doesnt come with one

got one on ebay for 200$(+shipping) (ebay mirror)

some dude got it working, https://github.com/JingShing/How-to-use-tesla-p40

SPECIFICATIONS:

  • GPU Architecture: NVIDIA Pascal
  • Single-Precision Performance 12 TeraFLOPS*
  • Integer Operations (INT8) 47 TOPS* (TeraOperations per Second)
  • GPU Memory 24 GB
  • Memory Bandwidth 346 GB/s
  • System Interface PCI Express 3.0 x16
  • Form Factor 4.4” H x 10.5” L, Dual Slot, Full Height
  • Max Power 250 W
  • Enhanced Programmability with Page Migration Engine Yes
  • ECC Protection Yes
  • Server-Optimized for Data Center Deployment Yes
  • Hardware-Accelerated Video Engine 1x Decode Engine, 2x Encode Engine />
  • NVPN: 699-2G610-0200-100
  • NVIDIA® CUDA® cores: 3840

installing

sudo apt install nvidia-headless-535

there is some issue, unlike other cards the blue led doesnt turn green on thunderbolt connection.
no power passing to the gPU.

:(

1080Ti

looks legit

$sudo dmesg -w
[96236.873213] nvidia-nvlink: Nvlink Core is being initialized, major device number 509
 
[96236.874544] nvidia 0000:09:00.0: enabling device (0006 -> 0007)
[96236.874646] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[96236.991272] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.43.02  Mon May 22 20:46:13 UTC 2023
[96237.009537] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.43.02  Mon May 22 20:25:24 UTC 2023
[96237.013346] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[96238.239429] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 1
[96238.269008] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[96238.330257] nvidia-uvm: Loaded the UVM driver, major device number 507.
[96238.399348] NVRM: API mismatch: the client has the version 390.157, but
               NVRM: this kernel module has the version 535.43.02.  Please
               NVRM: make sure that this kernel module and all NVIDIA driver

update the driver to fit

$ ubuntu-drivers devices
== /sys/devices/pci0000:00/0000:00:1c.4/0000:04:00.0/0000:05:01.0/0000:07:00.0/0000:08:01.0/0000:09:00.0 ==
modalias : pci:v000010DEd00001B06sv00001458sd0000377Abc03sc00i00
vendor   : NVIDIA Corporation
model    : GP102 [GeForce GTX 1080 Ti]
manual_install: True
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-510 - distro non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-470 - distro non-free
driver   : nvidia-driver-525-server - distro non-free
driver   : nvidia-driver-525 - distro non-free
driver   : nvidia-driver-535 - third-party non-free recommended
driver   : nvidia-driver-515 - distro non-free
driver   : nvidia-driver-515-server - distro non-free
driver   : nvidia-driver-530 - distro non-free
driver   : nvidia-driver-470-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin
 
$ sudo ubuntu-drivers autoinstall
1The following additional packages will be installed:
  libnvidia-common-535 libnvidia-compute-535:i386
  libnvidia-decode-535 libnvidia-decode-535:i386
  libnvidia-encode-535 libnvidia-encode-535:i386
  libnvidia-extra-535 libnvidia-fbc1-535 libnvidia-fbc1-535:i386
  libnvidia-gl-535 libnvidia-gl-535:i386 nvidia-prime
  nvidia-settings nvidia-utils-535 screen-resolution-extra
  xserver-xorg-video-nvidia-535
The following packages will be REMOVED:
  libnvidia-common-390 libnvidia-gl-390
The following NEW packages will be installed:
  libnvidia-common-535 libnvidia-compute-535:i386
  libnvidia-decode-535 libnvidia-decode-535:i386
  libnvidia-encode-535 libnvidia-encode-535:i386
  libnvidia-extra-535 libnvidia-fbc1-535 libnvidia-fbc1-535:i386
  libnvidia-gl-535 libnvidia-gl-535:i386 nvidia-driver-535
  nvidia-prime nvidia-settings nvidia-utils-535
  screen-resolution-extra xserver-xorg-video-nvidia-535

misc

 lspci -v | grep -A 2 -E "(VGA comp|3D)"
00:02.0 VGA compatible controller: Intel Corporation Iris Pro Graphics 580 (rev 09) (prog-if 00 [VGA controller])
	DeviceName:  CPU
	Subsystem: Intel Corporation Iris Pro Graphics 580
--
09:00.0 VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation GF106GL [Quadro 2000]
	Flags: bus master, fast devsel, latency 0

power from 12v dc plug (150W?)
https://www.reddit.com/r/eGPU/comments/ukqto9/comment/ige1rwv

https://egpu.io/forums/thunderbolt-linux-setup/

tamiwiki/projects/egpu.1687102772.txt.gz · Last modified: 2023/06/18 18:39 by yair