top of page

From Bottleneck to Beast: How I Supercharged My Deep Learning Rig

  • Writer: Istvan Benedek
    Istvan Benedek
  • 4 days ago
  • 2 min read

Updated: 2 days ago


лошадь танцует
лошадь танцует

Like many working in AI and deep learning, I’m constantly testing the limits of my hardware. Speed is everything—especially when you're experimenting with large datasets and iterating over models with fine-grained control. For a while, I was using an external GPU (eGPU) connected via Thunderbolt 4, thinking it would be a sleek and flexible solution for model training.


Reality Check: Thunderbolt Isn’t Magic


Although Thunderbolt 4 offers a theoretical maximum bandwidth of 40 Gbps, it soon became evident that this ceiling was a bottleneck for my workload.


The realization came unexpectedly while reviewing the technical specifications of the new MacBook Pro, which is the Thunderbolt 5 support—capable of delivering up to 120 Gbps using Bandwidth Boost. This significant jump in throughput underscored the limitations of my current system, which is constrained to Thunderbolt 4.


Every second of latency adds up when you're training across 250,000 images, (384x384x3+8)x250000=110.594 GByte, multiple epochs, and trying to fine-tune interpolative regressors under monotonic constraints.The conclusion was clear: I needed to ditch the eGPU setup and go internal.



The Upgrade: PCIe Power


So, I took the eGPU out of its gigabyte case (called gaming box) and installed it directly into my desktop via PCIe. That move immediately unleashed its full potential. PCIe lanes on modern desktops can easily hit 500+ Gbps, orders of magnitude faster than what Thunderbolt offers.

The performance boost was undeniable:

  • Training time dropped significantly, to be honest I started using RAM Driver as well...

  • Inference latency shrank

  • GPU utilization finally hit peak levels, I've never seen 380W utilization of my NVIDIA RTX 4090 before...



Memory Matters


I also upgraded my system’s RAM to a massive 192 GB (though I can see the limiation already).


Now, I can:

  • load my training / validation images into a RAM driver

  • I can start thinking to use cached tensors for smaller datasets; 1,6MByte / image stored as a tensor limiting me to use max ~ 50000 training and 15000 validation pictures in memory.

  • Maybe I should consider to run trainings simultaneously in the future to maximize the gpu utilization as with small models I still have enough GPU memory and free capacity...


Comments


© 2023 by istvan benedek. Powered and secured by Wix

bottom of page