This doesn't necessarily have to be a Linux problem but I'll ask it here anyway. I'm using a workstation mainly for training deep learning and machine learning models. I run training codes on both CPU and GPU.
CPU: AMD Ryzen 9 5950X 16-Core Processor
GPU: NVIDIA GeForce RTX 3090
OS: Ubuntu 22.04 LTS
The libraries that I use (PyTorch, XGBoost, LightGBM and etc.) utilize swap memory a lot for data loading. While working on big datasets, swap memory accumulates slowly and exceeds the limit (2GB). When that happens, all of the cores go crazy and CPU overheats. Workstation shuts down itself couple seconds later.
I'm a data scientist and I'm not good with hardware. It took couple weeks for me to figure out why my workstation was keep shutting itself down. I have to find a way to prevent this since I can't progress on my own tasks anymore. What are your suggestions?
To give you more details, this wasn't happening 3-4 months ago. It started very recently.
Edit: Added nvidia-smi and sensors outputs while training two models (UNet and YOLOv6) simultaneously.
nvidia-smi
+-----------------------------------------------------------------------------+| NVIDIA-SMI 510.73.05 Driver Version: 510.73.05 CUDA Version: 11.6 ||-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. || | | MIG M. ||===============================+======================+======================|| 0 NVIDIA GeForce ... Off | 00000000:0A:00.0 Off | N/A ||100% 79C P2 338W / 350W | 14171MiB / 24576MiB | 100% Default || | | N/A |+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+| Processes: || GPU GI CI PID Type Process name GPU Memory || ID ID Usage ||=============================================================================|| 0 N/A N/A 1361 G /usr/lib/xorg/Xorg 56MiB || 0 N/A N/A 1568 G /usr/bin/gnome-shell 10MiB || 0 N/A N/A 27955 C python 2743MiB || 0 N/A N/A 31692 C python 11355MiB |+-----------------------------------------------------------------------------+
sensors
nvme-pci-0300Adapter: PCI adapterComposite: +74.8°C (low = -273.1°C, high = +84.8°C) (crit = +84.8°C)Sensor 1: +74.8°C (low = -273.1°C, high = +65261.8°C)Sensor 2: +74.8°C (low = -273.1°C, high = +65261.8°C)iwlwifi_1-virtual-0Adapter: Virtual devicetemp1: +57.0°C k10temp-pci-00c3Adapter: PCI adapterTctl: +87.8°C Tccd1: +89.2°C Tccd2: +79.5°C