1. Overview
The University of South Carolina High Performance Computing (HPC) clusters are available to researchers requiring specialized hardware resources for computational research applications.
For more information:https://sc.edu/about/offices_and_divisions/division_of_information_technology/rc/hpc_clusters/index.php
Here are some basic steps for you to successfully train your model on HPC.
2. How to use HPC
— Apply account
Submit your request here. And, after several hours (maybe several days), you will receive some emails about your request. Normally, you can see your account information in those emails. I strongly suggest you to carefully read these emails. (PLEASE CAREFULLY READ)
— Configure your environment
Here are some official instrudctions. Please read them carefully before you start to use. It won’t cost you too much time.
- login to server
You can use any tools for ssh connection. Please replace the user information in following code.
ssh -p 222 username@login.rci.sc.edu
Notes: very very important thing is DO NOT DIRECTLY RUN YOUR CODE ON THIS SERVER. EVERTHING YOU WANT TO RUN OR TEST SHOULD BE SUBMITTED TO GPU CLUSTER USING SBATCH!!!
- Load module
Read instructions here to famillar with loading modules.
Loading anaconda module
module load python3/anaconda/2020.02
Notes: after you successfully load the conda, please change the working and tem path to your /work/username folder, since you only have 25 GB space under /home/username.
- configure your personal environment
I thought everyone should be very famillar with this part. You can use
conda create xxxx
to build your configuration. Then, use pip to install necessary lib.
— Submit your task
- build a script
create a test.sh file and write down command like following. For instructions, read this
#!/bin/sh
#SBATCH --job-name=finetune_1
#SBATCH -N 1
#SBATCH -n 28
#SBATCH --gres=gpu:1 ## Run on 1 GPU
#SBATCH --output ./log/finetune%j.out
#SBATCH --error ./log/finetune%j.err
#SBATCH -p gpu-v100-16gb
##Load your modules and run code here
date
module load cuda/11.3
module load python3/anaconda/2020.02
nvidia-smi
source activate mmbox10.1
python --version
python ./finetune.py > ./log/finetune.txt
conda deactivate
- submit
submit your taks by:
sbatch test.sh
3. Performance
The GPU loaded in HPC clusters is Tesla v100. According to my experience, the performance is almost same with our lab’ GPU server.
But, for a shared platform, sometimes you need to wait for a long time to exacute your code.
4. Suggestion
— Storage and Priority
If your storage is not enough (dataset may take lots of space), you can send an email to the person who gives you the account information with your situation. Hopefully, you can get extra 1 TB (I am in this situation).
Different users have different priorities. If you want to get your code exacuted faster, you can send an email for high priority.
— Waiting List
Before you submit your job, you can check the waiting list and available nodes:
sinfo
squeue xxx # check the instructions I mentioned
Perhaps, you can find available nodes.
5. Possible Problems
Sorry about this section. I have written a lot of problem I encountered with, but my server got into troubles and lost everything.
If you encounted with any problem, feel free to ask me solutions.
And, if you solved these problems, welcome to leave your comments here. That will help others a lot. Thanks!
1. How to match your python installation with HPC’s GPU driver. More precisely it is match cuda version of python installation with HPC’s GPU cuda version. For any HPC GPU resource you want to apply, you can check its cuda version by using nvidia-smi in job script. I use pytorch which need specific version cuda ( “print(torch.version.cuda)” can list its cuda version ) . The two cuda version must be the same.
2. How to activate your specific python environment within job script. In HPC, after you install your target python environment by which you run you job, you still need to activate it in the job script before you run any python script. My method is “source .bashrc”. That is also why there is one line to check python’s version in the job script.
Hope this is useful for your research.