High Performance Computing (HPC)

1. Overview

The University of South Carolina High Performance Computing (HPC) clusters are available to researchers requiring specialized hardware resources for computational research applications.

For more information:https://sc.edu/about/offices_and_divisions/division_of_information_technology/rc/hpc_clusters/index.php

Here are some basic steps for you to successfully train your model on HPC.

2. How to use HPC

— Apply account

Submit your request here. And, after several hours (maybe several days), you will receive some emails about your request. Normally, you can see your account information in those emails. I strongly suggest you to carefully read these emails. (PLEASE CAREFULLY READ)

— Configure your environment

Here are some official instrudctions. Please read them carefully before you start to use. It won’t cost you too much time.

You can use any tools for ssh connection. Please replace the user information in following code.

ssh -p 222 username@login.rci.sc.edu

Notes: very very important thing is DO NOT DIRECTLY RUN YOUR CODE ON THIS SERVER. EVERTHING YOU WANT TO RUN OR TEST SHOULD BE SUBMITTED TO GPU CLUSTER USING SBATCH!!!

Load module

Read instructions here to famillar with loading modules.

Loading anaconda module

module load python3/anaconda/2020.02

Notes: after you successfully load the conda, please change the working and tem path to your /work/username folder, since you only have 25 GB space under /home/username.

configure your personal environment

I thought everyone should be very famillar with this part. You can use

conda create xxxx

to build your configuration. Then, use pip to install necessary lib.

— Submit your task

build a script

create a test.sh file and write down command like following. For instructions, read this

#!/bin/sh
#SBATCH --job-name=finetune_1
#SBATCH -N 1
#SBATCH -n 28 
#SBATCH --gres=gpu:1 ## Run on 1 GPU
#SBATCH --output ./log/finetune%j.out
#SBATCH --error ./log/finetune%j.err
#SBATCH -p gpu-v100-16gb

##Load your modules and run code here
date

module load cuda/11.3
module load python3/anaconda/2020.02
nvidia-smi
source activate mmbox10.1
python --version
python ./finetune.py > ./log/finetune.txt
conda deactivate

submit

submit your taks by:

sbatch test.sh

3. Performance

The GPU loaded in HPC clusters is Tesla v100. According to my experience, the performance is almost same with our lab’ GPU server.

But, for a shared platform, sometimes you need to wait for a long time to exacute your code.

4. Suggestion

— Storage and Priority

If your storage is not enough (dataset may take lots of space), you can send an email to the person who gives you the account information with your situation. Hopefully, you can get extra 1 TB (I am in this situation).

Different users have different priorities. If you want to get your code exacuted faster, you can send an email for high priority.

— Waiting List

Before you submit your job, you can check the waiting list and available nodes:

sinfo
squeue xxx # check the instructions I mentioned

Perhaps, you can find available nodes.

5. Possible Problems

Sorry about this section. I have written a lot of problem I encountered with, but my server got into troubles and lost everything.

If you encounted with any problem, feel free to ask me solutions.

And, if you solved these problems, welcome to leave your comments here. That will help others a lot. Thanks!

1 thought on “High Performance Computing (HPC)”

Yong
October 13, 2023 at 8:16 PM

1. How to match your python installation with HPC’s GPU driver. More precisely it is match cuda version of python installation with HPC’s GPU cuda version. For any HPC GPU resource you want to apply, you can check its cuda version by using nvidia-smi in job script. I use pytorch which need specific version cuda ( “print(torch.version.cuda)” can list its cuda version ) . The two cuda version must be the same.

2. How to activate your specific python environment within job script. In HPC, after you install your target python environment by which you run you job, you still need to activate it in the job script before you run any python script. My method is “source .bashrc”. That is also why there is one line to check python’s version in the job script.

Hope this is useful for your research.