Cluster quick-start

FAQs

Usage policies

Slurm

Enroot

Using VS code within containers

Riemann

introduction

Abacus is a deep learning-oriented cluster infrastructure of the Data Science for Health (DSH) lab of the Bruno Kessler Foundation. This cluster was assembled in order to process and analyze large medical images using deep learning models, therefore its architecture is highly influenced by this. This documentation aims to introduce and explain the key concepts, technologies, and technical details that you need to know in order to use this system. But first, let’s have an overview of the architecture and how we are expecting you, as a researcher, to use it.

<aside> 💡 IMPORTANT: This setup was designed by researchers, rather than professional IT personnel, with the primary goal of providing an easy-to-use experience while maintaining optimal system performance. If you encounter any component that you believe it is not designed or functioning optimally, please do not hesitate to contact the admin staff to discuss potential improvements. Additionally, if you find that any of our policies are too restrictive for your project, feel free to reach out to us. Our goal is to support your ability to conduct great research. For questions or problems related to the usage of the cluster you can write on teams or via email to the system administrators: [email protected] or [email protected]

</aside>


<aside> 💡 TERMINOLOGY: Abacus —> with this term we refer to all the resources of the system Cluster —> with this term we ****refer to the SLURM system (GPU+login) Storage —> with this term we ****refer to laplace, the storage mounted on each server GPU servers —> with this term we ****refer to any GPU capable server

</aside>

At the time of writing (August 12 2024) it is composed by the following machines:

server name Purpose GPU type Number of GPUs Disk space (type) RAM GB Brand Number of Cores (threads)
gauss compute H100 SXM 8 20TB (NVMe) 2048 Dell 104 (104)
maryam compute L40s 8 14TB (NVMe) 1024 Lenovo 48 (48)
fermi compute L40s 8 14TB (NVMe) 1024 Lenovo 48 (48)
laplace storage-fast NA NA 120TB (SSD) 256 Dell -
fourier storage-backup NA NA 120TB (SSD) 32 Dell -
beppe virtualization NA NA 14TB (SSD) 1024 Dell 64
riemann compute L40s 6 11TB (NVMe) 1024 Lenovo 48 (96)

by August 12 2024, the resources are divided into the following component:

<aside> 💡 IMPORTANT: Data in the storage are backed-up, data store on the individual servers are not. It’s crucial to ensure that your data is securely stored and backed up before starting any project.

</aside>


How are we expecting users to use Abacus resources?

Let’s have a look at a generic project’s workflow and see the quick-start for a taste on how you can do it on ABACUS: