Abacus user guide

introduction

Abacus is a deep learning-oriented cluster infrastructure of the Data Science for Health (DSH) lab of the Bruno Kessler Foundation. This cluster was assembled in order to process and analyze large medical images using deep learning models, therefore its architecture is highly influenced by this. This documentation aims to introduce and explain the key concepts, technologies, and technical details that you need to know in order to use this system. But first, let’s have an overview of the architecture and how we are expecting you, as a researcher, to use it.

<aside> 💡 IMPORTANT: This setup was designed by researchers, rather than professional IT personnel, with the primary goal of providing an easy-to-use experience while maintaining optimal system performance. If you encounter any component that you believe it is not designed or functioning optimally, please do not hesitate to contact the admin staff to discuss potential improvements. Additionally, if you find that any of our policies are too restrictive for your project, feel free to reach out to us. Our goal is to support your ability to conduct great research. For questions or problems related to the usage of the cluster you can write on teams or via email to the system administrators: [email protected] or [email protected]

</aside>

<aside> 💡 TERMINOLOGY: Abacus —> with this term we refer to all the resources of the system Cluster —> with this term we ****refer to the SLURM system (GPU+login) Storage —> with this term we ****refer to laplace, the storage mounted on each server GPU servers —> with this term we ****refer to any GPU capable server

</aside>

At the time of writing (August 12 2024) it is composed by the following machines:

server name	Purpose	GPU type	Number of GPUs	Disk space (type)	RAM GB	Brand	Number of Cores (threads)
gauss	compute	H100 SXM	8	20TB (NVMe)	2048	Dell	104 (104)
maryam	compute	L40s	8	14TB (NVMe)	1024	Lenovo	48 (48)
fermi	compute	L40s	8	14TB (NVMe)	1024	Lenovo	48 (48)
laplace	storage-fast	NA	NA	120TB (SSD)	256	Dell	-
fourier	storage-backup	NA	NA	120TB (SSD)	32	Dell	-
beppe	virtualization	NA	NA	14TB (SSD)	1024	Dell	64
riemann	compute	L40s	6	11TB (NVMe)	1024	Lenovo	48 (96)

by January 21 2025, the resources are divided into the following component:

Interactive server: riemann (No longer available) ~~Users can log in directly into the server via ssh and execute code in an standard fashion~~
GPU cluster: gauss, maryam, fermi These servers are not directly accessible to users as they are managed by a SLURM queue system for optimal resource allocation. In order to use these servers, users must first login in to the front-end node of the cluster and from there submit a job which will request the necessary resources from the available servers. In a cluster configuration, each server within the cluster is referred to as a "node" . Understanding this terminology is essential for navigating this guide effectively.
We have a shared storage, named laplace, which is mounted and visible from the interactive server as well as from the nodes of the cluster.
We have another storage, named fourier, which is used for backups.

<aside> 💡 IMPORTANT: Data in the storage are backed-up, data store on the individual servers are not. It’s crucial to ensure that your data is securely stored and backed up before starting any project.

</aside>

Virtual Machine(s) (VM): One server is dedicated to the creation of virtual machines, among which there is the front-end of the cluster. This is the only VM the the user need to know about. It could be useful to know that if you might need a test environment which do not require GPU resources for particular applications, you can request a VM by contacting the admin of the DSH group.

How are we expecting users to use Abacus resources?

Let’s have a look at a generic project’s workflow and see the quick-start for a taste on how you can do it on ABACUS: