Using VS code within containers
Abacus is a deep learning-oriented cluster infrastructure of the Data Science for Health (DSH) lab of the Bruno Kessler Foundation. This cluster was assembled in order to process and analyze large medical images using deep learning models, therefore its architecture is highly influenced by this. This documentation aims to introduce and explain the key concepts, technologies, and technical details that you need to know in order to use this system. But first, let’s have an overview of the architecture and how we are expecting you, as a researcher, to use it.
<aside> 💡 IMPORTANT: This setup was designed by researchers, rather than professional IT personnel, with the primary goal of providing an easy-to-use experience while maintaining optimal system performance. If you encounter any component that you believe it is not designed or functioning optimally, please do not hesitate to contact the admin staff to discuss potential improvements. Additionally, if you find that any of our policies are too restrictive for your project, feel free to reach out to us. Our goal is to support your ability to conduct great research. For questions or problems related to the usage of the cluster you can write on teams or via email to the system administrators: [email protected] or [email protected]
</aside>
<aside> 💡 TERMINOLOGY: Abacus —> with this term we refer to all the resources of the system Cluster —> with this term we ****refer to the SLURM system (GPU+login) Storage —> with this term we ****refer to laplace, the storage mounted on each server GPU servers —> with this term we ****refer to any GPU capable server
</aside>
At the time of writing (August 12 2024) it is composed by the following machines:
server name | Purpose | GPU type | Number of GPUs | Disk space (type) | RAM GB | Brand | Number of Cores (threads) |
---|---|---|---|---|---|---|---|
gauss | compute | H100 SXM | 8 | 20TB (NVMe) | 2048 | Dell | 104 (104) |
maryam | compute | L40s | 8 | 14TB (NVMe) | 1024 | Lenovo | 48 (48) |
fermi | compute | L40s | 8 | 14TB (NVMe) | 1024 | Lenovo | 48 (48) |
laplace | storage-fast | NA | NA | 120TB (SSD) | 256 | Dell | - |
fourier | storage-backup | NA | NA | 120TB (SSD) | 32 | Dell | - |
beppe | virtualization | NA | NA | 14TB (SSD) | 1024 | Dell | 64 |
riemann | compute | L40s | 6 | 11TB (NVMe) | 1024 | Lenovo | 48 (96) |
by August 12 2024, the resources are divided into the following component:
<aside> 💡 IMPORTANT: Data in the storage are backed-up, data store on the individual servers are not. It’s crucial to ensure that your data is securely stored and backed up before starting any project.
</aside>
How are we expecting users to use Abacus resources?
Let’s have a look at a generic project’s workflow and see the quick-start for a taste on how you can do it on ABACUS: