Overview

Riemann is a GPU node, equipped with 6 NVIDIA L40s GPUs, 48 cores (96 threads), and 1TB of RAM, which is dedicated exclusively to the DSH unit and operates outside of the SLURM scheduler. The primary purpose of the Riemann node is to facilitate the development and debugging of code, as users can directly ssh into the machine, rather than running long training jobs. For guidelines on when to use the riemann, please refer to the usage policies.

How to connect

To connect to the riemann, please refer to the “Connecting to the Cluster” section in the quick-start guide. The steps are identical to those for connecting to the main cluster, except that you need to substitute the machine's DNS with the appropriate one for riemann. Use ssh "username"@abacus-riemann.fbk.eu to connect.

Containers

To understand how containers and enroot work, you should first visit the dedicated page.

When using Riemann you will have to manually execute some of the action that are automate by pyxis in the slum cluster. This actions are the following:

how to build and use containers

  1. Import the image (es: UCDA 12.1 with pytorch 2.4.0):

    go to your project directory in the storage. then:

    enroot import 'docker://pytorch/pytorch:2.4.0-cuda12.1-cudnn9-runtime'
    

    You can also use an already available one, just provide the path to the .sqsh file in the next step.

  2. build a container:

    enroot create --name pytorch_2-4_cuda_12-1 pytorch+pytorch+2.4.0-cuda12.1-cudnn9-runtime.sqsh
    

    This will create a container named pytorch_2-4_cuda_12-1 which you can then see with the command enroot list

  3. starting the container:

    you can then run the container with the enroot start command. See all the possible option with enroot start --help

    enroot start pytorch_2-4_cuda_12-1
    

    This will log you in the container terminal. Type exit or press Ctrl+D to exit.

  4. removing the container:

    Until you remove a container, it will remain on the system. This can be convenient during rapid development, but it can also quickly lead to excessive disk usage if containers are left on the system "just in case." To manage this, you should save your container as an image (see instructions below) and store it in the designated storage area (refer to the usage policy for more details). After saving the container, you can remove it using the following command:

    enroot remove pytorch_2-4_cuda_12-1
    

    <aside> 💡 IMPORTANT: To prevent the accumulation of unused containers, on the riemann, all containers are automatically removed one month after their creation. However, it’s still your responsibility to manage your containers proactively to avoid any inconvenient situations. Be sure to save your work regularly and remove containers when they are no longer needed.

    </aside>

working with container

Here are the info to start a container in a way that is actually more useful for your daily work.

enroot start --root --mount /storage/DSH/demo:/my_data pytorch_2-4_cuda_12-1
  1. Mount volumes

  2. Modify the container:

    apt update && upgrade -y
    apt install curl
    
    
    conda activate
    conda install pandas
    

    now exit from your container

  3. Save your modified container:

    enroot export --output my_container_image.sqsh pytorch_2-4_cuda_12-1
    

    with this you will save your modified container into an image you can than share with anybody.

    <aside> 💡 Note: To see how you can attach to the container using vs code, please read the guide on using vs code within containers.

    </aside>

    Data management

    Like all other nodes in the cluster, the Riemann node has access to the main storage (Laplace), which is mounted on /storage. This setup facilitates data transfer among nodes and should be used as the primary location for saving and storing data, code, container images, and results. Additionally, Riemann features a local fast NVMe disk, accessible for high-speed I/O processes, mounted at /mnt/md0/data, similar to the other cluster nodes. The contents of this disk are temporary and will be deleted one week after creation.

    As with the cluster nodes, it is recommended to mount both your project directory on /storage/project_dir and /mnt/md0/data inside your container (look above on how to do it). You should copy the data from the main storage to the local disk, work with it, and then save the results back to the main storage. For a detailed explanation, please refer to the usage policies, specifically the section on data management.

    <aside> 💡 Important:

    To prevent disk space overflow on the riemann node, each user is limited to a maximum of 300GB on /mnt/md0 and 30GB in their home directory. Please manage your storage usage accordingly to avoid exceeding these limits. In special cases, additional storage allowance can be granted upon request. Please contact the system administrators if you require more space.

    </aside>