What is HPC?¶

High-Performance Computing (HPC) is the practice of aggregating computing power to deliver much higher performance than a typical desktop or laptop can offer. Researchers use HPC to tackle problems that are too large, too slow, or too complex for a single machine. Things like simulating climate models, analyzing genomic data, or training machine learning models are all typical HPC workloads.

If you've ever found yourself waiting hours (or days) for code to finish on your laptop, HPC is likely the solution.

Why Not Just Use a Faster Laptop?¶

Even the most powerful laptop has hard limits: a fixed number of CPU cores, a ceiling on memory, and limited storage. HPC systems let you go beyond these limits by connecting many machines together so they can work on a problem collaboratively.

There are two key architectural models to understand:

Shared memory is what you're used to. Your laptop has one pool of memory that all its CPU cores can access directly. This is simple and fast, but it doesn't scale beyond a single machine.

Distributed memory is the model used by clusters of computers working collaboratively. Many individual machines (called compute nodes) each have their own private memory. They communicate over a high-speed network, coordinating to solve problems that no single machine could handle alone. This is what makes it possible to run a computation across hundreds or even thousands of processors simultaneously.

Note

We can still leverage shared memory within an individual node in the cluster to allow all of that node's CPUs to work together. Only once we want to scale out across multiple nodes is when we are required to use a distributed memory programming model.

graph LR
    subgraph "Your Laptop (Shared Memory)"
        CPU1[Core 1] --> RAM[Shared RAM]
        CPU2[Core 2] --> RAM
        CPU3[Core 3] --> RAM
        CPU4[Core 4] --> RAM
    end

graph TB
    subgraph "HPC Cluster (Distributed Memory)"
        direction TB
        subgraph Node1["Node 1"]
            C1[CPUs] --- M1[Memory]
        end
        subgraph Node2["Node 2"]
            C2[CPUs] --- M2[Memory]
        end
        subgraph Node3["Node 3"]
            C3[CPUs] --- M3[Memory]
        end
        Node1 <-->|"High-Speed Network"| Node2
        Node2 <-->|"High-Speed Network"| Node3
        Node1 <-->|"High-Speed Network"| Node3
    end

Storage: Where Your Data Lives¶

Beyond raw computing power, HPC workloads also depend on storage. Most clusters provide at least two tiers: high-speed scratch storage that compute nodes can read and write quickly during a job, and long-term storage for datasets and results that need to persist between jobs but don't require the same speed. Choosing the right storage tier for each stage of your workflow can make a big difference in both performance and cost. We cover this in detail in the Storage Fundamentals article.

Dartmouth's HPC Systems¶

Dartmouth provides several HPC systems, each suited to different kinds of work. Andes, Polaris, and Discovery are available to all researchers on campus, while the Babylon servers are operated by Thayer School of Engineering and Computer Science for their users.

Andes & Polaris (Shared Memory)¶

Andes and Polaris are large-scale shared memory systems available to all Dartmouth researchers. Unlike Discovery, they work more like a powerful version of your personal computer: You log in and run programs interactively, with all processors sharing the same memory space.

These systems are well suited for workloads that need large amounts of memory on a single machine, or for interactive data analysis and development. The limitation of these systems is that all resources are shared at all times. Your workloads might have to compete with other users' workloads for CPU time and memory, causing unnecessary slowdowns. Since they are stand-alone machines, you may also hit their resource ceiling.

Babylon (Thayer/CS — Shared Memory)¶

Thayer School of Engineering and Computer Science operate a set of shared memory compute servers named babylon1 through babylon12 (babylon1.thayer.dartmouth.edu, etc.). Like Andes and Polaris, these are standalone shared memory machines, but they are only available to Thayer and CS users. You can log in interactively via SSH.

The Babylon servers are a good stepping stone between your laptop and the full HPC cluster. They have faster processors and more memory than lab workstations, and they're convenient for engineering-specific software (MATLAB, Abaqus, Mathematica, etc.) that's pre-installed on the Thayer infrastructure. However, like Andes and Polaris, they are shared — your processes run alongside other users' work with no scheduler to guarantee dedicated resources.

Thayer/CS credentials

The Babylon servers authenticate using your NetID and password and are only accessible to Thayer and CS users. See the Thayer Linux documentation for full details.

Discovery (Campus-Wide HPC Cluster)¶

Discovery is Dartmouth's primary HPC cluster, available to all researchers on campus. It consists of many compute nodes connected by a high-speed network, and jobs are managed by the Slurm scheduler. You write a script describing what resources you need (CPUs, memory, time, GPUs), submit it, and Slurm runs it on the appropriate hardware when resources are available.

This is the system you'll use most often, and the one this cookbook focuses on.

Which System Should I Use?¶

With multiple systems available, it's natural to wonder which one is right for your work. Here's a quick guide:

Use Andes or Polaris if your work fits on a single large machine and you want to run it interactively. These are good for exploratory data analysis, prototyping code, or running jobs that need a lot of memory but not a lot of scheduling overhead. You log in, run your program, and see the results in real time — much like working on your own computer, just with far more resources.

Use the Babylon servers if you're in Thayer or CS and need quick access to a more powerful machine for interactive work, especially if your workflow depends on engineering software available on the Thayer infrastructure.

Use Discovery if your work needs dedicated resources, GPUs, long runtimes, or the ability to run many jobs at once. Because Discovery uses a scheduler, your jobs get exclusive access to the resources you request — no competing with other users for CPU time. This is the right choice for production runs, batch processing, GPU-accelerated workloads like deep learning, and any workflow where you want to submit a job and walk away.

When in doubt, use Discovery. It's the most versatile system, it's open to everyone at Dartmouth, and it's the one this cookbook focuses on. The scheduler may feel like an extra step at first, but it ensures fair access for everyone and gives you predictable, reproducible performance.

	Andes / Polaris	Babylon	Discovery
Best for	Large-memory single-machine tasks	Quick interactive work for Thayer/CS users	Batch jobs, GPU workloads, long-running or multi-job workflows
How you run work	Directly on the command line	Directly on the command line	Submit via the Slurm scheduler
Resource sharing	Shared with other users in real time	Shared with other users in real time	Dedicated resources for your job
Who can access	All Dartmouth users	Thayer and CS users	All Dartmouth users
GPUs	No	No	Yes
Scheduler	No	No	Yes (Slurm)

Test Yourself: Which System Would You Choose?¶

See if you can match each scenario to the best HPC system. Click on your answer, then advance to the next question!

Scenario 1 of 5: You are a professor at Thayer and want to run a quick MATLAB simulation to verify some results before a meeting tomorrow morning. You'd like to work interactively so you can tweak parameters on the fly.

Scenario 2 of 5: You are a graduate student in Neuroscience who needs to train a convolutional neural network on brain imaging data. The training requires GPUs and will take roughly 48 hours.

Scenario 3 of 5: You are a genomics researcher who needs to load a 400 GB reference genome index into memory for an interactive alignment analysis. You want to explore results in real time.

Scenario 4 of 5: You are a physics PhD student who needs to run 200 independent simulations, each with different parameters. You want to submit them all at once and collect results when they finish overnight.

Scenario 5 of 5: You are a researcher in Geography who wants to interactively explore a large geospatial dataset in R. The dataset is about 60 GB.

How You Interact with an HPC Cluster¶

On your own computer, you simply start a program and it runs immediately using as many resources as it needs until it hits the system's limits.

Using an HPC cluster is different from using your personal computer. Here's the typical workflow:

Connect — You log in to the cluster's login node over SSH from your own computer.
Prepare — On the login node, you write scripts, transfer data, and set up your environment.
Submit — You submit a job to the scheduler, describing what you want to run and what resources it needs.
Wait — The scheduler finds available resources and runs your job. You don't need to stay connected.
Collect results — When the job finishes, you retrieve the output files.

flowchart LR
    A["Your Computer"] -->|SSH| B["Login Node"]
    B -->|"sbatch job.sh"| C["Scheduler"]
    C --> D["Compute Nodes"]
    D -->|results| B

The login node is shared

When you connect to Discovery, you land on the login node — a shared gateway used by all users. It's appropriate for lightweight tasks like editing files, submitting or monitoring jobs. Do not run computationally intensive work on the login node. Doing so can degrade performance for everyone. Use the scheduler to submit real work to the compute nodes.

Key Terminology¶

New to HPC? Hover over technical terms to see a (hopefully) helpful definition. Check the glossary for the full list of terms like nodes, jobs, partitions, and more that you'll encounter throughout this cookbook.

What's Next?¶

Now that you have a sense of what HPC is and how Dartmouth's systems are organized, it's time to get hands-on:

Request an account for our HPC systems
Connect to our HPC systems via SSH
Submit your first job with Slurm