What is HPC?¶
High-Performance Computing (HPC) is the practice of aggregating computing power to deliver much higher performance than a typical desktop or laptop can offer. Researchers use HPC to tackle problems that are too large, too slow, or too complex for a single machine. Things like simulating climate models, analyzing genomic data, or training machine learning models are all typical HPC workloads.
If you've ever found yourself waiting hours (or days) for code to finish on your laptop, HPC is likely the solution.
Why Not Just Use a Faster Laptop?¶
Even the most powerful laptop has hard limits: a fixed number of CPU cores, a ceiling on memory, and limited storage. HPC systems let you go beyond these limits by connecting many machines together so they can work on a problem collaboratively.
There are two key architectural models to understand:
Shared memory is what you're used to. Your laptop has one pool of memory that all its CPU cores can access directly. This is simple and fast, but it doesn't scale beyond a single machine.
Distributed memory is the model used by clusters of computers working collaboratively. Many individual machines (called compute nodes) each have their own private memory. They communicate over a high-speed network, coordinating to solve problems that no single machine could handle alone. This is what makes it possible to run a computation across hundreds or even thousands of processors simultaneously.
Note
We can still leverage shared memory within an individual node in the cluster to allow all of that node's CPUs to work together. Only once we want to scale out across multiple nodes is when we are required to use a distributed memory programming model.
graph LR
subgraph "Your Laptop (Shared Memory)"
CPU1[Core 1] --> RAM[Shared RAM]
CPU2[Core 2] --> RAM
CPU3[Core 3] --> RAM
CPU4[Core 4] --> RAM
end graph TB
subgraph "HPC Cluster (Distributed Memory)"
direction TB
subgraph Node1["Node 1"]
C1[CPUs] --- M1[Memory]
end
subgraph Node2["Node 2"]
C2[CPUs] --- M2[Memory]
end
subgraph Node3["Node 3"]
C3[CPUs] --- M3[Memory]
end
Node1 <-->|"High-Speed Network"| Node2
Node2 <-->|"High-Speed Network"| Node3
Node1 <-->|"High-Speed Network"| Node3
end Storage: Where Your Data Lives¶
Beyond raw computing power, HPC workloads also depend on storage. Most clusters provide at least two tiers: high-speed scratch storage that compute nodes can read and write quickly during a job, and long-term storage for datasets and results that need to persist between jobs but don't require the same speed. Choosing the right storage tier for each stage of your workflow can make a big difference in both performance and cost. We cover this in detail in the Storage Fundamentals article.
Dartmouth's HPC Systems¶
Dartmouth provides several HPC systems, each suited to different kinds of work. Andes, Polaris, and Discovery are available to all researchers on campus, while the Babylon servers are operated by Thayer School of Engineering and Computer Science for their users.
Andes & Polaris (Shared Memory)¶
Andes and Polaris are large-scale shared memory systems available to all Dartmouth researchers. Unlike Discovery, they work more like a powerful version of your personal computer: You log in and run programs interactively, with all processors sharing the same memory space.
These systems are well suited for workloads that need large amounts of memory on a single machine, or for interactive data analysis and development. The limitation of these systems is that all resources are shared at all times. Your workloads might have to compete with other users' workloads for CPU time and memory, causing unnecessary slowdowns. Since they are stand-alone machines, you may also hit their resource ceiling.
Babylon (Thayer/CS — Shared Memory)¶
Thayer School of Engineering and Computer Science operate a set of shared memory compute servers named babylon1 through babylon12 (babylon1.thayer.dartmouth.edu, etc.). Like Andes and Polaris, these are standalone shared memory machines, but they are only available to Thayer and CS users. You can log in interactively via SSH.
The Babylon servers are a good stepping stone between your laptop and the full HPC cluster. They have faster processors and more memory than lab workstations, and they're convenient for engineering-specific software (MATLAB, Abaqus, Mathematica, etc.) that's pre-installed on the Thayer infrastructure. However, like Andes and Polaris, they are shared — your processes run alongside other users' work with no scheduler to guarantee dedicated resources.
Thayer/CS credentials
The Babylon servers authenticate using your NetID and password and are only accessible to Thayer and CS users. See the Thayer Linux documentation for full details.
Discovery (Campus-Wide HPC Cluster)¶
Discovery is Dartmouth's primary HPC cluster, available to all researchers on campus. It consists of many compute nodes connected by a high-speed network, and jobs are managed by the Slurm scheduler. You write a script describing what resources you need (CPUs, memory, time, GPUs), submit it, and Slurm runs it on the appropriate hardware when resources are available.
This is the system you'll use most often, and the one this cookbook focuses on.
Which System Should I Use?¶
With multiple systems available, it's natural to wonder which one is right for your work. Here's a quick guide:
Use Andes or Polaris if your work fits on a single large machine and you want to run it interactively. These are good for exploratory data analysis, prototyping code, or running jobs that need a lot of memory but not a lot of scheduling overhead. You log in, run your program, and see the results in real time — much like working on your own computer, just with far more resources.
Use the Babylon servers if you're in Thayer or CS and need quick access to a more powerful machine for interactive work, especially if your workflow depends on engineering software available on the Thayer infrastructure.
Use Discovery if your work needs dedicated resources, GPUs, long runtimes, or the ability to run many jobs at once. Because Discovery uses a scheduler, your jobs get exclusive access to the resources you request — no competing with other users for CPU time. This is the right choice for production runs, batch processing, GPU-accelerated workloads like deep learning, and any workflow where you want to submit a job and walk away.
When in doubt, use Discovery. It's the most versatile system, it's open to everyone at Dartmouth, and it's the one this cookbook focuses on. The scheduler may feel like an extra step at first, but it ensures fair access for everyone and gives you predictable, reproducible performance.
| Andes / Polaris | Babylon | Discovery | |
|---|---|---|---|
| Best for | Large-memory single-machine tasks | Quick interactive work for Thayer/CS users | Batch jobs, GPU workloads, long-running or multi-job workflows |
| How you run work | Directly on the command line | Directly on the command line | Submit via the Slurm scheduler |
| Resource sharing | Shared with other users in real time | Shared with other users in real time | Dedicated resources for your job |
| Who can access | All Dartmouth users | Thayer and CS users | All Dartmouth users |
| GPUs | No | No | Yes |
| Scheduler | No | No | Yes (Slurm) |
Test Yourself: Which System Would You Choose?¶
See if you can match each scenario to the best HPC system. Click on your answer, then advance to the next question!
How You Interact with an HPC Cluster¶
On your own computer, you simply start a program and it runs immediately using as many resources as it needs until it hits the system's limits.
Using an HPC cluster is different from using your personal computer. Here's the typical workflow:
- Connect — You log in to the cluster's login node over SSH from your own computer.
- Prepare — On the login node, you write scripts, transfer data, and set up your environment.
- Submit — You submit a job to the scheduler, describing what you want to run and what resources it needs.
- Wait — The scheduler finds available resources and runs your job. You don't need to stay connected.
- Collect results — When the job finishes, you retrieve the output files.
flowchart LR
A["Your Computer"] -->|SSH| B["Login Node"]
B -->|"sbatch job.sh"| C["Scheduler"]
C --> D["Compute Nodes"]
D -->|results| B The login node is shared
When you connect to Discovery, you land on the login node — a shared gateway used by all users. It's appropriate for lightweight tasks like editing files, submitting or monitoring jobs. Do not run computationally intensive work on the login node. Doing so can degrade performance for everyone. Use the scheduler to submit real work to the compute nodes.
Key Terminology¶
New to HPC? Hover over technical terms to see a (hopefully) helpful definition. Check the glossary for the full list of terms like nodes, jobs, partitions, and more that you'll encounter throughout this cookbook.
What's Next?¶
Now that you have a sense of what HPC is and how Dartmouth's systems are organized, it's time to get hands-on:
- Request an account for our HPC systems
- Connect to our HPC systems via SSH
- Submit your first job with Slurm