Skip to content

Storage Fundamentals

When you work on your laptop, you have one hard drive that holds everything: your operating system, your documents, your datasets, and your results. HPC clusters are different. They typically offer multiple storage tiers, each designed for a different purpose. Using the right tier at the right time can dramatically improve your job performance and help you avoid common pitfalls like running out of space mid-job or bottlenecking on slow I/O.

Why Storage Tiers Matter

HPC workloads generate and consume data at scales that a single file system can't efficiently handle. A genomics pipeline might need to read terabytes of sequencing data, write hundreds of intermediate files during processing, and then store a few gigabytes of final results. Each of those stages has different requirements:

  • Reading input data needs reliable, always-available storage.
  • Writing intermediate files during a job needs speed — the faster your storage, the less time your CPUs spend waiting on disk.
  • Keeping final results needs persistence and enough capacity that you're not constantly deleting old work.

That's why clusters separate storage into tiers.

High-Speed Scratch Storage

Scratch storage is designed for speed. It sits on fast disks (often SSDs or parallel file systems like Lustre) and is optimized for the heavy read/write patterns that jobs produce. This is where your jobs should write temporary and intermediate files.

The tradeoff is that scratch storage is not permanent. Files on scratch are typically subject to automatic purge policies — if a file hasn't been accessed in a certain number of days, it may be deleted to free space for other users. Scratch is working space, not archival space.

When to use scratch

Use scratch for anything your job produces during execution: intermediate files, checkpoint files, temporary outputs. Once your job finishes, copy the results you need to longer-term storage.

Long-Term Storage

Long-term (or persistent) storage is where you keep datasets, code, and results that need to stick around. It's typically larger in capacity than scratch but slower to access, since it's optimized for durability and availability rather than raw I/O speed.

This is the right place for:

  • Input datasets you'll reuse across multiple jobs
  • Final results and publications-ready data
  • Code repositories and environments
  • Anything you can't afford to lose

When to use long-term storage

Keep your important data — input datasets, final results, and anything you'd be upset to lose — on long-term storage. Copy what you need to scratch at the start of a job, and copy results back when the job finishes.

A Typical Storage Workflow

A common pattern for HPC jobs looks like this:

  1. Before the job — Your input data lives on long-term storage.
  2. Job starts — Your job script copies input data from long-term storage to scratch.
  3. Job runs — All intermediate I/O happens on fast scratch storage.
  4. Job finishes — Your job script copies results from scratch back to long-term storage.
  5. After the job — Scratch files are eventually purged; your results are safe on long-term storage.
flowchart LR
    A["Long-Term Storage"] -->|"1. Copy input"| B["Scratch Storage"]
    B -->|"2. Job reads/writes"| C["Compute Nodes"]
    C -->|"3. Job writes results"| B
    B -->|"4. Copy results back"| A

This approach gives you the best of both worlds: fast I/O where it matters and safe, persistent storage for everything else.

Home Directories

Most clusters also give each user a home directory. Your home directory is persistent and backed up, but it usually has a small quota (often just a few gigabytes). It's meant for configuration files, scripts, and small personal files — not for large datasets or job I/O.

Don't run jobs from your home directory

Home directories are typically on slower storage with tight quotas. Running jobs that do heavy I/O in your home directory can be slow for you and disruptive for other users sharing the same file system.

Storage at Dartmouth

Dartmouth provides several storage options for researchers. Here's how they map to the tiers described above:

DartFS (Long-Term Shared Storage)

DartFS is Dartmouth's shared network storage, available from all campus HPC systems as well as your desktop. It's designed for storing research data, datasets, and results that need to persist long-term. DartFS is backed up and accessible from Discovery, Andes, Polaris, and other campus systems.

This is where you should keep your important research data — the datasets you'll reuse, the results you want to preserve, and anything you'd need to recover if something went wrong.

For details on requesting a DartFS allocation and getting started, see the Research Computing DartFS documentation.

Scratch on Discovery

Discovery provides high-speed scratch storage at /dartfs-hpc/scratch. When you have an account on Discovery, you'll have a personal scratch directory where your jobs can read and write data quickly. Files on scratch are not backed up and are subject to a purge policy — don't rely on scratch for long-term storage.

Your Home Directory

Your home directory on Discovery (~) is persistent and backed up, but has a limited quota. Use it for scripts, configuration files, and small personal files. Avoid running jobs or storing large datasets here.

Which Storage Should I Use?

Storage Speed Capacity Persistence Best for
DartFS Moderate Large (by allocation) Backed up Research data, datasets, long-term results
Scratch (/dartfs-hpc/scratch) Fast Large Purged periodically Job I/O, temporary files, intermediate results
Home (~) Moderate Small (quota) Backed up Scripts, config files, small personal files

Key Takeaways

Understanding which storage tier to use is one of the simplest ways to improve your HPC experience. The general rule is straightforward: use scratch for speed during jobs, and long-term storage for everything you want to keep. Your home directory is for small personal files and configuration, not for heavy workloads.

What's Next?