Skip to content

Data Archiving (FORTRESS tape archive)

Overview

The Rosen center for Advanced Computing (RCAC) at Purdue University provides long-term archival storage of data on the Fortress tape archive system. The tape archive effectively provides limitless storage, but at the price of slow access. Tape is used because it is a reliable and inexpensive medium for long-term storage. Fortress should not be used for regular operations, but instead for capturing datasets that are not in immediate use and that would too difficult or costly to regenerate. Once files are uploaded to Fortress, the system will maintain two copies to guard against medium errors (e.g., failure or erasure of a single tape copy).

When should I use Fortress? (thoughts from Dr. Cherkauer)

  • When you have assembled a new raw dataset (e.g., returned from the field). Archive all raw files before digging into analysis. Then if a raw file is accidentally modified during analysis, the original can still be recovered.
  • When submitting a manuscript to a journal. Archive all files used for analysis and figure generation. Then when the manuscript is returned for revisions, you will be able to return files to the state they were when submitted - even if you continued to work in the same directories.
  • When you have completed preliminary analysis of a large dataset. Archive the raw dataset when you no longer need rapid and repeated access to it. This let's you clear-up hard disk space, while maintaining a copy in case you need to rerun some of the original processing
  • When you have completed a large model run. Archive the model output before conducting analysis, so that you can recover the original output if files are accidentally modified during the analysis process.

Remember that backups typically record changes from originals, and are mostly designed to recover from the accidental deletion or modification of a file or directory. Backups are less useful at returning all of your files to a specific time, since changes to files occur over an extended period as you work. An archive is a snapshot of what all files looked like at a specific time, so opening an archive can recover the original files, analysis code, plotting scripts, and everything else as it was at the time the archive was created.

Why to create a single archive file

Fortress functions most efficiently with a small number of large files. A tape can store 6-18 TB of data, but reserves space at the start of the tape to use to record filenames as they are added to the tape. That means that there is a limited number of files that can be written to each tape. Once the space for filenames is used up, no more files can be written to the tape, no matter how much space is left.

Info

Always bundle your files into a single tar, zip or other archive file format.

That said, it is also wise to consider the maximum size of your bundled files as well. While many TB can be written into a single TAR file, the entire TAR file will have to be downloaded, and then unarchived before a specific file can be used. Thus it is good practice to break very large file structures into smaller, more manageable archive files.

How to assess the size of a potential archive

Linux commands for getting disk space being used:

du –sh ‘directory name’ # will list just the total size of all files/directories within directory name

du -ah ‘directory name’ # will list all files/directory sizes within directory name, going dow all subdirectories

du -ah –-max-depth=1 ‘directory name’ # will provide total of storage for directories down to specified max-depth level.

If using FreeBSD based systems (e.g., Mac OS X), the method to control depth (last option above) is different:

du -h -d 1 ‘directory name’ # will provide total of storage for directories down to specified max-depth level.

Note

The -h flag provides results in "human readable" form, so instead of sizes in bytes, sizes are provided in TB, GB, MB, KB, depending on relative size.

Note

If running the du command on the Data Depot (or any other mirrored disk system, the resulting sizes will be double the actual size. The du command is unable to distinguish between the duplicate copies. The resulting archive file, will reflect the actual size, or half what the du command reports.

How to create an archive file for storage on Fortress

There are many methods by which to create an archive file from a collection of folders and files. The most commonly used archiving method is still to use the tar command to create a "tarred file" or "tarball". The tar command was originally created to read and write files from tape ("t"ape "ar"chive). It is still used today because it is efficient, reliable and preserves file structure and permissions in a single archive file. While the tar command does not directly compress files, it works with many compression schemes. Perhaps the most common is gzip, which is common on Linux systems. Tar files can be created from the command line in Linux and Windows, but is also a standard output type for most compression and archiving software (e.g., 7-zip).

Examples of using tar for file archiving:

# Create a single tar file (.tar) from a directory
>>> tar -cvf ArchiveName.tar Directory2Archive

# Create a compressed tar file (.tgz) from the same directory
>>> tar -cvzf ArchiveName.tgz Directory2Archive

# Review the contents of a tar file
>>> tar -tvf ArchiveName.tar Directory2Archive

# Extract the contents of a compressed tar file
>>> tar -xvzf ArchiveName.tgz

# Create a single tar file (.tar) from a directory using files newer than 6 months ago
>>> tar --newer="6 months ago" -cvf ArchiveName.tar Directory2Archive

The options -c, -v, -t, etc can be given separately, or merged together (as shown above). Note that if merged together the -f option to define a filename, must be the last option provided in the list because the filename must follow the -f flag.

Time and date can be specified in many ways, see Date input formats for suggestions.

Note

I suggest running the tar command from the parent directory of the relative path or files you want to archive. Whatever Directory2Archive is will be used as the base path in the archive file. If Directory2Archive is a local folder, like "FirstPaper" then when the archive is extracted, contents will be written into a local folder called "FirstPaper". If you use an absolute path, then extracting the files will build all of the parent directory structure to place the archived folders correctly, for example if you use "/home/negishi/username/FirstPaper" then extracting the archive will put everything into the local folders "home/negishi/username/FirstPaper".

This tutorial How to Compress Files in Linux | Tar Command covers a lot of methods for using the tar command and other compression schemes on Linux. However, the command line options should be similar when used from the Windows terminal.

Archiving files to Fortress

Transferring an archive file to Fortress

There are two primary ways to get files from a local computer to Fortress: (1) using Globus, and (2) using the HSI tool. The HSI tool is a Linux command line tool that can support automation and replicates many standard Linux shell commands. The Globus tool is a GUI that provides the easiest way to view and navigate local and remote file systems, while also providing a robust transfer protocol that works in the background.

Note

If you want to use HSI or HTAR to transfer files to or from Fortress from a system not administered by RCAC, you will need to generate a keytab using the Fortress HPSS Kaytab Generator.

This will only work from a computer on the Purdue netowrk or from a remote computer using Purdue's VPN.

Creating an archive file on RCAC cluster systems

Start by creating the archive file. On the RCAC cluster systems, I suggest that you create the archive file in your personal scratch allocation, using this suggested naming convention:

# general command
>>> tar --ignore-failed-read -cvzf /scratch/negishi/<username>/<short descriptive name for archive>_Archived<YYYYMMDD>.tgz <Archive folder or file list> 

# sample command
>>> tar --ignore-failed-read -cvzf /scratch/negishi/cherkaue/NASA-RSWQ-FieldData-2023_Archived20250609.tgz 2023RawData/

The --ignore-failed-read flag will make sure that the tar command finishes even if it has problems reading the contents of some files in the archived folder. This occurs most often when someone puts a file in a shared location without confirming that permissions are set correctly.

Including the year, month and day for when the archive was created in the filename means that even if the creation date is modified (say by being recovered from a backup), the date of creation is not lost. By using dates in Year-Month-Day order, and padding month and day with zeros if they are less than 10, means that they will always sort in numerical order. Thus is you make a new archive with the same name prefix (e.g., "NASA-RSWQ-FieldData-2023_"), the files will automatically sort from oldest to newest. That does not work if you switch the year-month-day format or forget the zeros.

Warning

It is critically important to follow good file naming practices and keep archive folders organized using effective folder names when using the PHIG Fortress archive. The archive grows larger every year, and is used to pass data new and old between current and future students within the group. If the contents of the archive file cannot be determined by reviewing the name and creation date, it is effectively trash, since no one is going to download multiple terabyte sized files, unpack them and sort through them to find the information they need. (Just because it was your first paper archive, calling it FirstPaper.tar is not going to help anyone determine whose first paper, and whether of not we agree on which is the first paper.) Be a good data steward, and if you have questions about what makes a good archive name please ask Dr. Cherkauer or Dr. Bowling.

Use Globus to transfer files to and from Fortress (GUI option)

​Globus is research cyberinfrastructure, developed and operated as a not-for-profit service by the University of Chicago. As Purdue is a subscriber to the service, you can use it to securely move, share and discover across all kinds of platforms, including the Data Depot, Fortress and your personal computer.

Note

Globus is the preferred method for transferring files to the PHIG Fortress archive because it will set the Linux group permissions correctly. If you use HSI or HTAR, you will need to double change that the group is set to "phig" and that read-write access is granted to the group. If not, then your files will be locked for all other users, so it is good as never having created the archive file to anyone else in the group.

Note

To transfer files between a personal computer and HPC resources, you can install Globus Connect Personal and turn your personal computer into a Globus collection.

  • First thing to do is log into Globus by navigating to globus.org. You can also access Globus by navigating to https://transfer.rcac.purdue.edu. Everything works the same, but the web tool is Purdue branded.
  • You should log in using your Purdue career account credentials, from the log in screen shown below (this is what globus.org looks like): Globus Web Application login screen

  • Once in Globus, you will see some version of the file transfer screen shown below: Globus Web Application transfer screen

    • This screen is split to show both ends of the transfer, but you can use the panel icons to focus on only one side of the transfer at a time. That works better for searching a single file system.

    • To transfer files, you need to open collections at both ends of the transfer. This can be Purdue resources, resources shared with you from another institute, or files on a personal computer (if you install and setup Globus Connect Personal).

    • Select the files or folders that you want to transfer (so the archive file name), from the source computer and click on the blue Start (arrow should point from the source to the destination).
    • Once the transfer has been started, you can close Globus without stopping the transfer process. You will receive an email, once the job is done.

      Warning

      If you are using Globus Connect Personal and the local computer turns off or is disconnected from the internet then your transfer will be paused. It will restart as soon as the computer is turned back on and reconnected to the internet.

  • To access disk storage on a specific cluster:

    • Click on the "Search" box on the left side of the system.
    • Search for "Purdue Negishi Cluster" (or another RCAC cluster system).
    • Once you find the system, open it to find two folders: "home" and "scratch".
      • You can navigate to subfolders by clicking on a folder to view its contents, just like in File Explorer.
      • On the clusters, you will probably be happier clicking in the "Path" box and giving an initial path, for example "/scratch/negishi/cherkaue", as this lets Globus skip over the process of loading all of the scratch folders on Negishi.
    • This will open the contents of the folder in the panel.
    • You can navigate the directory structure in this window.
  • To access Fortress:

    • In the "Search" box, type in "Purdue Fortress HPSS Archive".
    • This may require an additional login step (use your Purdue password only, not ",push" format).
    • Once logged in, you should see the contents of your personal Fortress archive (e.g., /home/<username>/).
    • If you have access to a shared Fortress folder, it will be in /group/<groupname>..

    Note

    If you are a member of PHIG, Dr. Cherkauer has created a direct collection, "PHIG_Fortress_Archive", which should be used instead of the general Fortress login. If you find that you do not have access, please contact Dr. Cherkauer to be added to the PHIG group.

Using HSI to transfer files to and from Fortress (command line option)

  • Here is a link to the RCAC overview page.
  • HSI, the Hierarchical Storage Interface, is the preferred method of transferring files to and from Fortress. HSI is designed to be a friendly interface for users of the High Performance Storage System (HPSS). It provides a familiar Unix-style environment for working within HPSS while automatically taking advantage of high-speed, parallel file transfers without requiring any special user knowledge.
  • The HSI system works very much like ftp or sftp from the Linux command line.
    • Use the put command to push files from the local system to the remote system (Fortress).
    • Use the get command to pull files from the remote system to the local system.
    • Linux commands are be used to navigate the remote system,
    • Add the prefix 'l' for "local" to the Linux command to apply it to the local system, for example the command ls lists the remote system, while lls lists the local system.
  • When you are done with file transfers, type exit to leave HSI.

Note

When using HSI, the home directory (~) is your personal Fortress folder. The contents of this folder cannot be shared with anyone else. To make sure that other PHIG members have access to your archive files, you will need to navigate to /group/phig/. Once there please consider placing your files in the existing file structure instead of creating a new folder. Most of your work can probably go int he Projects folder, in a subfolder names for your project (something someone else will be able to identify). When it is in an appropriate folder, then the archive file name does not have to explain the project, just the contents of the current archive.

Direct archiving to Fortress using HTAR (command line option)

There is an option to create a tar file directly on Fortress, which can be useful for smaller

  • Use htar to create a tar file directly on Fortress.
    • Here is a link to the RCAC overview page.
    • HTAR (short for "HPSS TAR") is a utility program that writes TAR-compatible archive files directly onto Fortress, without having to first create a local file.
    • The remote location defaults to your personal home folder on the Fortress system. Use the prefix /group/phig/ and an appropriate subdirectory to store archive files where they are accessible by Dr. Cherkauer and future PHIG members.
    • HTAR will fail if any one file is larger than 64 GB. If you are trying to archive larger files, create your archive using tar which has no limit, and Globus or HSI to transfer the resulting file.
    • HTAR also appears to have a maximum number of files that can be written to an archive. The -M <num files> flag can be used to increase this past the default value, but Larry Biehl found that it could not exceed 5 million files.

      Note

      Larry found this maximum when backing up the DOE ARPA-E Phenosorg project, which had over 95 TB stored on the Data Depot. I do not recommend creating archive files anywhere near that large as they are exceedingly difficult to use for data recovery.

Retrieving files from the FORTRESS archive

Both the Globus and HSI tools previously mentioned can be used to recover files from the Fortress tape archive. Just note that most of the time files are written to tape and removed from the Fortress HDD system within days of being uploaded (actual time depends on the size of your files and the number of other users). The Fortress HDD system is only used for transfer operations, and should NEVER be used for anything else.

Once files are written to tape, the recovery process requires that Fortress locate the proper tape and copy the file to the HDD system, at which point it can be transferred. Depending on the size of the file being retrieve and the number of other active jobs, this process can take minutes to days. In my experience it typically requires no more than a couple of hours.

Because of this delay, Globus is the best tool for retrieving files, since you can put in the transfer request and then close Globus. You will get an email from Globus once the file transfer has been completed.

I have not used HSI in a while, but when I did you either had to wait for the process to complete before exiting, or if your connection was interrupted, you would typically find that the file was pulled to disk with the first request, and that is was then available for immediate transfer when you reestablished the connection. Just don't wait too long or the disk copy will be removed to make way for other jobs.

EXAMPLES: Specific Use Cases

Here are some specific use cases that hav been employed within PHIG:

Archive all UAS flights on a given date

This creates a compressed tar file with all images and supplmental data in the /depot/phig/uasdata folder stored in folders named for the flight date June, 25, 2025.

>>> tar -cvzf /scratch/negishi/cherkaue/TempArchive/UASFlights_20250625.tgz `find . -iname 20250625 -type d`

The find command returns the list of directories:

./Raw_data/GPS/20250625
./Raw_data/Red_Edge/20250625
./Raw_data/Thermal/20250625
./Raw_data/RGB/20250625

Which are used by the tar command to locate files that should be added to the archive.
Resulting archive will have the directory structure ./RawData/ with fodlers for each camera and data type, each with a subfolder for the specified flight date. If there are folders in ./Processed_data with the same date, then they will also be included in the archive.