Syncing to Archival Storage

From CAC Documentation wiki
Jump to: navigation, search

General Information

Purpose of this how-to

This is a user-level guide for syncing a Linux or a Windows machine with Globus, particularly to the CAC Archival Storage.

Assumptions and definitions

  • You have a Globus account, which we'll indicate by <globusAccountName>
  • You have a local endpoint you want to sync, indicated by <localEndpointName>
  • There is a destination endpoint <destinationEndpointName>; for syncing to the CAC archive resource it is cac#archive01/export/archive01/<CACProjectName>/<path>
    • The CAC archive endpoint is active
  • Any subsidiary paths with be written as <path>.

Limitations

This requires that both endpoints be active. Although it can be run from any machine (not necessarily the one with an endpoint on it) Globus Connect Personal runs under a user identity and dies when that person logs out; this is not an issue on Linux, but limits its usefulness if syncing as a scheduled task on Windows, where it may be possible to run Globus Connect Personal as a service but that has not yet been tested.

Linux

Setup

To backup a directory from the a linux file server to CAC's archive you must first start GlobusConnect on the file server. Designate or create an account to run the syncing process, which we will call <sync-user>. Create these scripts so that the <sync-user> account can execute them:

gc_start.sh:

#!/bin/bash
sh /opt/globusconnectpersonal-2.0.3/globusconnect -start -restrict-paths rw/<path to back up>&

gc_status.sh:

#!/bin/bash
sh /opt/globusconnectpersonal-2.0.3/globusconnect -status

gc_stop.sh

#!/bin/bash
sh /opt/globusconnectpersonal-2.0.3/globusconnect -stop

Execute gc_start.sh:

<pathToScript>/gc_start.sh

You also need ssh keys set up with the Globus system. You store your private key locally (typically in the .ssh subdirectory of your home directory and give it a name, which we'll represent as <mykey>) and:

  • Go to the globus website and click on your account name at top-right, and select "manage identities"
    • Select "Add linked identity" and pick "Add SSH public key"
    • Paste the public key into the box for it and give the key a name
    • Click "Submit"

How to perform a backup

Once GlobusConnect is started you next issue a command to the CLI via ssh. For example:

ssh -t <globusAccountName>@cli.globusonline.org transfer -s 2 --preserve-mtime --verify-checksum 
-- <globusAccountName>#<localEndpointName> cac#archive01/export/archive01/<CACProjectName>/<path> -r

This command will backup the /home/fs01 directory to the CAC archive preserving the last modified timestamp, performing a checksum, and only backing up files with timestamps newer than those existing in the archive or new files. Nothing will be deleted.

To monitor the status of your backup go to the cacsystems GlobusOnline transfer activity page. If you don't have the password, talk to other CAC staff to obtain it. Once your backup is completed an automated summary will be mailed to cac-systems. Next you need to stop the GlobusConnect client on the file server by running:

<pathToScript>/gc_stop.sh

You can check the status of the GlobusConnect by running:

<pathToScript>/gc_status.sh

Scheduled Backups

You can use cron jobs to perform scheduled backups. You need a user in which context these services will run; we will call this user <sync-user>. Our example services are

  1. Daily backup of /home/fs01/ running at 11:00PM
  2. Weekly backup of /home/shared running on Sunday at 11:30PM

Say you want to run these daily and weekly sync cronjobs in the context of the CTC_ITH\<sync-user> user. In this user's home folder create a daily-sync.sh and a weekly-sync.sh file. Each file should be scheduled accordingly via crontab.

5 0 * * * /home/<sync-user>/daily-sync.sh
5 0 * * 6 /home/<sync-user>/weekly-sync.sh

Because of limited program control available in the batch files each file does the following at present:

  1. attempts to start the GlobusConnect client.
  2. auto-activates the GlobusConnect endpoint on hd-hni-fs.cac.cornell.edu
  3. initiates a transfer command

Example content of weekly-sync.sh

#!/bin/bash
<pathtTo>/gc_start.sh
ssh -i .ssh/<mykey> -t <globusAccount>@cli.globusonline.org endpoint-activate <globusAccount>#<my-endpoint>
ssh -i .ssh/<mykey> -t <globusAccount>@cli.globusonline.org transfer -s 2 --verify-checksum -- <globusAccount>#<my-endpoint>/home/shared/ cac#archive01/export/archive01/<CACProjectName>/<path>$

Ideally each script would terminate the GLobusConnect client when the transfer completed but, this is not yet implemented and may never be depending on the time and effort required to make it work.

Windows

Assumption

The client endpoint -- the one containing the resources to be transferred to the CAC endpoint -- is active.


Explanation

We're going to use the command-line interface (CLI) to Globus, which basically means logging into their dedicated server over SSH and sending commands. The CLI is detailed here.

Setup

  • Download the latest version of PuTTY (some older versions won't work), including PuTTYGen and plink (the Windows installer contains them all: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html )
  • Launch PuTTYGen
    • Make sure the "SSH2-RSA" radio button is selected, and a key length of at least 2048 in the box below that
    • Click the "Generate" button. You'll need to keep moving the cursor over the blank grey area to generate randomness
    • You don't want to use a passphrase!
    • Save the keys:
      • the private key should be called something like <privateKeyName>_id_rsa.ppk and stored somewhere safe but accessible to the scheduled task
      • The public key can be saved or you can just copy the key, which should be in clear test in the box to the clipboard
  • With the public key in the clipboard, go to the globus website and click on your account name at top-right, and select "manage identities"
    • Select "Add linked identity" and pick "Add SSH public key"
    • Paste the public key into the box for it and give the key a name
    • Click "Submit"
  • Create a connection:
    • Start PuTTY
    • In the "Session" tab:
      • put <accountName>@cli.globusonline.org in the Host Name box
      • in the "Saved Sessions" textbox give the session a name; I use "globusSync" and we'll refer to this as <sessionName>
    • On the Connection > SSH > Auth tab, for "Private Key for authentication" click "browse" and select the file in which you wrote your private key
    • Back on the "Session" tab, click "save"; your session name should now appear on the list of saved sessions
    • You can test that it works, now; double click on the saved session name; after accepting the server key, you should find yourself in an ssh session

You'll use plink to actually send the sync command (or any other Globus CLI commands you want to use); depending on whether it's in the right paths, you may wish to use the full path to the plink executable (for example, C:\Program Files (x86)\PuTTY\plink.exe) when you set this up as a scheduled task. The basic command, to run on the command line, is this:

   "C:\Program Files (x86)\PuTTY\plink.exe" <sessionName> transfer -s 2 --preserve-mtime --verify-checksum -- <accountName>#<localEndpointName>/<path> cac#archive01/export/archive01/<CACProjectName>/<path> -r

You should test that it works by calling up cmd.exe and executing it.

Creating the scheduled task

  • Start up the task scheduler
  • Select "Create task" from the "Actions" tab
  • Give the task a name, select a user identity as which this should run (ensuring it has access to the ssh session and key information you set up) and select the "run whether user is logged on or not" radio button (note that this doesn't fix the issue of the endpoint going down if the owning user isn't logged on). If only local resources will be required you can select to not store password details
    • On the "Triggers" tab
      • Select "New"
      • Select "run on a schedule" from the drop-down, and select when you want it to run and on what cadence
      • Select the checkbox for "Enabled" (important!)
    • On the "Actions" tab
      • Select "New"
      • Select "Start a program" from the drop-down
      • For "Program name" put the full path to plink.exe, enclosed in double quotes if it contains a space, eg:
   "C:\Program Files (x86)\PuTTY\plink.exe"
  • For "arguments" enter:
   <sessionName> transfer -s 2 --preserve-mtime --verify-checksum -- <globusAccountName>#<localEndpointName>/<path> cac#archive01/export/archive01/<CACProjectName>/<path> -r
  • Accept the other defaults, and click "OK". You'll have to enter the Windows credentials for the account under which the process will run if you didn't select the option not to store the password.