Difference between revisions of "Syncing to Archival Storage"

From CAC Documentation wiki
Jump to navigation Jump to search
Line 1: Line 1:
==General Information==
+
This is a user-level guide for syncing a directory to CAC Archival Storage using Globus.
===Purpose of this how-to===
 
This is a user-level guide for syncing a Linux or a Windows machine with Globus, particularly to the CAC Archival Storage.
 
  
===Assumptions and definitions===
+
== Prerequisites ==
 +
* You know [https://docs.globus.org/how-to/get-started/#log_in_with_an_existing_identity how to log into Globus].
 +
* You are a user of a CAC project with archival storage service enabled. In this document, <CACUser> denotes your CAC user name and <CACProject> denotes your CAC project name.
 +
* On the host from where you want to run (either one time or regularly scheduled) sync commands, [https://docs.globus.org/cli/installation/ install Globus CLI client].
 +
* If the source directory is not located on an existing Globus endpoint, install Globus Connect Personal for [https://docs.globus.org/how-to/globus-connect-personal-linux/ Linux], [https://docs.globus.org/how-to/globus-connect-personal-mac/ MacOS], or [https://docs.globus.org/how-to/globus-connect-personal-windows/ Windows] on the host where the source directory is located.
  
:* You have a Globus account, which we'll indicate by <globusAccountName>
+
== Setup ==
:* You have a local endpoint you want to sync, indicated by <localEndpointName>
 
:* There is a destination endpoint <destinationEndpointName>; for syncing to the CAC archive resource it is cac#archive01/export/archive01/<CACProjectName>/<path>
 
:** The CAC archive endpoint is active
 
:* Any subsidiary paths with be written as <path>.
 
  
===Limitations===
+
On the host from where you want to run sync commands,
This requires that both endpoints be active. Although it can be run from any machine (not necessarily the one with an endpoint on it) Globus Connect Personal runs under a user identity and dies when that person logs out; this is not an issue on Linux, but limits its usefulness if syncing as a scheduled task on Windows, where it may be possible to run Globus Connect Personal as a service but that has not yet been tested.
 
  
==Linux==
+
:* Log into Globus using Globus CLI:
 +
<pre>
 +
$ globus login
 +
Please authenticate with Globus here:
 +
------------------------------------
 +
https://auth.globus.org/v2/oauth2/authorize?...........
 +
------------------------------------
 +
 +
Enter the resulting Authorization Code here: </pre>
 +
::Copy and paste the URL into a web browser. Log into Globus as instructed in the web browser. After logging in, copy and paste the code back into the terminal.
 +
<pre>
 +
You have successfully logged in to the Globus CLI!
  
===Setup===
+
You can check your primary identity with
 +
  globus whoami
  
To backup a directory from the a linux file server to CAC's archive you must first start GlobusConnect on the file server. Designate or create an account to run the syncing process, which we will call <sync-user>. Create these scripts so that the <sync-user> account can execute them:
+
For information on which of your identities are in session use
 +
  globus session show
  
gc_start.sh:
+
Logout of the Globus CLI with
 
+
  globus logout
<pre>#!/bin/bash
 
sh /opt/globusconnectpersonal-2.0.3/globusconnect -start -restrict-paths rw/<path to back up>&</pre>
 
 
 
gc_status.sh:
 
 
 
<pre>#!/bin/bash
 
sh /opt/globusconnectpersonal-2.0.3/globusconnect -status</pre>
 
 
 
gc_stop.sh
 
 
 
<pre>#!/bin/bash
 
sh /opt/globusconnectpersonal-2.0.3/globusconnect -stop
 
 
</pre>
 
</pre>
 
Execute gc_start.sh:
 
 
<pre><pathToScript>/gc_start.sh</pre>
 
 
You also need ssh keys set up with the Globus system. You store your private key locally (typically in the .ssh subdirectory of your home directory and give it a name, which we'll represent as <mykey>) and:
 
 
:* Go to the globus website and click on your account name at top-right, and select "manage identities"
 
:** Select "Add linked identity" and pick "Add SSH public key"
 
:** Paste the public key into the box for it and give the key a name
 
:** Click "Submit"
 
 
=== How to perform a backup ===
 
 
Once GlobusConnect is started you next issue a command to the CLI via ssh. For example:
 
<pre>ssh -t <globusAccountName>@cli.globusonline.org transfer -s 2 --preserve-mtime --verify-checksum
 
-- <globusAccountName>#<localEndpointName> cac#archive01/export/archive01/<CACProjectName>/<path> -r</pre>
 
This command will backup the /home/fs01 directory to the CAC archive preserving the last modified timestamp, performing a checksum, and only backing up files with timestamps newer than those existing in the archive or new files. Nothing will be deleted.
 
 
To monitor the status of your backup go to [https://www.globusonline.org/xfer/ViewTransfers the cacsystems GlobusOnline transfer activity page]. If you don't have the password, talk to other CAC staff to obtain it. Once your backup is completed an automated summary will be mailed to cac-systems. Next you need to stop the GlobusConnect client on the file server by running:
 
<pre><pathToScript>/gc_stop.sh</pre>
 
 
You can check the status of the GlobusConnect by running:
 
<pre><pathToScript>/gc_status.sh</pre>
 
 
=== Scheduled Backups ===
 
You can use cron jobs to perform scheduled backups. You need a user in which context these services will run; we will call this user <sync-user>. Our example services are
 
 
# Daily backup of /home/fs01/ running at 11:00PM
 
# Weekly backup of /home/shared running on Sunday at 11:30PM
 
 
Say you want to run these daily and weekly sync cronjobs in the context of the CTC_ITH\<sync-user> user. In this user's home folder create a daily-sync.sh and a weekly-sync.sh file. Each file should be scheduled accordingly via crontab.
 
 
<pre>5 0 * * * /home/<sync-user>/daily-sync.sh
 
5 0 * * 6 /home/<sync-user>/weekly-sync.sh</pre>
 
 
Because of limited program control available in the batch files each file does the following at present:
 
 
# attempts to start the GlobusConnect client.
 
# auto-activates the GlobusConnect endpoint on hd-hni-fs.cac.cornell.edu
 
# initiates a transfer command
 
 
Example content of weekly-sync.sh
 
<source lang="bash">
 
#!/bin/bash
 
<pathtTo>/gc_start.sh
 
ssh -i .ssh/<mykey> -t <globusAccount>@cli.globusonline.org endpoint-activate <globusAccount>#<my-endpoint>
 
ssh -i .ssh/<mykey> -t <globusAccount>@cli.globusonline.org transfer -s 2 --verify-checksum -- <globusAccount>#<my-endpoint>/home/shared/ cac#archive01/export/archive01/<CACProjectName>/<path>$
 
</source>
 
 
Ideally each script would terminate the GLobusConnect client when the transfer completed but, this is not yet implemented and may never be depending on the time and effort required to make it work.
 
 
==Windows==
 
 
===Assumption===
 
 
The client endpoint -- the one containing the resources to be transferred to the CAC endpoint -- is active.
 
 
 
===Explanation===
 
 
We're going to use the command-line interface (CLI) to Globus, which basically means logging into their dedicated server over SSH and sending commands. The CLI is detailed [//support.globus.org/forums/22861518-Command-Line-Interface here].
 
 
===Setup===
 
 
:* Download the latest version of PuTTY (some older versions won't work), including PuTTYGen and plink (the Windows installer contains them all: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html )
 
:* Launch PuTTYGen
 
:** Make sure the "SSH2-RSA" radio button is selected, and a key length of at least 2048 in the box below that
 
:** Click the "Generate" button. You'll need to keep moving the cursor over the blank grey area to generate randomness
 
:** You don't want to use a passphrase!
 
:** Save the keys:
 
:*** the private key should be called something like <privateKeyName>_id_rsa.ppk and stored somewhere safe but accessible to the scheduled task
 
:*** The public key can be saved or you can just copy the key, which should be in clear test in the box to the clipboard
 
:* With the public key in the clipboard, go to the globus website and click on your account name at top-right, and select "manage identities"
 
:** Select "Add linked identity" and pick "Add SSH public key"
 
:** Paste the public key into the box for it and give the key a name
 
:** Click "Submit"
 
   
 
:*Create a connection:
 
:** Start PuTTY
 
:** In the "Session" tab:
 
:*** put <accountName>@cli.globusonline.org in the Host Name box
 
:*** in the "Saved Sessions" textbox give the session a name; I use "globusSync" and we'll refer to this as <sessionName>
 
:** On the Connection > SSH > Auth tab, for "Private Key for authentication" click "browse" and select the file in which you wrote your private key
 
:** Back on the "Session" tab, click "save"; your session name should now appear on the list of saved sessions
 
:** You can test that it works, now; double click on the saved session name; after accepting the server key, you should find yourself in an ssh session
 
 
 
You'll use plink to actually send the sync command (or any other Globus CLI commands you want to use); depending on whether it's in the right paths, you may wish to use the full path to the plink executable (for example, C:\Program Files (x86)\PuTTY\plink.exe) when you set this up as a scheduled task. The basic command, to run on the command line, is this:
 
 
    "C:\Program Files (x86)\PuTTY\plink.exe" <sessionName> transfer -s 2 --preserve-mtime --verify-checksum -- <accountName>#<localEndpointName>/<path> cac#archive01/export/archive01/<CACProjectName>/<path> -r
 
 
You should test that it works by calling up cmd.exe and executing it.
 
 
===Creating the scheduled task===
 
 
:* Start up the task scheduler
 
:* Select "Create task" from the "Actions" tab
 
:* Give the task a name, select a user identity as which this should run (ensuring it has access to the ssh session and key information you set up) and select the "run whether user is logged on or not" radio button (note that this doesn't fix the issue of the endpoint going down if the owning user isn't logged on). If only local resources will be required you can select to not store password details
 
:** On the "Triggers" tab
 
:*** Select "New"
 
:*** Select "run on a schedule" from the drop-down, and select when you want it to run and on what cadence
 
:*** Select the checkbox for "Enabled" (important!)
 
:** On the "Actions" tab
 
:*** Select "New"
 
:*** Select "Start a program" from the drop-down
 
:*** For "Program name" put the full path to plink.exe, enclosed in double quotes if it contains a space, eg:
 
    "C:\Program Files (x86)\PuTTY\plink.exe"
 
:* For "arguments" enter:
 
    <sessionName> transfer -s 2 --preserve-mtime --verify-checksum -- <globusAccountName>#<localEndpointName>/<path> cac#archive01/export/archive01/<CACProjectName>/<path> -r
 
:* Accept the other defaults, and click "OK". You'll have to enter the Windows credentials for the account under which the process will run if you didn't select the option not to store the password.
 

Revision as of 15:34, 31 March 2020

This is a user-level guide for syncing a directory to CAC Archival Storage using Globus.

Prerequisites

  • You know how to log into Globus.
  • You are a user of a CAC project with archival storage service enabled. In this document, <CACUser> denotes your CAC user name and <CACProject> denotes your CAC project name.
  • On the host from where you want to run (either one time or regularly scheduled) sync commands, install Globus CLI client.
  • If the source directory is not located on an existing Globus endpoint, install Globus Connect Personal for Linux, MacOS, or Windows on the host where the source directory is located.

Setup

On the host from where you want to run sync commands,

  • Log into Globus using Globus CLI:
 $ globus login
 Please authenticate with Globus here:
 ------------------------------------
 https://auth.globus.org/v2/oauth2/authorize?...........
 ------------------------------------
 
 Enter the resulting Authorization Code here: 
Copy and paste the URL into a web browser. Log into Globus as instructed in the web browser. After logging in, copy and paste the code back into the terminal.
You have successfully logged in to the Globus CLI!

You can check your primary identity with
  globus whoami

For information on which of your identities are in session use
  globus session show

Logout of the Globus CLI with
  globus logout