- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Data Transfer with GridFTP: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(Reworked documentation.)
Line 1: Line 1:
= Introduction =
== Introduction ==


For transfering large amounts of data, simple FTP protocol can not
For transferring large amounts of data, the simple FTP protocol can not fully exploit high bandwidth connections (especially when they have high latencies, like intra- or international Wide Area Networks (WANs)). For this task, an extension has been definied: GridFTP. It supports parallel TCP streams and multi-node transfers (also known as ''Striping'') to achieve a high data rate on high bandwidth connections (even with high latencies). Furthermore, transfers can be restarted and third-party transfers can be established, which means users can initiate transfers between two GridFTP servers that are controlled by a third party (i.e. the user).  
utilize high bandwidth channels. For this task, an extension has been  
definied: GridFTP supports parallel TCP streams and multi-node transfers
to achieve a high data rate via high bandwidth connections.  
Furthermore, transfers can be restarted and third-party transfers can be  
established. This means one can initiate transfers between two end hosts
that are mediated by a third party.  


GridFTP has a typical client/server architecture, where the server stores  
GridFTP has a typical client/server architecture, where the server stores the data or has access to the data and where the client downloads/uploads data or controls a server to server transfer in a third-party transfer as described above. The Globus Toolkit includes a simple GridFTP client - <code>globus-url-copy</code> - which is described in more detail below. On top of that there exists <code>gtransfer</code> a more user-friendly tool with additional features which is also described in more detail below.  
the data or has access to the data. A simple GridFTP client - globus-url-copy - is provided by the Globus Toolkit.  


At HLRS, a dedicated GridFTP server is running which has access to the according filesystems. This server can not be accessed directly but has to be controled by a GridFTP client, this means it has to be used as a third-party transfer.  
At HLRS, dedicated GridFTP servers are available for use which have access to the high-performance file systems of the Hazelhen and Laki supercomputers at HLRS. These servers can be used with a GridFTP client. Usually these GridFTP servers are used in third-party transfers, where users download/upload data from/to another GridFTP server e.g. at their home institution. There are two ways to conduct third-party transfers with our GridFTP servers: Either you use the pre-installed GridFTP clients on our Hazelhen frontend nodes or you install GridFTP clients somewhere else outside the HLRS network, for example at your home institution.


There are two places from where you can conduct third-party transfers by the GridFTP client: Either you use the GridFTP client which is pre-installed at our Hazelhen frontend nodes or you install the GridFTP client somewhere else outside the HLRS network, for example at your home institution.


= Requirements =
== Prerequirements for using our GridFTP servers ==
 
* '''A personal X509 certificate.''' For accessing our GridFTP servers and performing your data transfers with GridFTP you need a GSI proxy credential (GPC) signed by your personal X.509 certificate. Please see [http://toolkit.globus.org/toolkit/docs/6.0/gsic/key/index.html "Key concepts of GSI security"] for more information about GSI proxy certificates. This means that you first need a personal X.509 certificate signed by your organization or institute. In addition the source and destination GridFTP services must be able to verify your GPC to enable the data transfer. By default a GPC derived from a personal X.509 certificate issued by one of the grid certificate authorities (CAs) that are member of the [https://www.igtf.net/ IGTF] or their affiliated registration authorities (RAs) is required for data transfers. Please contact your IT department on how to acquire such a personal X.509 certificate.
 
* '''The distinguished name (DN) of your X.509 certificate.''' After receiving your personal X.509 certificate you need to forward the certificate's DN to the HLRS personnel in order to activate access to our GridFTP servers. To determine the DN you can use the following openssl command on your personal X.509 certificate:


* A personal X509 certificate. (Further information: http://en.wikipedia.org/wiki/X.509 , http://www.eugridpma.org/members/worldmap/ , http://www.prace-project.eu/Certificates-FAQ?lang=en)
* The DN of this certificate has to be extracted and sent to us. One can extract the DN by the following command:
<pre>
<pre>
openssl x509 -subject -in <USERCERT> -noout | sed -e 's/subject= //'
$ openssl x509 -noout -subject -in <YOUR_PERSONAL_X509_CERTIFICATE_FILE>
</pre>  
</pre>
* A Linux System with the GridFTP client installed (which is installed on the HLRS frontend nodes)


= Pre-installed GridFTP client on the Hazelhen frontend nodes =
* '''A Linux System with a GridFTP client installed''' (e.g. one of the Hazelhen frontend nodes)


* Create a directory ''.globus/'' in your homedir and place both the certificate and your keyfile into this directory


* In the above directory, create another directory ''certificates/'' and place all the CA files there. These files can be found at e.g. [https://winnetou.surfsara.nl/prace/certs/globuscerts.tar.gz here] as a tarball- just untar them into the above directory. These files are needed to later verify your certificate against the Certificate Authority.
=== Further information on X.509 certificates ===


* Load the module ''tools/globus-gridftp-client'' on the Hazelhen frontend node you are currently logged in.
* http://en.wikipedia.org/wiki/X.509
* http://www.eugridpma.org/members/worldmap/
* http://www.prace-project.eu/Certificates-FAQ?lang=en


== Pre-installed GridFTP client on the Hazelhen frontend nodes ==
* Create a GSI proxy credential (GPC) locally at your workstation with either <code>grid-proxy-init</code> (requires installation of Globus packages or manual compilation and installation of the Globus Toolkit, see below) or <code>genproxy</code> (just requires the Bash shell and OpenSSL). Afterwards copy the resulting GPC (usually named "x509up_u<UID>") to your home directory at HLRS with scp and configure the environment variable <code>X509_USER_PROXY</code> with the path to your GPC (<code>$</code> denotes a user prompt, user and host names are symbolic!):
<pre>
<pre>
hpcxxxx@eslogin008:~> module load tools/globus-gridftp-client
user@local:~$ genproxy
Your identity: /C=DE/O=GridGermany/OU=Universitaet Stuttgart/OU=[..]/CN=[...]
Enter pass phrase for /home/user/.globus/userkey.pem:
Your proxy `/tmp/x509up_u1234' is valid until: Fri May 19 11:16:36 CEST 2017
 
user@local:~$ scp /tmp/x509up_u1234 user@hazelhen.hww.de:X509_USER_PROXY
 
user@local:~$ ssh user@hazelhen.hww.de
 
user@hazelhen:~$ export X509_USER_PROXY="$HOME/X509_USER_PROXY"
</pre>
 
* To use <code>gtransfer</code>, load the <code>tools/gtransfer</code> module (which automatically loads all pre-required modules) on the Hazelhen frontend node you are currently logged in (<code>$</code> denotes a user prompt, user and host names are symbolic!):
<pre>
user@hazelhen:~$ module load tools/gtransfer
load globus-gridftp-client gt-6.0.1478289945 (PATH, MANPATH, GLOBUS_LOCATION, GLOBUS_TCP_PORT_RANGE, GLOBUS_TCP_SOURCE_RANGE, X509_CERT_DIR, LD_LIBRARY_PATH)
 
To make use of the Globus GridFTP client (GGC) you need a GSI proxy credential (GPC)
that authenticates you against the involved GridFTP servers.
 
Create your GPC at your local workstation and copy it to this system (e.g. via scp).
Then make it known to the Globus tools with ($ is the prompt and not part of the
command!):
 
```
$ export X509_USER_PROXY="/path/to/gpc"
```
 
Although you can use the GGC alone to transfer files via GridFTP, we strongly
recommend to use gtransfer - a more advanced GridFTP client on top of GGC, tgftp and
uberftp - instead. To use it, simply load its modulefile with:
 
```
$ module load tools/gtransfer
```
 
load tgftp 0.7.0 (PATH, MANPATH)
In addition to the manual pages (man {tgftp|tgftp_log}), there is also a longer README file available (less /sw/hazelhen/hlrs/tools/tgftp/0.7.0/share/doc/README).
load gtransfer 0.8.1 (PATH, MANPATH)
Bash completion loaded: press the TAB key for completion.
In addition to the manual pages (man {gtransfer|gt|dparam|dpath|halias|gcat|gls|gmkdir|gmv|grm}), there is also a longer README file available (less /sw/hazelhen/hlrs/tools/gtransfer/0.8.1/README.md).
</pre>


  **** Globus GridFTP Client ****
* To use <code>globus-url-copy</code> alone, load the module <code>tools/globus-gridftp-client</code> on the Hazelhen frontend node you are currently logged in (<code>$</code> denotes a user prompt, user and host names are symbolic!):
<pre>
user@hazelhen:~$ module load tools/globus-gridftp-client
load globus-gridftp-client gt-6.0.1478289945 (PATH, MANPATH, GLOBUS_LOCATION, GLOBUS_TCP_PORT_RANGE, GLOBUS_TCP_SOURCE_RANGE, X509_CERT_DIR, LD_LIBRARY_PATH)


  Initialisation of the Globus GridFTP client for fast data transfer.
To make use of the Globus GridFTP client (GGC) you need a GSI proxy credential (GPC)
that authenticates you against the involved GridFTP servers.


  Usage
Create your GPC at your local workstation and copy it to this system (e.g. via scp).
Then make it known to the Globus tools with ($ is the prompt and not part of the
command!):


  When the module is loaded, you can use the Globus GridFTP client with the command
```
      globus-url-copy
$ export X509_USER_PROXY="/path/to/gpc"
  For data transfer, simply issue the command
```
      globus-url-copy <optional command line switches> <gsiftp://<server adress>:<port> | file://><absolute path> <gsiftp://<server adress>:<port> | file://><absolute path>
  With the parameter -p <p> you can specify the degree of parallelism of your transfer
  Before data transfer, please issue the command
      grid-proxy-init
  to get a proxy certificate


  For more information on the GridFTP client and HPSS see:
Although you can use the GGC alone to transfer files via GridFTP, we strongly
recommend to use gtransfer - a more advanced GridFTP client on top of GGC, tgftp and
uberftp - instead. To use it, simply load its modulefile with:


  https://kb.hlrs.de/platforms/index.php/Data_Transfer_with_GridFTP
```
$ module load tools/gtransfer
```
</pre>
</pre>


= Installing the GridFTP client at your home institution =
== Installing the GridFTP client at your home institution ==


== Installation & Configuration ==
* Since version 5.2 of the Globus Toolkit, the GridFTP client is also available as pre-compiled RPM (for '''Red Hat Enterprise Linux 6 and 7''', '''CentOS 6 and 7''', '''Scientific Linux 6 and 7''' and possibly others) or DEB (for '''Debian GNU/Linux 7, 8 and 9''' and '''Ubuntu Linux 14.04 LTS, 16.04 LTS, 16.10 and 17.04''') package. Install the GridFTP client - if a pre-compiled package is available it's usually named <code>globus-gass-copy-progs</code>, <code>make grdiftp</code> will include it for source installs - by following the instructions in the [http://toolkit.globus.org/toolkit/docs/6.0/admin/install/index.html Globus Tookit 6.0 documentation]. Be sure to also install the <code>grid-proxy-init</code> tool - included in the <code>globus-proxy-utils</code> package or in an installation from source with <code>make gridftp</code> - or just use the <code>genproxy</code> tool mentioned above. Only one of these tools is required for the creation of GSI proxy credentials.


* Since the version 5.2, the GridFTP client is also available packaged as rpm- or deb-package. Install the GridFTP client by following the instructions on [http://toolkit.globus.org/toolkit/docs/6.0/admin/install/index.html this page] Be sure to have "globus-proxy-utils" as well (If it the client ist compiled from source, this is included).  
* Create a directory <code>.globus</code> in your home directory and place both your personal X.509 certificate (as <code>usercert.pem</code>) and your private key file (as <code>userkey.pem</code>) there. Alternatively you can also place a PKCS#12 keystore (as <code>usercred.p12</code>) there - the Firefox web browser for example exports user certificates and keys as PKCS#12 keystore.


* Create a directory ''.globus/'' in your homedir and place both the certificate and your keyfile into this directory  
* Additionally create another directory named <code>certificates</code> in <code>.globus</code> and place all the trusted CA certificates there. A collection suitable for use with the Globus Toolkit is provided by SURFsara as a [https://winnetou.surfsara.nl/prace/certs/globuscerts.tar.gz tarball] - download and untar it into the above directory. The included files are needed to authenticate remote entities (i.e. GridFTP servers).


* In the above directory, create another directory ''certificates/'' and place all the CA files there. These files can be found at e.g. [https://winnetou.surfsara.nl/prace/certs/globuscerts.tar.gz here] as a tarball- just untar them into the above directory. These files are needed to later verify your certificate against the Certificate Authority.
* Run <code>grid-proxy-init</code> or <code>genproxy</code> to verify the validity of your personal X.509 certificate and to create a GSI proxy credential signed by your personal X.509 certificate with a default lifetime of 12 hours. This step has to be repeated after the created GSI proxy credential has expired.


* run


<pre> grid-proxy-init </pre>
== Usage ==
 
This tool verifies the validity of your certificate and creates a proxy, that is internally needed by the GridFTP client. This step has to be repeated before the usage. If something like  
 
=== Workspaces ===
 
The paths to your workspaces are identical on supercomputers and GridFTP servers. To get the path of a specific workspace, first login to the respective supercomputer frontend(s), then determine the workspace name of the workspace you want to use and then enter <code>ws_find <WORKSPACE_NAME></code> to get the actual path to this specific workspace. More information about workspaces at HLRS can be found in the [https://kb.hlrs.de/platforms/index.php/Workspace_mechanism platforms wiki].
 
 
=== gtransfer (gt) ===
 
* Type <code>gt</code> and hit the ENTER/RETURN key to get a brief usage message. Use <code>gt --help</code> and <code>man gt</code> to get a description of all gt options.
* To start a transfer, enter <code>gt</code>, hit the SPACE key and then hit the TAB key three times to make use of the gt bash completion. You'll get a listing of all available options. Start with <code>-s</code> to enter the source address. The <code>-</code> character was already provided by the gt bash completion. After entering <code>s</code> hit the SPACE key and enter your source address, e.g. <code>gsiftp://gridftp.domain.tld:2811</code>. You can also hit the TAB key two times to get the preconfigured GridFTP source server addresses or [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/host-aliases.md host aliases]. Add the path to your desired workspace just like on the supercomputer frontends (e.g. <code>/lustre/cray/ws8/ws/user-workspace/</code>) and then hit the TAB key two to three times to get a listing of the files and directories in your workspace directory on the remote server. Depending on the latency and the number of files present there, it can take a few seconds until you see results and this will only work if your GSI proxy certificate is considered valid by the remote GridFTP server and you are trying to list a directory where you have <code>rx</code> (read and execute) permissions. Type in the beginning of your desired file or directory and hit the TAB key to complete the name. If you want to copy all files in a directory, add <code>/*</code> or just <code>/</code> to the end of the path. Now continue with the destination address. Add <code>-d</code> to the command line, hit the SPACE key and continue with the destination address just like you entered the source address. Enter a <code>/</code> at the end of the destination path.
* To recursively copy all files and directories below a given directory, add the <code>-r</code> option to the gt command line.


Example:
<pre>
<pre>
Your identity: <YourDNhere>
$ gt <TAB>
Creating proxy ............................................... Done
 
Your proxy is valid until: Wed Apr 18 22:25:32 2012
$ gt -
 
$ gt -<TAB><TAB>
 
$ gt -
--                      --configfile            --gt-max-retries        -m                      -s                      --verbose
-a                      -d                      --gt-progress-indicator  --metric                --source                --version
--auto-clean            --destination            --guc-max-retries        --no-sync                --sync-level           
--auto-optimize          -e                      --help                  -o                      --transfer-list         
-c                      --encrypt-data-channel  -l                      -r                      -v                     
--checksum-data-channel  -f                      --logfile                --recursive              -V
 
$ gt -s <TAB><TAB>
 
$ gt -s
hazelhen:  laki:
 
$ gt -s h<TAB>
 
$ gt -s hazelhen:
 
$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/<TAB>
 
$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file
 
$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file<TAB><TAB>
 
$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file
hazelhen:/lustre/cray/ws8/ws/user-workspace/file1  hazelhen:/lustre/cray/ws8/ws/user-workspace/file2  hazelhen:/lustre/cray/ws8/ws/user-workspace/file3
 
$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file* -d gsiftp://gridftp.domain.tld:2811/~/
</pre>
</pre>


shows up, everything is installed correctly.
==== Hints ====




=== Firewall issues ===
===== I have multiple user accounts at a remote GridFTP server. How can I choose a specific account? =====


Because there is a distinction between control and data connection, some ports of the firewall on the client side have to be opened:
This can be done by inserting a <code><USER>@</code> portion into your GridFTP URLs or prefixing host aliases with <code><USER>@</code>. Replace <code><USER></code> with your desired username on the remote site.
* Port 2812 for the control channel to the frontend node gridftp-fr1.hww.de (HAZELHEN) or gridftp-fr2.hww.de (LAKI)
* Ports 20000-20500 for data channels to the backend node (Hostnames on request. These ports have to be opened for both incoming and outgoing connections.


Moreover, you have to set the following environment variables to instruct your client to use the specified ports:
Examples:
<pre>
 
export GLOBUS_TCP_PORT_RANGE=20000,20500
* GridFTP URL:
export GLOBUS_TCP_SOURCE_RANGE=20000,20500
<code>gsiftp://gridftp.domain.tld:2811/[...]/files/*</code> => <code>gsiftp://user1@gridftp.domain.tld:2811/[...]/files/*</code>
</pre>  
* Host alias: 
<code>my-gridftp:/[...]/files/</code> => <code>user1@my-gridftp:/[...]/files/</code>
 
 
===== Can gtransfer automatically create non-existing directories on the destination side? =====
 
Yes, this is possible and activated by default. Just enter the desired name or path in your destination URL and gtransfer will automatically create non-existing directories on the destination side (with the help of [http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/user/#globus-url-copy globus-url-copy]).
 
 
===== Use host aliases for your GridFTP servers =====
 
There are already two host aliases defined which point to the two GridFTP servers at HLRS:
 
* <code>hazelhen:</code>  
* <code>laki:</code>


Please refer to this document for further information: http://www.globus.org/toolkit/docs/latest-stable/gridftp/user/#gridftp-user-config-client-firewall
You can use them instead of the longer host part of a GridFTP URL in the source and destination URLs, e.g. you can use:


= Usage =
* <code>hazelhen:/lustre/cray/ws8/ws/user-workspace</code> instead of
* <code>gsiftp://gridftp-fr1.hww.de:2812/lustre/cray/ws8/ws/user-workspace</code>


Suppose you have either loaded the module ''tools/globus-gridftp-client'' on the Hazelhen frontend node or you have installed the GridFTP client on a machine outside the HLRS.
To create your own host aliases, please refer to the host aliases documentation linked below.


First, run
===== What if the gtransfer command fails during a data transfer? =====
[http://toolkit.globus.org/toolkit/docs/latest-stable/gridftp/user/#globus-url-copy Globus-url-copy] - the tool gtransfer actually uses through [https://github.com/fr4nk5ch31n3r/tgftp tgftp] to transfer data - is configured by gtransfer to retry the transfer of files that failed to transfer successfully to the destination GridFTP server. And if that fails, gtransfer will retry the whole process three times until giving up on the transfer. And even if that happens, you can later continue a failed or interrupted transfer by simply issuing the very same gtransfer command. Gtransfer stores state information about a transfer in your home directory below <code>.gtransfer</code>. So this mechanism will work in the same home directory and with the same user account and as long as the state files are not touched in between.


<pre> grid-proxy-init </pre>
===== What if I need to interrupt a data transfer? =====
   
   
This tool verifies the validity of your certificate and creates a proxy, that is internally needed by the GridFTP client. This step has to be repeated before the usage. If something like
You can always interrupt a gtransfer data transfer by hitting CTRL+C during a data transfer, which effectively sends a <code>SIGINT</code> to the gtransfer process group and interrupts the data transfer. You can continue the transfer from where it was interrupted by issuing the very same gtransfer command - as with failed transfers described above. The same restrictions - same host, same user account, no fiddling with the state files in between - apply here.


<pre>
Your identity: <YourDNhere>
Creating proxy ............................................... Done
Your proxy is valid until: Wed Apr 18 22:25:32 2012
</pre>


shows up, a proxy certificate has been set up properly.
==== Documentation ====


Then, transfers can be started.


See http://toolkit.globus.org/toolkit/docs/6.0/appendices/commands/index.html#globus-url-copy
===== General =====
for details of the globus-url-copy tool


The basic syntax is:
* [https://github.com/fr4nk5ch31n3r/gtransfer/ gtransfer GitHub repository and README]


<pre>globus-url-copy [optional command line switches] source destination</pre>


where source and destination can be further resolved to
===== Man pages =====
<pre>globus-url-copy [optional command line switches] [gsiftp://<server adress>:<port> | file://]<absolute path> [gsiftp://<server adress>:<port> | file://]<absolute path></pre>


Files on remote systems are referenced by ''gsiftp://'' whereas local files a referenced by ''file://''. Be sure always to reference the absolute paths.  
Man(ual) pages are also available locally on the Hazelhen frontends. Simply enter <code>man</code> and the name of the manpage (e.g. <code>gtransfer</code> or <code>dpath</code>) to read a specific page. If man pages with the same name exist in different sections you also have to specify the section number after the <code>man</code> command but before the name of the man page to read a man page from a specific section. E.g. to read the <code>dparam(5)</code> man page - which contains the file format description for dparams - you would enter<code>man 5 dparam</code>.


To access files on '''HAZELHEN''', the informations are:<br>
*server adress: ''gridftp-fr1.hww.de''<br>
*port: ''2812''<br>


To access files on '''LAKI''', the informations are: <br>
====== Section 1 ======
*server adress: ''gridftp-fr2.hww.de''<br>
*port: ''2812'' <br>


* [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/gtransfer.1.md gtransfer(1)]
* [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/dparam.1.md dparam(1)]
* [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/dpath.1.md dpath(1)]
* [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/halias.1.md halias(1)]


For the referenced directories, you have to specify the absolute path to your
workspace.  If you are logged into HAZELHEN, you can find out about your
workspace with the command


<pre>ws_list</pre>
====== Section 5 ======


that lists all your available workspaces. Your workspace will reside in
* [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/dparam.5.md dparam(5)]
a directory like
* [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/dpath.5.md dpath(5)]


<pre>/univ_1/ws1/ws/<username-name></pre>


Suppose your workspace directory is
===== Special functionality =====


<pre>/univ_1/ws1/ws/foo-test-0</pre>
* [https://github.com/fr4nk5ch31n3r/gtransfer/blob/master/share/doc/host-aliases.md Host aliases]


and you want to copy files from this workspace to the home directory of
the machine you are currently  logged in, perform these commands:


<pre>grid-proxy-init
=== globus-url-copy (aka Globus GridFTP client (GGC)) ===


globus-url-copy -tcp-bs 4000000 -p 8 gsiftp://gridftp-fr1.hww.de:2812/univ_1/ws1/ws/foo-test-0/file  file:///home/foo/file
* Type <code>globus-url-copy</code> and hit the ENTER/RETURN key to get a brief usage message. Use <code>globus-url-copy -help</code> and <code>man globus-url-copy</code> to get a description of all globus-url-copy options.
</pre>


If you want to copy files from this workspace to the another machine running a GridFTP server as well, say in the PRACE network, perform these commands:  
* The basic syntax is:


<pre>grid-proxy-init
<pre>globus-url-copy [optional command line switches] source destination</pre>


globus-url-copy -tcp-bs 4000000 -p 8 gsiftp://gridftp-fr1.hww.de:2812/univ_1/ws1/ws/foo-test-0/file  gsiftp://juqueen1p.fz-juelich.de:2812/~
* Source and destination can be further resolved to:
</pre>


It may be neccessary to play around with the parameters a little bit to achieve optimal performance
<pre>globus-url-copy [optional command line switches] {gsiftp://<server address>:<port> | file://}<absolute path> {gsiftp://<server address>:<port> | file://}<absolute path></pre>


=== Some important parameters ===
* Files on remote systems can be referenced by <code>gsiftp://</code> URLs whereas local files have to be referenced by <code>file://</code> URLs. The usage of gtransfer host aliases is not supported by globus-url-copy, hence you need to enter the server addresses and ports manually. Use the following table for reference:


{|border="1" cellpadding="2"
{| class="wikitable"
|-
!Host
|'''Parameter'''||'''Description'''
!Server address
|-
!Port
| -help|| Prints out a detailled list of parameters and their description
|-
| -vb||Verbose mode, show more information: number of bytes transferred, performance since the last update (every 5 seconds) and average performance for the whole transfer
|-
| -dbg|| Debug mode, gives detailed information for debugging
|-
|-
| -p || Specifies the number of the parallel streams.
|Hazelhen
|gridftp-fr1.hww.de
|2812
|-
|-
| -tcp-bs || Specifies the size (in bytes) of the  TCP buffer to be used by the underlying ftp data channels. Please note that while higher values yield better performance, many parallel streams (high p) together with large buffer sizes could drive the systems out of memory.
|Laki
|gridftp-fr2.hww.de
|2812
|}
|}


= Further Information =
Example:
<pre>
$ globus-url-copy -cc 2 -tcp-bs 4M -p 2 -cd gsiftp://gridftp-fr1.hww.de:2812/lustre/cray/ws8/ws/user-workspace/file* gsiftp://gridftp.domain.tld:2811/~/
</pre>
 
 
==== Documentation ====
 
See the [http://toolkit.globus.org/toolkit/docs/6.0/gridftp/user/index.html#globus-url-copy Globus Toolkit documentation on globus-url-copy] for more details about this tool.
 
 
== Further Information ==
 
* http://www.globus.org/toolkit/docs/latest-stable/gridftp/ - Offical documentation
* http://www.prace-ri.eu/Data-Transfer-with-GridFTP-Details - Intended to PRACE user, but could also be helpful to others
* http://www.globus.org/toolkit/docs/latest-stable/gridftp/user/#gridftp-user-config-client-firewall - Firewall issues


* http://www.globus.org/toolkit/docs/latest-stable/gridftp/ Offical documentation
* http://www.prace-ri.eu/Data-Transfer-with-GridFTP-Details Intend to PRACE user, but could also be helpful to others
* http://www.globus.org/toolkit/docs/latest-stable/gridftp/user/#gridftp-user-config-client-firewall Firewall issues


== Support ==  
== Support ==  


[http://www.hlrs.de/organization/people/schembera/ Björn Schembera] [mailto:schembera@hlrs.de schembera@hlrs.de]
* [http://www.hlrs.de/about-us/organization/people/person/schembera/ Björn Schembera] [mailto:schembera@hlrs.de schembera@hlrs.de]

Revision as of 12:46, 6 July 2017

Introduction

For transferring large amounts of data, the simple FTP protocol can not fully exploit high bandwidth connections (especially when they have high latencies, like intra- or international Wide Area Networks (WANs)). For this task, an extension has been definied: GridFTP. It supports parallel TCP streams and multi-node transfers (also known as Striping) to achieve a high data rate on high bandwidth connections (even with high latencies). Furthermore, transfers can be restarted and third-party transfers can be established, which means users can initiate transfers between two GridFTP servers that are controlled by a third party (i.e. the user).

GridFTP has a typical client/server architecture, where the server stores the data or has access to the data and where the client downloads/uploads data or controls a server to server transfer in a third-party transfer as described above. The Globus Toolkit includes a simple GridFTP client - globus-url-copy - which is described in more detail below. On top of that there exists gtransfer a more user-friendly tool with additional features which is also described in more detail below.

At HLRS, dedicated GridFTP servers are available for use which have access to the high-performance file systems of the Hazelhen and Laki supercomputers at HLRS. These servers can be used with a GridFTP client. Usually these GridFTP servers are used in third-party transfers, where users download/upload data from/to another GridFTP server e.g. at their home institution. There are two ways to conduct third-party transfers with our GridFTP servers: Either you use the pre-installed GridFTP clients on our Hazelhen frontend nodes or you install GridFTP clients somewhere else outside the HLRS network, for example at your home institution.


Prerequirements for using our GridFTP servers

  • A personal X509 certificate. For accessing our GridFTP servers and performing your data transfers with GridFTP you need a GSI proxy credential (GPC) signed by your personal X.509 certificate. Please see "Key concepts of GSI security" for more information about GSI proxy certificates. This means that you first need a personal X.509 certificate signed by your organization or institute. In addition the source and destination GridFTP services must be able to verify your GPC to enable the data transfer. By default a GPC derived from a personal X.509 certificate issued by one of the grid certificate authorities (CAs) that are member of the IGTF or their affiliated registration authorities (RAs) is required for data transfers. Please contact your IT department on how to acquire such a personal X.509 certificate.
  • The distinguished name (DN) of your X.509 certificate. After receiving your personal X.509 certificate you need to forward the certificate's DN to the HLRS personnel in order to activate access to our GridFTP servers. To determine the DN you can use the following openssl command on your personal X.509 certificate:
$ openssl x509 -noout -subject -in <YOUR_PERSONAL_X509_CERTIFICATE_FILE>
  • A Linux System with a GridFTP client installed (e.g. one of the Hazelhen frontend nodes)


Further information on X.509 certificates


Pre-installed GridFTP client on the Hazelhen frontend nodes

  • Create a GSI proxy credential (GPC) locally at your workstation with either grid-proxy-init (requires installation of Globus packages or manual compilation and installation of the Globus Toolkit, see below) or genproxy (just requires the Bash shell and OpenSSL). Afterwards copy the resulting GPC (usually named "x509up_u<UID>") to your home directory at HLRS with scp and configure the environment variable X509_USER_PROXY with the path to your GPC ($ denotes a user prompt, user and host names are symbolic!):
user@local:~$ genproxy
Your identity: /C=DE/O=GridGermany/OU=Universitaet Stuttgart/OU=[..]/CN=[...]
Enter pass phrase for /home/user/.globus/userkey.pem:
Your proxy `/tmp/x509up_u1234' is valid until: Fri May 19 11:16:36 CEST 2017

user@local:~$ scp /tmp/x509up_u1234 user@hazelhen.hww.de:X509_USER_PROXY

user@local:~$ ssh user@hazelhen.hww.de

user@hazelhen:~$ export X509_USER_PROXY="$HOME/X509_USER_PROXY"
  • To use gtransfer, load the tools/gtransfer module (which automatically loads all pre-required modules) on the Hazelhen frontend node you are currently logged in ($ denotes a user prompt, user and host names are symbolic!):
user@hazelhen:~$ module load tools/gtransfer
load globus-gridftp-client gt-6.0.1478289945 (PATH, MANPATH, GLOBUS_LOCATION, GLOBUS_TCP_PORT_RANGE, GLOBUS_TCP_SOURCE_RANGE, X509_CERT_DIR, LD_LIBRARY_PATH)

To make use of the Globus GridFTP client (GGC) you need a GSI proxy credential (GPC)
that authenticates you against the involved GridFTP servers.

Create your GPC at your local workstation and copy it to this system (e.g. via scp).
Then make it known to the Globus tools with ($ is the prompt and not part of the
command!):

```
$ export X509_USER_PROXY="/path/to/gpc"
```

Although you can use the GGC alone to transfer files via GridFTP, we strongly
recommend to use gtransfer - a more advanced GridFTP client on top of GGC, tgftp and
uberftp - instead. To use it, simply load its modulefile with:

```
$ module load tools/gtransfer
```

load tgftp 0.7.0 (PATH, MANPATH)
In addition to the manual pages (man {tgftp|tgftp_log}), there is also a longer README file available (less /sw/hazelhen/hlrs/tools/tgftp/0.7.0/share/doc/README).
load gtransfer 0.8.1 (PATH, MANPATH)
Bash completion loaded: press the TAB key for completion.
In addition to the manual pages (man {gtransfer|gt|dparam|dpath|halias|gcat|gls|gmkdir|gmv|grm}), there is also a longer README file available (less /sw/hazelhen/hlrs/tools/gtransfer/0.8.1/README.md).
  • To use globus-url-copy alone, load the module tools/globus-gridftp-client on the Hazelhen frontend node you are currently logged in ($ denotes a user prompt, user and host names are symbolic!):
user@hazelhen:~$ module load tools/globus-gridftp-client
load globus-gridftp-client gt-6.0.1478289945 (PATH, MANPATH, GLOBUS_LOCATION, GLOBUS_TCP_PORT_RANGE, GLOBUS_TCP_SOURCE_RANGE, X509_CERT_DIR, LD_LIBRARY_PATH)

To make use of the Globus GridFTP client (GGC) you need a GSI proxy credential (GPC)
that authenticates you against the involved GridFTP servers.

Create your GPC at your local workstation and copy it to this system (e.g. via scp).
Then make it known to the Globus tools with ($ is the prompt and not part of the
command!):

```
$ export X509_USER_PROXY="/path/to/gpc"
```

Although you can use the GGC alone to transfer files via GridFTP, we strongly
recommend to use gtransfer - a more advanced GridFTP client on top of GGC, tgftp and
uberftp - instead. To use it, simply load its modulefile with:

```
$ module load tools/gtransfer
```

Installing the GridFTP client at your home institution

  • Since version 5.2 of the Globus Toolkit, the GridFTP client is also available as pre-compiled RPM (for Red Hat Enterprise Linux 6 and 7, CentOS 6 and 7, Scientific Linux 6 and 7 and possibly others) or DEB (for Debian GNU/Linux 7, 8 and 9 and Ubuntu Linux 14.04 LTS, 16.04 LTS, 16.10 and 17.04) package. Install the GridFTP client - if a pre-compiled package is available it's usually named globus-gass-copy-progs, make grdiftp will include it for source installs - by following the instructions in the Globus Tookit 6.0 documentation. Be sure to also install the grid-proxy-init tool - included in the globus-proxy-utils package or in an installation from source with make gridftp - or just use the genproxy tool mentioned above. Only one of these tools is required for the creation of GSI proxy credentials.
  • Create a directory .globus in your home directory and place both your personal X.509 certificate (as usercert.pem) and your private key file (as userkey.pem) there. Alternatively you can also place a PKCS#12 keystore (as usercred.p12) there - the Firefox web browser for example exports user certificates and keys as PKCS#12 keystore.
  • Additionally create another directory named certificates in .globus and place all the trusted CA certificates there. A collection suitable for use with the Globus Toolkit is provided by SURFsara as a tarball - download and untar it into the above directory. The included files are needed to authenticate remote entities (i.e. GridFTP servers).
  • Run grid-proxy-init or genproxy to verify the validity of your personal X.509 certificate and to create a GSI proxy credential signed by your personal X.509 certificate with a default lifetime of 12 hours. This step has to be repeated after the created GSI proxy credential has expired.


Usage

Workspaces

The paths to your workspaces are identical on supercomputers and GridFTP servers. To get the path of a specific workspace, first login to the respective supercomputer frontend(s), then determine the workspace name of the workspace you want to use and then enter ws_find <WORKSPACE_NAME> to get the actual path to this specific workspace. More information about workspaces at HLRS can be found in the platforms wiki.


gtransfer (gt)

  • Type gt and hit the ENTER/RETURN key to get a brief usage message. Use gt --help and man gt to get a description of all gt options.
  • To start a transfer, enter gt, hit the SPACE key and then hit the TAB key three times to make use of the gt bash completion. You'll get a listing of all available options. Start with -s to enter the source address. The - character was already provided by the gt bash completion. After entering s hit the SPACE key and enter your source address, e.g. gsiftp://gridftp.domain.tld:2811. You can also hit the TAB key two times to get the preconfigured GridFTP source server addresses or host aliases. Add the path to your desired workspace just like on the supercomputer frontends (e.g. /lustre/cray/ws8/ws/user-workspace/) and then hit the TAB key two to three times to get a listing of the files and directories in your workspace directory on the remote server. Depending on the latency and the number of files present there, it can take a few seconds until you see results and this will only work if your GSI proxy certificate is considered valid by the remote GridFTP server and you are trying to list a directory where you have rx (read and execute) permissions. Type in the beginning of your desired file or directory and hit the TAB key to complete the name. If you want to copy all files in a directory, add /* or just / to the end of the path. Now continue with the destination address. Add -d to the command line, hit the SPACE key and continue with the destination address just like you entered the source address. Enter a / at the end of the destination path.
  • To recursively copy all files and directories below a given directory, add the -r option to the gt command line.

Example:

$ gt <TAB>

$ gt -

$ gt -<TAB><TAB>

$ gt -
--                       --configfile             --gt-max-retries         -m                       -s                       --verbose
-a                       -d                       --gt-progress-indicator  --metric                 --source                 --version
--auto-clean             --destination            --guc-max-retries        --no-sync                --sync-level             
--auto-optimize          -e                       --help                   -o                       --transfer-list          
-c                       --encrypt-data-channel   -l                       -r                       -v                       
--checksum-data-channel  -f                       --logfile                --recursive              -V

$ gt -s <TAB><TAB>

$ gt -s
hazelhen:  laki:

$ gt -s h<TAB>

$ gt -s hazelhen:

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/<TAB>

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file<TAB><TAB>

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file
hazelhen:/lustre/cray/ws8/ws/user-workspace/file1  hazelhen:/lustre/cray/ws8/ws/user-workspace/file2  hazelhen:/lustre/cray/ws8/ws/user-workspace/file3

$ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file* -d gsiftp://gridftp.domain.tld:2811/~/

Hints

I have multiple user accounts at a remote GridFTP server. How can I choose a specific account?

This can be done by inserting a <USER>@ portion into your GridFTP URLs or prefixing host aliases with <USER>@. Replace <USER> with your desired username on the remote site.

Examples:

  • GridFTP URL:

gsiftp://gridftp.domain.tld:2811/[...]/files/* => gsiftp://user1@gridftp.domain.tld:2811/[...]/files/*

  • Host alias:

my-gridftp:/[...]/files/ => user1@my-gridftp:/[...]/files/


Can gtransfer automatically create non-existing directories on the destination side?

Yes, this is possible and activated by default. Just enter the desired name or path in your destination URL and gtransfer will automatically create non-existing directories on the destination side (with the help of globus-url-copy).


Use host aliases for your GridFTP servers

There are already two host aliases defined which point to the two GridFTP servers at HLRS:

  • hazelhen:
  • laki:

You can use them instead of the longer host part of a GridFTP URL in the source and destination URLs, e.g. you can use:

  • hazelhen:/lustre/cray/ws8/ws/user-workspace instead of
  • gsiftp://gridftp-fr1.hww.de:2812/lustre/cray/ws8/ws/user-workspace

To create your own host aliases, please refer to the host aliases documentation linked below.

What if the gtransfer command fails during a data transfer?

Globus-url-copy - the tool gtransfer actually uses through tgftp to transfer data - is configured by gtransfer to retry the transfer of files that failed to transfer successfully to the destination GridFTP server. And if that fails, gtransfer will retry the whole process three times until giving up on the transfer. And even if that happens, you can later continue a failed or interrupted transfer by simply issuing the very same gtransfer command. Gtransfer stores state information about a transfer in your home directory below .gtransfer. So this mechanism will work in the same home directory and with the same user account and as long as the state files are not touched in between.

What if I need to interrupt a data transfer?

You can always interrupt a gtransfer data transfer by hitting CTRL+C during a data transfer, which effectively sends a SIGINT to the gtransfer process group and interrupts the data transfer. You can continue the transfer from where it was interrupted by issuing the very same gtransfer command - as with failed transfers described above. The same restrictions - same host, same user account, no fiddling with the state files in between - apply here.


Documentation

General


Man pages

Man(ual) pages are also available locally on the Hazelhen frontends. Simply enter man and the name of the manpage (e.g. gtransfer or dpath) to read a specific page. If man pages with the same name exist in different sections you also have to specify the section number after the man command but before the name of the man page to read a man page from a specific section. E.g. to read the dparam(5) man page - which contains the file format description for dparams - you would enterman 5 dparam.


Section 1


Section 5


Special functionality


globus-url-copy (aka Globus GridFTP client (GGC))

  • Type globus-url-copy and hit the ENTER/RETURN key to get a brief usage message. Use globus-url-copy -help and man globus-url-copy to get a description of all globus-url-copy options.
  • The basic syntax is:
globus-url-copy [optional command line switches] source destination
  • Source and destination can be further resolved to:
globus-url-copy [optional command line switches] {gsiftp://<server address>:<port> | file://}<absolute path> {gsiftp://<server address>:<port> | file://}<absolute path>
  • Files on remote systems can be referenced by gsiftp:// URLs whereas local files have to be referenced by file:// URLs. The usage of gtransfer host aliases is not supported by globus-url-copy, hence you need to enter the server addresses and ports manually. Use the following table for reference:
Host Server address Port
Hazelhen gridftp-fr1.hww.de 2812
Laki gridftp-fr2.hww.de 2812

Example:

$ globus-url-copy -cc 2 -tcp-bs 4M -p 2 -cd gsiftp://gridftp-fr1.hww.de:2812/lustre/cray/ws8/ws/user-workspace/file* gsiftp://gridftp.domain.tld:2811/~/


Documentation

See the Globus Toolkit documentation on globus-url-copy for more details about this tool.


Further Information


Support