- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -
Data Transfer with GridFTP: Difference between revisions
Line 186: | Line 186: | ||
See the [https://gridcf.org/gct-docs/latest/gridftp/user/index.html#gridftp-user-basic Grid Community Toolkit documentation on globus-url-copy] for more details. | See the [https://gridcf.org/gct-docs/latest/gridftp/user/index.html#gridftp-user-basic Grid Community Toolkit documentation on globus-url-copy] for more details. | ||
<code>man globus-url-copy</code> | |||
== Further Information == | == Further Information == |
Revision as of 12:39, 14 September 2021
Introduction
For transferring large amounts of data, the simple FTP protocol can not fully exploit high bandwidth connections (especially when they have high latencies, like intra- or international Wide Area Networks (WANs)). For this task, an extension has been definied: GridFTP. It supports parallel TCP streams and multi-node transfers (also known as Striping) to achieve a high data rate on high bandwidth connections (even with high latencies). Furthermore, transfers can be restarted and third-party transfers can be established, which means users can initiate transfers between two GridFTP servers that are controlled by a third party (i.e. the user).
GridFTP has a typical client/server architecture, where the server stores the data or has access to the data and where the client downloads/uploads data or controls a server to server transfer in a third-party transfer as described above. The Globus Toolkit includes a simple GridFTP client - globus-url-copy
- which is described in more detail below. On top of that there exists gtransfer
a more user-friendly tool with additional features which is also described in more detail below.
At HLRS, dedicated GridFTP servers are available for use which have access to the high-performance file system of the Hawk supercomputer at HLRS. These servers can be used with a GridFTP client, which needs to be installed by the user
Prerequirements for using our GridFTP servers
- A personal X509 certificate. For accessing our GridFTP servers and performing your data transfers with GridFTP you need a GSI proxy credential (GPC) signed by your personal X.509 certificate. Please see "Key concepts of GSI security" for more information about GSI proxy certificates. This means that you first need a personal X.509 certificate signed by your organization or institute. In addition the source and destination GridFTP services must be able to verify your GPC to enable the data transfer. By default a GPC derived from a personal X.509 certificate issued by one of the grid certificate authorities (CAs) that are member of the IGTF or their affiliated registration authorities (RAs) is required for data transfers. Please contact your IT department on how to acquire such a personal X.509 certificate.
- The distinguished name (DN) of your X.509 certificate. After receiving your personal X.509 certificate you need to forward the certificate's DN to the HLRS personnel in order to activate access to our GridFTP servers. To determine the DN you can use the following openssl command on your personal X.509 certificate:
$ openssl x509 -noout -subject -in <YOUR_PERSONAL_X509_CERTIFICATE_FILE>
- A Linux System with a GridFTP client installed
Further information on X.509 certificates
Installing the GridFTP client at your home institution
- Since version 5.2 of the Globus Toolkit, the GridFTP client is also available as pre-compiled RPM (for Red Hat Enterprise Linux 6 and 7, CentOS 6 and 7, Scientific Linux 6 and 7 and possibly others) or DEB (for Debian GNU/Linux 7, 8 and 9 and Ubuntu Linux 14.04 LTS, 16.04 LTS, 16.10 and 17.04) package. Install the GridFTP client - if a pre-compiled package is available it's usually named
globus-gass-copy-progs
,make grdiftp
will include it for source installs - by following the instructions in the Grid Community Toolkit documentation. Be sure to also install thegrid-proxy-init
tool - included in theglobus-proxy-utils
package or in an installation from source withmake gridftp
- or just use thegenproxy
tool mentioned above. Only one of these tools is required for the creation of GSI proxy credentials.
- Create a directory
.globus
in your home directory and place both your personal X.509 certificate (asusercert.pem
) and your private key file (asuserkey.pem
) there. To create these files from a PKCS#12 keystore follow these instructions but use the names from above for the destination files. When usinggrid-proxy-init
to create a GSI proxy credential, you can also place a PKCS#12 keystore (asusercred.p12
) there - the Firefox web browser for example exports user certificates and keys as PKCS#12 keystore.
- Additionally create another directory named
certificates
in.globus
and place all the trusted CA certificates there. A collection suitable for use with the Globus Toolkit is provided by SURFsara as a tarball - download and untar it into the above directory. The included files are needed to authenticate remote entities (i.e. GridFTP servers).
- Run
grid-proxy-init
orgenproxy
to verify the validity of your personal X.509 certificate and to create a GSI proxy credential signed by your personal X.509 certificate with a default lifetime of 12 hours (forgrid-proxy-init
) and 24 hours (forgenproxy
). This step has to be repeated after the created GSI proxy credential has expired.
Usage
Workspaces
The paths to your workspaces are identical on supercomputers and GridFTP servers. To get the path of a specific workspace, first login to the respective supercomputer frontend(s), then determine the workspace name of the workspace you want to use and then enter ws_find <WORKSPACE_NAME>
to get the actual path to this specific workspace. More information about workspaces at HLRS can be found in the platforms wiki.
gtransfer (gt)
- Type
gt
and hit the ENTER/RETURN key to get a brief usage message. Usegt --help
andman gt
to get a description of all gt options. - To start a transfer, enter
gt
, hit the SPACE key and then hit the TAB key three times to make use of the gt bash completion. You'll get a listing of all available options. Start with-s
to enter the source address. The-
character was already provided by the gt bash completion. After enterings
hit the SPACE key and enter your source address, e.g.gsiftp://gridftp.domain.tld:2811
. You can also hit the TAB key two times to get the preconfigured GridFTP source server addresses or host aliases. Add the path to your desired workspace just like on the supercomputer frontends (e.g./lustre/cray/ws8/ws/user-workspace/
) and then hit the TAB key two to three times to get a listing of the files and directories in your workspace directory on the remote server. Depending on the latency and the number of files present there, it can take a few seconds until you see results and this will only work if your GSI proxy certificate is considered valid by the remote GridFTP server and you are trying to list a directory where you haverx
(read and execute) permissions. Type in the beginning of your desired file or directory and hit the TAB key to complete the name. If you want to copy all files in a directory, add/*
or just/
to the end of the path. Now continue with the destination address. Add-d
to the command line, hit the SPACE key and continue with the destination address just like you entered the source address. Enter a/
at the end of the destination path. - To recursively copy all files and directories below a given directory, add the
-r
option to the gt command line.
Example:
$ gt <TAB> $ gt - $ gt -<TAB><TAB> $ gt - -- --configfile --gt-max-retries -m -s --verbose -a -d --gt-progress-indicator --metric --source --version --auto-clean --destination --guc-max-retries --no-sync --sync-level --auto-optimize -e --help -o --transfer-list -c --encrypt-data-channel -l -r -v --checksum-data-channel -f --logfile --recursive -V $ gt -s <TAB><TAB> $ gt -s hazelhen: laki: $ gt -s h<TAB> $ gt -s hazelhen: $ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/<TAB> $ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file $ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file<TAB><TAB> $ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file hazelhen:/lustre/cray/ws8/ws/user-workspace/file1 hazelhen:/lustre/cray/ws8/ws/user-workspace/file2 hazelhen:/lustre/cray/ws8/ws/user-workspace/file3 $ gt -s hazelhen:/lustre/cray/ws8/ws/user-workspace/file* -d gsiftp://gridftp.domain.tld:2811/~/
Hints
I have multiple user accounts at a remote GridFTP server. How can I choose a specific account?
This can be done by inserting a <USER>@
portion into your GridFTP URLs or prefixing host aliases with <USER>@
. Replace <USER>
with your desired username on the remote site.
Examples:
- GridFTP URL:
gsiftp://gridftp.domain.tld:2811/[...]/files/*
=> gsiftp://user1@gridftp.domain.tld:2811/[...]/files/*
- Host alias:
my-gridftp:/[...]/files/
=> user1@my-gridftp:/[...]/files/
Can gtransfer automatically create non-existing directories on the destination side?
Yes, this is possible and activated by default. Just enter the desired name or path in your destination URL and gtransfer will automatically create non-existing directories on the destination side (with the help of globus-url-copy).
Use host aliases for your GridFTP servers
To create your own host aliases, please refer to the host aliases documentation linked below.
What if the gtransfer command fails during a data transfer?
Globus-url-copy - the tool gtransfer actually uses through tgftp to transfer data - is configured by gtransfer to retry the transfer of files that failed to transfer successfully to the destination GridFTP server. And if that fails, gtransfer will retry the whole process three times until giving up on the transfer. And even if that happens, you can later continue a failed or interrupted transfer by simply issuing the very same gtransfer command. Gtransfer stores state information about a transfer in your home directory below .gtransfer
. So this mechanism will work in the same home directory and with the same user account and as long as the state files are not touched in between.
What if I need to interrupt a data transfer?
You can always interrupt a gtransfer data transfer by hitting CTRL+C during a data transfer, which effectively sends a SIGINT
to the gtransfer process group and interrupts the data transfer. You can continue the transfer from where it was interrupted by issuing the very same gtransfer command - as with failed transfers described above. The same restrictions - same host, same user account, no fiddling with the state files in between - apply here.
Documentation
General
Man pages
Man(ual) pages are available as a part of the software installation. Simply enter man
and the name of the manpage (e.g. gtransfer
or dpath
) to read a specific page. If man pages with the same name exist in different sections you also have to specify the section number after the man
command but before the name of the man page to read a man page from a specific section. E.g. to read the dparam(5)
man page - which contains the file format description for dparams - you would enterman 5 dparam
.
Section 1
Section 5
Special functionality
globus-url-copy (aka Globus GridFTP client (GGC))
- Type
globus-url-copy
and hit the ENTER/RETURN key to get a brief usage message. Useglobus-url-copy -help
andman globus-url-copy
to get a description of all globus-url-copy options.
- The basic syntax is:
globus-url-copy [optional command line switches] source destination
- Source and destination can be further resolved to:
<pre< globus-url-copy [optional command line switches] {gsiftp://<server address>:<port> | file://}<absolute path> {gsiftp://<server address>:<port> | file://}<absolute path>
- Files on remote systems can be referenced by
gsiftp://
URLs whereas local files have to be referenced byfile://
URLs. The usage of gtransfer host aliases is not supported by globus-url-copy, hence you need to enter the server addresses and ports manually. Use the following table for reference:
Server address | Port |
---|---|
gridftp-fr1.hww.de | 2812 |
gridftp-fr2.hww.de | 2812 |
Example:
$ globus-url-copy -cc 2 -tcp-bs 4M -p 2 -cd gsiftp://gridftp-fr1.hww.de:2812/lustre/hpe/ws10/ws10.3/ws/user-workspace/file* gsiftp://gridftp.domain.tld:2811/~/
Documentation
See the Grid Community Toolkit documentation on globus-url-copy for more details.
man globus-url-copy
Further Information
- https://gridcf.org/gct-docs/latest/gridftp/index.html - Offical documentation
- https://gridcf.org/gct-docs/latest/gridftp/user/index.html#gridftp-user-quickstart-config - Firewall issues