- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

Workspace migration: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
No edit summary
No edit summary
 
(131 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Warning
| text = This page describes the necessary steps to migrate workspaces to another workspace filesystems.<br/>
On 18th October 2024 all data on the old Hawk ws10 filesystems will be deleted. Follow the guide below to transfer your data to the ws11 workspace filesystems.
}}


== User migration to new workspaces ==
== User migration to new workspaces ==




In January 2017 the workspaces installed in 2011 with the Cray Xe6 System Hermit will be replaced. In preparation for this task, users have to migrate their workspaces onto the replacement filesystems. Run the command ''ws_list -a'' on a frontend system to display the path for all your workspaces, if path names match mount points in the following table, these workspaces need to  
On Hawk the workspace filesystem ws11 will be the new default workspace filesystem. The currently existing default workspace filesystem (ws10) will be shut down on 18th October 2024. The policy settings of ws10 will be transferred to the ws11 filesystem.
be migrated.
Now, users have to migrate their workspaces located on the ws10 filesystems onto the ws11 filesystem. Run the command ''ws_list -a'' on a frontend system to display the path for all your workspaces. If path names match mount points listed in the following table, these workspaces need to  
be migrated to the ws11 filesystem.
 


{|Class=wikitable
{|Class=wikitable
|-
|-
! File System  
! File System  
! alias
! mounted on
! mounted on
|-
|-
| ws1
| ws10.0
| univ_1
| /lustre/hpe/ws10/ws10.0
| /lustre/cray/ws1
|-
|-
| ws2
| ws10.1
| ind_1
| /lustre/hpe/ws10/ws10.1
| /lustre/cray/ws2
|-
|-
| ws3
| ws10.2
| univ_2
| /lustre/hpe/ws10/ws10.2
| /lustre/cray/ws3
|-
|-
| ws4
| ws10.3
| res_1
| /lustre/hpe/ws10/ws10.3
| /lustre/cray/ws4
|-
|-
| ws5
| ws10.3P
| ind_2
| /lustre/hpe/ws10/ws10.3P
| /lustre/cray/ws5
|-
| ws6
| res_2
| /lustre/cray/ws6
|-
|-
|}
|}


== before you start ==
== Before you start ==


Migration for large amount of data consumes a lot of IO ressources. '''Please review and remove  
Migration for large amount of data consumes a lot of IO ressources. '''Please review and remove data not needed any more or move it into [[High_Performance_Storage_System_(HPSS)| HPSS]].'''
data which you do not need any more.'''


== How to proceed ==
== How to proceed / Time schedule for the gradual shutdown of ws10 ==


* On Decamber 19th 2016 10:00 new workspaces will be allocated on the replacment filesystems. Available workspace will be listed also on the old filesystems.  
* <Font color=red>2024-07-22 10:00</Font>:
* workspaces located on old filesystems could not be extended.
** The default workspace filesystem will be switched from ws10 to ws11. ws_* tools applied to ws10 require the additional option <tt>'-F <workspacefilesystem>'</tt>.
* if you have to migrate data from workspaces located in a different filesystem, do not use the mv command to transfer data. For large amount of data, this will fail. We recomment using [[Workspace_migration#Using_a_parallel_copy_for_data_transfer | parallel copy programm ''pcp'']] for large amount of data using large files. If this fails e.g. millions of small files following command may help: ''rsync --links --perms --times --hard-links  Old_ws  new_ws/''
** ws11 will get the [[Workspace_migration#Operation_/_Policies_of_the_workspaces_on_ws11: |same policy settings]] as ws10.
* take care when you create new batch jobs. If you have to migrate your workspace from an old filesystem to the new location, this take some time to transfer large amount of data. Do not run any Job while the migration process is active. This may result in incosistant data.  
** <Font color=red>Quota limit settings for each user-group will be moved from ws10 to ws11 and the quota enforcement will be enabled on ws11! (Batchjobs will not be scheduled in case the quota was exceeded the limit for your group)</Font>.
* On January 26th the “old” workspaces ws1, … ws6 will be disconnected from the Cray system. The Filesystems will be available on the frontend systems for data migration until February 10th 2017
** Max. extension for workspaces on ws10 will be reduced from 3 to 1 (ws_allocate, ws_extend).
*February 11th 2017 all data on the old filesystems will be deleted.
** Max. duration for workspaces on ws10 will be reduced from 60 days to 40 days (ws_allocate, ws_extend).


== Using a parallel copy for data transfer ==
* <Font color=red>2024-08-19</Font>:
** Max. durations for workspaces on ws10 will be reduced from 40 days to 10 days (ws_allocate).
** Deactivation of ws_extend for ws10.


pcp is a python based parallel copy using MPI. It can only be run on compute nodes via aprun.
* <Font color=red>2024-09-09</Font>:
** Max. duration for workspaces on ws10 will be reduced from 10 days to 3 days (ws_allocate).


pcp is similar to cp -r; simply give it a source directory and destination and pcp will recursively copy the source directory to the destination inparallel.
* <Font color=red>2024-10-18</Font>:
** Start removing all remaining data located on ws10.
** Final shutdown.


pcp has a number of useful options; use pcp -h to see a description.
=== Important remarks ===
* If you have to migrate data residing in workspaces from one to another filesystem, do not use the ''mv'' command to transfer data. For large amounts of data, this will fail due to time limits. Currently we recommend for e.g. millions of small files or for large amounts of data to use the following command inside a single node batch job: ''rsync -a  --hard-links  Old_ws/  new_ws/'' .
* Please try to use the [[Workspace_migration#Using_mpifileutils_for_data_transfer | mpifileutils '''dcp''' or '''dsync''']].
* Take care when you create new batch jobs. If you have to migrate your workspace from an old filesystem to the new location, this takes time. Do not run any job while the migration process is active. This may result in inconsistent data.


This program traverses a directory tree and copies files in the tree in parallel. It does not copy individual files in parallel. It should be invoked via aprun.
== Operation / Policies of the workspaces on ws11: ==


=== Basic arguments ===
* No job of any user-group member will be scheduled for computation as long as the group quota is exceeded.
* Accounting.
* Max. lifetime of a workspace is currently 60 days.
* Default lifetime of a workspace is 1 day.
* Max. number of workspace extensions are 3.
* Please read related man pages or the online [[Workspace_mechanism | workspace mechanism document]]<BR>
: in particular note that the workspace tools allow to explicitly address a specific workspace file system using the <tt>-F</tt> option (e.g. <tt>ws_allocate -F ws11.0 my_workspace 10</tt>)
* To list your available workspace file systems use <tt>ws_list -l</tt>.
* Users can restore expired workspaces using ''ws_restore''.


If run with the '''-l''' flag or '''-lf''' flags pcp will be stripe aware.  
Please read https://kb.hlrs.de/platforms/index.php/Storage_usage_policy
 
== Using mpifileutils for data transfer ==
The mpifileutils suite provides MPI-based tools to handle typical jobs like copy, remove, and compare for large  datasets, providing speedups of up to 50x compared to single process jobs. It can only be run on compute nodes via mpirun.
 
dcp or dsync is similar to cp -r or rsync; simply provide a source directory and destination and dcp / dsync will recursively copy the source directory to the destination in parallel.
 
dcp / dsync has a number of useful options; use dcp -h or dsync -h to see a description or use the [[https://mpifileutils.readthedocs.io/en/v0.11.1/ User Guide]].


'''-l''' will cause stripe information to be copied from the source files and directories.
It should be invoked via mpirun.


'''-lf''' will cause all files and directories on the destination to be striped, regardless of the striping on the source.
We highly recommend to use dcp / dsync with an empty ~/.profile and ~/.bashrc only! Furthermore, take care that only the following modules are loaded when using mpifileutils (this can be achieved by logging into the system without modifying the list of modules and loading only the modules openmpi and mpifileutils): <br>
1) system/site_names <br>
2) system/ws/8b99237 <br>
3) system/wrappers/1.0 <br>
4) hlrs-software-stack/current <br>
5) gcc/10.2.0 <br>
6) openmpi/4.1.4 <br>
7) mpifileutils/0.11 <br>


Striping behavior can be further modified with -ls and -ld.


'''-ls''' will set a minimum file size.
=== dcp ===
Files below this size will not be striped, regardless of the source striping.  
Parallel MPI application to recursively copy files and directories.


'''-ld''' will cause all directories to be unstriped.
dcp is a file copy tool in the spirit of cp(1) that evenly distributes the work of scanning the directory tree, and copying file data across a large cluster without any centralized state. It is designed for copying files that are located on a distributed parallel file system, and it splits large file copies across multiple processes.


'''-b C''': Copy files larger than C Mbytes in C Mbyte chunks
Run '''dcp''' with the '''-p''' option to preserve permissions and timestamps, and ownership.<br>


=== Algorithm ===
'''-p'''  : preserve permissions and timestamps, and ownership


pcp runs in two phases:
'''--chunksize C''': Copy files larger than C Bytes in C Byte chunks (default ist 4MB)
 
Phase I is a parallel walk of the file tree, involving all MPI ranks in a peer-to-peer algorithm. The walk constructs the list of files to be copied and creates the destination directory hierarchy.
We highly recommend using the '''-p''' option.
 
=== dsync ===
Parallel MPI application to synchronize two files or two directory trees.
 
dsync makes DEST match SRC, adding missing entries from DEST, and updating existing entries in DEST as necessary so that SRC and DEST have identical content, ownership, timestamps, and permissions.
 
'''--chunksize C''': Copy files larger than C Bytes in C Byte chunks (default ist 4MB)


In phase II, the actual files are copied. Phase II uses a master-slave algorithm.
R0 is the master and dispatches file copy instructions to the slaves (R1...Rn).


=== Job Script example ===
=== Job Script example ===
Here is an example of a job script.  
Here is an example of a job script.


You have to change the SOURCEDIR and TARGETDIR according to your setup.
You have to change the SOURCEDIR and TARGETDIR according to your setup.
Also the number of nodes and wallclock time should be adjusted.  
Also the number of nodes and wallclock time should be adjusted.


Again, pcp does NOT parallelize a single copy operation, but the number of copy operations are distributed over the nodes.


> cat pcp.qsub
  #!/bin/bash
  #!/bin/bash
  #PBS -N IO_copy_test
  #PBS -N parallel-copy
  #PBS -l nodes=8
  #PBS -l select=2:node_type=rome:mpiprocs=128
  #PBS -l walltime=0:30:00
  #PBS -l walltime=00:20:00
#PBS -joe
cd $PBS_O_WORKDIR
   
   
  module load tools/python/2.7.8
  module load openmpi mpifileutils
   
   
  SOURCEDIR=<YOUR SOURCE DIRECTORY HERE>
  SOURCEDIR=<YOUR SOURCE DIRECTORY HERE>
  TARGETDIR=<YOUR TARGET DIRECTORY HERE>
  TARGETDIR=<YOUR TARGET DIRECTORY HERE>
   
   
  sleep 5  
  sleep 5
  nodes=$(qstat -f $PBS_JOBID | awk -F: '/Resource_List.nodes/ {print $1 }' | awk -F= '{print $2}')
  nodes=$(cat $PBS_NODEFILE | sort -u | wc -l)
  let cores=nodes*24
  let cores=nodes*20
/usr/bin/time -p aprun -n $cores -N24 -d1  pcp -l -ls 1048576  -b 4096 $SOURCEDIR $TARGETDIR
>
 
Output of a run with the script
 
R0: All workers have reported in.
Starting 192 processes.
Will copy lustre stripe information.
Files larger than 4096 Mbytes will be copied in parallel chunks.
Will not stripe files smaller than 1.00 Mbytes
Starting phase I: Scanning and copying directory structure...
Phase I done: Scanned 115532 files, 1007 dirs in 00 hrs 00 mins 01 secs (106900 items/sec).
115532 files will be copied.
   
   
  Starting phase II: Copying files...
  time_start=$(date "+%c  :: %s")
  Phase II done.
  #mpirun -np $cores dcp -p --bufsize 8MB ${SOURCEDIR}/ ${TARGETDIR}/
mpirun -np $cores dsync --bufsize 8MB $SOURCEDIR $TARGETDIR
time_end=$(date "+%c  :: %s")
   
   
  Copy Statisics:
  tt_start=$(echo $time_start | awk {'print $9'})
Rank 1 copied 7.00 Gbytes in 839 files (38.17 Mbytes/s)
  tt_end=$(echo $time_end | awk {'print $9'})
  Rank 2 copied 6.37 Gbytes in 825 files (34.75 Mbytes/s)
  (( total_time=$tt_end-$tt_start ))
  ...
  echo "Total runtime in seconds: $total_time"
Rank 190 copied 7.84 Gbytes in 495 files (42.78 Mbytes/s)
Rank 191 copied 7.25 Gbytes in 784 files (39.63 Mbytes/s)
  Total data copied: 1.47 Tbytes in 115606 files (7.09 Gbytes/s)
Total Time for copy: 00 hrs 03 mins 31 secs
Warnings 0
Application 6324961 resources: utime ~4257s, stime ~4005s, Rss ~33000, inblocks ~3148863732, outblocks ~3148201243
real 259.90
user 0.02
sys 0.01
 
== Operation of the workspaces will be changed: ==
 
* To extend an existent workspace, an interactive shell is necessary.
* Due to a drop of performance on high usage of quota, no job of any group member will be scheduled for computation as long as the group quota exceeds 80%. All  blocked group members get a notice by E-mail (if a valid address is registered)
* accounting
* max. lifetime of a workspace is currently 60 days
* new versions of workspace tools have got more options, please read related man pages.
 
Please read https://kb.hlrs.de/platforms/index.php/Storage_usage_policy

Latest revision as of 07:23, 15 July 2024


Warning: This page describes the necessary steps to migrate workspaces to another workspace filesystems.
On 18th October 2024 all data on the old Hawk ws10 filesystems will be deleted. Follow the guide below to transfer your data to the ws11 workspace filesystems.


User migration to new workspaces

On Hawk the workspace filesystem ws11 will be the new default workspace filesystem. The currently existing default workspace filesystem (ws10) will be shut down on 18th October 2024. The policy settings of ws10 will be transferred to the ws11 filesystem. Now, users have to migrate their workspaces located on the ws10 filesystems onto the ws11 filesystem. Run the command ws_list -a on a frontend system to display the path for all your workspaces. If path names match mount points listed in the following table, these workspaces need to be migrated to the ws11 filesystem.


File System mounted on
ws10.0 /lustre/hpe/ws10/ws10.0
ws10.1 /lustre/hpe/ws10/ws10.1
ws10.2 /lustre/hpe/ws10/ws10.2
ws10.3 /lustre/hpe/ws10/ws10.3
ws10.3P /lustre/hpe/ws10/ws10.3P

Before you start

Migration for large amount of data consumes a lot of IO ressources. Please review and remove data not needed any more or move it into HPSS.

How to proceed / Time schedule for the gradual shutdown of ws10

  • 2024-07-22 10:00:
    • The default workspace filesystem will be switched from ws10 to ws11. ws_* tools applied to ws10 require the additional option '-F <workspacefilesystem>'.
    • ws11 will get the same policy settings as ws10.
    • Quota limit settings for each user-group will be moved from ws10 to ws11 and the quota enforcement will be enabled on ws11! (Batchjobs will not be scheduled in case the quota was exceeded the limit for your group).
    • Max. extension for workspaces on ws10 will be reduced from 3 to 1 (ws_allocate, ws_extend).
    • Max. duration for workspaces on ws10 will be reduced from 60 days to 40 days (ws_allocate, ws_extend).
  • 2024-08-19:
    • Max. durations for workspaces on ws10 will be reduced from 40 days to 10 days (ws_allocate).
    • Deactivation of ws_extend for ws10.
  • 2024-09-09:
    • Max. duration for workspaces on ws10 will be reduced from 10 days to 3 days (ws_allocate).
  • 2024-10-18:
    • Start removing all remaining data located on ws10.
    • Final shutdown.

Important remarks

  • If you have to migrate data residing in workspaces from one to another filesystem, do not use the mv command to transfer data. For large amounts of data, this will fail due to time limits. Currently we recommend for e.g. millions of small files or for large amounts of data to use the following command inside a single node batch job: rsync -a --hard-links Old_ws/ new_ws/ .
  • Please try to use the mpifileutils dcp or dsync.
  • Take care when you create new batch jobs. If you have to migrate your workspace from an old filesystem to the new location, this takes time. Do not run any job while the migration process is active. This may result in inconsistent data.

Operation / Policies of the workspaces on ws11:

  • No job of any user-group member will be scheduled for computation as long as the group quota is exceeded.
  • Accounting.
  • Max. lifetime of a workspace is currently 60 days.
  • Default lifetime of a workspace is 1 day.
  • Max. number of workspace extensions are 3.
  • Please read related man pages or the online workspace mechanism document
in particular note that the workspace tools allow to explicitly address a specific workspace file system using the -F option (e.g. ws_allocate -F ws11.0 my_workspace 10)
  • To list your available workspace file systems use ws_list -l.
  • Users can restore expired workspaces using ws_restore.

Please read https://kb.hlrs.de/platforms/index.php/Storage_usage_policy

Using mpifileutils for data transfer

The mpifileutils suite provides MPI-based tools to handle typical jobs like copy, remove, and compare for large datasets, providing speedups of up to 50x compared to single process jobs. It can only be run on compute nodes via mpirun.

dcp or dsync is similar to cp -r or rsync; simply provide a source directory and destination and dcp / dsync will recursively copy the source directory to the destination in parallel.

dcp / dsync has a number of useful options; use dcp -h or dsync -h to see a description or use the [User Guide].

It should be invoked via mpirun.

We highly recommend to use dcp / dsync with an empty ~/.profile and ~/.bashrc only! Furthermore, take care that only the following modules are loaded when using mpifileutils (this can be achieved by logging into the system without modifying the list of modules and loading only the modules openmpi and mpifileutils):
1) system/site_names
2) system/ws/8b99237
3) system/wrappers/1.0
4) hlrs-software-stack/current
5) gcc/10.2.0
6) openmpi/4.1.4
7) mpifileutils/0.11


dcp

Parallel MPI application to recursively copy files and directories.

dcp is a file copy tool in the spirit of cp(1) that evenly distributes the work of scanning the directory tree, and copying file data across a large cluster without any centralized state. It is designed for copying files that are located on a distributed parallel file system, and it splits large file copies across multiple processes.

Run dcp with the -p option to preserve permissions and timestamps, and ownership.

-p  : preserve permissions and timestamps, and ownership

--chunksize C: Copy files larger than C Bytes in C Byte chunks (default ist 4MB)

We highly recommend using the -p option.

dsync

Parallel MPI application to synchronize two files or two directory trees.

dsync makes DEST match SRC, adding missing entries from DEST, and updating existing entries in DEST as necessary so that SRC and DEST have identical content, ownership, timestamps, and permissions.

--chunksize C: Copy files larger than C Bytes in C Byte chunks (default ist 4MB)


Job Script example

Here is an example of a job script.

You have to change the SOURCEDIR and TARGETDIR according to your setup. Also the number of nodes and wallclock time should be adjusted.


#!/bin/bash
#PBS -N parallel-copy
#PBS -l select=2:node_type=rome:mpiprocs=128
#PBS -l walltime=00:20:00

module load openmpi mpifileutils

SOURCEDIR=<YOUR SOURCE DIRECTORY HERE>
TARGETDIR=<YOUR TARGET DIRECTORY HERE>

sleep 5
nodes=$(cat $PBS_NODEFILE | sort -u | wc -l)
let cores=nodes*20

time_start=$(date "+%c  :: %s")
#mpirun -np $cores dcp -p --bufsize 8MB ${SOURCEDIR}/ ${TARGETDIR}/
mpirun -np $cores dsync --bufsize 8MB $SOURCEDIR $TARGETDIR
time_end=$(date "+%c  :: %s")

tt_start=$(echo $time_start | awk {'print $9'})
tt_end=$(echo $time_end | awk {'print $9'})
(( total_time=$tt_end-$tt_start ))
echo "Total runtime in seconds: $total_time"