- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CAE utilities: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
(added a jobnodesinfo entry)
 
(16 intermediate revisions by the same user not shown)
Line 4: Line 4:


will you take care that these are found when called through the commandline or e.g. a job script.
will you take care that these are found when called through the commandline or e.g. a job script.
=Workspaces=
==ws_exchange procedure==
{{Note|text = obsolete, see [[#ws_exchange replacements|note about ws_exchange replacements]] }}
Exchanging data between users can be done using a [[Workspace_mechanism | workspace]].
If the users don't share a common group it is not advisable to just create such a workspace with permissions
set to read (or/and write) and execute to world (others) since everyone on the systems gets access to the data.
ws_exchange uses such a workspace with a slight modification:
the workspace directory permissions are set to execute for all, but only the user is allowed to read its contents.
A readable(/writable) subdirectory is therefore "invisble"/secret to others.
This approach is '''not''' secure, but better than the "open to all" approach.
{{Warning|text = Never use the secret subdirectory name within the arguments of command line tools. (also see [[#ws_cp2exchange|ws_cp2exchange]]) }}
===ws_exchange===
Looking at the help first:
$ module load cae
$ ws_exchange -h
usage: ws_exchange [-s <secret subdirectory name>] [-d <workspace duration>] [-w <workspacename>] [-f <workspace filesystem>] [-p <permissions>]
        -s <secret subdirectory name> [default: random name]
        -d <workspace duration [d]>  [default: 1 days (48 hours)]
        -w <workspacename>            [default: exchange%Y%m%d%H%M%S]
        -f <workspace filesystem>    [default: default of ws_allocate]
        -p <subdirectory permissions> [default: go+rwx]
ws_exchange will create a workspace, which is executable but not readable for the world.
Its contents are hidden and a additional created subdirectory, which name serves as a password, could be open to the world.
Executing ws_exchange with default options will create a one-day workspace in the default file system
with a name starting with "exchange" followed by the date and time and a subdirectory with random name
and execute/read/write permissions:
$ ws_exchange
Workspace created on gerris
exchange20081018113236                                      Oct 18 11:32:51  0 days 23 hours
                  /scratch2/ws/hpcstruc-exchange20081018113236-0
  exchange directory ----------------------------------------
/scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6
------------------------------------------------------------
you might ws_cp2exchange to copy your files to exchange20081018113236
Concerned other user have to be notified about the path
("/scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6" in the example)
e.g. by email. Another example:
$ ws_exchange -w myexchangews -s secretsubdir -d 30 -f lustre -p go+rx
Workspace created on gerris
myexchangews                                                Oct 18 12:43:19  0 days 23 hours
                  /scratch2/ws/hpcstruc-myexchangews-0
  exchange directory ----------------------------------------
/scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
------------------------------------------------------------
you might ws_cp2exchange to copy your files to myexchangews
Now the workspace (on the lustre file system) named myexchangews lasts for 30 days.
===ws_cp2exchange===
Files also have permissions and especially copying files to a ws_exchange workspace
these permissions are automatically set too restrictive. It's also a good idea first
to change the current directory to the ws_exchange subdirectory and copy to this
directory than in the other direction. To simplify this, [[#ws_cp2exchange|ws_cp2exchange]] was created.
$ ws_cp2exchange -h
usage: ws_cp2exchange [cp|mv] [<options>] <src> <wsname>
To copy a file "testfile" to the ws_exchange workspace subdirectory created above simply write
$ ws_cp2exchange testfile exchange20081018113236
working directory: /scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6
cp  /DDN1/HLRS/hlrs/hpcstruc/testfile .
change permissions to g+rwx,o+rwx
done
Copying whole directories needs an option (see "man cp"):
$ ws_cp2exchange cp -r testdir1 myexchangews
working directory: /scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
cp  -r /DDN1/HLRS/hlrs/hpcstruc/testdir1 .
change permissions to g+rwx,o+rwx
done
(The "cp" can be omitted, since copy is the default mode.)
To move a directory (or file) use the "mv" mode:
$ ws_cp2exchange mv testdir2 myexchangews
working directory: /scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
mv  /DDN1/HLRS/hlrs/hpcstruc/testdir2 .
change permissions to g+rwx,o+rwx
done
== ws_exchange replacements ==
{{Note|text = The ws_exchange utility is not needed any more due to the introduction of
  ACLs and the new [[Workspace_mechanism#Share_your_workspace_with_another_user | ws_share]] [[Workspace_mechanism | workspace]]tool. }}
=== Usage of ACLs ===
Since Access Control Lists are now enabled, there's not only user, group and other to specify the permissions for file objects,
but a list. Creating a new workspace, the directory could be made e.g. readable to a specific user SHAREUSER like
<code>
$ WSDIR=ws_allocate $WSNAME
$ setfacl -d -m u:$SHAREUSER:r $WSDIR
</code>
=== ws_share ===
see the [[ Workspace_mechanism#Share_your_workspace_with_another_user | Workspace mechanism documentation ]]
=Jobs=
== jobnodesinfo ==
jobnodesinfo.py shows which nodes are used by a job (and for HAWK how they are placed):
$ jobnodesinfo.py --help
usage: jobnodesinfo [-h] [-v] [-p PLATFORM] [-n NODESFILE] [-j JOBID]
                    [-f FORMAT] [-o [OUTPUTFILE]] [--nocolor]
                    [--nodedistribmarkers NODEDISTRIBMARKERS]
                    [nodes [nodes ...]]
print nodes information
positional arguments:
  nodes
optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose        verbose level (default, if not set: ERROR)
  -p PLATFORM, --platform PLATFORM
                        platform name (default, if not set: hawk)
  -n NODESFILE, --nodesfile NODESFILE
                        nodes file (default, if not set: $PBS_NODEFILE)
  -j JOBID, --jobid JOBID
                        PBS JobID
  -f FORMAT, --format FORMAT
                        format string (default, if not set: platform specific)
  -o [OUTPUTFILE], --outputfile [OUTPUTFILE]
                        redirect output to file (default: stdout)
  --nocolor            no color output
  --nodedistribmarkers NODEDISTRIBMARKERS
                        nodesdistribution markers
Within a job there's no need to set a JobID, since
-n $PBS_JOBID resp. -n $PBS_NODEFILE is automatically set as a default.


==qgen==
==qgen==
qgen will generate a job script based on a template.
qgen will generate a job script based on a template.


  qgen -h
  $ module load cae
$ qgen -h
qgen a tool to generate(&submit) pbs jobfiles (version: 2012/03/29, author: Martin Bernreuther <bernreuther@hlrs.de>)
usage: qgen [-n|--nodes <nodes>] [-t|--walltime <walltime>] [-N|--jobname <jobname>] [-w|--workdir <workdir>] [-o <jobfile>] [--submit|--submitq <queue>|--execute] [-] [<template>] [<template_opt1><template_opt2>...]
      -n|--nodes <nodes> [default: 1] number of nodes($QGEN_NODES)
      -t|--walltime <walltime> [default: 24:00:00] walltime ($QGEN_WALLTIME)
      -N|--jobname <jobname> [default: <template><datetime>] job name ($QGEN_NAME)
      -w|--workdir <workdir> [default: /zhome/academic/HLRS/hlrs/hpcstruc] working directory ($QGEN_WORKDIR)
      -o <jobfile> [default: <stdout>] save script to <jobfile>
      --submit    submit immediately (to standard queue)
      --submitq <queue>    submit immediately to queue <queue>
      --execute    execute immediately on current host
      -l          list system templates
      -v          be verbose
      -h|--help print help and exit
      <template> template to use
      <template_options> additional options depending on and passed directly to the template ($*)
qgen will create a pbs jobscript based on a template, substituting
$QGEN_NODES, $QGEN_WALLTIME, $QGEN_NAME, $QGEN_WORKDIR and $* (latter with template options)
Instead of setting the values with command line arguments,
environment variables might be used (except for template options)
These environment variables might be also set in the configuration file
.../.qgenrc, which is sourced at the beginning.
Templates are searched for in $QGEN_TEMPLATE_PATH, actually set to:
...


will print some help and
will print some help and


  qgen -l
  $ qgen -l


will list all available templates found in the QGEN_TEMPLATE_PATH
will list all available templates found in the QGEN_TEMPLATE_PATH
One of these templates is "test", which doesn't need any further arguments.
One of these templates is "test", which doesn't need any further arguments.


  qgen test
  $ qgen test


A job file will be printed. Save this job file to "test.pbs" and compare it with the template:
A job file will be printed. Save this job file to "test.pbs" and compare it with the template:


  qgen -o test.pbs test
  $ qgen -o test.pbs test
  tkdiff test.pbs /app/rus/struct/bin/qgen_templates/test
  $ tkdiff test.pbs /app/rus/struct/bin/qgen_templates/test


(If there's no X avaiable, replace tkdiff with e.g. sdiff)
(If there's no X avaiable, replace tkdiff with e.g. sdiff)
Line 32: Line 220:
(approx.) the same (for 2 requested nodes) can also be achieved with:
(approx.) the same (for 2 requested nodes) can also be achieved with:


  qgen -n 2 --submit test
  $ qgen -n 2 --submit test


However, omitting the -o option, the executed job script will not be saved for the user.
However, omitting the -o option, the executed job script will not be saved for the user.
Line 38: Line 226:
In general, templates require additional options. The help might give you a hint at the end, e.g.
In general, templates require additional options. The help might give you a hint at the end, e.g.


  qgen -h abaqus
  $ qgen -h abaqus
 
  qgen    a tool to generate(&submit) pbs jobfiles        (author: Martin Bernreuther <bernreuther@hlrs.de>)
  qgen    a tool to generate(&submit) pbs jobfiles        (version: 2012/03/29, author: Martin Bernreuther <bernreuther@hlrs.de>)
  usage: qgen [-n|--nodes <nodes>] [-t|--walltime <walltime>] [-N|--jobname <jobname>] [-w|--workdir <workdir>] [-o <jobfile>] [--submit|--submitq <queue>|--execute] [-] [<template>] [<template_opt1> <template_opt2>...]
  usage: qgen [-n|--nodes <nodes>] [-t|--walltime <walltime>] [-N|--jobname <jobname>] [-w|--workdir <workdir>] [-o <jobfile>] [--submit|--submitq <queue>|--execute] [-] [<template>] [<template_opt1> <template_opt2>...]
       -n|--nodes <nodes>      [default: 1]   number of nodes
       -n|--nodes <nodes>      [default: 1:nehalem:ppn=8]     number of nodes($QGEN_NODES)
       -t|--walltime <walltime> [default: 12:00:00]      walltime
       -t|--walltime <walltime> [default: 24:00:00]      walltime ($QGEN_WALLTIME)
       -N|--jobname <jobname>  [default: <template><datetime>] job name
       -N|--jobname <jobname>  [default: <template><datetime>] job name ($QGEN_NAME)
       -w|--workdir <workdir>  [default: /scratch2/ws/hpcbern-dynatest-0]       working directory
       -w|--workdir <workdir>  [default: /zhome/academic/HLRS/hlrs/hpcbern]     working directory ($QGEN_WORKDIR)
       -o <jobfile>    [default: <stdout>]      jobfile output
       -o <jobfile>    [default: <stdout>]      save script to <jobfile>
       --submit        submit immediately (to standard queue)
       --submit        submit immediately (to standard queue)
       --submitq <queue>        submit immediately to queue
       --submitq <queue>        submit immediately to queue <queue>
       --execute        execute on current host immediately
       --execute        execute immediately on current host
       -l              list system templates
       -l              list system templates
      -v              be verbose
       -h|--help                print help and exit
       -h|--help                print help and exit
       <template>              template to use
       <template>              template to use
       <template_options>      additional options depending on the template
       <template_options>      additional options depending on and passed directly to the template ($*)
  qgen will create a pbs jobscript based on a template, substituting
  qgen will create a pbs jobscript based on a template, substituting
  $QGEN_NODES, $QGEN_WALLTIME, $QGEN_NAME, $QGEN_WORKDIR and $* (latter with template options)
  $QGEN_NODES, $QGEN_WALLTIME, $QGEN_NAME, $QGEN_WORKDIR and $* (latter with template options)
  Instead of setting the values with command line arguments,"
  Instead of setting the values with command line arguments,
        echo "environment variables might be used (except for template options)
environment variables might be used (except for template options)
  These environment variables might be also set in the configuration file
  These environment variables might be also set in the configuration file
  ~/.qgenrc, which is sourced at the beginning.
  /zhome/academic/HLRS/hlrs/hpcbern/.qgenrc, which is sourced at the beginning.
  Templates are searched for in $QGEN_TEMPLATE_PATH, actually set to:
  Templates are searched for in $QGEN_TEMPLATE_PATH, actually set to:
  ~/qgen_templates /app/rus/struct/bin/qgen_templates
  /zhome/academic/HLRS/hlrs/hpcbern/qgen_templates /opt/cae/bin/qgen_templates
   
   
  abaqus template_options: options for abaqus, e.g. job=<jobname>
  abaqus template_options: [<version>] <abaqus-options>
                          cpus and mp_mode are set automatically
                examples: job=<jobname>
                          692 job=<jobname> inp=<inputfile> user=<usrprgfile> scratch=.
ABAQUS options:
                job=job-name
                [input=input-file]
                [user={source-file | object-file}]
                [oldjob=oldjob-name]
                [fil={append | new}]
                [globalmodel={results file-name | output database file-name}]
                [domains=number-of-domains]
                [dynamic_load_balancing]
                [standard_parallel={all | solver}]
                [gpus=number-of-gpgpus]
                [memory=memory-size]
                [interactive |  background | queue=[queue-name][after=time]]
                [double={explicit | both | off | constraint}]
                [scratch=scratch-dir]
                [output_precision={single | full} ]
                [field={odb | exodus | nemesis} ]
                [history={odb | csv} ]
                [madymo=MADYMO-input-file]
                [port=co-simulation port-number]
                [host=co-simulation hostname]
                [listenerport=Co-Simulation Engine listener port-number]
                [remoteconnections=Co-Simulation Engine host:port-number, remote job host:port-number]
                [timeout=co-simulation timeout value in seconds]
                [unconnected_regions={yes | no}]
(Nutzung dieses Templates in eigener Verantwortung - use this template on your own responsibility)


Submitting an ABAQUS job on 4 nodes is as easy as executing
Submitting an ABAQUS job on 4 nodes is as easy as executing


  qgen -n 4 --submit abaqus job=<jobname>
  $ qgen -n 4 --submit abaqus job=<jobname>


with <jobname> as ABAQUS inputfile (absolute path or relative path to the current directory).
with <jobname> as ABAQUS inputfile (absolute path or relative path to the current directory).
Line 73: Line 293:
As mentioned before, more control is gained, if the PBS jobfile is generated first and submitted afterwards, like with this LS-Dyna example:
As mentioned before, more control is gained, if the PBS jobfile is generated first and submitted afterwards, like with this LS-Dyna example:


  qgen -o lsdyna.pbs -n 8 dyna i=<inputfile>
  $ qgen -o lsdyna.pbs -n 8 dyna i=<inputfile>
  qsub lsdyna.pbs
  $ qsub lsdyna.pbs


Before the submission with qsub, the jobfile might be changed/tuned.
Before the submission with qsub, the jobfile might be changed/tuned.


The qgen command can also be used within LSOPT for LS-Dyna optimization/DoE runs.
After loading the necessary modules (module load cae lstc) start "lsoptui" to do the configuration.
Within the "Solvers" tab, after choosing "LS-DYNA" as "Solver Package Name", a "Command" like
"qgen --submit -n 1:ppn=8 -o job.pbs dyna" will generate and submit the LS-Dyna jobs, controlled by lsopt.


There's also a simple template to execute a single command:
There's also a simple template to execute a single command:


  qgen -N cmdtoptest --submit cmd top -b -n 1
  $ qgen -N cmdtoptest --submit cmd top -b -n 1


The -N option sets the jobname and affects the names of the stdout/stderr files.
The -N option sets the jobname and affects the names of the stdout/stderr files.
Line 95: Line 319:
which directories qgen should include, searching for the given template name.
which directories qgen should include, searching for the given template name.


To start with a first example, a simple helloworld template will be created
One strategy would be to copy an already existing template file (see ''qgen -l'') to tailored it to the own particular needs.
Another one is to use an existing job script and to replace some varying parts with QGEN variables.
 
To show how such a template file looks like, a simple helloworld template will be created from scratch
(if you're not familar with vi and you have X enabled, use another editor like nedit,kwrite,...):
(if you're not familar with vi and you have X enabled, use another editor like nedit,kwrite,...):


  mkdir ${HOME}/qgen_templates
  $ mkdir ${HOME}/qgen_templates
  vi ${HOME}/qgen_templates/helloworld
  $ vim ${HOME}/qgen_templates/helloworld


The content of ~/qgen_templates/helloworld might look like
The content of ~/qgen_templates/helloworld might look like
Line 114: Line 341:
The $QGEN_XXX strings will be replaced by qgen with some job specific data according to default values and the qgen options set.
The $QGEN_XXX strings will be replaced by qgen with some job specific data according to default values and the qgen options set.


Now we can use this template to submit a first job
Now we can use this template to create job files and submit the first jobs based on this template
(the cae module has to be loaded only once - otherwise the full path to qgen has be typed...):
(The cae module has to be loaded only once - otherwise the full path to qgen has to be typed...):


  module load cae
  $ module load cae
  qgen -h helloworld
  $ qgen -h helloworld
  qgen -o helloworld.pbs --submit helloworld Rumpelstilzchen
  $ qgen -o helloworld1.pbs --submit helloworld Rumpelstilzchen
$ qgen -o helloworld2.pbs -n 1:node_type=rome -t 0:03:00 -N helloworldtest -w $HOME --submit helloworld Rumpelstilzchen
$ qstat -f


The help message will also display the QGENHELP part of the template at the end.
The help message will also display the QGENHELP part of the template at the end.
=== qgen usage examples ===
The examples will create a workspace, copy/extract an example input file and
generate a job script file to be submitted.
The provided qgen templates will try to use the academic licenses provided by HLRS.
To use other liceses they have to be [[CAE_utilities#qgen_customization|adopted]].
==== dyna (LS-Dyna) ====
Using a single precision mpp LS-DYNA version to compute 100 time steps of the odb10m example on a Vulcan "genoa" node.
The same procedure can also be used on Hawk, but the node_type has to be adopted to "rome".
$ ws_allocate dynatest 1
$ cd `ws_find dynatest`
$ gunzip -c /sw/general/x86_64/cae/lstc/examples/topcrunch.org/ODB-10M/odb10m-ver18.k.gz > odb10m.k
$ module load cae
$ qgen -n 1:node_type=genoa:mpiprocs=64 -t 0:05:00 -o job.pbs dyna s mpp i=odb10m.k ncycle=100 d=NODUMP
$ qsub job.pbs
==== ansys (ANSYS mechanical) ====
$ ws_allocate ansystest 1
$ cd `ws_find ansystest`
$ cp /sw/general/x86_64/cae/ansys_inc/v241/ansys/site/ansys/mapdl/core/examples/verif/vm263.dat .
$ module load cae
$ qgen -n 1:node_type=clx-25:mpiprocs=1:ompthreads=10 -t 0:05:00 -o job.pbs ansys -i vm263.dat -o vm263_output
$ qsub job.pbs
==qwtime==
qwtime will show the (remaining) walltime of a job.
$ module load cae
$ qwtime -h
usage: /sw/general/x86_64/cae/bin/qwtime [-r] [-R] [-e] [--fmt '<FORMAT>'] [--datimefmt '<DATIMEFORMAT>'] [<JOBID> ...]
      -r print remaining time (FORMAT '%r') [sec]
      -R print remaining time (FORMAT '%:r') (HH:MM:SS)
      -e print (estimated) end time (FORMAT '%e') (also see --datimefmt)
      --fmt <FORMAT> print with format <FORMAT>
        (default:'%j\tjobname:\t%n (jobowner: %o, #hosts: %#h, host1: %h1)\n%j\tjobworkdir:\t%d\n%j\ttime range:\t%s\t%e\t(%:w)\n%j\twalltime      used:\t%:u\t(%us, %%u%)\n%j\twalltime remaining:\t%:r\t(%rs, %%r%)\n%j\t|%b|\n')
        environment variable qwtimeARGfmtDEFAULT (not set)
      --datimefmt <DATIMEFORMAT> use time format <DATIMEFORMAT>
        (default:'%Y-%m-%dT%H:%M:%S')
        environment variable: qwtimeARGdatimefmtDEFAULT (not set)
      <JOBID> job ID (default: PBS_JOBID or actual USER jobs)
FORMAT:
      %j job ID (without suffix, %J with suffix)
      %n job Name
      %o job Owner
      %h job hosts (%h1 first host/MOM)
      %d job workdir
      %s start time (also see --datimefmt)
      %e (estimated) endtime (also see --datimefmt)
      %w requested Walltime [sec] (%:w [HH:MM:SS])
      %u used walltime [sec] (%:u [HH:MM:SS], %%u [%])
      %r remaining walltime [sec] (%:r [HH:MM:SS], %%r [%])
      %b progress bar
DATIMEFORMAT: see "man 1 date | grep -m1 -A105 '^ *FORMAT'"
If no JOBIDs are given, the actual running jobs will be queried.
Using a --fmt format string, which might include format specifiers, the output can be customized.
With options like e.g. -r only a specific information (e.g. remaining walltime in seconds) will be printed (adjusting the format string).


==qcat==
==qcat==
Line 132: Line 428:
working directory.
working directory.


  qcat -h
  $ qcat -h


  usage: /app/rus/struct/bin/qcat [o|e] [<JOBID>] [-c] [-t [<N>]]
  usage: /opt/cae/bin/qcat [o|e] [<JOBID>] [-c] [-t [<N>]]
        o|e      show output or error output file        default:o
      o|e      show output or error output file        default:o
        <JOBID>  OpenPBS job ID                          default: last own job
      <JOBID>  OpenPBS job ID                          default: last own job
        -c      copy file to final destination          default: no copy
      -c      copy file to final destination          default: no copy
        -t [<N>] show tail (only <N> last lines)        default: show all
      -t [<N>] show tail (only <N> last lines)        default: show all


= MPI/OpenMP =


==ws_exchange procedure==
== showaffinity ==


Exchanging data between users who don't share a common group sometimes leads to
showaffinity will show/print the affinity (pinning) of processes and serves as a replacement of a program to be executed with the same options.
creating an workspace with permissions set to read (or/and write) and execute to world (others).
The possible places will be marked.
ws_exchange goes in that direction with a slight modification:
the workspace directory permissions are set to execute for all, but doesn't allow reading its contents.
A readable(/writable) subdirectory is therefore "invisble" to others.
This approach is '''not''' secure, but better than the "open to all" approach.


===ws_exchange===
Looking at the help first:


  ws_exchange -h
=== sequential program ===
  $ showaffinity --help
hybrid MPI program to show process to core affinity; <bernreuther@hlrs.de>
Usage:
  showaffinity [OPTION...]
  -h, --help            Print help
  -v, --verbose        Verbose output
  -f, --maskformat arg  mask format template string
  -p, --preset arg      mask format presets
      --marker arg      marker (default: .X)
$ showaffinity.seq
$ taskset 1 showaffinity.seq
$ taskset 0x0000000b showaffinity.seq
$ taskset --cpu-list 0-2,4 showaffinity.seq
$ numactl --cpunodebind=0 --membind=0 showaffinity.seq


/app/rus/struct/bin/ws_exchange: option requires an argument -- h
usage: ws_exchange [-s <secret subdirectory name>] [-d <workspace duration>] [-w <workspacename>] [-f <workspace filesystem>] [-p <permissions>]
        -s <secret subdirectory name> [default: random name]
        -d <workspace duration [d]>  [default: 0 days (24 hours)]
        -w <workspacename>            [default: exchange%Y%m%d%H%M%S]
        -f <workspace filesystem>    [default: default of ws_allocate]
        -p <permissions>              [default: go+rwx]
ws_exchange will create a workspace, which is executable but not readable for the world.
Its contents are hidden and a additional created subdirectory, which name serves as a password, could be open to the world.


Executing ws_exchange with default options will create a one-day workspace in the default file system
=== OpenMP program ===
with a name starting with "exchange" followed by the date and time and a subdirectory with random name
and execute/read/write permissions:


  ws_exchange
  $ showaffinity.omp
$ OMPLACES


Workspace created on gerris
=== MPI program ===
exchange20081018113236                                      Oct 18 11:32:51  0 days 23 hours
                  /scratch2/ws/hpcstruc-exchange20081018113236-0
  exchange directory ----------------------------------------
/scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6
------------------------------------------------------------
you might ws_cp2exchange to copy your files to exchange20081018113236


Concerned other user have to be notified about the path
using MPT on HAWK
("/scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6" in the example)
$ mpirun showaffinity.mpt -p hawk0
e.g. by email. Another example:
$ mpirun -np 4 omplace -v showaffinity.mpt


ws_exchange -w myexchangews -s secretsubdir -d 30 -f lustre -p go+rx


Workspace created on gerris
=== hybrid MPI+OpenMP program ===
myexchangews                                                Oct 18 12:43:19  0 days 23 hours
                  /scratch2/ws/hpcstruc-myexchangews-0
  exchange directory ----------------------------------------
/scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
------------------------------------------------------------
you might ws_cp2exchange to copy your files to myexchangews


Now the workspace (on the lustre file system) named myexchangews lasts for 30 days.
using MPT on HAWK
$ mpirun -np 2 omplace -nt $OMP_NUM_THREADS -vv showaffinity.mptomp -p hawk1


===ws_cp2exchange===
= Monitoring =


Files also have permissions and especially copying files to a ws_exchange workspace
== meminfo ==
these permissions are automatically set too restrictive. It's also a good idea first
to change the current directory to the ws_exchange subdirectory and copy to this
directory than in the other direction. To simplify this, "ws_cp2exchange" was created.


ws_cp2exchange -h
meminfo.py will just print the data found in the Linux /proc/meminfo,
which looks like (shown for an HAWK compute node)


  usage: ws_cp2exchange [cp|mv] [<options>] <src> <wsname>
  $ nl /proc/meminfo; meminfo.py -c 1-3,17
    1  MemTotal:       262730452 kB
    2  MemFree:        250762824 kB
    3  MemAvailable:  250228776 kB
    4  Buffers:          39288 kB
    5  Cached:          1151664 kB
    6  SwapCached:            0 kB
    7  Active:          232808 kB
    8  Inactive:        1677452 kB
    9  Active(anon):    109068 kB
    10  Inactive(anon):  1436592 kB
    11  Active(file):    123740 kB
    12  Inactive(file):  240860 kB
    13  Unevictable:          0 kB
    14  Mlocked:              0 kB
    15  SwapTotal:            0 kB
    16  SwapFree:              0 kB
    17  Dirty:                0 kB
    18  Writeback:            0 kB
    19  AnonPages:        717452 kB
    20  Mapped:          294688 kB
    21  Shmem:            826352 kB
    22  KReclaimable:    340944 kB
    23  Slab:            4913136 kB
    24  SReclaimable:    340944 kB
    25  SUnreclaim:      4572192 kB
    26  KernelStack:      54800 kB
    27  PageTables:        10864 kB
    28  NFS_Unstable:          0 kB
    29  Bounce:                0 kB
    30  WritebackTmp:          0 kB
    31  CommitLimit:    131365224 kB
    32  Committed_AS:  10660508 kB
    33  VmallocTotal:  34359738367 kB
    34  VmallocUsed:    3109344 kB
    35  VmallocChunk:          0 kB
    36  Percpu:          475136 kB
    37  HardwareCorrupted:    0 kB
    38  AnonHugePages:    573440 kB
    39  ShmemHugePages:        0 kB
    40  ShmemPmdMapped:        0 kB
    41  FileHugePages:        0 kB
    42  FilePmdMapped:        0 kB
    43  HugePages_Total:      0
    44  HugePages_Free:        0
    45  HugePages_Rsvd:        0
    46  HugePages_Surp:        0
    47  Hugepagesize:      2048 kB
    48  Hugetlb:              0 kB
    49  DirectMap4k:    1905588 kB
    50  DirectMap2M:    189710336 kB
    51  DirectMap1G:    76546048 kB


To copy a file "testfile" to the ws_exchange workspace subdirectory created above simply write
but offering some additional features:


  ws_cp2exchange testfile exchange20081018113236
  $ module load cae
$ meminfo.py --help
usage: meminfo [-h] [-o [OUTPUTFILE]] [--mpi] [--avg] [--printnodenames]
              [-r REPEAT] [-c COLUMNS] [--delay DELAY] [--noheader]
              [--valfmt VALFMT] [--timestampfmt TIMESTAMPFMT] [--daemonize]


  working directory: /scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6
  gather and print meminfo data
cp  /DDN1/HLRS/hlrs/hpcstruc/testfile .
change permissions to g+rwx,o+rwx
done


Copying whole directories needs an option (see "man cp"):
optional arguments:
  -h, --help            show this help message and exit
  -o [OUTPUTFILE], --outputfile [OUTPUTFILE]
                        redirect output to file (default: stdout)
  --mpi                MPI multinode version (default: False)
  --avg                also calculate averages and standard deviation
                        (default: False) [needs --mpi]
  --printnodenames      print node names (default: False) [needs --mpi;
                        inactive for --noheader]
  -r REPEAT, --repeat REPEAT
                        N|+N|+HH:MM:SS,T|HH:MM:SS repeat N-1 times|within +N
                        sec|+HH:MM:SS with a sleep time of T sec|HH:MM:SS in
                        between
  -c COLUMNS, --columns COLUMNS
                        columns selection e.g. 1-3,17,18,21,33-34 (if not set:
                        select all) - see `nl /proc/meminfo`
  --delay DELAY        delay (sleep) before execution (default: None)
  --noheader            omit printing header
  --valfmt VALFMT      value format (default: {})
  --timestampfmt TIMESTAMPFMT
                        timestamp format (default: %y%m%dT%H%M%S)
  --daemonize          put process into background (default: False)


ws_cp2exchange cp -r testdir1 myexchangews
E.g. filtering some values like


  working directory: /scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
  $ meminfo.py -c 1-3,17
  cp -r /DDN1/HLRS/hlrs/hpcstruc/testdir1 .
# meminfo running at XXX XXX XX XX:XX:XX XXXX
  change permissions to g+rwx,o+rwx
# TIMESTAMP    MemTotal        MemFree MemAvailable    Dirty
  done
  XXXXXXTXXXXXX  269035982848.0 256765276160.0 256218411008.0 0.0


(The "cp" can be omitted, since copy is the default mode.)
where the numbers correspond to the line numbers above (without -c all entries will be listed).
For single node jobs or if only the first node should be monitored, meminfo.py could be
"daemonized" to execute the following commands in a job script directly afterwards.
Especially for this case, it is advisable to redirect the output to a file.
Its also possible to repeat this data collection and do that after a given delay.
In the following case 2 columns will be written to meminfo.out after a waiting time
of 10 seconds 10 times waiting 5 seconds in between:


To move a directory (or file) use the "mv" mode:
$ meminfo.py --daemonize --delay 10 -r 10,5 -c 2,3 -o meminfo.out


ws_cp2exchange mv testdir2 myexchangews
mpiinfo.py also offers a MPI mode (using mpi4py) to collect the data of multiple nodes:


  working directory: /scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
  $ qsub -IXlselect=2:node_type=rome:mpiprocs=128,walltime=0:5:0 -q test
  mv /DDN1/HLRS/hlrs/hpcstruc/testdir2 .
  $ module load cae mpi4py
change permissions to g+rwx,o+rwx
  $ ( mpirun -np 256 meminfo.py --mpi --avg --delay 0:0:30 -r +0:10:00,0:0:10 -c 1-3 -o meminfo.out.${PBS_JOBID%%.*} ) &
  done
   
The output will contain the MIN,MAX values. With --avg (like in the example above)
additionally the average and the standard deviation is printed.
Only a single process will collect the data on each node (all others will just idle).
The "daemonize" is not supported in MPI mode, so in the example
the whole program start will be done in a subprocess in the background
(notice the paranthesis and the trailing ampersand).

Latest revision as of 17:59, 7 November 2024

A collection of small helper scripts.

module load cae

will you take care that these are found when called through the commandline or e.g. a job script.


Workspaces

ws_exchange procedure


Exchanging data between users can be done using a workspace. If the users don't share a common group it is not advisable to just create such a workspace with permissions set to read (or/and write) and execute to world (others) since everyone on the systems gets access to the data. ws_exchange uses such a workspace with a slight modification: the workspace directory permissions are set to execute for all, but only the user is allowed to read its contents. A readable(/writable) subdirectory is therefore "invisble"/secret to others. This approach is not secure, but better than the "open to all" approach.

Warning: Never use the secret subdirectory name within the arguments of command line tools. (also see ws_cp2exchange)


ws_exchange

Looking at the help first:

$ module load cae

$ ws_exchange -h
usage: ws_exchange [-s <secret subdirectory name>] [-d <workspace duration>] [-w <workspacename>] [-f <workspace filesystem>] [-p <permissions>]
       -s <secret subdirectory name> [default: random name]
       -d <workspace duration [d]>   [default: 1 days (48 hours)]
       -w <workspacename>            [default: exchange%Y%m%d%H%M%S]
       -f <workspace filesystem>     [default: default of ws_allocate]
       -p <subdirectory permissions> [default: go+rwx]
ws_exchange will create a workspace, which is executable but not readable for the world.
Its contents are hidden and a additional created subdirectory, which name serves as a password, could be open to the world.

Executing ws_exchange with default options will create a one-day workspace in the default file system with a name starting with "exchange" followed by the date and time and a subdirectory with random name and execute/read/write permissions:

$ ws_exchange
Workspace created on gerris
exchange20081018113236                                      Oct 18 11:32:51  0 days 23 hours
                 /scratch2/ws/hpcstruc-exchange20081018113236-0
 exchange directory ----------------------------------------
/scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6
------------------------------------------------------------
you might ws_cp2exchange to copy your files to exchange20081018113236

Concerned other user have to be notified about the path ("/scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6" in the example) e.g. by email. Another example:

$ ws_exchange -w myexchangews -s secretsubdir -d 30 -f lustre -p go+rx
Workspace created on gerris
myexchangews                                                Oct 18 12:43:19  0 days 23 hours
                 /scratch2/ws/hpcstruc-myexchangews-0
 exchange directory ----------------------------------------
/scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
------------------------------------------------------------
you might ws_cp2exchange to copy your files to myexchangews

Now the workspace (on the lustre file system) named myexchangews lasts for 30 days.

ws_cp2exchange

Files also have permissions and especially copying files to a ws_exchange workspace these permissions are automatically set too restrictive. It's also a good idea first to change the current directory to the ws_exchange subdirectory and copy to this directory than in the other direction. To simplify this, ws_cp2exchange was created.

$ ws_cp2exchange -h
usage: ws_cp2exchange [cp|mv] [<options>] <src> <wsname>

To copy a file "testfile" to the ws_exchange workspace subdirectory created above simply write

$ ws_cp2exchange testfile exchange20081018113236
working directory: /scratch2/ws/hpcstruc-exchange20081018113236-0/Yj2mskvAuC6
cp  /DDN1/HLRS/hlrs/hpcstruc/testfile .
change permissions to g+rwx,o+rwx
done

Copying whole directories needs an option (see "man cp"):

$ ws_cp2exchange cp -r testdir1 myexchangews
working directory: /scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
cp  -r /DDN1/HLRS/hlrs/hpcstruc/testdir1 .
change permissions to g+rwx,o+rwx
done

(The "cp" can be omitted, since copy is the default mode.)

To move a directory (or file) use the "mv" mode:

$ ws_cp2exchange mv testdir2 myexchangews
working directory: /scratch2/ws/hpcstruc-myexchangews-0/kkE2WVIY
mv  /DDN1/HLRS/hlrs/hpcstruc/testdir2 .
change permissions to g+rwx,o+rwx
done

ws_exchange replacements

Note: The ws_exchange utility is not needed any more due to the introduction of ACLs and the new ws_share workspacetool.


Usage of ACLs

Since Access Control Lists are now enabled, there's not only user, group and other to specify the permissions for file objects, but a list. Creating a new workspace, the directory could be made e.g. readable to a specific user SHAREUSER like

$ WSDIR=ws_allocate $WSNAME

$ setfacl -d -m u:$SHAREUSER:r $WSDIR

ws_share

see the Workspace mechanism documentation


Jobs

jobnodesinfo

jobnodesinfo.py shows which nodes are used by a job (and for HAWK how they are placed):

$ jobnodesinfo.py --help
usage: jobnodesinfo [-h] [-v] [-p PLATFORM] [-n NODESFILE] [-j JOBID]
                    [-f FORMAT] [-o [OUTPUTFILE]] [--nocolor]
                    [--nodedistribmarkers NODEDISTRIBMARKERS]
                    [nodes [nodes ...]]

print nodes information

positional arguments:
  nodes

optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         verbose level (default, if not set: ERROR)
  -p PLATFORM, --platform PLATFORM
                        platform name (default, if not set: hawk)
  -n NODESFILE, --nodesfile NODESFILE
                        nodes file (default, if not set: $PBS_NODEFILE)
  -j JOBID, --jobid JOBID
                        PBS JobID
  -f FORMAT, --format FORMAT
                        format string (default, if not set: platform specific)
  -o [OUTPUTFILE], --outputfile [OUTPUTFILE]
                        redirect output to file (default: stdout)
  --nocolor             no color output
  --nodedistribmarkers NODEDISTRIBMARKERS
                        nodesdistribution markers

Within a job there's no need to set a JobID, since -n $PBS_JOBID resp. -n $PBS_NODEFILE is automatically set as a default.


qgen

qgen will generate a job script based on a template.

$ module load cae

$ qgen -h

qgen	a tool to generate(&submit) pbs jobfiles	(version: 2012/03/29, author: Martin Bernreuther <bernreuther@hlrs.de>)
usage: qgen [-n|--nodes <nodes>] [-t|--walltime <walltime>] [-N|--jobname <jobname>] [-w|--workdir <workdir>] [-o <jobfile>] [--submit|--submitq <queue>|--execute] [-] [<template>] [<template_opt1><template_opt2>...] 
      -n|--nodes <nodes>	[default: 1]	number of nodes($QGEN_NODES)
      -t|--walltime <walltime>	[default: 24:00:00]	 walltime ($QGEN_WALLTIME)
      -N|--jobname <jobname> 	[default: <template><datetime>]	job name ($QGEN_NAME)
      -w|--workdir <workdir> 	[default: /zhome/academic/HLRS/hlrs/hpcstruc]	 working directory ($QGEN_WORKDIR)
      -o <jobfile> 	[default: <stdout>]	 save script to <jobfile>
      --submit    	submit immediately (to standard queue)
      --submitq <queue>    	submit immediately to queue <queue>
      --execute    	execute immediately on current host
      -l          	list system templates
      -v          	be verbose
      -h|--help 	 	print help and exit
      <template>	 	template to use
      <template_options>	additional options depending on and passed directly to the template ($*)
qgen will create a pbs jobscript based on a template, substituting
$QGEN_NODES, $QGEN_WALLTIME, $QGEN_NAME, $QGEN_WORKDIR and $* (latter with template options)
Instead of setting the values with command line arguments,
environment variables might be used (except for template options)
These environment variables might be also set in the configuration file
.../.qgenrc, which is sourced at the beginning.
Templates are searched for in $QGEN_TEMPLATE_PATH, actually set to:
...

will print some help and

$ qgen -l

will list all available templates found in the QGEN_TEMPLATE_PATH One of these templates is "test", which doesn't need any further arguments.

$ qgen test

A job file will be printed. Save this job file to "test.pbs" and compare it with the template:

$ qgen -o test.pbs test
$ tkdiff test.pbs /app/rus/struct/bin/qgen_templates/test

(If there's no X avaiable, replace tkdiff with e.g. sdiff)

qgen just replaces some placeholders within the template. Everyone can expand the system writing own templates. The default path for these files is ~/qgen_templates

Now the job script could be submitted with "qsub test.pbs", but... (approx.) the same (for 2 requested nodes) can also be achieved with:

$ qgen -n 2 --submit test

However, omitting the -o option, the executed job script will not be saved for the user.

In general, templates require additional options. The help might give you a hint at the end, e.g.

$ qgen -h abaqus

qgen    a tool to generate(&submit) pbs jobfiles        (version: 2012/03/29, author: Martin Bernreuther <bernreuther@hlrs.de>)
usage: qgen [-n|--nodes <nodes>] [-t|--walltime <walltime>] [-N|--jobname <jobname>] [-w|--workdir <workdir>] [-o <jobfile>] [--submit|--submitq <queue>|--execute] [-] [<template>] [<template_opt1> <template_opt2>...]
      -n|--nodes <nodes>       [default: 1:nehalem:ppn=8]      number of nodes($QGEN_NODES)
      -t|--walltime <walltime> [default: 24:00:00]      walltime ($QGEN_WALLTIME)
      -N|--jobname <jobname>   [default: <template><datetime>] job name ($QGEN_NAME)
      -w|--workdir <workdir>   [default: /zhome/academic/HLRS/hlrs/hpcbern]     working directory ($QGEN_WORKDIR)
      -o <jobfile>     [default: <stdout>]      save script to <jobfile>
      --submit         submit immediately (to standard queue)
      --submitq <queue>        submit immediately to queue <queue>
      --execute        execute immediately on current host
      -l               list system templates
      -v               be verbose
      -h|--help                print help and exit
      <template>               template to use
      <template_options>       additional options depending on and passed directly to the template ($*)
qgen will create a pbs jobscript based on a template, substituting
$QGEN_NODES, $QGEN_WALLTIME, $QGEN_NAME, $QGEN_WORKDIR and $* (latter with template options)
Instead of setting the values with command line arguments,
environment variables might be used (except for template options)
These environment variables might be also set in the configuration file
/zhome/academic/HLRS/hlrs/hpcbern/.qgenrc, which is sourced at the beginning.
Templates are searched for in $QGEN_TEMPLATE_PATH, actually set to:
/zhome/academic/HLRS/hlrs/hpcbern/qgen_templates /opt/cae/bin/qgen_templates

abaqus template_options: [<version>] <abaqus-options>
                         cpus and mp_mode are set automatically
               examples: job=<jobname>
                         692 job=<jobname> inp=<inputfile> user=<usrprgfile> scratch=.

ABAQUS options:
                job=job-name
                [input=input-file]
                [user={source-file | object-file}]
                [oldjob=oldjob-name]
                [fil={append | new}]
                [globalmodel={results file-name | output database file-name}]
                [domains=number-of-domains]
                [dynamic_load_balancing]
                [standard_parallel={all | solver}]
                [gpus=number-of-gpgpus]
                [memory=memory-size]
                [interactive |  background | queue=[queue-name][after=time]]
                [double={explicit | both | off | constraint}]
                [scratch=scratch-dir]
                [output_precision={single | full} ]
                [field={odb | exodus | nemesis} ]
                [history={odb | csv} ]
                [madymo=MADYMO-input-file]
                [port=co-simulation port-number]
                [host=co-simulation hostname]
                [listenerport=Co-Simulation Engine listener port-number]
                [remoteconnections=Co-Simulation Engine host:port-number, remote job host:port-number]
                [timeout=co-simulation timeout value in seconds]
                [unconnected_regions={yes | no}]

(Nutzung dieses Templates in eigener Verantwortung - use this template on your own responsibility)

Submitting an ABAQUS job on 4 nodes is as easy as executing

$ qgen -n 4 --submit abaqus job=<jobname>

with <jobname> as ABAQUS inputfile (absolute path or relative path to the current directory). Typically qgen is executed within a writable directory and a workspace is used here. As mentioned before, more control is gained, if the PBS jobfile is generated first and submitted afterwards, like with this LS-Dyna example:

$ qgen -o lsdyna.pbs -n 8 dyna i=<inputfile>
$ qsub lsdyna.pbs

Before the submission with qsub, the jobfile might be changed/tuned.

The qgen command can also be used within LSOPT for LS-Dyna optimization/DoE runs. After loading the necessary modules (module load cae lstc) start "lsoptui" to do the configuration. Within the "Solvers" tab, after choosing "LS-DYNA" as "Solver Package Name", a "Command" like "qgen --submit -n 1:ppn=8 -o job.pbs dyna" will generate and submit the LS-Dyna jobs, controlled by lsopt.

There's also a simple template to execute a single command:

$ qgen -N cmdtoptest --submit cmd top -b -n 1

The -N option sets the jobname and affects the names of the stdout/stderr files. The output of the command "top -b -n 1" can thus be found in the cmdtoptest.o* file. The error output cmdtoptest.e* is hopefully empty...

qgen customization

The preset qgen templates can be augmented through user defined ones. qgen will check the directory ~/qgen_templates for templates by default and files in this directory will be preferred over the preset templates. There's also the environment variable $QGEN_TEMPLATE_PATH to define, which directories qgen should include, searching for the given template name.

One strategy would be to copy an already existing template file (see qgen -l) to tailored it to the own particular needs. Another one is to use an existing job script and to replace some varying parts with QGEN variables.

To show how such a template file looks like, a simple helloworld template will be created from scratch (if you're not familar with vi and you have X enabled, use another editor like nedit,kwrite,...):

$ mkdir ${HOME}/qgen_templates
$ vim ${HOME}/qgen_templates/helloworld

The content of ~/qgen_templates/helloworld might look like

#!/bin/bash
#QGENHELP pbs jobscript file for testing/demonstration purposes
#QGENHELP helloworld template_options: <name>
#PBS -l nodes=$QGEN_NODES
#PBS -l walltime=$QGEN_WALLTIME
#PBS -N $QGEN_NAME
cd $QGEN_WORKDIR
echo -e "Hello world!\nThe current working directory is $PWD and my name is...\n$*"

The $QGEN_XXX strings will be replaced by qgen with some job specific data according to default values and the qgen options set.

Now we can use this template to create job files and submit the first jobs based on this template (The cae module has to be loaded only once - otherwise the full path to qgen has to be typed...):

$ module load cae
$ qgen -h helloworld
$ qgen -o helloworld1.pbs --submit helloworld Rumpelstilzchen
$ qgen -o helloworld2.pbs -n 1:node_type=rome -t 0:03:00 -N helloworldtest -w $HOME --submit helloworld Rumpelstilzchen
$ qstat -f

The help message will also display the QGENHELP part of the template at the end.

qgen usage examples

The examples will create a workspace, copy/extract an example input file and generate a job script file to be submitted. The provided qgen templates will try to use the academic licenses provided by HLRS. To use other liceses they have to be adopted.

dyna (LS-Dyna)

Using a single precision mpp LS-DYNA version to compute 100 time steps of the odb10m example on a Vulcan "genoa" node. The same procedure can also be used on Hawk, but the node_type has to be adopted to "rome".

$ ws_allocate dynatest 1
$ cd `ws_find dynatest`
$ gunzip -c /sw/general/x86_64/cae/lstc/examples/topcrunch.org/ODB-10M/odb10m-ver18.k.gz > odb10m.k
$ module load cae
$ qgen -n 1:node_type=genoa:mpiprocs=64 -t 0:05:00 -o job.pbs dyna s mpp i=odb10m.k ncycle=100 d=NODUMP
$ qsub job.pbs

ansys (ANSYS mechanical)

$ ws_allocate ansystest 1
$ cd `ws_find ansystest`
$ cp /sw/general/x86_64/cae/ansys_inc/v241/ansys/site/ansys/mapdl/core/examples/verif/vm263.dat .
$ module load cae
$ qgen -n 1:node_type=clx-25:mpiprocs=1:ompthreads=10 -t 0:05:00 -o job.pbs ansys -i vm263.dat -o vm263_output
$ qsub job.pbs


qwtime

qwtime will show the (remaining) walltime of a job.

$ module load cae

$ qwtime -h

usage: /sw/general/x86_64/cae/bin/qwtime [-r] [-R] [-e] [--fmt '<FORMAT>'] [--datimefmt '<DATIMEFORMAT>'] [<JOBID> ...]
      -r	print remaining time (FORMAT '%r') [sec]
      -R	print remaining time (FORMAT '%:r') (HH:MM:SS)
      -e	print (estimated) end time (FORMAT '%e') (also see --datimefmt)
      --fmt <FORMAT>	print with format <FORMAT>
        (default:'%j\tjobname:\t%n (jobowner: %o, #hosts: %#h, host1: %h1)\n%j\tjobworkdir:\t%d\n%j\ttime range:\t%s\t%e\t(%:w)\n%j\twalltime      used:\t%:u\t(%us, %%u%)\n%j\twalltime remaining:\t%:r\t(%rs, %%r%)\n%j\t|%b|\n')
        environment variable qwtimeARGfmtDEFAULT (not set)
      --datimefmt <DATIMEFORMAT>	use time format <DATIMEFORMAT>
        (default:'%Y-%m-%dT%H:%M:%S')
        environment variable: qwtimeARGdatimefmtDEFAULT (not set)
      <JOBID>	job ID (default: PBS_JOBID or actual USER jobs)
FORMAT:
      %j	job ID (without suffix, %J with suffix)
      %n	job Name
      %o	job Owner
      %h	job hosts (%h1 first host/MOM)
      %d	job workdir
      %s	start time (also see --datimefmt)
      %e	(estimated) endtime (also see --datimefmt)
      %w	requested Walltime [sec] (%:w [HH:MM:SS])
      %u	used walltime [sec] (%:u [HH:MM:SS], %%u [%])
      %r	remaining walltime [sec] (%:r [HH:MM:SS], %%r [%])
      %b	progress bar
DATIMEFORMAT:	see "man 1 date | grep -m1 -A105 '^ *FORMAT'"

If no JOBIDs are given, the actual running jobs will be queried. Using a --fmt format string, which might include format specifiers, the output can be customized. With options like e.g. -r only a specific information (e.g. remaining walltime in seconds) will be printed (adjusting the format string).


qcat

Whereas the qsub option "-k oe" causes the system to create a *.o* and *.e* file for the standard and error output in the home directory right from the start of a job run, the default behaviour is that these files are kept in a spool directory and moved at the end to the working directory. qcat will show these files using cat or copy the actual version to the working directory.

$ qcat -h
usage: /opt/cae/bin/qcat [o|e] [<JOBID>] [-c] [-t [<N>]]
      o|e      show output or error output file        default:o
      <JOBID>  OpenPBS job ID                          default: last own job
      -c       copy file to final destination          default: no copy
      -t [<N>] show tail (only <N> last lines)         default: show all

MPI/OpenMP

showaffinity

showaffinity will show/print the affinity (pinning) of processes and serves as a replacement of a program to be executed with the same options. The possible places will be marked.


sequential program

$ showaffinity --help
hybrid MPI program to show process to core affinity; <bernreuther@hlrs.de>
Usage:
  showaffinity [OPTION...]

  -h, --help            Print help
  -v, --verbose         Verbose output
  -f, --maskformat arg  mask format template string
  -p, --preset arg      mask format presets
      --marker arg      marker (default: .X)
$ showaffinity.seq
$ taskset 1 showaffinity.seq
$ taskset 0x0000000b showaffinity.seq
$ taskset --cpu-list 0-2,4 showaffinity.seq
$ numactl --cpunodebind=0 --membind=0 showaffinity.seq


OpenMP program

$ showaffinity.omp
$ OMPLACES 

MPI program

using MPT on HAWK

$ mpirun showaffinity.mpt -p hawk0
$ mpirun -np 4 omplace -v showaffinity.mpt


hybrid MPI+OpenMP program

using MPT on HAWK

$ mpirun -np 2 omplace -nt $OMP_NUM_THREADS -vv showaffinity.mptomp -p hawk1

Monitoring

meminfo

meminfo.py will just print the data found in the Linux /proc/meminfo, which looks like (shown for an HAWK compute node)

$ nl /proc/meminfo; meminfo.py -c 1-3,17
    1  MemTotal:       262730452 kB
    2  MemFree:        250762824 kB
    3  MemAvailable:   250228776 kB
    4  Buffers:           39288 kB
    5  Cached:          1151664 kB
    6  SwapCached:            0 kB
    7  Active:           232808 kB
    8  Inactive:        1677452 kB
    9  Active(anon):     109068 kB
   10  Inactive(anon):  1436592 kB
   11  Active(file):     123740 kB
   12  Inactive(file):   240860 kB
   13  Unevictable:           0 kB
   14  Mlocked:               0 kB
   15  SwapTotal:             0 kB
   16  SwapFree:              0 kB
   17  Dirty:                 0 kB
   18  Writeback:             0 kB
   19  AnonPages:        717452 kB
   20  Mapped:           294688 kB
   21  Shmem:            826352 kB
   22  KReclaimable:     340944 kB
   23  Slab:            4913136 kB
   24  SReclaimable:     340944 kB
   25  SUnreclaim:      4572192 kB
   26  KernelStack:       54800 kB
   27  PageTables:        10864 kB
   28  NFS_Unstable:          0 kB
   29  Bounce:                0 kB
   30  WritebackTmp:          0 kB
   31  CommitLimit:    131365224 kB
   32  Committed_AS:   10660508 kB
   33  VmallocTotal:   34359738367 kB
   34  VmallocUsed:     3109344 kB
   35  VmallocChunk:          0 kB
   36  Percpu:           475136 kB
   37  HardwareCorrupted:     0 kB
   38  AnonHugePages:    573440 kB
   39  ShmemHugePages:        0 kB
   40  ShmemPmdMapped:        0 kB
   41  FileHugePages:         0 kB
   42  FilePmdMapped:         0 kB
   43  HugePages_Total:       0
   44  HugePages_Free:        0
   45  HugePages_Rsvd:        0
   46  HugePages_Surp:        0
   47  Hugepagesize:       2048 kB
   48  Hugetlb:               0 kB
   49  DirectMap4k:     1905588 kB
   50  DirectMap2M:    189710336 kB
   51  DirectMap1G:    76546048 kB

but offering some additional features:

$ module load cae
$ meminfo.py --help
usage: meminfo [-h] [-o [OUTPUTFILE]] [--mpi] [--avg] [--printnodenames]
              [-r REPEAT] [-c COLUMNS] [--delay DELAY] [--noheader]
              [--valfmt VALFMT] [--timestampfmt TIMESTAMPFMT] [--daemonize]
gather and print meminfo data
optional arguments:
 -h, --help            show this help message and exit
 -o [OUTPUTFILE], --outputfile [OUTPUTFILE]
                       redirect output to file (default: stdout)
 --mpi                 MPI multinode version (default: False)
 --avg                 also calculate averages and standard deviation
                       (default: False) [needs --mpi]
 --printnodenames      print node names (default: False) [needs --mpi;
                       inactive for --noheader]
 -r REPEAT, --repeat REPEAT
                       N|+N|+HH:MM:SS,T|HH:MM:SS repeat N-1 times|within +N
                       sec|+HH:MM:SS with a sleep time of T sec|HH:MM:SS in
                       between
 -c COLUMNS, --columns COLUMNS
                       columns selection e.g. 1-3,17,18,21,33-34 (if not set:
                       select all) - see `nl /proc/meminfo`
 --delay DELAY         delay (sleep) before execution (default: None)
 --noheader            omit printing header
 --valfmt VALFMT       value format (default: {})
 --timestampfmt TIMESTAMPFMT
                       timestamp format (default: %y%m%dT%H%M%S)
 --daemonize           put process into background (default: False)

E.g. filtering some values like

$ meminfo.py -c 1-3,17
# meminfo running at XXX XXX XX XX:XX:XX XXXX
# TIMESTAMP     MemTotal        MemFree MemAvailable    Dirty
XXXXXXTXXXXXX   269035982848.0  256765276160.0  256218411008.0  0.0

where the numbers correspond to the line numbers above (without -c all entries will be listed). For single node jobs or if only the first node should be monitored, meminfo.py could be "daemonized" to execute the following commands in a job script directly afterwards. Especially for this case, it is advisable to redirect the output to a file. Its also possible to repeat this data collection and do that after a given delay. In the following case 2 columns will be written to meminfo.out after a waiting time of 10 seconds 10 times waiting 5 seconds in between:

$ meminfo.py --daemonize --delay 10 -r 10,5 -c 2,3 -o meminfo.out

mpiinfo.py also offers a MPI mode (using mpi4py) to collect the data of multiple nodes:

$ qsub -IXlselect=2:node_type=rome:mpiprocs=128,walltime=0:5:0 -q test
$ module load cae mpi4py
$ ( mpirun -np 256 meminfo.py --mpi --avg --delay 0:0:30 -r +0:10:00,0:0:10 -c 1-3 -o meminfo.out.${PBS_JOBID%%.*} ) &

The output will contain the MIN,MAX values. With --avg (like in the example above) additionally the average and the standard deviation is printed. Only a single process will collect the data on each node (all others will just idle). The "daemonize" is not supported in MPI mode, so in the example the whole program start will be done in a subprocess in the background (notice the paranthesis and the trailing ampersand).