- Infos im HLRS Wiki sind nicht rechtsverbindlich und ohne Gewähr -
- Information contained in the HLRS Wiki is not legally binding and HLRS is not responsible for any damages that might result from its use -

CAE howtos: Difference between revisions

From HLRS Platforms
Jump to navigationJump to search
m (added Lizenz-ssh-Tunnel.png)
 
(10 intermediate revisions by the same user not shown)
Line 27: Line 27:


  # specify license server and port (using a TCP connection)
  # specify license server and port (using a TCP connection)
  export LICSERVER=hwwlic2.hww.de # license server
  export LICSERVER=licserver.mydomain.de # license server
  export LICSERVER_PORT=51718 # license server port (use vendor daemon port for flexnet)
  export LICSERVER_PORT=12345 # license server port (use vendor daemon port for flexnet)
  echo -e "license server:\t ${LICSERVER}:${LICSERVER_PORT}"
  echo -e "license server:\t ${LICSERVER}:${LICSERVER_PORT}"
  export LICSERVERlocal=localhost # local license server
  export LICSERVERlocal=localhost # local license server
Line 34: Line 34:
  export LICSERVERlocal_PORT=${LICSERVER_PORT:-12345} # local license port
  export LICSERVERlocal_PORT=${LICSERVER_PORT:-12345} # local license port
  echo -e "local license ssh tunnel end:\t${LICSERVERlocal}:${LICSERVERlocal_PORT}"
  echo -e "local license ssh tunnel end:\t${LICSERVERlocal}:${LICSERVERlocal_PORT}"
  SSH_userserver="hpcstruc@hawk-login04"    # passwordless ssh access needed!
  SSH_userserver="user@sshserver.mydomain.de"    # passwordless ssh access needed!
  SSH_PORT=22
  SSH_PORT=22
  SSH_ctrlsocket="sshtunnelCtrlSocket.${jobid}"
  SSH_ctrlsocket="sshtunnelCtrlSocket.${PBS_JOBID}"
  echo "[`date +%Y-%m-%dT%H:%M:%S`] setting up ssh tunnel through ${SSH_userserver} (control socket: ${SSH_ctrlsocket})"
  echo "[`date +%Y-%m-%dT%H:%M:%S`] setting up ssh tunnel through ${SSH_userserver} (control socket: ${SSH_ctrlsocket})"
  #rm -rf "${SSH_ctrlsocket}" # removing socket file should not be necessary
  #rm -f "${SSH_ctrlsocket}" # removing socket file should not be necessary
  # establish ssh tunnel
  # establish ssh tunnel (might add additional options like e.g. -o ServerAliveInterval=60 or -o TCPKeepAlive=yes)
  ssh -MS "${SSH_ctrlsocket}" -fNTL ${LICSERVERlocal_PORT}:${LICSERVER}:${LICSERVER_PORT} -p ${SSH_PORT} ${SSH_userserver}
  ssh -MS "${SSH_ctrlsocket}" -fNTL ${LICSERVERlocal_PORT}:${LICSERVER}:${LICSERVER_PORT} -p ${SSH_PORT} ${SSH_userserver}
  # check ssh tunnel
  # check ssh tunnel
Line 61: Line 61:
  # close connection
  # close connection
  ssh -S "${SSH_ctrlsocket}" -O exit ${SSH_userserver}
  ssh -S "${SSH_ctrlsocket}" -O exit ${SSH_userserver}
=== Improvements ===
* replacing ssh with [https://www.harding.motd.ca/autossh/ autossh] to automatically restart the ssh-connection if necessary and improve resiliency
=== Connectivity checks ===
There might be firewalls, which block a direct connection.
To check, if a connection can be established, some checks might be performed
(e.g. from a frontend or within an interactive session)
* check if port can be reached
nc -zvw4 <SERVER> <PORT>
nmap --system-dns -PN -p <PORT> ${SERVER}
(nmap is not available on HLRS/HWW systems at the moment.)
* check how far we get (assuming TCP connection)
traceroute -T -p <PORT> <SERVER>
(traceroute is disabled on HLRS/HWW systems at the moment.)
For UDP also tracepath might be used.
* check external IP address
The IP address "seen" from outside might be different than the internal one.
Check e.g.
https://websrv.hlrs.de/ipinfo
=Jobscipts=
== Self-initiate termination & more ==
PBSpro (and other batch systems) send a SIGTERM to the executed jobscript at the end of the job walltime.
However the time before the job termination might be too short and thus taking care of this within the
jobscript itself is a more flexible alternative. First the time to wait will be calculated
(assuming running a bash jobscript in the example here):
timebeforeend=$(( 5*60 ))  # 5 min
module load cae
jobremainingwalltime=$(qwtime -r)
remaintingwalltime2stop=$(( jobremainingwalltime-timebeforeend ))
=== Send SIGTERM to command after some time ===
If the command is known e.g. killall can send a SIGTERM:
cmd="path/mycommand"
(sleep ${remaintingwalltime2stop}; killall "${cmd}" ) &  # start a subshell in the background which will sleep first
$cmd $options  # also with e.g. mpirun
=== LS-Dyna ===
<span id="LSDYNAjobterminate"></span>
LS-Dyna checks the existence and content of a file d3kil,
which makes it possible to trigger a program termination:
# LS-Dyna sense switches
##Type          Response
# SW1.          A restart file is written and LS-DYNA terminates.
# SW2.          LS-DYNA responds with time and cycle numbers.
# SW3.          A restart file is written and LS-DYNA continues.
# SW4.          A plot state is written and LS-DYNA continues.
# SW5.          Enter interactive graphics phase and real time visualization.
# SW7.          Turn off real time visualization.
# SW8.          Interactive 2D rezoner for solid elements and real time visualization.
# SW9.          Turn off real time visualization (for option SW8).
# SWA.          Flush ASCII file buffers.
# lprint        Enable/Disable printing of equation solver memory, cpu requirements.
# nlprint      Enable/Disable printing of nonlinear equilibrium iteration information.
# iter          Enable/Disable output of binary plot database "d3iter" showing mesh after each equilibrium iteration. Useful for debugging convergence problems.
# conv          Temporarily override nonlinear convergence tolerances.
# stop          Halt execution immediately, closing open files.
##
dumpsenseswitch='SW1'
# see above for the definition of remaintingwalltime2stop
(sleep ${remaintingwalltime2stop}; echo ${dumpsenseswitch} >d3kil ) &
# LS-Dyna will be executed afterwards
=== Check free memory ===
The same technique can also be used to check the free memory after a initial waiting time of 10sec periodically every minute,
saving the results in an file within the actual directory e.g. with
(sleep 10; freeavail.sh --periodic 60:`qwtime -r` -n `qjobnodes.sh -n` > "$PWD/freeavail_${PBS_JOBNAME%.*}.${PBS_JOBID%%.*}") &
before starting the program.
=ISV codes=
* also see [[ISV_Usage]]

Latest revision as of 14:38, 22 September 2023

Licensing

ssh-Tunnel

To use a remote license server, a ssh-Tunnel can be used. If a ssh-Tunnel connects a local compute node TCP port with the port the license server listens to, the license can be checked out through the local port.

Setup

ssh tunnel for license server

application node (compute node)

the node where the license is drawn

ssh server

a proxy between the application node and the license server

  • The ssh server has to be accessible from the application node (maybe through a NAT-gateway) and the license server has to be accessible by the ssh server. Thus there mustn't be a firewall to prevent the connections. However the ssh server firewall only has to enable a connection to the application node and the license server port (and probably an administration computer or internal network).
  • The sshd configuration has to enable "AllowTcpForwarding yes" (instead of port 22 also an alternative port might be used).
  • The ssh server user does not need a login-shell to just establish a ssh tunnel (/bin/false is enough), but
  • a passwordless access is needed to automize the setup of a ssh tunnel from a job script.

license server

the node a license is served

Job script example excerpt

# specify license server and port (using a TCP connection)
export LICSERVER=licserver.mydomain.de # license server
export LICSERVER_PORT=12345 # license server port (use vendor daemon port for flexnet)
echo -e "license server:\t ${LICSERVER}:${LICSERVER_PORT}"
export LICSERVERlocal=localhost # local license server
#export LICSERVERlocal=`hostname`       # needs ssh \* binding address
export LICSERVERlocal_PORT=${LICSERVER_PORT:-12345} # local license port
echo -e "local license ssh tunnel end:\t${LICSERVERlocal}:${LICSERVERlocal_PORT}"
SSH_userserver="user@sshserver.mydomain.de"    # passwordless ssh access needed!
SSH_PORT=22
SSH_ctrlsocket="sshtunnelCtrlSocket.${PBS_JOBID}"
echo "[`date +%Y-%m-%dT%H:%M:%S`] setting up ssh tunnel through ${SSH_userserver} (control socket: ${SSH_ctrlsocket})"
#rm -f "${SSH_ctrlsocket}" # removing socket file should not be necessary
# establish ssh tunnel (might add additional options like e.g. -o ServerAliveInterval=60 or -o TCPKeepAlive=yes)
ssh -MS "${SSH_ctrlsocket}" -fNTL ${LICSERVERlocal_PORT}:${LICSERVER}:${LICSERVER_PORT} -p ${SSH_PORT} ${SSH_userserver}
# check ssh tunnel
ssh -S "${SSH_ctrlsocket}" -O check ${SSH_userserver} || (echo "ssh CTRL socket  ${SSH_ctrlsocket} check failed - wait some more time..."; sleep 10)
## adjusting license server environment variables to the ssh tunnel end
# e.g. flexnet (using vendor daemon port)
export LM_LICENSE_FILE="${LICSERVERlocal_PORT}@${LICSERVERlocal}"
echo "[`date +%Y-%m-%dT%H:%M:%S`] licensing redirected to ${LM_LICENSE_FILE}"
# alternative check of connection (output redirected to stderr)
nc -zvw4 ${LICSERVERlocal} ${LICSERVERlocal_PORT} 1>&2
# alternatives, e.g.:
##nmap --system-dns -PN -p${LICSERVERlocal_PORT} ${LICSERVERlocal}
if [ $? -ne 0 ]; then
        echo "ERROR reaching ${LICSERVERlocal}:${LICSERVERlocal_PORT}"
else
        echo "test connection to ${LICSERVERlocal}:${LICSERVERlocal_PORT} succeeded"
fi
#
# start simulation...
#
# close connection
ssh -S "${SSH_ctrlsocket}" -O exit ${SSH_userserver}

Improvements

  • replacing ssh with autossh to automatically restart the ssh-connection if necessary and improve resiliency

Connectivity checks

There might be firewalls, which block a direct connection. To check, if a connection can be established, some checks might be performed (e.g. from a frontend or within an interactive session)

  • check if port can be reached
nc -zvw4 <SERVER> <PORT>
nmap --system-dns -PN -p <PORT> ${SERVER}

(nmap is not available on HLRS/HWW systems at the moment.)

  • check how far we get (assuming TCP connection)
traceroute -T -p <PORT> <SERVER>

(traceroute is disabled on HLRS/HWW systems at the moment.)

For UDP also tracepath might be used.

  • check external IP address

The IP address "seen" from outside might be different than the internal one. Check e.g.

https://websrv.hlrs.de/ipinfo

Jobscipts

Self-initiate termination & more

PBSpro (and other batch systems) send a SIGTERM to the executed jobscript at the end of the job walltime. However the time before the job termination might be too short and thus taking care of this within the jobscript itself is a more flexible alternative. First the time to wait will be calculated (assuming running a bash jobscript in the example here):

timebeforeend=$(( 5*60 ))  # 5 min
module load cae
jobremainingwalltime=$(qwtime -r)
remaintingwalltime2stop=$(( jobremainingwalltime-timebeforeend ))

Send SIGTERM to command after some time

If the command is known e.g. killall can send a SIGTERM:

cmd="path/mycommand"
(sleep ${remaintingwalltime2stop}; killall "${cmd}" ) &  # start a subshell in the background which will sleep first
$cmd $options  # also with e.g. mpirun

LS-Dyna

LS-Dyna checks the existence and content of a file d3kil, which makes it possible to trigger a program termination:

# LS-Dyna sense switches
##Type          Response
# SW1.          A restart file is written and LS-DYNA terminates.
# SW2.          LS-DYNA responds with time and cycle numbers.
# SW3.          A restart file is written and LS-DYNA continues.
# SW4.          A plot state is written and LS-DYNA continues.
# SW5.          Enter interactive graphics phase and real time visualization.
# SW7.          Turn off real time visualization.
# SW8.          Interactive 2D rezoner for solid elements and real time visualization.
# SW9.          Turn off real time visualization (for option SW8).
# SWA.          Flush ASCII file buffers.
# lprint        Enable/Disable printing of equation solver memory, cpu requirements.
# nlprint       Enable/Disable printing of nonlinear equilibrium iteration information.
# iter          Enable/Disable output of binary plot database "d3iter" showing mesh after each equilibrium iteration. Useful for debugging convergence problems.
# conv          Temporarily override nonlinear convergence tolerances.
# stop          Halt execution immediately, closing open files.
##
dumpsenseswitch='SW1'
# see above for the definition of remaintingwalltime2stop
(sleep ${remaintingwalltime2stop}; echo ${dumpsenseswitch} >d3kil ) &
# LS-Dyna will be executed afterwards

Check free memory

The same technique can also be used to check the free memory after a initial waiting time of 10sec periodically every minute, saving the results in an file within the actual directory e.g. with

(sleep 10; freeavail.sh --periodic 60:`qwtime -r` -n `qjobnodes.sh -n` > "$PWD/freeavail_${PBS_JOBNAME%.*}.${PBS_JOBID%%.*}") &

before starting the program.


ISV codes