NEC Cluster cacau introduction

This platform serves the following purpose. It enables development and computation of parallel programs on the Intel Xeon processors with Intel EM64T Technology. The two major parallel programming standards MPI and OpenMP are supported. Please note that you must limit the execution time of your jobs during daytime to guarantee the short turn around times that are necessary for development.

Hardware and Architecture

The HWW Xeon EM64T cluster platform consists of one front node for interactive access (cacau.hww.de) and several nodes for execution of parallel programs. The cluster consists of 210 dual socket and 2 quad socket nodes with 3.2GHz/3.0GHz/2.4GHz Xeon EM64T CPU's + 2/8/16/128 GByte memory on the nodes and two 2way frontend node with 2 Xeon EM64T 3.2GHz CPU's + 6GByte memory. Additionally a RAID system with 8 TByte and a GPFS with 15 TByte is available. The local disks of each node (58 GByte) serves as scratch disks. 2 nodes are installed with 128GB memory and a fast local disk with 1.7TB.

Features:

Cluster of 210 dual SMPs nodes NEC Express 5800 120Re-1 servers with 2/8/16/128 GByte memory
Frontend node is a 2way NEC Express5800/120Rg-2 server with 6GByte memory
Node-Node interconnect Voltaire Infiniband(Switch:ISR9288) Network + Gigabit Ethernet
Disk 8 TByte home/shared scratch + 1.2 TByte local scratch + 15 TByte GPFS parallel Filesystem
Batch system: Torque, Maui scheduler
Operating System: Scientific Linux SL release 5.2 (Boron), Kernel: 2.6.18-92.1.6.el5 (x86_64)
NEC HPC Linux software packages
Intel Compilers
Voltaire MPI
Switcher/Module

Peak Performance: 	        3.9 TFLOP/s
Cores/node: 	                4
Memory: 	                1 TB
Shared Disk: 	                24 TB 
Local Disks/node: 	        80 GB
Number of Nodes: 	        212
Node-node data transfer rate: 	10 Gbps(Full bisectional: 20Gbps) infiniband

**Short overview of installed compute nodes**
Type	memory	Freq	cores	Disk	PBS Queue	PBS properties	Interconnect	nodes	number
1	2GB	3.2 GHz	2*1= 2	80GB	-	mem2gb	infiniband	noco001-075, noco109-204	172
2	8GB	3.0 GHz	2*2= 4	160GB	workq	-	infiniband	noco075-106	32
3	8GB	3.2 GHz	2*1= 2	80GB	-	mem8gb	infiniband	noco205-208	4
4	16GB	3.2 GHz	2*1= 2	80GB	workq	mem16gb	infiniband	noco209-210	2
5	128GB	2.4 GHz	2*4= 8	1.7TB	pp	-	GigE	pp2 - pp3	2

Access

The only way to access cacau.hww.de (frontend node of NEC Cluster) from outside HWW net is through ssh. Information on how to set up ssh can be found on our webserver at Secure Shell (ssh).

Usage

The frontend node cacau.hww.de is intended as single point of access to the entire cluster. Here you can set your environment, move your data, edit and compile your programs and create batch scripts. Interactive usage like run your program which leads to a high load is NOT allowed on the frontend node cacau.hww.de. The compute nodes for running parallel jobs are available only through the Batch system installed on the frontend node cacau.hww.de!

HOME directories

All user HOME directories for every compute node of the cluster are located on the master node cacau.hww.de. The compute nodes have the HOME directories mounted via NFS. On every node of the cluster the path to your HOME is the same. The filesystem space on HOME is limited by a quota of 50MB! Please note the Filesystempolicy! Default startup files (.profile, .cshrc,...) for your environmental settings can be found in: /usr/local/skel Only the default .profiles and the commands module or switcher support the HWW cluster features like MPI, Compiler settings,...(see Program Development and Environment Settings).

SCRATCH directories

Local scratch

When allocating nodes using the batch queuing system (Torque), the system creates your own scratch area on each of the allocated nodes. The path to this local scratch area is stored in the environment variable SCRDIR (echo $SCRDIR) in your batch job shell. After your batch jobs are finished, the $SCRDIR will be removed automatically.

Global scratch

Another scratch you can get are global space on shared filesystems. There are 2 globel shared filesystems available on cacau:

default

It's a filesystem which is available via NFS on all cacau cluster nodes and on the cacau frontend system

gpfs

IBM GPFS filesystem shared globaly on different HWW Clusters To use it on cacau compute nodes, you need to create a file named '.gpfs' in your HOME directory (touch $HOME/.gpfs). The GPFS filesystem need some of the compute nodes memory. If you are short in memory on those nodes and you didn't need this filesystem, then please delete $HOME/.gpfs. If no such file found in your HOME, then the GPFS modules will not be loaded on the compute nodes.

You are responsible to obtain it from the system. To get access to this global scratch filesystems you have to use the workspace mechanism.

Environment Settings

In order to use some software features like special MPI versions, or Compilers, you have to perform some environmental settings.

Environment Settings using command switcher

switcher <tag> --show [--system or --user]

This shows you the current system or user default for a certain tag.

switcher --list

This shows you all available tags.

switcher <tag> --list

This shows you all available names for tag.

switcher mpi = voltaire_gcc

This will set you a new MPI default.

Environment Settings using command module

PATH

MANPATH

LD_LIBRARY_PATH

to invoke the module command, type:
```
module option args
```

module help modulecommand

help

modulecommand

module

module avail

avail

module list

list

module add / module load modulename

add

module rm / module unload modulename

rm

unload

module display modulename

display

module switch modulename/currentversionmodulename/newversion

switch

using $HOME/.modulerc

#%Module1.0#

set version 1.0
module load use.own

The module use.own will add $HOME/privatemodules to the list of directories that the module command will search for modules. Place your own module files here. This module, when loaded, will create this directory if necessary.

Filesystem Policy

IMPORTANT! NO BACKUP!! There is NO backup done of any user data located on HWW systems. The only protection of your data is the redundant disk subsystem. This RAID system (Raid5) is able to handle a failure of one component (e.g. a single disk or a controller). There is NO way to recover inadvertently removed data. Users have to backup critical data on their local site!

The homedirectory of each user is available on all nodes. Each node has a local /scratch directory (which is much faster than the NFS-mounted home-directory) that should be used as temporary file space. In your batch job session an environment $SCRDIR will be set on all allocated nodes. This $SCRDIR is your local scratch directory located in /scratch.

Support / Feedback

Please report all problems to:

System Administrators

Thomas Beisel

Bernd Krischok

Danny Sternkopf

Applications

Martin Bernreuther

NEC Cluster cacau introduction

Contents

Hardware and Architecture

Access

Usage

HOME directories

SCRATCH directories

Local scratch

Global scratch

Environment Settings

Filesystem Policy

Support / Feedback

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools