Notes about Installation and Configuration of WN using Torque with MPI support

  • These notes are provided by site admins on a best effort base as a contribution to the IGI communities and MUST not be considered as a subsitute of the Official IGI documentation.
  • This document is addressed to site administrators responsible for middleware installation and configuration.
  • The goal of this page is to provide some hints and examples on how to install and configure an EMI WN + MPI service based on EMI middleware using TORQUE as batch system

References

  1. About IGI - Italian Grid infrastructure
  2. About IGI Release
  3. IGI Official Installation and Configuration guide
  4. Generic Installation & Configuration for EMI 1
  5. Yaim Guide
  6. site-info.def yaim variables
  7. MPI yaim variables
  8. WN yaim variables
  9. TORQUE Yaim variables
  10. EMI-WN v.1.0.0
  11. gLite-MPI v.1.0.0
  12. MPI-Start Installation and Configuration
  13. Troubleshooting Guide for Operational Errors on EGI Sites
  14. Grid Administration FAQs page

Service installation

O.S. and Repos

  • Starts from a fresh installation of Scientific Linux 5.x (x86_64).
# cat /etc/redhat-release 
Scientific Linux SL release 5.7 (Boron) 

* Install the additional repositories: EPEL, Certification Authority, UMD

# yum install yum-priorities yum-protectbase
# cd /etc/yum.repos.d/
# rpm -ivh http://mirror.switch.ch/ftp/mirror/epel//5/x86_64/epel-release-5-4.noarch.rpm
# wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
# rpm -ivh http://repo-pd.italiangrid.it/mrepo/EMI/1/sl5/x86_64/updates/emi-release-1.0.1-1.sl5.noarch.rpm
# wget http://repo-pd.italiangrid.it/mrepo/repos/igi/sl5/x86_64/igi-emi.repo

  • Be sure that SELINUX is disabled (or permissive). Details on how to disable SELINUX are here:

# getenforce 
Disabled

  • Check the repos list (sl-*.repo are the repos of the O.S. and they should be present by default).

# ls /etc/yum.repos.d/
egi-trustanchors.repo  
emi1-third-party.repo emi1-base.repo emi1-updates.repo
epel.repo epel-testing.repo  igi-emi.repo
sl-contrib.repo sl-fastbugs.repo sl-security.repo sl-testing.repo sl-debuginfo.repo sl.repo sl-srpms.repo
IMPORTANT: remove the dag repository if present

yum install

# yum clean all
Loaded plugins: downloadonly, kernel-module, priorities, protect-packages, protectbase, security, verify, versionlock
Cleaning up Everything

# yum install ca-policy-egi-core
# yum install igi-wn_torque_noafs 
# yum install glite-mpi 
# yum install emi-version
# yum install openmpi openmpi-devel
# yum install mpich2 mpich2-devel
# yum install nfs-utils
see here for details

Service configuration

You have to copy the configuration files in another path, for example root, and set them properly (see later):
# ls /opt/glite/yaim/examples/siteinfo/
services  site-info.def

# ls /opt/glite/yaim/examples/siteinfo/services/
glite-mpi  glite-mpi_ce  glite-mpi_sl4-x86  glite-mpi_sl5-x86_64  glite-mpi_wn  glite-vobox  glite-wn  glite-wn_tar  igi-mpi

# cp -r /opt/glite/yaim/examples/siteinfo/* .

in the services directory, keep and edit only these files:

# ls services/
glite-mpi  glite-mpi_ce  glite-mpi_wn

vo.d directory

Create the vo.d directory for the VO configuration file (you can decide if keep the VO information in the site.def or putting them in the vo.d directory)
# mkdir vo.d
here an example for some VOs.

Information about the several VOs are available at the CENTRAL OPERATIONS PORTAL.

users and groups configuration

here an example on how to define pool accounts (ig-users.conf) and groups (ig-groups.conf) for several VOs

site-info.def

SUGGESTION: use the same site-info.def for CREAM and WNs: for this reason in this example file there are yaim variable used by CREAM, TORQUE or emi-WN.

It is also included the settings of some VOs

For your convenience there is an explanation of each yaim variable. For more details look at [6, 7, 8, 9]

glite-mpi

in the following example, it is enabled the support for MPICH2 and OPENMPI; moreover the WNs are configured to use shared homes

############################################
# Mandatory parameters in services/mpi     #
############################################

# N.B. this file contains common configuration for CE and WN
# As such, it should be included in your site-info.def to ensure
# that the configuration of the CE and WNs remains in sync.

#----------------------------------
# MPI-related configuration:
#----------------------------------
# Several MPI implementations (or "flavours") are available.
# If you do NOT want a flavour to be configured, set its variable
# to "no". Otherwise, set it to "yes". If you want to use an
# already installed version of an implementation, set its "_PATH" and
# "_VERSION" variables to match your setup (examples below).
#
# NOTE 1: the CE_RUNTIMEENV will be automatically updated in the file
# functions/config_mpi_ce, so that the CE advertises the MPI implementations
# you choose here - you do NOT have to change it manually in this file.
# It will become something like this:
#
#   CE_RUNTIMEENV="$CE_RUNTIMEENV
#              MPICH
#              MPICH-1.2.7p4
#              MPICH2
#              MPICH2-1.0.4
#              OPENMPI
#              OPENMPI-1.1
#              LAM"
#
# NOTE 2: it is currently NOT possible to configure multiple concurrent
# versions of the same implementations (e.g. MPICH-1.2.3 and MPICH-1.2.7)
# using YAIM. Customize "/opt/glite/yaim/functions/config_mpi_ce" file
# to do so.

###############
# The following example are applicable to default SL 5.3 x86_64 (gLite 3.2 WN)
# Support for MPICH 1 is dropped
MPI_MPICH_ENABLE="no"
MPI_MPICH2_ENABLE="yes"
MPI_OPENMPI_ENABLE="yes"
MPI_LAM_ENABLE="no"

#---
# Example for using an already installed version of MPI.
# Just fill in the path to its current installation (e.g. "/usr")
# and which version it is (e.g. "6.5.9").
#---

# DEFAULT Parameters
# The following parameters are correct for a default SL 5.X x86_64 WN
#MPI_MPICH_PATH="/opt/mpich-1.2.7p1/"
#MPI_MPICH_VERSION="1.2.7p1"
MPI_MPICH2_PATH="/usr/lib64/mpich2/"
MPI_MPICH2_VERSION="1.2.1p1"
MPI_OPENMPI_PATH="/usr/lib64/openmpi/1.4-gcc/"
MPI_OPENMPI_VERSION="1.4"
#MPI_LAM_VERSION="7.1.2"

# If you provide mpiexec (http://www.osc.edu/~pw/mpiexec/index.php)
# for MPICH or MPICH2, please state the full path to that file here.
# Otherwise do not set this variable. (Default is to set this to
# the location of mpiexec set by the glite-MPI_WN metapackage.

# Most versions of MPI now distribute their own versions of mpiexec
# However, I had some problems with the MPICH2 version - so use standard mpiexec
MPI_MPICH_MPIEXEC="/usr/bin/mpiexec"
MPI_MPICH2_MPIEXEC="/usr/bin/mpiexec"
MPI_OPENMPI_MPIEXEC="/usr/lib64/openmpi/1.4-gcc/bin/mpiexec"


#########  MPI_SHARED_HOME section
# Set this variable to one of the following:
# MPI_SHARED_HOME="no" if a shared directory is not used
# MPI_SHARED_HOME="yes" if the HOME directory area is shared
# MPI_SHARED_HOME="/Path/to/Shared/Location" if a shared area other
#    than the HOME dirirectory is used.

# If you do NOT provide a shared home, set MPI_SHARED_HOME to "no" (default).
#MPI_SHARED_HOME=${MPI_SHARED_HOME:-"no"}

# If you do provide a shared home and Grid jobs normally start in that area,
# set MPI_SHARED_HOME to "yes".
MPI_SHARED_HOME="yes"

# If you have a shared area but Grid jobs don't start there, then set
# MPI_SHARED_HOME to the location of this shared area. The permissions
# of this area need to be the same as /tmp (i.e. 1777) so that users
# can create their own subdirectories.
#MPI_SHARED_HOME=/share/cluster/mpi


######## Intra WN authentication
# This variable is normally set to yes when shared homes are not used.
# This allows the wrapper script to copy the job data to the other nodes
#
# If enabling SSH Hostbased Authentication you must ensure that
# the appropriate ssh config files are deployed.
# Affected files are the system ssh_config, sshd_config and ssh_know_hosts.
# The edg-pbs-knownhosts can be use to generate the ssh_know_hosts
#
# If you do NOT have SSH Hostbased Authentication between your WNs, 
# set this variable to "no" (default). Otherwise set it to "yes".
#
MPI_SSH_HOST_BASED_AUTH=${MPI_SSH_HOST_BASED_AUTH:-"no"}

glite-mpi_ce

# Setup configuration variables that are common to both the CE and WN

if [ -r ${config_dir}/services/glite-mpi ]; then
 source ${config_dir}/services/glite-mpi
fi

# The MPI CE config function can create a submit filter for
# Torque to ensure that CPU allocation is performed correctly.
# Change this variable to "yes" to have YAIM create this filter.
# Warning: if you have an existing torque.cfg it will be modified.
#MPI_SUBMIT_FILTER=${MPI_SUBMIT_FILTER:-"no"}
MPI_SUBMIT_FILTER="yes"

glite-mpi_wn

# Setup configuration variables that are common to both the CE and WN
# Most variables are common to CE and WN. It is easier to define
# These in a common file ${config_dir}/services/glite-mpi

if [ -r ${config_dir}/services/glite-mpi ]; then
 source ${config_dir}/services/glite-mpi
fi

munge configuration

IMPORTANT: The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge as an inter node authentication method.

  • verify that munge is correctly installed:
# rpm -qa | grep munge
munge-libs-0.5.8-8.el5
munge-0.5.8-8.el5
  • On one host (for example the batch server) generate a key by launching:
# /usr/sbin/create-munge-key

# ls -ltr /etc/munge/
total 4
-r-------- 1 munge munge 1024 Jan 13 14:32 munge.key
  • Copy the key, /etc/munge/munge.key to every host of your cluster, adjusting the permissions:
# chown munge:munge /etc/munge/munge.key
  • Start the munge daemon on each node:
# service munge start
Starting MUNGE:                                            [  OK  ]

# chkconfig munge on

software area settings

you have to import the software area from CE (or another host).
  • Edit the file /etc/fstab by adding a line like the following:
cremino.cnaf.infn.it:/opt/exp_soft/ /opt/exp_soft/ nfs rw,defaults 0 0
  • check nfs and portmap status
# service nfs status
rpc.mountd is stopped
nfsd is stopped

# service portmap status
portmap is stopped

# service portmap start
Starting portmap:                                          [  OK  ]

# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
Starting RPC idmapd:                                       [  OK  ]

# chkconfig nfs on
# chkconfig portmap on
  • after any modification in /etc/fstab launch
mount -a
  • verify the mount:
# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3              65G  1.9G   59G   4% /
/dev/sda1              99M   18M   76M  19% /boot
tmpfs                 2.0G     0  2.0G   0% /dev/shm
cremino.cnaf.infn.it:/opt/exp_soft/
                       65G  4.4G   57G   8% /opt/exp_soft

yaim check

# /opt/glite/yaim/bin/yaim -v -s site-info_batch.def -n MPI_WN -n WN_torque_noafs
   INFO: Using site configuration file: site-info_batch.def
   INFO: Sourcing service specific configuration file: ./services/glite-mpi_wn
   INFO: 
         ###################################################################
         
         .             /'.-. ')
         .     yA,-"-,( ,m,:/ )   .oo.     oo    o      ooo  o.     .oo
         .    /      .-Y a  a Y-.     8. .8'    8'8.     8    8b   d'8
         .   /           ~ ~ /         8'    .8oo88.     8    8  8'  8
         . (_/         '===='          8    .8'     8.   8    8  Y   8
         .   Y,-''-,Yy,-.,/           o8o  o8o    o88o  o8o  o8o    o8o
         .    I_))_) I_))_)
         
         
         current working directory: /root
         site-info.def date: Apr 24 09:22 site-info_batch.def
         yaim command: -v -s site-info_batch.def -n MPI_WN -n WN_torque_noafs
         log file: /opt/glite/yaim/bin/../log/yaimlog
         Tue Apr 24 11:53:02 CEST 2012 : /opt/glite/yaim/bin/yaim
         
         Installed YAIM versions:
         glite-yaim-clients 5.0.0-1
         glite-yaim-core 5.0.2-1
         glite-yaim-mpi 1.1.10-10
         glite-yaim-torque-client 5.0.0-1
         glite-yaim-torque-utils 5.0.0-1
         
         ####################################################################
   INFO: The default location of the grid-env.(c)sh files will be: /usr/libexec
   INFO: Sourcing the utilities in /opt/glite/yaim/functions/utils
   INFO: Detecting environment
   INFO: Executing function: config_mpi_wn_check 
   INFO: Executing function: config_ntp_check 
   INFO: Executing function: config_sysconfig_lcg_check 
   INFO: Executing function: config_globus_clients_check 
   INFO: Executing function: config_lcgenv_check 
   INFO: Executing function: config_users_check 
   INFO: Executing function: config_sw_dir_check 
   INFO: Executing function: config_amga_client_check 
   INFO: Executing function: config_wn_check 
   INFO: Executing function: config_vomsdir_check 
   INFO: Executing function: config_vomses_check 
   INFO: Executing function: config_glite_saga_check 
   INFO: Executing function: config_add_pool_env_check 
   INFO: Executing function: config_wn_info_check 
   INFO: Executing function: config_torque_client_check 
   INFO: Checking is done.
   INFO: All the necessary variables to configure MPI_WN WN_torque_noafs are defined in your configuration files.
   INFO: Please, bear in mind that YAIM only guarantees the definition of variables
   INFO: controlled in the _check functions.
   INFO: YAIM terminated succesfully.

yaim config

# /opt/glite/yaim/bin/yaim -c -s site-info_batch.def -n MPI_WN -n WN_torque_noafs
   INFO: Using site configuration file: site-info_batch.def
   INFO: Sourcing service specific configuration file: ./services/glite-mpi_wn
   INFO: 
         ###################################################################
         
         .             /'.-. ')
         .     yA,-"-,( ,m,:/ )   .oo.     oo    o      ooo  o.     .oo
         .    /      .-Y a  a Y-.     8. .8'    8'8.     8    8b   d'8
         .   /           ~ ~ /         8'    .8oo88.     8    8  8'  8
         . (_/         '===='          8    .8'     8.   8    8  Y   8
         .   Y,-''-,Yy,-.,/           o8o  o8o    o88o  o8o  o8o    o8o
         .    I_))_) I_))_)
         
         
         current working directory: /root
         site-info.def date: Apr 24 09:22 site-info_batch.def
         yaim command: -c -s site-info_batch.def -n MPI_WN -n WN_torque_noafs
         log file: /opt/glite/yaim/bin/../log/yaimlog
         Tue Apr 24 11:53:15 CEST 2012 : /opt/glite/yaim/bin/yaim
         
         Installed YAIM versions:
         glite-yaim-clients 5.0.0-1
         glite-yaim-core 5.0.2-1
         glite-yaim-mpi 1.1.10-10
         glite-yaim-torque-client 5.0.0-1
         glite-yaim-torque-utils 5.0.0-1
         
         ####################################################################
   INFO: The default location of the grid-env.(c)sh files will be: /usr/libexec
   INFO: Sourcing the utilities in /opt/glite/yaim/functions/utils
   INFO: Detecting environment
   INFO: Executing function: config_mpi_wn_check 
   INFO: Executing function: config_ntp_check 
   INFO: Executing function: config_sysconfig_lcg_check 
   INFO: Executing function: config_globus_clients_check 
   INFO: Executing function: config_lcgenv_check 
   INFO: Executing function: config_users_check 
   INFO: Executing function: config_sw_dir_check 
   INFO: Executing function: config_amga_client_check 
   INFO: Executing function: config_wn_check 
   INFO: Executing function: config_vomsdir_check 
   INFO: Executing function: config_vomses_check 
   INFO: Executing function: config_glite_saga_check 
   INFO: Executing function: config_add_pool_env_check 
   INFO: Executing function: config_wn_info_check 
   INFO: Executing function: config_torque_client_check 
   INFO: Executing function: config_mpi_wn_setenv 
   INFO: Executing function: config_mpi_wn 
   INFO: Executing function: config_ldconf 
   INFO: config_ldconf: function not needed anymore, left empy waiting to be removed
   INFO: Executing function: config_ntp_setenv 
   INFO: Executing function: config_ntp 
   INFO: Storing old ntp settings in /etc/ntp.conf.yaimold.20120424_115316
   INFO: Executing function: config_sysconfig_edg 
   INFO: Executing function: config_sysconfig_globus 
   INFO: Executing function: config_sysconfig_lcg 
   INFO: Executing function: config_crl 
   INFO: Now updating the CRLs - this may take a few minutes...
Enabling periodic fetch-crl:                               [  OK  ]
   INFO: Executing function: config_rfio 
   INFO: Executing function: config_globus_clients_setenv 
   INFO: Executing function: config_globus_clients 
   INFO: Configure the globus service - not needed in EMI
   INFO: Executing function: config_lcgenv 
   INFO: Executing function: config_users 
   INFO: Executing function: config_sw_dir_setenv 
   INFO: Executing function: config_sw_dir 
   INFO: Executing function: config_nfs_sw_dir_client 
   INFO: Variable $BASE_SW_DIR is not set!
   INFO: The directory /opt/exp_soft won't be mounted with NFS!
   INFO: Executing function: config_fts_client 
   INFO: Executing function: config_amga_client_setenv 
   INFO: Executing function: config_amga_client 
   INFO: Executing function: config_wn_setenv 
   INFO: Executing function: config_wn 
   INFO: Executing function: config_vomsdir_setenv 
   INFO: Executing function: config_vomsdir 
   INFO: Executing function: config_vomses 
   INFO: Executing function: config_glite_saga_setenv 
   INFO: SAGA configuration is not required
   INFO: Executing function: config_glite_saga 
   INFO: SAGA configuration is not required
   INFO: Executing function: config_add_pool_env_setenv 
   INFO: Executing function: config_add_pool_env 
   INFO: Executing function: config_wn_info 
   WARNING: No subcluster has been defined for the WN in the WN_LIST file /root/wn-list.conf
   WARNING: YAIM will use the default subcluster id: CE_HOST -> cream-01.cnaf.infn.it
   INFO: Executing function: config_torque_client 
   INFO: starting pbs_mom...
Shutting down TORQUE Mom: pbs_mom already stopped          [  OK  ]
Starting TORQUE Mom:                                       [  OK  ]
   INFO: Configuration Complete.                                               [  OK  ]
   INFO: YAIM terminated succesfully.

Revisions

Date Comment By
2012-05-25 installation notes completed Alessandro Paolini
2012-04-23 First draft Alessandro Paolini

-- AlessandroPaolini - 2012-04-23

Edit | Attach | PDF | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | More topic actions
Topic revision: r5 - 2012-05-25 - AlessandroPaolini
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback