Notes about Installation and Configuration of a CREAM Computing Element - EMI-2 - SL6 (external Torque, external Argus, MPI enabled)
- These notes are provided by site admins on a best effort base as a contribution to the IGI communities and MUST not be considered as a subsitute of the Official IGI documentation
.
- This document is addressed to site administrators responsible for middleware installation and configuration.
- The goal of this page is to provide some hints and examples on how to install and configure an EMI-2 CREAM CE service based on EMI middleware, in no cluster mode, with TORQUE as batch system installed on a different host, using an external ARGUS server for the users authorization and with MPI enabled.
References
- About IGI - Italian Grid infrastructure
- About IGI Release
- IGI Official Installation and Configuration guide
- EMI-2 Release
- CREAM
- CREAM TORQUE module
- Yaim Guide
- site-info.def yaim variables
- CREAM yaim variables
- TORQUE Yaim variables
- Troubleshooting Guide for Operational Errors on EGI Sites
- Grid Administration FAQs page
Service installation
O.S. and Repos
- Starts from a fresh installation of Scientific Linux 6.x (x86_64).
# cat /etc/redhat-release
Scientific Linux release 6.2 (Carbon)
- Install the additional repositories: EPEL, Certification Authority, EMI-2
# yum install yum-priorities yum-protectbase epel-release
# rpm -ivh http://emisoft.web.cern.ch/emisoft/dist/EMI/2/sl6/x86_64/base/emi-release-2.0.0-1.sl6.noarch.rpm
# cd /etc/yum.repos.d/
# wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
- Be sure that SELINUX is disabled (or permissive). Details on how to disable SELINUX are here
:
# getenforce
Disabled
yum install
# yum clean all
# yum install ca-policy-egi-core emi-cream-ce emi-torque-utils
Service configuration
You have to copy the configuration files in another path, for example root, and set them properly (see later):
# cp -vr /opt/glite/yaim/examples/siteinfo .
vo.d directory
Create the directory
siteinfo/vo.d
and fill it with a file for each supported VO. You can download them from
HERE
and
here an example for some VOs.
Information about the several VOs are available at the
CENTRAL OPERATIONS PORTAL
.
users and groups
You can download them from
HERE
.
Munge
Copy the key
/etc/munge/munge.key
from the Torque server to every host of your cluster, adjust the permissions and start the service
# chown munge:munge /etc/munge/munge.key
# ls -ltr /etc/munge/
total 4
-r-------- 1 munge munge 1024 Jan 13 14:32 munge.key
# chkconfig munge on
# /etc/init.d/munge restart
site-info.def
KISS: Keep it simple, stupid! For your convenience there is an explanation of each yaim variable. For more details look
HERE
.
# cat siteinfo/site-info.def
BATCH_SERVER=batch.cnaf.infn.it
CE_HOST=cream-01.cnaf.infn.it
CE_SMPSIZE=8
USERS_CONF=/root/siteinfo/ig-users.conf
GROUPS_CONF=/root/siteinfo/ig-users.conf
VOS="comput-er.it dteam igi.italiangrid.it infngrid ops gridit"
QUEUES="cert prod"
CERT_GROUP_ENABLE="dteam infngrid ops /dteam/ROLE=lcgadmin /dteam/ROLE=production /ops/ROLE=lcgadmin /ops/ROLE=pilot /infngrid/ROLE=SoftwareManager /infngrid/ROLE=pilot"
PROD_GROUP_ENABLE="comput-er.it gridit igi.italiangrid.it /comput-er.it/ROLE=SoftwareManager /gridit/ROLE=SoftwareManager /igi.italiangrid.it/ROLE=SoftwareManager"
WN_LIST="/root/siteinfo/wn-list.conf"
MUNGE_KEY_FILE=/etc/munge/munge.key
CONFIG_MAUI="no"
SITE_NAME=IGI-BOLOGNA
APEL_DB_PASSWORD=not_used
APEL_MYSQL_HOST=not_used
WN list
Set in this file the WNs list, for example:
# less /root/siteinfo/wn-list.conf
wn05.cnaf.infn.it
wn06.cnaf.infn.it
site-info.def
SUGGESTION: use the same
site-info.def for CREAM and WNs: for this reason in this example file there are yaim variable used by CREAM, TORQUE or emi-WN.
It is also included the settings of some VOs
For your convenience there is an explanation of each yaim variable. For more details look at [8, 9, 10]
</>
services/glite-creamce
#
# YAIM creamCE specific variables
#
# LSF settings: path where lsf.conf is located
#BATCH_CONF_DIR=lsf_install_path/conf
#
# CE-monitor host (by default CE-monitor is installed on the same machine as
# cream-CE)
CEMON_HOST=$CE_HOST
#
# CREAM database user
CREAM_DB_USER=*********
#
CREAM_DB_PASSWORD=*********
# Machine hosting the BLAH blparser.
# In this machine batch system logs must be accessible.
#BLPARSER_HOST=set_to_fully_qualified_host_name_of_machine_hosting_blparser_server
BLPARSER_HOST=$CE_HOST
services/dgas_sensors
#
# YAIM DGAS Sensors specific variables
#
################################
# DGAS configuration variables #
################################
# For any details about DGAS variables please refer to the guide:
# http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:dgas
# Reference Resource HLR for the site.
DGAS_HLR_RESOURCE="prod-hlr-01.pd.infn.it"
# Specify the type of job which the CE has to process.
# Set ”all” on “the main CE” of the site, ”grid” on the others.
# Default value: all
#DGAS_JOBS_TO_PROCESS="all"
# This parameter can be used to specify the list of VOs to publish.
# If the parameter is specified, the sensors (pushd) will forward
# to the Site HLR just records belonging to one of the specified VOs.
# Leave commented if you want to send records for ALL VOs
# Default value: parameter not specified
#DGAS_VO_TO_PROCESS="vo1;vo2;vo3..."
# Bound date on jobs backward processing.
# The backward processing does not consider jobs prior to that date.
# Default value: 2009-01-01.
#DGAS_IGNORE_JOBS_LOGGED_BEFORE="2011-11-01"
# Main CE of the site.
# ATTENTION: set this variable only in the case of site with a “singleLRMS”
# in which there are more than one CEs or local submission hosts (i.e. host
# from which you may submit jobs directly to the batch system).
# In this case, DGAS_USE_CE_HOSTNAME parameter must be set to the same value
# for all hosts sharing the lrms and this value can be arbitrary chosen among
# these submitting hostnames (you may choose the best one).
# Otherwise leave it commented.
# we have 2 CEs, cremino is the main one
DGAS_USE_CE_HOSTNAME="cremino.cnaf.infn.it"
# Path for the batch-system log files.
# * for torque/pbs:
# DGAS_ACCT_DIR=/var/torque/server_priv/accounting
# * for LSF:
# DGAS_ACCT_DIR=lsf_install_path/work/cluster_name/logdir
# * for SGE:
# DGAS_ACCT_DIR=/opt/sge/default/common/
DGAS_ACCT_DIR=/var/torque/server_priv/accounting
# Full path to the 'condor_history' command, used to gather DGAS usage records
# when Condor is used as a batch system. Otherwise leave it commented.
#DGAS_CONDOR_HISTORY_COMMAND=""
---+++ host certificate
# ll /etc/grid-security/host*
-rw-r--r-- 1 root root 1440 Oct 18 09:31 /etc/grid-security/hostcert.pem
-r-------- 1 root root 887 Oct 18 09:31 /etc/grid-security/hostkey.pem
munge configuration
IMPORTANT: The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge as an inter node authentication method.
- verify that munge is correctly installed:
# rpm -qa | grep munge
munge-libs-0.5.8-8.el5
munge-0.5.8-8.el5
- On one host (for example the batch server) generate a key by launching:
# /usr/sbin/create-munge-key
# ls -ltr /etc/munge/
total 4
-r-------- 1 munge munge 1024 Jan 13 14:32 munge.key
- Copy the key, /etc/munge/munge.key to every host of your cluster, adjusting the permissions:
# chown munge:munge /etc/munge/munge.key
- Start the munge daemon on each node:
# service munge start
Starting MUNGE: [ OK ]
# chkconfig munge on
Verify to have set all the yaim variables by launching:
# /opt/glite/yaim/bin/yaim -v -s site-info_cremino.def -n creamCE -n TORQUE_server -n TORQUE_utils -n DGAS_sensors
see
details
# /opt/glite/yaim/bin/yaim -c -s site-info_cremino.def -n creamCE -n TORQUE_server -n TORQUE_utils -n DGAS_sensors
see
details
Software Area settings
If the Software Area is hosted on your CE, you have to create it and export to the WNs
in the site.def we set:
VO_SW_DIR=/opt/exp_soft
mkdir /opt/exp_soft/
- edit /etc/exports creating a line like the following:
/opt/exp_soft/ *.cnaf.infn.it(rw,sync,no_root_squash)
- check nfs and portmap status
# service nfs status
rpc.mountd is stopped
nfsd is stopped
# service portmap status
portmap is stopped
# service portmap start
Starting portmap: [ OK ]
# service nfs start
Starting NFS services: [ OK ]
Starting NFS daemon: [ OK ]
Starting NFS mountd: [ OK ]
Starting RPC idmapd: [ OK ]
# chkconfig nfs on
# chkconfig portmap on
- after any modification in /etc/exports you can launch
# exportfs -ra
or simply restart nfs daemon
walltime workaround
If on the queues there is published:
GlueCEStateWaitingJobs: 444444
and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing:
Traceback (most recent call last):
File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ?
wrt = qwt * nwait
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
probably the queues have no "resources_default.walltime" parameter configured.
So define it for each queue by launching, for example:
# qmgr -c "set queue prod resources_default.walltime = 01:00:00"
# qmgr -c "set queue cert resources_default.walltime = 01:00:00"
# qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00"
Service Checks
- After service installation to have a look if all were installed in a proper way, you could have a look to Service CREAM Reference Card
- You can also perform some checks
after the installation and configuration of your CREAM
TORQUE checks:
# qmgr -c 'p s'
# pbsnodes -a
maui settings
In order to reserve a job slot for test jobs, you need to apply some settings in the maui configuration (/var/spool/maui/maui.cfg)
Suppose you have enabled the test VOs (ops, dteam and infngrid) on the "cert" queue and that you have 8 job slots available. Add the following lines in the
/var/spool/maui/maui.cfg
file:
CLASSWEIGHT 1
QOSWEIGHT 1
QOSCFG[normal] MAXJOB=7
CLASSCFG[prod] QDEF=normal
CLASSCFG[cert] PRIORITY=5000
After the modification restart maui.
In order to avoid that yaim overwrites this file during the host reconfiguration, set:
CONFIG_MAUI="no"
in your site.def (the first time you launch the yaim script, it has to be set to "yes"
Revisions
--
PaoloVeronesi - 2012-05-25