Notes about Installation and Configuration of a stand alone Torque server

  • These notes are provided by site admins on a best effort base as a contribution to the IGI communities and MUST not be considered as a subsitute of the Official IGI documentation.
  • This document is addressed to site administrators responsible for middleware installation and configuration.
  • The goal of this page is to provide some hints and examples on how to install and configure a stand alone EMI TORQUE server

References

  1. About IGI - Italian Grid infrastructure
  2. About IGI Release
  3. IGI Official Installation and Configuration guide
  4. Generic Installation & Configuration for EMI 1
  5. Yaim Guide
  6. site-info.def yaim variables
  7. MPI yaim variables
  8. WN yaim variables
  9. TORQUE Yaim variables
  10. EMI-WN v.1.0.0
  11. gLite-MPI v.1.0.0
  12. MPI-Start Installation and Configuration
  13. Troubleshooting Guide for Operational Errors on EGI Sites
  14. Grid Administration FAQs page

Service installation

O.S. and Repos

  • Starts from a fresh installation of Scientific Linux 5.x (x86_64).
# cat /etc/redhat-release 
Scientific Linux SL release 5.7 (Boron) 

* Install the additional repositories: EPEL, Certification Authority, UMD

# yum install yum-priorities yum-protectbase
# cd /etc/yum.repos.d/
# rpm -ivh http://mirror.switch.ch/ftp/mirror/epel//5/x86_64/epel-release-5-4.noarch.rpm
# wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
# rpm -ivh http://repo-pd.italiangrid.it/mrepo/EMI/1/sl5/x86_64/updates/emi-release-1.0.1-1.sl5.noarch.rpm
# wget http://repo-pd.italiangrid.it/mrepo/repos/igi/sl5/x86_64/igi-emi.repo

  • Be sure that SELINUX is disabled (or permissive). Details on how to disable SELINUX are here:

# getenforce 
Disabled

  • Check the repos list (sl-*.repo are the repos of the O.S. and they should be present by default).

# ls /etc/yum.repos.d/
egi-trustanchors.repo  
emi1-third-party.repo emi1-base.repo emi1-updates.repo
epel.repo epel-testing.repo  igi-emi.repo
sl-contrib.repo sl-fastbugs.repo sl-security.repo sl-testing.repo sl-debuginfo.repo sl.repo sl-srpms.repo
IMPORTANT: remove the dag repository if present

yum install

# yum clean all
Loaded plugins: downloadonly, kernel-module, priorities, protect-packages, protectbase, security, verify, versionlock
Cleaning up Everything

# yum install emi-torque-server emi-torque-utils
# yum install yaim-addons
# yum install nfs-utils
see here for details

Service configuration

You have to copy the configuration files in another path, for example root, and set them properly (see later):
# cp -r /opt/glite/yaim/examples/siteinfo/* .

vo.d directory

Create the vo.d directory for the VO configuration file (you can decide if keep the VO information in the site.def or putting them in the vo.d directory)
# mkdir vo.d
here an example for some VOs.

Information about the several VOs are available at the CENTRAL OPERATIONS PORTAL.

users and groups configuration

here an example on how to define pool accounts (ig-users.conf) and groups (ig-groups.conf) for several VOs

wn-list.conf

Set in this file the WNs list, for example:

# less wn-list.conf 
wn05.cnaf.infn.it
wn06.cnaf.infn.it

site-info.def

SUGGESTION: use the same site-info.def for CREAM and WNs: for this reason in this example file there are yaim variable used by CREAM, TORQUE or emi-WN.

It is also included the settings of some VOs

For your convenience there is an explanation of each yaim variable. For more details look at [8, 9, 10]

host certificate

# ll /etc/grid-security/host*
-rw-r--r-- 1 root root 1440 Oct 18 09:31 /etc/grid-security/hostcert.pem
-r-------- 1 root root  887 Oct 18 09:31 /etc/grid-security/hostkey.pem

munge configuration

IMPORTANT: The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge as an inter node authentication method.

  • verify that munge is correctly installed:
# rpm -qa | grep munge
munge-libs-0.5.8-8.el5
munge-0.5.8-8.el5
  • On one host (for example the batch server) generate a key by launching:
# /usr/sbin/create-munge-key

# ls -ltr /etc/munge/
total 4
-r-------- 1 munge munge 1024 Jan 13 14:32 munge.key
  • Copy the key, /etc/munge/munge.key to every host of your cluster, adjusting the permissions:
# chown munge:munge /etc/munge/munge.key
  • Start the munge daemon on each node:
# service munge start
Starting MUNGE:                                            [  OK  ]

# chkconfig munge on

Verify to have set all the yaim variables by launching:
# /opt/glite/yaim/bin/yaim -v -s site-info_batch.def -n TORQUE_server -n TORQUE_utils

see details

# /opt/glite/yaim/bin/yaim -c -s site-info_batch.def -n TORQUE_server -n TORQUE_utils

see details

tomcat and ldap users

It is necessary to create tomcat and ldap users on the torque server, otherwise the computing elements will fail in connecting the server.

When those users doesn't exist on the server, on the CE you will see errors like the following

2012-04-24 15:37:29 lcg-info-dynamic-scheduler: LRMS backend command returned nonzero exit status
2012-04-24 15:37:29 lcg-info-dynamic-scheduler: Exiting without output, GIP will use static values
Can not obtain pbs version from host
[...]

instead, on the torque server:

04/24/2012 14:00:46;0080;PBS_Server;Req;req_reject;Reject reply code=15021(Invalid credential), aux=0, type=StatusJob, from tomcat@cream-01.cnaf.infn.it
04/24/2012 14:01:02;0080;PBS_Server;Req;req_reject;Reject reply code=15021(Invalid credential), aux=0, type=StatusJob, from ldap@cream-01.cnaf.infn.it

Solution is to add tomcat and ldap users/groups to torque host and restart pbs_server - as they exists only on CreamCE host.

# echo 'tomcat:x:91:91:Tomcat:/usr/share/tomcat5:/bin/sh' >> /etc/passwd
# echo 'ldap:x:55:55:LDAP User:/var/lib/ldap:/bin/false' >> /etc/passwd
# echo 'tomcat:x:91:' >> /etc/group
# echo 'ldap:x:55:' >> /etc/group

Software Area settings

If the Software Area is hosted on your CE, you have to create it and export to the WNs in the site.def we set:

VO_SW_DIR=/opt/exp_soft
  • directory creation
mkdir /opt/exp_soft/
  • edit /etc/exports creating a line like the following:
/opt/exp_soft/ *.cnaf.infn.it(rw,sync,no_root_squash)
  • check nfs and portmap status
# service nfs status
rpc.mountd is stopped
nfsd is stopped

# service portmap status
portmap is stopped

# service portmap start
Starting portmap:                                          [  OK  ]

# service nfs start
Starting NFS services:                                     [  OK  ]
Starting NFS daemon:                                       [  OK  ]
Starting NFS mountd:                                       [  OK  ]
Starting RPC idmapd:                                       [  OK  ]

# chkconfig nfs on
# chkconfig portmap on
  • after any modification in /etc/exports you can launch
# exportfs -ra
or simply restart nfs daemon

walltime workaround

If on the CE queues there is published:
GlueCEStateWaitingJobs: 444444
and in the log /var/log/bdii/bdii-update.log on CE you notice errors like the folllowing:
Traceback (most recent call last):
  File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ?
    wrt = qwt * nwait
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
probably the queues have no "resources_default.walltime" parameter configured.

So define it for each queue by launching, for example:

# qmgr -c "set queue prod resources_default.walltime = 01:00:00"
# qmgr -c "set queue cert resources_default.walltime = 01:00:00"
# qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00"

adding a second CE

In order to allow the submission from a second CE, do the following actions on the batch server:

  • edit the files /etc/hosts.equiv and /etc/ssh/shosts.equiv adding the FQDN of the second CE

  • define the parameter authorized_users in the pbs server:
# qmgr -c "set server authorized_users += *@cream-02.cnaf.infn.it"
Regarding the ssh configuration, have a look NotesAboutInstallationAndConfigurationOfCREAMForTORQUE

Service Checks

  • After service installation to have a look if all were installed in a proper way, you could have a look to Service CREAM Reference Card
  • You can also perform some checks after the installation and configuration of your CREAM

TORQUE checks:

  • check the pbs settings:
# qmgr -c 'p s'
  • check the WNs state
# pbsnodes -a

maui settings

In order to reserve a job slot for test jobs, you need to apply some settings in the maui configuration (/var/spool/maui/maui.cfg)

Suppose you have enabled the test VOs (ops, dteam and infngrid) on the "cert" queue and that you have 8 job slots available. Add the following lines in the maui.cfg files:

CLASSWEIGHT 1
QOSWEIGHT 1

QOSCFG[normal] MAXJOB=7

CLASSCFG[prod] QDEF=normal
CLASSCFG[cert] PRIORITY=5000

After the modification restart maui.

In order to avoid that yaim overwrites this file during the host reconfiguration, set:

CONFIG_MAUI="no"

in your site.def (the first time you launch the yaim script, it has to be set to "yes"

Revisions

Date Comment By
2012-05-31 installation notes completed Alessandro Paolini
2012-05-25 First draft Alessandro Paolini

-- AlessandroPaolini - 2012-05-25

Topic revision: r2 - 2012-06-13 - AlessandroPaolini
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback