Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Notes about Installation and Configuration of a CREAM Computing Element - EMI-2 - SL6 (external Torque, external Argus, MPI enabled)
| ||||||||
Line: 52 to 52 | ||||||||
# yum clean all | ||||||||
Changed: | ||||||||
< < | # yum install ca-policy-egi-core emi-cream-ce emi-torque-utils | |||||||
> > | # yum install ca-policy-egi-core emi-cream-ce emi-torque-utils glite-mpi | |||||||
Service configuration | ||||||||
Line: 61 to 61 | ||||||||
# cp -vr /opt/glite/yaim/examples/siteinfo . | ||||||||
Added: | ||||||||
> > | host certificate# ll /etc/grid-security/host* -rw-r--r-- 1 root root 1440 Oct 18 09:31 /etc/grid-security/hostcert.pem -r-------- 1 root root 887 Oct 18 09:31 /etc/grid-security/hostkey.pem | |||||||
vo.d directoryCreate the directorysiteinfo/vo.d and fill it with a file for each supported VO. You can download them from HERE and here an example for some VOs.
Information about the several VOs are available at the CENTRAL OPERATIONS PORTAL. | ||||||||
Line: 83 to 90 | ||||||||
site-info.defKISS: Keep it simple, stupid! For your convenience there is an explanation of each yaim variable. For more details look HERE. | ||||||||
Added: | ||||||||
> > | SUGGESTION: use the same site-info.def for CREAM and WNs: for this reason in this example file there are yaim variable used by CREAM, TORQUE or emi-WN. | |||||||
Changed: | ||||||||
< < | # cat siteinfo/site-info.def BATCH_SERVER=batch.cnaf.infn.it | |||||||
> > | # cat site-info.def | |||||||
CE_HOST=cream-01.cnaf.infn.it | ||||||||
Added: | ||||||||
> > | SITE_NAME=IGI-BOLOGNA BATCH_SERVER=batch.cnaf.infn.it BATCH_LOG_DIR=/var/torque #BDII_HOST=egee-bdii.cnaf.infn.it CE_BATCH_SYS=torque JOB_MANAGER=pbs BATCH_VERSION=torque-2.5.7 #CE_DATADIR= CE_INBOUNDIP=FALSE CE_OUTBOUNDIP=TRUE CE_OS="ScientificSL" CE_OS_RELEASE=6.2 CE_OS_VERSION="Carbon" CE_RUNTIMEENV="IGI-BOLOGNA" CE_PHYSCPU=8 CE_LOGCPU=16 CE_MINPHYSMEM=16000 CE_MINVIRTMEM=32000 | |||||||
CE_SMPSIZE=8 | ||||||||
Added: | ||||||||
> > | CE_CPU_MODEL=Xeon CE_CPU_SPEED=2493 CE_CPU_VENDOR=intel CE_CAPABILITY="CPUScalingReferenceSI00=1039 glexec" CE_OTHERDESCR="Cores=1,Benchmark=4.156-HEP-SPEC06" CE_SF00=951 CE_SI00=1039 CE_OS_ARCH=x86_64 CREAM_PEPC_RESOURCEID="http://cnaf.infn.it/cremino" | |||||||
USERS_CONF=/root/siteinfo/ig-users.conf GROUPS_CONF=/root/siteinfo/ig-users.conf | ||||||||
Line: 95 to 139 | ||||||||
QUEUES="cert prod" CERT_GROUP_ENABLE="dteam infngrid ops /dteam/ROLE=lcgadmin /dteam/ROLE=production /ops/ROLE=lcgadmin /ops/ROLE=pilot /infngrid/ROLE=SoftwareManager /infngrid/ROLE=pilot" PROD_GROUP_ENABLE="comput-er.it gridit igi.italiangrid.it /comput-er.it/ROLE=SoftwareManager /gridit/ROLE=SoftwareManager /igi.italiangrid.it/ROLE=SoftwareManager" | ||||||||
Added: | ||||||||
> > | VO_SW_DIR=/opt/exp_soft | |||||||
WN_LIST="/root/siteinfo/wn-list.conf" MUNGE_KEY_FILE=/etc/munge/munge.key CONFIG_MAUI="no" | ||||||||
Changed: | ||||||||
< < | SITE_NAME=IGI-BOLOGNA | |||||||
> > | MYSQL_PASSWORD=********************************* | |||||||
APEL_DB_PASSWORD=not_used APEL_MYSQL_HOST=not_used | ||||||||
Added: | ||||||||
> > | SE_LIST="darkstorm.cnaf.infn.it" SE_MOUNT_INFO_LIST="none" | |||||||
WN list | ||||||||
Line: 113 to 160 | ||||||||
wn06.cnaf.infn.it | ||||||||
Added: | ||||||||
> > | services/glite-mpi_ce# cp /opt/glite/yaim/examples/siteinfo/services/glite-mpi_ce /root/siteinfo/services/ | |||||||
Changed: | ||||||||
< < | site-info.defSUGGESTION: use the same site-info.def for CREAM and WNs: for this reason in this example file there are yaim variable used by CREAM, TORQUE or emi-WN. It is also included the settings of some VOs | |||||||
> > | # cat services/glite-mpi_ce # Setup configuration variables that are common to both the CE and WN | |||||||
Changed: | ||||||||
< < | For your convenience there is an explanation of each yaim variable. For more details look at [8, 9, 10]
</> <--/twistyPlugin--> | |||||||
> > | if [ -r ${config_dir}/services/glite-mpi ]; then source ${config_dir}/services/glite-mpi fi # The MPI CE config function can create a submit filter for # Torque to ensure that CPU allocation is performed correctly. # Change this variable to "yes" to have YAIM create this filter. # Warning: if you have an existing torque.cfg it will be modified. MPI_SUBMIT_FILTER=${MPI_SUBMIT_FILTER:-"yes"} | |||||||
Deleted: | ||||||||
< < | <--/twistyPlugin twikiMakeVisibleInline--> | |||||||
services/glite-creamce | ||||||||
Added: | ||||||||
> > | # cat /root/siteinfo/services/glite-creamce | |||||||
# # YAIM creamCE specific variables # | ||||||||
Deleted: | ||||||||
< < | # LSF settings: path where lsf.conf is located #BATCH_CONF_DIR=lsf_install_path/conf | |||||||
# # CE-monitor host (by default CE-monitor is installed on the same machine as # cream-CE) CEMON_HOST=$CE_HOST # # CREAM database user | ||||||||
Changed: | ||||||||
< < | CREAM_DB_USER=********* | |||||||
> > | CREAM_DB_USER=******************** CREAM_DB_PASSWORD=**************************** | |||||||
# | ||||||||
Deleted: | ||||||||
< < | CREAM_DB_PASSWORD=********* | |||||||
# Machine hosting the BLAH blparser. # In this machine batch system logs must be accessible. | ||||||||
Deleted: | ||||||||
< < | #BLPARSER_HOST=set_to_fully_qualified_host_name_of_machine_hosting_blparser_server | |||||||
BLPARSER_HOST=$CE_HOST | ||||||||
Deleted: | ||||||||
< < |
<--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> services/dgas_sensors# # YAIM DGAS Sensors specific variables # ################################ # DGAS configuration variables # ################################ # For any details about DGAS variables please refer to the guide: # http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:dgas # Reference Resource HLR for the site. DGAS_HLR_RESOURCE="prod-hlr-01.pd.infn.it" # Specify the type of job which the CE has to process. # Set ”all” on “the main CE” of the site, ”grid” on the others. # Default value: all #DGAS_JOBS_TO_PROCESS="all" # This parameter can be used to specify the list of VOs to publish. # If the parameter is specified, the sensors (pushd) will forward # to the Site HLR just records belonging to one of the specified VOs. # Leave commented if you want to send records for ALL VOs # Default value: parameter not specified #DGAS_VO_TO_PROCESS="vo1;vo2;vo3..." # Bound date on jobs backward processing. # The backward processing does not consider jobs prior to that date. # Default value: 2009-01-01. #DGAS_IGNORE_JOBS_LOGGED_BEFORE="2011-11-01" # Main CE of the site. # ATTENTION: set this variable only in the case of site with a “singleLRMS” # in which there are more than one CEs or local submission hosts (i.e. host # from which you may submit jobs directly to the batch system). # In this case, DGAS_USE_CE_HOSTNAME parameter must be set to the same value # for all hosts sharing the lrms and this value can be arbitrary chosen among # these submitting hostnames (you may choose the best one). # Otherwise leave it commented. # we have 2 CEs, cremino is the main one DGAS_USE_CE_HOSTNAME="cremino.cnaf.infn.it" # Path for the batch-system log files. # * for torque/pbs: # DGAS_ACCT_DIR=/var/torque/server_priv/accounting # * for LSF: # DGAS_ACCT_DIR=lsf_install_path/work/cluster_name/logdir # * for SGE: # DGAS_ACCT_DIR=/opt/sge/default/common/ DGAS_ACCT_DIR=/var/torque/server_priv/accounting # Full path to the 'condor_history' command, used to gather DGAS usage records # when Condor is used as a batch system. Otherwise leave it commented. #DGAS_CONDOR_HISTORY_COMMAND="" <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> ---+++ host certificate # ll /etc/grid-security/host* -rw-r--r-- 1 root root 1440 Oct 18 09:31 /etc/grid-security/hostcert.pem -r-------- 1 root root 887 Oct 18 09:31 /etc/grid-security/hostkey.pem <--/twistyPlugin--> | |||||||
Changed: | ||||||||
< < | <--/twistyPlugin twikiMakeVisibleInline--> munge configurationIMPORTANT: The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge as an inter node authentication method.
# rpm -qa | grep munge munge-libs-0.5.8-8.el5 munge-0.5.8-8.el5
# /usr/sbin/create-munge-key # ls -ltr /etc/munge/ total 4 -r-------- 1 munge munge 1024 Jan 13 14:32 munge.key
# chown munge:munge /etc/munge/munge.key
# service munge start Starting MUNGE: [ OK ] # chkconfig munge on | |||||||
> > | # Value to be published as GlueCEStateStatus instead of Production #CREAM_CE_STATE=Special | |||||||
Changed: | ||||||||
< < | <--/twistyPlugin--> | |||||||
> > | services/dgas_sensors (not available yet)TODO | |||||||
Changed: | ||||||||
< < | <--/twistyPlugin twikiMakeVisibleInline--> | |||||||
> > | yaim check | |||||||
Verify to have set all the yaim variables by launching: | ||||||||
Changed: | ||||||||
< < | # /opt/glite/yaim/bin/yaim -v -s site-info_cremino.def -n creamCE -n TORQUE_server -n TORQUE_utils -n DGAS_sensors | |||||||
> > | # /opt/glite/yaim/bin/yaim -v -s /root/siteinfo/site-info.def -n creamCE -n TORQUE_utils | |||||||
Changed: | ||||||||
< < | see details
<--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> # /opt/glite/yaim/bin/yaim -c -s site-info_cremino.def -n creamCE -n TORQUE_server -n TORQUE_utils -n DGAS_sensorssee details <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> Software Area settingsIf the Software Area is hosted on your CE, you have to create it and export to the WNs in the site.def we set:VO_SW_DIR=/opt/exp_soft
mkdir /opt/exp_soft/
/opt/exp_soft/ *.cnaf.infn.it(rw,sync,no_root_squash)
# service nfs status rpc.mountd is stopped nfsd is stopped # service portmap status portmap is stopped # service portmap start Starting portmap: [ OK ] # service nfs start Starting NFS services: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Starting RPC idmapd: [ OK ] # chkconfig nfs on # chkconfig portmap on
# exportfs -raor simply restart nfs daemon <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> walltime workaroundIf on the queues there is published:GlueCEStateWaitingJobs: 444444and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing: Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ? wrt = qwt * nwait TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'probably the queues have no "resources_default.walltime" parameter configured. So define it for each queue by launching, for example: | |||||||
> > | yaim config | |||||||
Changed: | ||||||||
< < | # qmgr -c "set queue prod resources_default.walltime = 01:00:00" # qmgr -c "set queue cert resources_default.walltime = 01:00:00" # qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00" | |||||||
> > | # /opt/glite/yaim/bin/yaim -c -s /root/siteinfo/site-info.def -n creamCE -n TORQUE_utils | |||||||
Deleted: | ||||||||
< < | <--/twistyPlugin--> | |||||||
Service Checks | ||||||||
Deleted: | ||||||||
< < | <--/twistyPlugin twikiMakeVisibleInline--> | |||||||
| ||||||||
Deleted: | ||||||||
< < | TORQUE checks:
# qmgr -c 'p s'
# pbsnodes -a <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> maui settingsIn order to reserve a job slot for test jobs, you need to apply some settings in the maui configuration (/var/spool/maui/maui.cfg) Suppose you have enabled the test VOs (ops, dteam and infngrid) on the "cert" queue and that you have 8 job slots available. Add the following lines in the/var/spool/maui/maui.cfg file:
CLASSWEIGHT 1 QOSWEIGHT 1 QOSCFG[normal] MAXJOB=7 CLASSCFG[prod] QDEF=normal CLASSCFG[cert] PRIORITY=5000After the modification restart maui. In order to avoid that yaim overwrites this file during the host reconfiguration, set: CONFIG_MAUI="no"in your site.def (the first time you launch the yaim script, it has to be set to "yes" <--/twistyPlugin--> | |||||||
Revisions
|
Line: 1 to 1 | |||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Added: | |||||||||||||||||||||
> > |
Notes about Installation and Configuration of a CREAM Computing Element - EMI-2 - SL6 (external Torque, external Argus, MPI enabled)
References
Service installationO.S. and Repos
# cat /etc/redhat-release Scientific Linux release 6.2 (Carbon)
# yum install yum-priorities yum-protectbase epel-release # rpm -ivh http://emisoft.web.cern.ch/emisoft/dist/EMI/2/sl6/x86_64/base/emi-release-2.0.0-1.sl6.noarch.rpm # cd /etc/yum.repos.d/ # wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
# getenforce Disabled yum install# yum clean all # yum install ca-policy-egi-core emi-cream-ce emi-torque-utils Service configurationYou have to copy the configuration files in another path, for example root, and set them properly (see later):# cp -vr /opt/glite/yaim/examples/siteinfo . vo.d directoryCreate the directorysiteinfo/vo.d and fill it with a file for each supported VO. You can download them from HERE and here an example for some VOs.
Information about the several VOs are available at the CENTRAL OPERATIONS PORTAL.
users and groupsYou can download them from HERE.MungeCopy the key/etc/munge/munge.key from the Torque server to every host of your cluster, adjust the permissions and start the service
# chown munge:munge /etc/munge/munge.key # ls -ltr /etc/munge/ total 4 -r-------- 1 munge munge 1024 Jan 13 14:32 munge.key # chkconfig munge on # /etc/init.d/munge restart site-info.defKISS: Keep it simple, stupid! For your convenience there is an explanation of each yaim variable. For more details look HERE.# cat siteinfo/site-info.def BATCH_SERVER=batch.cnaf.infn.it CE_HOST=cream-01.cnaf.infn.it CE_SMPSIZE=8 USERS_CONF=/root/siteinfo/ig-users.conf GROUPS_CONF=/root/siteinfo/ig-users.conf VOS="comput-er.it dteam igi.italiangrid.it infngrid ops gridit" QUEUES="cert prod" CERT_GROUP_ENABLE="dteam infngrid ops /dteam/ROLE=lcgadmin /dteam/ROLE=production /ops/ROLE=lcgadmin /ops/ROLE=pilot /infngrid/ROLE=SoftwareManager /infngrid/ROLE=pilot" PROD_GROUP_ENABLE="comput-er.it gridit igi.italiangrid.it /comput-er.it/ROLE=SoftwareManager /gridit/ROLE=SoftwareManager /igi.italiangrid.it/ROLE=SoftwareManager" WN_LIST="/root/siteinfo/wn-list.conf" MUNGE_KEY_FILE=/etc/munge/munge.key CONFIG_MAUI="no" SITE_NAME=IGI-BOLOGNA APEL_DB_PASSWORD=not_used APEL_MYSQL_HOST=not_used WN listSet in this file the WNs list, for example:# less /root/siteinfo/wn-list.conf wn05.cnaf.infn.it wn06.cnaf.infn.it site-info.defSUGGESTION: use the same site-info.def for CREAM and WNs: for this reason in this example file there are yaim variable used by CREAM, TORQUE or emi-WN. It is also included the settings of some VOs For your convenience there is an explanation of each yaim variable. For more details look at [8, 9, 10] </><--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> services/glite-creamce# # YAIM creamCE specific variables # # LSF settings: path where lsf.conf is located #BATCH_CONF_DIR=lsf_install_path/conf # # CE-monitor host (by default CE-monitor is installed on the same machine as # cream-CE) CEMON_HOST=$CE_HOST # # CREAM database user CREAM_DB_USER=********* # CREAM_DB_PASSWORD=********* # Machine hosting the BLAH blparser. # In this machine batch system logs must be accessible. #BLPARSER_HOST=set_to_fully_qualified_host_name_of_machine_hosting_blparser_server BLPARSER_HOST=$CE_HOST <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> services/dgas_sensors# # YAIM DGAS Sensors specific variables # ################################ # DGAS configuration variables # ################################ # For any details about DGAS variables please refer to the guide: # http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:dgas # Reference Resource HLR for the site. DGAS_HLR_RESOURCE="prod-hlr-01.pd.infn.it" # Specify the type of job which the CE has to process. # Set ”all” on “the main CE” of the site, ”grid” on the others. # Default value: all #DGAS_JOBS_TO_PROCESS="all" # This parameter can be used to specify the list of VOs to publish. # If the parameter is specified, the sensors (pushd) will forward # to the Site HLR just records belonging to one of the specified VOs. # Leave commented if you want to send records for ALL VOs # Default value: parameter not specified #DGAS_VO_TO_PROCESS="vo1;vo2;vo3..." # Bound date on jobs backward processing. # The backward processing does not consider jobs prior to that date. # Default value: 2009-01-01. #DGAS_IGNORE_JOBS_LOGGED_BEFORE="2011-11-01" # Main CE of the site. # ATTENTION: set this variable only in the case of site with a “singleLRMS” # in which there are more than one CEs or local submission hosts (i.e. host # from which you may submit jobs directly to the batch system). # In this case, DGAS_USE_CE_HOSTNAME parameter must be set to the same value # for all hosts sharing the lrms and this value can be arbitrary chosen among # these submitting hostnames (you may choose the best one). # Otherwise leave it commented. # we have 2 CEs, cremino is the main one DGAS_USE_CE_HOSTNAME="cremino.cnaf.infn.it" # Path for the batch-system log files. # * for torque/pbs: # DGAS_ACCT_DIR=/var/torque/server_priv/accounting # * for LSF: # DGAS_ACCT_DIR=lsf_install_path/work/cluster_name/logdir # * for SGE: # DGAS_ACCT_DIR=/opt/sge/default/common/ DGAS_ACCT_DIR=/var/torque/server_priv/accounting # Full path to the 'condor_history' command, used to gather DGAS usage records # when Condor is used as a batch system. Otherwise leave it commented. #DGAS_CONDOR_HISTORY_COMMAND="" <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> ---+++ host certificate # ll /etc/grid-security/host* -rw-r--r-- 1 root root 1440 Oct 18 09:31 /etc/grid-security/hostcert.pem -r-------- 1 root root 887 Oct 18 09:31 /etc/grid-security/hostkey.pem <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> munge configurationIMPORTANT: The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge as an inter node authentication method.
# rpm -qa | grep munge munge-libs-0.5.8-8.el5 munge-0.5.8-8.el5
# /usr/sbin/create-munge-key # ls -ltr /etc/munge/ total 4 -r-------- 1 munge munge 1024 Jan 13 14:32 munge.key
# chown munge:munge /etc/munge/munge.key
# service munge start Starting MUNGE: [ OK ] # chkconfig munge on <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline-->
Verify to have set all the yaim variables by launching:
# /opt/glite/yaim/bin/yaim -v -s site-info_cremino.def -n creamCE -n TORQUE_server -n TORQUE_utils -n DGAS_sensorssee details <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> # /opt/glite/yaim/bin/yaim -c -s site-info_cremino.def -n creamCE -n TORQUE_server -n TORQUE_utils -n DGAS_sensorssee details <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> Software Area settingsIf the Software Area is hosted on your CE, you have to create it and export to the WNs in the site.def we set:VO_SW_DIR=/opt/exp_soft
mkdir /opt/exp_soft/
/opt/exp_soft/ *.cnaf.infn.it(rw,sync,no_root_squash)
# service nfs status rpc.mountd is stopped nfsd is stopped # service portmap status portmap is stopped # service portmap start Starting portmap: [ OK ] # service nfs start Starting NFS services: [ OK ] Starting NFS daemon: [ OK ] Starting NFS mountd: [ OK ] Starting RPC idmapd: [ OK ] # chkconfig nfs on # chkconfig portmap on
# exportfs -raor simply restart nfs daemon <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> walltime workaroundIf on the queues there is published:GlueCEStateWaitingJobs: 444444and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing: Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ? wrt = qwt * nwait TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'probably the queues have no "resources_default.walltime" parameter configured. So define it for each queue by launching, for example: # qmgr -c "set queue prod resources_default.walltime = 01:00:00" # qmgr -c "set queue cert resources_default.walltime = 01:00:00" # qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00" <--/twistyPlugin--> Service Checks<--/twistyPlugin twikiMakeVisibleInline-->
TORQUE checks:
# qmgr -c 'p s'
# pbsnodes -a <--/twistyPlugin--> <--/twistyPlugin twikiMakeVisibleInline--> maui settingsIn order to reserve a job slot for test jobs, you need to apply some settings in the maui configuration (/var/spool/maui/maui.cfg) Suppose you have enabled the test VOs (ops, dteam and infngrid) on the "cert" queue and that you have 8 job slots available. Add the following lines in the/var/spool/maui/maui.cfg file:
CLASSWEIGHT 1 QOSWEIGHT 1 QOSCFG[normal] MAXJOB=7 CLASSCFG[prod] QDEF=normal CLASSCFG[cert] PRIORITY=5000After the modification restart maui. In order to avoid that yaim overwrites this file during the host reconfiguration, set: CONFIG_MAUI="no"in your site.def (the first time you launch the yaim script, it has to be set to "yes" <--/twistyPlugin--> Revisions
|