Line: 1 to 1 | |||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
System Administrator Guide for CREAM for EMI-1 release
1 Installation and Configuration1.1 Prerequisites1.1.1 Operating systemA standard 64 bit SL(C)5 distribution is supposed to be properly installed.1.1.2 Node synchronizationA general requirement for the Grid nodes is that they are synchronized. This requirement may be fulfilled in several ways. One of the most common one is using theNTP protocol with a time server.
1.1.3 Cron and logrotateMany components deployed on the CREAM CE rely on the presence ofcron (including support for /etc/cron.* directories) and logrotate . You should make sure these utils are available on your system.
1.1.4 Batch systemIf you plan to use Torque as batch system for your CREAM CE, it will be installed and configured along with the middleware (i.e. you don't have to install and configure it in advance) If you plan to use LSF as batch system for your CREAM CE, you have to install and configure it before installing and configuring the CREAM software. Since LSF is a commercial software it can't be distributed together with the middleware. If you plan to use GE as batch system for your CREAM CE, you have to install and configure it before installing and configuring the CREAM software. The CREAM CE integration was tested with GE 6.2u5 but it should work with any forked version of the original GE software. The support of the GE batch system software (or any of its forked versions) is out of the scope of this activity. More information abut batch system integration is available in the relevant section.1.2 Plan how to deploy the CREAM CE1.2.1 CREAM CE and gLite-clusterglite-CLUSTER is a node type that can publish information about clusters and subclusters in a site, referenced by any number of compute elements. In particular it allows to deal with sites having multiple CREAM CE nodes and/or multiple subclusters (i.e. disjoint sets of worker nodes, each set having sufficiently homogeneous properties). In Glue1, batch system queues are represented through GlueCE objectclasses. Each GlueCE refers to a Cluster, which can be composed by one or more SubClusters. However the gLite WMS requires the publication of exactly one SubCluster per Cluster (and hence per batch queue). Thus sites with heterogeneous hardware have two possible choices:
1.2.2 Define a DNS alias to refer to set of CREAM CEsIn order to distribute load for job submissions, it is possible to deploy multiple CREAM CEs head nodes referring to the same set of resources. As explained in the previous section, this should be implemented with:
POISE is used; using metrics (which take into account in particular the load and the sandbox size ) it decides the physical instance the alias should point to. Another possibility to define aliases is to use commercial network techniques such as F5.
It must be noted that, as observed by Desy sysadmins, the proliferation of the aliases (C-records) is not well defined among DNS'. Therefore changes of an alias sometimes can take hours to be propagated to other sites.
The use of alias for job submission is a good solution to improve load balancing and availability of the service (the unavailability of a physical CREAM CE would be hidden by the use of the alias). It must however be noted that:
1.2.3 Choose the authorization modelThe CREAM CE can be configured to use as authorization system:
USE_ARGUS must be set in the following way:
USE_ARGUS=yesIn this case it is also necessary to set the following yaim variables:
USE_ARGUS must be set in the following way:
USE_ARGUS=no 1.2.4 Choose the BLAH BLparser deployment modelThe BLAH Blparser is the component of the CREAM CE responsible to notify CREAM about job status changes. For LSF and PBS/Torque it is possible to configure the BLAH blparser in two possible ways:
1.2.4.1 New BLAH BlparserThe new Blparser runs on the CREAM CE machine and it is automatically installed when installing the CREAM CE. The configuration of the new BLAH Blparser is done when configuring the CREAM CE (i.e. it is not necessary to configure the Blparser separately from the CREAM CE). To use the new BLAH blparser, it is just necessary to set:BLPARSER_WITH_UPDATER_NOTIFIER=truein the siteinfo.def and then configure the CREAM CE. This is the default value. The new BLParser doesn't parse the log files. However the bhist (for LSF) and tracejob (for Torque) commands (used by the new BLParser) require the batch system log files, which therefore must be available (in case e.g. via NFS in the CREAM CE node. Actually for Torque the blparser uses tracejob (which requires the log files) only when qstat can't find anymore the job. And this can happen if the job has been completed more than keep_completed seconds ago and the blparser was not able to detect before that the job completed/was cancelled/whatever. This can happen e.g. if keep_completed is too short or if the BLAH blparser for whatever reason didn't run for a while. If the log files are not available and the tracejob command is issued (for the reasons specified above), the BLAH blparser will not be able to find the job, which will considered "lost" (DONE-FAILED wrt CREAM).
The init script of the new Blparser is /etc/init.d/glite-ce-blahparser . Please note that it is not needed to explicitly start the new blparser: when CREAM is started, it starts also this new BLAH Blparser if it is not already running.
When the new Blparser is running, you should see the following two processes on the CREAM CE node:
tomcat on the CREAM CE should be allowed to issue the relevant status/history commands (for Torque: qstat , tracejob , for LSF: bhist , bjobs ). Some sites configure the batch system so that users can only see their own jobs (e.g. in torque:
set server query_other_jobs = False). If this is done at the site, then the tomcat user will need a special privilege in order to be exempt from this setting (in torque: set server operators += tomcat@creamce.yoursite.domain). 1.2.4.2 Old BLAH BlparserThe old BLAH blparser must be installed on a machine where the batch system log files are available (let's call this hostBLPARSER_HOST . So the BLPARSER_HOST can be the batch system master or a different machine where the log files are available (e.g. they have been exported via NFS). There are two possible layouts:
BLPARSER_WITH_UPDATER_NOTIFIER=falsein the siteinfo.def before configuring via yaim. 1.2.5 Deployment models for CREAM databasesThe databases used by CREAM can be deployed in the CREAM CE host (which is the default scenario) or on a different machine. Click here for information how to deploy the databases on a machine different wrt the CREAM-CE.1.3 CREAM CE InstallationThis section explains how to install:
1.3.1 RepositoriesFor a successful installation, you will need to configure your package manager to reference a number of repositories (in addition to your OS);
1.3.1.1 The EPEL repositoryYou can install the EPEL repository, issuing:rpm -Uvh http://dl.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm 1.3.1.2 The EMI middleware repositoryYou can install the EMI-1 yum repository, issuing:wget http://emisoft.web.cern.ch/emisoft/dist/EMI/1/sl5/x86_64/updates/emi-release-1.0.1-1.sl5.noarch.rpm yum install ./emi-release-1.0.1-1.sl5.noarch.rpm 1.3.1.3 The Certification Authority repositoryThe most up-to-date version of the list of trusted Certification Authorities (CA) is needed on your node. The relevant yum repo can be installed issuing:wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/EGI-trustanchors.repo -O /etc/yum.repos.d/EGI-trustanchors.repo 1.3.1.4 Important note on automatic updatesAn update of an RPM not followed by configuration can cause problems. Therefore WE STRONGLY RECOMMEND NOT TO USE AUTOMATIC UPDATE PROCEDURE OF ANY KIND. Running the script available at http://forge.cnaf.infn.it/frs/download.php/101/disable_yum.sh (implemented by Giuseppe Platania (INFN Catania) yum autoupdate will be disabled1.3.2 Installation of a CREAM CE node in no cluster modeFirst of all, install theyum-protectbase rpm:
yum install yum-protectbaseThen proceed with the installation of the CA certificates. 1.3.2.1 Installation of the CA certificatesThe CA certificate can be installed issuing:yum install ca-policy-egi-core 1.3.2.2 Installation of the CREAM CE softwareThe CREAM software itself can work with both Sun jdk and openjdk. However, the apel-core package, deployed in the CREAM CE node, requiresmm-mysql , which explicitly requires Sun jdk. So to install the middleware software needed for the CREAM CE, install first of all Sun JDK ( jdk ). This is actually not needed in a standard SL5 box, since in this case the Sun JDK rpm is available in the OS repo. Please note that this dependency on mm-mysql is being fixed by APEL developers.
Then install xml-commons-apis :
yum install xml-commons-apis java-1.6.0-openjdk-develThis is due to a dependency problem within the Tomcat distribution Then install the CREAM-CE metapackage: yum install emi-cream-ce 1.3.2.3 Installation of the batch system specific softwareAfter the installation of the CREAM CE metapackage it is necessary to install the batch system specific metapackage(s):
yum install emi-torque-server yum install emi-torque-utils
yum install emi-torque-utils
yum install emi-lsf-utils
yum install emi-ge-utils 1.3.3 Installation of a CREAM CE node in cluster modeFirst of all, install theyum-protectbase rpm:
yum install yum-protectbaseThen proceed with the installation of the CA certificates. 1.3.3.1 Installation of the CA certificatesThe CA certificate can be installed issuing:yum install ca-policy-egi-core 1.3.3.2 Installation of the CREAM CE softwareThe CREAM software itself can work with both Sun jdk and openjdk. However, the apel-core package, deployed in the CREAM CE node, requiresmm-mysql , which explicitly requires Sun jdk. So to install the middleware software needed for the CREAM CE, install first of all Sun JDK ( jdk ). This is actually not needed in a standard SL5 box, since in this case the Sun JDK rpm is available in the OS repo. Please note that this dependency on mm-mysql is being fixed by APEL developers.
Then install xml-commons-apis :
yum install xml-commons-apis java-1.6.0-openjdk-develThis is due to a dependency problem within the Tomcat distribution Then install the CREAM-CE metapackage: yum install emi-cream-ce 1.3.3.3 Installation of the batch system specific softwareAfter the installation of the CREAM CE metapackage it is necessary to install the batch system specific metapackage(s).
yum install emi-torque-server yum install emi-torque-utils
yum install emi-torque-utils
yum install emi-lsf-utils
yum install emi-ge-utils 1.3.3.4 Installation of the cluster metapackageIf the CREAM CE node has to host also theglite-cluster , install also this metapackage:
yum install emi-cluster 1.3.4 Installation of a glite-cluster nodeFirst of all, install theyum-protectbase rpm:
yum install yum-protectbaseThen proceed with the installation of the CA certificates. 1.3.4.1 Installation of the CA certificatesThe CA certificate can be installed issuing:yum install ca-policy-egi-core 1.3.4.2 Installation of the cluster metapackageInstall the glite-CLUSTER metapackage:yum install emi-cluster 1.3.5 Installation of the BLAH BLparserIf the new BLAH Blparser must be used, there isn't anything to be installed for the BLAH Blparser (i.e. the installation of the CREAM-CE is enough). This is also the case when the old BLAH Blparser must be used AND the BLPARSER_HOST is the CREAM-CE. Only when the old BLAH Blparser must be used AND the BLPARSER_HOST is different than the CREAM-CE, it is necessary to install the BLParser software on this BLPARSER_HOST. This is done in the following way:yum install glite-ce-blahp yum install glite-ce-yaim-cream-ce 1.3.6 Installation of the CREAM CLIThe CREAM CLI is part of the EMI-UI. To install it please refer to https://twiki.cern.ch/twiki/bin/view/EMI/EMIui#Client_Installation_Configuratio .1.4 CREAM CE updateTo update the CREAM CE from EMI-1 Update x to EMI-1 Update y:
1.5 CREAM CE configuration1.5.1 Using the YAIM configuration toolFor a detailed description on how to configure the middleware with YAIM, please check the YAIM guide. The necessary YAIM modules needed to configure a certain node type are automatically installed with the middleware.1.5.2 Configuration of a CREAM CE node in no cluster mode1.5.2.1 Install host certificateThe CREAM CE node requires the host certificate/key files to be installed. Contact your national Certification Authority (CA) to understand how to obtain a host certificate if you do not have one already. Once you have obtained a valid certificate:
/etc/grid-security directory. Then set the proper mode and ownerships doing:
chown root.root /etc/grid-security/hostcert.pem chown root.root /etc/grid-security/hostkey.pem | |||||||||||||||||||||||||||||||
Changed: | |||||||||||||||||||||||||||||||
< < | chmod 600 /etc/grid-security/hostcert.pem | ||||||||||||||||||||||||||||||
> > | chmod 644 /etc/grid-security/hostcert.pem | ||||||||||||||||||||||||||||||
chmod 400 /etc/grid-security/hostkey.pem
0.0.0.1 Configure the siteinfo.def fileSet yoursiteinfo.def file, which is the input file used by yaim. Documentation about yaim variables relevant for CREAM CE is available at https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#cream_CE
Be sure that CREAMCE_CLUSTER_MODE is set to no (or not set at all).
0.0.0.2 Run yaimAfter having filled thesiteinfo.def file, run yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode>Examples:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n SGE_utils 0.0.1 Configuration of a CREAM CE node in cluster mode0.0.1.1 Install host certificateThe CREAM CE node requires the host certificate/key files to be installed. Contact your national Certification Authority (CA) to understand how to obtain a host certificate if you do not have one already. Once you have obtained a valid certificate:
/etc/grid-security directory. Then set the proper mode and ownerships doing:
chown root.root /etc/grid-security/hostcert.pem chown root.root /etc/grid-security/hostkey.pem chmod 600 /etc/grid-security/hostcert.pem chmod 400 /etc/grid-security/hostkey.pem 0.0.1.2 Configure the siteinfo.def fileSet yoursiteinfo.def file, which is the input file used by yaim.
Variables which are required in cluster mode are described below.
When the CREAM CE is configured in cluster mode it will stop publishing information about clusters and subclusters. That information should be published by the glite-CLUSTER node type instead. A specific set of yaim variables has been defined for configuring the information which is still required by the CREAM CE in cluster mode. The names of these variables follow this syntax:
CREAMCE_CLUSTER_MODE is set to yes
0.0.1.3 Run yaimAfter having filled thesiteinfo.def file, run yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode> [-n glite-CLUSTER] -n glite-CLUSTER must be specified only if the glite-CLUSTER is deployed in the same node of the CREAM-CE
Examples:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n SGE_utils
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils -n glite-CLUSTER
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n SGE_utils -n glite-CLUSTER
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils -n glite-CLUSTER
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils -n glite-CLUSTER 0.0.2 Configuration of a glite-CLUSTER node0.0.2.1 Install host certificateThe glite-CLUSTER node requires the host certificate/key files to be installed. Contact your national Certification Authority (CA) to understand how to obtain a host certificate if you do not have one already. Once you have obtained a valid certificate:
/etc/grid-security directory. Then set the proper mode and ownerships doing:
chown root.root /etc/grid-security/hostcert.pem chown root.root /etc/grid-security/hostkey.pem chmod 600 /etc/grid-security/hostcert.pem chmod 400 /etc/grid-security/hostkey.pem 0.0.2.2 Configure the siteinfo.def fileSet yoursiteinfo.def file, which is the input file used by yaim. Documentation about yaim variables relevant for glite-CLUSTER is available at https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#CLUSTER
0.0.2.3 Run yaimAfter having filled thesiteinfo.def file, run yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n glite-CLUSTER 0.0.3 Configuration of the BLAH BlparserIf the new BLAH Blparser must be used, there isn't anything to be configured for the BLAH Blparser (i.e. the configuration of the CREAM-CE is enough). If the old BLparser must be used, it is necessary to configure it on the BLPARSER_HOST (which, as said above, can be the CREAM-CE node or on a different host). This is done in the following way:/opt/glite/yaim/bin/yaim -r -s <site-info.def> -n creamCE -f config_cream_blparserThen it is necessary to restart tomcat on the CREAM-CE node: service tomcat5 restart 0.0.3.1 Configuration of the old BLAH Blparser to serve multiple CREAM CEsThe configuration instructions reported above explains how to configure a CREAM CE and the BLAH blparser (old model) considering the scenario where the BLAH blparser has to "serve" a single CREAM CE. Considering that the blparser (old model) has to run where the batch system log files are available, let's consider a scenario where there are 2 CREAM CEs (ce1.mydomain and ce2.mydomain ) that must be configured. Let's suppose that the batch system log files are not available on these 2 CREAM CEs machine. Let's assume they are available in another machine ( blhost.mydomain ), where the old blparser has to be installed.
The following summarizes what must be done:
BLPARSER_HOST=blhost.mydomain BLAH_JOBID_PREFIX=cre01_ BLP_PORT=33333and configure ce1.mydomain via yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode> [-n glite-CLUSTER]
BLPARSER_HOST=blhost.mydomain BLAH_JOBID_PREFIX=cre02_ BLP_PORT=33334and configure ce2.mydomain via yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode> [-n glite-CLUSTER]
CREAM_PORT=56565and configure blhost.mydomain via yaim:
/opt/glite/yaim/bin/yaim -r -s <site-info.def> -n creamCE -f config_cream_blparser
GLITE_CE_BLPARSERPBS_NUM=2 # ce01.mydomain GLITE_CE_BLPARSERPBS_PORT1=33333 GLITE_CE_BLPARSERPBS_CREAMPORT1=56565 # ce02.mydomain GLITE_CE_BLPARSERPBS_PORT2=33334 GLITE_CE_BLPARSERPBS_CREAMPORT2=56566
/etc/init.d/glite-ce-blparser restart
0.0.3.2 Configuration of the new BLAH Blparser to to use cached batch system commandsWith BLAH version >= 1.16-3, the new BLAH blparser can be configured in order to not interact directly with the batch system, but through a program (to be implemented by the site admin) which can implement some caching functionality. This is the case for example ofCommandProxyTools , implement at Cern
To enable this feature, add in /etc/blah.config (the example below is for lsf, with /usr/bin/runcmd.pl as name of the "caching" program):
lsf_batch_caching_enabled=yes batch_command_caching_filter=/usr/bin/runcmd.plSo the blparser, insead of issuing bjobs -u .... , will issue /usr/bin/runcmd.pl bjobs -u .. ." </verbatim>
0.0.4 Configuration of the CREAM databases on a host different than the CREAM-CETo configure the CREAM databases on a host different than the CREAM-CE:
creamdb and delegationdb . This means that some manual hacks are needed if the same mysql server host should be used to host the databases of multiple CREAM CEs.
The files where the database names are specified and should be changed before configuring via yaim are:
url="jdbc:mysql://localhost:3306/creamdb"and: url="jdbc:mysql://localhost:3306/delegationdb?autoReconnect=true"You must NOT change in the following line: <Resource name="jdbc/creamdb"
if [ $3 == "creamdb" -a -d /opt/glite/var/cream_sandbox ] ; then
/************ Drop: Database ***************/ DROP DATABASE IF EXISTS creamdb; /************ Create: Database ***************/ CREATE DATABASE creamdb; /************ Use: Database ***************/ USE creamdb;
/************ Drop: Database ***************/ DROP DATABASE IF EXISTS delegationdb; /************ Create: Database ***************/ CREATE DATABASE delegationdb; /************ Use Delegation DB ***************/ use delegationdb;
my $querycmd= "mysql -B --skip-column-names -u" . $userdb . " --password=\"" . $passworddb . "\" -e \"use creamdb;select count(*) from job_status AS status LEFT OUTER JOIN job_status AS latest ON latest.jobId=status.jobId AND status.id < latest.id WHERE latest.id IS null AND status.type IN ('0','1','2','3','4','6');\"";
<cream db name>_database_version="2.4" <delegation db name>_database_version="2.4"just after: creamdb_database_version="2.4" delegationdb_database_version="2.4"
create_mysql_db creamdatabase ${CREAM_DB_USER} ${CREAM_DB_PASSWORD} "${GLITE_CREAM_LOCATION_ETC}/glite-ce-cream/populate_creamdb_mysql.sql" ${CREAM_DB_HOST} \ && create_mysql_db delegationdatabase ${CREAM_DB_USER} ${CREAM_DB_PASSWORD} "${GLITE_CREAM_LOCATION_ETC}/glite-ce-cream/populate_delegationdb.sql" ${CREAM_DB_HOST} \
-e "s/url=\"jdbc:mysql:\/\/localhost:3306\/delegationdb?autoReconnect=true/url=\"jdbc:mysql:\/\/${CREAM_DB_HOST}:3306\/delegationdb\?autoReconnect=true/" \ -e "s/url=\"jdbc:mysql:\/\/localhost:3306\/creamdb\"/url=\"jdbc:mysql:\/\/${CREAM_DB_HOST}:3306\/creamdb\"/" \
mysqlshow --password="$MYSQL_PASSWORD" | grep "creamdatabase" > /dev/null 2>&1 mysql -u root --password="$MYSQL_PASSWORD" -e "DROP DATABASE creamdatabase" mysqlshow --password="$MYSQL_PASSWORD" | grep "delegationdatabase" > /dev/null 2>&1 mysql -u root --password="$MYSQL_PASSWORD" -e "DROP DATABASE delegationdatabase" 0.0.5 Configuration of the CREAM CLIThe CREAM CLI is part of the EMI-UI. To configure it please refer to https://twiki.cern.ch/twiki/bin/view/EMI/EMIui#Client_Installation_Configuratio.0.0.6 Manual configurationyaim allows to choose the most important parameters (via yaim variables) related to the CREAM-CE. It is then possible to tune some other attributes manually editing the relevant configuration files. The following subsections describe some of the parameters that can be manually configured. Please note that:
0.0.6.1 Tune the number of concurrent BLAH instancesTo allow a parallelism when interacting with the batch system, in particular to have a good throughput when submitting jobs to the batch system, CREAM can run multiple BLAH instances. A new instance is created whenever needed, till the maximum number defined in theetc/glite-ce-cream/cream-config.xml configuration file is reached. The relevant attribute is cream_concurrency_level . The default value is 50.
This value should be usually fine. You might need to decrease it if you notice an overload of the batch system and many jobs aborted because the submission to the batch system failed with "send command timeout" error message.
Please note that with BLAH version < 1.16.3, there are some memory issues: each BLAH instance uses too much memory and therefore a high value of this parameter can cause memory problems. The problem is fixed with BLAH version 1.16.3. Besides this version of BLAH, it is also needed to have in blah.config :
job_registry_use_mmap=yeswhich is automatically added by yaim (starting with yaim-cream-ce version >= 4.2.2-1). Starting with yaim-cream-ce version >= 4.2.3-1, this value for the number of concurrent BLAH instances is configurable via yaim. The name of the relevant yaim variable is CREAM_CONCURRENCY_LEVEL .
0.0.6.2 Tune the BLAH BUpdater polling frequencyIf the new BLAH Blparser is used (click here for instructions to check if you are using the old or the new blparser) the parameterbupdater_loop_interval attribute in /etc/blah.config defines how often the batch system is queried to check the status of the jobs. If a low value is used, job status changes are detected promptly, but this also means that several batch system queries are done, and this can cause a high load.
With yaim-cream-ce v >= 4.2.2-1, this parameter is configurable via yaim: the relevant yaim variable is BUPDATER_LOOP_INTERVAL which has 30 (secs) as default value.
With yaim-cream-ce v. < 4.2.2-1, this parameter is not configurable via yaim, and therefore it is needed to manually edit the blah configuration file, otherwise the default value (5 s.) is used. After having set this value in the blah configuration file it is necessary:
bhist -u all -d -l -n 1 ).
0.1 Batch system integration0.1.1 Torque0.1.1.1 InstallationIf the CREAM-CE has to be also the torque server, install theemi-torque-server metapackage: yum install emi-torque-server
In all cases (Torque server in the CREAM-CE or in a different host) then install the emi-torque-utils metapackage: yum install emi-torque-utils
0.1.1.2 ConfigurationSet yoursiteinfo.def file, which is the input file used by yaim. Documentation about yaim variables relevant for CREAM CE is available at CREAM CE: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#cream_CE
The CREAM CE Torque integration is then configured running YAIM:
0.1.2 LSF0.1.2.1 RequirementsYou have to install and configure the LSF batch system software before installing and configuring the CREAM software.0.1.2.2 InstallationIf you are running LSF, install theemi-lsf-utils metapackage: yum install emi-lsf-utils
0.1.2.3 ConfigurationSet yoursiteinfo.def file, which is the input file used by yaim. Documentation about yaim variables relevant for CREAM CE is available at CREAM CE: https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#cream_CE
The CREAM CE LSF integration is then configured running YAIM:
0.1.3 Grid Engine0.1.3.1 RequirementsYou have to install and configure the GE batch system software before installing and configuring the CREAM software. The CREAM CE integration was tested with GE 6.2u5 but it should work with any forked version of the original GE software. The support of the GE batch system software (or any of its forked versions) is out of the scope of this activity. Before proceeding, please take note of the following remarks:
0.1.3.2 Integration pluginsThe GE integration with CREAM CE consists in deploying specific BLAH plugins and configure them to properly interoperate with Grid Engine batch system. The following GE BLAH plugins are deployed with CREAM CE installation: BUpdaterSGE, sge_hold.sh, sge_submit.sh, sge_resume.sh, sge_status.sh and sge_cancel.0.1.3.3 InstallationIf you are running GE, install theemi-ge-utils metapackage: yum install emi-ge-utils
0.1.3.4 ConfigurationSet yoursiteinfo.def file, which is the input file used by yaim. Documentation about yaim variables relevant for CREAM CE and GE is available at
SGE_SHARED_INSTALL=yes in your site-info.def , otherwise YAIM may change your setup according to the definitions in your site-info.def .
The CREAM CE GE integration is then configured running YAIM:
0.1.3.5 Important notes0.1.3.5.1 File transfersBesides the input/output sandbox files (transfered via GFTP) there are some other files that need to be transferred from/to the CREAM sandbox directory on the CE node to/from the Worker Node, namely:
# diff -Nua sge_filestaging.modified sge_filestaging.orig --- sge_filestaging.modified 2010-03-25 19:38:11.000000000 +0000 +++ sge_filestaging.orig 2010-03-25 19:05:43.000000000 +0000 | |||||||||||||||||||||||||||||||
Line: 21 to 21 | |||||||||||||||||||||||||||||||
my $remotefile = $3;
if ( $STAGEIN ) {
- system( 'cp', $remotefile, $localfile );
+ system( 'scp', "$remotemachine:$remotefile", $localfile );
} else {
- system( 'cp', $localfile, $remotefile );
+ system( 'scp', $localfile, "$remotemachine:$remotefile" );
}
}
0.0.0.0.1 GE accounting fileBUpdaterSGE needs to consult the GE accounting file to determine how did a given job ended. Therefore, the GE accounting file must be shared between the GE SERVER / QMASTER and the CREAM CE. Moreover, to guarantee that the accounting file is updated on the fly, the GE configuration should be tunned (using qconf -mconf) in order to add under the reporting_params the following definitions:accounting=true accounting_flush_time=00:00:00
0.0.0.0.2 GE SERVER (QMASTER) tuningThe following suggestions should be implemented to achieve better performance when integrating with CREAM CE:
1 PostconfigurationHave a look at the Known issue page. In particular consider the workaround needed for this problem.2 Operating the system2.1 Tomcat configuration guidelinesIn/etc/tomcat5/tomcat5.conf , there are some settings related to heap. They are in the JAVA_OPTS setting (see -Xms and -Xmx ).
It is suggested to customize such settings taking into account how much physical memory is available, as indicated in the following table (which refers to 64bit architectures):
2.2 MySQL database configuration guidelinesDefault values of some MySQL settings are likely to be suboptimal especially for large machines. In particular some parameters could improve the overall performance if carefully tuned.In this context one relevant parameter to be set is the innodb_buffer_poll_size which specifies the size of the buffer pool (the default value is 8MB).
The benefits obtained by using a proper value for this parameter are principally: an appreciable performance improving and the reduced amount of disk I/O needed for accessing the data in the tables. The optimal value depends on the amount of physical memory and the CPU architecture available in the host machine.
The maximum value depends on the CPU architecture, 32-bit or 64-bit. For 32-bit systems, the CPU architecture and operating system sometimes impose a lower practical maximum size. Larger this value is set, less disk I/O is needed to access data in tables. On a dedicated database server, it is possible to set this to up to 80% of the machine physical memory size. Scale back this value whether one of the following issues occur:
/etc/my.cnf , in particular within the [mysqld] section, it is suggested to customize the innodb_buffer_pool_size parameter taking into account how much physical memory is available.
Example:
[mysqld] innodb_buffer_pool_size=512MBAfter that, it's necessary to restart the mysql service for applying the change: /etc/init.d/mysqld restartFinally, the following sql command (root rights are needed) could be used for checking if the new value was applied successfully: SHOW VARIABLES like 'innodb_buffer_pool_size'; 2.3 MySQL database: How to resize Innodb log filesIf the following error occurs (see the mysql log file: /var/log/mysqld.log)InnoDB: ERROR: the age of the last checkpoint is , InnoDB: which exceeds the log group capacity . InnoDB: If you are using big BLOB or TEXT rows,you must set the InnoDB: combined size of log files at least 10 times bigger than the InnoDB: largest such row.then you must resize the innodb log files. Follow these steps:
SHOW VARIABLES like "innodb_log_file_size";
service mysqld stop
[mysqld] innodb_log_file_size=64M
mv /var/lib/mysql/ib_logfile* /tmp
service mysqld start
ls -lrth /var/lib/mysql/ib_logfile* 2.4 How to start the CREAM serviceA site admin can start the CREAM service just starting the CREAM container:/etc/init.d/tomcat5 startIn case the new BLAH blparser is used, this will also start it (if not already running). If for some reason it necessary to explicitly start the new BLAH blparser, the following command can be used: /etc/init.d/glite-ce-blahparser startIf instead the old BLAH blparser is used, before starting tomcat it is necessary to start it on the BLPARSER_HOST using the command: /etc/init.d/glite-ce-blparser startTo stop the CREAM service, it is just necessary to stop the CREAM container: /etc/init.d/tomcat5 stop 2.5 DaemonsInformation about daemons running in the CREAM CE is available in http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Daemons_running2.6 Init scriptsInformation about init scripts in the CREAM CE is available in the http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Init_scripts_and_options_start_s2.7 Configuration filesInformation about configuration files in the CREAM CE is available in http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Configuration_files_location_wit2.8 Log filesInformation about log files in the CREAM CE is available in http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Logfile_locations_and_management2.9 Network portsInformation about ports used in the CREAM CE is available in http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Open_ports2.10 Cron jobsInformation about cron jobs used in the CREAM CE is available in http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Cron_jobs2.11 Security related operations2.11.1 How to enable a certain VO for a certain CREAM CE in ArgusLet's consider that a certain CREAM CE has been configured to use ARGUS as authorization system. Let's suppose that we chosehttp://pd.infn.it/cream-18 as the id of the CREAM CE (i.e. yaim variable CREAM_PEPC_RESOURCEID is http://pd.infn.it/cream-18 ).
On the ARGUS box (identified by the yaim variable ARGUS_PEPD_ENDPOINTS ) to enable the VO XYZ, it is necessary to define the following policy:
resource "http://pd.infn.it/cream-18" { obligation "http://glite.org/xacml/obligation/local-environment-map" {} action ".*" { rule permit { vo = "XYZ" } } } 2.11.2 Security recommendationsSecurity recommendations relevant for the CREAM CE is available in http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Security_recommendations2.11.3 How to block/ban a userInformation about how to ban users is available in http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#How_to_block_ban_a_user2.11.4 How to block/ban a VOTo ban a VO, it is suggested to reconfigure the service via yaim without that VO in thesiteinfo.def
2.11.5 How to define a CREAM administratorA CREAM administrator (aka super-user) can manage (e.g. cancel, check the status, etc.) also the jobs submitted by other people. Moreover he/she can issue some privileged operations, in particular the ones to disable the new job submissions (glite-ce-disable-submission ) and then to re-enable them ( glite-ce-disable-submission )
To define a CREAM CE administrator for a specific CREAM CE, the DN of this person must be specified in the /etc/grid-security/admin-list of this CREAM CE node, e.g.:
"/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto"Please note that including the DN between " is important 2.12 Input and Output Sandbox files transfer between the CREAM CE and the WNThe input and output sandbox files (unless they have to be copied from/to remote servers) are copied between the CREAM CE node and the Worker Node. These files transfers can be done in two possible ways:
SANDBOX_TRANSFER_METHOD_BETWEEN_CE_WN . Possible values are:
2.13 Sharing of the CREAM sandbox area between the CREAM CE and the WNBesides the input/output sandbox files there are some other files that need to be transferred from/to the CREAM sandbox directory on the CE node to/from the Worker Node:
2.13.1 Sharing of the CREAM sandbox area between the CREAM CE and the WN for TorqueWhen Torque is used as batch system, to share the CREAM sandbox area between the CREAM CE node and the WNs:
$usecp <CE node>://var/cream_sandbox /cream_sandboxThis $usecp line means that every time Torque will have to copy a file from/t the cream_sandbox directory on the CE (which is the case during the stage in/stage out phase), it will have to use a cp from /cream_sandbox instead.
2.14 Self-limiting CREAM behaviorCREAM is able to protect itself if the load, memory usage, etc. is too high. This happens disabling new job submissions, while the other commands are still allowed. The whole stuff is implemented via a limiter script (/usr/bin/glite_cream_load_monitor ) very similar to the one used in the WMS.
Basically this limiter script check the values for some system and CREAM specific parameters, and compare them against some thresholds. If one or more threshold is exceeded, new job submissions get disabled. If a new submission is attempted when submissions are disabled, an error message is returned, e.g.:
$ glite-ce-job-submit -a -r cream-35.pd.infn.it:8443/cream-lsf-creamtest2 myjob.jdl MethodName=[jobRegister] ErrorCode=[0] Description=[The CREAM service cannot accept jobs at the moment] FaultCause=[Threshold for Memory Usage: 95 => Detected value for Memory Usage: 96.71%] Timestamp=[Mon 02 Nov 2009 21:36:04]The limiter script is run every 10 minutes. To disable the limiter, it is necessary to edit the CREAM configuration file /etc/glite-ce-cream/cream-config.xml setting JOB_SUBMISSION_MANAGER_ENABLE to false and restarting tomcat.
The values that are currently taken into account are the following:
# Default ValuesIf needed, the limiter script can be easily augmented to take into account some other parameters. 2.15 How to drain a CREAM CEThe administrator of a CREAM CE can decide to drain a CREAM CE, that is disabling new job submissions while allowing the other commands. This can be useful for example because of scheduled shutdown of the CREAM CE. This can be achieved via theglite-ce-disable-submission command (provided by the CREAM CLI package installed on the UI), that can be issued only by a CREAM CE administrator, that is the DN of this person must be listed in the /etc/grid-security/admin-list file of the CE.
If newer job submissions are attempted, users will get an error message such as:
> glite-ce-job-submit -a -r grid006.pd.infn.it:8443/cream-lsf-grid02 lnl_test.j\dl MethodName=[jobRegister] ErrorCode=[0] Description=[The CREAM2 service cannot accept jobs anymore] FaultCause=[The CREAM2 service cannot accept jobs anymore]\ Timestamp=[Tue 22 Jan 2008 16:28:47]It is possible to then resume new job submissions calling the glite-ce-enable-submission command.
To check if job submissions on a specific CREAM CE are allowed, the command glite-ce-allowed-submission can be used.
It is possible to resume the job submission calling the proper operation ( glite-ce-enable-submission ).
E.g.:
> glite-ce-disable-submission grid006.pd.infn.it:8443 Operation for disabling new submissions succeeded > > glite-ce-allowed-submission grid006.pd.infn.it:8443 Job Submission to this CREAM CE is disabled > > glite-ce-enable-submission grid006.pd.infn.it:8443 Operation for enabling new submissions succeeded > > glite-ce-allowed-submission grid006.pd.infn.it:8443 Job Submission to this CREAM CE is enabled 2.16 How to trace a specific jobTo trace a specific job, first of all get the CREAMjobid. If the job was submitted through the WMS, you can get its CREAMjobdid in the following way:glite-wms-job-logging-info -v 2 <gridjobdid> | grep "Dest jobid"If the job is not yours and you are not LB admin, you can get the CREAMjobid of that gridjobid if you have access to the CREAM logs doing: grep <gridjobid> /var/log/cream.glite-ce-cream.log*Grep the "last part" of the CREAMjobid in the CREAM log file (e.g. if the CREAMjobid is https://cream-07.pd.infn.it:8443/CREAM383606450 considers CREAM383606450): grep CREAM383606450 /var/log/cream/glite-ce-cream.log*This will return all the information relevant for this job 2.17 How to check if you are using the old or the new blparserIf you want to quickly check if you are using the old or the new BLAH Blparser, do agrep registry /etc/blah.config . If you see something like:
# grep registry /etc/blah.config job_registry=/var/blah/user_blah_job_registry.bjryou are using the new BLAH blparser. Otherwise you are using the old one. 2.18 Job purgingPurging a CREAM job means removing it from the CREAM database and removing from the CREAM CE any information relevant for that job (e.g. the job sandbox area). When a job has been purged, it is not possible to manage it anymore (e.g. it is not possible to check anymore its status). A job can be purged:
/etc/glite_wms.conf ) the attribute purge_jobs in the ICE section is set to false .
2.18.1 Automatic job purgingThe automatic CREAM job purger is responsible to purge old - forgotten jobs, according to a policy specified in the CREAM configuration file (/etc/glite-ce-cream/cream-config.xml ).
This policy is specified by the attribute JOB_PURGE_POLICY .
For example, if JOB_PURGE_POLICY is the following:
<parameter name="JOB_PURGE_POLICY" value="ABORTED 1 days; CANCELLED 2 days; DONE-OK 3 days; DONE-FAILED 4 days; REGISTERED 5 days;" />then the job purger will purge jobs which are:
2.18.2 Purging jobs in a non terminal statusThe (manual or automatic) purge operation can be issued only for jobs which are in a terminal status. If it is necessary to purge a job which has been terminated but which is for CREAM in a non terminal status (e.g. RUNNING, REALLY_RUNNING) because of some bugs/problems/..., a specific utility (JobAdminPurger ) provided with the glite-ce-cream package can be used.
JobAdminPurger allows to purge jobs based on their CREAM jobids and/or their status (considering how long the job is in that status).
Usage:
JobDBAdminPurger.sh [-c|--conf CREAMConfPath] -u|--userDB userDB -p|--pswDB pswDB [-j|--jobids jobId1:jobId2:...] | [-f|--filejobIds filenameJobIds] | [-s|--status statusType0,deltaTime:statusType1:...] [-h|--help]Options:
JobDBAdminPurger.sh -u xyz -p abc -j CREAM217901296:CREAM324901232 JobDBAdminPurger.sh -u xyz -p abc -s registered:pending:idle JobDBAdminPurger.sh u xyz -p abc -s registered,3:pending:idle,5 JobDBAdminPurger.sh u xyz -p abc -f /tmp/jobIdsToPurge.txtPlease note that this script should be run just to clean the CREAM DB in case of problems (i.e. jobs reported in a non terminal status while this is not the case) Please also note that this script purges jobs from the CREAM DB. The relevant job sandbox directories are also deleted. 2.19 Proxy purgingExpired delegation proxies are automatically purged:
/etc/glite-ce-cream/cream-config.xml ) there is a property called delegation_purge_rate which defines how often the proxy purger is run. The default value is 720 (720 minutes, that is 12 hours).
If the value is changed, it is then necessary to restart tomcat. |