System Administrator Guide for CREAM for EMI-1 release
1 Installation and Configuration
1.1 Prerequisites
1.1.1 Operating system
A standard 64 bit SL(C)5 distribution is supposed to be properly installed.
1.1.2 Node synchronization
A general requirement for the Grid nodes is that they are synchronized. This requirement may be fulfilled in several ways. One of the most common one is using the
NTP
protocol with a time server.
1.1.3 Cron and logrotate
Many components deployed on the CREAM CE rely on the presence of
cron
(including support for
/etc/cron.*
directories) and
logrotate
. You should make sure these utils are available on your system.
1.1.4 Batch system
If you plan to use Torque as batch system for your CREAM CE, it will be installed and configured along with the middleware (i.e. you don't have to install and configure it in advance)
If you plan to use LSF as batch system for your CREAM CE, you have to install and configure it before installing and configuring the CREAM software. Since LSF is a commercial software it can't be distributed together with the middleware.
1.2 Plan how to deploy the CREAM CE
1.2.1 CREAM CE and gLite-cluster
glite-CLUSTER is a node type that can publish information about clusters and subclusters in a site, referenced by any number of compute elements.
glite-CLUSTER can be deployed in the same host of the CREAM-CE or in a different one.
The following deployment models are possible:
- CREAM-CE can be configured without worrying about the glite-CLUSTER node. This can be useful for small sites who don't want to worry about cluster/subcluster configurations because they have a very simple setup. In this case CREAM-CE will publish a single cluster/subcluster. This is called no cluster mode. This is done as described below by defining the yaim setting
CREAMCE_CLUSTER_MODE=no
(or by no defining at all that variable).
- CREAM-CE can work on cluster mode using the glite-CLUSTER node type. This is done as described below by defining the yaim setting
CREAMCE_CLUSTER_MODE=yes
. The CREAM-CE can be in the same host or in a different host from the glite-CLUSTER node.
More information about glite-CLUSTER can be found at
https://twiki.cern.ch/twiki/bin/view/LCG/CLUSTER
1.2.2 Choose the authorization model
The CREAM CE can be configured to use as authorization system:
- the ARGUS authorization framework
OR
- the grid Java Authorization Framework (gJAF)
In the former case a ARGUS box (usually at site level) where to define policies for the CREAM CE box is needed.
To use ARGUS as authorization system, yaim variable
USE_ARGUS
must be set in the following way:
USE_ARGUS=yes
In this case it is also necessary to set the following yaim variables:
-
ARGUS_PEPD_ENDPOINTS
The endpoint of the ARGUS box (e.g."https://cream-43.pd.infn.it:8154/authz")
-
CREAM_PEPC_RESOURCEID
The id of the CREAM CE in the ARGUS box (e.g. "http://pd.infn.it/cream-18")
If instead gJAF should be used as authorization system, yaim variable
USE_ARGUS
must be set in the following way:
USE_ARGUS=no
1.2.3 Choose the BLAH BLparser deployment model
The BLAH Blparser is the component of the CREAM CE responsible to notify CREAM about job status changes.
For LSF and PBS/Torque it is possible to configure the BLAH blparser in two possible ways:
- The new BLAH BLparser, which relies on the status/history batch system commands
- The old BLAH BLparser, which parses the batch system log files
For SGE and Condor, only the configuration with the new BLAH blparser is possible
1.2.3.1 New BLAH Blparser
The new Blparser runs on the CREAM CE machine and it is automatically installed when installing the CREAM CE. The configuration of the new BLAH Blparser is done when configuring the CREAM CE (i.e. it is not necessary to configure the Blparser separately from the CREAM CE).
To use the new BLAH blparser, it is just necessary to set:
BLPARSER_WITH_UPDATER_NOTIFIER=true
in the siteinfo.def and then configure the CREAM CE. This is the default value.
The new BLParser doesn't parse the log files. However the
bhist
(for LSF) and
tracejob
(for Torque) commands (used by the new BLParser) require the batch system log files, which therefore must be available (in case e.g. via NFS in the CREAM CE node. Actually for Torque the blparser uses
tracejob
(which requires the log files) only when
qstat
can't find anymore the job. And this can happen if the job has been completed more than
keep_completed
seconds ago and the blparser was not able to detect before that the job completed/was cancelled/whatever. This can happen e.g. if keep_completed is too short or if the BLAH blparser for whatever reason didn't run for a while. If the log files are not available and the
tracejob
command is issued (for the reasons specified above), the BLAH blparser will not be able to find the job, which will considered "lost" (DONE-FAILED wrt CREAM).
The init script of the new Blparser is
/etc/init.d/glite-ce-blahparser
. Please note that it is not needed to explicitly start the new blparser: when CREAM is started, it starts also this new BLAH Blparser if it is not already running.
When the new Blparser is running, you should see the following two processes on the CREAM CE node:
-
/usr/bin/BUpdaterxxx
-
/usr/bin/BNotifier
Please note that the user
tomcat
on the CREAM CE should be allowed to issue the relevant status/history commands (for Torque:
qstat
,
tracejob
, for LSF:
bhist
,
bjobs
). Some sites configure the batch system so that users can only see their own jobs (e.g. in torque:
set server query_other_jobs = False
). If this is done at the site, then the tomcat user will need a special privilege in order to be exempt from this setting (in torque:
set server operators += tomcat@creamce.yoursite.domain
).
1.2.3.2 Old BLAH Blparser
The old BLAH blparser must be installed on a machine where the batch system log files are available (let's call this host
BLPARSER_HOST
. So the BLPARSER_HOST can be the batch system master or a different machine where the log files are available (e.g. they have been exported via NFS). There are two possible layouts:
- The BLPARSER_HOST is the CREAM CE host
- The BLPARSER_HOST is different than the CREAM CE host
If the BLPARSER_HOST is the CREAM CE host, after having installed and configured the CREAM CE, it is necessary to configure the old BLAH Blparser as explained below.
If the BLPARSER_HOST is different than the CREAM CE host, after having installed and configured the CREAM CE it is necessary:
- to install the old BLAH BLparser software on this BLPARSER_HOST as explained below
- to configure this software
After having configured CREAM, it is necessary to also configure the BLAH Blparser as explained below.
On the CREAM CE, to use the old BLAH blparser, it is necessary to set:
BLPARSER_WITH_UPDATER_NOTIFIER=true
in the siteinfo.def before configuring via yaim.
1.3 Installation
This section explains how to install:
- a CREAM CE in no cluster mode
- a CREAM CE in cluster mode
- a glite-CLUSTER node
For all these scenarios, the setting of the repositories is the same.
1.3.1 Repositories
For a successful installation, you will need to configure your package manager to reference a number of repositories (in addition to your OS);
- the EPEL repository
- the EMI middleware repository
- the CA repository
and to
REMOVE (!!!) or
DEACTIVATE (!!!)
1.3.1.1 The EPEL repository
You can install the EPEL repository, issuing:
rpm -Uvh http://download.fedora.redhat.com/pub/epel/5/i386/epel-release-5-4.noarch.rpm
1.3.1.2 The EMI middleware repository
You can install the EMI-1 yum repository, issuing:
wget http://emisoft.web.cern.ch/emisoft/dist/EMI/1/sl5/x86_64/base/emi-release-1.0.0-1.sl5.noarch.rpm
yum install ./emi-release-1.0.0-1.sl5.noarch.rpm
1.3.1.3 The Certification Authority repository
The most up-to-date version of the list of trusted Certification Authorities (CA) is needed on your node. The relevant yum repo can be installed issuing:
wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/EGI-trustanchors.repo -O /etc/yum.repos.d/EGI-trustanchors.repo
1.3.1.4 Important note on automatic updates
An update of an RPM not followed by configuration can cause problems. Therefore
WE STRONGLY RECOMMEND NOT TO USE AUTOMATIC UPDATE PROCEDURE OF ANY KIND.
Running the script available at
http://forge.cnaf.infn.it/frs/download.php/101/disable_yum.sh
(implemented by Giuseppe Platania (INFN Catania) yum autoupdate will be disabled
1.3.2 Installation of a CREAM CE node in no cluster mode
First of all, install the
yum-protectbase
rpm:
yum install yum-protectbase.noarch
Then proceed with the installation of the CA certificates.
1.3.2.1 Installation of the CA certificates
The CA certificate can be installed issuing:
yum install ca-policy-egi-core
1.3.2.2 Installation of the CREAM CE software
The CREAM software itself can work with both Sun jdk and openjdk. However, the apel-core package, deployed in the CREAM CE node, requires
mm-mysql
, which explicitly requires Sun jdk. So to install the middleware software needed for the CREAM CE, install first of all Sun JDK (
jdk
). This is actually not needed in a standard SL5 box, since in this case the Sun JDK rpm is available in the OS repo.
Then install
xml-commons-apis
:
yum install xml-commons-apis
This is due to a dependency problem within the Tomcat distribution
Then install the CREAM-CE metapackage:
yum install emi-cream-ce
1.3.2.3 Installation of the batch system specific software
After the installation of the CREAM CE metapackage it is necessary to install the batch system specific metapackage(s):
- If you are running Torque, and your CREAM CE node is the torque master, install the
emi-torque-server
and emi-torque-utils
metapackages:
yum install emi-torque-server
yum install emi-torque-utils
- If you are running Torque, and your CREAM CE node is NOT the torque master, install the
emi-torque-utils
metapackage:
yum install emi-torque-utils
- If you are running LSF, install the
emi-lsf-utils
metapackage:
yum install emi-lsf-utils
1.3.3 Installation of a CREAM CE node in cluster mode
First of all, install the
yum-protectbase
rpm:
yum install yum-protectbase.noarch
Then proceed with the installation of the CA certificates.
1.3.3.1 Installation of the CA certificates
The CA certificate can be installed issuing:
yum install ca-policy-egi-core
1.3.3.2 Installation of the CREAM CE software
The CREAM software itself can work with both Sun jdk and openjdk. However, the apel-core package, deployed in the CREAM CE node, requires
mm-mysql
, which explicitly requires Sun jdk. So to install the middleware software needed for the CREAM CE, install first of all Sun JDK (
jdk
). This is actually not needed in a standard SL5 box, since in this case the Sun JDK rpm is available in the OS repo.
Then install
xml-commons-apis
:
yum install xml-commons-apis
This is due to a dependency problem within the Tomcat distribution
Then install the CREAM-CE metapackage:
yum install emi-cream-ce
1.3.3.3 Installation of the batch system specific software
After the installation of the CREAM CE metapackage it is necessary to install the batch system specific metapackage(s).
- If you are running Torque, and your CREAM CE node is the torque master, install the
emi-torque-server
and emi-torque-utils
metapackages:
yum install emi-torque-server
yum install emi-torque-utils
- If you are running Torque, and your CREAM CE node is NOT the torque master, install the
emi-torque-utils
metapackage:
yum install emi-torque-utils
- If you are running LSF, install the
emi-lsf-utils
metapackage:
yum install emi-lsf-utils
1.3.3.4 Installation of the cluster metapackage
If the CREAM CE node has to host also the
glite-cluster
, install also this metapackage:
yum install emi-cluster
1.3.4 Installation of a glite-cluster node
First of all, install the
yum-protectbase
rpm:
yum install yum-protectbase.noarch
Then proceed with the installation of the CA certificates.
1.3.4.1 Installation of the CA certificates
The CA certificate can be installed issuing:
yum install ca-policy-egi-core
1.3.4.2 Installation of the cluster metapackage
Install the glite-CLUSTER metapackage:
yum install emi-cluster
1.3.5 Installation of the BLAH BLparser
If the new BLAH Blparser must be used, there isn't anything to be installed for the BLAH Blparser (i.e. the installation of the CREAM-CE is enough).
This is also the case when the old BLAH Blparser must be used
AND the BLPARSER_HOST is the CREAM-CE.
Only when the old BLAH Blparser must be used
AND the BLPARSER_HOST is different than the CREAM-CE, it is necessary to install the BLParser software on this BLPARSER_HOST. This is done in the following way:
yum install glite-ce-blahp
yum install glite-yaim-cream-ce
1.3.6 Installation of the CREAM CLI
The CREAM CLI is part of the EMI-UI. To install it please refer to xxx.
1.4 Configuration
1.4.1 Using the YAIM configuration tool
For a detailed description on how to configure the middleware with YAIM, please check the
YAIM guide
.
The necessary YAIM modules needed to configure a certain node type are automatically installed with the middleware.
1.4.2 Configuration of a CREAM CE node in no cluster mode
1.4.2.1 Install host certificate
The CREAM CE node requires the host certificate/key files to be installed. Contact your national Certification Authority (CA) to understand how to obtain a host certificate if you do not have one already.
Once you have obtained a valid certificate:
- hostcert.pem - containing the machine public key
- hostkey.pem - containing the machine private key
make sure to place the two files in the target node into the
/etc/grid-security
directory. Then set the proper mode and ownerships doing:
chown root.root /etc/grid-security/hostcert.pem
chown root.root /etc/grid-security/hostkey.pem
chmod 600 /etc/grid-security/hostcert.pem
chmod 400 /etc/grid-security/hostkey.pem
1.4.2.2 Configure the siteinfo.def file
Set your
siteinfo.def
file, which is the input file used by yaim. Documentation about yaim variables relevant for CREAM CE is available at
https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#cream_CE
Be sure that
CREAMCE_CLUSTER_MODE
is set to
no
(or not set at all).
1.4.2.3 Run yaim
After having filled the
siteinfo.def
file, run yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode>
Examples:
- Configuration of a CREAM CE in no cluster mode using Torque as batch system, with the CREAM CE being also Torque server
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
- Configuration of a CREAM CE in no cluster mode using Torque as batch system, with the CREAM CE NOT being also Torque server
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils
- Configuration of a CREAM CE in no cluster mode using LSF as batch system
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils
1.4.3 Configuration of a CREAM CE node in cluster mode
1.4.3.1 Install host certificate
The CREAM CE node requires the host certificate/key files to be installed. Contact your national Certification Authority (CA) to understand how to obtain a host certificate if you do not have one already.
Once you have obtained a valid certificate:
- hostcert.pem - containing the machine public key
- hostkey.pem - containing the machine private key
make sure to place the two files in the target node into the
/etc/grid-security
directory. Then set the proper mode and ownerships doing:
chown root.root /etc/grid-security/hostcert.pem
chown root.root /etc/grid-security/hostkey.pem
chmod 600 /etc/grid-security/hostcert.pem
chmod 400 /etc/grid-security/hostkey.pem
1.4.3.2 Configure the siteinfo.def file
Set your
siteinfo.def
file, which is the input file used by yaim. Documentation about yaim variables relevant for CREAM CE is available at
https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#cream_CE
Be sure that
CREAMCE_CLUSTER_MODE
is set to
yes
1.4.3.3 Run yaim
After having filled the
siteinfo.def
file, run yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode> [-n glite-CLUSTER]
-n glite-CLUSTER
must be specified only if the glite-CLUSTER is deployed in the same node of the CREAM-CE
Examples:
- Configuration of a CREAM CE in cluster mode (with glite-CLUSTER deployed on a different node) using LSF as batch system
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils
- Configuration of a CREAM CE in cluster mode (with glite-CLUSTER deployed on a different node) using Torque as batch system, with the CREAM CE being also Torque server
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
- Configuration of a CREAM CE in cluster mode (with glite-CLUSTER deployed on a different node) using Torque as batch system, with the CREAM CE NOT being also Torque server
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils
- Configuration of a CREAM CE in cluster mode (with glite-CLUSTER deployed on the same node of the CREAM-CE) using LSF as batch system
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils -n glite-CLUSTER
- Configuration of a CREAM CE in cluster mode (with glite-CLUSTER deployed on the same node of the CREAM-CE) using Torque as batch system, with the CREAM CE being also Torque server
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils -n glite-CLUSTER
- Configuration of a CREAM CE in cluster mode (with glite-CLUSTER deployed on the same node of the CREAM-CE)) using Torque as batch system, with the CREAM CE NOT being also Torque server
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils -n glite-CLUSTER
1.4.4 Configuration of a glite-CLUSTER node
1.4.4.1 Install host certificate
The glite-CLUSTER node requires the host certificate/key files to be installed. Contact your national Certification Authority (CA) to understand how to obtain a host certificate if you do not have one already.
Once you have obtained a valid certificate:
- hostcert.pem - containing the machine public key
- hostkey.pem - containing the machine private key
make sure to place the two files in the target node into the
/etc/grid-security
directory. Then set the proper mode and ownerships doing:
chown root.root /etc/grid-security/hostcert.pem
chown root.root /etc/grid-security/hostkey.pem
chmod 600 /etc/grid-security/hostcert.pem
chmod 400 /etc/grid-security/hostkey.pem
1.4.4.2 Configure the siteinfo.def file
Set your
siteinfo.def
file, which is the input file used by yaim. Documentation about yaim variables relevant for glite-CLUSTER is available at
https://twiki.cern.ch/twiki/bin/view/LCG/Site-info_configuration_variables#CLUSTER
1.4.4.3 Run yaim
After having filled the
siteinfo.def
file, run yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n glite-CLUSTER
1.4.5 Configuration of the BLAH Blparser
If the new BLAH Blparser must be used, there isn't anything to be configured for the BLAH Blparser (i.e. the configuration of the CREAM-CE is enough).
If the old BLparser must be used, it is necessary to configure it on the BLPARSER_HOST (which, as said above, can be the CREAM-CE node or on a different host). This is done in the following way:
/opt/glite/yaim/bin/yaim -r -s <site-info.def> -n creamCE -f config_cream_blparser
Then it is necessary to restart tomcat on the CREAM-CE node:
service tomcat5 restart
1.4.5.1 Configuration of the old BLAH Blparser to serve multiple CREAM CEs
The configuration instructions reported above explains how to configure a CREAM CE and the BLAH blparser (old model) considering the scenario where the BLAH blparser has to "serve" a single CREAM CE.
Considering that the blparser (old model) has to run where the batch system log files are available, let's consider a scenario where there are 2 CREAM CEs (
ce1.mydomain
and
ce2.mydomain
) that must be configured. Let's suppose that the batch system log files are not available on these 2 CREAM CEs machine. Let's assume they are available in another machine (
blhost.mydomain
), where the old blparser has to be installed.
The following summarizes what must be done:
- In the
/services/glite-creamce
for ce1.mydomain
set:
BLPARSER_HOST=blhost.mydomain
BLAH_JOBID_PREFIX=cre01_
BLP_PORT=33333
and configure
ce1.mydomain
via yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode> [-n glite-CLUSTER]
- In the
/services/glite-creamce
for ce2.mydomain
set:
BLPARSER_HOST=blhost.mydomain
BLAH_JOBID_PREFIX=cre02_
BLP_PORT=33334
and configure
ce2.mydomain
via yaim:
/opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n <LRMSnode> [-n glite-CLUSTER]
- In the
/services/glite-creamce
for blhost.mydomain
sets:
CREAM_PORT=56565
and configure
blhost.mydomain
via yaim:
/opt/glite/yaim/bin/yaim -r -s <site-info.def> -n creamCE -f config_cream_blparser
- In
blhost.mydomain
edit the file /etc/blparser.conf
setting (considering the pbs/torque scenario):
GLITE_CE_BLPARSERPBS_NUM=2
# ce01.mydomain
GLITE_CE_BLPARSERPBS_PORT1=33333
GLITE_CE_BLPARSERPBS_CREAMPORT1=56565
# ce02.mydomain
GLITE_CE_BLPARSERPBS_PORT2=33334
GLITE_CE_BLPARSERPBS_CREAMPORT2=56566
- Restart the blparser on
blhost.mydomain
:
/etc/init.d/glite-ce-blparser restart
- Restart tomcat on
ce01.mydomain
and ce02.mydomain
You can of course replace 33333, 33334, 56565, 56566 (reported in the above examples) with other port numbers
1.4.6 Configuration of the CREAM CLI
The CREAM CLI is part of the EMI-UI. To configure it please refer to xxx.
2 Operating the system
2.1 Tomcat configuration guidelines
In
/etc/tomcat5/tomcat5.conf
, there are some settings related to heap. They are in the
JAVA_OPTS
setting (see
-Xms
and
-Xmx
).
It is suggested to customize such settings taking into account how much physical memory is available, as indicated in the following table (which refers to 64bit architectures):
After having done the changes, it is necessary to restart tomcat
2.2 How to start the CREAM service
A site admin can start the CREAM service just starting the CREAM container:
/etc/init.d/tomcat5 start
In case the new BLAH blparser is used, this will also start it (if not already running).
If for some reason it necessary to explicitly start the new BLAH blparser, the following command can be used:
/etc/init.d/glite-ce-blahparser start
If instead the old BLAH blparser is used, before starting tomcat it is necessary to start it on the BLPARSER_HOST using the command:
/etc/init.d/glite-ce-blparser start
To stop the CREAM service, it is just necessary to stop the CREAM container:
/etc/init.d/tomcat5 stop
2.3 Daemons
Information about daemons running in the CREAM CE is available in
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Daemons_running
2.4 Init scripts
Information about init scripts in the CREAM CE is available in the
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Init_scripts_and_options_start_s
2.5 Configuration files
Information about configuration files in the CREAM CE is available in
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Configuration_files_location_wit
2.6 Log files
Information about log files in the CREAM CE is available in
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Logfile_locations_and_management
2.7 Network ports
Information about ports used in the CREAM CE is available in
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Open_ports
2.8 Cron jobs
Information about cron jobs used in the CREAM CE is available in
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Cron_jobs
2.9 Security related operations
2.9.1 How to enable a certain VO for a certain CREAM CE in Argus
Let's consider that a certain CREAM CE has been configured to use ARGUS as authorization system.
Let's suppose that the id of the CREAM CE in the ARGUS box (yaim variable
CREAM_PEPC_RESOURCEID
is
http://pd.infn.it/cream-18
).
On the ARGUS box (identified by the yaim variable
ARGUS_PEPD_ENDPOINTS
) to enable the VO XYZ, it is necessary to define the following policy:
resource "http://pd.infn.it/cream-18" {
obligation "http://glite.org/xacml/obligation/local-environment-map" {}
action ".*" {
rule permit { vo = "XYZ" }
}
}
2.9.2 Security recommendations
Security recommendations relevant for the CREAM CE is available in
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Security_recommendations
2.9.3 How to block/ban a user
Information about how to ban users is available in
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#How_to_block_ban_a_user
2.9.4 How to block/ban a VO
To ban a VO, it is suggested to reconfigure the service via yaim without that VO in the
siteinfo.def
2.9.5 How to define a CREAM administrator
A CREAM administrator (aka super-user) can manage (e.g. cancel, check the status, etc.) also the jobs submitted by other people.
Moreover he/she can issue some privileged operations, in particular the ones to disable the new job submissions (
glite-ce-disable-submission
) and then to re-enable them (
glite-ce-disable-submission
)
To define a CREAM CE administrator for a specific CREAM CE, the DN of this person must be specified in the
/etc/grid-security/admin-list
of this CREAM CE node, e.g.:
"/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto"
Please note that including the DN between " is important
2.10 Input and Output Sandbox files transfer between the CREAM CE and the WN
The input and output sandbox files (unless they have to be copied from/to remote servers) are copied between the CREAM CE node and the Worker Node.
These files transfers can be done in two possible ways:
- Using gridftp
- Using the staging capabilities of the batch system
The choice is done at configuration time setting the yaim variable
SANDBOX_TRANSFER_METHOD_BETWEEN_CE_WN
. Possible values are:
-
GSIFTP
to use gridftp. This is the default value
-
LRMS
to use the staging capabilities of the batch system
2.11 Sharing of the CREAM sandbox area between the CREAM CE and the WN
Besides the input/output sandbox files there are some other files that need to be transferred from/to the CREAM sandbox directory on the CE node to/from the Worker Node:
- The CREAM job wrapper and the user proxies, that are staged from the CE node to the WN where the job will run
- The standard output and error files of the Cream job wrapper, that are copied from the WN to the CE when the job completes its execution.
To manage that, there are two possible options:
- Use the staging capabilities of the batch system (e.g. for Torque this requires configuring the ssh trust between CE and WNs)
- Share the CREAM sanbox area between the CREAM CE node and the Worker Node and configure appropriately the batch system
Please note:
- If you want to have several CREAM CEs sharing the same WNs, you need to mount each CE sandbox area to a different mount point on the WN, such as
/cream_sandbox/ce_hostname
.
- The CREAM sandbox directory name (default
/var/cream_sandbox
) can be changed using the yaim variable CREAM_SANDBOX_DIR
2.11.1 Sharing of the CREAM sandbox area between the CREAM CE and the WN for Torque
When Torque is used as batch system, to share the CREAM sandbox area between the CREAM CE node and the WNs:
- Mount the
cream_sandbox
directory also in the WNs. Let's assume that in the CE node the cream sandbox directory is called /var/cream_sandbox
and on the WN is mounted as /cream_sandbox
)
- On the WNs, add the following to the Torque client config file (generally
/var/spool/pbs/mom_priv/config
):
$usecp <CE node>://var/cream_sandbox /cream_sandbox
This
$usecp
line means that every time Torque will have to copy a file from/t the cream_sandbox directory on the CE (which is the case during the stage in/stage out phase), it will have to use a
cp
from
/cream_sandbox
instead.
2.12 Self-limiting CREAM behavior
CREAM is able to protect itself if the load, memory usage, etc. is too high. This happens disabling new job submissions, while the other commands are still allowed.
The whole stuff is implemented via a limiter script (@@/usr/bin/glite_cream_load_monitor@@) very similar to the one used in the WMS.
Basically this limiter script check the values for some system and CREAM specific parameters, and compare them against some thresholds. If one or more threshold is exceeded, new job submissions get disabled. If a new submission is attempted when submissions are disabled, an error message is returned, e.g.:
$ glite-ce-job-submit -a -r cream-35.pd.infn.it:8443/cream-lsf-creamtest2 myjob.jdl
MethodName=[jobRegister] ErrorCode=[0] Description=[The CREAM service cannot accept jobs at the moment]
FaultCause=[Threshold for Memory Usage: 95 => Detected value for Memory Usage: 96.71%] Timestamp=[Mon 02 Nov 2009 21:36:04]
The limiter script is run every 10 minutes.
To disable the limiter, it is necessary to edit the CREAM configuration file
/etc/glite-ce-cream/cream-config.xml
setting
JOB_SUBMISSION_MANAGER_ENABLE
to false and restarting tomcat.
The values that are currently taken into account are the following:
Value |
Default threshold |
Load average (1 min) |
40 |
Load average (5 min) |
40 |
Load average (15 min) |
20 |
Memory usage |
95 % |
Swap usage |
95 % |
Free file descriptors |
500 |
File descriptors used by tomcat |
800 |
Number of FTP connections |
30 |
Number of active jobs |
no limit |
Number of pending commands (commands still to be executed) |
no limit |
If needed, the thresholds can be modified editing the limiter script itself (they are defined in the section starting with:
# Default Values
If needed, the limiter script can be easily augmented to take into account some other parameters.
2.13 How to drain a CREAM CE
The administrator of a CREAM CE can decide to drain a CREAM CE, that is disabling new job submissions while allowing the other commands. This can be useful for example because of scheduled shutdown of the CREAM CE.
This can be achieved via the
glite-ce-disable-submission command
(provided by the CREAM CLI package installed on the UI), that can be issued only by a CREAM CE administrator, that is the DN of this person must be listed in the
/etc/grid-security/admin-list
file of the CE.
If newer job submissions are attempted, users will get an error message such as:
> glite-ce-job-submit -a -r grid006.pd.infn.it:8443/cream-lsf-grid02 lnl_test.j\dl
MethodName=[jobRegister] ErrorCode=[0] Description=[The CREAM2 service cannot
accept jobs anymore] FaultCause=[The CREAM2 service cannot accept jobs anymore]\
Timestamp=[Tue 22 Jan 2008 16:28:47]
It is possible to then resume new job submissions calling the
glite-ce-enable-submission
command.
To check if job submissions on a specific CREAM CE are allowed, the command
glite-ce-allowed-submission
can be used.
It is possible to resume the job submission calling the proper operation (
glite-ce-enable-submission
).
E.g.:
> glite-ce-disable-submission grid006.pd.infn.it:8443
Operation for disabling new submissions succeeded
>
> glite-ce-allowed-submission grid006.pd.infn.it:8443
Job Submission to this CREAM CE is disabled
>
> glite-ce-enable-submission grid006.pd.infn.it:8443
Operation for enabling new submissions succeeded
>
> glite-ce-allowed-submission grid006.pd.infn.it:8443
Job Submission to this CREAM CE is enabled
2.14 How to trace a specific job
To trace a specific job, first of all get the CREAMjobid.
If the job was submitted through the WMS, you can get its CREAMjobdid in the following way:
glite-wms-job-logging-info -v 2 <gridjobdid> | grep "Dest jobid"
If the job is not yours and you are not LB admin, you can get the CREAMjobid of that gridjobid if you have access to the CREAM logs doing:
grep <gridjobid> /var/log/cream.glite-ce-cream.log*
Grep the "last part" of the CREAMjobid in the CREAM log file (e.g. if the CREAMjobid is
https://cream-07.pd.infn.it:8443/CREAM383606450
considers CREAM383606450):
grep CREAM383606450 /var/log/cream/glite-ce-cream.log*
This will return all the information relevant for this job
2.15 Job purging
Purging a CREAM job means removing it from the CREAM database and removing from the CREAM CE any information relevant for that job (e.g. the job sandbox area).
When a job has been purged, it is not possible to manage it anymore (e.g. it is not possible to check anymore its status).
A job can be purged:
- Explicitly by the user who submitted that job, using e.g. the
glite-ce-job-purge
command provided by the CREAM CLI
- Automatically by the automatic CREAM job purger, which is responsible to purge old - forgotten jobs, according to a policy specified in the CREAM configuration file (
/etc/glite-ce-cream/cream-config.xml
).
A user can purge only jobs she submitted. Only a CREAM CE admin can purge jobs submitted by other users.
For jobs submitted to a CREAM CE through the WMS, the purging is done by the ICE component of the WMS when it detects the job has reached a terminal status. The purging operation is not done if in the WMS conf file (
/etc/glite_wms.conf
) the attribute
purge_jobs
in the ICE section is set to
false
.
2.15.1 Automatic job purging
The automatic CREAM job purger is responsible to purge old - forgotten jobs, according to a policy specified in the CREAM configuration file (
/etc/glite-ce-cream/cream-config.xml
).
This policy is specified by the attribute
JOB_PURGE_POLICY
.
For example, if
JOB_PURGE_POLICY
is the following:
<parameter name="JOB_PURGE_POLICY" value="ABORTED 1 days; CANCELLED 2 days; DONE-OK 3 days; DONE-FAILED 4 days; REGISTERED 5 days;" />
then the job purger will purge jobs which are:
- in ABORTED status for more than 1 day
- in CANCELLED status for more than 2 days
- in DONE-OK status for more than 3 days
- in DONE-FAILED status more than 4 days
- in REGISTERED status for more than 5 days
2.15.2 Purging jobs in a non terminal status
The (manual or automatic) purge operation can be issued only for jobs which are in a terminal status. If it is necessary to purge a job which has been terminated but which is for CREAM in a non terminal status (e.g. RUNNING, REALLY_RUNNING) because of some bugs/problems/..., a specific utility (
JobAdminPurger
) provided with the
glite-ce-cream
package can be used.
JobAdminPurger allows to purge jobs based on their CREAM jobids and/or their status (considering how long the job is in that status).
Usage:
JobDBAdminPurger.sh [-c|--conf CREAMConfPath] -u|--userDB userDB -p|--pswDB pswDB [-j|--jobids jobId1:jobId2:...] | [-f|--filejobIds filenameJobIds] | [-s|--status statusType0,deltaTime:statusType1:...] [-h|--help]
Options:
-
-c | --conf
: the CREAM conf file (to be specified only if it is not the standard value /etc/glite-ce-cream/cream-config.xml
)
-
-u | --userDB
: the cream DB user (as specified in the /etc/tomcat5/Catalina/localhost/ce-cream.xml
file)
-
-p | --pswDB
: the cream DB password (as specified in the /etc/tomcat5/Catalina/localhost/ce-cream.xml
file)
-
-j | --jobids
: the IDs (list of values separated by ':') of the jobs to be purged
-
-f | --filejobIds
: the file containing a list of jobids (one per line) to be purged
-
-s | --status
: the list of state,deltatime
(list of values separated by ':') of the jobs to be purged. A job is purged if it is in specified state
for more that deltatime
days. deltatime can be omitted (which means that all jobs in that status will be purged). The possible states are:
- REGISTERED
- PENDING
- IDLE
- RUNNING
- REALLY-RUNNING
- CANCELLED
- HELD
- DONE-OK
- DONE-FAILED
- PURGED
- ABORTED
Examples:
JobDBAdminPurger.sh -u xyz -p abc -j CREAM217901296:CREAM324901232
JobDBAdminPurger.sh -u xyz -p abc -s registered:pending:idle
JobDBAdminPurger.sh u xyz -p abc -s registered,3:pending:idle,5
JobDBAdminPurger.sh u xyz -p abc -f /tmp/jobIdsToPurge.txt
Please note that this script should be run just to clean the CREAM DB in case of problems (i.e. jobs reported in a non terminal status while this is not the case)
Please also note that this script purges jobs from the CREAM DB. The relevant job sandbox directories are also deleted.
2.16 Proxy purging
Expired delegation proxies are automatically purged:
- from the DelegationDB
- from the file system (
<cream-sandbox-dir>/<group>/<DN>/proxy
)
In the CREAM configuration file (
/etc/glite-ce-cream/cream-config.xml
) there is a property called
delegation_purge_rate
which defines how often the proxy purger is run. The default value is 720 (720 minutes, that is 12 hours).
If the value is changed, it is then necessary to restart tomcat.
Setting that value to -1 means disabling the proxy purger.
2.17 Job wrapper management
2.17.1 Customization points
The CREAM JobWrapper running on the WN execute some scripts (to be provided by the local administrators) if they exist. These are called
customization points
.
There are 3 customization points:
-
${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh
. This is executed in the beginning of the CREAM job wrapper execution, before the creation of the temporary directory where the job is executed.
-
${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_2.sh
. This is executed just after the execution of the user job, before executing the epilogue (if any)
-
${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh
. This is executed just before the end of the JobWrapper execution.
If setting the customization point is not enough, the administrator can also customize the CREAM job wrapper, as explained in the following section
2.17.2 Customization of the CREAM Job wrapper
To customize the CREAM job wrapper it is just necessary to edit as appropriate the template file
/usr/share/tomcat5/webapps/ce-cream/WEB-INF/jobwrapper.tpl
.
When done, tomcat must be restarted.
2.17.3 Customization of the Input/Output Sandbox file transfers
The CREAM job wrapper, besides running the user payload, is also responsible for other operations, such as the transfer of the input and output sandbox files from/to remote gridftp servers.
If in such transfers there is a failure, the operation is retried after a while. The sleep time between the first attempt and the second one is the “initial wait time” specified in the CREAM configuration file. In every next attempt the sleep time is doubled.
In the CREAM configuration file (
/etc/glite-ce-cream/cream-config.xml
) it is possible to set:
- the maximum number of file transfers that should be tried
- the initial wait time (i.e. the wait time between the first attempt and the second one)
Different values can be used for input (ISB) and output (OSB) files.
The relevant section in the CREAM configuration file is this one:
<parameter name="JOB_WRAPPER_COPY_RETRY_COUNT_ISB" value="2" />
<parameter name="JOB_WRAPPER_COPY_RETRY_FIRST_WAIT_ISB" value="60" /> <!-- sec. -->
<parameter name="JOB_WRAPPER_COPY_RETRY_COUNT_OSB" value="6" />
<parameter name="JOB_WRAPPER_COPY_RETRY_FIRST_WAIT_OSB" value="300" /> <!-- sec. -->
If one or more of these values are changed, it is then necessary to restart tomcat.
2.18 Managing the forwarding of requirements to the batch system
The CREAM CE allows to forward, via tha BLAH component, requirements to the batch system.
From a site administrator point of view, this requires creating and properly filling some scripts (
/usr/bin/xxx_local_submit_attributes.sh
).
The relevant documentation is available at
http://wiki.italiangrid.org/twiki/bin/view/CREAM/UserGuide#1_Forward_of_requirements_to_the
2.19 Querying the CREAM Database
2.19.1 Check how many jobs are stored in the CREAM database
The following mysql query can be used to check how many jobs (along with their status) are reported in the CREAM database:
mysql> select jstd.name, count(*) from job, job_status_type_description jstd, job_status AS status LEFT OUTER JOIN job_status AS latest ON latest.jobId=status.jobId AND status.id < latest.id WHERE latest.id IS null and job.id=status.jobId and jstd.type=status.type group by jstd.name;
--
MassimoSgaravatto - 2011-04-07