Difference: SystemAdministratorGuideForEMI2 (3 vs. 4)

Revision 42012-01-18 - MassimoSgaravatto

Line: 1 to 1
 
META TOPICPARENT name="SystemAdministratorDocumentation"

System Administrator Guide for CREAM for EMI-2 release

Line: 826 to 826
 

0.0.0.1 Installation

Added:
>
>
If the CREAM-CE has to be also the torque server, install the emi-torque-server metapackage:
 
Added:
>
>
  • for SL5_x86_64: yum install emi-torque-server

In all cases (Torque server in the CREAM-CE or in a different host) then install the emi-torque-utils metapackage:

  • for SL5_x86_64: yum install emi-torque-utils

TBC

0.0.0.1 Yaim Configuration

Set your siteinfo.def file, which is the input file used by yaim.

The CREAM CE Torque integration is then configured running YAIM:

  • no cluster mode with CREAM-CE being also Torque server: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
  • no cluster mode with CREAM-CE not being also Torque server: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils
  • cluster mode with glite-CLUSTER deployed on a different node with CREAM-CE being also Torque server: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils
  • cluster mode with glite-CLUSTER deployed on a different node with CREAM-CE not being also Torque server: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils
  • cluster mode with glite-CLUSTER deployed on the same node of the CREAM-CE with CREAM-CE being also Torque server : /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_server -n TORQUE_utils -n glite-CLUSTER
  • cluster mode with glite-CLUSTER deployed on the same node of the CREAM-CE with CREAM-CE not being also Torque server : /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n TORQUE_utils -n glite-CLUSTER

0.0.1 LSF

0.0.1.1 Requirements

You have to install and configure the LSF batch system software before installing and configuring the CREAM software.

0.0.1.2 Installation

If you are running LSF, install the emi-lsf-utils metapackage:

  • for sl5_x86_64: yum install emi-lsf-utils

0.0.1.3 Yaim Configuration

Set your siteinfo.def file, which is the input file used by yaim.

The CREAM CE LSF integration is then configured running YAIM:

  • no cluster mode: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils
  • cluster mode with glite-CLUSTER deployed on a different node: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils
  • cluster mode with glite-CLUSTER deployed on the same node of the CREAM-CE: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n LSF_utils -n glite-CLUSTER

0.0.2 Grid Engine

0.0.2.1 Requirements

You have to install and configure the GE batch system software before installing and configuring the CREAM software. The CREAM CE integration was tested with GE 6.2u5 but it should work with any forked version of the original GE software. The support of the GE batch system software (or any of its forked versions) is out of the scope of this activity.

Before proceeding, please take note of the following remarks:

  1. CREAM CE must be installed in a separate node from the GE SERVER (GE QMASTER).
  2. CREAM CE must work as a GE submission host (use qconf -as <CE.MY.DOMAIN> in the GE QMASTER to set it up).

0.0.2.2 Integration plugins

The GE integration with CREAM CE consists in deploying specific BLAH plugins and configure them to properly interoperate with Grid Engine batch system. The following GE BLAH plugins are deployed with CREAM CE installation: BUpdaterSGE, sge_hold.sh, sge_submit.sh, sge_resume.sh, sge_status.sh and sge_cancel.

0.0.2.3 Installation

If you are running GE, install the emi-ge-utils metapackage:

  • for sl5_x86_64: =yum install emi-ge-utils

0.0.2.4 Yaim Configuration

Set your siteinfo.def file, which is the input file used by yaim. Documentation about yaim variables relevant for CREAM CE and GE is available at

The most relevant GE YAIM variables to set in your site-info.def are:
  1. BLPARSER_WITH_UPDATER_NOTIFIER= "true"
  2. JOB_MANAGER= sge
  3. CE_BATCH_SYS= sge
  4. SGE_ROOT= <Path to your SGE installation>. Default: "/usr/local/sge/pro"
  5. SGE_CELL= <Path to your SGE CELL>. Default: "default"
  6. SGE_QMASTER= <SGE QMASTER PORT>. Default: "536"
  7. SGE_EXECD= <SGE EXECD PORT>. Defaul: "537"
  8. SGE_SPOOL_METH= "classic"
  9. BATCH_SERVER= <FQDN of your QMASTER>
  10. BATCH_LOG_DIR= <Path for the GE accounting file>
  11. BATCH_BIN_DIR= <Path for the GE binaries>
  12. BATCH_VERSION= <GE version>
Some sites use GE installations shared via NFS (or equivalent) in the CREAM CE. In order to prevent changes in that setup when YAIM is executed, define SGE_SHARED_INSTALL=yes in your site-info.def, otherwise YAIM may change your setup according to the definitions in your site-info.def.

The CREAM CE GE integration is then configured running YAIM:

  • no cluster mode: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n SGE_utils
  • cluster mode with glite-CLUSTER deployed on a different node: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n SGE_utils
  • cluster mode with glite-CLUSTER deployed on the same node of the CREAM-CE: /opt/glite/yaim/bin/yaim -c -s <site-info.def> -n creamCE -n SGE_utils -n glite-CLUSTER

0.0.2.5 Important notes

0.0.2.5.1 File transfers

Besides the input/output sandbox files (transfered via GFTP) there are some other files that need to be transferred from/to the CREAM sandbox directory on the CE node to/from the Worker Node, namely:

  • The CREAM job wrapper and the user proxies, that are staged from the CE node to the WN where the job will run
  • The standard output and error files of the Cream job wrapper, that are copied from the WN to the CE when the job completes its execution.
Since GE does not implement staging capabilitites by default, we distribute the sge_filestaging file with the GE CREAM software. In order to enable the copy of the previous files:
  1. Copy the sge_filestaging file to all your WNs (or to a shared directory mounted on your WNs)
  2. Add <path>/sge_filestaging --stagein and <path>/sge_filestaging --stageout to your prolog and epilog defined in GE global configuration (use qconf -mconf), or alternatively, in each queue configuration (qconf -mq <QUEUE>).
  3. If you do not share the CREAM sanbox area between the CREAM CE node and the Worker Node, the sge_filestaging file requires configuring the ssh trust between CE and WNs.
  4. If you share the CREAM sanbox area between the CREAM CE node and the Worker Node, the sge_filestaging has to be changed according to:

# diff -Nua sge_filestaging.modified sge_filestaging.orig
--- sge_filestaging.modified    2010-03-25 19:38:11.000000000 +0000
+++ sge_filestaging.orig    2010-03-25 19:05:43.000000000 +0000
Line: 21 to 21
Added:
>
>
my $remotefile = $3;

if ( $STAGEIN ) { - system( 'cp', $remotefile, $localfile ); + system( 'scp', "$remotemachine:$remotefile", $localfile ); } else { - system( 'cp', $localfile, $remotefile" ); + system( 'scp', $localfile, "$remotemachine:$remotefile" ); } }

0.0.0.0.1 GE accounting file

BUpdaterSGE needs to consult the GE accounting file to determine how did a given job ended. Therefore, the GE accounting file must be shared between the GE SERVER / QMASTER and the CREAM CE.

Moreover, to guarantee that the accounting file is updated on the fly, the GE configuration should be tunned (using qconf -mconf) in order to add under the reporting_params the following definitions: accounting=true accounting_flush_time=00:00:00

0.0.0.0.2 GE SERVER (QMASTER) tuning

The following suggestions should be implemented to achieve better performance when integrating with CREAM CE:

  1. The Cream CE machine must be set as a submission machine
  2. The GE QMASTER configuration should have the definition execd_params INHERIT_ENV=false (use qconf -mconf to set it up). This setting allows to propagate the environment of the submission machine (CREAM CE) into the execution machine (WN).

1 Postconfiguration

Have a look at the Known issue page to check if some postconfigurations are needed.

2 Operating the system

2.1 Tomcat configuration guidelines

In /etc/tomcat5/tomcat5.conf, there are some settings related to heap. They are in the JAVA_OPTS setting (see -Xms and -Xmx).

It is suggested to customize such settings taking into account how much physical memory is available, as indicated in the following table (which refers to 64bit architectures):

Memory < 2 GB 2 - 4 GB > 4 GB
JAVA_OPTS setting -Xms128m -Xmx512m -Xms512m -Xmx1024m -Xms512m -Xmx2048m

After having done the changes, it is necessary to restart tomcat

2.2 MySQL database configuration guidelines

Default values of some MySQL settings are likely to be suboptimal especially for large machines. In particular some parameters could improve the overall performance if carefully tuned.
In this context one relevant parameter to be set is the innodb_buffer_poll_size which specifies the size of the buffer pool (the default value is 8MB).

The benefits obtained by using a proper value for this parameter are principally: an appreciable performance improving and the reduced amount of disk I/O needed for accessing the data in the tables. The optimal value depends on the amount of physical memory and the CPU architecture available in the host machine.

The maximum value depends on the CPU architecture, 32-bit or 64-bit. For 32-bit systems, the CPU architecture and operating system sometimes impose a lower practical maximum size.
Larger this value is set, less disk I/O is needed to access data in tables. On a dedicated database server, it is possible to set this to up to 80% of the machine physical memory size.
Scale back this value whether one of the following issues occur:

  • competition for physical memory might cause paging in the operating system;
  • innoDB reserves additional memory for buffers and control structures, so that the total allocated space is approximately 10% greater than the specified size.
In /etc/my.cnf, in particular within the [mysqld] section, it is suggested to customize the innodb_buffer_pool_size parameter taking into account how much physical memory is available.

Example:

[mysqld]
innodb_buffer_pool_size=512MB

After that, it's necessary to restart the mysql service for applying the change:

/etc/init.d/mysqld restart

Finally, the following sql command (root rights are needed) could be used for checking if the new value was applied successfully:

SHOW VARIABLES like 'innodb_buffer_pool_size';

2.3 MySQL database: How to resize Innodb log files

If the following error occurs (see the mysql log file: /var/log/mysqld.log)

InnoDB: ERROR: the age of the last checkpoint is ,
InnoDB: which exceeds the log group capacity .
InnoDB: If you are using big BLOB or TEXT rows,you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.

then you must resize the innodb log files.

Follow these steps:

  • check for the value of the innodb_log_file_size.
SHOW VARIABLES  like "innodb_log_file_size"; 

  • Stop the MySQL server and make sure it shuts down without any errors. You can have a look at the error log to see if there are no errors.
service mysqld stop

  • Once the server has stopped, edit the configuration file ( /etc/my.cnf) and insert or change the value of innodb_log_file_size to your desired value (64M should be a proper value). Example:
[mysqld]
innodb_log_file_size=64M

  • Move the log file sizes ib_log* to some place out of the the directory where the log files reside. Example:
mv /var/lib/mysql/ib_logfile* /tmp

  • Now restart the server.
service mysqld start

  • Check for errors in /var/log/mysqld.log file

  • Verify the correct size of the log files
ls -lrth /var/lib/mysql/ib_logfile*

2.4 How to start the CREAM service

A site admin can start the CREAM service just starting the CREAM container:

For sl5_x86_64:

/etc/init.d/tomcat5 start

In case the new BLAH blparser is used, this will also start it (if not already running).

If for some reason it necessary to explicitly start the new BLAH blparser, the following command can be used:

/etc/init.d/glite-ce-blah-parser start

If instead the old BLAH blparser is used, before starting tomcat it is necessary to start it on the BLPARSER_HOST using the command:

/etc/init.d/glite-ce-blah-parser start

To stop the CREAM service, it is just necessary to stop the CREAM container.

For sl5_x86_64:

/etc/init.d/tomcat5 stop

2.5 Daemons

Information about daemons running in the CREAM CE is available in TBC

2.6 Init scripts

Information about init scripts in the CREAM CE is available in the TBC

2.7 Configuration files

Information about configuration files in the CREAM CE is available in TBC

2.8 Log files

Information about log files in the CREAM CE is available in TBC

2.9 Network ports

Information about ports used in the CREAM CE is available in TBC

2.10 Cron jobs

Information about cron jobs used in the CREAM CE is available in TBC

2.11 Security related operations

2.11.1 How to enable a certain VO for a certain CREAM CE in Argus

Let's consider that a certain CREAM CE has been configured to use ARGUS as authorization system.

Let's suppose that we chose http://pd.infn.it/cream-18 as the id of the CREAM CE (i.e. yaim variable CREAM_PEPC_RESOURCEID is http://pd.infn.it/cream-18).

On the ARGUS box (identified by the yaim variable ARGUS_PEPD_ENDPOINTS) to enable the VO XYZ, it is necessary to define the following policy:

resource "http://pd.infn.it/cream-18" {
    obligation "http://glite.org/xacml/obligation/local-environment-map" {}
    action ".*" {
        rule permit { vo = "XYZ" }
    }
}

2.11.2 Security recommendations

Security recommendations relevant for the CREAM CE is available in TBC

2.11.3 How to block/ban a user

Information about how to ban users is available in TBC

2.11.4 How to block/ban a VO

To ban a VO, it is suggested to reconfigure the service via yaim without that VO in the siteinfo.def

2.11.5 How to define a CREAM administrator

A CREAM administrator (aka super-user) can manage (e.g. cancel, check the status, etc.) also the jobs submitted by other people.

Moreover he/she can issue some privileged operations, in particular the ones to disable the new job submissions ( glite-ce-disable-submission) and then to re-enable them ( glite-ce-disable-submission)

To define a CREAM CE administrator for a specific CREAM CE, the DN of this person must be specified in the /etc/grid-security/admin-list of this CREAM CE node, e.g.:

"/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto"

Please note that including the DN between " is important

2.12 Input and Output Sandbox files transfer between the CREAM CE and the WN

The input and output sandbox files (unless they have to be copied from/to remote servers) are copied between the CREAM CE node and the Worker Node.

These files transfers can be done in two possible ways:

  • Using gridftp
  • Using the staging capabilities of the batch system
The choice is done at configuration time setting the yaim variable SANDBOX_TRANSFER_METHOD_BETWEEN_CE_WN. Possible values are:

  • GSIFTP to use gridftp. This is the default value
  • LRMS to use the staging capabilities of the batch system

2.13 Sharing of the CREAM sandbox area between the CREAM CE and the WN

Besides the input/output sandbox files there are some other files that need to be transferred from/to the CREAM sandbox directory on the CE node to/from the Worker Node:

  • The CREAM job wrapper and the user proxies, that are staged from the CE node to the WN where the job will run
  • The standard output and error files of the Cream job wrapper, that are copied from the WN to the CE when the job completes its execution.
To manage that, there are two possible options:

  • Use the staging capabilities of the batch system (e.g. for Torque this requires configuring the ssh trust between CE and WNs)
  • Share the CREAM sanbox area between the CREAM CE node and the Worker Node and configure appropriately the batch system
Please note:

  • If you want to have several CREAM CEs sharing the same WNs, you need to mount each CE sandbox area to a different mount point on the WN, such as /cream_sandbox/ce_hostname.

  • The CREAM sandbox directory name (default /var/cream_sandbox) can be changed using the yaim variable CREAM_SANDBOX_DIR

2.13.1 Sharing of the CREAM sandbox area between the CREAM CE and the WN for Torque

When Torque is used as batch system, to share the CREAM sandbox area between the CREAM CE node and the WNs:

  • Mount the cream_sandbox directory also in the WNs. Let's assume that in the CE node the cream sandbox directory is called /var/cream_sandbox and on the WN is mounted as /cream_sandbox)
  • On the WNs, add the following to the Torque client config file (generally /var/spool/pbs/mom_priv/config):

$usecp <CE node>://var/cream_sandbox /cream_sandbox

This $usecp line means that every time Torque will have to copy a file from/t the cream_sandbox directory on the CE (which is the case during the stage in/stage out phase), it will have to use a cp from /cream_sandbox instead.

2.14 Self-limiting CREAM behavior

CREAM is able to protect itself if the load, memory usage, etc. is too high. This happens disabling new job submissions, while the other commands are still allowed.

The whole stuff is implemented via a limiter script ( /usr/bin/glite_cream_load_monitor) very similar to the one used in the WMS.

Basically this limiter script check the values for some system and CREAM specific parameters, and compare them against some thresholds. If one or more threshold is exceeded, new job submissions get disabled. If a new submission is attempted when submissions are disabled, an error message is returned, e.g.:

TBC

2.15 How to drain a CREAM CE

The administrator of a CREAM CE can decide to drain a CREAM CE, that is disabling new job submissions while allowing the other commands. This can be useful for example because of scheduled shutdown of the CREAM CE.

This can be achieved via the glite-ce-disable-submission command (provided by the CREAM CLI package installed on the UI), that can be issued only by a CREAM CE administrator, that is the DN of this person must be listed in the /etc/grid-security/admin-list file of the CE.

If newer job submissions are attempted, users will get an error message such as:

TBC

2.16 How to trace a specific job

To trace a specific job, first of all get the CREAMjobid.

If the job was submitted through the WMS, you can get its CREAMjobdid in the following way:

glite-wms-job-logging-info -v 2 <gridjobdid> | grep "Dest jobid"

If the job is not yours and you are not LB admin, you can get the CREAMjobid of that gridjobid if you have access to the CREAM logs doing:

grep <gridjobid> /var/log/cream.glite-ce-cream.log*

Grep the "last part" of the CREAMjobid in the CREAM log file (e.g. if the CREAMjobid is https://cream-07.pd.infn.it:8443/CREAM383606450 considers CREAM383606450):

grep CREAM383606450 /var/log/cream/glite-ce-cream.log*

This will return all the information relevant for this job

2.17 How to check if you are using the old or the new blparser

If you want to quickly check if you are using the old or the new BLAH Blparser, do a grep registry blah.config. If you see something like:

# grep registry blah.config
job_registry=/var/blah/user_blah_job_registry.bjr

you are using the new BLAH blparser. Otherwise you are using the old one.

2.18 Job purging

Purging a CREAM job means removing it from the CREAM database and removing from the CREAM CE any information relevant for that job (e.g. the job sandbox area).

When a job has been purged, it is not possible to manage it anymore (e.g. it is not possible to check anymore its status).

A job can be purged:

  • Explicitly by the user who submitted that job, using e.g. the glite-ce-job-purge command provided by the CREAM CLI

  • Automatically by the automatic CREAM job purger, which is responsible to purge old - forgotten jobs, according to a policy specified in the CREAM configuration file ( /etc/glite-ce-cream/cream-config.xml).
A user can purge only jobs she submitted. Only a CREAM CE admin can purge jobs submitted by other users.

For jobs submitted to a CREAM CE through the WMS, the purging is done by the ICE component of the WMS when it detects the job has reached a terminal status. The purging operation is not done if in the WMS conf file ( /etc/glite_wms.conf) the attribute purge_jobs in the ICE section is set to false.

2.18.1 Automatic job purging

The automatic CREAM job purger is responsible to purge old - forgotten jobs, according to a policy specified in the CREAM configuration file ( /etc/glite-ce-cream/cream-config.xml).

This policy is specified by the attribute JOB_PURGE_POLICY.

For example, if JOB_PURGE_POLICY is the following:

<parameter name="JOB_PURGE_POLICY" value="ABORTED 1 days; CANCELLED 2 days; DONE-OK 3 days; DONE-FAILED 4 days; REGISTERED 5 days;" />

then the job purger will purge jobs which are:

  • in ABORTED status for more than 1 day
  • in CANCELLED status for more than 2 days
  • in DONE-OK status for more than 3 days
  • in DONE-FAILED status more than 4 days
  • in REGISTERED status for more than 5 days

2.18.2 Purging jobs in a non terminal status

The (manual or automatic) purge operation can be issued only for jobs which are in a terminal status. If it is necessary to purge a job which has been terminated but which is for CREAM in a non terminal status (e.g. RUNNING, REALLY_RUNNING) because of some bugs/problems/..., a specific utility ( JobAdminPurger) provided with the glite-ce-cream package can be used.

JobAdminPurger allows to purge jobs based on their CREAM jobids and/or their status (considering how long the job is in that status).

TBC

2.19 Proxy purging

Expired delegation proxies are automatically purged:

  • from the DelegationDB
  • from the file system ( <cream-sandbox-dir>/<group>/<DN>/proxy)
In the CREAM configuration file ( /etc/glite-ce-cream/cream-config.xml) there is a property called delegation_purge_rate which defines how often the proxy purger is run. The default value is 720 (720 minutes, that is 12 hours).

If the value is changed, it is then necessary to restart tomcat.

Setting that value to -1 means disabling the proxy purger.

2.20 Job wrapper management

2.20.1 Customization points

The CREAM JobWrapper running on the WN execute some scripts (to be provided by the local administrators) if they exist. These are called customization points.

There are 3 customization points:

  • ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_1.sh. This is executed in the beginning of the CREAM job wrapper execution, before the creation of the temporary directory where the job is executed.

  • ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_2.sh. This is executed just after the execution of the user job, before executing the epilogue (if any)

  • ${GLITE_LOCAL_CUSTOMIZATION_DIR}/cp_3.sh. This is executed just before the end of the JobWrapper execution.
If setting the customization point is not enough, the administrator can also customize the CREAM job wrapper, as explained in the following section

2.20.2 Customization of the CREAM Job wrapper

To customize the CREAM job wrapper it is just necessary to edit as appropriate the template file /etc/glite-ce-cream/jobwrapper.tpl.

When done, tomcat must be restarted.

2.20.3 Customization of the Input/Output Sandbox file transfers

The CREAM job wrapper, besides running the user payload, is also responsible for other operations, such as the transfer of the input and output sandbox files from/to remote gridftp servers.

If in such transfers there is a failure, the operation is retried after a while. The sleep time between the first attempt and the second one is the “initial wait time” specified in the CREAM configuration file. In every next attempt the sleep time is doubled.

In the CREAM configuration file ( /etc/glite-ce-cream/cream-config.xml) it is possible to set:

  • the maximum number of file transfers that should be tried
  • the initial wait time (i.e. the wait time between the first attempt and the second one)
Different values can be used for input (ISB) and output (OSB) files.

The relevant section in the CREAM configuration file is this one:

<parameter name="JOB_WRAPPER_COPY_RETRY_COUNT_ISB" value="2" />
<parameter name="JOB_WRAPPER_COPY_RETRY_FIRST_WAIT_ISB" value="60" /> <!-- sec. -->
<parameter name="JOB_WRAPPER_COPY_RETRY_COUNT_OSB" value="6" />
<parameter name="JOB_WRAPPER_COPY_RETRY_FIRST_WAIT_OSB" value="300" /> <!-- sec. -->

If one or more of these values are changed, it is then necessary to restart tomcat.

2.21 Managing the forwarding of requirements to the batch system

The CREAM CE allows to forward, via tha BLAH component, requirements to the batch system.

From a site administrator point of view, this requires creating and properly filling some scripts ( /usr/bin/xxx_local_submit_attributes.sh).

The relevant documentation is available at TBC

2.22 Querying the CREAM Database

2.22.1 Check how many jobs are stored in the CREAM database

The following mysql query can be used to check how many jobs (along with their status) are reported in the CREAM database:

mysql> select jstd.name, count(*) from job, job_status_type_description jstd, job_status AS status LEFT OUTER JOIN job_status AS latest  ON latest.jobId=status.jobId AND status.id < latest.id  WHERE latest.id IS null and job.id=status.jobId and jstd.type=status.type group by jstd.name;
  -- MassimoSgaravatto - 2011-12-20
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback