Difference: KnownIssues (40 vs. 41)

Revision 412012-12-06 - GoncaloBorges

Line: 1 to 1
 
META TOPICPARENT name="GeneralDocumentation"

Known issues

Line: 98 to 98
 This will be fixed in EMI 2.

CREAM jobs are cancelled with status reason=3 in a GE system

Added:
>
>
When a job is submitted by BLAH to an GE system, the blah job_registry is updated via sge_submit.sh script with status "1" and all the subsequent status are updated in blah job_registry by the BUpdaterSGE daemon.

BUpdaterSGE is the daemon that decides in what status a given job is examining the output of a "qstat" command. There is a tricky situation when a job disappears: Was it cancelled or did it finish? To know the difference, BUpdaterSGE uses "qacct" to query the accounting log. If there is information about the job in the accounting log, it means it finished, otherwise it means it was cancelled. There are two queries to the accounting log using "qacct -j" with a difference of one minute between the two. If both queries return error, the job is assumed as cancelled.

If you are seeing systematic cancelled jobs in glite-cream-ce.log like

02 Dec 2012 12:43:01,247 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM172977114 STATUS CHANGED: REALLY-RUNNING => CANCELLED [description=Cancelled by CE admin] [failureReason=reason=3] [localUser=XXX] [workerNode=XXX] [delegationId=1354371579.274301]

this most probably means "qstat" and "qacct" commands can not be successfully executed by tomcat. This could happen by several reasons: * The BUpdaterSGE daemons does not inherits the correct GE environment variables * = Tomcat user is not allowed to query the GE system= * = The accounting file is not shared in the CreamCE=

The BUpdaterSGE daemons does not inherits the correct GE environment variables

 If the environment present in a BUpdaterSGE process does not include the GE environment variables, the GE client commands (qstat, qconf) can not be executed by BUpdaterSGE.

As a consequence, BUpdaterSGE will assume that jobs have been cancelled (because it receives no information from qstat or qacct). You can check the environment for BUpdaterSGE process using the following commands and searching for the GE env variables (SGE_EXECD, SGE_QMASTER, SGE_ROOT, SGE_CLUSTER_NAME, SGE_CELL)

Line: 115 to 133
  . /etc/profile.d/sge.sh
Added:
>
>

Tomcat user is not allowed to query the GE system

Some GE systems use certificates to encrypt the communication between GE client and server. For CreamCE, tomcat must be able to query your system (BUpdaterSGE daemon is running under user tomcat). If this is not the case, most probably you will get the following error while trying to do a "qstat" with user tomcat

su - tomcat
sh-3.2$ qstat -u '*'
error: commlib error: can't set CA chain file (/var/sgeCA/sge_qmaster/GridKa/userkeys/tomcat/cert.pem)
error: commlib error: ssl error ([ID=33558530] in module "system library": "No such file or directory")
error: unable to send message to qmaster using port 15020 on host "xxxxxxxx": can't set CA chain file
 

Significant changes introduced with Torque 2.5.7-1

The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see

 
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback