Difference: TroubleshootingGuide (3 vs. 4)

Revision 42011-04-29 - MassimoSgaravatto

Line: 1 to 1
 
META TOPICPARENT name="SystemAdministratorDocumentation"

Troubleshooting guide for CREAM

Line: 259 to 259
 notesthat the torque_server node has to contain the submission hosts (the CREAM CEs) in its /etc/hosts.equiv. Or in recent versions of torque consider the acl_hosts and acl_hosts_enable attributes in the Torque server configuration (see http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml#open)
Added:
>
>

0.1 org.glite.security.delegation.storage.GrDPStorageException

Example:

2009-09-10 15:48:16,082 ERROR - Received NULL fault; the error is due to
another cause:
FaultString=[org.glite.security.delegation.storage.GrDPStorageException:
Configuration error: delegation_factorynull] -
FaultCode=[SOAP-ENV:Server.userException] -
FaultSubCode=[SOAP-ENV:Server.userException] -
FaultDetail=[<ns1:hostname>vtb-generic-64.cern.ch</ns1:hostname>]

This is likely a problem with the mysql DB (e.g. mysql not accessible, problems with grants, etc.)

0.2 sudo: 0 incorrect password attempts

This means that there is an error in the sudoers file (e.g. the mapped user is not "enabled") created by yaim-cream-ce). So this is very likely a configuration problem.

0.3 Authorization error: Failed to get the local user id via glexec

This usually means an error while running glexec to get the local userid Check therefore the glexec log files (syslog or the log files defined in /etc/glexec.conf). You might also need to increase (setting to 5) the glexec/lcas/lcmaps debug levels in /etc/glexec.conf.

0.4 Cannot find grid-proxy-info

This means that the job wrapper running on the WN could not find the grid-proxy-info executable. This could be due to several reasons.The most common ones are:

  • the grid-proxy-info executable is not installed on the WN
  • the grid-proxy-info executable is not found (e.g. because it not in the path of the local account executing the job) on the WN
  • the which executable is not installed on the WN

0.5 Problem to detect the lifetime of the proxy

This means that the job wrapper running on the WN could not detect the lifetime of the proxy (using the grid-proxy-info command). This could be due to several reasons.The most common ones are:

  • the proxy for some reason was not staged on the WN
  • the grid-proxy-info executable was not found (or it was not in the path) on the WN
  • the which executable is not installed on the WN

0.6 Cannot create the job's working directory! [failure reason = ">>> sudoers file: Alias `XYZ' already defined, line ABC <<<"]

This means that there is a syntax error in the sudoers file created by yaim-cream-ce. A likely reason if that the same VO as been enabled more than once in the siteinfo.def.

0.7 Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted

This problem can happen when submitting to CREAM through the WMS. This means that the CE was blacklisted by that WMS (in particular by the ICE component) because the connection to this CE from that WMS went in timeout (default value for timeout: 60 secs) for 3 times. The ICE blacklisting of a CREAM CE lasts for 30 minutes. In this period submissions to that CREAM CE by that WMS/ICE are not attempted and fail with this error message.

0.8 Cannot create the job's working directory! failure reason = "sudo: no tty present and no askpass program specified"

The problem happens if there are some problems with the sudoers file, which is created by yaim

0.9 User ABC not authorized for operation XYZ

 

1 Other problems

Added:
>
>

0.1 Job failure with reason=999

This can happen with the new BLAH BLparser when it is not able to detect the status of the job for more than x seconds. The default value for x is 86400. This value can be modified setting the attribute alldone_interval in /etc/blah.config, e.g.:

alldone_interval=100000

It is then necessary to restart the BLAH blparser:

/etc/init.d/glite-ce-blahparser restart 
 

0.1 Jobs are successfully submitted to Torque, but they stay in W status

Torque will put a job in 'W' when it cannot stage the job's input files. Check if grid accounts on your WNs can "scp" without password to the CE and/or the Torque server host.

 
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback