META TOPICPARENT |
name="SystemAdministratorDocumentation" |
Troubleshooting guide for CREAM
1 Checks to be done after installation and configuration
1.1 Check via browser
Open your browser (where a valid certificate must be installed)
- For CREAM EMI1 (and previous versions) on:
https://hostname-of-cream-ce:8443/ce-cream/services
- For CREAM EMI2 (using axis2) on:
https://hostname-of-cream-ce:8443/ce-cream/services/listServices
A page with link to the CREAM WSDL should be shown
1.2 Check the CREAM log file
Check in the CREAM log file ( /var/log/cream/glite-ce-cream.log ) for the following strings:
CREAM started!
(BLParserClient) Connection with BLParser (xxx) correctly established
(Replace xxx with the name of your management system: lsf or pbs)
If they are not there, it means that CREAM has not started properly
1.3 Test glexec
- Log on the CREAM CE
-
su tomcat -
- Consider a user proxy (e.g.
/tmp/user.proxy ) for a user authorized to use that CREAM CE. This proxy file must belong to tomcat.tomcat
- Issue the following:
export GLEXEC_MODE="lcmaps_get_account"
export GLEXEC_CLIENT_CERT=/tmp/user.proxy
/usr/sbin/glexec /usr/bin/id
This should return the id of the local user mapped to that Grid user.
Please note that this test makes sense only when the CREAM CE is configured to NOT use Argus. When the CREAM CE is instead configured to use Argus, glexec is not used at all in the CREAM CE node.
1.4 Test gridftp
Try a gsiftp (e.g. using globus-url-copy or uberftp ) towards that CREAM CE. E.g.:
uberftp hostname-of-cream-ce> "ls /etc"
1.5 Check if submissions are enabled
Try the following command from a UI:
glite-ce-allowed-submission <<hostname-of-cream-ce>>:8443
It should return:
Job Submission to this CREAM CE is enabled
1.6 Try a direct submission
Try a submission to that CE using the glite-ce-job-submit command , e.g.:
$ /bin/cat test.jdl
[
executable="/bin/sleep";
arguments="1";
]
$ glite-ce-job-submit -a -r alice16.spbu.ru:8443/cream-pbs-dteam test.jdl
https://alice16.spbu.ru:8443/CREAM336256203
Check the status of that job, which eventually should be DONE-OK:
$ glite-ce-job-status https://alice16.spbu.ru:8443/CREAM336256203
****** JobID=[https://alice16.spbu.ru:8443/CREAM336256203]
Status = [DONE-OK]
ExitCode = [0]
1.7 Try a job cancellation
Try a submission to that CE using the glite-ce-job-submit command, and then try to cancel it (using the glite-ce-job-cancel command).
$ /bin/cat test.jdl
[
executable="/bin/sleep";
arguments="1000";
]
$ glite-ce-job-submit -a -r alice16.spbu.ru:8443/cream-pbs-dteam test.jdl
https://alice16.spbu.ru:8443/CREAM510970530
$ glite-ce-job-cancel https://alice16.spbu.ru:8443/CREAM510970530
Check the status of that job, which eventually should be CANCELLED :
$ glite-ce-job-status https://alice16.spbu.ru:8443/CREAM510970530
****** JobID=[https://alice16.spbu.ru:8443/CREAM510970530
Status = [CANCELLED]
ExitCode = []
Description = [Cancelled by user]
1.8 Try a submission through the WMS
Try a submission to that CE through the WMS, i.e. using the glite-wms-job-submit command
2 Log files
In case of problems first of all check the log files. See http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Logfile_locations_and_management for relevant information
3 Error messages
3.1 Batch system xxx not supported
Example:
$ glite-ce-job-submit -a -r cert-26.pd.infn.it:8443/cream-pbs-cream oo.jdl
2008-01-15 13:46:18,167 FATAL - MethodName=[jobRegister] Timestamp=[Tue 15
Jan 2008 13:46:18] ErrorCode=[0] Description=[system error]
FaultCause=[Batch System pbs not supported!]
This means that:
- the batch system specified in the used CREAM CE id is not supported by that CREAM CE (this can be also because a wrong setting of the
JOB_MANAGER variable, e.g. lcgpbs instead of pbs ) or
- when CREAM has been started, its BLParser was not running. In this case a error message is printed in the CREAM log file
/var/log/cream/glite-ce-cream.log* ) describing the problem. This message is something like:
org.glite.ce.cream.jobmanagement.cmdexecutor.blah.BLParserClient - initializeConnection: getting info about BLParser (xxx) from BLAH (retry count=yy/zz)
...
org.glite.ce.cream.jobmanagement.cmdexecutor.blah.BLParserClient - initializeConnection error: cannot get BLParser (lsf) HOST:PORT information from BLAH. Please, be sure that BLAH is properly configured and RESTART the CREAM service.
As batch system it is meant the value specified as JOB_MANAGER in the siteinfo.def . Please note that valid values are lsf, pbs and condor (and not lcgpbs, lcglsf, lcgcondor).
This value is reported in the /etc/blah.config file (attribute supported_lrms ).
Suppose that you have lsf as this value.
As user tomcat from the CE node try the following:
$ /usr/bin/blahpd
$GahpVersion: 1.14.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
BLAH_GET_HOSTPORT lsf
This should return:
S
Then type:
results
It should return something like
lsf 0 lsf/cream-35.pd.infn.it:56565
If this is the case (i.e. you are getting a value such as this one) your problem is likely the second one (i.e. the blparser was not running when tomcat was started).
For LSF, check also in the /etc/blah.config if the path of lsf.profile is correct.
3.2 The job cannot be submitted because the blparser service is not alive
Check if the BLAH blparser is running. If it is supposed to run but it is not running, check in its log file(s) ( /var/log/cream/glite-xxxparser.log for the old parser, /var/log/cream/glite-ce-bnotifier.log and /var/log/cream/glite-ce-bupdater.log for the new one) if something interesting is reported.
In case (re)start it ( /etc/init.d/glite-ce-blparser restart for the old one, /etc/init.d/glite-ce-blahparser restart for the new one) and then restart tomcat ( service tomcat5 restart )
In case your blparser node servers multiple CREAM CEs, please be sure to have followed the instructions reported at http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_4_5_1_Configuration_of_the_old
3.3 Delegation error: the proxy delegationID "xx" is not more valid!
Example:
$ glite-ce-job-submit -D de2 -r cream-02.pd.infn.it:8443/cream-lsf-cream
prren1.jdl
2008-01-28 11:27:04,859 FATAL - MethodName=[jobRegister] Timestamp=[Mon 28
Jan 2008 11:27:04] ErrorCode=[0] Description=[delegation error: the proxy delegationID "de2" is not more valid!;] FaultCause=[delegation proxy expired!]
This means that the used delegationID was found on the target CREAM CE, but it is not more valid (i.e. the proxy expired)
3.4 job id list file error: File [xxx] is not a CREAM job list file
Example:
$ glite-ce-job-status -i job_ids
2008-04-24 09:59:54,963 FATAL - File [job_ids] is not a CREAM job list file. Stop.
This means the [job_ids] file passed in the command line has not the right format
Here's an example of a file in the right format:
$ cat job_ids
##CREAMJOBS##
https://devel03.cnaf.infn.it:8443/CREAM683051516
https://devel03.cnaf.infn.it:8443/CREAM481684356
https://devel03.cnaf.infn.it:8443/CREAM333841302
https://devel03.cnaf.infn.it:8443/CREAM279829555
https://devel03.cnaf.infn.it:8443/CREAM334653961
3.5 bad UID for job execution
If your job fails with an error such as:
****** JobID=[https://ppsce03.pic.es:8443/CREAM880596078]
Status = [ABORTED]
ExitCode = []
FailureReason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Bad UID for job execution MSG=ruserok failed
validating dteam017/dteam017 from ppsce03.pic.es-) N/A (jobId = CREAM880596078)]
notesthat the torque_server node has to contain the submission hosts (the CREAM CEs) in its /etc/hosts.equiv . Or in recent versions of torque consider the acl_hosts and acl_hosts_enable attributes in the Torque server configuration (see http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml#open)
3.6 org.glite.security.delegation.storage.GrDPStorageException
Example:
2009-09-10 15:48:16,082 ERROR - Received NULL fault; the error is due to
another cause:
FaultString=[org.glite.security.delegation.storage.GrDPStorageException:
Configuration error: delegation_factorynull] -
FaultCode=[SOAP-ENV:Server.userException] -
FaultSubCode=[SOAP-ENV:Server.userException] -
FaultDetail=[<ns1:hostname>vtb-generic-64.cern.ch</ns1:hostname>]
This is likely a problem with the mysql DB (e.g. mysql not accessible, problems with grants, etc.)
3.7 sudo: 0 incorrect password attempts
This means that there is an error in the sudoers file (e.g. the mapped user is not "enabled") created by yaim-cream-ce). So this is very likely a configuration problem.
3.8 Authorization error: Failed to get the local user id via glexec
This usually means an error while running glexec to get the local userid Check therefore the glexec log files (syslog or the log files defined in /etc/glexec.conf ). You might also need to increase (setting to 5) the glexec/lcas/lcmaps debug levels in /etc/glexec.conf .
3.9 Cannot find grid-proxy-info
This means that the job wrapper running on the WN could not find the grid-proxy-info executable. This could be due to several reasons.The most common ones are:
- the
grid-proxy-info executable is not installed on the WN
- the
grid-proxy-info executable is not found (e.g. because it not in the path of the local account executing the job) on the WN
- the
which executable is not installed on the WN
3.10 Problem to detect the lifetime of the proxy
This means that the job wrapper running on the WN could not detect the lifetime of the proxy (using the grid-proxy-info command). This could be due to several reasons.The most common ones are:
- the proxy for some reason was not staged on the WN
- the
grid-proxy-info executable was not found (or it was not in the path) on the WN
- the
which executable is not installed on the WN
|