Difference: TroubleshootingGuide (2 vs. 3)

Revision 32011-04-28 - MassimoSgaravatto

Line: 1 to 1
 
META TOPICPARENT name="SystemAdministratorDocumentation"

Troubleshooting guide for CREAM

Line: 196 to 196
 For LSF, check also in the /etc/blah.config if the path of lsf.profile is correct.
Added:
>
>

0.1 The job cannot be submitted because the blparser service is not alive

 
Added:
>
>
Check if the BLAH blparser is running. If it is supposed to run but it is not running, check in its log file(s) (/var/log/cream/glite-xxxparser.log for the old parser, /var/log/cream/glite-ce-bnotifier.log and /var/log/cream/glite-ce-bupdater.log for the new one) if something interesting is reported.

In case (re)start it (/etc/init.d/glite-ce-blparser restart for the old one, /etc/init.d/glite-ce-blahparser restart for the new one) and then restart tomcat (service tomcat5 restart)

In case your blparser node servers multiple CREAM CEs, please be sure to have followed the instructions reported at http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_4_5_1_Configuration_of_the_old

0.1 Delegation error: the proxy delegationID "xx" is not more valid!

Example:

$ glite-ce-job-submit -D de2 -r cream-02.pd.infn.it:8443/cream-lsf-cream
prren1.jdl
2008-01-28 11:27:04,859 FATAL - MethodName=[jobRegister] Timestamp=[Mon 28
Jan 2008 11:27:04] ErrorCode=[0] Description=[delegation error: the proxy delegationID "de2" is not more valid!;] FaultCause=[delegation proxy expired!]

This means that the used delegationID was found on the target CREAM CE, but it is not more valid (i.e. the proxy expired)

0.2 job id list file error: File [xxx] is not a CREAM job list file

Example:

$ glite-ce-job-status -i job_ids
2008-04-24 09:59:54,963 FATAL - File [job_ids] is not a CREAM job list file. Stop.

This means the [job_ids] file passed in the command line has not the right format

Here's an example of a file in the right format:

$ cat job_ids
##CREAMJOBS##
https://devel03.cnaf.infn.it:8443/CREAM683051516
https://devel03.cnaf.infn.it:8443/CREAM481684356
https://devel03.cnaf.infn.it:8443/CREAM333841302
https://devel03.cnaf.infn.it:8443/CREAM279829555
https://devel03.cnaf.infn.it:8443/CREAM334653961

0.3 bad UID for job execution

If your job fails with an error such as:

 ******  JobID=[https://ppsce03.pic.es:8443/CREAM880596078]
     Status        = [ABORTED]
     ExitCode      = []
     FailureReason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Bad UID for job execution MSG=ruserok failed
 validating dteam017/dteam017 from ppsce03.pic.es-) N/A (jobId = CREAM880596078)]

notesthat the torque_server node has to contain the submission hosts (the CREAM CEs) in its /etc/hosts.equiv. Or in recent versions of torque consider the acl_hosts and acl_hosts_enable attributes in the Torque server configuration (see http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml#open)

1 Other problems

1.1 Jobs are successfully submitted to Torque, but they stay in W status

Torque will put a job in 'W' when it cannot stage the job's input files. Check if grid accounts on your WNs can "scp" without password to the CE and/or the Torque server host.

1.2 Jobs submitted to LSF fails with errorcode 127

This is likely a problem with staging of files from/to the CE node to/from the WN. Check if the relevant LSF daemons run properly.

 

1 Other troubleshooting hints

 
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback