Troubleshooting guide for CREAM
1 Checks to be done after installation and configuration
1.1 Check via browser
Open your browser (where a valid certificate must be installed) to
https://hostname-of-cream-ce:8443/ce-cream/services
A page with link to the CREAM WSDL should be shown
1.2 Check the CREAM log file
Check in the CREAM log file (
/var/log/cream/glite-ce-cream.log
) for the following strings:
CREAM started!
(BLParserClient) Connection with BLParser (xxx) correctly established
(Replace xxx with the name of your management system: lsf or pbs)
If they are not there, it means that CREAM has not started properly
1.3 Test glexec
- Log on the CREAM CE
-
su tomcat -
- Consider a user proxy (e.g.
/tmp/user.proxy
) for a user authorized to use that CREAM CE. This proxy file must belong to tomcat.tomcat
- Issue the following:
export GLEXEC_MODE="lcmaps_get_account"
export GLEXEC_CLIENT_CERT=/tmp/user.proxy
/usr/sbin/glexec /usr/bin/id
This should return the id of the local user mapped to that Grid user
1.4 Test gridftp
Try a gsiftp (e.g.
using globus-url-copy
or
uberftp
) towards that CREAM CE.
E.g.:
uberftp hostname-of-cream-ce> "ls /etc"
1.5 Check if submissions are enabled
Try the following command from a UI:
glite-ce-allowed-submission <<hostname-of-cream-ce>>:8443
It should return:
Job Submission to this CREAM CE is enabled
1.6 Try a direct submission
Try a submission to that CE using the
glite-ce-job-submit command
, e.g.:
$ /bin/cat test.jdl
[
executable="/bin/sleep";
arguments="1";
]
$ glite-ce-job-submit -a -r alice16.spbu.ru:8443/cream-pbs-dteam test.jdl
https://alice16.spbu.ru:8443/CREAM336256203
Check the status of that job, which eventually should be DONE-OK:
$ glite-ce-job-status https://alice16.spbu.ru:8443/CREAM336256203
****** JobID=[https://alice16.spbu.ru:8443/CREAM336256203]
Status = [DONE-OK]
ExitCode = [0]
1.7 Try a job cancellation
Try a submission to that CE using the
glite-ce-job-submit
command, and then try to cancel it (using the
glite-ce-job-cancel
command).
$ /bin/cat test.jdl
[
executable="/bin/sleep";
arguments="1000";
]
$ glite-ce-job-submit -a -r alice16.spbu.ru:8443/cream-pbs-dteam test.jdl
https://alice16.spbu.ru:8443/CREAM510970530
$ glite-ce-job-cancel https://alice16.spbu.ru:8443/CREAM510970530
Check the status of that job, which eventually should be
CANCELLED
:
$ glite-ce-job-status https://alice16.spbu.ru:8443/CREAM510970530
****** JobID=[https://alice16.spbu.ru:8443/CREAM510970530
Status = [CANCELLED]
ExitCode = []
Description = [Cancelled by user]
1.8 Try a submission through the WMS
Try a submission to that CE through the WMS, i.e. using the
glite-wms-job-submit
command
2 Log files
In case of problems first of all check the log files. See
http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Logfile_locations_and_management for relevant information
3 Error messages
3.1 Batch system xxx not supported
Example:
$ glite-ce-job-submit -a -r cert-26.pd.infn.it:8443/cream-pbs-cream oo.jdl
2008-01-15 13:46:18,167 FATAL - MethodName=[jobRegister] Timestamp=[Tue 15
Jan 2008 13:46:18] ErrorCode=[0] Description=[system error]
FaultCause=[Batch System pbs not supported!]
This means that:
- the batch system specified in the used CREAM CE id is not supported by that CREAM CE (this can be also because a wrong setting of the
JOB_MANAGER
variable, e.g. lcgpbs
instead of pbs
) or
- when CREAM has been started, its BLParser was not running. In this case a error message is printed in the CREAM log file
/var/log/cream/glite-ce-cream.log*
) describing the problem. This message is something like:
org.glite.ce.cream.jobmanagement.cmdexecutor.blah.BLParserClient - initializeConnection: getting info about BLParser (xxx) from BLAH (retry count=yy/zz)
...
org.glite.ce.cream.jobmanagement.cmdexecutor.blah.BLParserClient - initializeConnection error: cannot get BLParser (lsf) HOST:PORT information from BLAH. Please, be sure that BLAH is properly configured and RESTART the CREAM service.
As batch system it is meant the value specified as
JOB_MANAGER
in the
siteinfo.def
. Please note that valid values are lsf, pbs and condor (and not lcgpbs, lcglsf, lcgcondor).
This value is reported in the
/etc/blah.config file
(attribute
supported_lrms
).
Suppose that you have lsf as this value.
As user tomcat from the CE node try the following:
$ /usr/bin/blahpd
$GahpVersion: 1.14.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
BLAH_GET_HOSTPORT lsf
This should return:
S
Then type:
results
It should return something like
lsf 0 lsf/cream-35.pd.infn.it:56565
If this is the case (i.e. you are getting a value such as this one) your problem is likely the second one (i.e. the blparser was not running when tomcat was started).
For LSF, check also in the
/etc/blah.config
if the path of
lsf.profile
is correct.
4 Other troubleshooting hints
4.1 Saving the batch job submission script
If, for debugging purposes, it is necessary to save the script used by the BLAH component of CREAM to submit the job to the batch system, edit the file
/usr/bin/blah_common_submit_functions.sh
replacing the following lines:
# DEBUG: cp $bls_tmp_file /tmp
rm -f $bls_tmp_file
with:
cp $bls_tmp_file /tmp
# rm -f $bls_tmp_file
4.2 Saving files on the worker node after job execution
For debugging purposes, it is possible to save the stdout, stderr and proxy files on the Worker Node after job execution.
This is done setting
blah_debug_save_wn_files
in
/etc/blah.config
on the CREAM CE node to an existing directory (on the WN) where the user running the job has the proper rights for writing.
A directory called
XXX.debug
will be created within the specified directory.
--
MassimoSgaravatto - 2011-04-27