Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 295 to 295 | ||||||||
| ||||||||
Added: | ||||||||
> > |
| |||||||
0.1 Cannot create the job's working directory! [failure reason = ">>> sudoers file: Alias `XYZ' already defined, line ABC <<<"] |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 443 to 443 | ||||||||
1 Other problems | ||||||||
Added: | ||||||||
> > | 0.1 Job cancelled with description "Cancelled by CE admin"All activities about the job are tracked on its history. So, whenever the CREAM service receives from BLAH the cancel state about a job, it checks in the job's history if the user (or admin) explicitly requested its cancellation or whether the cream itself decided to cancel it because of the lease expiration or some other reason. If no one "jobCancel" command has been executed by CREAM (i.e. due to user/admin request or job lease expiration), it means that the action has been executed externally by the system administrator (e.g. bkill) or the batch system itself (e.g. batch system killed the job due to excessive memory usage). In this case, CREAM is not able to know exactly what happened and so it can just inform the user about the job cancellation with the description "Cancelled by CE admin". The LRMS logs may contain some useful detailed information for understanding who canceled the job, but this approach is batch system dependent and it needs more investigation. At the present time the hint we can provide to the user is to ask the LRMS admin to investigate the cancellation reason by using the tools provided by the LRMS itself. | |||||||
0.1 Dynamic information not published in the BDII |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 8 to 8 | ||||||||
0.1 Check via browser | ||||||||
Changed: | ||||||||
< < | Open your browser (where a valid certificate must be installed) to https://hostname-of-cream-ce:8443/ce-cream/services | |||||||
> > | Open your browser (where a valid certificate must be installed)
https://hostname-of-cream-ce:8443/ce-cream/services
https://hostname-of-cream-ce:8443/ce-cream/services/listServices | |||||||
A page with link to the CREAM WSDL should be shown |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 451 to 451 | ||||||||
0.1 High load caused by bupdater process | ||||||||
Changed: | ||||||||
< < | This can be because of a too high polling frequency when checking the status of the jobs. Follow this link for more information. | |||||||
> > | This can be because of a too high polling frequency when checking the status of the jobs. Follow this link for more information. | |||||||
0.1 Job failure with reason=999 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 334 to 334 | ||||||||
In glite-ce-cream.log the reason for the authorization problem should be explained | ||||||||
Added: | ||||||||
> > | 0.1 Authorization layer is not configuredThis error on the client side may be caused by many different errors arising during the startup of the CREAM service. This is a list of possible reasons:
| |||||||
0.1 Failed to create a delegation id for job <jobid>: reason is User <userDN> not authorized for operation <operation>See what reported here |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 386 to 386 | ||||||||
see: http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#3_3_MySQL_database_How_to_resize | ||||||||
Added: | ||||||||
> > | 0.1 InnoDB: Problem with the log file sizes.If the following error occurs (see the mysql log file: /var/log/mysqld.log):InnoDB: The log sequence number in ibdata files does not match InnoDB: the log sequence number in the ib_logfiles!or a similar one: InnoDB: Error: log file ./ib_logfile0 is of different size 0 5242880 bytes InnoDB: than specified in the .cnf file 0 67108864 bytes!then the "innodb_log_file_size" property has been changed without moving the log files ib_log* to some place out of the the directory where the log files reside. Follow these steps (see http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#3_3_MySQL_database_How_to_resize):
| |||||||
0.1 copyNewProxyToSandbox error: sudo: no tty present and no askpass program specified |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 397 to 397 | ||||||||
1 Other problems | ||||||||
Added: | ||||||||
> > | 0.1 Dynamic information not published in the BDIIIf dynamic information is not published in the resource BDII of the CREAM-CE and e.g.lcg-infosites reports 444444 as number of waiting jobs, on the CREAM CE tries to manually run the infoprovider script:
/sbin/runuser -s /bin/sh ldap -c "/var/lib/bdii/gip/plugin/glite-info-dynamic-ce"and check if some errors are returned. If the information is correctly returned, check the BDII logs if some errors are reported. | |||||||
0.1 High load caused by bupdater processThis can be because of a too high polling frequency when checking the status of the jobs. Follow this link for more information. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 431 to 431 | ||||||||
0.1 Saving the batch job submission script | ||||||||
Changed: | ||||||||
< < | If, for debugging purposes, it is necessary to save the script used by the BLAH component of CREAM to submit the job to the batch system, edit the file /usr/bin/blah_common_submit_functions.sh ( /usr/libexec/blah_common_submit_functions.sh starting with EMI-2, i.e. starting with BLAH v. 1.18)replacing the following lines: | |||||||
> > | If, for debugging purposes, it is necessary to save the script used by the BLAH component of CREAM to submit the job to the batch system, edit the files blah_common_submit_functions.sh
and xxx_submit.sh (e.g. pbs_submit.sh ) and comment all the occurrences of: | |||||||
Deleted: | ||||||||
< < | # DEBUG: cp $bls_tmp_file /tmp | |||||||
rm -f $bls_tmp_file | ||||||||
Changed: | ||||||||
< < | with: | |||||||
> > | These scripts are in /usr/bin in EMI-1, in usr/libexec starting with EMI-2 | |||||||
Deleted: | ||||||||
< < | cp $bls_tmp_file /tmp # rm -f $bls_tmp_file | |||||||
0.1 Saving files on the worker node after job execution |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 457 to 457 | ||||||||
If you want to check the queue of commands and notifications still to be processed, the mysql query command to be issued is: | ||||||||
Added: | ||||||||
> > |
| |||||||
use creamdb; select count(*), name from command group by name; | ||||||||
Added: | ||||||||
> > |
use creamdb; select count(*), name from command_queue group by name; | |||||||
If the CREAM CE is behaving properly, such queue should be always reasonable short.
0.1 Remove a delegationid |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 431 to 431 | ||||||||
0.1 Saving the batch job submission script | ||||||||
Changed: | ||||||||
< < | If, for debugging purposes, it is necessary to save the script used by the BLAH component of CREAM to submit the job to the batch system, edit the file /usr/bin/blah_common_submit_functions.sh replacing the following lines: | |||||||
> > | If, for debugging purposes, it is necessary to save the script used by the BLAH component of CREAM to submit the job to the batch system, edit the file /usr/bin/blah_common_submit_functions.sh ( /usr/libexec/blah_common_submit_functions.sh starting with EMI-2, i.e. starting with BLAH v. 1.18)replacing the following lines: | |||||||
# DEBUG: cp $bls_tmp_file /tmp |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 386 to 386 | ||||||||
see: http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#3_3_MySQL_database_How_to_resize | ||||||||
Added: | ||||||||
> > |
0.1 copyNewProxyToSandbox error: sudo: no tty present and no askpass program specifiedThis means that the sudo operation (which is used to copy the proxy to a user specific directory) failed, very likely because of an an error in the sudoers file (e.g. the relevant users are not authorized). This file is supposed to be filled by yaim. | |||||||
1 Other problems1.1 High load caused by bupdater process |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 334 to 334 | ||||||||
In glite-ce-cream.log the reason for the authorization problem should be explained | ||||||||
Added: | ||||||||
> > | 0.1 Failed to create a delegation id for job <jobid>: reason is User <userDN> not authorized for operation <operation>See what reported here | |||||||
0.1 Cannot generate the job wrapper! the problem seems to be related to the jdl: Number mismatch for maxOutputSandboxSize = -1,000000000000000E+00"This happens for submissions through the WMS to a CREAM CE deployed on a machine installed using a non-English (en-US ) language. This is because of different representations of decimal numbers. The workaround in this case is to uncomment the line: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 366 to 366 | ||||||||
0.1 InnoDB: ERROR: the age of the last checkpoint is ... | ||||||||
Changed: | ||||||||
< < | If the following error occurs | |||||||
> > | If the following error occurs (see the mysql log file: /var/log/mysqld.log) | |||||||
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 370 to 370 | ||||||||
| ||||||||
Added: | ||||||||
> > | ||||||||
InnoDB: ERROR: the age of the last checkpoint is , | ||||||||
Deleted: | ||||||||
< < | ||||||||
InnoDB: which exceeds the log group capacity . | ||||||||
Deleted: | ||||||||
< < | ||||||||
InnoDB: If you are using big BLOB or TEXT rows,you must set the | ||||||||
Deleted: | ||||||||
< < | ||||||||
InnoDB: combined size of log files at least 10 times bigger than the | ||||||||
Deleted: | ||||||||
< < | ||||||||
InnoDB: largest such row. | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | ||||||||
Changed: | ||||||||
< < | then you must resize innodb log files. | |||||||
> > | then you must resize the innodb log files. | |||||||
see: http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#3_3_MySQL_database_How_to_resize |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 364 to 364 | ||||||||
You can then restart the PEP server, but normally the logging configuration is automatically reload every 5 minutes. | ||||||||
Added: | ||||||||
> > | 0.1 InnoDB: ERROR: the age of the last checkpoint is ...If the following error occursInnoDB: ERROR: the age of the last checkpoint is , InnoDB: which exceeds the log group capacity . InnoDB: If you are using big BLOB or TEXT rows,you must set the InnoDB: combined size of log files at least 10 times bigger than the InnoDB: largest such row. then you must resize innodb log files. see: http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#3_3_MySQL_database_How_to_resize | |||||||
1 Other problems | ||||||||
Line: 394 to 413 | ||||||||
This is likely a problem with staging of files from/to the CE node to/from the WN. Check if the relevant LSF daemons run properly. | ||||||||
Deleted: | ||||||||
< < | ||||||||
0.1 Jobs submitted to Torque are stuck in W status | ||||||||
Changed: | ||||||||
< < | The 'W' status means "waiting on stagein": that should never happen for grid jobs and signals a misconfiguration of the CE or the batch system. The most likely cause is explained here | |||||||
> > | The 'W' status means "waiting on stagein": that should never happen for grid jobs and signals a misconfiguration of the CE or the batch system. The most likely cause is explained here | |||||||
1 Other troubleshooting hints |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 367 to 367 | ||||||||
1 Other problems | ||||||||
Added: | ||||||||
> > | 0.1 High load caused by bupdater processThis can be because of a too high polling frequency when checking the status of the jobs. Follow this link for more information. | |||||||
0.1 Job failure with reason=999This can happen with the new BLAH BLparser when it is not able to detect the status of the job for more thanx seconds. The default value for x is 86400. This value can be modified setting the attribute alldone_interval in /etc/blah.config , e.g.: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 344 to 344 | ||||||||
in $CATALINA_HOME/conf/tomcat5.conf and then restart tomcat | ||||||||
Added: | ||||||||
> > | 0.1 Missing property local.user.idOn a CREAM-CE using Argus this means that ARGUS was not able to provide the mapping, or CREAM was not able to process it. To check if the mapping is correctly done in Argus, look for a line like this in the Arugs PEP Server/var/log/argus/pepd/process.log :
2011-10-10 08:42:44.209Z - INFO [DFPMObligationHandler] - ACCOUNTMAP_OH: DN: CN=Valery Tschopp 9FEE5EE3,O=SWITCH,DC=slcs,DC=switch,DC=ch pFQAN: /dteam FQANs: [/dteam] mapped to POSIX account: PosixAccount{user=dteam024 group=dteam}You might need to set the debug mode on for the obligation handler, editing /etc/argus/pepd/logging.xml adding the lines:
<logger name="org.glite.authz.pep.obligation.dfpmap"> <level value="DEBUG" /> </logger>You can then restart the PEP server, but normally the logging configuration is automatically reload every 5 minutes. | |||||||
1 Other problems1.1 Job failure with reason=999 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 368 to 368 | ||||||||
This is likely a problem with staging of files from/to the CE node to/from the WN. Check if the relevant LSF daemons run properly. | ||||||||
Added: | ||||||||
> > |
0.1 Jobs submitted to Torque are stuck in W statusThe 'W' status means "waiting on stagein": that should never happen for grid jobs and signals a misconfiguration of the CE or the batch system. The most likely cause is explained here | |||||||
1 Other troubleshooting hints1.1 Saving the batch job submission script |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 407 to 407 | ||||||||
0.1 Remove a delegationid | ||||||||
Changed: | ||||||||
< < | If for some reason you need to remove a delegation from the CREAM DB, do the following mysql commands: | |||||||
> > | If for some reason you need to remove a delegation from the CREAM DB, use the following mysql commands: | |||||||
Changed: | ||||||||
< < | use delegationdb | |||||||
> > | use delegationdb; | |||||||
delete from t_credential where dlg_id="pippo" and local_user = "pluto"; |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 405 to 405 | ||||||||
If the CREAM CE is behaving properly, such queue should be always reasonable short. | ||||||||
Added: | ||||||||
> > | 0.1 Remove a delegationidIf for some reason you need to remove a delegation from the CREAM DB, do the following mysql commands:use delegationdb delete from t_credential where dlg_id="pippo" and local_user = "pluto";Of course replace pippo and pluto with the proper strings.
For submissions through the WMS, if a certain delegationid is removed, the ICE component of the WMS will detect it and will create a new delegation. | |||||||
-- MassimoSgaravatto - 2011-04-27 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 394 to 394 | ||||||||
A directory called XXX.debug will be created within the specified directory. | ||||||||
Added: | ||||||||
> > | 0.1 Check the queue of command/notifications still to be processedIf you want to check the queue of commands and notifications still to be processed, the mysql query command to be issued is:use creamdb; select count(*), name from command group by name;If the CREAM CE is behaving properly, such queue should be always reasonable short. | |||||||
-- MassimoSgaravatto - 2011-04-27 \ No newline at end of file |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 39 to 39 | ||||||||
/usr/sbin/glexec /usr/bin/id | ||||||||
Changed: | ||||||||
< < | This should return the id of the local user mapped to that Grid user | |||||||
> > | This should return the id of the local user mapped to that Grid user. Please note that this test makes sense only when the CREAM CE is configured to NOT use Argus. When the CREAM CE is instead configured to use Argus, glexec is not used at all in the CREAM CE node. | |||||||
0.1 Test gridftp |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 45 to 43 | ||||||||
0.1 Test gridftp | ||||||||
Changed: | ||||||||
< < | Try a gsiftp (e.g. using globus-url-copy or uberftp ) towards that CREAM CE.
E.g.: | |||||||
> > | Try a gsiftp (e.g. using globus-url-copy or uberftp ) towards that CREAM CE. E.g.: | |||||||
uberftp hostname-of-cream-ce> "ls /etc" | ||||||||
Line: 301 to 287 | ||||||||
| ||||||||
Changed: | ||||||||
< < |
0.1 Cannot create the job's working directory! [failure reason = ">>> sudoers file: Alias `XYZ' already defined, line ABC <<<"] | |||||||
> > | 0.1 Cannot create the job's working directory! [failure reason = ">>> sudoers file: Alias `XYZ' already defined, line ABC <<<"] | |||||||
This means that there is a syntax error in the sudoers file created by yaim-cream-ce. A likely reason if that the same VO as been enabled more than once in the siteinfo.def. | ||||||||
Line: 317 to 301 | ||||||||
0.1 User ABC not authorized for operation XYZ | ||||||||
Added: | ||||||||
> > | This means that there was an authorization problem. | |||||||
Added: | ||||||||
> > | If the authorization is managed via gJAF, check first of all if the relevant VOMS role has been enabled in the grid-mapfile. | |||||||
Added: | ||||||||
> > | Then check if you have all the relevant .lsc files in /etc/grid-security/vomsdir/<VO> (there must be a file for each "supported" VOMS server for that VO) and if they are correct (please note that the VOMS server certificate in /etc/grid-security/vomsdir is not needed anymore and it is used only if the relevant lsc file is not foun)d. | |||||||
Changed: | ||||||||
< < | 1 Other problems | |||||||
> > | If this was not enough, edit /etc/glite-ce-cream/log4j.properties | |||||||
Added: | ||||||||
> > | replacing: | |||||||
Changed: | ||||||||
< < | 0.1 Job failure with reason=999 | |||||||
> > | log4j.logger.org.glite=info, fileout | |||||||
Added: | ||||||||
> > | with:
log4j.logger.org.glite=debug, fileoutand comment the following lines: log4j.logger.org.glite.security=off log4j.logger.org.glite.voms=offThen restart tomcat In glite-ce-cream.log the reason for the authorization problem should be explained
0.1 Cannot generate the job wrapper! the problem seems to be related to the jdl: Number mismatch for maxOutputSandboxSize = -1,000000000000000E+00"This happens for submissions through the WMS to a CREAM CE deployed on a machine installed using a non-English (en-US ) language. This is because of different representations of decimal numbers. The workaround in this case is to uncomment the line:
LANG=en_USin $CATALINA_HOME/conf/tomcat5.conf and then restart tomcat
1 Other problems1.1 Job failure with reason=999 | |||||||
This can happen with the new BLAH BLparser when it is not able to detect the status of the job for more than x seconds. The default value for x is 86400. This value can be modified setting the attribute alldone_interval in /etc/blah.config , e.g.: | ||||||||
Line: 344 to 364 | ||||||||
0.1 Jobs submitted to LSF fails with errorcode 127 | ||||||||
Changed: | ||||||||
< < | This is likely a problem with staging of files from/to the CE node to/from the WN. Check if the relevant LSF daemons run properly. | |||||||
> > | This is likely a problem with staging of files from/to the CE node to/from the WN. Check if the relevant LSF daemons run properly. | |||||||
1 Other troubleshooting hints |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 259 to 259 | ||||||||
notesthat the torque_server node has to contain the submission hosts (the CREAM CEs) in its /etc/hosts.equiv . Or in recent versions of torque consider the acl_hosts and acl_hosts_enable attributes in the Torque server configuration (see http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml#open) | ||||||||
Added: | ||||||||
> > | 0.1 org.glite.security.delegation.storage.GrDPStorageExceptionExample:2009-09-10 15:48:16,082 ERROR - Received NULL fault; the error is due to another cause: FaultString=[org.glite.security.delegation.storage.GrDPStorageException: Configuration error: delegation_factorynull] - FaultCode=[SOAP-ENV:Server.userException] - FaultSubCode=[SOAP-ENV:Server.userException] - FaultDetail=[<ns1:hostname>vtb-generic-64.cern.ch</ns1:hostname>]This is likely a problem with the mysql DB (e.g. mysql not accessible, problems with grants, etc.) 0.2 sudo: 0 incorrect password attemptsThis means that there is an error in thesudoers file (e.g. the mapped user is not "enabled") created by yaim-cream-ce). So this is very likely a configuration problem.
0.3 Authorization error: Failed to get the local user id via glexecThis usually means an error while running glexec to get the local userid Check therefore the glexec log files (syslog or the log files defined in/etc/glexec.conf ). You might also need to increase (setting to 5) the glexec/lcas/lcmaps debug levels in /etc/glexec.conf .
0.4 Cannot find grid-proxy-infoThis means that the job wrapper running on the WN could not find thegrid-proxy-info executable. This could be due to several reasons.The most common ones are:
0.5 Problem to detect the lifetime of the proxyThis means that the job wrapper running on the WN could not detect the lifetime of the proxy (using thegrid-proxy-info command). This could be due to several reasons.The most common ones are:
0.6 Cannot create the job's working directory! [failure reason = ">>> sudoers file: Alias `XYZ' already defined, line ABC <<<"]This means that there is a syntax error in the sudoers file created by yaim-cream-ce. A likely reason if that the same VO as been enabled more than once in the siteinfo.def.0.7 Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklistedThis problem can happen when submitting to CREAM through the WMS. This means that the CE was blacklisted by that WMS (in particular by the ICE component) because the connection to this CE from that WMS went in timeout (default value for timeout: 60 secs) for 3 times. The ICE blacklisting of a CREAM CE lasts for 30 minutes. In this period submissions to that CREAM CE by that WMS/ICE are not attempted and fail with this error message.0.8 Cannot create the job's working directory! failure reason = "sudo: no tty present and no askpass program specified"The problem happens if there are some problems with the sudoers file, which is created by yaim0.9 User ABC not authorized for operation XYZ | |||||||
1 Other problems | ||||||||
Added: | ||||||||
> > |
0.1 Job failure with reason=999This can happen with the new BLAH BLparser when it is not able to detect the status of the job for more thanx seconds. The default value for x is 86400. This value can be modified setting the attribute alldone_interval in /etc/blah.config , e.g.:
alldone_interval=100000It is then necessary to restart the BLAH blparser: /etc/init.d/glite-ce-blahparser restart | |||||||
0.1 Jobs are successfully submitted to Torque, but they stay in W statusTorque will put a job in 'W' when it cannot stage the job's input files. Check if grid accounts on your WNs can "scp" without password to the CE and/or the Torque server host. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 196 to 196 | ||||||||
For LSF, check also in the /etc/blah.config if the path of lsf.profile is correct. | ||||||||
Added: | ||||||||
> > | 0.1 The job cannot be submitted because the blparser service is not alive | |||||||
Added: | ||||||||
> > | Check if the BLAH blparser is running. If it is supposed to run but it is not running, check in its log file(s) (/var/log/cream/glite-xxxparser.log for the old parser, /var/log/cream/glite-ce-bnotifier.log and /var/log/cream/glite-ce-bupdater.log for the new one) if something interesting is reported.
In case (re)start it (/etc/init.d/glite-ce-blparser restart for the old one, /etc/init.d/glite-ce-blahparser restart for the new one) and then restart tomcat (service tomcat5 restart )
In case your blparser node servers multiple CREAM CEs, please be sure to have followed the instructions reported at http://wiki.italiangrid.org/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_4_5_1_Configuration_of_the_old
0.1 Delegation error: the proxy delegationID "xx" is not more valid!Example:$ glite-ce-job-submit -D de2 -r cream-02.pd.infn.it:8443/cream-lsf-cream prren1.jdl 2008-01-28 11:27:04,859 FATAL - MethodName=[jobRegister] Timestamp=[Mon 28 Jan 2008 11:27:04] ErrorCode=[0] Description=[delegation error: the proxy delegationID "de2" is not more valid!;] FaultCause=[delegation proxy expired!]This means that the used delegationID was found on the target CREAM CE, but it is not more valid (i.e. the proxy expired) 0.2 job id list file error: File [xxx] is not a CREAM job list fileExample:$ glite-ce-job-status -i job_ids 2008-04-24 09:59:54,963 FATAL - File [job_ids] is not a CREAM job list file. Stop.This means the [job_ids] file passed in the command line has not the right format Here's an example of a file in the right format: $ cat job_ids ##CREAMJOBS## https://devel03.cnaf.infn.it:8443/CREAM683051516 https://devel03.cnaf.infn.it:8443/CREAM481684356 https://devel03.cnaf.infn.it:8443/CREAM333841302 https://devel03.cnaf.infn.it:8443/CREAM279829555 https://devel03.cnaf.infn.it:8443/CREAM334653961 0.3 bad UID for job executionIf your job fails with an error such as:****** JobID=[https://ppsce03.pic.es:8443/CREAM880596078] Status = [ABORTED] ExitCode = [] FailureReason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Bad UID for job execution MSG=ruserok failed validating dteam017/dteam017 from ppsce03.pic.es-) N/A (jobId = CREAM880596078)]notesthat the torque_server node has to contain the submission hosts (the CREAM CEs) in its /etc/hosts.equiv . Or in recent versions of torque consider the acl_hosts and acl_hosts_enable attributes in the Torque server configuration (see http://www.clusterresources.com/torquedocs21/a.bserverparameters.shtml#open)
1 Other problems1.1 Jobs are successfully submitted to Torque, but they stay in W statusTorque will put a job in 'W' when it cannot stage the job's input files. Check if grid accounts on your WNs can "scp" without password to the CE and/or the Torque server host.1.2 Jobs submitted to LSF fails with errorcode 127This is likely a problem with staging of files from/to the CE node to/from the WN. Check if the relevant LSF daemons run properly. | |||||||
1 Other troubleshooting hints |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Troubleshooting guide for CREAM | ||||||||
Line: 201 to 201 | ||||||||
1 Other troubleshooting hints | ||||||||
Added: | ||||||||
> > | 0.1 Saving the batch job submission scriptIf, for debugging purposes, it is necessary to save the script used by the BLAH component of CREAM to submit the job to the batch system, edit the file/usr/bin/blah_common_submit_functions.sh replacing the following lines:
# DEBUG: cp $bls_tmp_file /tmp rm -f $bls_tmp_filewith: cp $bls_tmp_file /tmp # rm -f $bls_tmp_file | |||||||
0.1 Saving files on the worker node after job execution | ||||||||
Changed: | ||||||||
< < | Fore debugging purposes, it is possible to save the stdout, stderr and proxy files on the Worker Node after job execution. | |||||||
> > | For debugging purposes, it is possible to save the stdout, stderr and proxy files on the Worker Node after job execution. | |||||||
This is done setting blah_debug_save_wn_files in /etc/blah.config on the CREAM CE node to an existing directory (on the WN) where the user running the job has the proper rights for writing. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Troubleshooting guide for CREAM
1 Checks to be done after installation and configuration1.1 Check via browserOpen your browser (where a valid certificate must be installed) tohttps://hostname-of-cream-ce:8443/ce-cream/services
A page with link to the CREAM WSDL should be shown
1.2 Check the CREAM log fileCheck in the CREAM log file (/var/log/cream/glite-ce-cream.log ) for the following strings:
CREAM started! (BLParserClient) Connection with BLParser (xxx) correctly established(Replace xxx with the name of your management system: lsf or pbs) If they are not there, it means that CREAM has not started properly 1.3 Test glexec
export GLEXEC_MODE="lcmaps_get_account" export GLEXEC_CLIENT_CERT=/tmp/user.proxy /usr/sbin/glexec /usr/bin/idThis should return the id of the local user mapped to that Grid user 1.4 Test gridftpTry a gsiftp (e.g.using globus-url-copy or uberftp ) towards that CREAM CE.
E.g.:
uberftp hostname-of-cream-ce> "ls /etc" 1.5 Check if submissions are enabledTry the following command from a UI:glite-ce-allowed-submission <<hostname-of-cream-ce>>:8443It should return: Job Submission to this CREAM CE is enabled 1.6 Try a direct submissionTry a submission to that CE using theglite-ce-job-submit command , e.g.:
$ /bin/cat test.jdl [ executable="/bin/sleep"; arguments="1"; ] $ glite-ce-job-submit -a -r alice16.spbu.ru:8443/cream-pbs-dteam test.jdl https://alice16.spbu.ru:8443/CREAM336256203Check the status of that job, which eventually should be DONE-OK: $ glite-ce-job-status https://alice16.spbu.ru:8443/CREAM336256203 ****** JobID=[https://alice16.spbu.ru:8443/CREAM336256203] Status = [DONE-OK] ExitCode = [0] 1.7 Try a job cancellationTry a submission to that CE using theglite-ce-job-submit command, and then try to cancel it (using the glite-ce-job-cancel command).
$ /bin/cat test.jdl [ executable="/bin/sleep"; arguments="1000"; ] $ glite-ce-job-submit -a -r alice16.spbu.ru:8443/cream-pbs-dteam test.jdl https://alice16.spbu.ru:8443/CREAM510970530 $ glite-ce-job-cancel https://alice16.spbu.ru:8443/CREAM510970530Check the status of that job, which eventually should be CANCELLED :
$ glite-ce-job-status https://alice16.spbu.ru:8443/CREAM510970530 ****** JobID=[https://alice16.spbu.ru:8443/CREAM510970530 Status = [CANCELLED] ExitCode = [] Description = [Cancelled by user] 1.8 Try a submission through the WMSTry a submission to that CE through the WMS, i.e. using theglite-wms-job-submit command
2 Log filesIn case of problems first of all check the log files. See http://wiki.italiangrid.org/twiki/bin/view/CREAM/ServiceReferenceCard#Logfile_locations_and_management for relevant information3 Error messages3.1 Batch system xxx not supportedExample:$ glite-ce-job-submit -a -r cert-26.pd.infn.it:8443/cream-pbs-cream oo.jdl 2008-01-15 13:46:18,167 FATAL - MethodName=[jobRegister] Timestamp=[Tue 15 Jan 2008 13:46:18] ErrorCode=[0] Description=[system error] FaultCause=[Batch System pbs not supported!]This means that:
org.glite.ce.cream.jobmanagement.cmdexecutor.blah.BLParserClient - initializeConnection: getting info about BLParser (xxx) from BLAH (retry count=yy/zz) ... org.glite.ce.cream.jobmanagement.cmdexecutor.blah.BLParserClient - initializeConnection error: cannot get BLParser (lsf) HOST:PORT information from BLAH. Please, be sure that BLAH is properly configured and RESTART the CREAM service.As batch system it is meant the value specified as JOB_MANAGER in the siteinfo.def . Please note that valid values are lsf, pbs and condor (and not lcgpbs, lcglsf, lcgcondor).
This value is reported in the /etc/blah.config file (attribute supported_lrms ).
Suppose that you have lsf as this value.
As user tomcat from the CE node try the following:
$ /usr/bin/blahpd $GahpVersion: 1.14.0 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $ BLAH_GET_HOSTPORT lsfThis should return: SThen type: resultsIt should return something like lsf 0 lsf/cream-35.pd.infn.it:56565If this is the case (i.e. you are getting a value such as this one) your problem is likely the second one (i.e. the blparser was not running when tomcat was started). For LSF, check also in the /etc/blah.config if the path of lsf.profile is correct.
4 Other troubleshooting hints4.1 Saving files on the worker node after job executionFore debugging purposes, it is possible to save the stdout, stderr and proxy files on the Worker Node after job execution. This is done settingblah_debug_save_wn_files in /etc/blah.config on the CREAM CE node to an existing directory (on the WN) where the user running the job has the proper rights for writing.
A directory called XXX.debug will be created within the specified directory.
-- MassimoSgaravatto - 2011-04-27 |