Difference: TroubleshootingGuide (26 vs. 27)

Revision 272013-02-25 - LisaZangrando

Line: 1 to 1
 
META TOPICPARENT name="SystemAdministratorDocumentation"

Troubleshooting guide for CREAM

Line: 445 to 445
 

1 Other problems

1.1 Job cancelled with description "Cancelled by CE admin"

Changed:
<
<
All activities about the job are tracked on its history. So, whenever the CREAM service receives from BLAH the cancel state about a job, it checks in the job's history if the user (or admin) explicitly requested its cancellation or whether the cream itself decided to cancel it because of the lease expiration or some other reason. If no one "jobCancel" command has been executed by CREAM (i.e. due to user/admin request or job lease expiration), it means that the action has been executed externally by the system administrator (e.g. bkill) or the batch system itself (e.g. batch system killed the job due to excessive memory usage). In this case, CREAM is not able to know exactly what happened and so it can just inform the user about the job cancellation with the description "Cancelled by CE admin". The LRMS logs may contain some useful detailed information for understanding who canceled the job, but this approach is batch system dependent and it needs more investigation. At the present time the hint we can provide to the user is to ask the LRMS admin to investigate the cancellation reason by using the tools provided by the LRMS itself.
>
>
All activities about the job are tracked on its "command history", that is a sort of recorder realized for the purpose of noting the main internal/external events (i.e. the commands) which condition the job's life cycle. So whenever CREAM receives a command for the specified job, it records the event on the related command history. Such commands could be expressly required to CREAM by the user (i.e. the owner of the job), or by the CREAM administrator which is a privileged user who can control all jobs, or by CREAM itself which sometimes must control the job in place of the user/admin (e.g. whenever the lease expires the associated jobs must be cancelled). Unfortunately the job's life cycle could be controlled externally by the batch system administrator (e.g. bkill, bstop, bresume etc) or by the batch system itself (e.g. the LRMS killed the job due to excessive memory usage) bypassing CREAM which ignores the command executed against the job and it will be aware through BLAH just about the job status change. Under this scenario, in case of job cancellation, CREAM is not able to know exactly what happened, who and why cancelled the job, and so it can just inform the user about the job cancellation with the message "Cancelled by CE admin" noted in the description field of the job status. At the present time CREAM is not able to provide further detailed information. The LRMS logs may contain some useful details for understanding who canceled the job, but this approach is batch system dependent and it needs more investigation. The hint for the user is to ask to the system administrator to investigate the cancellation reason by using the tools provided by the LRMS itself.
 

0.1 Dynamic information not published in the BDII

 
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback