Tags:
, view all tags

Nagios WMS probe

Test WMS service with job submission to CEs.It is based on the python-GridMon library developed by the SAM team. Details about the command line parameters can be found here.

WMS Metrics

  1. emi.wms.WMS-JobState. Submits grid job to CE(s) via WMS under test. Accepts passive check updates from emi.wms.WMS-JobMonit.
  2. emi.wms.WMS-JobMonit. Monitors submitted grid jobs.
  3. emi.wms.WMS-JobCancel. Cancel grid job.
  4. emi.wms.WMS-JobSubmit. Passive check. Holds terminal status of job submission.

emi.wms.WMS-JobState

This metric is used to submit jobs through the WMS under test. It accepts these parameter:

--jdl-templ <file>              JDL template file (full path). 
                                Default: <emi.wms.ProbesLocation>/WMS-jdl.template
--jdl-retrycount <val>          JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val>   JDL ShallowRetryCount (Default: 1).
--ces-file <file>               File with list of CEs. Two schemes [file:] or http: 
                                (Default: /var/lib/gridprobes/<vo>/GoodCEs)
--prev-status <0-3>             Previous Nagios status of the metric.

GoodCEs

If exists the file "GoodCEs" all CEs from the file are OR'ed in the resulting Requirements ClassAdd. Eg.

Requirements = (other.GlueCEInfoHostName == "ce106.cern.ch") || (other.GlueCEInfoHostName == "creamce.gina.sara.nl")

This file can been automatically popolated using the script gather_healthy_nodes from the hr.srce grid-monitoring probes

Otherwise a very general requirement: Requirements = true is used for match-making.

JDL template

This is the default jdl template used for submission:

Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
OutputSandbox = {"gridjob.out"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = <jdlReqCEInfoHostName>;

At the moment the parameter jdlExecutable is hardcoded, and the script /bin/hostname is used.

emi.wms.WMS-JobMonit

Monitors submitted grid jobs. Threaded implementation with one thread per monitored resource with max 10 threads. Passively updates emi.wms.WMS-JobState with the latest state of the job according to WMS when job is not in a terminal state. When job enters terminal state or was canceled the metric updates both emi.wms.WMS-JobState and org.sam.WMS-JobSubmit with the final job status. The latter metrics are updated (as passive checks) either via Nagios command file or NSCA. emi.wms.WMS-JobSubmit is the metric which goes to Metric Store Database. It accepts these parameters:

--timeout-job-global <sec>   Global timeout for jobs. Job will be cancelled and dropped 
                             if it is not in terminal state by that time. (Default: 3600)
--timeout-job-waiting <sec>  Time allowed for a job to stay in Waiting with 'no compatible 
                             resources'. (Default: 2700)
--hosts <h1,h2,..>           Comma-separated list of CE hostnames to run monitor on.
Edit | Attach | PDF | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | More topic actions...
Topic revision: r3 - 2011-09-26 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback