Nagios WMS probe
Test WMS service with job submission to CEs.It is based on the
python-GridMon library developed by the
SAM team. Details about the command line parameters can be found
here.
Installation
This probe need to be installed on a WMS User Interface because it use wms-cli commands to monitor WMS.
WMS Metrics
- emi.wms.WMS-JobState. Submits grid job to CE(s) via WMS under test. Accepts passive check updates from emi.wms.WMS-JobMonit.
- emi.wms.WMS-JobMonit. Monitors submitted grid jobs.
- emi.wms.WMS-JobCancel. Cancel grid job.
- emi.wms.WMS-JobSubmit. Passive check. Holds terminal status of job submission.
emi.wms.WMS-JobState
This metric is used to submit jobs through the WMS under test. It accepts these parameter:
--jdl-templ <file> JDL template file (full path).
Default: <emi.wms.ProbesLocation>/WMS-jdl.template
--jdl-retrycount <val> JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val> JDL ShallowRetryCount (Default: 1).
--ces-file <file> File with list of CEs. Two schemes [file:] or http:
(Default: /var/lib/gridprobes/<vo>/GoodCEs)
--prev-status <0-3> Previous Nagios status of the metric.
GoodCEs
If exists the file "
GoodCEs" all CEs from the file are OR'ed in the resulting
Requirements
ClassAdd
. Eg.
Requirements = (other.GlueCEInfoHostName == "ce106.cern.ch") || (other.GlueCEInfoHostName == "creamce.gina.sara.nl")
This file can been automatically popolated using the script
gather_healthy_nodes
from the
hr.srce grid-monitoring probes
Otherwise a very general requirement:
Requirements = true
is used for match-making.
JDL template
This is the default jdl template used for submission:
Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
OutputSandbox = {"gridjob.out"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = <jdlReqCEInfoHostName>;
At the moment the parameter
jdlExecutable is hardcoded, and the script
/bin/hostname
is used.
emi.wms.WMS-JobMonit
Monitors submitted grid jobs. Threaded implementation with one thread per monitored resource with max 10 threads. Passively updates
emi.wms.WMS-JobState
with the latest state of the job according to WMS when job is not in a terminal state. When job enters terminal state or was canceled the metric updates both
emi.wms.WMS-JobState
and
org.sam.WMS-JobSubmit
with the final job status. The latter metrics are updated (as passive checks) either via Nagios command file or NSCA.
emi.wms.WMS-JobSubmit
is the metric which goes to Metric Store Database. It accepts these parameters:
--timeout-job-global <sec> Global timeout for jobs. Job will be cancelled and dropped
if it is not in terminal state by that time. (Default: 3600)
--timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with 'no compatible
resources'. (Default: 2700)
--hosts <h1,h2,..> Comma-separated list of CE hostnames to run monitor on.