Tags:
create new tag
,
view all tags
---+ Nagios WMS probe Test WMS service with job submission to CEs.It is based on the [[https://twiki.cern.ch/twiki/bin/view/LCG/SAMToNagios#Python_based_probes_using_gridmo][python-GridMon]] library developed by the [[https://tomtools.cern.ch/confluence/display/SAM/Home][SAM team]]. Details about the command line parameters can be found [[https://tomtools.cern.ch/confluence/display/SAM/Probes+org.sam#Probesorg.sam-Probesandmetrics][here]]. %TOC% ---++ Installation This probe need to be installed on a WMS User Interface because it use wms-cli commands to monitor WMS. ---+++ Dependencies | python | >= 2.4 | | python-ldap | | | openssl | >= 0.9.8e-12 | | nagios-submit-conf | >= 0.2 | | python-GridMon | >= 1.1.10 | About the last two rpms they can be install using this repository: <verbatim> [egi-sam] name=EGI SAM repo baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch enabled=1 gpgcheck=0 protect=1 priority=10 </verbatim> ---++ WMS Metrics 1 *emi.wms.WMS-JobState*. Submits grid job to CE(s) via WMS under test. Accepts passive check updates from emi.wms.WMS-JobMonit. 1 <strong>emi.wms.WMS-JobMonit. </strong>Monitors submitted grid jobs. 1 *emi.wms.WMS-JobCancel*. Cancel grid job. 1 <strong>emi.wms.WMS-JobSubmit. </strong>Passive check. Holds terminal status of job submission. ---+++ *emi.wms.WMS-JobState* This metric is used to submit jobs through the WMS under test. It accepts these parameter: <verbatim>--jdl-templ <file> JDL template file (full path). Default: <emi.wms.ProbesLocation>/WMS-jdl.template --jdl-retrycount <val> JDL RetryCount (Default: 0). --jdl-shallowretrycount <val> JDL ShallowRetryCount (Default: 1). --ces-file <file> File with list of CEs. Two schemes [file:] or http: (Default: /var/lib/gridprobes/<vo>/GoodCEs) --prev-status <0-3> Previous Nagios status of the metric.</verbatim> ---++++ !GoodCEs If exists the file "<em>GoodCEs</em>" all CEs from the file are OR'ed in the resulting =Requirements= =ClassAdd=. Eg. <verbatim>Requirements = (other.GlueCEInfoHostName == "ce106.cern.ch") || (other.GlueCEInfoHostName == "creamce.gina.sara.nl")</verbatim> This file can been automatically popolated using the script =gather_healthy_nodes= from the =hr.srce grid-monitoring probes= Otherwise a very general requirement: =Requirements = true= is used for match-making. ---++++ JDL template This is the default jdl template used for submission: <verbatim>Type="Job"; JobType="Normal"; Executable = "<jdlExecutable>"; StdError = "gridjob.out"; StdOutput = "gridjob.out"; OutputSandbox = {"gridjob.out"}; RetryCount = <jdlRetryCount>; ShallowRetryCount = <jdlShallowRetryCount>; Requirements = <jdlReqCEInfoHostName>;</verbatim> At the moment the parameter _jdlExecutable_ is hardcoded, and the script =/bin/hostname= is used. ---+++ *emi.wms.WMS-JobMonit* Monitors submitted grid jobs. Threaded implementation with one thread per monitored resource with max 10 threads. Passively updates =emi.wms.WMS-JobState= with the latest state of the job according to WMS when job is not in a terminal state. When job enters terminal state or was canceled the metric updates both =emi.wms.WMS-JobState= and =org.sam.WMS-JobSubmit= with the final job status. The latter metrics are updated (as passive checks) either via Nagios command file or NSCA. =emi.wms.WMS-JobSubmit= is the metric which goes to Metric Store Database. It accepts these parameters: <verbatim>--timeout-job-global <sec> Global timeout for jobs. Job will be cancelled and dropped if it is not in terminal state by that time. (Default: 3600) --timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with 'no compatible resources'. (Default: 2700) --hosts <h1,h2,..> Comma-separated list of CE hostnames to run monitor on.</verbatim> ---++ Test To test the probe you have to create a valid proxy. ---+++ State + Monit First you have to "submit" a jdl: */usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobState* <verbatim> [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState OK: OK: Testing from: cream-48.pd.infn.it DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server ====================== glite-wms-job-submit Success ====================== The job has been successfully submitted to the WMProxy Your job identifier is: https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ The job identifier has been saved in the following file: /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID ========================================================================== </verbatim> Then you can "monitor" the job: */usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobMonit --pass-check-dest active* <verbatim> [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam> OK: [Scheduled] https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ OK: [Scheduled] https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ glite-wms-job-status https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ Current Status: Scheduled Status Reason: unavailable Destination: cccreamceli09.in2p3.fr:8443/cream-sge-long Submitted: Mon Nov 7 16:38:01 2011 CET ========================================================================== OK: Jobs processed - 1 OK: Jobs processed - 1 [Scheduled] : 1|jobs_processed=1;; DONE=0;; RUNNING=0;; SCHEDULED=1;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2 </verbatim> If you execute again the _emi.wms.WMS-JobState_ metrics you can have as output the last status see by the metrics _emi.wms.WMS-JobMonit_ <verbatim> [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState OK: Active job - Scheduled [2011-11-07T15:38:28Z] OK: Active job - Scheduled [2011-11-07T15:38:28Z] Testing from: cream-48.pd.infn.it DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL Active job - Scheduled [2011-11-07T15:38:28Z] </verbatim> At the end when job finished the execution of the _emi.wms.WMS-JobMonit_ metrics should trigger also a _get_output_ <verbatim> [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobSubmit-dteam> OK: success. glite-wms-job-output --noint --nosubdir --dir /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ 2>&1 Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server Warning - option --nosubdir specified: output files with same name will be overridden Warning - Directory already exists: /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput ================================================================================ JOB GET OUTPUT OUTCOME Output sandbox files for the job: https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ have been successfully retrieved and stored in the directory: /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobOutput ================================================================================ metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam> OK: success. OK: Jobs processed - 1 OK: Jobs processed - 1 Done : 1|jobs_processed=1;; DONE=1;; RUNNING=0;; SCHEDULED=0;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2 </verbatim> You can verify that it works correctly checking the status of the job. <verbatim> [ale@cream-48 ~]$ glite-wms-job-status https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://cream-45.pd.infn.it:9000/o9ELufQPDtyIxkJ0_YGuKQ Current Status: Cleared Status Reason: user retrieved output sandbox Destination: cccreamceli09.in2p3.fr:8443/cream-sge-long Submitted: Mon Nov 7 16:38:01 2011 CET ========================================================================== </verbatim> ---+++ State + Monit + Cancel First you have to "submit" a jdl: */usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobState* <verbatim> [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobState OK: OK: Testing from: cream-48.pd.infn.it DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL Connecting to the service https://cream-45.pd.infn.it:7443/glite_wms_wmproxy_server ====================== glite-wms-job-submit Success ====================== The job has been successfully submitted to the WMProxy Your job identifier is: https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A The job identifier has been saved in the following file: /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID ========================================================================== </verbatim> As before you can "monitor" the job: */usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobMonit --pass-check-dest active* <verbatim> [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active metric results >>> <cream-45.pd.infn.it,emi.wms.WMS-JobState-dteam> OK: [Scheduled] https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A OK: [Scheduled] https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A glite-wms-job-status https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A Current Status: Scheduled Status Reason: unavailable Destination: t2-ce-01.lnl.infn.it:8443/cream-lsf-cert1 Submitted: Mon Nov 7 16:51:29 2011 CET ========================================================================== OK: Jobs processed - 1 OK: Jobs processed - 1 [Scheduled] : 1|jobs_processed=1;; DONE=0;; RUNNING=0;; SCHEDULED=1;; SUBMITTED=0;; READY=0;; WAITING=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2 </verbatim> Before it finishes you can "cancel" it: */usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo <vo> -x <path of the proxy> -H <WMS hostname> -m emi.wms.WMS-JobCancel* <verbatim> [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobCancel OK: job cancelled OK: job cancelled Testing from: cream-48.pd.infn.it DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL Job cancellation request sent: glite-wms-job-cancel --noint -i /var/lib/gridprobes/dteam/emi.wms/WMS/cream-45.pd.infn.it/jobID Job bookkeeping files deleted. [ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.wms/WMS-probe --vo dteam -x /tmp/x509up_u501 -H cream-45.pd.infn.it -m emi.wms.WMS-JobMonit --pass-check-dest active OK: no active jobs [2011-11-07T16:51:53Z] OK: no active jobs [2011-11-07T16:51:53Z]|jobs_processed=0;; </verbatim> You can verify that it works correctly checking the status of the job. <verbatim> [ale@cream-48 ~]$ glite-wms-job-status https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A ======================= glite-wms-job-status Success ===================== BOOKKEEPING INFORMATION: Status info for the Job : https://cream-45.pd.infn.it:9000/apDAgqvkXls1HNLfRxPO0A Current Status: Cancelled Destination: t2-ce-01.lnl.infn.it:8443/cream-lsf-cert1 Submitted: Mon Nov 7 16:51:29 2011 CET ========================================================================== </verbatim>
E
dit
|
A
ttach
|
PDF
|
H
istory
: r6
<
r5
<
r4
<
r3
<
r2
|
B
acklinks
|
V
iew topic
|
M
ore topic actions
Topic revision: r6 - 2011-11-07
-
AlessioGianelle
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
WMS Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback