Difference: DjsCreamProbe (3 vs. 4)

Revision 42011-11-10 - AlessioGianelle

Line: 1 to 1
 
META TOPICPARENT name="NagiosProbes"

CREAM-CE direct job submission metrics

Line: 40 to 40
 

emi.cream.CREAMCEDJS-DirectJobMonit

Monitors submitted grid jobs. Threaded implementation with one thread per monitored resource with max 10 threads. Passively updates emi.cream.CREAMCEDJS-DirectJobState with the latest state of the job according to CREAM when job is not in a terminal state. When job enters terminal state or was canceled the metric updates both emi.cream.CREAMCEDJS-DirectJobState and emi.cream.CREAMCEDJS-DirectJobSubmit with the final job status. The latter metrics are updated (as passive checks) either via Naigos command file or NSCA. emi.cream.CREAMCEDJS-DirectJobSubmit is the metric which goes to Metric Store Database. \ No newline at end of file

Added:
>
>

Test

To test the probe you have to create a valid proxy.

State + Monit

First you have to "submit" a job:

/usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo <vo> -x <path of the proxy> -H <CREAM hostname> -m emi.cream.CREAMCEDJS-DirectJobState --resource <CREAM CE url>

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo dteam -x /tmp/x509up_u501 -H cream-30.pd.infn.it -m emi.cream.CREAMCEDJS-DirectJobState --resource cream-30.pd.infn.it:8443/cream-pbs-cert
OK: Job was submitted [https://cream-30.pd.infn.it:8443/CREAM126240562].
OK: Job was submitted [https://cream-30.pd.infn.it:8443/CREAM126240562].
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL

https://cream-30.pd.infn.it:8443/CREAM126240562

Then you can monitor the job:

/usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo <vo> -x <path of the proxy> -H <CREAM hostname> -m emi.cream.CREAMCEDJS-DirectJobMonit --pass-check-dest active

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo dteam -x /tmp/x509up_u501 -H cream-30.pd.infn.it -m emi.cream.CREAMCEDJS-DirectJobMonit --pass-check-dest active
OK: DONE.
metric results >>> <cream-30.pd.infn.it,emi.cream.CREAMCEDJS-DirectJobSubmit-dteam>



metric results >>> <cream-30.pd.infn.it,emi.cream.CREAMCEDJS-DirectJobState-dteam>



OK: Jobs processed - 1
OK: Jobs processed - 1
DONE : 1|jobs_processed=1;; DONE=1;; REALLY-RUNNING=0;; RUNNING=0;; REGISTERED=0;; PENDING=0;; IDLE=0;; HELD=0;; CANCELLED=0;; ABORTED=0;; UNKNOWN=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2

When it finishes the output file is retrieve and stored into /var/lib/gridprobes/<VO or FQAN>/emi.cream/CREAMCEDJS/<hostname>/jobOutput. The output file should contains the hostname of the worker node where job have run.

[ale@cream-48 ~]$ cat /var/lib/gridprobes/dteam/emi.cream/CREAMCEDJS/cream-30.pd.infn.it/jobOutput/cream-30.pd.infn.it_8443_CREAM126240562/cream.out 
cream-wn-030.pn.pd.infn.it

State + Cancel

To test easily the "Cancel" metrics you need to modify the JDL template to increment job duration:

[ale@cream-48 ~]$ cat /usr/libexec/grid-monitoring/probes/emi.cream/CREAMDJS-jdl.template
[
Type="Job";
JobType="Normal";
#Executable = "<jdlExecutable>";
Executable = "/bin/sleep";
#Arguments = "<jdlArguments>";
Arguments = "100";
StdOutput = "cream.out";
StdError = "cream.out";
OutputSandbox = {"cream.out"};
OutputSandboxBaseDestUri="gsiftp://localhost";
]

Then you have to "submit" the job:

/usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo <vo> -x <path of the proxy> -H <CREAM hostname> -m emi.cream.CREAMCEDJS-DirectJobState --resource <CREAM CE url>

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo dteam -x /tmp/x509up_u501 -H cream-30.pd.infn.it -m emi.cream.CREAMCEDJS-DirectJobState --resource cream-30.pd.infn.it:8443/cream-pbs-cert
OK: Job was submitted [https://cream-30.pd.infn.it:8443/CREAM226348631].
OK: Job was submitted [https://cream-30.pd.infn.it:8443/CREAM226348631].
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL

https://cream-30.pd.infn.it:8443/CREAM226348631

Monitor the job until it arrives to the RUNNING state:

/usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo <vo> -x <path of the proxy> -H <CREAM hostname> -m emi.cream.CREAMCEDJS-DirectJobMonit --pass-check-dest active

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo dteam -x /tmp/x509up_u501 -H cream-30.pd.infn.it -m emi.cream.CREAMCEDJS-DirectJobMonit --pass-check-dest active
metric results >>> <cream-30.pd.infn.it,emi.cream.CREAMCEDJS-DirectJobState-dteam>
OK: [RUNNING] https://cream-30.pd.infn.it:8443/CREAM226348631
OK: [RUNNING] https://cream-30.pd.infn.it:8443/CREAM226348631

glite-ce-job-status https://cream-30.pd.infn.it:8443/CREAM226348631

******  JobID=[https://cream-30.pd.infn.it:8443/CREAM226348631]
  Status        = [RUNNING]
OK: Jobs processed - 1
OK: Jobs processed - 1
[RUNNING] : 1|jobs_processed=1;; DONE=0;; REALLY-RUNNING=0;; RUNNING=1;; REGISTERED=0;; PENDING=0;; IDLE=0;; HELD=0;; CANCELLED=0;; ABORTED=0;; UNKNOWN=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2

Then you can cancel it:

/usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo <vo> -x <path of the proxy> -H <CREAM hostname> -m emi.cream.CREAMCEDJS-DirectJobCancel

[ale@cream-48 ~]$ /usr/libexec/grid-monitoring/probes/emi.cream/CREAMCEDJS-probe --vo dteam -x /tmp/x509up_u501 -H cream-30.pd.infn.it -m emi.cream.CREAMCEDJS-DirectJobCancel
OK: job cancelled
OK: job cancelled
Testing from: cream-48.pd.infn.it
DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy
VOMS FQANs: /dteam/Role=NULL/Capability=NULL, /dteam/NGI_IT/Role=NULL/Capability=NULL
Job cancellation request sent:
glite-ce-job-cancel --noint https://cream-30.pd.infn.it:8443/CREAM226348631
Job bookkeeping files deleted.

You can check the manually if the final status of the job is CANCELLED as expected:

[ale@cream-48 ~]$ glite-ce-job-status https://cream-30.pd.infn.it:8443/CREAM226348631

******  JobID=[https://cream-30.pd.infn.it:8443/CREAM226348631]
   Status        = [CANCELLED]
   ExitCode      = []
   Description   = [Cancelled by user]

 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback