Difference: DjsCreamProbeNew (3 vs. 4)

Revision 42014-01-31 - MarcoVerlato

Line: 1 to 1
 
META TOPICPARENT name="NagiosProbes"
Changed:
<
<

New CREAM-CE direct job submission metrics

>
>

New CREAM-CE direct job submission metrics and WN probes

 
Changed:
<
<
The following metrics are a restructured version of the existing ones and provide a better approach for probing a CREAM CE.
>
>
The following metrics are a restructured version of the existing ones and provide a better approach for probing a CREAM CE and its WNs
 
  1. cream_serviceInfo.py - get CREAM CE service info
  2. cream_allowedSubmission.py - check if the submission to the selected CREAM CE is allowed
  3. cream_jobSubmit.py - submit a job directly to the selected CREAM CE
  4. cream_jobCancel.py - cancel an active job.
  5. cream_jobPurge.py - purge a terminted job.
Added:
>
>
  1. WN-softver probe - check middleware version on WN (via cream_jobSubmit.py)
  2. WN-csh probe - check if WN has csh (via cream_jobSubmit.py)
  All of them have been implemented in python and are based on the cream-cli commands. They share the same logic structure and provide useful information about their version, usage (i.e. help) including the options list and their meaning. For example:
Line: 30 to 32
  -H HOSTNAME, --hostname=HOSTNAME The hostname of the CREAM service. -p PORT, --port=PORT The port of the service. [default: none]
Deleted:
<
<
-u URL, --url=URL The status endpoint URL of the service. Example: https://[:]
  -x PROXY, --proxy=PROXY The proxy path
Added:
>
>
-t TIMEOUT, --timeout=TIMEOUT Probe execution time limit. [default:
                1. sec]
  -v, --verbose verbose mode [default: False]
Added:
>
>
-u URL, --url=URL The status endpoint URL of the service. Example: https://[:]
  $ ./cream_serviceInfo.py --version cream_serviceInfo v.1.0
Line: 50 to 55
 The verbose mode (--verbose) could be enabled to each metric. It provides several details about the probe execution itself by highlighting the internal commands:
Changed:
<
<
$ ./cream_serviceInfo.py --hostname cream-41.pd.infn.it --port 8443 --verbose
>
>
$ ./cream_serviceInfo.py --hostname cream-41.pd.infn.it --port 8443 -x /tmp/dteam.proxy --verbose
 executing command: /usr/bin/voms-proxy-info -timeleft invoking service info executing command: /usr/bin/glite-ce-service-info cream-41.pd.infn.it:8443
Changed:
<
<
Interface Version = [2.1] Service Version = [1.16.1 - EMI version: 3.5.1-1.el6] Description = [CREAM 2] Started at = [Thu Jan 2 19:06:51 2014] Submission enabled = [YES] Status = [RUNNING]
>
>
OK: Service Version = [1.16.1 - EMI version: 3.5.1-1.el6]
 

In case of mistakes on the selected options or on their values, the probe tries to explain what is wrong. For example the cream_serviceInfo doesn't support the --queue option:

Changed:
<
<
$ ./cream_serviceInfo.py --hostname cream-41.pd.infn.it --port 8443 --queue creamtest1 --verbose
>
>
$ ./cream_serviceInfo.py --hostname cream-41.pd.infn.it --port 8443 --queue creamtest1 -x /tmp/dteam.proxy --verbose
 Usage: cream_serviceInfo.py [options]

cream_serviceInfo.py: error: no such option: --queue

Line: 75 to 75
 In case of the errors in interacting with the CREAM CE, useful details will be provided about the failure:
Changed:
<
<
$ ./cream_allowedSubmission.py --url https://cream-43.pd.infn.it:8443
>
>
$ ./cream_allowedSubmission.py --url https://cream-43.pd.infn.it:8443 -x /tmp/dteam.proxy
 command '/usr/bin/glite-ce-allowed-submission cream-43.pd.infn.it:8443' failed: return_code=1 details: ['2014-01-16 15:59:57,085 FATAL - Received NULL fault; the error is due to another cause: FaultString=[connection error] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] - FaultDetail=[Connection refused]\n']
Line: 95 to 95
  -p PORT, --port=PORT The port of the service. [default: none] -x PROXY, --proxy=PROXY The proxy path
Added:
>
>
-t TIMEOUT, --timeout=TIMEOUT Probe execution time limit. [default:
                1. sec]
  -v, --verbose verbose mode [default: False] -u URL, --url=URL The status endpoint URL of the service. Example: https://[:]
Line: 103 to 106
 In order to get information about the CREAM service on the host https://cream-41.pd.infn.it:8443, use the following command:
Changed:
<
<
$ ./cream_serviceInfo.py --url https://cream-41.pd.infn.it:8443 Interface Version = [2.1] Service Version = [1.16.1 - EMI version: 3.5.1-1.el6] Description = [CREAM 2] Started at = [Thu Jan 2 19:06:51 2014] Submission enabled = [YES] Status = [RUNNING]
>
>
$ ./cream_serviceInfo.py --url https://cream-41.pd.infn.it:8443 -x /tmp/dteam.proxy OK: Service Version = [1.16.1 - EMI version: 3.5.1-1.el6]
 

or similary:

Changed:
<
<
$ ./cream_serviceInfo.py --hostname cream-41.pd.infn.it --port 8443 Interface Version = [2.1] Service Version = [1.16.1 - EMI version: 3.5.1-1.el6] Description = [CREAM 2] Started at = [Thu Jan 2 19:06:51 2014] Submission enabled = [YES] Status = [RUNNING]
>
>
$ ./cream_serviceInfo.py --hostname cream-41.pd.infn.it --port 8443 -x /tmp/dteam.proxy OK: Service Version = [1.16.1 - EMI version: 3.5.1-1.el6]
 

cream_allowedSubmission

Line: 137 to 130
  -H HOSTNAME, --hostname=HOSTNAME The hostname of the CREAM service. -p PORT, --port=PORT The port of the service. [default: none]
Deleted:
<
<
-u URL, --url=URL The status endpoint URL of the service. Example: https://[:]
  -x PROXY, --proxy=PROXY The proxy path
Added:
>
>
-t TIMEOUT, --timeout=TIMEOUT Probe execution time limit. [default:
                1. sec]
  -v, --verbose verbose mode [default: False]
Added:
>
>
-u URL, --url=URL The status endpoint URL of the service. Example: https://[:]
 

Notice: the use of the --url option is equivalent to specify both the options: --hostname and --port:

Changed:
<
<
$ ./cream_allowedSubmission.py --hostname cream-41.pd.infn.it --port 8443 ENABLED
>
>
$ ./cream_allowedSubmission.py --hostname cream-41.pd.infn.it --port 8443 -x /tmp/dteam.proxy OK: ENABLED
 
Changed:
<
<
$ ./cream_allowedSubmission.py --url https://cream-41.pd.infn.it:8443 ENABLED
>
>
$ ./cream_allowedSubmission.py --url https://cream-41.pd.infn.it:8443 -x /tmp/dteam.proxy OK: ENABLED
 

The verbose mode highlights the internal commands:

Changed:
<
<
$ ./cream_allowedSubmission.py --url https://cream-41.pd.infn.it:8443 --verbose
>
>
$ ./cream_allowedSubmission.py --url https://cream-41.pd.infn.it:8443 -x /tmp/dteam.proxy --verbose
 executing command: /usr/bin/voms-proxy-info -timeleft invoking allowedSubmission executing command: /usr/bin/glite-ce-allowed-submission cream-41.pd.infn.it:8443
Changed:
<
<
ENABLED
>
>
OK: ENABLED
 
Line: 178 to 174
  -H HOSTNAME, --hostname=HOSTNAME The hostname of the CREAM service. -p PORT, --port=PORT The port of the service. [default: none]
Deleted:
<
<
-u URL, --url=URL The status endpoint URL of the service. Example: https://[:]/cream--
  -x PROXY, --proxy=PROXY The proxy path
Added:
>
>
-t TIMEOUT, --timeout=TIMEOUT Probe execution time limit. [default:
                1. sec]
  -v, --verbose verbose mode [default: False]
Added:
>
>
-u URL, --url=URL The status endpoint URL of the service. Example: https://[:]/cream--
  -l LRMS, --lrms=LRMS The LRMS name (e.g.: 'lsf', 'pbs' etc) -q QUEUE, --queue=QUEUE The queue name (e.g.: 'creamtest') -j JDL, --jdl=JDL The jdl path
Added:
>
>
-d DIR, --dir=DIR The output sandbox path
 

The --url (-u) directive must be used to target the probe to a specific CREAM CE identified by its identifier (i.e. CREAM CE ID). Alternatively is it possible to specify the CREAM CE identifier by using the --hostname , --port, --lrms and --queue options which are mutually exclusive with respect to the --url option.

Line: 199 to 199
 Type="Job"; JobType="Normal"; Executable = "/bin/hostname";
Added:
>
>
Arguments = "-s";
 StdOutput = "std.out"; StdError = "std.err"; OutputSandbox = {"std.out","std.err"};
Changed:
<
<
OutputSandboxBaseDestUri="gsiftp://localhost"
>
>
OutputSandboxBaseDestUri="gsiftp://localhost";
 ]

If verbose mode is disabled, the output should look like this:

Changed:
<
<
$ ./cream_jobSubmit.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 --jdl ./hostname.jdl Job terminated with status DONE-OK
>
>
$ ./cream_jobSubmit.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy --jdl ./hostname.jdl DONE-OK: prod-wn-001
 

Notice: the use of the --url option is equivalent to specify both the options: --hostname, --port --lrms and --queue:

Changed:
<
<
$ ./cream_jobSubmit.py --hostname cream-41.pd.infn.it --port 8443 --lrms lsf --queue creamtest1 --jdl ./hostname.jdl Job terminated with status DONE-OK
>
>
$ ./cream_jobSubmit.py --hostname cream-41.pd.infn.it --port 8443 --lrms lsf --queue creamtest1 -x /tmp/dteam.proxy --jdl ./hostname.jdl DONE-OK: prod-wn-001
 

If the verbose mode is enabled, the output of the above command should be like this:

Changed:
<
<
$ ./cream_jobSubmit.py --hostname cream-41.pd.infn.it --port 8443 --lrms lsf --queue creamtest1 --jdl ./hostname.jdl --verbose
>
>
$ ./cream_jobSubmit.py --hostname cream-41.pd.infn.it --port 8443 --lrms lsf --queue creamtest1 -x /tmp/dteam.proxy --jdl ./hostname.jdl --verbose
 executing command: /usr/bin/voms-proxy-info -timeleft
Changed:
<
<
executing command: /usr/bin/glite-ce-job-submit -d -a -r cream-41.pd.infn.it:8443/cream-lsf-creamtest1 ./hostname.jdl ['2014-01-16 13:52:57,305 DEBUG - Using certificate proxy file [/tmp/x509up_u0]\n', '2014-01-16 13:52:57,324 DEBUG - VO from certificate=[dteam]\n', '2014-01-16 13:52:57,324 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2014-01-16 13:52:57,324 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_root_20140116-135257.log]\n', '2014-01-16 13:52:57,328 INFO - certUtil::generateUniqueID() - Generated DelegationID: [99b19aafc98e11bb7956cc58d901bd860697227d]\n', '2014-01-16 13:52:59,645 DEBUG - Registering to [https://cream-41.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "creamtest1"; Executable = "/bin/hostname"; Type = "Job"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; StdError = "std.err" ] - JDL File=[./hostname.jdl]\n', '2014-01-16 13:53:00,022 DEBUG - Will invoke JobStart for JobID [CREAM304354901]\n', 'https://cream-41.pd.infn.it:8443/CREAM304354901\n'] job id: https://cream-41.pd.infn.it:8443/CREAM304354901
>
>
executing command: /usr/bin/glite-ce-job-submit -d -a -r cream-41.pd.infn.it:8443/cream-lsf-creamtest1 hostname.jdl ['2014-01-31 12:18:12,740 DEBUG - Using certificate proxy file [/tmp/dteam.proxy]\n', '2014-01-31 12:18:12,756 DEBUG - VO from certificate=[dteam]\n', '2014-01-31 12:18:12,756 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2014-01-31 12:18:12,756 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_root_20140131-121812.log]\n', '2014-01-31 12:18:12,760 INFO - certUtil::generateUniqueID() - Generated DelegationID: [e0b6d023c766400e8b27e51cf5a2b1fa179d78f9]\n', '2014-01-31 12:18:14,606 DEBUG - Registering to [https://cream-41.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "creamtest1"; Executable = "/bin/hostname"; Type = "Job"; Arguments = "-s"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; StdError = "std.err" ] - JDL File=[hostname.jdl]\n', '2014-01-31 12:18:14,942 DEBUG - Will invoke JobStart for JobID [CREAM446576112]\n', 'https://cream-41.pd.infn.it:8443/CREAM446576112\n'] job id: https://cream-41.pd.infn.it:8443/CREAM446576112
 invoking jobStatus
Changed:
<
<
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM304354901 job status: REALLY-RUNNING invoking jobStatus executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM304354901
>
>
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM446576112 ['\n', '****** JobID=[https://cream-41.pd.infn.it:8443/CREAM446576112]\n', '\tStatus = [DONE-OK]\n', '\tExitCode = [0]\n', '\n', '\n'] exitCode= ExitCode = [0]
 job status: DONE-OK invoking getOutputSandbox
Changed:
<
<
executing command: /usr/bin/glite-ce-job-output --noint https://cream-41.pd.infn.it:8443/CREAM187996258 output sandbox dir: ./cream-41.pd.infn.it_8443_CREAM187996258
>
>
executing command: /usr/bin/glite-ce-job-output --noint --dir /tmp https://cream-41.pd.infn.it:8443/CREAM446576112 output sandbox dir: /tmp/cream-41.pd.infn.it_8443_CREAM446576112
 invoking jobPurge
Changed:
<
<
executing command: /usr/bin/glite-ce-job-purge --noint https://cream-41.pd.infn.it:8443/CREAM187996258 Job terminated with status DONE-OK
>
>
executing command: /usr/bin/glite-ce-job-purge --noint https://cream-41.pd.infn.it:8443/CREAM446576112 DONE-OK: prod-wn-001
 
Changed:
<
<
Notice the output sandbox dir: ./cream-41.pd.infn.it_8443_CREAM187996258. This is the output sandbox directory containing all the produced files:
>
>
Notice the output sandbox dir: ./cream-41.pd.infn.it_8443_CREAM446576112. This is the output sandbox directory containing all the produced files:
 
Changed:
<
<
$ ls -la ./cream-41.pd.infn.it_8443_CREAM187996258
>
>
$ ls -la ./cream-41.pd.infn.it_8443_CREAM446576112
 total 12 drwxr-xr-x 2 root root 4096 17 gen 16:20 . dr-xr-x---. 23 root root 4096 17 gen 16:20 ..
Line: 267 to 271
  -p PORT, --port=PORT The port of the service. [default: none] -x PROXY, --proxy=PROXY The proxy path
Added:
>
>
-t TIMEOUT, --timeout=TIMEOUT Probe execution time limit. [default:
                1. sec]
  -v, --verbose verbose mode [default: False] -u URL, --url=URL The status endpoint URL of the service. Example: https://[:]/cream--
Line: 280 to 287
 For example consider the job (i.e. hostnane.jdl) of the above metric. In this case the probe will fail because the job already terminated before the execution of the e glite-ce-job-cancel command:
Changed:
<
<
$ ./cream_jobCancel.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 --jdl ./hostname.jdl
>
>
$ ./cream_jobCancel.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy --jdl ./hostname.jdl
 job already terminated
Line: 300 to 307
  The output of the probe should be like:
Changed:
<
<
$ ./cream_jobCancel.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 --jdl ./sleep.jdl job cancelled
>
>
$ ./cream_jobCancel.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy --jdl ./sleep.jdl OK: job cancelled
 

or like this with --verbose option specified:

Changed:
<
<
$ ./cream_jobCancel.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 --jdl ./sleep.jdl --verbose
>
>
$ ./cream_jobCancel.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy --jdl ./sleep.jdl --verbose
 executing command: /usr/bin/voms-proxy-info -timeleft executing command: /usr/bin/glite-ce-job-submit -d -a -r cream-41.pd.infn.it:8443/cream-lsf-creamtest1 ./sleep.jdl
Changed:
<
<
['2014-01-16 17:22:49,728 DEBUG - Using certificate proxy file [/tmp/x509up_u0]\n', '2014-01-16 17:22:49,744 DEBUG - VO from certificate=[dteam]\n', '2014-01-16 17:22:49,745 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2014-01-16 17:22:49,745 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_root_20140116-172249.log]\n', '2014-01-16 17:22:49,747 INFO - certUtil::generateUniqueID() - Generated DelegationID: [69e3bdaf4e818e1f71f1b7ff442f74583c869b84]\n', '2014-01-16 17:22:52,165 DEBUG - Registering to [https://cream-41.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "cream.out"; BatchSystem = "lsf"; QueueName = "creamtest1"; Executable = "/bin/sleep"; Type = "Job"; Arguments = "200"; JobType = "Normal"; StdError = "cream.out" ] - JDL File=[./sleep.jdl]\n', '2014-01-16 17:22:52,513 DEBUG - Will invoke JobStart for JobID [CREAM437649288]\n', 'https://cream-41.pd.infn.it:8443/CREAM437649288\n'] job id: https://cream-41.pd.infn.it:8443/CREAM437649288
>
>
['2014-01-31 12:30:42,469 DEBUG - Using certificate proxy file [/tmp/dteam.proxy]\n', '2014-01-31 12:30:42,489 DEBUG - VO from certificate=[dteam]\n', '2014-01-31 12:30:42,489 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2014-01-31 12:30:42,489 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_root_20140131-123042.log]\n', '2014-01-31 12:30:42,493 INFO - certUtil::generateUniqueID() - Generated DelegationID: [59882bed5243b5788a83404ad027cf571319ef79]\n', '2014-01-31 12:30:44,059 DEBUG - Registering to [https://cream-41.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "cream.out"; BatchSystem = "lsf"; QueueName = "creamtest1"; Executable = "/bin/sleep"; Type = "Job"; Arguments = "200"; JobType = "Normal"; StdError = "cream.out" ] - JDL File=[./sleep.jdl]\n', '2014-01-31 12:30:44,368 DEBUG - Will invoke JobStart for JobID [CREAM076606856]\n', 'https://cream-41.pd.infn.it:8443/CREAM076606856\n'] job id: https://cream-41.pd.infn.it:8443/CREAM076606856
 invoking jobStatus
Changed:
<
<
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM437649288
>
>
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM076606856 ['\n', '****** JobID=[https://cream-41.pd.infn.it:8443/CREAM076606856]\n', '\tStatus = [REALLY-RUNNING]\n', '\n', '\n']
 job status: REALLY-RUNNING invoking jobCancel
Changed:
<
<
executing command: /usr/bin/glite-ce-job-cancel --noint https://cream-41.pd.infn.it:8443/CREAM437649288
>
>
executing command: /usr/bin/glite-ce-job-cancel --noint https://cream-41.pd.infn.it:8443/CREAM076606856
 invoking jobStatus
Changed:
<
<
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM437649288
>
>
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM076606856 ['\n', '****** JobID=[https://cream-41.pd.infn.it:8443/CREAM076606856]\n', '\tStatus = [REALLY-RUNNING]\n', '\n', '\n']
 job status: REALLY-RUNNING invoking jobStatus
Changed:
<
<
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM437649288
>
>
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM076606856 ['\n', '****** JobID=[https://cream-41.pd.infn.it:8443/CREAM076606856]\n', '\tStatus = [CANCELLED]\n', '\tExitCode = []\n', '\tDescription = [Cancelled by user]\n', '\n', '\n'] exitCode= ExitCode = []
 job status: CANCELLED invoking jobPurge
Changed:
<
<
executing command: /usr/bin/glite-ce-job-purge --noint https://cream-41.pd.infn.it:8443/CREAM437649288 job cancelled
>
>
executing command: /usr/bin/glite-ce-job-purge --noint https://cream-41.pd.infn.it:8443/CREAM076606856 OK: job cancelled
 
Line: 344 to 356
  -p PORT, --port=PORT The port of the service. [default: none] -x PROXY, --proxy=PROXY The proxy path
Added:
>
>
-t TIMEOUT, --timeout=TIMEOUT Probe execution time limit. [default:
                1. sec]
  -v, --verbose verbose mode [default: False] -u URL, --url=URL The status endpoint URL of the service. Example: https://[:]/cream--
Line: 354 to 369
 
Changed:
<
<
$ ./cream_jobPurge.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 --jdl ./hostname.jdl job purged
>
>
$ ./cream_jobPurge.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy --jdl ./hostname.jdl OK: job purged
 
Added:
>
>
 
Changed:
<
<
$ ./cream_jobPurge.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 --jdl ./hostname.jdl --verbose
>
>
$ ./cream_jobPurge.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy --jdl ./hostname.jdl --verbose
 executing command: /usr/bin/voms-proxy-info -timeleft
Changed:
<
<
executing command: /usr/bin/glite-ce-job-submit -d -a -r cream-41.pd.infn.it:8443/cream-lsf-creamtest1 ./hostname.jdl ['2014-01-16 14:02:48,282 DEBUG - Using certificate proxy file [/tmp/x509up_u0]\n', '2014-01-16 14:02:48,301 DEBUG - VO from certificate=[dteam]\n', '2014-01-16 14:02:48,301 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2014-01-16 14:02:48,302 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_root_20140116-140248.log]\n', '2014-01-16 14:02:48,305 INFO - certUtil::generateUniqueID() - Generated DelegationID: [85a80615fc8046baf6dce2a27b708fc82ecbce55]\n', '2014-01-16 14:02:51,105 DEBUG - Registering to [https://cream-41.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "creamtest1"; Executable = "/bin/hostname"; Type = "Job"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; StdError = "std.err" ] - JDL File=[./hostname.jdl]\n', '2014-01-16 14:02:51,504 DEBUG - Will invoke JobStart for JobID [CREAM691625071]\n', 'https://cream-41.pd.infn.it:8443/CREAM691625071\n'] job id: https://cream-41.pd.infn.it:8443/CREAM691625071
>
>
executing command: /usr/bin/glite-ce-job-submit -d -a -r cream-41.pd.infn.it:8443/cream-lsf-creamtest1 hostname.jdl ['2014-01-31 12:27:50,347 DEBUG - Using certificate proxy file [/tmp/dteam.proxy]\n', '2014-01-31 12:27:50,364 DEBUG - VO from certificate=[dteam]\n', '2014-01-31 12:27:50,364 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2014-01-31 12:27:50,364 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_root_20140131-122750.log]\n', '2014-01-31 12:27:50,367 INFO - certUtil::generateUniqueID() - Generated DelegationID: [5978b3f267779dbf4d691889ea316a14dafeb4bb]\n', '2014-01-31 12:27:51,780 DEBUG - Registering to [https://cream-41.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "creamtest1"; Executable = "/bin/hostname"; Type = "Job"; Arguments = "-s"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; StdError = "std.err" ] - JDL File=[hostname.jdl]\n', '2014-01-31 12:27:52,109 DEBUG - Will invoke JobStart for JobID [CREAM973322659]\n', 'https://cream-41.pd.infn.it:8443/CREAM973322659\n'] job id: https://cream-41.pd.infn.it:8443/CREAM973322659
 invoking jobStatus
Changed:
<
<
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM691625071
>
>
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM973322659 ['\n', '****** JobID=[https://cream-41.pd.infn.it:8443/CREAM973322659]\n', '\tStatus = [DONE-OK]\n', '\tExitCode = [0]\n', '\n', '\n'] exitCode= ExitCode = [0]
 job status: DONE-OK invoking jobPurge
Changed:
<
<
executing command: /usr/bin/glite-ce-job-purge --noint https://cream-41.pd.infn.it:8443/CREAM691625071
>
>
executing command: /usr/bin/glite-ce-job-purge --noint https://cream-41.pd.infn.it:8443/CREAM973322659
 invoking jobStatus
Changed:
<
<
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM691625071 job purged
>
>
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM973322659 OK: job purged
 
Added:
>
>

WN-softver probe

This probe checks the middleware version on a WN managed by the CREAM-CE. It makes use of cream_jobSubmit.py in the following way:
$ ./cream_jobSubmit.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy -j ./WN-softver.jdl
DONE-OK: prod-wn-002 has EMI 1.11.0-1
where
$ cat WN-softver.jdl
[
Type="Job";
JobType="Normal";
Executable = "WN-softver.sh";
#Arguments = "a b c";
StdOutput = "std.out";
StdError = "std.err";
InputSandbox = {"WN-softver.sh"};
OutputSandbox = {"std.out","std.err"};
OutputSandboxBaseDestUri="gsiftp://localhost";
]
and WN-softver.sh is attached.

The verbose option enabled gives the following output:

$ ./cream_jobSubmit.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy -j ./WN-softver.jdl --verbose
executing command: /usr/bin/voms-proxy-info -timeleft
executing command: /usr/bin/glite-ce-job-submit -d -a -r cream-41.pd.infn.it:8443/cream-lsf-creamtest1 ./WN-softver.jdl
['2014-01-31 13:05:05,802 DEBUG - Using certificate proxy file [/tmp/dteam.proxy]\n', '2014-01-31 13:05:05,823 DEBUG - VO from certificate=[dteam]\n', '2014-01-31 13:05:05,824 WARN - No configuration file suitable for loading. Using built-in configuration\n', '2014-01-31 13:05:05,824 DEBUG - Logfile is [/tmp/glite_cream_cli_logs/glite-ce-job-submit_CREAM_root_20140131-130505.log]\n', '2014-01-31 13:05:05,824 DEBUG - Processing file [WN-softver.sh]...\n', '2014-01-31 13:05:05,824 DEBUG - Adding absolute path [/root/WN-softver.sh]...\n', '2014-01-31 13:05:05,824 DEBUG - Inserting mangled InputSandbox in JDL: [{"/root/WN-softver.sh"}]...\n', '2014-01-31 13:05:05,827 INFO - certUtil::generateUniqueID() - Generated DelegationID: [b18342097b01b309cc9112a683d4c6bd15d35796]\n', '2014-01-31 13:05:07,517 DEBUG - Registering to [https://cream-41.pd.infn.it:8443/ce-cream/services/CREAM2] JDL=[ StdOutput = "std.out"; BatchSystem = "lsf"; QueueName = "creamtest1"; Executable = "WN-softver.sh"; Type = "Job"; JobType = "Normal"; OutputSandboxBaseDestUri = "gsiftp://localhost"; OutputSandbox = { "std.out","std.err" }; InputSandbox = { "/root/WN-softver.sh" }; StdError = "std.err" ] - JDL File=[./WN-softver.jdl]\n', '2014-01-31 13:05:07,870 DEBUG - JobID=[https://cream-41.pd.infn.it:8443/CREAM680089855]\n', '2014-01-31 13:05:07,871 DEBUG - UploadURL=[gsiftp://cream-41.pd.infn.it/var/glite/cream_sandbox/dteam/CN_Marco_Verlato_L_Padova_OU_Personal_Certificate_O_INFN_C_IT_dteam_Role_NULL_Capability_NULL_dteam019/68/CREAM680089855/ISB]\n', '2014-01-31 13:05:07,873 INFO - Sending file [gsiftp://cream-41.pd.infn.it/var/glite/cream_sandbox/dteam/CN_Marco_Verlato_L_Padova_OU_Personal_Certificate_O_INFN_C_IT_dteam_Role_NULL_Capability_NULL_dteam019/68/CREAM680089855/ISB/WN-softver.sh]\n', '2014-01-31 13:05:08,044 DEBUG - Will invoke JobStart for JobID [CREAM680089855]\n', 'https://cream-41.pd.infn.it:8443/CREAM680089855\n']
job id: https://cream-41.pd.infn.it:8443/CREAM680089855
invoking jobStatus
executing command: /usr/bin/glite-ce-job-status https://cream-41.pd.infn.it:8443/CREAM680089855
['\n', '******  JobID=[https://cream-41.pd.infn.it:8443/CREAM680089855]\n', '\tStatus        = [DONE-OK]\n', '\tExitCode      = [0]\n', '\n', '\n']
exitCode=       ExitCode      = [0]

job status: DONE-OK
invoking getOutputSandbox
executing command: /usr/bin/glite-ce-job-output --noint https://cream-41.pd.infn.it:8443/CREAM680089855
output sandbox dir: ./cream-41.pd.infn.it_8443_CREAM680089855
invoking jobPurge
executing command: /usr/bin/glite-ce-job-purge --noint https://cream-41.pd.infn.it:8443/CREAM680089855
DONE-OK: prod-wn-002 has EMI 1.11.0-1
 
Changed:
<
<
-- LisaZangrando - 2014-01-16
>
>

WN-csh probe

This probe checks that csh is there on a WN managed by the CREAM-CE. It makes use of cream_jobSubmit.py in the following way:
$ ./cream_jobSubmit.py --url https://cream-41.pd.infn.it:8443/cream-lsf-creamtest1 -x /tmp/dteam.proxy -j ./WN-csh.jdl
DONE-OK: prod-wn-002 has csh
where
$ cat WN-csh.jdl
[
Type="Job";
JobType="Normal";
Executable = "WN-csh.sh";
#Arguments = "a b c";
StdOutput = "std.out";
StdError = "std.err";
InputSandbox = {"WN-csh.sh"};
OutputSandbox = {"std.out","std.err"};
OutputSandboxBaseDestUri="gsiftp://localhost";
]
and WN-csh.sh is attached.

Deployment example

In a Nagios server version 3.5.0 testing instance, we deployed the files needed to execute the probes described above in the following directories:
$ ls -l /usr/libexec/grid-monitoring/probes/emi.cream/
total 48
-rwxr-xr-x 1 root root  1361 Jan 30 16:58 cream_allowedSubmission.py
-rwxr-xr-x 1 root root  2494 Jan 30 17:00 cream_jobCancel.py
-rwxr-xr-x 1 root root  2103 Jan 30 17:01 cream_jobPurge.py
-rwxr-xr-x 1 root root  2972 Jan 31 12:42 cream_jobSubmit.py
-rwxr-xr-x 1 root root 15527 Jan 30 16:29 cream.py
-rwxr-xr-x 1 root root  1416 Jan 31 12:42 cream_serviceInfo.py
-rw-r--r-- 1 root root   213 Jan 29 14:26 hostname.jdl
-rw-r--r-- 1 root root   129 Jan 30 16:21 sleep.jdl
drwxr-xr-x 2 root root  4096 Jan 31 11:34 wn
and
$ ls -l /usr/libexec/grid-monitoring/probes/emi.cream/wn
total 16
-rw-r--r-- 1 root root  292 Jan 31 11:34 WN-csh.jdl
-rwxr-xr-x 1 root root  603 Jan 31 11:34 WN-csh.sh
-rw-r--r-- 1 root root  300 Jan 31 11:34 WN-softver.jdl
-rwxr-xr-x 1 root root 1144 Jan 31 11:34 WN-softver.sh

and defined the new services adding in the file /etc/nagios/objects/services.cfg the following lines:

define service{
        use                             local-service
        host_name                       cream-41.pd.infn.it
        service_description             emi.cream.CEDIRECT-AllowedSubmission
        check_command                   ncg_check_native!/usr/libexec/grid-monit
oring/probes/emi.cream/cream_allowedSubmission.py!60!-x /tmp/dteam.proxy -p 8443
        normal_check_interval           6
        retry_check_interval            3
        max_check_attempts              2
        obsess_over_service             0
}
define service{
        use                             local-service
        host_name                       cream-41.pd.infn.it
        service_description             emi.cream.CEDIRECT-JobCancel
        check_command                   ncg_check_native!/usr/libexec/grid-monit
oring/probes/emi.cream/cream_jobCancel.py!60!-x /tmp/dteam.proxy -p 8443 -l lsf 
-q creamtest1 -j /usr/libexec/grid-monitoring/probes/emi.cream/sleep.jdl
        normal_check_interval           6
        retry_check_interval            3
        max_check_attempts              2
        obsess_over_service             0
}
define service{
        use                             local-service
        host_name                       cream-41.pd.infn.it
        service_description             emi.cream.CEDIRECT-JobPurge
        check_command                   ncg_check_native!/usr/libexec/grid-monit
oring/probes/emi.cream/cream_jobPurge.py!60!-x /tmp/dteam.proxy -p 8443 -l lsf -
q creamtest1 -j /usr/libexec/grid-monitoring/probes/emi.cream/hostname.jdl
        normal_check_interval           6
        retry_check_interval            3
        max_check_attempts              2
        obsess_over_service             0
}

define service{
        use                             local-service
        host_name                       cream-41.pd.infn.it
        service_description             emi.cream.CEDIRECT-ServiceInfo
        check_command                   ncg_check_native!/usr/libexec/grid-monit
oring/probes/emi.cream/cream_serviceInfo.py!60!-x /tmp/dteam.proxy -p 8443
        normal_check_interval           6
        retry_check_interval            3
        max_check_attempts              2
        obsess_over_service             0
}
define service{
        use                             local-service
        host_name                       cream-41.pd.infn.it
        service_description             emi.cream.CEDIRECT-JobSubmit
        check_command                   ncg_check_native!/usr/libexec/grid-monit
oring/probes/emi.cream/cream_jobSubmit.py!60!-x /tmp/dteam.proxy -p 8443 -l lsf 
-q creamtest1 -j /usr/libexec/grid-monitoring/probes/emi.cream/hostname.jdl --di
r /tmp 
        normal_check_interval           6
        retry_check_interval            3
        max_check_attempts              2
        obsess_over_service             0
}
define service{
        use                             local-service
        host_name                       cream-41.pd.infn.it
        service_description             emi.cream.WN-Softver
        check_command                   ncg_check_native!/usr/libexec/grid-monit
oring/probes/emi.cream/cream_jobSubmit.py!60!-x /tmp/dteam.proxy -p 8443 -l lsf 
-q creamtest1 -j /usr/libexec/grid-monitoring/probes/emi.cream/wn/WN-softver.jdl
 --dir /tmp
        normal_check_interval           6
        retry_check_interval            3
        max_check_attempts              2
        obsess_over_service             0
}
define service{
        use                             local-service
        host_name                       cream-41.pd.infn.it
        service_description             emi.cream.WN-Csh
        check_command                   ncg_check_native!/usr/libexec/grid-monit
oring/probes/emi.cream/cream_jobSubmit.py!60!-x /tmp/dteam.proxy -p 8443 -l lsf 
-q creamtest1 -j /usr/libexec/grid-monitoring/probes/emi.cream/wn/WN-csh.jdl --d
ir /tmp

        normal_check_interval           6
        retry_check_interval            3
        max_check_attempts              2
        obsess_over_service             0
}

The check_command ncg_check_native was defined in the file /etc/nagios/objects/commands.cfg as below:

define command{
        command_name                    ncg_check_native
        command_line                    $ARG1$ -H $HOSTNAME$ -t $ARG2$ $ARG3$
}

When the CREAM-CE is properly working the Nagios output is shown below:

CEDIRECT-probes.jpg

If instead the CREAM-CE is not reachable (e.g. tomcat down in the CE), you will see:

CREAM-failure.jpg

and in the detail of the error message you will recognize that the connection was refused:

CREAM-failure-d.jpg

If instead e.g. the GRIDFTP server of the CREAM-CE is down, you will see:

GRIDFTP-failure.jpg

i.e. only the probes not involving file transfers via GRIDFTP are still OK. Among the others, the ones waiting for the job reaching a terminal state will time out, while the WN-* will fail because these try to transfer the sandbox: in fact in the status information field you will recognize the GRIDFTP problem (ERROR - data_cb_read() - globus_xio: Unable to connect to cream-41.pd.infn.it:2811 )

-- LisaZangrando - 2014-01-16
-- MarcoVerlato - 2014-01-31

META FILEATTACHMENT attachment="WN-softver.sh" attr="" comment="" date="1391171091" name="WN-softver.sh" path="WN-softver.sh" size="1144" user="MarcoVerlato" version="1"
META FILEATTACHMENT attachment="WN-csh.sh" attr="" comment="" date="1391171091" name="WN-csh.sh" path="WN-csh.sh" size="603" user="MarcoVerlato" version="1"
META FILEATTACHMENT attachment="CEDIRECT-probes.jpg" attr="" comment="new CREAM-CE Nagios probes in action" date="1391174324" name="CEDIRECT-probes.jpg" path="CEDIRECT-probes.jpg" size="309312" user="MarcoVerlato" version="1"
META FILEATTACHMENT attachment="CREAM-failure.jpg" attr="" comment="Screenshots when CREAM-CE is down" date="1391175849" name="CREAM-failure.jpg" path="CREAM-failure.jpg" size="420872" user="MarcoVerlato" version="1"
META FILEATTACHMENT attachment="CREAM-failure-d.jpg" attr="" comment="Screenshots when CREAM-CE is down" date="1391175849" name="CREAM-failure-d.jpg" path="CREAM-failure-d.jpg" size="338847" user="MarcoVerlato" version="1"
META FILEATTACHMENT attachment="GRIDFTP-failure.jpg" attr="" comment="Screenshot when GRIDFTP is down" date="1391178142" name="GRIDFTP-failure.jpg" path="GRIDFTP-failure.jpg" size="357316" user="MarcoVerlato" version="1"
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback