Difference: WebDocumentationz (3 vs. 4)

Revision 42008-01-31 - DaniloDongiovanni

Line: 1 to 1
 
META TOPICPARENT name="WebDownloadz"

Overview

Changed:
<
<
WMSMON Main Page Open the CNAF instance at
https://cert-wms-01:8443/wmsmon/main/main.php
Notice that you need a certificate in your browser in order to access it.
>
>
WMSMON Main Page Open the CNAF instance at

https://cert-wms-01:8443/wmsmon/main/main.php

Notice that you need a certificate in your browser in order to access it.

 Once in the main page you will see a table with 1 row for each WMS instance monitored with a sintethic overview of instance status across columns
Also a filter to find VO dedicated WMS instances is provided.
A short automatic help on mouse pointer positioning over main buttons and variables
Changed:
<
<
The following columns are reported:

>
>
The following columns are reported:
 
WMS DATE RUNNING
JOBS
IDLE JOBS WM QUEUE JC QUEUE VO VIEWS LB EVENTS QUEUE CPU LOAD SANDBOX PARTITION GENERAL STATUS DAEMONS STATUS
Changed:
<
<
Specific WMS instance hostname. Date-time of last measurement Number of jobs in the state "Running" in the condor Number of jobs in 'Idle' state in condor Number of entries in input.fl Number of entries in queue.fl This is calculated as:
-number of VO Views used in last Match making from workload Manager as reported in 'workload_manager.log'
or
-number of entries in ismdump.fl if last MM is older than 1 hour.
None is returned in case none of the two measures above is successful.
Number of dg20logd_* files in directory /var/tmp. (Events not yet stored in the LB server DB) Average load of the machine during the past 15 minutes Occupancy (in %) of directory /var/glite/SandboxDir This is a FLAG:
-Green-Yellow-Red depending on whether all main variables status measured is OK, or at least one variable is in Warning or Error status respectively
This is a FLAG:
-Green-Red depending on whether all WMSLB daemons returned a normal status, or at least one returns an Error status


Clicking on WMS hostname or one of the STATUS flags you will access the relative detailed data page, described below.

>
>
Specific WMS instance hostname. Date-time of last measurement Number of jobs in the state "Running" in the condor Number of jobs in 'Idle' state in condor Number of entries in input.fl Number of entries in queue.fl Number of VO views available for the WMS Match Making
Number of dg20logd_* files in directory /var/tmp. (Events not yet stored in the LB server DB) Average load of the machine during the past 15 minutes Occupancy (in %) of directory /var/glite/SandboxDir Green-Yellow-Red FLAG depending on whether all main variables status measured is OK, or at least one variable is in Warning or Error status respectively
Green-Red FLAG depending on whether all WMSLB daemons returned a normal status, or at least one returns an Error

Clicking on WMS hostname or one of the STATUS flags you will access the relative detailed data page, described below.

Specific WMSLB Instance Detailed data page

 
Deleted:
<
<
Specific WMSLB Instance Detailed data page
 Here, more info are made available on the specific WMS instance.
Changed:
<
<
Two kinds of data are collected:
- WMSLB service and HW status variables (such as daemons status, condor jobs status statistics, File descriptors opened by main processes..) for which the value at the time of measurement is shown
- Mean job flow rates between wms components (WMproxy, Workload Manager, Job Controller, Condor) across a given time interval reported as well (across past 15 min is the default)
>
>
You can go back to the main page clicking on the wmsmon log or on the correspondent part of the navigation bar 17<EX. _moz-userdefined="" details::cert-rb-01.cnaf.infn.it*="" >>="" main="" *wmsmonitor="">18 Before describing the available info in detail, notice that two kinds of data are collected:
- WMSLB service and HW status variables (such as daemons status, condor jobs status statistics, File descriptors opened by main processes..) for which the value at the time of measurement is shown. These data are collected by the mean of a client application in python running on the WMS and LB instance.
- Mean job flow rates between wms components (WMproxy, Workload Manager, Job Controller, Condor) across a given time interval reported as well. Time interval is past 15 min by default. These data are collected by the mean of a client application in python running a mysql query on specific LB instance.

Components Details_ BOX In this BOX info about each single WMSLB component are presented.
Here again, a short automatic help on mouse pointer positioning over main buttons and variables.
A flag (Green=OK / Red=Error) beside each component's label reports the correspondent daemon status.

Components Details from t1 to t2 (Reports the exact time interval in the LB query used to collect data)

WM Proxy [WMproxy daemon status]

Jobs -> WMProxy : number of jobs submitted within reported time interval
Collections submitted : number of collections of jobs submitted within reported time interval
Mean nodes per coll. : mean number of nodes per collection within reported time interval
Std nodes per coll.: standard deviation from the mean of the number of nodes per collection in reported time interval

Proxy Reneval [PX daemon status]

Workload Manager [WM daemon status]

WM file descriptors: Number of file descriptors opened by process Work Load Manager at t2 time
WM queue: Number of entries in input.fl at t2 time
Jobs -> WM: Number of jobs enqueued to Workload Manager from WMproxy within reported time interval
Jobs Resub -> WM: Number of jobs enqueued to Workload Manager from JobController, i.e. resubmitted after failure
VO Views: Number of VO views available for the WMS Match Making. This is either parsed in the workload_manager.log looking for the number of VO Views used in last Match making from workload Manager or ( if last MM is older than 1 hour) as the number of entries in ismdump.fl. None is returned in case none of the two measures above is successful.

Log Monitor [LM daemon status]

 
Changed:
<
<
Notice that: - A short automatic help on mouse pointer positioning over main buttons and variables
- Passing the mouse pointer on charts a window will pop up with the correspondent variable values and date of measurement.
- A link to lemon monitoring of the specific instance (both WMS and LB if on separate machines) is available
>
>
LM file descriptors:Number of file descriptors opened by process Log Monitor at t2 time
 
Added:
>
>
Job Controller [JC daemon status]

JC queue: Number of entries in queue.fl at t2 time
JC file descriptors: Number of file descriptors opened by process Job Controller at t2 time
Jobs -> JC : Number of jobs enqueued to Job Controller from Workload Manager within reported time interval
Jobs JC -> Condor: Number of jobs enqueued to Condor from Job Controller within reported time interval

Local Logger [LL daemon status]
LB events queue: Number of dg20logd_* files in directory /var/tmp t2 time. These are events not yet stored in the LB server DB, hence not available for LB queries which maybe affected by this.
LL file descriptor: Number of file descriptors opened by process Job Controller at t2 time

LB Proxy [LBPX daemon status]

Tranfers [FTPD daemon status]

gftp: number of gftp sessions opened at t2 time

CHARTS BOX

 
Deleted:
<
<
*UNDERSTANDING CHARTS*
 For most relevant variables, charts with recent history (whose time interval can be selected) are made available
Changed:
<
<
*Components Report History*
>
>
Notice that passing the mouse pointer on charts a window will pop up with the correspondent variable values and date of measurement.

Two tags of charts are available: the 'Components Report History' and the 'Daily Statistics' one.

Components Report History

 Four charts are available under the tag "Components Report History" reporting info respectively on :
Added:
>
>
  1. Condor Statistics: i.e. the number of jobs in the status "Running", "Idle"
  2. Component Queues: i.e. the number of jobs enqueued to and waiting to be processed correspondent wmslb component
    In particular the number of jobs enqueued to the Workload Manager and Job Controller are reported.
    Also the number of events still to be transferred/processed by the LB (and therefore not yet accounted for in LB queries from users or wmsmon itself), is reported as "LB events queue"
  3. Job Flow Rates: i.e. the number of jobs processed by the correspondent component reported as mean value in Hz.
    Jobs->WMProxy: for each point reports the mean job submission rate since previous measurement (ex. 900 jobs successfully submitted to the WMS between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
    Jobs->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from WMproxy since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from WMproxy between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
    Jobs Resub->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from JobController, i.e. resubmitted after failure, since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from JobController between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
  4. Job Flow Rates: i.e. the number of jobs processed by the correspondent component reported as mean value in Hz. Total Jobs->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from both WMproxy and Job Controller (i.e. both Submitted and Resubmitted jobs) since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from both WMproxy and Job Controller between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
    Jobs->JC: for each point reports the mean rate of jobs enqueued to Job Controller from Workload Manager since previous measurement (ex. 900 jobs successfully enqueued to Job Controller from Workload Manage between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
    Jobs->Condor: for each point reports the mean rate of jobs enqueued to Condor from Job Controller since previous measurement (ex. 900 jobs successfully enqueued to Condor from Job Controller between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
Notice that the time interval to generate charts for, can be selected.

Daily Statistics

Three charts are available under the tag "Daily Statistics" reporting info respectively on :

  1. Jobs Flow: for each day in the selected interval the total number of Jobs processed respectively by each component are reported.

    In particular in the chart are reported:
    Jobs->WMProxy: for each point reports the total number of jobs successfully submitted to the WMS for correspondent day.
    Jobs ->WM: for each point reports the total number of jobs successfully enqueued to Workload Manager from both WMproxy for correspondent day.
    Jobs Resub->WM: for each point reports the total number of jobs enqueued to Workload Manager from JobController (i.e. total number of resubmissions) for correspondent day.

  2. Jobs Final State:for each day in the selected interval the total number of Jobs with final state "Done Successfully" and "Aborted" respectively are reported.
  3. Jobs Flow: for each day in the selected interval the total number of Jobs processed respectively by each component are reported.
    In particular in the chart are reported:
    Total Jobs->WM: for each point reports the total number of jobs enqueued to Workload Manager from both WMproxy and Job Controller (i.e. both Submitted and Resubmitted jobs) for correspondent day.
    Jobs->JC: for each point reports the total number of jobs enqueued to Job Controller from Workload Manager for correspondent day.
    Jobs->Condor: for each point reports the total number of jobs enqueued to Condor from Job Controller for correspondent day.

Notice that the time interval to generate charts for, can be selected.

GENERAL INFO BOXES

These two boxes report infos about the WMS and LB status

General Info at TIME-OF-MEASUREMENT Lemon (link to lemon monitoring of the specific instance)

WMS HW Status
Sandbox partition Occupancy (in %) of directory /var/glite/SandboxDir
/tmp partition Occupancy (in %) of directory /tmp
CPU load Average load of the machine during the past 15 minutes

Job Stats (Condor)
Running jobs Number of jobs in the state "Running" in Condor
Idle jobs Number of jobs in the state "Idle" in Condor
Total Condor jobs Number of jobs in the states "Running" or "Idle" or “Held” in the condor queue.

Info from LB LB-SERVER-HOSTNAME Lemon (link to lemon monitoring of the specific instance)

LB Status Daemon Status
CPU load Average load of the machine during the past 15 minutes
/ partition Occupancy (in %) of directory /
LB connections Number of connections opened on port 9000,9001,9003
 
Deleted:
<
<
1 -Condor Statistics: i.e. the number of jobs in the status "Running", "Idle"
2 Component Queues: i.e. the number of jobs enqueued to and waiting to be processed correspondent wmslb component
In particular the number of jobs enqueued to the Workload Manager and Job Controller are reported.
Also the number of events still to be transferred/processed by the LB (and therefore not yet accounted for in LB queries from users or wmsmon itself), is reported as "LB events queue"
*3*-Job Flow Rates: i.e. the number of jobs processed by the correspondent component reported as mean value in Hz. Jobs->WMProxy: for each point reports the mean job submission rate since previous measurement (ex. 900 jobs successfully submitted to the WMS between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
Jobs->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from WMproxy since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from WMproxy between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
Jobs Resub->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from JobController, i.e. resubmitted after failure, since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from JobController between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
*4*-Job Flow Rates: i.e. the number of jobs processed by the correspondent component reported as mean value in Hz. Total Jobs->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from both WMproxy and Job Controller (i.e. both Submitted and Resubmitted jobs) since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from both WMproxy and Job Controller between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
Jobs->JC: for each point reports the mean rate of jobs enqueued to Job Controller from Workload Manager since previous measurement (ex. 900 jobs successfully enqueued to Job Controller from Workload Manage between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)
Jobs->Condor: for each point reports the mean rate of jobs enqueued to Condor from Job Controller since previous measurement (ex. 900 jobs successfully enqueued to Condor from Job Controller between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15)

*Daily Statistics*
Three charts are available under the tag "Daily Statistics" reporting info respectively on :
*1*-Jobs Flow: for each day in the selected interval the total number of Jobs processed respectively by each component are reported.
In particular in the chart are reported:
Jobs->WMProxy: for each point reports the total number of jobs successfully submitted to the WMS for correspondent day.
Jobs ->WM: for each point reports the total number of jobs successfully enqueued to Workload Manager from both WMproxy for correspondent day.
Jobs Resub->WM: for each point reports the total number of jobs enqueued to Workload Manager from JobController (i.e. total number of resubmissions) for correspondent day.

*2*-Jobs Final State:for each day in the selected interval the total number of Jobs with final state "Done Successfully" and "Aborted" respectively are reported.

*3*-Jobs Flow: for each day in the selected interval the total number of Jobs processed respectively by each component are reported.
In particular in the chart are reported:
Total Jobs->WM: for each point reports the total number of jobs enqueued to Workload Manager from both WMproxy and Job Controller (i.e. both Submitted and Resubmitted jobs) for correspondent day.
Jobs->JC: for each point reports the total number of jobs enqueued to Job Controller from Workload Manager for correspondent day.
Jobs->Condor: for each point reports the total number of jobs enqueued to Condor from Job Controller for correspondent day.

User Documentation

Line: 70 to 115
 
  • You need the certificate and flash installed on your browser
--> 2) data are collected by sensors on wms and lb and sent to the server using a pyhton soap module
Changed:
<
<
-->

Interaction with Related Tools

>
>
 

Future work:

-Packaging for current release deployment
 
TWIKI.NET
This site is powered by the TWiki collaboration platformCopyright © 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback