---+ Overview *WMSMON Main Page* Open the CNAF instance at<br /> https://cert-wms-01:8443/wmsmon/main/main.php<br />Notice that you need a certificate in your browser in order to access it.<br /> Once in the main page you will see a table with 1 row for each WMS instance monitored with a sintethic overview of instance status across columns<br />Also a filter to find VO dedicated WMS instances is provided.<br /> A short automatic help on mouse pointer positioning over main buttons and variables<br /> The following columns are reported:<br /><br /> | WMS | DATE | RUNNING<br />JOBS | IDLE JOBS | WM QUEUE | JC QUEUE | VO VIEWS | LB EVENTS QUEUE | CPU LOAD | SANDBOX PARTITION | GENERAL STATUS | DAEMONS STATUS | | Specific WMS instance hostname. | Date-time of last measurement | Number of jobs in the state "Running" in the condor| Number of jobs in 'Idle' state in condor | Number of entries in input.fl| Number of entries in queue.fl | This is calculated as: <br /> -number of VO Views used in last Match making from workload Manager as reported in 'workload_manager.log' <br />or <br /> -number of entries in ismdump.fl if last MM is older than 1 hour. <br />None is returned in case none of the two measures above is successful. | Number of dg20logd_* files in directory /var/tmp. (Events not yet stored in the LB server DB)| Average load of the machine during the past 15 minutes| Occupancy (in %) of directory /var/glite/SandboxDir | This is a FLAG:<br />-Green-Yellow-Red depending on whether all main variables status measured is OK, or at least one variable is in Warning or Error status respectively<br /> | This is a FLAG:<br /> -Green-Red depending on whether all WMSLB daemons returned a normal status, or at least one returns an Error status | <br /><br /> Clicking on WMS hostname or one of the STATUS flags you will access the relative detailed data page, described below.<br /><br /> *Specific WMSLB Instance Detailed data page* Here, more info are made available on the specific WMS instance.<br /> Two kinds of data are collected:<br /> - WMSLB service and HW status variables (such as daemons status, condor jobs status statistics, File descriptors opened by main processes..) for which the value at the time of measurement is shown<br /> - Mean job flow rates between wms components (WMproxy, Workload Manager, Job Controller, Condor) across a given time interval reported as well (across past 15 min is the default)<br /> Notice that: - A short automatic help on mouse pointer positioning over main buttons and variables<br /> - Passing the mouse pointer on charts a window will pop up with the correspondent variable values and date of measurement.<br /> - A link to lemon monitoring of the specific instance (both WMS and LB if on separate machines) is available <br /> *UNDERSTANDING CHARTS*<br /> For most relevant variables, charts with recent history (whose time interval can be selected) are made available<br /> *Components Report History*<br /> Four charts are available under the tag "Components Report History" reporting info respectively on :<br /> *1* -Condor Statistics: i.e. the number of jobs in the status "Running", "Idle"<br /> 2 Component Queues: i.e. the number of jobs enqueued to and waiting to be processed correspondent wmslb component <br /> In particular the number of jobs enqueued to the Workload Manager and Job Controller are reported.<br /> Also the number of events still to be transferred/processed by the LB (and therefore not yet accounted for in LB queries from users or wmsmon itself), is reported as "LB events queue"<br /> *3*-Job Flow Rates: i.e. the number of jobs processed by the correspondent component reported as mean value in Hz. Jobs->WMProxy: for each point reports the mean job submission rate since previous measurement (ex. 900 jobs successfully submitted to the WMS between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15) <br /> Jobs->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from WMproxy since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from WMproxy between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15) <br /> Jobs Resub->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from JobController, i.e. resubmitted after failure, since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from JobController between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15) <br /> *4*-Job Flow Rates: i.e. the number of jobs processed by the correspondent component reported as mean value in Hz. Total Jobs->WM: for each point reports the mean rate of jobs enqueued to Workload Manager from both WMproxy and Job Controller (i.e. both Submitted and Resubmitted jobs) since previous measurement (ex. 900 jobs successfully enqueued to Workload Manager from both WMproxy and Job Controller between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15) <br /> Jobs->JC: for each point reports the mean rate of jobs enqueued to Job Controller from Workload Manager since previous measurement (ex. 900 jobs successfully enqueued to Job Controller from Workload Manage between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15) <br /> Jobs->Condor: for each point reports the mean rate of jobs enqueued to Condor from Job Controller since previous measurement (ex. 900 jobs successfully enqueued to Condor from Job Controller between 2008-20-01 10:00:00 and 2008-20-01 10:15:00 , i.e. 15 min = 900sec, will produce a 1 Hz point at time 20-01 10:15) *Daily Statistics*<br /> Three charts are available under the tag "Daily Statistics" reporting info respectively on :<br /> *1*-Jobs Flow: for each day in the selected interval the total number of Jobs processed respectively by each component are reported. <br /> In particular in the chart are reported:<br /> Jobs->WMProxy: for each point reports the total number of jobs successfully submitted to the WMS for correspondent day.<br /> Jobs ->WM: for each point reports the total number of jobs successfully enqueued to Workload Manager from both WMproxy for correspondent day.<br /> Jobs Resub->WM: for each point reports the total number of jobs enqueued to Workload Manager from JobController (i.e. total number of resubmissions) for correspondent day.<br /> *2*-Jobs Final State:for each day in the selected interval the total number of Jobs with final state "Done Successfully" and "Aborted" respectively are reported. *3*-Jobs Flow: for each day in the selected interval the total number of Jobs processed respectively by each component are reported. <br /> In particular in the chart are reported:<br /> Total Jobs->WM: for each point reports the total number of jobs enqueued to Workload Manager from both WMproxy and Job Controller (i.e. both Submitted and Resubmitted jobs) for correspondent day.<br /> Jobs->JC: for each point reports the total number of jobs enqueued to Job Controller from Workload Manager for correspondent day.<br /> Jobs->Condor: for each point reports the total number of jobs enqueued to Condor from Job Controller for correspondent day.<br /> ---+ User Documentation * UserManual <!-- - Charts usage - ---+ Installation * Requirements ---+ User Requirements * You need the certificate and flash installed on your browser --> 2) data are collected by sensors on wms and lb and sent to the server using a pyhton soap module<br /> --> ---+ Interaction with Related Tools ---+ Future work:<br /> -Packaging for current release deployment<br /> - Implementation of estimates of job latency <br /> - cumulative stats plot reporting on all wms in use by a specific VO<br /> - old RB monitoring<br /> - mail notifications system<br /> -- Main.DaniloDongiovanni - 25 Jan 2008
This topic: WMSMonitor
>
WebLeftBar
>
WebHomez
>
WebAboutz
>
WebDownloadz
>
WebDocumentationz
Topic revision: r3 - 2008-01-31 - DaniloDongiovanni
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback