Service Monitoring Guide
WLCG Grid Middleware Services Monitoring Definition Template
This definition template aims
- to assist middleware service developers in describing, at a high level, information to be used to monitor the health and availability of a grid service
- to provide a reference for monitoring developers and site administrators creating sensors for integration with local fabric monitoring systems
Description of the Categories A, B and C, further information about WLCG Grid Service Monitoring and other terminology used below is available in "
Understanding WLCG Monitoring
"
Some of the information required may vary depending on service configuration options, in which case please try to provide what may be commonly accepted as the default condition or otherwise indicate appropriately.
Service Name:
Contact Details:
Context and Dependencies
The Workload Management System (WMS) comprises a set of Grid middleware components responsible for the distribution and management of tasks across Grid resources, in such a way that applications are conveniently, efficiently and effectively executed. Following the list of sub-services the WMS is composed of:
- Workload Management – WM: Core component of the Workload Management, its purpose is to accept and satisfy requests for job management coming from its clients
- WMProxy: Web service interface to submit jobs to the WM
- Job Controller – JC: Acts as an interface to condor for the WM
- Log Monitor – LM: Directly connected to JC acts as a job monitoring tool parsing condor log files
- Local Logger: copy events to be sent to the LB server into a local disk file
- LBProxy: keeps a local view of the job state to be sent to the LB server
- Proxy Renewal: Service to renew the proxy of a long-running job
The WMS makes use of the following external dependencies:
boost >= 1.32.0
classads >= 0.9.8
fcgi >= 2.4.0
c-ares >= 1.3.0
expat >= 1.95.7
gridsite-shared
mod_fastcgi >= 2.4.3
vdt_globus_essentials >= VDT1.6.0x86_rhas_4
xerces-c >= 2.7.0
zlib >= 1.2.1
log4cxx >= 0.9.7-1
xerces-c-2.7.0-1.slc4.i686.rpm
condor >= 6.8.4
mysq
Category A - Traces in the fabric
Process State
- /opt/glite/bin/glite-wms-workload_manager
- Number of instances: 1
- User: glite
- /opt/glite/bin/glite-wms-job_controller
- Number of instances: 1
- User: glite
- /opt/glite/bin/glite-wms-log_monitor
- Number of instances: 1
- User: glite
- /opt/glite/bin/glite-lb-proxy
- Number of instances: 1 master and 10 slaves
- User: glite
- glite_wms_wmproxy_server
- Number of instances: a variable number depending on load and configuration
- User: glite
- /opt/globus/sbin/globus-gridftp-server
- Number of instances: a variable number depending on load and configuration
- User: glite
- /opt/glite/bin/glite-proxy-renewd
- Number of instances: 1 master and 1 slave
- User: glite
- /opt/glite/bin/glite-lb-logd
- Number of instances: 1
- User: glite
- /opt/glite/bin/glite-lb-interlogd
- Number of instances: 1
- User: glite
- Condor: each of the following condor processes as user 'glite' have to always be present on a healthy running wms
29260 ? Ss 9:01 /opt/condor-c/sbin/condor_master
29265 ? Ss 7:13 \_ condor_collector -f
29268 ? Ss 6:50 \_ condor_schedd -f
29898 ? S 0:33 | \_ perl /var/local/condor/spool/cluster1.ickpt.subproc0
29906 ? S 0:07 | \_ perl /var/local/condor/spool/cluster2.ickpt.subproc0
29916 ? S 0:00 | \_ perl /var/local/condor/spool/cluster3.ickpt.subproc0
29276 ? Ss 1:47 \_ condor_negotiator -f
Network Ports
Port Number / Process Name / Security (GSI)
- 7443/ WMProxy httpd / GSI
- 2811/ globus-gridftp / GSI
- 9002/ glite-lb-logd / GSI
- 9000/ glite-lb-proxy / GSI
Filesystems/Directories
Path / Owner/Group / Permissions / Min. Space
- /var/glite/SandboxDir/
- Owner/Group: glite glite
- Permissions: drwxrwx-wx
- Min Space: depends on load and type of jobs submitted. 30-40GB is the min on several wms used in production with the cron job to purge job sandboxes once a week enabled
- /var/glite/workloadmanager/
- Owner/Group: glite glite
- Permissions: drwxr-xr-x
- /var/glite/logmonitor/
- Owner/Group: glite glite
- Permissions: drwxr-xr-x
- Min Space: this directory contains mainly condor log files, so that the used space depends on how many jobs are submitted to the WMS. Unused files are periodically moved into the ' CondorLogRecycleDir' defined in the LogMonitor section of the /opt/glite/etc/glite_wms.conf file and by default set to /var/glite/logmonitor/CondorG.log/recycle/ On several wms used in production the min space is 1-2 GB This space can be reduced by removing the /var/glite/logmonitor/CondorG.log/recycle/ dir or setting the corresponding parameter 'CondorLogRecycleDir' to a dir such as '/tmp' or '/dev/null'
- /var/glite/jobcontrol/
- Owner/Group: glite glite
- Permissions: drwxr-xr-x
- Min Space: depends on the value of the flag 'RemoveJobFiles' in the LogMonitor section of the /opt/glite/etc/glite_wms.conf file. If set to true all files used to submit jobs to condor are removed when they are no more necessary. On several wms used in production the flag is set to true and the min space is 1-2 GB
- /var/log/glite/
- Owner/Group: glite glite
- Permissions: drwxrwxr-t
Log Files
Path / Owner/Group / Permissions
- (log) Path: /var/log/glite/workloadmanager_events.log
- Owner/Group: glite glite
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/wmproxy.log
- Owner/Group: glite glite
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/jobcontroller_events.log
- Owner/Group: glite glite
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/logmonitor_events.log
- Owner/Group: glite glite
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/httpd-wmproxy-access.log
- Owner/Group: root root
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/httpd-wmproxy-errors.log
- Owner/Group: root root
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/glite-wms-wmproxy-purge-proxycache.log*
- Owner/Group: root root
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/glite-wms-purgeStorage.log
- Owner/Group: root root
- Permissions: 644
- (exp)
- (log) Path: /var/log/glite/lcmaps.log
- Owner/Group: glite glite
- Permissions: 644
- (exp)
Category B - Interpretation of a user-level operations
Describe (or provide persistent urls to information about) commands which can be used as category B probes including how the status return and output may be parsed to diagnose the performance or status of the service. Include the release packages in which the commands are made available.
- Scripts to check the daemons status and to start/stop are located in the /opt/glite/etc/init.d/ directory (i.e. /opt/glite/etc/init.f/glite-wms-wm start/stop/status). Glite production installation also provide a more generic service, called gLite, to manage all of them simultaneously, try service gLite status/start/stop
Category C - Internal metrics presented by the service
Describe (or provide persistent urls to information about) commands, or otherwise how to invoke mechanisms, to gather category C metrics about this service (see endnote i) including how the status return and output may be parsed to diagnose the performance or status of the service. Include the release packages in which any commands are made available.
- No specific algorithms have been implemented. Some performance and service status metrics may be gathered simply by running searches on the following files:
- /var/glite/jobcontrol/queue.fl -> JobController queue of jobs waiting to be submitted condor
- /var/glite/workload_manager/input.fl -> queue of job requests waiting for the WM
- Also information about jobs not yet completed can be found using 'condor' commands, such as condor_q
--
ElisabettaMolinari - 11 Apr 2008