Nagios for WeNMR
WeNMR Nagios web page:
https://grid-monitor03.pd.infn.it:50080/nagios/
Access permitted for enmr.eu members only using personal certificate (authorized DNs are retrieved by /etc/cron.d/voms-htpasswd and listed in files /etc/nagios/htpasswd.users and /etc/httpd/httpd.users)
This Nagios monitors hosts belonging to NGIs of Belgium, Germany, France, Italy, Spain, Portugal, Netherlands, UK, Poland, Malaysia, Taiwan, Brasil where enmr.eu probes can be executed.
Sites from South Africa and OSG will be soon monitored too.
Detailed documentation about Nagios could be found
here
How to add quickly new WeNMR probes
- 1. In /etc/ncg-metric-config.d create the file wenmr-probes.conf with json formatted directives
- 2. In /usr/libexec/grid-monitoring/probes/wenmr/wnjob/etc/wn.d/wenmr/ edit the services.cfg and commands.cfg files
- 3. Implement the probe in the file /usr/libexec/grid-monitoring/probes/wenmr/wnjob/probes/wenmr/probe_name (see e.g. gromacs probe)
- 4. Ensuring the following:
cat /etc/ncg/ncg-localdb.d/wenmr-custom.conf
MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--add-wntar-nag!/usr/libexec/grid-monitoring/probes/wenmr/wnjob/
MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--add-wntar-nag!/usr/libexec/grid-monitoring/probes/wenmr/wnjob/
/opt/glite/yaim/bin/yaim -s siteinfo/site-info.def -d 6 -c -n glite-UI -n glite-NAGIOS && service nagios restart
Information about latest update 09
Management instructions
For its probes, Nagios uses Badoer's certificate, that must be renewed before it expires (it lasts one week):
[badoer@grid-monitor03]# myproxy-init --voms enmr.eu:/enmr.eu/ops -k !NagiosRetrieve-grid-monitor03.pd.infn.it-enmr.eu -s prod-ui-02.pd.infn.it -l nagios -x -Z "/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it"
After a yaim reconfig, do the following instructions:
- Keep only WMS of CERM, removing the other WMSes from
- /opt/glite/etc/enmr.eu/glite_wms.conf
- /opt/glite/etc/enmr.eu/glite_wmsui.conf
- Add WeNMR BDII for SRM probes in sites not belonging to EGI-GOCDB
- add the following line to /etc/ncg/ncg-localdb.d/uncert.conf
- MODIFY_METRIC_PARAMETER!org.sam.SRM-All!--ldap-uri!ldap://bdii-wenmr.pd.infn.it:2170
- or equally:
cp /etc/ncg/ncg-localdb.d/uncert.conf.GOOD /etc/ncg/ncg-localdb.d/uncert.conf
- Configure ncg to find sites not belonging to EGI-GOCDB on their site-BDII
- add the following lines to /etc/ncg/ncg.conf.d/uncert.conf for each site outside EGI-GOCDB
- # <GOCDB/>
- ADD_HOSTS=1
- LDAP_ADDRESS=<siteBDII>
- or equally:
cp /etc/ncg/ncg.conf.d/uncert.conf-OK-OutOfGOCDB_sites /etc/ncg/ncg.conf.d/uncert.conf
To add a site:
- if the site is certified in EGI-GOCDB:
- edit yaim configuration file adding the NGI of the site (if not already present) in variable NCG_GOCDB_ROC_NAME.
- reconfigure with yaim
- if the site is present in EGI-GOCDB but not certified:
- edit yaim configuration file adding the site name on variable UNCERTIFIED_SITES (it should be present in TopBDII bdii-wenmr.pd.infn.it)
- reconfigure with yaim
- if the site is not present in EGI-GOCDB
- edit yaim configuration file adding the site name on variable UNCERTIFIED_SITES (it should be present in TopBDII bdii-wenmr.pd.infn.it)
- reconfigure with yaim
- edit grid-monitor03:/etc/ncg/ncg.conf.d/uncert.conf changing:
- <NCG::SiteInfo SITE_NAME>
- # <GOCDB/>
- <LDAP>
- LDAP_ADDRESS=SITE_BDII
- # ADD_HOSTS=0
- ADD_HOSTS=1
- </LDAP>
To authorize a user whose DN isn't automatically retrieved from VOMS to /etc/nagios/htpasswd.users:
- copy user's DN in a file /etc/voms2htpasswd-static.d/*.conf
To add a custom Nagios probe see
here
Installation and configuration instructions
This documentation was followed:
VO SAM
Here's the steps executed on grid-monitor03.pd.infn.it.
Installation
Installed SL5 x86_64
service yum stop
chkconfig yum off
- host certificates (''hostkey.pem'' and ''hostcert.pem'') installed in ''/etc/grid-security/''
cd /etc/yum.repos.d/
wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/egi-trustanchors.repo
wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.X/glite-BDII.repo
wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/dag.repo
wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/ig.repo
wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/glite-ui.repo
vi egi-sam.repo
[egi-sam]
name=EGI SAM repo
baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch
enabled=1
gpgcheck=0
protect=1
priority=10
mv sl.repo sl.repo.disable
mv sl-security.repo sl-security.repo.disable
mv sl-fastbugs.repo sl-fastbugs.repo.disable
mv sl-contrib.repo sl-contrib.repo.disable
yum clean all
yum install lcg-CA
yum install httpd
yum groupinstall ig_UI_noafs
yum install yum-priorities
yum remove mysql-server-5.0.77-4.el5_5.4 mysql-5.0.77-4.el5_5.4 mysql-devel-5.0.77-4.el5_5.4 [necessary because yaim configuration wants a newer version of !MySQL]
- edited /etc/yum.repos.d/dag.repo because of missing dependencies (why??) with perl-DBD-mysql-4.014-1.el5.rfx (needed by egee-NAGIOS)
[root@grid-monitor03 ~]# cat /etc/yum.repos.d/dag.repo
[dag]
name=DAG rpms
baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
http://ftp1.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
http://ftp2.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
ftp://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
enabled=1
# To use priorities you must have yum-priorities installed
priority=30
[dag-extra]
name=DAG extras
baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/extras/
enabled=1
yum install egee-NAGIOS
yum install 'perl(Class::Inspector)' [needed to let Nagios update file /etc/nagios/htpasswd.users, where authorized users are listed]
Configuration
- edit file <yaim-conf-dir>/3_2/nodes/grid-monitor03
VOS="enmr.eu"
NAGIOS_HOST=grid-monitor03.$MY_DOMAIN
NAGIOS_ADMIN_DNS="/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Cristina Aiftimiei,/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Sergio Traldi,/C=IT/O=INFN/OU=Personal Certificate/L=LNL/CN=Simone Badoer,/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Marco Verlato"
NCG_NAGIOS_ADMIN=simone.badoer@pd.infn.it
NAGIOS_ROLE=vo
NCG_PROBES_TYPE=all
NCG_VO=enmr.eu
NAGIOS_HTTPD_ENABLE_CONFIG=true
NAGIOS_NCG_ENABLE_CONFIG=true
NAGIOS_SUDO_ENABLE_CONFIG=true
NAGIOS_NAGIOS_ENABLE_CONFIG=true
NAGIOS_CGI_ENABLE_CONFIG=true
NAGIOS_NSCA_PASS=xxx
NAGIOS_NCG_ENABLE_CRON=true
NCG_TOPOLOGY_USE_SAM=true
NCG_TOPOLOGY_USE_GOCDB=false
NCG_TOPOLOGY_USE_ENOC=false
NCG_TOPOLOGY_USE_LDAP=false
NCG_REMOTE_USE_SAM=false
NCG_REMOTE_USE_NAGIOS=false
NCG_REMOTE_USE_ENOC=false
MYSQL_ADMIN="xxx"
DB_PASS="xxx"
MYEGI_ADMIN_NAME="Simone Badoer"
MYEGI_ADMIN_EMAIL="simone.badoer@pd.infn.it"
MYEGI_DEFAULT_PROFILE="ROC"
NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"
NCG_NOTIFICATION_HEADER="WeNMR Nagios"
NCG_INCLUDE_EMPTY_HOSTS=0
# Found from GOCDB:
NCG_GOCDB_ROC_NAME="Italy NGI_IT NGI_NL NGI_DE UKI NGI_IBERGRID ROC_IGALC"
# Needed for uncertified sites:
UNCERTIFIED_SITES="BCBR"
UNCERTIFIED_WMS=wms-enmr.chem.uu.nl
UNCERTIFIED_BDII=bdii-enmr.chem.uu.nl
/opt/glite/yaim/bin/ig_yaim -c -d 6 -s /usr/local/nfs/3_2/ig-site-info.def.current -n ig_UI_noafs -n glite-NAGIOS 2>&1 | tee /root/conf_ig_UI_noafs__glite-NAGIOS.`hostname -s`.`date +mHS`.log
- on yaim configuration file of prod-ui-02 changed this variables and reconfigured prod-ui-02:
GRID_AUTHORIZED_RETRIEVERS="'/C=IT/O=INFN/OU=Host/L=Padova/CN=prod-ui-02.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=cert-30.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it'"
GRID_TRUSTED_RETRIEVERS="'/C=IT/O=INFN/OU=Host/L=Padova/CN=prod-ui-02.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=cert-30.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it'"
userdadd badoer
[badoer@grid-monitor03]#myproxy-init -k !NagiosRetrieve-grid-monitor03.pd.infn.it-enmr.eu -s prod-ui-02.pd.infn.it -l nagios -x -Z "/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it"
- cp /etc/nagios/plugins/send_to_db.ini /etc/nagios/plugins/send_to_db.ini
- edit /etc/nagios/plugins/send_to_db.ini changing:
db_user=mrs
db_pwd=xxx
Information about old update 07
This Nagios monitors hosts published by Top BDII bdii-enmr.cerm.unifi.it
Detailed documentation and instructions about Nagios could be found
here
Management instructions
Nothing to do about published information, a cron keeps them up-to-date.
After a reconfiguration
- ''service nagios restart''
Installation and configuration instructions
Initially this documentation was followed:
After a series of problems, due to database name errors (only one database named 'mrs' must be used, while in previous links the names of databases are defined by user as yaim variables - see ticket
https://gus.fzk.de/ws/ticket_info.php?ticket=65594
) the following documentation was used to correctly complete the first installation:
"All on one box" configuration (Nagios + NRPE + ig_UI) was installed.
Here's the steps executed on grid-monitor03.pd.infn.it.
Installation
- host certificates (''hostkey.pem'' and ''hostcert.pem'') installed in ''/etc/grid-security/''
- copied repo files in ''/etc/yum.repos.d/'' as described in documentation
- ''lcg-CA.repo''
- ''glite-BDII.repo''
- rpmforge (''rpmforge-release-0.5.1-1.el5.rf.x86_64.rpm'')
- sa1-release (''sa1-release-3-1.el5.noarch.rpm'')
- ''ig.repo''
- ''dag.repo'' (different versions for ig and for glite)
- disabled (renamed with different extension) ''dag.repo'', ''sl.repo'' and ''sl-security.repo'' because it's used the option
"event-scheduler=1" in file ''/etc/my.cnf'' available only for MySQL > 5.1.6 (default mysql was 5.0.77)
- ''yum install httpd''
- ''yum install lcg-CA''
- ''yum groupinstall ig_UI_noafs''
- ''yum install egee-NAGIOS''
Configuration
Nagios specific variables was defined in /opt/nfs_install/3_2/nodes/grid-monitor03.pd.infn.it
In particular:
- NAGIOS_ROLE=vo
- it creates some databases (ATP, MDDB, MS)... don't know if really necessary...
- it searches voms for the specified VO
- NCG_VO=enmr.eu
- BDII_HOST=bdii-enmr.cerm.unifi.it
- set the Top BDII where to find sites to be monitored
- NCG_LDAP_FILTER="GlueSiteUniqueID=*"
- this is a "false" filter (* implies every site), but this variable must have a value in order to let yaim (config_ncg) to create by its own the correct file ''/etc/ncg/ncg.conf'', in such a way that ncg considers only the Top BDII; if this variable is not set, ncg searches for all sites belonging to other parameters, for example ROC=Italy
A lot of bugs had to be resolved before having a good configuration.
- syntax error in ''/opt/glite/yaim/functions/config_mysql''
- hardcoded parameter in ''/usr/share/doc/atp-1.15.6/mysql_schema/ver_1_6/increase_version.sql''
- line 1: removed ''USE `mrs`;''
- ATP_DB_NAME is a variable defined in ''site-info.def'' with value 'atp', not hardcoded with value 'mrs', so yaim couldn't create table atp.schema_details
- no problems using only one DB named 'mrs' (ATP_DB_NAME=`mrs`)
- wrong comment in ''/opt/glite/yaim/functions/config_mddb_mysql''
- line 112: uncomment ''#mysqladmin -u root --password=${MYSQL_ADMIN} create $MDDB_DB_NAME > /dev/null 2>&1''
- MDDB database couldn't be created
- no problems using only one DB named 'mrs' (MDDB_DB_NAME=`mrs`)
- wrong parameter in function tableName in file ''/usr/libexec/mddb/synchronizer.php''
- line 22: changed ''vo'' with ''atp.vo''
- a test on an inexistent table was tried
- no problems using only one DB named 'mrs'
- hardcoded parameters in each file in directory ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/''
- removed every instance of ''USE `mrs`;'' on every file
- MS_DB_NAME is a variable defined in ''site-info.def'' with value 'metricstore', not hardcoded with value 'mrs', so yaim couldn't go on
- no problems using only one DB named 'mrs' (MS_DB_NAME=`mrs`)
- undefined tables in ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/create_structure.sql''
- created tables ''vo, metrics, service, profile'' and their dipendences copying their definitions from ATP database (is it wrong??)
- these tables are required from other tables - declared in the same file - because or foreign key, for example:
- FOREIGN KEY (vo_id )
- REFERENCES vo (id )
- no problems using only one DB named 'mrs'
- undefined field in ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/create_structure.sql''
- added field 'db_name' on table 'schema_details' copying from its definition in MDDB database
- it's used by file ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/increase_version.sql''
- no problems using only one DB named 'mrs'
- LDAP error with some TOPOLOGY definition:
- set these variables:
- NCG_TOPOLOGY_USE_SAM=true
- NCG_TOPOLOGY_USE_GOCDB=false
- NCG_TOPOLOGY_USE_ENOC=false
- NCG_TOPOLOGY_USE_LDAP=false
- in the beginning they were inverted, but there was a blocking LDAP error when a host couldn't be connected.
- Invoking NCG::SiteInfo::LDAP.
- DEBUG: in NCG::SiteDB::siteName with args:
- DEBUG: in NCG::SiteDB::siteLDAP with args:
- Getting info from LDAP: inaf-ce-01.ct.pi2s2.it:2170/Mds-Vo-Name=GRISU-COMETA-INAF-CT, O=Grid
- ERROR: Cannot connect to inaf-ce-01.ct.pi2s2.it:2170
- Module NCG::SiteInfo::LDAP hit critical error, stopping NCG
- exit with error in ''/usr/sbin/ncg.reload.sh''
- moved 'exit 0' from line 18 to line 19, outside the more internal 'if'.
- if service nagios is stopped (at the first configuration it is stopped), ''service nagios reload'' gives an error (exit 7: reload implies stop and start, and stopping a stopped service is considered by /etc/init.d/nagios an error); so if exit!=0 yaim failed
- wrong directory in ''/opt/glite/yaim/functions/config_nagios''
- line 266: changed ''lock_file=/var/run/nagios.pid'' with ''lock_file=/var/run/nagios/nagios.pid''
- there was a permission denied error because the deamon nagios is executed by user nagios, but the pid file wasn't created in a directory with write permission for that user
- short(?) timeout in ''/opt/glite/yaim/functions/config_ncg''
- lines 299 and 448: changed from ''TIMEOUT=600'' to ''#TIMEOUT=600''
- error starting ncg; the log in /var/log/ncg.log:
- ERROR: Could not get results from SAM: 500 Server closed connection without sending any data back
- ERROR: Could not get list of critical metrics from SAM: 500 Server closed connection without sending any data back
After correcting the bugs, finally the yaim configuration command:
- /opt/glite/yaim/bin/ig_yaim -c -d 6 -s /usr/local/nfs/3_2/ig-site-info.def.current -n ig_UI_noafs -n glite-NAGIOS 2>&1 | tee /root/yaim38.log
Post configuration
- changed https port to make site visible outside pd.infn.it
- edited ''/etc/httpd/conf.d/ssl.conf'' changing from 443 to 50080
- set variable NAGIOS_HTTPD_ENABLE_CONFIG=false in yaim configuration file, in order to avoid https configuration to be reset after every reconfiguration
--
MarcoVerlato - 2014-02-19