Nagios for WeNMR

WeNMR Nagios web page: https://grid-monitor03.pd.infn.it:50080/nagios/

Access permitted for enmr.eu members only using personal certificate (authorized DNs are retrieved by /etc/cron.d/voms-htpasswd and listed in files /etc/nagios/htpasswd.users and /etc/httpd/httpd.users)

This Nagios monitors hosts belonging to NGIs of Belgium, Germany, France, Italy, Spain, Portugal, Netherlands, UK, Poland, Malaysia, Taiwan, Brasil where enmr.eu probes can be executed.

Sites from South Africa and OSG will be soon monitored too.

Detailed documentation about Nagios could be found here

How to add quickly new WeNMR probes

  • 1. In /etc/ncg-metric-config.d create the file wenmr-probes.conf with json formatted directives
  • 2. In /usr/libexec/grid-monitoring/probes/wenmr/wnjob/etc/wn.d/wenmr/ edit the services.cfg and commands.cfg files
  • 3. Implement the probe in the file /usr/libexec/grid-monitoring/probes/wenmr/wnjob/probes/wenmr/probe_name (see e.g. gromacs probe)
  • 4. Ensuring the following:
cat /etc/ncg/ncg-localdb.d/wenmr-custom.conf
MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--add-wntar-nag!/usr/libexec/grid-monitoring/probes/wenmr/wnjob/
MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--add-wntar-nag!/usr/libexec/grid-monitoring/probes/wenmr/wnjob/
/opt/glite/yaim/bin/yaim -s siteinfo/site-info.def -d 6 -c -n glite-UI -n glite-NAGIOS && service nagios restart


Information about latest update 09

Management instructions

For renewing the proxy (it lasts 100 days) used by Nagios, from a EMI UI do execute:

 [emi-ui]$  myproxy-init -l nagios -s prod-ui-02.pd.infn.it -k  NagiosRetrieve-grid-monitor03.pd.infn.it-enmr.eu -c 2400 -x -Z  "/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it"

After a yaim reconfig, do the following instructions:

  • Keep only WMS of CERM, removing the other WMSes from
    • /opt/glite/etc/enmr.eu/glite_wms.conf
    • /opt/glite/etc/enmr.eu/glite_wmsui.conf

  • Add WeNMR BDII for SRM probes in sites not belonging to EGI-GOCDB
    • add the following line to /etc/ncg/ncg-localdb.d/uncert.conf
      • MODIFY_METRIC_PARAMETER!org.sam.SRM-All!--ldap-uri!ldap://bdii-wenmr.pd.infn.it:2170
    • or equally:
 cp /etc/ncg/ncg-localdb.d/uncert.conf.GOOD /etc/ncg/ncg-localdb.d/uncert.conf 

  • Configure ncg to find sites not belonging to EGI-GOCDB on their site-BDII
    • add the following lines to /etc/ncg/ncg.conf.d/uncert.conf for each site outside EGI-GOCDB
      • # <GOCDB/>
      • ADD_HOSTS=1
      • LDAP_ADDRESS=<siteBDII>
    • or equally:
 cp /etc/ncg/ncg.conf.d/uncert.conf-OK-OutOfGOCDB_sites /etc/ncg/ncg.conf.d/uncert.conf 

To add a site:

  • if the site is certified in EGI-GOCDB:
    • edit yaim configuration file adding the NGI of the site (if not already present) in variable NCG_GOCDB_ROC_NAME.
    • reconfigure with yaim
  • if the site is present in EGI-GOCDB but not certified:
    • edit yaim configuration file adding the site name on variable UNCERTIFIED_SITES (it should be present in TopBDII bdii-wenmr.pd.infn.it)
    • reconfigure with yaim
  • if the site is not present in EGI-GOCDB
    • edit yaim configuration file adding the site name on variable UNCERTIFIED_SITES (it should be present in TopBDII bdii-wenmr.pd.infn.it)
    • reconfigure with yaim
    • edit grid-monitor03:/etc/ncg/ncg.conf.d/uncert.conf changing:
      • <NCG::SiteInfo SITE_NAME>
      • # <GOCDB/>
      • <LDAP>
      • LDAP_ADDRESS=SITE_BDII
      • # ADD_HOSTS=0
      • ADD_HOSTS=1
      • </LDAP>

To authorize a user whose DN isn't automatically retrieved from VOMS to /etc/nagios/htpasswd.users:

  • copy user's DN in a file /etc/voms2htpasswd-static.d/*.conf

To add a custom Nagios probe see here

Installation and configuration instructions

This documentation was followed: VO SAM

Here's the steps executed on grid-monitor03.pd.infn.it.

Installation

Installed SL5 x86_64

   service yum stop
   chkconfig yum off
  • host certificates (''hostkey.pem'' and ''hostcert.pem'') installed in ''/etc/grid-security/''
   cd /etc/yum.repos.d/
   wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/egi-trustanchors.repo
   wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.X/glite-BDII.repo
   wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/dag.repo
   wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/ig.repo
   wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/glite-ui.repo

   vi egi-sam.repo
      [egi-sam]
      name=EGI SAM repo
      baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch
      enabled=1
      gpgcheck=0
      protect=1
      priority=10

   mv sl.repo sl.repo.disable
   mv sl-security.repo sl-security.repo.disable
   mv sl-fastbugs.repo sl-fastbugs.repo.disable
   mv sl-contrib.repo sl-contrib.repo.disable

   yum clean all
   yum install lcg-CA
   yum install httpd
   yum groupinstall ig_UI_noafs
   yum install yum-priorities
   yum remove mysql-server-5.0.77-4.el5_5.4 mysql-5.0.77-4.el5_5.4 mysql-devel-5.0.77-4.el5_5.4 [necessary because yaim configuration wants a newer version of !MySQL]
  • edited /etc/yum.repos.d/dag.repo because of missing dependencies (why??) with perl-DBD-mysql-4.014-1.el5.rfx (needed by egee-NAGIOS)
   [root@grid-monitor03 ~]# cat /etc/yum.repos.d/dag.repo
      [dag]
      name=DAG rpms
      baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
      http://ftp1.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
      http://ftp2.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
      ftp://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/
      enabled=1
      # To use priorities you must have yum-priorities installed
      priority=30
      [dag-extra]
      name=DAG extras
      baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/extras/
      enabled=1

   yum install egee-NAGIOS
   yum install 'perl(Class::Inspector)' [needed to let Nagios update file /etc/nagios/htpasswd.users, where authorized users are listed]

Configuration

  • edit file <yaim-conf-dir>/3_2/nodes/grid-monitor03
      VOS="enmr.eu"

      NAGIOS_HOST=grid-monitor03.$MY_DOMAIN
      NAGIOS_ADMIN_DNS="/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Cristina Aiftimiei,/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Sergio Traldi,/C=IT/O=INFN/OU=Personal Certificate/L=LNL/CN=Simone Badoer,/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Marco Verlato"
      NCG_NAGIOS_ADMIN=simone.badoer@pd.infn.it
      NAGIOS_ROLE=vo
      NCG_PROBES_TYPE=all
      NCG_VO=enmr.eu
      NAGIOS_HTTPD_ENABLE_CONFIG=true
      NAGIOS_NCG_ENABLE_CONFIG=true
      NAGIOS_SUDO_ENABLE_CONFIG=true
      NAGIOS_NAGIOS_ENABLE_CONFIG=true
      NAGIOS_CGI_ENABLE_CONFIG=true
      NAGIOS_NSCA_PASS=xxx

      NAGIOS_NCG_ENABLE_CRON=true

      NCG_TOPOLOGY_USE_SAM=true
      NCG_TOPOLOGY_USE_GOCDB=false
      NCG_TOPOLOGY_USE_ENOC=false
      NCG_TOPOLOGY_USE_LDAP=false

      NCG_REMOTE_USE_SAM=false
      NCG_REMOTE_USE_NAGIOS=false
      NCG_REMOTE_USE_ENOC=false

      MYSQL_ADMIN="xxx"
      DB_PASS="xxx"

      MYEGI_ADMIN_NAME="Simone Badoer"
      MYEGI_ADMIN_EMAIL="simone.badoer@pd.infn.it"
      MYEGI_DEFAULT_PROFILE="ROC"

      NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS"
      NCG_NOTIFICATION_HEADER="WeNMR Nagios"
      NCG_INCLUDE_EMPTY_HOSTS=0
      # Found from GOCDB:
      NCG_GOCDB_ROC_NAME="Italy NGI_IT NGI_NL NGI_DE UKI NGI_IBERGRID ROC_IGALC"

      # Needed for uncertified sites:
      UNCERTIFIED_SITES="BCBR"
      UNCERTIFIED_WMS=wms-enmr.chem.uu.nl
      UNCERTIFIED_BDII=bdii-enmr.chem.uu.nl

/opt/glite/yaim/bin/ig_yaim -c -d 6 -s /usr/local/nfs/3_2/ig-site-info.def.current -n ig_UI_noafs -n glite-NAGIOS 2>&1 | tee /root/conf_ig_UI_noafs__glite-NAGIOS.`hostname -s`.`date +mHS`.log 

  • on yaim configuration file of prod-ui-02 changed this variables and reconfigured prod-ui-02:
      GRID_AUTHORIZED_RETRIEVERS="'/C=IT/O=INFN/OU=Host/L=Padova/CN=prod-ui-02.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=cert-30.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it'"
      GRID_TRUSTED_RETRIEVERS="'/C=IT/O=INFN/OU=Host/L=Padova/CN=prod-ui-02.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=cert-30.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it'"

   userdadd badoer
   [badoer@grid-monitor03]#myproxy-init -k !NagiosRetrieve-grid-monitor03.pd.infn.it-enmr.eu -s prod-ui-02.pd.infn.it -l nagios -x -Z "/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it"
  • cp /etc/nagios/plugins/send_to_db.ini /etc/nagios/plugins/send_to_db.ini
  • edit /etc/nagios/plugins/send_to_db.ini changing:
      db_user=mrs
      db_pwd=xxx

Information about old update 07

This Nagios monitors hosts published by Top BDII bdii-enmr.cerm.unifi.it

Detailed documentation and instructions about Nagios could be found here


Management instructions

Nothing to do about published information, a cron keeps them up-to-date.

After a reconfiguration

  • ''service nagios restart''


Installation and configuration instructions

Initially this documentation was followed:

After a series of problems, due to database name errors (only one database named 'mrs' must be used, while in previous links the names of databases are defined by user as yaim variables - see ticket https://gus.fzk.de/ws/ticket_info.php?ticket=65594) the following documentation was used to correctly complete the first installation:

"All on one box" configuration (Nagios + NRPE + ig_UI) was installed.

Here's the steps executed on grid-monitor03.pd.infn.it.

Installation

  • host certificates (''hostkey.pem'' and ''hostcert.pem'') installed in ''/etc/grid-security/''

  • copied repo files in ''/etc/yum.repos.d/'' as described in documentation
    • ''lcg-CA.repo''
    • ''glite-BDII.repo''
    • rpmforge (''rpmforge-release-0.5.1-1.el5.rf.x86_64.rpm'')
    • sa1-release (''sa1-release-3-1.el5.noarch.rpm'')
    • ''ig.repo''
    • ''dag.repo'' (different versions for ig and for glite)
  • disabled (renamed with different extension) ''dag.repo'', ''sl.repo'' and ''sl-security.repo'' because it's used the option
"event-scheduler=1" in file ''/etc/my.cnf'' available only for MySQL > 5.1.6 (default mysql was 5.0.77)

  • ''yum install httpd''
  • ''yum install lcg-CA''
  • ''yum groupinstall ig_UI_noafs''
  • ''yum install egee-NAGIOS''

Configuration

Nagios specific variables was defined in /opt/nfs_install/3_2/nodes/grid-monitor03.pd.infn.it In particular:

  • NAGIOS_ROLE=vo
    • it creates some databases (ATP, MDDB, MS)... don't know if really necessary...
    • it searches voms for the specified VO
  • NCG_VO=enmr.eu
  • BDII_HOST=bdii-enmr.cerm.unifi.it
    • set the Top BDII where to find sites to be monitored
  • NCG_LDAP_FILTER="GlueSiteUniqueID=*"
    • this is a "false" filter (* implies every site), but this variable must have a value in order to let yaim (config_ncg) to create by its own the correct file ''/etc/ncg/ncg.conf'', in such a way that ncg considers only the Top BDII; if this variable is not set, ncg searches for all sites belonging to other parameters, for example ROC=Italy

A lot of bugs had to be resolved before having a good configuration.

  • hardcoded parameter in ''/usr/share/doc/atp-1.15.6/mysql_schema/ver_1_6/increase_version.sql''
    • line 1: removed ''USE `mrs`;''
      • ATP_DB_NAME is a variable defined in ''site-info.def'' with value 'atp', not hardcoded with value 'mrs', so yaim couldn't create table atp.schema_details
    • no problems using only one DB named 'mrs' (ATP_DB_NAME=`mrs`)

  • wrong comment in ''/opt/glite/yaim/functions/config_mddb_mysql''
    • line 112: uncomment ''#mysqladmin -u root --password=${MYSQL_ADMIN} create $MDDB_DB_NAME > /dev/null 2>&1''
      • MDDB database couldn't be created
    • no problems using only one DB named 'mrs' (MDDB_DB_NAME=`mrs`)

  • wrong parameter in function tableName in file ''/usr/libexec/mddb/synchronizer.php''
    • line 22: changed ''vo'' with ''atp.vo''
      • a test on an inexistent table was tried
    • no problems using only one DB named 'mrs'

  • hardcoded parameters in each file in directory ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/''
    • removed every instance of ''USE `mrs`;'' on every file
      • MS_DB_NAME is a variable defined in ''site-info.def'' with value 'metricstore', not hardcoded with value 'mrs', so yaim couldn't go on
    • no problems using only one DB named 'mrs' (MS_DB_NAME=`mrs`)

  • undefined tables in ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/create_structure.sql''
    • created tables ''vo, metrics, service, profile'' and their dipendences copying their definitions from ATP database (is it wrong??)
      • these tables are required from other tables - declared in the same file - because or foreign key, for example:
        • FOREIGN KEY (vo_id )
        • REFERENCES vo (id )
    • no problems using only one DB named 'mrs'

  • undefined field in ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/create_structure.sql''
    • added field 'db_name' on table 'schema_details' copying from its definition in MDDB database
      • it's used by file ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/increase_version.sql''
    • no problems using only one DB named 'mrs'

  • LDAP error with some TOPOLOGY definition:
    • set these variables:
      • NCG_TOPOLOGY_USE_SAM=true
      • NCG_TOPOLOGY_USE_GOCDB=false
      • NCG_TOPOLOGY_USE_ENOC=false
      • NCG_TOPOLOGY_USE_LDAP=false
    • in the beginning they were inverted, but there was a blocking LDAP error when a host couldn't be connected.
      • Invoking NCG::SiteInfo::LDAP.
      • DEBUG: in NCG::SiteDB::siteName with args:
      • DEBUG: in NCG::SiteDB::siteLDAP with args:
      • Getting info from LDAP: inaf-ce-01.ct.pi2s2.it:2170/Mds-Vo-Name=GRISU-COMETA-INAF-CT, O=Grid
      • ERROR: Cannot connect to inaf-ce-01.ct.pi2s2.it:2170
      • Module NCG::SiteInfo::LDAP hit critical error, stopping NCG

  • exit with error in ''/usr/sbin/ncg.reload.sh''
    • moved 'exit 0' from line 18 to line 19, outside the more internal 'if'.
      • if service nagios is stopped (at the first configuration it is stopped), ''service nagios reload'' gives an error (exit 7: reload implies stop and start, and stopping a stopped service is considered by /etc/init.d/nagios an error); so if exit!=0 yaim failed

  • wrong directory in ''/opt/glite/yaim/functions/config_nagios''
    • line 266: changed ''lock_file=/var/run/nagios.pid'' with ''lock_file=/var/run/nagios/nagios.pid''
      • there was a permission denied error because the deamon nagios is executed by user nagios, but the pid file wasn't created in a directory with write permission for that user

  • short(?) timeout in ''/opt/glite/yaim/functions/config_ncg''
    • lines 299 and 448: changed from ''TIMEOUT=600'' to ''#TIMEOUT=600''
      • error starting ncg; the log in /var/log/ncg.log:
        • ERROR: Could not get results from SAM: 500 Server closed connection without sending any data back
        • ERROR: Could not get list of critical metrics from SAM: 500 Server closed connection without sending any data back

After correcting the bugs, finally the yaim configuration command:

  • /opt/glite/yaim/bin/ig_yaim -c -d 6 -s /usr/local/nfs/3_2/ig-site-info.def.current -n ig_UI_noafs -n glite-NAGIOS 2>&1 | tee /root/yaim38.log

Post configuration

  • changed https port to make site visible outside pd.infn.it
    • edited ''/etc/httpd/conf.d/ssl.conf'' changing from 443 to 50080

  • set variable NAGIOS_HTTPD_ENABLE_CONFIG=false in yaim configuration file, in order to avoid https configuration to be reset after every reconfiguration

-- MarcoVerlato - 2014-02-19

Edit | Attach | PDF | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | More topic actions
Topic revision: r9 - 2014-02-19 - MarcoVerlato
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback