Tags:
,
view all tags
---+ Nagios for !WeNMR !WeNMR Nagios web page: [[https://grid-monitor03.pd.infn.it:50080/nagios/]] Access permitted for enmr.eu members only using personal certificate (authorized DNs are retrieved by /etc/cron.d/voms-htpasswd and listed in files /etc/nagios/htpasswd.users and /etc/httpd/httpd.users) This Nagios monitors hosts belonging to NGIs of Belgium, Germany, France, Italy, Spain, Portugal, Netherlands, UK, Poland, Malaysia, Taiwan, Brasil where enmr.eu probes can be executed. Sites from South Africa and OSG will be soon monitored too. Detailed documentation about Nagios could be found [[https://wiki.egi.eu/wiki/VO_Service_Availability_Monitoring][here]] ---++ How to add quickly new WeNMR probes * 1. In /etc/ncg-metric-config.d create the file wenmr-probes.conf with json formatted directives * 2. In /usr/libexec/grid-monitoring/probes/wenmr/wnjob/etc/wn.d/wenmr/ edit the services.cfg and commands.cfg files * 3. Implement the probe in the file /usr/libexec/grid-monitoring/probes/wenmr/wnjob/probes/wenmr/<probename> * 4. Ensuring the following: <verbatim>cat /etc/ncg/ncg-localdb.d/wenmr-custom.conf MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--add-wntar-nag!/usr/libexec/grid-monitoring/probes/wenmr/wnjob/ MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--add-wntar-nag!/usr/libexec/grid-monitoring/probes/wenmr/wnjob/ </verbatim> * 5. Adding the new metrics to Poem through its application http://grid-monitor03.pd.infn.it:50180/poem/admin * 6. Reconfigure with <verbatim>/opt/glite/yaim/bin/yaim -s siteinfo/site-info.def -d 6 -c -n glite-UI -n glite-NAGIOS && service nagios restart </verbatim> --- ---++ Information about latest update 09 ---+++ Management instructions For its probes, Nagios uses Badoer's certificate, that must be renewed before it expires (it lasts one week): <verbatim> [badoer@grid-monitor03]# myproxy-init --voms enmr.eu:/enmr.eu/ops -k !NagiosRetrieve-grid-monitor03.pd.infn.it-enmr.eu -s prod-ui-02.pd.infn.it -l nagios -x -Z "/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it" </verbatim> After a yaim reconfig, do the following instructions: * Keep only WMS of CERM, removing the other WMSes from * /opt/glite/etc/enmr.eu/glite_wms.conf * /opt/glite/etc/enmr.eu/glite_wmsui.conf * Add !WeNMR BDII for SRM probes in sites not belonging to EGI-GOCDB * add the following line to /etc/ncg/ncg-localdb.d/uncert.conf * MODIFY_METRIC_PARAMETER!org.sam.SRM-All!--ldap-uri!ldap://bdii-wenmr.pd.infn.it:2170 * or equally: <verbatim> cp /etc/ncg/ncg-localdb.d/uncert.conf.GOOD /etc/ncg/ncg-localdb.d/uncert.conf </verbatim> * Configure ncg to find sites not belonging to EGI-GOCDB on their site-BDII * add the following lines to /etc/ncg/ncg.conf.d/uncert.conf for each site outside EGI-GOCDB * # <GOCDB/> * ADD_HOSTS=1 * LDAP_ADDRESS=<siteBDII> * or equally: <verbatim> cp /etc/ncg/ncg.conf.d/uncert.conf-OK-OutOfGOCDB_sites /etc/ncg/ncg.conf.d/uncert.conf </verbatim> To add a site: * if the site is certified in EGI-GOCDB: * edit yaim configuration file adding the NGI of the site (if not already present) in variable NCG_GOCDB_ROC_NAME. * reconfigure with yaim * if the site is present in EGI-GOCDB but not certified: * edit yaim configuration file adding the site name on variable UNCERTIFIED_SITES (it should be present in !TopBDII bdii-wenmr.pd.infn.it) * reconfigure with yaim * if the site is not present in EGI-GOCDB * edit yaim configuration file adding the site name on variable UNCERTIFIED_SITES (it should be present in !TopBDII bdii-wenmr.pd.infn.it) * reconfigure with yaim * edit grid-monitor03:/etc/ncg/ncg.conf.d/uncert.conf changing: * <NCG::SiteInfo SITE_NAME> * # <GOCDB/> * <LDAP> * LDAP_ADDRESS=SITE_BDII * # ADD_HOSTS=0 * ADD_HOSTS=1 * </LDAP> To authorize a user whose DN isn't automatically retrieved from VOMS to /etc/nagios/htpasswd.users: * copy user's DN in a file /etc/voms2htpasswd-static.d/*.conf To add a custom Nagios probe see [[CustomNagiosProbes][here]] ---+++ Installation and configuration instructions This documentation was followed: [[https://wiki.egi.eu/wiki/VO_Service_Availability_Monitoring][VO SAM]] Here's the steps executed on grid-monitor03.pd.infn.it. *Installation* Installed SL5 x86_64 <verbatim> service yum stop chkconfig yum off </verbatim> * host certificates (''hostkey.pem'' and ''hostcert.pem'') installed in ''/etc/grid-security/'' <verbatim> cd /etc/yum.repos.d/ wget http://repository.egi.eu/sw/production/cas/1/current/repo-files/egi-trustanchors.repo wget http://grid-deployment.web.cern.ch/grid-deployment/glite/repos/3.X/glite-BDII.repo wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/dag.repo wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/ig.repo wget http://grid-it.cnaf.infn.it/mrepo/repos/sl5/x86_64/glite-ui.repo vi egi-sam.repo [egi-sam] name=EGI SAM repo baseurl=http://repository.egi.eu/sw/production/sam/1/$basearch enabled=1 gpgcheck=0 protect=1 priority=10 mv sl.repo sl.repo.disable mv sl-security.repo sl-security.repo.disable mv sl-fastbugs.repo sl-fastbugs.repo.disable mv sl-contrib.repo sl-contrib.repo.disable yum clean all yum install lcg-CA yum install httpd yum groupinstall ig_UI_noafs yum install yum-priorities yum remove mysql-server-5.0.77-4.el5_5.4 mysql-5.0.77-4.el5_5.4 mysql-devel-5.0.77-4.el5_5.4 [necessary because yaim configuration wants a newer version of !MySQL] </verbatim> * edited /etc/yum.repos.d/dag.repo because of missing dependencies (why??) with perl-DBD-mysql-4.014-1.el5.rfx (needed by egee-NAGIOS) <verbatim> [root@grid-monitor03 ~]# cat /etc/yum.repos.d/dag.repo [dag] name=DAG rpms baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/ http://ftp1.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/ http://ftp2.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/ ftp://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/dag/ enabled=1 # To use priorities you must have yum-priorities installed priority=30 [dag-extra] name=DAG extras baseurl=http://ftp.scientificlinux.org/linux/extra/dag/redhat/el5/en/$basearch/extras/ enabled=1 yum install egee-NAGIOS yum install 'perl(Class::Inspector)' [needed to let Nagios update file /etc/nagios/htpasswd.users, where authorized users are listed] </verbatim> *Configuration* * edit file <yaim-conf-dir>/3_2/nodes/grid-monitor03 <verbatim> VOS="enmr.eu" NAGIOS_HOST=grid-monitor03.$MY_DOMAIN NAGIOS_ADMIN_DNS="/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Cristina Aiftimiei,/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Sergio Traldi,/C=IT/O=INFN/OU=Personal Certificate/L=LNL/CN=Simone Badoer,/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Marco Verlato" NCG_NAGIOS_ADMIN=simone.badoer@pd.infn.it NAGIOS_ROLE=vo NCG_PROBES_TYPE=all NCG_VO=enmr.eu NAGIOS_HTTPD_ENABLE_CONFIG=true NAGIOS_NCG_ENABLE_CONFIG=true NAGIOS_SUDO_ENABLE_CONFIG=true NAGIOS_NAGIOS_ENABLE_CONFIG=true NAGIOS_CGI_ENABLE_CONFIG=true NAGIOS_NSCA_PASS=xxx NAGIOS_NCG_ENABLE_CRON=true NCG_TOPOLOGY_USE_SAM=true NCG_TOPOLOGY_USE_GOCDB=false NCG_TOPOLOGY_USE_ENOC=false NCG_TOPOLOGY_USE_LDAP=false NCG_REMOTE_USE_SAM=false NCG_REMOTE_USE_NAGIOS=false NCG_REMOTE_USE_ENOC=false MYSQL_ADMIN="xxx" DB_PASS="xxx" MYEGI_ADMIN_NAME="Simone Badoer" MYEGI_ADMIN_EMAIL="simone.badoer@pd.infn.it" MYEGI_DEFAULT_PROFILE="ROC" NCG_MDDB_SUPPORTED_PROFILES="ROC,ROC_CRITICAL,ROC_OPERATORS" NCG_NOTIFICATION_HEADER="WeNMR Nagios" NCG_INCLUDE_EMPTY_HOSTS=0 # Found from GOCDB: NCG_GOCDB_ROC_NAME="Italy NGI_IT NGI_NL NGI_DE UKI NGI_IBERGRID ROC_IGALC" # Needed for uncertified sites: UNCERTIFIED_SITES="BCBR" UNCERTIFIED_WMS=wms-enmr.chem.uu.nl UNCERTIFIED_BDII=bdii-enmr.chem.uu.nl /opt/glite/yaim/bin/ig_yaim -c -d 6 -s /usr/local/nfs/3_2/ig-site-info.def.current -n ig_UI_noafs -n glite-NAGIOS 2>&1 | tee /root/conf_ig_UI_noafs__glite-NAGIOS.`hostname -s`.`date +mHS`.log </verbatim> * on yaim configuration file of prod-ui-02 changed this variables and reconfigured prod-ui-02: <verbatim> GRID_AUTHORIZED_RETRIEVERS="'/C=IT/O=INFN/OU=Host/L=Padova/CN=prod-ui-02.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=cert-30.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it'" GRID_TRUSTED_RETRIEVERS="'/C=IT/O=INFN/OU=Host/L=Padova/CN=prod-ui-02.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=cert-30.pd.infn.it' '/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it'" userdadd badoer [badoer@grid-monitor03]#myproxy-init -k !NagiosRetrieve-grid-monitor03.pd.infn.it-enmr.eu -s prod-ui-02.pd.infn.it -l nagios -x -Z "/C=IT/O=INFN/OU=Host/L=Padova/CN=grid-monitor03.pd.infn.it" </verbatim> * cp /etc/nagios/plugins/send_to_db.ini /etc/nagios/plugins/send_to_db.ini * edit /etc/nagios/plugins/send_to_db.ini changing: <verbatim> db_user=mrs db_pwd=xxx </verbatim> ---- ---++ Information about old update 07 This Nagios monitors hosts published by Top BDII bdii-enmr.cerm.unifi.it Detailed documentation and instructions about Nagios could be found [[https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaim][here]] ---- ---+++ Management instructions Nothing to do about published information, a cron keeps them up-to-date. *After a reconfiguration* * ''service nagios restart'' ---- ---+++ Installation and configuration instructions Initially this documentation was followed: * https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaim * http://igrelease.forge.cnaf.infn.it/doku.php?id=doc:guides:install-3_2 After a series of problems, due to database name errors (only one database named 'mrs' must be used, while in previous links the names of databases are defined by user as yaim variables - see ticket https://gus.fzk.de/ws/ticket_info.php?ticket=65594) the following documentation was used to correctly complete the first installation: * https://tomtools.cern.ch/confluence/display/SAM/Clean+egee-NAGIOS+installation * https://tomtools.cern.ch/confluence/display/SAMDOC/Update-05 * https://tomtools.cern.ch/confluence/pages/viewpage.action?pageId=3245295 "All on one box" configuration (Nagios + NRPE + ig_UI) was installed. Here's the steps executed on grid-monitor03.pd.infn.it. *Installation* * host certificates (''hostkey.pem'' and ''hostcert.pem'') installed in ''/etc/grid-security/'' * copied repo files in ''/etc/yum.repos.d/'' as described in documentation * ''lcg-CA.repo'' * ''glite-BDII.repo'' * rpmforge (''rpmforge-release-0.5.1-1.el5.rf.x86_64.rpm'') * sa1-release (''sa1-release-3-1.el5.noarch.rpm'') * ''ig.repo'' * ''dag.repo'' (different versions for ig and for glite) * disabled (renamed with different extension) ''dag.repo'', ''sl.repo'' and ''sl-security.repo'' because it's used the option "event-scheduler=1" in file ''/etc/my.cnf'' available only for !MySQL > 5.1.6 (default mysql was 5.0.77) * ''yum install httpd'' * ''yum install lcg-CA'' * ''yum groupinstall ig_UI_noafs'' * ''yum install egee-NAGIOS'' *Configuration* Nagios specific variables was defined in /opt/nfs_install/3_2/nodes/grid-monitor03.pd.infn.it In particular: * NAGIOS_ROLE=vo * it creates some databases (ATP, MDDB, MS)... don't know if really necessary... * it searches voms for the specified VO * NCG_VO=enmr.eu * BDII_HOST=bdii-enmr.cerm.unifi.it * set the Top BDII where to find sites to be monitored * NCG_LDAP_FILTER="GlueSiteUniqueID=*" * this is a "false" filter (* implies every site), but this variable must have a value in order to let yaim (config_ncg) to create by its own the correct file ''/etc/ncg/ncg.conf'', in such a way that ncg considers only the Top BDII; if this variable is not set, ncg searches for all sites belonging to other parameters, for example ROC=Italy A lot of bugs had to be resolved before having a good configuration. * syntax error in ''/opt/glite/yaim/functions/config_mysql'' * line 83: changed ''/sbin.service'' with ''/sbin/service'' * mysqld didn't start * this is a known issue: https://tomtools.cern.ch/confluence/pages/viewpage.action?pageId=3245295 * hardcoded parameter in ''/usr/share/doc/atp-1.15.6/mysql_schema/ver_1_6/increase_version.sql'' * line 1: removed ''USE `mrs`;'' * ATP_DB_NAME is a variable defined in ''site-info.def'' with value 'atp', not hardcoded with value 'mrs', so yaim couldn't create table atp.schema_details * no problems using only one DB named 'mrs' (ATP_DB_NAME=`mrs`) * wrong comment in ''/opt/glite/yaim/functions/config_mddb_mysql'' * line 112: uncomment ''#mysqladmin -u root --password=${MYSQL_ADMIN} create $MDDB_DB_NAME > /dev/null 2>&1'' * MDDB database couldn't be created * no problems using only one DB named 'mrs' (MDDB_DB_NAME=`mrs`) * wrong parameter in function tableName in file ''/usr/libexec/mddb/synchronizer.php'' * line 22: changed ''vo'' with ''atp.vo'' * a test on an inexistent table was tried * no problems using only one DB named 'mrs' * hardcoded parameters in each file in directory ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/'' * removed every instance of ''USE `mrs`;'' on every file * MS_DB_NAME is a variable defined in ''site-info.def'' with value 'metricstore', not hardcoded with value 'mrs', so yaim couldn't go on * no problems using only one DB named 'mrs' (MS_DB_NAME=`mrs`) * undefined tables in ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/create_structure.sql'' * created tables ''vo, metrics, service, profile'' and their dipendences copying their definitions from ATP database (is it wrong??) * these tables are required from other tables - declared in the same file - because or foreign key, for example: * FOREIGN KEY (vo_id ) * REFERENCES vo (id ) * no problems using only one DB named 'mrs' * undefined field in ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/create_structure.sql'' * added field 'db_name' on table 'schema_details' copying from its definition in MDDB database * it's used by file ''/usr/share/doc/nagios2metricstore-1.0.29/DBScripts/initial/1.4/mysql/increase_version.sql'' * no problems using only one DB named 'mrs' * LDAP error with some TOPOLOGY definition: * set these variables: * NCG_TOPOLOGY_USE_SAM=true * NCG_TOPOLOGY_USE_GOCDB=false * NCG_TOPOLOGY_USE_ENOC=false * NCG_TOPOLOGY_USE_LDAP=false * in the beginning they were inverted, but there was a blocking LDAP error when a host couldn't be connected. * Invoking NCG::SiteInfo::LDAP. * DEBUG: in NCG::SiteDB::siteName with args: * DEBUG: in NCG::SiteDB::siteLDAP with args: * Getting info from LDAP: inaf-ce-01.ct.pi2s2.it:2170/Mds-Vo-Name=GRISU-COMETA-INAF-CT, O=Grid * ERROR: Cannot connect to inaf-ce-01.ct.pi2s2.it:2170 * Module NCG::SiteInfo::LDAP hit critical error, stopping NCG * exit with error in ''/usr/sbin/ncg.reload.sh'' * moved 'exit 0' from line 18 to line 19, outside the more internal 'if'. * if service nagios is stopped (at the first configuration it is stopped), ''service nagios reload'' gives an error (exit 7: reload implies stop and start, and stopping a stopped service is considered by /etc/init.d/nagios an error); so if exit!=0 yaim failed * wrong directory in ''/opt/glite/yaim/functions/config_nagios'' * line 266: changed ''lock_file=/var/run/nagios.pid'' with ''lock_file=/var/run/nagios/nagios.pid'' * there was a permission denied error because the deamon nagios is executed by user nagios, but the pid file wasn't created in a directory with write permission for that user * short(?) timeout in ''/opt/glite/yaim/functions/config_ncg'' * lines 299 and 448: changed from ''TIMEOUT=600'' to ''#TIMEOUT=600'' * error starting ncg; the log in /var/log/ncg.log: * ERROR: Could not get results from SAM: 500 Server closed connection without sending any data back * ERROR: Could not get list of critical metrics from SAM: 500 Server closed connection without sending any data back After correcting the bugs, finally the yaim configuration command: * /opt/glite/yaim/bin/ig_yaim -c -d 6 -s /usr/local/nfs/3_2/ig-site-info.def.current -n ig_UI_noafs -n glite-NAGIOS 2>&1 | tee /root/yaim38.log *Post configuration* * changed https port to make site visible outside pd.infn.it * edited ''/etc/httpd/conf.d/ssl.conf'' changing from 443 to 50080 * set variable NAGIOS_HTTPD_ENABLE_CONFIG=false in yaim configuration file, in order to avoid https configuration to be reset after every reconfiguration --- -- Main.MarcoVerlato - 2012-02-15
Edit
|
Attach
|
PDF
|
H
istory
:
r9
<
r8
<
r7
<
r6
<
r5
|
B
acklinks
|
V
iew topic
|
More topic actions...
Topic revision: r7 - 2013-11-22
-
MarcoVerlato
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
WeNMR Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback