>::operator()() const
glite::wms::manager::server::Events::run()
boost::_mfi::mf0<void, glite::wms::manager::server::Events>::operator()(glite::wms::manager::server::Events*) const
- Enable Argus authZ and check results PASS
In site_info.def
USE_ARGUS=yes
ARGUS_PEPD_ENDPOINTS="https://argus01.lcg.cscs.ch:8154/authz https://argus02.lcg.cscs.ch:8154/authz
https://argus03.lcg.cscs.ch:8154/authz"
22/03/2012 argus-gsi-pep-callout missing from the MP, error while using gridftp FIXED
22/05/2012 New test:
22 May 2012, 16:38:56 -I- PID: 30454 (Debug) - Calling the WMProxy jobRegister service
Warning - Unable to register the job to the service: https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server
Argus denied authorization on jobRegister by ,C=IT,O=INFN,OU=Personal Certificate,L=CNAF,CN=Marco Cecchi
Error code: SOAP-ENV:Server
PASS
TODO: need to try with one policy that let us pass
BUGS:
Vulnerability bug in ICE's proxy renewal (Advisory-SVG-2012-4073) PRE-CERTIFIED (adorigo 18/07/2012)
I used the ICE RPM from http://etics-repository.cern.ch/repository/download/registered/emi/emi.wms.ice/3.3.5/sl5_x86_64_gcc412EPEL/glite-wms-ice-3.3.5-4.sl5.x86_64.rpm
and verified that the issue specified in the SVG 4073 has been fixed, even at the ICE start (starting in a situation in which the vulnerability is present).
Vulnerability bug in ICE's proxy renewal (Advisory-SVG-2012-4039) PRE-CERTIFIED (adorigo 09/07/2012)
The bug has been fixed and verified. In particular it has been verified that the vulnerability has disappeared and that in "normal" conditions the proxy renewal still works correctly. here
it is described how to check the correct working of proxy renewal.
glite-wms-ice-proxy-renew can block undefinitely (https://savannah.cern.ch/bugs/?95584
) PRE-CERTIFIED (adorigo 09/07/2012)
Must set the following conf parameters in the file /etc/glite-wms/glite_wms.conf
(in the WMS node):
[root@cream-01 persist_dir]# egrep "proxy_renewal_freq|ice_log_level|renewal_timeout" /etc/glite-wms/glite_wms.conf
ice_log_level = 700;
proxy_renewal_frequency = 60;
proxy_renewal_timeout = 60;
Then must open a console (as root) on any public machine and start the command:
[root@cream-28 ~]# openssl s_server -cert /etc/grid-security/hostcert.pem -key /etc/grid-security/hostkey.pem -accept 7000
This will simulates a non-responding overloaded MyProxyServer.
Log into the WMS node and make sure that the proxy renewal daemon cache is empty and that ICE doesn't have any proxy cached (also remove old ICE logfiles for your ease):
[root@cream-01 persist_dir]# service gLite stop
[...]
[root@cream-01 persist_dir]# rm -f /var/ice/persist_dir/* /var/log/wms/ice.log*
\rm -f /var/glite/spool/glite-renewd/*
[root@cream-01 persist_dir]# service gLite start
Do a tail on the ICE's logfile filtering on proxy renewal's messages:
[root@cream-01 persist_dir]# tail -f /var/log/wms/ice.log |grep iceCommandDelegationRenewal
Switch on a WMS UI machine and write down this JDL:
dorigoa@cream-03 13:47:23 ~>cat wms.jdl
[
Executable = "/bin/echo";
Arguments = "ciao";
InputSandbox = {};
stdoutput="stdout";
stderror="stderr";
OutputSandbox = {"stdout","stderr"};
requirements = RegExp("cream.*", other.GlueCEUniqueID);
myproxyserver="cream-28.pd.infn.it:7000"; // is the machines running the openssl server
]
Submit the above JDL and switch to the console where the tail is running. At a certain point you'll have to see this log message:
2012-07-09 13:47:52,785 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Contacting MyProxy server [cream-28.pd.infn.it:7000] for user dn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alvise Dorigo-/dteam/Role=NULL/Capability=NULL] with proxy certificate [/var/ice/persist_dir/779DCED0D3865DD842211CF2C904581ED1E2A964.betterproxy] to renew it...
2012-07-09 13:48:52,822 ERROR - iceCommandDelegationRenewal::renewAllDelegations() - Proxy renewal failed: [ERROR - /usr/bin/glite-wms-ice-proxy-renew killed the renewal child after timeout of 60 seconds. The proxy /var/ice/persist_dir/779DCED0D3865DD842211CF2C904581ED1E2A964.betterproxy has NOT been renewed! ]
clearly showing that the proxy renewal process doens't block undefinitely (but for a maximum of 60 seconds as set in the file glite_wms.conf
) anymore to a non-respoding server.
Deregistration of a proxy (2) (https://savannah.cern.ch/bugs/?83453
) PRE-CERTIFIED (adorigo 25/06/2012)
Submitted a job and verified the proxy-deregistration in the syslog:
Connecting to the service https://cream-01.pd.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel09.cnaf.infn.it:9000/tsk_AU6HJZB4l9IPNeLdPQ
==========================================================================
From file /var/log/message:
Jun 25 16:35:03 cream-01 glite-proxy-renewd[18810]: Proxy /var/proxycache/%2FC%3DIT%2FO%3DINFN%2FOU%3DPersonal%20Certificate%2FL%3DPadova%2FCN%3DAlvise%20Dorigo/9n4QO2qQq50mmLDohkmIhw/userproxy.pem of job https://devel09.cnaf.infn.it:9000/cJBETCHFmomxOnqC8ZJTiw has been registered as /var/glite/spool/glite-renewd/45a96bd6a16770e5fdc4c60bbae2646e.0
Jun 25 16:35:03 cream-01 glite-proxy-renewd[18810]: Proxy /var/proxycache/%2FC%3DIT%2FO%3DINFN%2FOU%3DPersonal%20Certificate%2FL%3DPadova%2FCN%3DAlvise%20Dorigo/9n4QO2qQq50mmLDohkmIhw/userproxy.pem of job https://devel09.cnaf.infn.it:9000/82dORq-A05f06LDYxwtlXQ has been registered as /var/glite/spool/glite-renewd/45a96bd6a16770e5fdc4c60bbae2646e.0
Jun 25 16:35:03 cream-01 glite_wms_wmproxy_server: submission from cream-12.pd.infn.it, DN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alvise Dorigo, FQAN=/dteam/Role=NULL/Capability=NULL, userid=18757 for jobid=https://devel09.cnaf.infn.it:9000/tsk_AU6HJZB4l9IPNeLdPQ
[dorigoa@cream-12 ~]$ glite-wms-job-status https://devel09.cnaf.infn.it:9000/tsk_AU6HJZB4l9IPNeLdPQ
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel09.cnaf.infn.it:9000/tsk_AU6HJZB4l9IPNeLdPQ
Current Status: Done (Success)
Exit code: 0
Submitted: Mon Jun 25 16:35:03 2012 CEST
==========================================================================
- Nodes information for:
Status info for the Job : https://devel09.cnaf.infn.it:9000/82dORq-A05f06LDYxwtlXQ
Current Status: Done (Success)
Logged Reason(s):
- job completed
- Job Terminated Successfully
Exit code: 0
Status Reason: Job Terminated Successfully
Destination: cream-23.pd.infn.it:8443/cream-lsf-creamtest2
Submitted: Mon Jun 25 16:35:03 2012 CEST
==========================================================================
Status info for the Job : https://devel09.cnaf.infn.it:9000/cJBETCHFmomxOnqC8ZJTiw
Current Status: Done (Success)
Logged Reason(s):
- job completed
- Job Terminated Successfully
Exit code: 0
Status Reason: Job Terminated Successfully
Destination: cream-23.pd.infn.it:8443/cream-lsf-creamtest2
Submitted: Mon Jun 25 16:35:03 2012 CEST
==========================================================================
Again from /var/log/messages:
Jun 25 16:35:56 cream-01 glite-proxy-renewd[18810]: Proxy /var/glite/spool/glite-renewd/45a96bd6a16770e5fdc4c60bbae2646e.0 of job https://devel09.cnaf.infn.it:9000/cJBETCHFmomxOnqC8ZJTiw has been unregistered
Jun 25 16:35:56 cream-01 glite-proxy-renewd[18810]: Proxy /var/glite/spool/glite-renewd/45a96bd6a16770e5fdc4c60bbae2646e.0 of job https://devel09.cnaf.infn.it:9000/82dORq-A05f06LDYxwtlXQ has been unregistered
some sensible information should be logged on syslog (https://savannah.cern.ch/bugs/?92657
) PRE-CERTIFIED (mcecchi 29/03/12)
May 18 17:19:07 devel09 glite_wms_wmproxy_server: submission from ui.cnaf.infn.it, DN=/C=IT/O=INFN/OU=Personal Certificate/L=CNAF/CN=Marco Cecchi, FQAN=/dteam/Role=NULL/Capability=NULL, userid=18264 for jobid=https://devel07.cnaf.infn.it:9000/_K-LYpekDA1xk9sisW5hBA
May 18 17:19:36 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/D6jb8a1zk8kfwLhm2PxAIw, destination=cmsrm-cream01.roma1.infn.it:8443/cream-lsf-cmsgcert
May 18 17:19:36 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/mlMh39hZcZSHzLozxmSUXQ, destination=ce203.cern.ch:8443/cream-lsf-grid_2nh_dteam
May 18 17:19:36 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/ToPUYBYlHPX8ucct9GMxJA, destination=ce207.cern.ch:8443/cream-lsf-grid_2nh_dteam
May 18 17:19:36 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/eqK70rofnD_ssOJVVGzZFg, destination=ce04.esc.qmul.ac.uk:8443/cream-sge-lcg_long
May 18 17:19:36 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/RWbT_HnCv_xhVIgxOz6Zew, destination=atlasce02.scope.unina.it:8443/cream-pbs-egeecert
May 18 17:19:37 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/Uen9xJpMD_muv4DgUa7XHg, destination=ce01.eela.if.ufrj.br:8443/cream-pbs-dteam
May 18 17:19:37 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/surSO68FUvkoMMJ5vwnslw, destination=ce-cr-02.ts.infn.it:8443/cream-lsf-cert
May 18 17:19:37 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/K6OlQy5IR0a_bs3jyIcASQ, destination=ce-grisbi.cbib.u-bordeaux2.fr:8443/cream-pbs-dteam
May 18 17:19:37 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/rQzNsXNGtVRtKjt0hW-n7w, destination=ce0.m3pec.u-bordeaux1.fr:8443/cream-pbs-dteam
May 18 17:19:37 devel09 glite-wms-workload_manager: jobid=https://devel07.cnaf.infn.it:9000/PVm7kNQu_h7CzrsFhGbIrQ, destination=cccreamceli06.in2p3.fr:8443/cream-sge-long
WMS UI emi-wmproxy-api-cpp and emi-wms-ui-api-python still use use gethostbyaddr/gethostbyname (https://savannah.cern.ch/bugs/?89668
) PRE-CERTIFIED (alvise 28/03/12)
grep on source code
Submission with rfc proxy doesn't work (https://savannah.cern.ch/bugs/?88128
) PRE-CERTIFIED (mcecchi 14/6/12)
Create a proxy with voms-proxy-init and option -rfc.
1) Authentication fails at the LCG-CE, as expected:
27 Mar, 16:01:17 -I- EventGlobusSubmitFailed::process_event(): Got globus submit failed event.
27 Mar, 16:01:17 -I- EventGlobusSubmitFailed::process_event(): For cluster: 1363, reason: 7 authentication failed: GSS Major Status: Authentication Failed GSS Minor Status Error Chain: init.c:499: globus_gss_assist_init_sec_context_async: Error during context initialization init_sec_contex
27 Mar, 16:01:17 -I- EventGlobusSubmitFailed::process_event(): Job id = https://devel09.cnaf.infn.it:9000/kDLx5tj_Nxpt2ewI35AARQ
27 Mar, 16:01:17 -I- SubmitReader::internalRead(): Reading condor submit file of job https://devel09.cnaf.infn.it:9000/kDLx5tj_Nxpt2ewI35AARQ
2) let's submit a job to CREAM:
[mcecchi@devel15 ~]$ glite-wms-job-submit -a --endpoint https://devel16.cnaf.infn.it:7443/glite_wms_wmproxy_server -r cream-38.pd.infn.it:8443/cream-pbs-creamtest2 ls_cream.jdl
Connecting to the service https://devel16.cnaf.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel16.cnaf.infn.it:9000/XRSoZaPhZds7iwhh_z4DDw
==========================================================================
[mcecchi@devel15 ~]$ glite-wms-job-status https://devel16.cnaf.infn.it:9000/XRSoZaPhZds7iwhh_z4DDw
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel16.cnaf.infn.it:9000/XRSoZaPhZds7iwhh_z4DDw
Current Status: Running
Status Reason: unavailable
Destination: cream-38.pd.infn.it:8443/cream-pbs-creamtest2
Submitted: Thu Jun 14 15:18:15 2012 CEST
==========================================================================
[mcecchi@devel15 ~]$ glite-wms-job-status https://devel16.cnaf.infn.it:9000/XRSoZaPhZds7iwhh_z4DDw
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel16.cnaf.infn.it:9000/XRSoZaPhZds7iwhh_z4DDw
Current Status: Done(Success)
Logged Reason(s):
- job completed
- Job Terminated Successfully
Exit code: 0
Status Reason: Job Terminated Successfully
Destination: cream-38.pd.infn.it:8443/cream-pbs-creamtest2
Submitted: Thu Jun 14 15:18:15 2012 CEST
EMI WMS wmproxy init.d script stop/start problems (https://savannah.cern.ch/bugs/?89577
) PRE-CERTIFIED (mcecchi 23/03/12)
1. The restart command does not restart httpd, whereas stop + start does:
[root@devel09 ~]# /etc/init.d/glite-wms-wmproxy status
/usr/bin/glite_wms_wmproxy_server is running...
[root@devel09 ~]# ps aux | grep httpd
glite 11440 0.0 0.0 96440 2196 ? S 11:31 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 11441 0.0 0.0 96440 2480 ? S 11:31 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 11442 0.0 0.0 96440 2480 ? S 11:31 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 11443 0.0 0.0 96440 2480 ? S 11:31 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 11444 0.0 0.0 96440 2480 ? S 11:31 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 11445 0.0 0.0 96440 2480 ? S 11:31 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
root 15789 0.0 0.0 61192 740 pts/0 R+ 11:52 0:00 grep httpd
root 24818 0.0 0.1 96440 4716 ? Ss Mar19 0:04 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
[root@devel09 ~]# /etc/init.d/glite-wms-wmproxy restart
Restarting /usr/bin/glite_wms_wmproxy_server... ok
[root@devel09 ~]# ps aux | grep httpd
glite 15889 0.0 0.0 96440 2196 ? S 11:52 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 15890 0.0 0.0 96440 2488 ? S 11:52 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 15891 0.0 0.0 96440 2480 ? S 11:52 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 15892 0.0 0.0 96440 2480 ? S 11:52 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 15893 0.0 0.0 96440 2480 ? S 11:52 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
glite 15894 0.0 0.0 96440 2480 ? S 11:52 0:00 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
root 15897 0.0 0.0 61196 772 pts/0 S+ 11:53 0:00 grep httpd
root 24818 0.0 0.1 96440 4716 ? Ss Mar19 0:04 /usr/sbin/httpd -k start -f /etc/glite-wms/glite_wms_wmproxy_httpd.conf
2. A start immediately following a stop often fails and has to be repeated to get the service working again:
[root@devel09 ~]# /etc/init.d/glite-wms-wmproxy stop; /etc/init.d/glite-wms-wmproxy start
Stopping /usr/bin/glite_wms_wmproxy_server... ok
Starting /usr/bin/glite_wms_wmproxy_server... ok
3. The stop and start commands fail when invoked via ssh ([Sun Oct 09 17:08:01 2011] [warn] PassEnv variable HOSTNAME was undefined):
[mcecchi@cnaf ~]$ ssh root@devel09 '/etc/init.d/glite-wms-wmproxy stop;/etc/init.d/glite-wms-wmproxy start'
root@devel09's password:
Stopping /usr/bin/glite_wms_wmproxy_server... ok
Starting /usr/bin/glite_wms_wmproxy_server... ok
Make some WMS init scripts System V compatible (https://savannah.cern.ch/bugs/?91115
) PRE-CERTIFIED (mcecchi 23/03/12)
[root@devel09 ~]# grep -1 chkconfig /etc/init.d/glite-wms-ice
# chkconfig: 345 95 06
# description: startup script for the ICE process
[root@devel09 ~]# grep -1 chkconfig /etc/init.d/glite-wms-wm
# chkconfig: 345 94 06
# description: WMS processing engine
Semi-automated service backends configuration for WMS (task #23845, EMI Development Tracker, Done) PRE-CERTIFIED (mcecchi 23/03/12)
[root@devel09 ~]# cat /etc/my.cnf
[mysqld]
innodb_flush_log_at_trx_commit=2
innodb_buffer_pool_size=500M
!includedir /etc/mysql/conf.d/
innodb_flush_log_at_trx_commit=2 and innodb_buffer_pool_size=500M are what is expected to be present
WMproxy GACLs do not support wildcards (as they used to do) (https://savannah.cern.ch/bugs/?87261
) PRE-CERTIFIED (mcecchi 23/03/12)
with:
<gacl version="0.0.1">
<entry> <voms> <fqan>/dtea*</voms> <allow> <exec/> </allow> </entry>
</gacl>
GRANT
<gacl version="0.0.1">
<entry> <voms> <fqan>/dteaM*</fqan></voms> <allow> <exec/> </allow> </entry>
</gacl>
DENY
<gacl version="0.0.1">
<entry> <voms> <fqan>/dteam</fqan></voms> <allow> <exec/> </allow> </entry>
</gacl>
DENY
<gacl version="0.0.1">
<entry> <voms> <fqan>/dteam</fqan></voms> <allow> <exec/> </allow> </entry>
<entry> <voms> <fqan>/dteam/*</fqan></voms> <allow> <exec/> </allow> </entry>
</gacl>
GRANT
WMS logs should keep track of the last 90 days (https://savannah.cern.ch/bugs/?89871
) PRE-CERTIFIED (mcecchi 22/03/12)
[root@devel09 ~]# grep -r rotate\ 90 /etc/logrotate.d/
/etc/logrotate.d/wm: rotate 90
/etc/logrotate.d/globus-gridftp: rotate 90
/etc/logrotate.d/globus-gridftp: rotate 90
/etc/logrotate.d/lcmaps: rotate 90
/etc/logrotate.d/lm: rotate 90
/etc/logrotate.d/jc: rotate 90
/etc/logrotate.d/glite-wms-purger: rotate 90
/etc/logrotate.d/wmproxy: rotate 90
/etc/logrotate.d/argus: rotate 90
/etc/logrotate.d/ice: rotate 90
[root@devel09 ~]# grep -r daily /etc/logrotate.d/
/etc/logrotate.d/wm: daily
/etc/logrotate.d/kill-stale-ftp: daily
/etc/logrotate.d/globus-gridftp: daily
/etc/logrotate.d/globus-gridftp: daily
/etc/logrotate.d/lcmaps: daily
/etc/logrotate.d/lm: daily
/etc/logrotate.d/jc: daily
/etc/logrotate.d/glite-wms-purger: daily
/etc/logrotate.d/wmproxy: daily
/etc/logrotate.d/argus: daily
/etc/logrotate.d/ice: daily
/etc/logrotate.d/glite-lb-server: daily
/etc/logrotate.d/bdii: daily
/etc/logrotate.d/kill-stale-ftp
has
rotate 30
but it should be
rotate 90
yaim-wms changes for Argus based authZ (https://savannah.cern.ch/bugs/?90760
) NOT PRE-CERTIFIED (mcecchi 22/03/12)
Modified siteinfo.def with:
USE_ARGUS=yes
ARGUS_PEPD_ENDPOINTS="https://argus01.lcg.cscs.ch:8154/authz https://argus02.lcg.cscs.ch:8154/authz
https://argus03.lcg.cscs.ch:8154/authz"
ran:
/opt/glite/yaim/bin/yaim -c -s siteinfo/site-info.def -n WMS
in glite_wms.conf:
ArgusAuthz = true;
ArgusPepdEndpoints = {"https://argus01.lcg.cscs.ch:8154/authz", "https://argus02.lcg.cscs.ch:8154/authz", "https://argus03.lcg.cscs.ch:8154/authz"};
glite-wms-check-daemons.sh should not restart daemons under the admin's nose (https://savannah.cern.ch/bugs/?89674
) PRE-CERTIFIED (mcecchi 22/03/12)
[root@devel09 ~]# /etc/init.d/glite-wms-wm start
starting workload manager... ok
[root@devel09 ~]# /etc/init.d/glite-wms-wm status
/usr/bin/glite-wms-workload_manager (pid 486) is running...
[root@devel09 ~]# ll /var/run/glite-wms-w*
-rw-r--r-- 1 root root 4 Mar 22 10:02 /var/run/glite-wms-workload_manager.pid
[root@devel09 ~]# /etc/init.d/glite-wms-wm stop
stopping workload manager... ok
[root@devel09 ~]# date
Thu Mar 22 10:02:37 CET 2012
[root@devel09 ~]# ll /var/run/glite-wms-w*
ls: /var/run/glite-wms-w*: No such file or directory
[root@devel09 ~]# cat /etc/cron.d/glite-wms-check-daemons.cron
HOME=/
MAILTO=SA3-italia
*/5 * * * * root . /usr/libexec/grid-env.sh ; sh /usr/libexec/glite-wms-check-daemons.sh > /dev/null 2>&1
[root@devel09 ~]# sh /usr/libexec/glite-wms-check-daemons.sh
[root@devel09 ~]# /etc/init.d/glite-wms-wm status
/usr/bin/glite-wms-workload_manager is not running
[root@devel09 ~]# ps aux | grep workl
root 970 0.0 0.0 61192 756 pts/2 S+ 10:03 0:00 grep workl
[root@devel09 ~]# /etc/init.d/glite-wms-wm start
starting workload manager... ok
[root@devel09 ~]# ps aux | grep workl
glite 1009 11.0 0.6 255060 26444 ? Ss 10:04 0:00 /usr/bin/glite-wms-workload_manager --conf glite_wms.conf --daemon
root 1013 0.0 0.0 61196 764 pts/2 S+ 10:04 0:00 grep workl
[root@devel09 ~]# kill -9 1009
[root@devel09 ~]# sh /usr/libexec/glite-wms-check-daemons.sh
stopping workload manager... ok
starting workload manager... ok
[root@devel09 ~]# /etc/init.d/glite-wms-wm status
/usr/bin/glite-wms-workload_manager (pid 1196) is running...
[root@devel09 ~]#
Wrong location for PID file (https://savannah.cern.ch/bugs/?89857
) PRE-CERTIFIED (mcecchi 21/03/12)
[root@devel09 ~]# ls /var/run/*pid
/var/run/atd.pid /var/run/crond.pid /var/run/glite-wms-job_controller.pid /var/run/gpm.pid /var/run/klogd.pid /var/run/ntpd.pid
/var/run/brcm_iscsiuio.pid /var/run/exim.pid /var/run/glite-wms-log_monitor.pid /var/run/haldaemon.pid /var/run/messagebus.pid /var/run/sshd.pid
/var/run/condor_master.pid /var/run/glite-wms-ice-safe.pid /var/run/glite-wms-workload_manager.pid /var/run/iscsid.pid /var/run/nrpe.pid /var/run/syslogd.pid
For ICE now the pid is correctly handled:
[root@devel09 ~]# /etc/init.d/glite-wms-ice status
/usr/bin/glite-wms-ice-safe (pid 16820) is running...
[root@devel09 ~]# ll /var/run/glite-wms-ice-safe.pid
-rw-r--r-- 1 root root 6 Mar 27 14:41 /var/run/glite-wms-ice-safe.pid
[root@devel09 ~]# cat /var/run/glite-wms-ice-safe.pid
16820
[root@devel09 ~]# /etc/init.d/glite-wms-ice stop
stopping ICE... ok
[root@devel09 ~]# /etc/init.d/glite-wms-ice start
starting ICE... ok
[root@devel09 ~]# ll /var/run/glite-wms-ice-safe.pid
-rw-r--r-- 1 root root 6 Mar 27 14:42 /var/run/glite-wms-ice-safe.pid
[root@devel09 ~]# cat /var/run/glite-wms-ice-safe.pid
16969
[root@devel09 ~]# /etc/init.d/glite-wms-ice status
/usr/bin/glite-wms-ice-safe (pid 16969) is running...
[root@devel09 ~]# /etc/init.d/glite-wms-ice restart
stopping ICE... ok
starting ICE... ok
[root@devel09 ~]# ps -ef|grep ice
root 2942 1 0 Mar26 ? 00:00:00 gpm -m /dev/input/mice -t exps2
glite 17074 1 0 14:43 ? 00:00:00 /usr/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /tmp/icepid
glite 17080 17074 0 14:43 ? 00:00:00 sh -c /usr/bin/glite-wms-ice --conf glite_wms.conf /var/log/wms/ice_console.log 2>&1
glite 17081 17080 0 14:43 ? 00:00:00 /usr/bin/glite-wms-ice --conf glite_wms.conf /var/log/wms/ice_console.log
root 17114 23643 0 14:43 pts/1 00:00:00 grep ice
*Pid file of ICE and WM has glite ownership (https://savannah.cern.ch/bugs/?91834
) - PRE-CERTIFIED (adorigo - 20121001)
It must be verified that the owner of WM and ICE PID files is root. To do this, as root, stop WMS and ICE and delete any previous WMS and ICE's PID file; then restart services and check the ownership as in the example below:
[root@cream-01 ~]# /etc/init.d/glite-wms-wm stop
stopping workload manager... ok
[root@cream-01 ~]# \rm /var/run/glite-wms-workload_manager.pid
[root@cream-01 ~]# /etc/init.d/glite-wms-ice stop
stopping ICE... ok
[root@cream-01 ~]# \rm /var/run/glite-wms-ice-safe.pid
[root@cream-01 ~]# /etc/init.d/glite-wms-wm start
starting workload manager... ok
[root@cream-01 ~]# /etc/init.d/glite-wms-ice start
starting ICE... ok
[root@cream-01 ~]# ll /var/run/glite-wms-workload_manager.pid /var/run/glite-wms-ice-safe.pid
-rw-r--r-- 1 root root 6 Oct 1 11:39 /var/run/glite-wms-ice-safe.pid
-rw-r--r-- 1 root root 6 Oct 1 11:38 /var/run/glite-wms-workload_manager.pid
The job replanner should be configurable (https://savannah.cern.ch/bugs/?91941None
) PRE-CERTIFIED (mcecchi 21/03/12)
EnableReplanner=true in the WM conf
21 Mar, 16:51:54 -I: [Info] main(/home/condor/execute/dir_2479/userdir/emi.wms.wms-manager/src/main.cpp:468): WM startup completed...
21 Mar, 16:51:54 -I: [Info] operator()(/home/condor/execute/dir_2479/userdir/emi.wms.wms-manager/src/replanner.cpp:288): replanner in action
21 Mar, 16:51:57 -W: [Warning] get_site_name(/home/condor/execute/dir_2479/userdir/emi.wms.wms-ism/src/purchaser/ldap-utils.cpp:162): Cannot find GlueSiteUniqueID assignment.
EnableReplanner=false in the WM conf
21 Mar, 16:54:02 -I: [Info] main(/home/condor/execute/dir_2479/userdir/emi.wms.wms-manager/src/main.cpp:468): WM startup completed...
21 Mar, 16:54:05 -W: [Warning] get_site_name(/home/condor/execute/dir_2479/userdir/emi.wms.wms-ism/src/purchaser/ldap-utils.cpp:162): Cannot find GlueSiteUniqueID assignment.
GlueServiceStatusInfo: ?? (https://savannah.cern.ch/bugs/?89435
) PRE-CERTIFIED (mcecchi 21/03/12)
[root@devel09 ~]# /var/lib/bdii/gip/provider/glite-info-provider-service-wmproxy-wrapper|grep -i servicestatusinfo
GlueServiceStatusInfo: /usr/bin/glite_wms_wmproxy_server is running...
WMProxy limiter should log more at info level (https://savannah.cern.ch/bugs/?72280
) PRE-CERTIFIED (mcecchi 21/03/12).
In wmp conf:
jobRegister = "${WMS_LOCATION_SBIN}/glite_wms_wmproxy_load_monitor --oper jobRegister --load1 0 --load5 20 --load15 18 --memusage 99 --diskusage 95 --fdnum 1000 --jdnum 1500 --ftpconn 300";
LogLevel = 5;
restart wmp, in wmproxy log:
21 Mar, 16:28:52 -S- PID: 875 - "wmputils::doExecv": Child failure, exit code: 256
21 Mar, 16:28:52 -I- PID: 875 - "wmpgsoapoperations::ns1__jobRegister": ------------------------------- Fault description --------------------------------
21 Mar, 16:28:52 -I- PID: 875 - "wmpgsoapoperations::ns1__jobRegister": Method: jobRegister
21 Mar, 16:28:52 -I- PID: 875 - "wmpgsoapoperations::ns1__jobRegister": Code: 1228
21 Mar, 16:28:52 -I- PID: 875 - "wmpgsoapoperations::ns1__jobRegister": Description: System load is too high:
Threshold for Load Average(1 min): 0 => Detected value for Load Average(1 min): 0.19
* EMI WMS wmproxy rpm doesn't set execution permissions as it used to do in gLite (https://savannah.cern.ch/bugs/?89506
) PRE-CERTIFIED (mcecchi 15/5/2012)
after installing the wmproxy rpm and without running yaim:
[root@devel09 ~]# ll /usr/libexec/glite_wms_wmproxy_dirmanager
-rwsr-xr-x 1 root root 18989 May 15 10:26 /usr/libexec/glite_wms_wmproxy_dirmanager
[root@devel09 ~]# ll /usr/sbin/glite_wms_wmproxy_load_monitor
-rwsr-xr-x 1 root root 22915 May 15 10:26 /usr/sbin/glite_wms_wmproxy_load_monitor
yaim-wms: set ldap query filter expression for GLUE2 in WMS configuration (https://savannah.cern.ch/bugs/?91563
) PRE-CERTIFIED (mcecchi)
There are GLUE2 filters for both CE and SE:
[root@devel09 ~]# grep G2LDAP /etc/glite-wms/glite_wms.conf
IsmIiG2LDAPCEFilterExt = "(|(&(objectclass=GLUE2ComputingService)(|(GLUE2ServiceType=org.glite.ce.ARC)(GLUE2ServiceType=org.glite.ce.CREAM)))(|(objectclass=GLUE2ComputingManager)(|(objectclass=GLUE2ComputingShare)(|(&(objectclass=GLUE2ComputingEndPoint)(GLUE2EndpointInterfaceName=org.glite.ce.CREAM))(|(objectclass=GLUE2ToStorageService)(|(&(objectclass=GLUE2MappingPolicy)(GLUE2PolicyScheme=org.glite.standard))(|(&(objectclass=GLUE2AccessPolicy)(GLUE2PolicyScheme=org.glite.standard))(|(objectclass=GLUE2ExecutionEnvironment)(|(objectclass=GLUE2ApplicationEnvironment)(|(objectclass=GLUE2Benchmark)))))))))))";
IsmIiG2LDAPSEFilterExt = "(|(objectclass=GLUE2StorageService)(|(objectclass=GLUE2StorageManager)(|(objectclass=GLUE2StorageShare)(|(objectclass=GLUE2StorageEndPoint)(|(objectclass=GLUE2MappingPolicy)(|(objectclass=GLUE2AccessPolicy)(|(objectclass=GLUE2DataStore)(|(objectclass=GLUE2StorageServiceCapacity)(|(objectclass=GLUE2StorageShareCapacity))))))))))";;
[root@devel09 ~]#
JobController logfile name is misspelled (https://savannah.cern.ch/bugs/?32611https://savannah.cern.ch/bugs/?32611
) PRE-CERTIFIED (alvise)
Verified that in the glite_wms.conf file the log file name was correct:
[root@devel09 glite-wms]# grep jobcontroller glite_wms.conf
LogFile = "${WMS_LOCATION_LOG}/jobcontroller_events.log";
glite-wms-job-submit doesn't always pick up other WMProxy endpoints if load on WMS is high (https://savannah.cern.ch/bugs/?40370https://savannah.cern.ch/bugs/?40370
) HOPEFULLY FIXED (alvise)
[wms] GlueServiceStatusInfo content is ugly (https://savannah.cern.ch/bugs/?48068https://savannah.cern.ch/bugs/?48068
) PRE-CERTIFIED (alvise).
Verified that the output of the command "/etc/init.d/glite-wms-wmproxy status"
is as requested.
[ yaim-wms ] CeForwardParameters should include several more parameters (https://savannah.cern.ch/bugs/?61315https://savannah.cern.ch/bugs/?61315
) PRE-CERTIFIED (alvise):
[root@devel09 ~]# grep CeF /etc/glite-wms/glite_wms.conf
CeForwardParameters = {"GlueHostMainMemoryVirtualSize","GlueHostMainMemoryRAMSize",
"GlueCEPolicyMaxCPUTime", "GlueCEPolicyMaxObtainableCPUTime", "GlueCEPolicyMaxObtainableWallClockTime", "GlueCEPolicyMaxWallClockTime" };
Files specified with absolute paths shouldn't be used with inputsandboxbaseuri (https://savannah.cern.ch/bugs/?74832https://savannah.cern.ch/bugs/?74832
) PRE-CERTIFIED (alvise).
Verified that JDL described in the comment is correctly handled by activating debug (--debug); the debug log showed that the file /etc/fstab
is correctly staged from the UI node via gsiftp.
There's an un-catched out_of_range exception in the ICE component (https://savannah.cern.ch/bugs/?75099https://savannah.cern.ch/bugs/?75099
) PRE-CERTIFIED (alvise)
Tried on my build machine (able to run ICE without WM) submittinga JDL with empty "ReallyRunningToken" attribute. ICE didn't crash as before. There's not yet possibility to test all-in-one (WMProxy/WM/ICE) because of a problem with LCMAPS.
Too much flexibility in JDL syntax (https://savannah.cern.ch/bugs/?75802https://savannah.cern.ch/bugs/?75802
) PRE-CERTIFIED (alvise)
Verified with --debug that glite-wms-job-submit:
dorigoa@cream-01 11:00:47 ~/emi/wmsui_emi2>grep -i environment jdl2
environment = "FOO=bar";
dorigoa@cream-01 11:50:02 ~/emi/wmsui_emi2>stage/usr/bin/glite-wms-job-submit --debug -a -c ~/JDLs/WMS/wmp_gridit.conf jdl2
[...]
-----------------------------------------
07 March 2012, 11:50:45 -I- PID: 3397 (Debug) - Registering JDL [ stdoutput = "out3.out"; SignificantAttributes = { "Requirements","Rank" }; DefaultNodeRetryCount = 5; executable = "ssh1.sh"; Type = "job"; Environment = { "FOO=bar" }; AllowZippedISB = false; VirtualOrganisation = "dteam"; JobType = "normal"; DefaultRank = -other.GlueCEStateEstimatedResponseTime; outputsandbox = { "out3.out","err2.err","fstab","grid-mapfile","groupmapfile"," passwd" }; InputSandbox = { "file:///etc/fstab","grid-mapfile","groupmapfile","gsiftp://cream-38.pd.infn.it/etc/passwd","file:///home/dorigoa/ssh1.sh" }; stderror = "err2.err"; inputsandboxbaseuri = "gsiftp://cream-38.pd.infn.it/etc/grid-security"; rank = -other.GlueCEStateEstimatedResponseTime; MyProxyServer = "myproxy.cern.ch"; requirements = other.GlueCEStateStatus == "Production" || other.GlueCEStateStatus == "testbedb" ]
[...]
So the mangling Environment = "FOO=bar"; -> Environment = { "FOO=bar" }; occurs correctly.
getaddrinfo() sorts results according to RFC3484, but random ordering is lost (https://savannah.cern.ch/bugs/?82779https://savannah.cern.ch/bugs/?82779
) PRE-CERTIFIED (alvise).
I did a not deep test. A deep test needs an alias pointing to at least 3 or 4 different WM nodes. The alias provided in the bug's savannah page just is resolved to 2 different physical hosts, and I observed that both hosts are choosen by the UI while submitting several jobs. I did this test from my EMI2 WMSUI workarea as I do not have any WMS UI EMI2 machine to try on.
glite-wms-job-status needs a json-compliant format (https://savannah.cern.ch/bugs/?82995https://savannah.cern.ch/bugs/?82995
) PRE-CERTIFIED (alvise) from my WMSUI EMI2 workarea:
dorigoa@lxgrid05 14:20:13 ~/emi/wmsui_emi2>stage/usr/bin/glite-wms-job-status --json https://wms014.cnaf.infn.it:9000/pVQojatZbyoj_Pyab66_dw
{ "result": "success" , "https://wms014.cnaf.infn.it:9000/pVQojatZbyoj_Pyab66_dw": { "Current Status": "Done(Success)", "Logged Reason": {"0": "job completed","1": "Job Terminated Successfully"}, "Exit code": "0", "Status Reason": "Job Terminated Successfully", "Destination": "grive02.ibcp.fr:8443/cream-pbs-dteam", "Submitted": "Mon Mar 12 14:11:55 2012 CET", "Done": "1331558020"} }
Last LB event logged by ICE when job aborted for proxy expired should be ABORTED (https://savannah.cern.ch/bugs/?84839
) PRE_CERTIFIED (alvise)
Submitted to ICE (running from my workarea emi2) a job sleeping for 5 minutes with a proxy valid for 3 minutes (myproxyserver not set, so no proxy renewal). Last event logged by ICE (as shown in the ICE's log) is:
2012-03-13 10:49:33,616 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[Proxy is expired; Job has been terminated (got SIGTERM)] - [GRIDJobID="https://grid005.pd.infn.it:9000/0001331632035.314183" CREAMJobID="https://cream-23.pd.infn.it:8443/CREAM017935418"]
glite-wms-job-status needs a better handing of purged-related error code. (https://savannah.cern.ch/bugs/?85063https://savannah.cern.ch/bugs/?85063
) HOPEFULLY-FIXED (alvise). Reproducing the scenario that triggered the problem is highly improbable.
pkg-config info for wmproxy-api-cpp should be enriched (https://savannah.cern.ch/bugs/?85799
) PRE-CERTIFIED (alvise, 30/03/2012):
[root@devel09 ~]# rpm -ql glite-wms-wmproxy-api-cpp-devel
/usr/include/glite
/usr/include/glite/wms
/usr/include/glite/wms/wmproxyapi
/usr/include/glite/wms/wmproxyapi/wmproxy_api.h
/usr/include/glite/wms/wmproxyapi/wmproxy_api_utilities.h
/usr/lib64/libglite_wms_wmproxy_api_cpp.so
/usr/lib64/pkgconfig/wmproxy-api-cpp.pc
[root@devel09 ~]# cat /usr/lib64/pkgconfig/wmproxy-api-cpp.pc
prefix=/usr
exec_prefix=${prefix}
libdir=${exec_prefix}/lib64
includedir=${prefix}/include
Name: wmproxy api cpp
Description: WMProxy C/C++ APIs
Version: 3.3.3
Requires: emi-gridsite-openssl
Libs: -L${libdir} -lglite_wms_wmproxy_api_cpp
Cflags: -I${includedir}
queryDb has 2 handling user's options (see ggus ticket for more info) (https://savannah.cern.ch/bugs/?86267https://savannah.cern.ch/bugs/?86267
) PRE-CERTIFIED (alvise). The way to verify it has been the same described (by me) in the related ticket: https://ggus.eu/tech/ticket_show.php?ticket=73658
.
glite-wms-job-list-match --help show an un-implemented (and useless) option "--default-jdl" (https://savannah.cern.ch/bugs/?87444https://savannah.cern.ch/bugs/?87444
) PRE-CERTIFIED (alvise)
The command glite-wms-job-list-match --help doesn't show that option anymore.
EMI WMS wmproxy rpm doesn't set execution permissions as it used to do in gLite (https://savannah.cern.ch/bugs/?89506https://savannah.cern.ch/bugs/?89506
) PRE-CERTIFIED (alvise):
[root@devel09 ~]# ll /usr/sbin/glite_wms_wmproxy_load_monitor /usr/bin/glite_wms_wmproxy_server /usr/bin/glite-wms-wmproxy-purge-proxycache /usr/libexec/glite_wms_wmproxy_dirmanager
-rwxr-xr-x 1 nobody nobody 1876 Mar 2 15:14 /usr/bin/glite-wms-wmproxy-purge-proxycache
-rwxr-xr-x 1 nobody nobody 3059020 Mar 2 15:14 /usr/bin/glite_wms_wmproxy_server
-rwsr-xr-x 1 nobody nobody 22637 Mar 2 15:14 /usr/libexec/glite_wms_wmproxy_dirmanager
-rwsr-xr-x 1 nobody nobody 22915 Mar 2 15:14 /usr/sbin/glite_wms_wmproxy_load_monitor
no root:root and not suid bit either
yaim-wms creates wms.proxy in wrong path (https://savannah.cern.ch/bugs/?90129
https://savannah.cern.ch/bugs/?90129
) PRE-CERTIFIED (alvise)
the path of wms.proxy seems to be correct now:
[root@devel09 ~]# ll ${WMS_LOCATION_VAR}/glite/wms.proxy
-r-------- 1 glite glite 2824 Mar 14 12:00 /var/glite/wms.proxy
[root@devel09 ~]# ll ${WMS_LOCATION_VAR}/wms.proxy
ls: /var/wms.proxy: No such file or directory
ICE log verbosity should be reduced to 300 (https://savannah.cern.ch/bugs/?91078https://savannah.cern.ch/bugs/?91078
) PRE-CERTIFIED (alvise):
[root@devel09 etc]# grep ice_log_level /etc/glite-wms/glite_wms.conf
ice_log_level = 300;
move lcmaps.log from /var/log/glite to WMS_LOCATION_LOG (https://savannah.cern.ch/bugs/?91484https://savannah.cern.ch/bugs/?91484
) PRE-CERTIFIED (alvise):
[root@devel09 etc]# ll $WMS_LOCATION_LOG/lcmaps.log ; ll /var/log/glite/lcmaps.log
-rw-r--r-- 1 glite glite 588 Mar 19 09:41 /var/log/wms/lcmaps.log
ls: /var/log/glite/lcmaps.log: No such file or directory
WMS: use logrotate uniformly in ice, lm, jc, wm, wmp (https://savannah.cern.ch/bugs/?91486
) PRE-CERTIFIED (22/03/12 mcecchi)
logrotate disappeared from crom jobs
[root@devel09 ~]# grep -r rotate /etc/cron.d
[root@devel09 ~]#
because it is consistently managed here:
[root@devel09 ~]# ll /etc/logrotate.d/
total 96
-rw-r--r-- 1 root root 111 Mar 22 15:40 argus
-rw-r--r-- 1 root root 106 Mar 10 00:13 bdii
-rw-r--r-- 1 root root 109 Mar 22 15:39 fetch-crl
-rw-r--r-- 1 root root 194 Mar 10 11:23 glite-lb-server
-rw-r--r-- 1 root root 128 Mar 22 15:40 glite-wms-purger
-rw-r--r-- 1 root root 240 Nov 22 05:06 globus-gridftp
-rw-r--r-- 1 root root 167 Feb 27 20:09 httpd
-rw-r--r-- 1 root root 109 Mar 22 15:40 ice
-rw-r--r-- 1 root root 126 Mar 22 15:40 jc
-rw-r--r-- 1 root root 83 Mar 10 06:34 kill-stale-ftp
-rw-r--r-- 1 root root 112 Mar 22 15:40 lcmaps
-rw-r--r-- 1 root root 123 Mar 22 15:40 lm
-rw-r--r-- 1 root root 129 Mar 22 15:40 wm
-rw-r--r-- 1 root root 192 Mar 22 15:40 wmproxy
remove several dismissed parameters from the WMS configuration (https://savannah.cern.ch/bugs/?91488https://savannah.cern.ch/bugs/?91488
) PRE-CERTIFIED (alvise)
verified that the params cited in the savannah bug are missing in the glite_wms.conf (this command grep -E 'log_file_max_size|log_rotation_base_file|log_rotation_max_file_number|ice.input_type|wmp.input_type|wmp.locallogger|wm.dispatcher_type|wm.enable_bulk_mm|wm.ism_ii_ldapsearch_async' /etc/glite-wms/glite_wms.conf
did't produce any output).
WMS needs cron job to kill stale GridFTP processes (https://savannah.cern.ch/bugs/?67489DONE
) PRE-CERTIFIED (alvise, 27/03/2012)
rebuilt the RPM kill-stale-ftp from branch; installed on devel09.
[root@devel09 ~]# cat /etc/cron.d/kill-stale-ftp.cron
PATH=/sbin:/bin:/usr/sbin:/usr/bin
5,15,25,35,45,55 * * * * root /sbin/kill-stale-ftp.sh >> /var/log/kill-stale-ftp.log 2>&1
[root@devel09 ~]# ll /sbin/kill-stale-ftp.sh
-rwxr-xr-x 1 root root 841 Mar 27 12:29 /sbin/kill-stale-ftp.sh
now the path is correct with my commit of today (27/03/2012). Moreover now the script seems to be working (when invoked by the cron):
[root@devel09 ~]# tail -2 /var/log/kill-stale-ftp.log
=== START Tue Mar 27 14:05:01 CEST 2012 PID 6617
=== READY Tue Mar 27 14:05:01 CEST 2012 PID 6617
WMS UI depends on a buggy libtar (on SL5 at least) (https://savannah.cern.ch/bugs/?89443
) PRE-CERTIFIED (alvise, 28/03/2012). Tried this JDL:
dorigoa@lxgrid05 16:01:39 ~/emi/wmsui_emi2>cat ~/JDLs/WMS/wms_test_tar_bug.jdl
[
AllowZippedISB = true;
Executable = "/bin/ls" ;
Arguments = "-lha " ;
Stdoutput = "ls.out" ;
InputSandbox = {"isb1", "isb2","isb3", "temp/isb4"};
OutputSandbox = { ".BrokerInfo", "ls.out"} ;
Retrycount = 2;
ShallowRetryCount = -1;
usertags = [ bug = "#82687" ];
VirtualOrganisation="dteam";
]
dorigoa@lxgrid05 16:01:41 ~/emi/wmsui_emi2>stage/usr/bin/glite-wms-job-submit --debug -a -e https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server ~/JDLs/WMS/wms_test_tar_bug.jdl
[ ... ]
28 March 2012, 16:02:21 -I- PID: 14236 (Debug) - File Transfer (gsiftp)
Command: /usr/bin/globus-url-copy
Source: file:///tmp/ISBfiles_YaMkV2gJJbddD38QUNR5DA_0.tar.gz
Destination: gsiftp://devel09.cnaf.infn.it:2811/var/SandboxDir/p_/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2fp_5fNucDphCF_5fynIE-0XKnxg/input/ISBfiles_YaMkV2gJJbddD38QUNR5DA_0.tar.gz
-----------------------------------------
-----------------------------------------
28 March 2012, 16:02:22 -I- PID: 14236 (Debug) - File Transfer (gsiftp) Transfer successfully done
[ ... ]
So the file .tar.gz has been correctly created/transferred/removed. Verified that the source code does not use anymore the libtar's functions:
dorigoa@lxgrid05 16:11:05 ~/emi/wmsui_emi2>grep -r libtar emi.wms-ui.wms-ui-commands/src/
emi.wms-ui.wms-ui-commands/src/utilities/options_utils.cpp:* of the archiving tool (libtar; if zipped feature is allowed).
emi.wms-ui.wms-ui-commands/src/utilities/options_utils.h: * of the archiving tool (libtar; if zipped feature is allowed).
emi.wms-ui.wms-ui-commands/src/services/jobsubmit.cpp~://#include "libtar.h"
emi.wms-ui.wms-ui-commands/src/services/jobsubmit.cpp://#include "libtar.h"
Complete procedure to verify the bug and verify the fix (as reported in the savannah bug):
- Have root or sudo access to a UI with EMI1 installation
- create the path /home/alex/J0
- create the NON empty files:
-bash-3.2# cd /home/alex/J0
-bash-3.2# ls -l
total 12
-rw-r--r-- 1 root root 413 May 11 14:38 hoco_ltsh.e
-rw-r--r-- 1 root root 413 May 11 14:38 ltsh.sh
-rw-r--r-- 1 root root 413 May 11 14:38 plantilla_venus.dat
(make sure they are world-readable)
- Create this JDL file:
[dorigoa@cream-12 ~]$ cat JDLs/WMS/JDL_bug_89443.jdl
[
StdOutput = "myjob.out";
ShallowRetryCount = 10;
SignificantAttributes = { "Requirements","Rank","FuzzyRank" };
RetryCount = 3;
Executable = "ltsh.sh";
Type = "job";
Arguments = "hoco_ltsh.e 0 1 200 114611111";
AllowZippedISB = true;
VirtualOrganisation = "gridit";
JobType = "normal";
DefaultRank = -other.GlueCEStateEstimatedResponseTime;
ZippedISB = { "ISBfiles_rjKoznMzsjvH6Nuvp0AhMQ_0.tar.gz" };
OutputSandbox = { "myjob.out","myjob.err","out.tar.gz" };
InputSandbox = { "file:///home/alex/J0/plantilla_venu...","file:///home/alex/J0/ltsh.sh","file:///home/alex/J0/hoco_ltsh.e" };
StdError = "myjob.err";
rank = -other.GlueCEStateEstimatedResponseTime;
MyProxyServer = "myproxy.cnaf.infn.it";
requirements = ( regexp("ng-ce.grid.unipg.it:8443/cream-pbs-grid",other.GlueCEUniqueID) )&& ( other.GlueCEStateStatus == "Production" )
]
Do not change anything in it; it must be submitted "as is".
Submit this JDL with this command:
$ glite-wms-job-submit --register-only -a --debug -e https://prod-wms-01.ct.infn.it:7443... <YOUR_JDL_CREATED_IN_THE_PREVIOUS_STEP> >&! log
>&! is to redirect out/err in a file with tcsh SHELL; change it accordingly to your SHELL.
Then grep ZIP in the log just created:
11 May 2012, 15:26:14 -I- PID: 11561 (Debug) - ISB ZIPPED file successfully created: /tmp/ISBfiles_ZT4DysizXpjHOT-hmzQf2A_0.tar.gz
ISB ZIP file : /tmp/ISBfiles_ZT4DysizXpjHOT-hmzQf2A_0.tar.gz
Decompress it:
dorigoa@cream-12 15:29:34 ~/JDLs/WMS>gunzip /tmp/ISBfiles_QPGfonkfQyOTbXa6uDpnZQ_0.tar.gz
dorigoa@cream-12 15:29:38 ~/JDLs/WMS>tar tvf /tmp/ISBfiles_QPGfonkfQyOTbXa6uDpnZQ_0.tar
-rw-r--r-- root/root 413 2012-05-11 14:38:09 SandboxDir/nJ/https_3a_2f_2fprod-wms-01.ct.infn.it_3a9000_2fnJmsSZ3XIaff3gbwkF0TVQ/input/plantilla_venus.dat
-rw-r--r-- root/root 413 2012-05-11 14:38:06 SandboxDir/nJ/https_3a_2f_2fprod-wms-01.ct.infn.it_3a9000_2fnJmsSZ3XIaff3gbwkF0TVQ/input/ltsh.sh
-rw-r--r-- root/root 413 2012-05-11 14:38:12 SandboxDir/nJ/https_3a_2f_2fprod-wms-01.ct.infn.it_3a9000_2fnJmsSZ3XIaff3gbwkF0TVQ/input/hoco_ltsh.
You can see that the filename hoco.ltsh.e has been truncated (hoco.ltsh.) in the archive.
Repeat the same procedure on a UI EMI2; the output will change a bit for what concern the location of the file ISB.....tar.gz, but you will have again to unzip it and verify it with "tar tvf"; you should see that the last file is not truncated anymore.
ICE should use env vars in its configuration (https://savannah.cern.ch/bugs/?90830
) PRE-CERTIFIED (alvise, 29/03/2012).
Check glite_wms.conf:
[root@devel09 siteinfo]# grep -E 'persist_dir|Input|ice_host_cert|ice_host_key' /etc/glite-wms/glite_wms.conf
ice_host_cert = "${GLITE_HOST_CERT}";
Input = "${WMS_LOCATION_VAR}/ice/jobdir";
persist_dir = "${WMS_LOCATION_VAR}/ice/persist_dir";
ice_host_key = "${GLITE_HOST_KEY}";
cron job deletes /var/proxycache (https://savannah.cern.ch/bugs/?90640
) PRE-CERTIFIED (alvise, 29/03/2012). Verified the usage of "-mindepth 1" as explained in the bug's comment on savannah:
[root@devel09 cron.d]# grep proxycache *
glite-wms-wmproxy-purge-proxycache.cron:0 */6 * * * root . /usr/libexec/grid-env.sh ; /usr/bin/glite-wms-wmproxy-purge-proxycache /var/proxycache > /var/log/wms/glite-wms-wmproxy-purge-proxycache.log 2>&1
[root@devel09 cron.d]# grep find /usr/bin/glite-wms-wmproxy-purge-proxycache
find $1 -mindepth 1 -cmin +60 > $tmp_file
ICE jobdir issue - 1 bad CE can block all jobs (https://savannah.cern.ch/bugs/?80751
) PRE-CERTIFIED (alvise, 29/03/2012). This is HOPEFULLY FIXED; I verified that the source code fixing the problem is there, but it is very difficult to test it, because it is needed to simulate a CE continuously going in connection timeout.
EMI-1 WMS does not propagate user job exit code (https://savannah.cern.ch/bugs/?92922
) PRE-CERTIFIED (mcecchi, 7/5/2012)
Submitted this job:
[
Executable = "/bin/false";
Arguments = "";
StdOutput = "out.log";
StdError = "err.log";
InputSandbox = {};
OutputSandbox = {};
myproxyserver="";
requirements = !RegExp("cream.*", other.GlueCEUniqueID);
RetryCount = 0;
ShallowRetryCount = 1;
]
and got:
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel16.cnaf.infn.it:9000/SWI6lM88RZ0noSpQhmz1EQ
Current Status: Done (Exit Code !=0)
Exit code: 1
Status Reason: Warning: job exit code != 0
Destination: lyogrid02.in2p3.fr:2119/jobmanager-pbs-dteam
Submitted: Thu Jun 7 11:06:19 2012 CEST
==========================================================================
glite_wms_wmproxy_server segfaults after job registration failure (https://savannah.cern.ch/bugs/?94845
) PRE-CERTIFIED (mcecchi, 18/6/2012)
This has been checked at large in all the tests. Submit a collection and check that is properly registered.
LB failover mechanism in WMproxy needs to be reviewed (https://savannah.cern.ch/bugs/?90034
) PRE-CERTIFIED (mcecchi, 18/6/2012)
In the WorkloadManagerProxy conf, put an invalid url for the first LB in vector:
LBServer = {"aaa.cnaf.infn.it:9000", "devel09.cnaf.infn.it:9000"};
Submit a job from the UI, after a while, you will see that the second LB in the vector is picked up:
[mcecchi@ui ~]$ glite-wms-job-submit -a --endpoint https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server coll_1.jdl
Connecting to the service https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel09.cnaf.infn.it:9000/T3olEsUGW4Tgi-EAp4hUcw
==========================================================================
Cancellation of a dag's node doesn't work (https://savannah.cern.ch/bugs/?81651
) PRE-CERTIFIED (mcecchi, 18/6/2012)
[mcecchi@ui ~]$ glite-wms-job-cancel https://devel09.cnaf.infn.it:9000/UE6W4uCneruHXY4R05azWg
Are you sure you want to remove specified job(s) [y/n]y : y
Connecting to the service https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server
============================= glite-wms-job-cancel Success =============================
The cancellation request has been successfully submitted for the following job(s):
- https://devel09.cnaf.infn.it:9000/UE6W4uCneruHXY4R05azWg
========================================================================================
[mcecchi@ui ~]$ glite-wms-job-status https://devel09.cnaf.infn.it:9000/u-ig_gAOkNyOaHMYf2kyHA
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel09.cnaf.infn.it:9000/u-ig_gAOkNyOaHMYf2kyHA
Current Status: Running
Submitted: Mon Jun 18 12:27:14 2012 CEST
==========================================================================
- Nodes information for:
Status info for the Job : https://devel09.cnaf.infn.it:9000/UE6W4uCneruHXY4R05azWg
Current Status: Running
Status Reason: unavailable
Destination: ce03.ific.uv.es:8443/cream-pbs-short
Submitted: Mon Jun 18 12:27:14 2012 CEST
==========================================================================
[mcecchi@ui ~]$ glite-wms-job-status https://devel09.cnaf.infn.it:9000/u-ig_gAOkNyOaHMYf2kyHA
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel09.cnaf.infn.it:9000/u-ig_gAOkNyOaHMYf2kyHA
Current Status: Cleared
Submitted: Mon Jun 18 12:27:14 2012 CEST
==========================================================================
- Nodes information for:
Status info for the Job : https://devel09.cnaf.infn.it:9000/UE6W4uCneruHXY4R05azWg
Current Status: Cancelled
Logged Reason(s):
- Cancelled by user
Status Reason: Cancelled by user
Destination: ce03.ific.uv.es:8443/cream-pbs-short
Submitted: Mon Jun 18 12:27:14 2012 CEST
==========================================================================
EMI WMS WM might abort resubmitted jobs (https://savannah.cern.ch/bugs/?89508
) PRE-CERTIFIED (mcecchi, 19/6/2012)
This problem was caused by the use of a throwing statement to read the value of GlueCEInfoHostName in the CE Ad. Now a non throwing one is used:
ce_ad->EvaluateAttrString("GlueCEInfoHostName", ceinfohostname);
To test it, submit to a queue that does not publish GlueCEInfoHostName. The submission should run fine, without giving this error:
ClassAd error: attribute "GlueCEInfoHostName" does not exist or has the wrong type (expecting "std::string"))
WMProxy code requires FQANs (https://savannah.cern.ch/bugs/?72169
) PRE-CERTIFIED (mcecchi, 19/6/2012)
The WMP code has been changed as requested in the ticket
// gacl file has a valid entry for user proxy without fqan
if (execDN || execAU){
exec = execDN = execAU = true;
}
Various issues found while testing
18/04/2012
1) [dorigoa@cream-12 ~]$ glite-wms-job-submit -c ~/JDLs/WMS/wmp_devel09.conf -a ~/JDLs/WMS/wms.jdl
Connecting to the service https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server
Warning - Unable to submit the job to the service: https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server
Proxy file doesn't exist or has bad permissions
Error code: SOAP-ENV:Server
FIXED 17/04/12, commit in wmproxyThe problem is due to the recent authN/Z restructuring and occurs on a jobSubmit operation (i.e. a job submitted without ISB)
May 4 2012:
2) load_monitor gives:
Can't do setuid (cannot exec sperl)
FIXED to be added dep perl-suidperl in MP
Using repo from 9/7/2012:
dependency: perl-suidperl
provider: perl-suidperl.i386 4:5.8.8-32.el5_7.6
provider: perl-suidperl.x86_64 4:5.8.8-32.el5_7.6
provider: perl-suidperl.x86_64 4:5.8.8-32.el5_6.3
3) with 'AsyncJobStart = false;' wmp crashes every second submission. problem is wherever there is a:
if (conf.getAsyncJobStart()) {
// \/ Copy environment and restore it right after FCGI_Finish
char** backupenv = copyEnvironment(environ);
FCGI_Finish(); // returns control to client
environ = backupenv;
// /_ From here on, execution is asynchronous
}
/usr/bin/glite_wms_wmproxy_server
/lib64/libpthread.so.0
/lib64/libc.so.6(gsignal+0x35)
/lib64/libc.so.6(abort+0x110)
/lib64/libc.so.6
/lib64/libc.so.6
/lib64/libc.so.6(cfree+0x4b)
classad::ClassAd::~ClassAd()
glite::jdl::Ad::~Ad()
jobStart(jobStartResponse&, std::string const&, soap*)
ns1__jobStart(soap*, std::string, ns1__jobStartResponse&)
soap_serve_ns1__jobStart(soap*)
soap_serve_request(soap*)
glite::wms::wmproxy::server::WMProxyServe::wmproxy_soap_serve(soap*)
glite::wms::wmproxy::server::WMProxyServe::serve()
/usr/bin/glite_wms_wmproxy_server(main+0x667)
/lib64/libc.so.6(__libc_start_main+0xf4)
glite::wmsutils::exception::Exception::getStackTrace()
FIXED 14/05/12, commit in wmproxy (copyEnvironment)
4) submitting dag1.jdl
getSandboxBulkDestURI(getSandboxBulkDestURIResponse&, std::string const&, std::string const&)
ns1__getSandboxBulkDestURI(soap*, std::string, std::string, ns1__getSandboxBulkDestURIResponse&)
soap_serve_ns1__getSandboxBulkDestURI(soap*)
soap_serve_request(soap*)
glite::wms::wmproxy::server::WMProxyServe::wmproxy_soap_serve(soap*)
glite::wms::wmproxy::server::WMProxyServe::serve()
/usr/bin/glite_wms_wmproxy_server(main+0x667)
/lib64/libc.so.6(__libc_start_main+0xf4)
glite::wmsutils::exception::Exception::getStackTrace()
FIXED 14/05/12 by an update in LB bkserver from the latest RC
[mcecchi@ui ~]$ glite-wms-job-submit -a --endpoint https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server dag1.jdl
Connecting to the service https://devel09.cnaf.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel09.cnaf.infn.it:9000/RDzM29cERl8imVCMnFoRyA
==========================================================================
5) WM DOES NOT MATCH ANYTHING
FIXED
fixed by moving requirements only on WmsRequirements on the JDL. No requirements on ce_ad and no need for symmetric_match anymore.
The expression is now:
WmsRequirements = ((ShortDeadlineJob ?
TRUE ? RegExp(".*sdj$", other.GlueCEUniqueID) : RegExp(".*sdj$", other.GlueCEUniqueID)) && (other.GlueCEPolicyMaxTotalJobs = 0 || other.GlueCEStateTotalJobs < other.GlueCEPolicyMaxTotalJobs) && (EnableWmsFeedback =?
TRUE ? RegExp("cream", other.GlueCEImplementationName, "i") : true) && (member(CertificateSubject,other.GlueCEAccessControlBaseRule) || member(strcat("VO:",VirtualOrganisation),other.GlueCEAccessControlBaseRule) || FQANmember(strcat("VOMS:", VOMS_FQAN),other.GlueCEAcc
essControlBaseRule)) && FQANmember(strcat("DENY:",VOMS_FQAN),other.GlueCEAccessControlBaseRule) && (IsUndefined(other.OutputSE) || member(other.OutputSE,GlueCESEBindGroupSEUniqu
eID)));
6) Load script on wmproxy has security issues:
Insecure $ENV{PATH} while running setuid at /usr/sbin/glite_wms_wmproxy_load_monitor line 26.
FIXED, commit in wmp in wmproxy, load script interpreted with perl -U
TESTED on 16/5/2012
16 May, 16:13:55 -D- PID: 6886 - "wmpcommon::callLoadScriptFile": Executing command: /usr/sbin/glite_wms_wmproxy_load_monitor --oper jobRegister --load1 22 --load5 20 --load15 18 --memusage 99 --diskusage 95 --fdnum 1000 --jdnum 1500 --ftpconn 300
16 May, 16:13:55 -D- PID: 6886 - "wmpcommon::callLoadScriptFile": Executing load script file: /usr/sbin/glite_wms_wmproxy_load_monitor
22/05/12. INSTALLING FROM EMI2 RC4
7) [root@devel09 ~]# /usr/bin/glite-wms-workload_manager
22 May, 11:51:06 -I: [Info] main(main.cpp:289): This is the gLite Workload Manager, running with pid 2454
22 May, 11:51:06 -I: [Info] main(main.cpp:297): loading broker dll libglite_wms_helper_broker_ism.so
cannot load dynamic library libglite_wms_helper_broker_ism.so: /usr/lib64/libgsoap++.so.0: undefined symbol: soap_faultstring
25/5/12 FIXED, commit in broker-info Makefile.am
22/05/12
8) slower MM after authZcheck by conf? (0/4310 [1] )
FIXED
15 Jun, 10:07:19 -I: [Info] checkRequirement(/home/mcecchi/34/emi.wms.wms-matchmaking/src/matchmakerISMImpl.cpp:105): MM for job: https://devel09.cnaf.infn.it:9000/0LXzbbGARFooXwYBAPOXEQ
(0/1145 [0] )
fixed using artefacts from remote build
9) Submission to CREAM DOES NOT WORK with collections and dags
FIXED by commit in wmproxy (setJobFileSystem)
- Cannot move ISB (retry_copy ${globus_transfer_cmd}
gsiftp://devel09.cnaf.infn.it:2811/var/SandboxDir/tu/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2ftutvLp_5fPTFrLUqQH4OSS-A/input/Test.sh
file:///scratch/9462489.1.medium/home_crm07_232749015/CREAM232749015/Test.sh):
error: globus_ftp_client: the server responded with an error
500 500-Command failed. : globus_l_gfs_file_open failed.
500-globus_xio: Unable to open file
/var/SandboxDir/tu/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2ftutvLp_5fPTFrLUqQH4OSS-A/input/Test.sh
500-globus_xio: System error in open: Permission denied
500-globus_xio: A system call failed: Permission denied
500 End.
- Cannot move ISB (retry_copy ${globus_transfer_cmd}
gsiftp://devel09.cnaf.infn.it:2811/var/SandboxDir/tu/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2ftutvLp_5fPTFrLUqQH4OSS-A/input/Test.sh
file:///scratch/9462489.1.medium/home_crm07_232749015/CREAM232749015/Test.sh):
error: globus_ftp_client: the server responded with an error500
500-Command failed. : globus_l_gfs_file_open failed.500-globus_xio:
Unable to open file
/var/SandboxDir/tu/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2ftutvLp_5fPTFrLUqQH4OSS-A/input/Test.sh500-globus_xio:
System error in open: Permission denied500-globus_xio: A system call
failed: Permission denied500 End.; Cannot move ISB (retry_copy
${globus_transfer_cmd}
gsiftp://devel09.cnaf.infn.it:2811/var/SandboxDir/tu/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2ftutvLp_5fPTFrLUqQH4OSS-A/input/Test.sh
file:///scratch/9462489.1.medium/home_crm07_232749015/CREAM232749015/Test.sh):
error: globus_ftp_client: the server responded with an error 500
500-Command failed. : globus_l_gfs_file_open failed. 500-globus_xio:
Unable to open file
/var/SandboxDir/tu/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2ftutvLp_5fPTFrLUqQH4OSS-A/input/Test.sh
500-globus_xio: System error in open: Permission denied
500-globus_xio: A system call failed: Permission denied 500 End.
Status Reason: failed (LB query failed)
10) Proxy exception: Unable to get Not Before date from Proxy
It happens on submission when wmp serving processes restart FIXED committed in wmproxy (initwmp before authZ)
/var/log/wms/wmproxy.log-18 Jun, 10:14:31 -I- PID: 22575 - "wmproxy::main": Maximum core request count reached: 50
/var/log/wms/wmproxy.log-18 Jun, 10:14:31 -I- PID: 22575 - "wmproxy::main": Exiting WM proxy serving process ...
/var/log/wms/wmproxy.log-18 Jun, 10:14:31 -I- PID: 24999 - "wmproxy::main": ------- Starting Server Instance -------
/var/log/wms/wmproxy.log-18 Jun, 10:14:31 -I- PID: 24999 - "wmproxy::main": WM proxy serving process started
/var/log/wms/wmproxy.log-18 Jun, 10:14:31 -I- PID: 24999 - "wmproxy::main": ---------------------------------------
Reason is that Sandboxdir misses from the path
15 Jun, 12:08:23 -D- PID: 31416 - "WMPAuthorizer::checkProxyValidity": Proxy path: /var//ji/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2fji-MZN8xuc8UXCvqceHaHQ/user.proxy
15 Jun, 12:08:23 -E- PID: 31416 - "WMPAuthorizer::getNotBefore": Unable to get Not Before date from Proxy
11) wms-wm stop doesn't delete pid file (and check-daemons kicks in)
FIXED by commit in wm init script
12) Configuration of Condor 7.8.0
FIXED by commit in yaim
13) Build against Condor 7.8.0 FIXED commit in jobsubmission
fix NEEDED in condorg.pc to build from FHS condor
14) Restarting /usr/bin/glite_wms_wmproxy_server... -bash: condorc-initialize: command not found
condorc-initialize was meant for the glite CE. Not needed.
in /opt/glite/yaim/functions/config_gliteservices_wms FIXED commit in yaim
15) Job to JC always stay in Running
FIXED by commit in jobsubmission (was due to the recent cleanup)
16) Crashes in WM with GLUE 2.0 purchaser enabled FIXED (among many others) by commit in ISM
a)
glite-wms-workload_manager: /usr/include/boost/shared_ptr.hpp:247: typename boost::detail::shared_ptr_traits::reference boost::shared_ptr::operator*() const [with T = classad::ClassAd]: Assertion `px = 0' failed.
b)
15 Jun, 10:25:30 -D: [Debug] fetch_bdii_ce_info_g2(ldap-utils-g2.cpp:1290): #6993 LDAP entries received in 12 seconds
15 Jun, 10:25:30 -E: [Error] handle_synch_signal(signal_handling.cpp:77): Got a synchronous signal (6), stack trace:
/usr/bin/glite-wms-workload_manager
/lib64/libpthread.so.0
/lib64/libc.so.6(gsignal+0x35)
/lib64/libc.so.6(abort+0x110)
/lib64/libc.so.6(__assert_fail+0xf6)
boost::shared_ptr<classad::ClassAd>::operator*() const
/usr/lib64/libglite_wms_ism_ii_g2_purchaser.so.0(_ZN5glite3wms3ism9purchaser21fetch_bdii_ce_info_g2ERKSsmS4_lS4_RSt3mapISsN5boo
/usr/lib64/libglite_wms_ism_ii_g2_purchaser.so.0(_ZN5glite3wms3ism9purchaser18fetch_bdii_info_g2ERKSsmS4_lS4_RSt3mapISsN5boost1
glite::wms::ism::purchaser::ism_ii_g2_purchaser::operator()()
void boost::_mfi::mf0<void, glite::wms::ism::purchaser::ism_purchaser>::call<boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser> >(boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser>&, void const*) const
void boost::_mfi::mf0<void, glite::wms::ism::purchaser::ism_purchaser>::operator()<boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser> >(boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser>&) const
/usr/bin/glite-wms-workload_manager(_ZN5boost3_bi5list1INS0_5valueINS_10shared_ptrIN5glite3wms3ism9purchaser13ism_purchaserEEEE
/usr/bin/glite-wms-workload_manager(_ZN5boost3_bi6bind_tIvNS_4_mfi3mf0IvN5glite3wms3ism9purchaser13ism_purchaserEEENS0_5list1IN
/usr/bin/glite-wms-workload_manager(_ZN5boost6detail8function26void_function_obj_invoker0INS_3_bi6bind_tIvNS_4_mfi3mf0IvN5glite
boost::function0<void, std::allocator >::operator()() const
/usr/bin/glite-wms-workload_manager
/usr/bin/glite-wms-workload_manager
boost::function0<void, std::allocator >::operator()() const
glite::wms::manager::server::Events::run()
boost::_mfi::mf0<void, glite::wms::manager::server::Events>::operator()(glite::wms::manager::server::Events*) const
17) Crash in LM FIXED commit in CVS (SubmitHost not allocated by Condor APIs)
Program received signal SIGSEGV, Segmentation fault.
0x0000003d02e782b0 in strlen () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003d02e782b0 in strlen () from /lib64/libc.so.6
#1 0x00000034e489c500 in std::basic_string<char, std::char_traits, std::allocator >::basic_string(char const*, std::allocator const&) ()
from /usr/lib64/libstdc++.so.6
#2 0x0000003969e56708 in glite::wms::jobsubmission::logmonitor::processer::EventSubmit::process_event() () from /usr/lib64/libglite_wms_jss_logmonitor.so.0
#3 0x0000003969e30526 in glite::wms::jobsubmission::logmonitor::CondorMonitor::process_next_event() () from /usr/lib64/libglite_wms_jss_logmonitor.so.0
#4 0x000000000040fc18 in glite::wms::jobsubmission::daemons::MonitorLoop::run() ()
#5 0x000000000040c39a in (anonymous namespace)::run_instance(std::basic_string<char, std::char_traits, std::allocator > const&, glite::wms::common::utilities::LineParser const&, std::auto_ptr<glite::wms::jobsubmission::jccommon::LockFile>&, glite::wms::jobsubmission::daemons::MonitorLoop::run_code_t&) ()
#6 0x000000000040c72b in main ()
18) Scheduled status missing in jc/lm FIXED
Commited in LM, removed getSubmitHost()
19) Condor 7.8.0 rpm FIXED changed spec file in condor-emi-7.8.0
removed libclassad.so, executables (classad_version, ...), include
LEFT libclassad_7_8_0.so, libclassad.so.3
it conflicts with classads and classads-devel
20) proxy renewal INVALID this is a problem of devel09, it works ok on a clean installation
this error:
glite_renewal_RegisterProxy
Exit code: 13
LB[Proxy] Error not available (empty messages)
15 Jun, 15:55:26 -S- PID: 26528 - "WMPEventLogger::registerProxyRenewal": Register job failed
glite_renewal_RegisterProxy
Exit code: 13
LB[Proxy] Error not available (empty messages)
means:
[root@devel09 ~]# /etc/init.d/glite-proxy-renewald status
glite-proxy-renewd not running
21) UI BUG: zipped isb doesnt work with dags INVALID seems to have been fixed by some commit in the wm.
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::getDN_SSL": Getting user DN...
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::getDN_SSL": User DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alvise Dorigo
15 Jun, 17:11:44 -D- PID: 16909 - "WMPEventlogger::registerSubJobs": Registering DAG subjobs to LB Proxy...
15 Jun, 17:11:44 -D- PID: 16909 - "WMPEventlogger::setLoggingJob": Setting job for logging to LB Proxy...
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::getDN_SSL": Getting user DN...
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::getDN_SSL": User DN: /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alvise Dorigo
15 Jun, 17:11:44 -D- PID: 16909 - "wmpcoreoperations::submit": registerSubJobs OK, writing flag file: /var/SandboxDir/M4/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2fM4GwcN3q0KYZz-ed7SHHQA/.registersubjobsok
15 Jun, 17:11:44 -D- PID: 16909 - "wmpcoreoperations::submit": Uncompressing zip file: ISBfiles_1kZPt8uCtNxrEBxTLVHzlQ_0.tar.gz
15 Jun, 17:11:44 -D- PID: 16909 - "wmpcoreoperations::submit": Absolute path: /var/SandboxDir/M4/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2fM4GwcN3q0KYZz-ed7SHHQA/input/ISBfiles_1kZPt8uCtNxrEBxTLVHzlQ_0.tar.gz
15 Jun, 17:11:44 -D- PID: 16909 - "wmpcoreoperations::submit": Target directory: /var
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::doExecv": Forking process...
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::doExecv": Parent PID wait: 16909 waiting for: 19967
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::doExecv": Parent PID after wait: 16909 waiting for: 19967
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::doExecv": Child wait succesfully (WIFEXITED(status))
15 Jun, 17:11:44 -D- PID: 16909 - "wmputils::doExecv": WEXITSTATUS(status): 2
15 Jun, 17:11:44 -S- PID: 16909 - "wmputils::doExecv": Child failure, exit code: 512
15 Jun, 17:11:44 -S- PID: 16909 - "wmputils::doExecv": Child failure, exit code: 512
15 Jun, 17:11:44 -C- PID: 16909 - "wmputils::untarFile": Unable to untar ISB file:/var/SandboxDir/M4/https_3a_2f_2fdevel09.cnaf.infn.it_3a9000_2fM4GwcN3q0KYZz-ed7SHHQA/input/ISBfiles_1kZPt8uCtNxrEBxTLVHzlQ_0.tar.gz
15 Jun, 17:11:44 -E- PID: 16909 - "wmpcoreoperations::submit": Logging LOG_ENQUEUE_FAIL, std::exception Unable to untar ISB file
(please contact server administrator)
15 Jun, 17:11:44 -D- PID: 16909 - "WMPEventlogger::logEvent": Logging to LB Proxy...
15 Jun, 17:11:44 -D- PID: 16909 - "WMPEventlogger::logEvent": Logging Enqueue FAIL event...
15 Jun, 17:11:44 -D- PID: 16909 - "wmpcoreoperations::submit": Removing lock...
15 Jun, 17:11:44 -D- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": jobStart operation exception: The Operation is not allowed: Standard exception: Unable to untar ISB file
(please contact server administrator)
15 Jun, 17:11:44 -I- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": ------------------------------- Fault description --------------------------------
15 Jun, 17:11:44 -I- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": Method: jobStart
15 Jun, 17:11:44 -I- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": Code: 900
15 Jun, 17:11:44 -I- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": Description: The Operation is not allowed: Standard exception: Unable to untar ISB file
(please contact server administrator)
15 Jun, 17:11:44 -D- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": Stack:
15 Jun, 17:11:44 -D- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": JobOperationException: The Operation is not allowed: Standard exception: Unable to untar ISB file
(please contact server administrator)
at submit()[coreoperations.cpp:1726]
at jobStart()[coreoperations.cpp:1838]
15 Jun, 17:11:44 -I- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": ----------------------------------------------------------------------------------
15 Jun, 17:11:44 -D- PID: 16909 - "wmpgsoapoperations::ns1__jobStart": jobStart operation completed
22) fetch_bdii_ce_info: no such object NOT SEEN ANYMORE, seems to be a network error or something
[root@devel09 ~]# grep such /var/log/wms/workload_manager_events.log
15 Jun, 17:16:35 -W: [Warning] fetch_bdii_ce_info(ldap-utils.cpp:688): No such object
15 Jun, 17:17:56 -W: [Warning] fetch_bdii_se_info(ldap-utils.cpp:346): No such object
23) cancel status not properly managed by dagmanless FIXED commit in WM DAG engine
a job cancelled did not cause its consequent nodes to be aborted
24) cancellation to condor does not work FIXED commit in jobsubmission (restored irepository in JC)
18 Jun, 12:29:28 -D- cancelJob(...): Condor id of job was:
18 Jun, 12:29:28 -S- cancelJob(...): Job cancellation refused.
18 Jun, 12:29:28 -S- cancelJob(...): Condor ID =
18 Jun, 12:29:28 -S- cancelJob(...): Reason: "
Couldn't find/remove all jobs matching constraint (ClusterId== && ProcId==0 && JobStatus!=3)
no:
on devel11 (EMI-1):
18 Jun, 12:46:26 -I- ControllerLoop::run(): Got new remove request (JOB ID = https://devel11.cnaf.infn.it:9000/_UFQCsCUNGpVmZOU-q1-HQ)...
18 Jun, 12:46:26 -I- JobControllerReal::cancel(...): Asked to remove job: https://devel11.cnaf.infn.it:9000/_UFQCsCUNGpVmZOU-q1-HQ
18 Jun, 12:46:26 -I- cancelJob(...): Job has been succesfully removed.
18 Jun, 12:46:26 -V- JobControllerReal::cancel(...): Job https://devel11.cnaf.infn.it:9000/_UFQCsCUNGpVmZOU-q1-HQ successfully marked for removal.
25) check logrotation FIXED commit in yaim (copytruncate)
from lsof, LM was reading a log.1 file
even JC:
-rw-r--r-- 1 glite glite 0 Jun 19 04:02 jobcontroller_events.log
-rw-r--r-- 1 glite glite 349597 Jun 19 11:47 jobcontroller_events.log.1
[root@devel09 ~]# ps aux | grep job_
root 4193 0.0 0.0 61212 760 pts/0 S+ 11:48 0:00 grep job_
glite 16261 0.0 0.1 226112 6616 ? Ss Jun18 0:00 /usr/bin/glite-wms-job_controller -c glite_wms.conf
[root@devel09 ~]# lsof -p 16261|grep .log
glite-wms 16261 glite mem REG 253,0 671439 17115535 /usr/lib64/libglite_wms_logger.so.0.0.0
glite-wms 16261 glite 3u REG 253,0 349597 29524130 /var/log/wms/jobcontroller_events.log.1
26) first G2 sync purchasing is empty FIXED
First G2 sync purchasing had:
config.ns()->ii_dn(),
instead of "o=glue"
27) new crash in G2 purchaser FIXED it's the same problem shown in point 28. The two purchasers cannot work together, as with the present design
19 Jun, 15:56:20 -E: [Error] handle_synch_signal(/home/mcecchi/34/emi.wms.wms-manager/src/signal_handling.cpp:77): Got a synchronous signal (11), stack trace:
/usr/bin/glite-wms-workload_manager
/lib64/libpthread.so.0
std::string::compare(std::string const&) const
bool std::operator< <char, std::char_traits<char>, std::allocator<char> >(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
std::less<std::string>::operator()(std::string const&, std::string const&) const
/usr/lib64/libglite_wms_ism_ii_g2_purchaser.so.0(_ZNSt8_Rb_treeISsSt4pairIKSsN5boost6tuples5tupleIiiNS2_10shared_ptrIN7classad7
/usr/lib64/libglite_wms_ism_ii_g2_purchaser.so.0(_ZNSt3mapISsN5boost6tuples5tupleIiiNS0_10shared_ptrIN7classad7ClassAdEEENS0_8f
/usr/lib64/libglite_wms_ism_ii_g2_purchaser.so.0
glite::wms::ism::purchaser::ism_ii_g2_purchaser::operator()()
void boost::_mfi::mf0<void, glite::wms::ism::purchaser::ism_purchaser>::call<boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser> >(boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser>&, void const*) const
void boost::_mfi::mf0<void, glite::wms::ism::purchaser::ism_purchaser>::operator()<boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser> >(boost::shared_ptr<glite::wms::ism::purchaser::ism_purchaser>&) const
/usr/bin/glite-wms-workload_manager(_ZN5boost3_bi5list1INS0_5valueINS_10shared_ptrIN5glite3wms3ism9purchaser13ism_purchaserEEEE
/usr/bin/glite-wms-workload_manager(_ZN5boost3_bi6bind_tIvNS_4_mfi3mf0IvN5glite3wms3ism9purchaser13ism_purchaserEEENS0_5list1IN
/usr/bin/glite-wms-workload_manager(_ZN5boost6detail8function26void_function_obj_invoker0INS_3_bi6bind_tIvNS_4_mfi3mf0IvN5glite
boost::function0<void, std::allocator<void> >::operator()() const
/usr/bin/glite-wms-workload_manager
/usr/bin/glite-wms-workload_manager
boost::function0<void, std::allocator<void> >::operator()() const
glite::wms::manager::server::Events::run()
boost::_mfi::mf0<void, glite::wms::manager::server::Events>::operator()(glite::wms::manager::server::Events*) const
28) G1 and G2 purchasers do not work togheter? Yes, this is a design issue FIXED
EnableIsmIiGlue13Purchasing = true;
EnableIsmIiGlue20Purchasing = true;
but:
19 Jun, 16:54:39 -I: [Info] checkRequirement(/home/mcecchi/34/emi.wms.wms-matchmaking/src/matchmakerISMImpl.cpp:110): MM for listmatch (0/1124 [0] )
at the beginning, and then:
19 Jun, 16:55:09 -D: [Debug] schedule_at(/home/mcecchi/34/emi.wms.wms-manager/src/events.cpp:156): timed event scheduled at 1340117710 with priority 20
19 Jun, 16:55:09 -D: [Debug] operator()(/home/mcecchi/34/emi.wms.wms-manager/src/match_request.cpp:69): considering match https://localhost:6000/hsgQjAjmC8fJs30scM8mAQ /tmp/13043.20120619165508248 -1 0
19 Jun, 16:55:09 -I: [Info] checkRequirement(/home/mcecchi/34/emi.wms.wms-matchmaking/src/matchmakerISMImpl.cpp:110): MM for listmatch (2/81 [0] )
19 Jun, 16:55:10 -D: [Debug] schedule_at(/home/mcecchi/34/emi.wms.wms-manager/src/events.cpp:156): timed event scheduled at 1340117711 with priority 20
19 Jun, 17:03:50 -I: [Info] checkRequirement(/home/mcecchi/34/emi.wms.wms-matchmaking/src/matchmakerISMImpl.cpp:110): MM for listmatch (81/81 [0] )
19 Jun, 17:03:51 -D: [Debug] schedule_at(/home/mcecchi/34/emi.wms.wms-manager/src/events.cpp:156): timed event scheduled at 1340118232 with priority 20
There are then several crashes when one purchaser causes the switch-over without the other one knowing.
29) proxy renewal not installed. FIXED commit in ice e wmproxy spce files (provides glite-px-proxyrenewal)
30) 18/07/2012 ICE log level: from 300 to 500. Tagged yaim 0_7, but out of the present registered build
31) 18/07/2012 ice persist_dir is OK
32) 19/07/2012 SL6 gridftp not working FIXED Fixed updating to a more recent EMI-2 repo
root@devel08 grid-security]# LCAS_DEBUG_LEVEL=5 LCAS_LOG_LEVEL=5 LCMAPS_DEBUG_LEVEL=5 LCMAPS_LOG_LEVEL=5 /usr/sbin/globus-gridftp-server -ns -p 24024 -d all > /tmp/ftplog.txt 2>&1
[root@devel08 grid-security]# cat /tmp/ftplog.txt
[19911] Thu Jul 19 14:58:45 2012 :: GFork functionality not enabled.:
globus_gfork: GFork error: Env not set
[19911] Thu Jul 19 14:58:45 2012 :: Configuration read from /etc/gridftp.conf.
[19911] Thu Jul 19 14:58:45 2012 :: Server started in daemon mode.
[19911] Thu Jul 19 14:58:48 2012 :: New connection from: ui.cnaf.infn.it:38435
[19911] Thu Jul 19 14:58:48 2012 :: ui.cnaf.infn.it:38435: [CLIENT]: USER :globus-mapping:
[19911] Thu Jul 19 14:58:48 2012 :: ui.cnaf.infn.it:38435: [SERVER]: 331 Password required for :globus-mapping:.
[19911] Thu Jul 19 14:58:48 2012 :: ui.cnaf.infn.it:38435: [CLIENT]: PASS dummy
[19911] Thu Jul 19 14:58:49 2012 :: ui.cnaf.infn.it:38435: [CLIENT]: PASS dummy
[19911] Thu Jul 19 14:58:49 2012 :: ui.cnaf.infn.it:38435: [SERVER]: 530-Login incorrect. : globus_gss_assist: Error invoking callout
530-globus_callout_module: The callout returned an error
530-an unknown error occurred
530 End.
33) 19/07/2012 SL6 LB not working FIXED Fixed updating to a more recent EMI-2 repo
34) 19/07/2012 SL6 WM not working FIXED
crash while loading II library
disappeared after reinstalling classad(-devl), common and other libraries
35) 31/7/2012 Again? FIXED cleanup/install/yaim of latest EMI-2
31 Jul, 11:58:32 -D- PID: 6649 - "wmpgsoapoperations::ns1__jobStart": jobStart operation called
31 Jul, 11:58:32 -D- PID: 6649 - "WMPAuthorizer::checkProxyValidity": Proxy path: /var//oz/https_3a_2f_2fdevel08.cnaf.infn.it_3a9000_2foz-mWzYgnJWchPD-_5faB7nA/user.proxy
31 Jul, 11:58:32 -E- PID: 6649 - "WMPAuthorizer::getNotBefore": Unable to get Not Before date from Proxy
31 Jul, 11:58:32 -D- PID: 6649 - "wmpgsoapoperations::ns1__jobStart": jobStart operation exception: Proxy exception: Unable to get Not Before date from Proxy