Pre-certification of WMS 3.3.6
Repository
http://etics-repository.cern.ch/repository/pm/volatile/repomd/id/4d429a92-e1ae-42e9-95e4-b2d5338349a5/sl5_x86_64_gcc412EPEL
updating from a WMS 3.3.5
Tests:
SVG #4073
OK
SVG #4039
OK
The pre-certification consists of simply submit a job to the WMS and scan the syslog file /var/log/message to see if the WMProxy and Manager logged the relevant information required by this bug.
Simply log as root on the WMS machine and execute the command:
tail -f /var/log/messages|egrep "wmproxy|manager"
then log into an UI and submit a job (whatever
JDL you like) to the WMS. 2 log lines should appear after few seconds in the console running the tail command:
May 18 14:23:07 devel11 glite_wms_wmproxy_server[32565]: submission from lxgrid05.pd.infn.it, DN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alvise Dorigo, FQAN=/dteam/Role=NULL/Capability=NULL, userid=18702 for jobid=https://devel07.cnaf.infn.it:9000/rkYSfEe5IqsDc17_UpPu3Q
May 18 14:23:10 devel11 glite-wms-workload_manager: jobid https://devel07.cnaf.infn.it:9000/rkYSfEe5IqsDc17_UpPu3Q was matched to destination creamce.gina.sara.nl:8443/cream-pbs-infra
Note in particular the DN,FQAN,JobID information and the UI's hostname.
The pre-certification I did was in 2 phases: verification of the bug with EMI1 installation, installation of the new RPM (that will be in the next EMI1 update 17) and verification that the bug disappeared.
In order to reproduce the bug it is sufficient to use this
JDL:
[
Executable = "/bin/touch" ;
Arguments = "/foo" ;
Retrycount = 2;
usertags = [ exe = "touch" ];
VirtualOrganisation="dteam";
requirements = ! RegExp("cream.*", other.GlueCEUniqueID);
]
and submit it to a WMS EMI1 that have not the fix. Note that this bug occurs when the job lands on a NON-CREAM CE (this is why the requirements attribute specification in the
JDL).
This should be the result:
glite-wms-job-status https://devel09.cnaf.infn.it:9000/U...
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel09.cnaf.infn.it:9000/U...
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce01.dur.scotgrid.ac.uk:2119/jobmanager-lcgpbs-q7d
Submitted: Wed May 16 16:19:38 2012 CEST
==========================================================================
Even if the job should return an exit code 1 (cannot create a file in /, permission denied), the Exit Code reported by the status is 0 as shown above.
After applying the fix to this EMI1 WMS, the same
JDL should produce the expected exit code (1)
glite-wms-job-status https://devel07.cnaf.infn.it:9000/f...
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel07.cnaf.infn.it:9000/f...
Current Status: Done(Exit Code !=0)
Exit code: 1
Status Reason: Warning: job exit code != 0
Submitted: Wed May 16 16:54:25 2012 CEST
==========================================================================
The bug is "Hopefully fixed". Triggering the problem is very difficult.
- Use a CMS proxy and submit to korundi.grid.helsinki.fi
ONLY:
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel11.cnaf.infn.it:9000/pLl1nyepSih7NYrivm8T1A
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: korundi.grid.helsinki.fi:2811/nordugrid-GE-mgrid
Submitted: Tue Jun 5 22:37:22 2012 CEST
==========================================================================
- Check if the gridmanager stays alive for reasonable amounts of time (occasional crashes are "normal" on WMS nodes).
[root@devel11 mcecchi]# grep STARTING /var/local/condor/log/GridmanagerLog.glite |tail
05/31 14:59:45 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
05/31 15:09:57 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
05/31 15:10:08 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
05/31 16:13:20 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
05/31 16:13:30 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
05/31 16:23:42 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
05/31 16:24:00 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
06/01 15:58:27 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
06/05 13:06:53 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
06/05 22:19:07 ** condor_gridmanager (CONDOR_GRIDMANAGER) STARTING UP
also check that the grid_monitor.sh script is properly read:
[root@devel11 mcecchi]# locate grid_monitor.sh|xargs ls -lu
-rwxr-xr-x 1 root root 42728 Jun 1 13:20 /opt/condor-7.4.2/libexec/glite/grid_monitor.sh
-rwxr-xr-x 1 root root 38151 Jun 5 23:07 /opt/condor-7.4.2/sbin/grid_monitor.sh
Generic test of job submission
Submission of 3 jobs (single, collection, dag) and one cancel with NON CREAM CE destination (use 'requirements = !
RegExp("cream.*", other.GlueCEUniqueID);' in the
JDL).
Single job
$ cat ~/JDLs/WMS/wms_submission_non_cream_true.jdl
[
Executable = "/bin/true";
Arguments = "";
myproxyserver="";
requirements = ! RegExp("cream.*", other.GlueCEUniqueID);
RetryCount = 0;
ShallowRetryCount = 1;
]
$ glite-wms-job-submit -a -c ~/JDLs/WMS/wmp_devel11.conf ~/JDLs/WMS/wms_submission_non_cream_true.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel07.cnaf.infn.it:9000/mq4J3d8lBxgBj_TA34Aa8g
==========================================================================
$ glite-wms-job-status https://devel07.cnaf.infn.it:9000/mq4J3d8lBxgBj_TA34Aa8g
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel07.cnaf.infn.it:9000/mq4J3d8lBxgBj_TA34Aa8g
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: dangus.itpa.lt:2119/jobmanager-lcgpbs-short
Submitted: Fri May 25 09:29:43 2012 CEST
==========================================================================
Job cancellation
$ glite-wms-job-submit -a -c ~/JDLs/WMS/wmp_devel11.conf ~/JDLs/WMS/wms_sottomissione_non_cream_true.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel07.cnaf.infn.it:9000/CFUZC1XPny7j5Dd596torw
==========================================================================
$ glite-wms-job-status https://devel07.cnaf.infn.it:9000/CFUZC1XPny7j5Dd596torw
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel07.cnaf.infn.it:9000/CFUZC1XPny7j5Dd596torw
Current Status: Scheduled
Status Reason: Job successfully submitted to Globus
Destination: egee02.grid.hku.hk:2119/jobmanager-lcgpbs-dteam
Submitted: Fri May 25 09:50:50 2012 CEST
==========================================================================
$ glite-wms-job-cancel https://devel07.cnaf.infn.it:9000/CFUZC1XPny7j5Dd596torw
Are you sure you want to remove specified job(s) [y/n]y : y
Connecting to the service https://devel11.cnaf.infn.it:7443/glite_wms_wmproxy_server
============================= glite-wms-job-cancel Success =============================
The cancellation request has been successfully submitted for the following job(s):
- https://devel07.cnaf.infn.it:9000/CFUZC1XPny7j5Dd596torw
========================================================================================
$ glite-wms-job-status https://devel07.cnaf.infn.it:9000/CFUZC1XPny7j5Dd596torw
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel07.cnaf.infn.it:9000/CFUZC1XPny7j5Dd596torw
Current Status: Cancelled
Destination: egee02.grid.hku.hk:2119/jobmanager-lcgpbs-dteam
Submitted: Fri May 25 09:50:50 2012 CEST
==========================================================================
Collection
$ cat /home/dorigoa/JDLs/WMS/coll1.jdl
[
type = "collection";
nodes = {
[
file ="/home/dorigoa/JDLs/WMS/coll/job.jdl" ;
],
[
file ="/home/dorigoa/JDLs/WMS/coll/job.jdl" ;
],
[
file ="/home/dorigoa/JDLs/WMS/coll/job.jdl" ;
],
[
file ="/home/dorigoa/JDLs/WMS/coll/job.jdl" ;
],
[
file ="/home/dorigoa/JDLs/WMS/coll/job.jdl" ;
]
};
]
$ cat /home/dorigoa/JDLs/WMS/coll/job.jdl
[
Executable = "/bin/ls" ;
Arguments = "/tmp" ;
RetryCount = 2 ;
Stdoutput = "std.out" ;
StdError = "std.err" ;
OutputSandbox = { "std.out" ,"std.err"} ;
InputSandbox = { "data/pippo" };
rank = 1 ;
ShallowRetryCount = 2;
usertags = [ exe = "ls" ];
requirements = !RegExp("cream.*", other.GlueCEUniqueID);
]
$ glite-wms-job-submit -a -c ~/JDLs/WMS/wmp_devel11.conf ~/JDLs/Alessio/UI/jdl/coll1.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel07.cnaf.infn.it:9000/NpR759bu84k_72RM5_v-kw
==========================================================================
$ glite-wms-job-status https://devel07.cnaf.infn.it:9000/NpR759bu84k_72RM5_v-kw
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel07.cnaf.infn.it:9000/NpR759bu84k_72RM5_v-kw
Current Status: Done(Success)
Exit code: 0
Submitted: Fri May 25 11:14:54 2012 CEST
==========================================================================
- Nodes information for:
Status info for the Job : https://devel07.cnaf.infn.it:9000/FRKaSvQ-fX0Q5Qd-BFuIGg
Current Status: Done(Success)
Logged Reason(s):
-
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce1.grid.lebedev.ru:2119/jobmanager-lcgpbs-dteam
Submitted: Fri May 25 11:14:54 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/SfrHHxw4E3-KZKRx4-hZEQ
Current Status: Done(Success)
Logged Reason(s):
-
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce01.athena.hellasgrid.gr:2119/jobmanager-pbs-dteam
Submitted: Fri May 25 11:14:54 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/fDTaOZAeRkvB0Rgt6ahu5Q
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce.reef.man.poznan.pl:2119/jobmanager-pbs-dteam
Submitted: Fri May 25 11:14:54 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/ly9bq3Y764NQPh73HhjA-Q
Current Status: Done(Success)
Logged Reason(s):
-
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce01.kallisto.hellasgrid.gr:2119/jobmanager-pbs-dteam
Submitted: Fri May 25 11:14:54 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/qy5ih7me0ABi2AzO2RZzqA
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ceprod03.grid.hep.ph.ic.ac.uk:2119/jobmanager-sge-long
Submitted: Fri May 25 11:14:54 2012 CEST
==========================================================================
DAG
$ cat ~/JDLs/WMS/dag10nodi.jdl
[
SignificantAttributes = { "Requirements", "Rank" };
type = "dag";
requirements = !RegExp("cream.*", other.GlueCEUniqueID);
nodes = [
nodeA = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/env.jdl" ;
];
nodeB = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/sleep.jdl" ;
];
nodeC = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/touch.jdl" ;
];
nodeD = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/ls.jdl" ;
];
nodeE = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/ls.jdl" ;
];
nodeF = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/env.jdl" ;
];
nodeG = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/echo.jdl" ;
];
nodeH = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/random.jdl" ;
];
nodeI = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/ls.jdl" ;
];
nodeL = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/cat.jdl" ;
];
nodeM = [
node_type = "edg-jdl";
file ="/home/dorigoa/JDLs/WMS/cat.jdl" ;
];
dependencies = {
{ nodeA, { nodeB, nodeC, nodeD, nodeE } },
{ nodeD, nodeF },
{ nodeD, nodeG },
{ nodeE, nodeH },
{ nodeF, nodeI },
{ nodeF, nodeL },
{ nodeL, nodeM }
}
];
]
glite-wms-job-submit -a -c ~/JDLs/WMS/wmp_devel11.conf ~/JDLs/WMS/dag10nodi.jdl
Connecting to the service https://devel11.cnaf.infn.it:7443/glite_wms_wmproxy_server
====================== glite-wms-job-submit Success ======================
The job has been successfully submitted to the WMProxy
Your job identifier is:
https://devel07.cnaf.infn.it:9000/zSoSo9jMN-MgInLb9e4G0Q
==========================================================================
glite-wms-job-status https://devel07.cnaf.infn.it:9000/zSoSo9jMN-MgInLb9e4G0Q
======================= glite-wms-job-status Success =====================
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel07.cnaf.infn.it:9000/zSoSo9jMN-MgInLb9e4G0Q
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: dagman
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
- Nodes information for:
Status info for the Job : https://devel07.cnaf.infn.it:9000/2DZeXp0v5_3zGNueXqFeyw
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-enmr.chemie.uni-frankfurt.de:2119/jobmanager-lcgpbs-long
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/7Jk1daotQC_uXmv33bbCgA
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-enmr.chemie.uni-frankfurt.de:2119/jobmanager-lcgpbs-long
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/9m_oRvKFp1texXeSR6UUbg
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-enmr.chemie.uni-frankfurt.de:2119/jobmanager-lcgpbs-cert
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/QCBl5XHvyAOJPdkzdNO_VQ
Current Status: Done(Success)
Logged Reason(s):
- epilogue failed with error 1
Fri May 25 15:16:18 CEST 2012: Taken token gsiftp://devel11.cnaf.infn.it/var/SandboxDir/QC/https_3a_2f_2fdevel07.cnaf.infn.it_3a9000_2fQCBl5XHvyAOJPdkzdNO_5fVQ/token.txt_0
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-enmr.chemie.uni-frankfurt.de:2119/jobmanager-lcgpbs-verylong
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/_wtIrUOhjHyHCAiwR3Vcxw
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-enmr.chemie.uni-frankfurt.de:2119/jobmanager-lcgpbs-medium
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/epUEnBpmwZhaxLX1fWZYGA
Current Status: Done(Success)
Logged Reason(s):
-
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/jeG1XmOzuRIxW-PAIyIG_w
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: boalice3.bo.infn.it:2119/jobmanager-lcgpbs-cert
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/mlrvfleUcbWtRODiOFCxgQ
Current Status: Done(Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: boalice3.bo.infn.it:2119/jobmanager-lcgpbs-cert
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/qXXOVm_X7KvNf53g_2_vdQ
Current Status: Done(Success)
Logged Reason(s):
-
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-edu.grid.acad.bg:2119/jobmanager-pbs-dteam
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/vTOKXPI8oMnUmpJZajgS7w
Current Status: Done(Success)
Logged Reason(s):
-
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-atlas.ipb.ac.rs:2119/jobmanager-pbs-dteam
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
Status info for the Job : https://devel07.cnaf.infn.it:9000/wb7RKljxzfNjv0cxbYxjtA
Current Status: Done(Success)
Logged Reason(s):
-
- Job terminated successfully
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce-edu.grid.acad.bg:2119/jobmanager-pbs-dteam
Submitted: Fri May 25 13:43:44 2012 CEST
==========================================================================
--
MarcoCecchi - 2012-04-26