Check bugs:

  • Bugs #39807: In some circumstances, jobs which are killed by CREAM job wrapper might remain in ICE cache forever FIXED
    • Following the instructions reported in the bug's comments set:
      start_listener = false;
      start_subscription_updater = false;
      poller_delay = 900;
      poller_status_threshold_time = 60;
      in the Ice section of the configuration file (i.e. glite_wms.conf)
    • Submit a long job (i.e. a job that should run for more than 15 minutes), with a short proxy (i.e. a proxy with a lifetime of about 13 minutes).
    • Submit another job with a long proxy (i.e. more than an hour).
    • After about 17/18 minutes the original job should be ABORTED.

  • Bugs #42018: Missing exit on very severe error HOPEFULLY FIXED
    • Changes inside the code.

  • Bugs #42081: Exception not catched in ICE HOPEFULLY FIXED
    • Changes inside the code.

  • Bugs #42141: Calling the FileList::get_size() method should be mutex protected HOPEFULLY FIXED
    • Changes inside the code.

  • Bugs #44604: A bad handling of delegations slow down dramatically the submission rate of ICE HOPEFULLY FIXED
    • See tests below.

  • Bugs #46116: MaxOutputSandboxSize value not sent to CREAM by ICE FIXED
    • Set the parameter MaxOutputSandboxSize? in the WorkloadManager? section of the configuration file /opt/glite/etc/glite_wms.conf on the WMS to 100 and restart the workload manager.
    • Submit to a cream CE a jdl like this:
      [
      Type = "Job";
      Executable = "27215_exe.sh";
      Arguments = "70";
      StdOutput = "test.out";
      StdError = "test.err";
      InputSandbox = {"27215_exe.sh"};
      OutputSandbox = {"test.err","test.out","out2", "out1"};
      usertags = [ bug = "27215" ];
      ] 
      where 27215_exe.sh contains
      #!/bin/sh
      MAX=$1
      i=0
      while [ $i -lt $MAX ]; do
                      echo -n "1" >> out1
                      echo -n "2" >> out2
          i=$[$i + 1]
      done
      
    • Take the CreamJobID from the "Transfer Event" logged by the "LogMonitor" (i.e. The field Dest jobid)
    • Using the command of the client of the CE look inside the JDL sent to the ce: glite-ce-job-status -L 2 <CreamJobID>; you should find this parameter: maxOutputSandboxSize = 1.000000000000000E+02;
    • Due to a bug in CREAM the output files are not truncated as expected.

  • Bugs #47389: There's a mem leak in ICE that raises in some very rare circumstances HOPEFULLY FIXED
    • Not easy to reproduce

  • Bugs #47509: ICE must be modified in order to be compliant with modification to CEMon C++ API FIXED
    • Verify if the subscription of ICE to the CE works well (you need to look inside the log file of ICE)

  • Bugs #47996: Apparent database corruption when ICE exits.FIXED
    • Submit a lot of jobs and restart ice daemon (i.e.: /opt/glite/etc/init.d/glite-wms-ice restart)

TESTs on ICE

15) Test starts on Thu Mar 19 at 12:00:57 CET 2009 (WMS: wms008)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • FIve users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf and cream-21.pd)
  • Used automatic-delegation and proxy renewal service (MyProxyServer = "myproxy.cern.ch")
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Use rpms from patch #2459

Test finishes on Tue Mar 24 at 11:56:59 CET 2009

  • 3750 collections submitted in 37283 seconds: 5/9/38 secs (min/avg/max)
    • 3450 submission fail due to System load is too high

Final results taken on Fri Mar 24 at 13:44:29

  • Collections correctly submitted: 3750 (150000 jobs)
    • DONE OK: 145210 (96.81%)
    • ABORTED: 611 (0.41%)
    • NotDone: 4179 (2.78%)
    • Resubmitted: 1283 (0.86%)

  • Errors found (3191):
    • BLAH error (201 times 6.3%)
    • Cannot move ISB (569 times 17.83%)
    • Cannot move OSB (148 times 4.64%)
    • Transfer to CREAM failed (1895 times 59.39%)
    • Cannot take token (50 times 1.57%)
    • Proxy is expired (315 times 9.87%)
    • lsf_reason (9 times 0.28%)
    • blparser service is not alive (3 times 0.09%)
    • pbs_reason (1 time 0.03%)

  • All the aborted (and also the majority of the failures) are due to "proxy renewal" mechanism which doesn't work at the beginning of the test.
  • All the "NotDone" jobs are matched to a single ce (cert-05.cnaf.infn.it), which has some problems under investigation.

14) Test starts on Fri Mar 13 at 11:44:57 CET 2009 (WMS: wms008)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf and cream-21.pd)
  • Used automatic-delegation and proxy renewal service (MyProxyServer = "myproxy.cern.ch")
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Use rpms from patch #2459

Results (Mon Mar 16 17:21:55). Test interrupted due to a problem in the lsf server at Cnaf.

  • Collections correctly submitted: 2932 (117280 jobs)
    • DONE OK: 113758 (97%)
    • ABORTED: 3 (0.003%)
    • NotDone: 3519 (3%)
    • Resubmitted: 141 (0.12%)

  • Errors found (179):
    • Cannot move ISB (14 times 7.82%)
    • BLAH error (92 times 51.4%)
      • blah error: send command timeout (28 times)
      • submission command failed (exit code = -15) (stdout:) (stderr: exe_getouterr: 200 seconds timeout expired, killing child process.- killed by signal 15.-) N/A (jobId = [...]) (43 times)
      • submission command failed (exit code = 1) (stdout:) (stderr:Cannot connect to LSF. Please wait ...-Cannot connect to LSF. Please wait ...- exe_getouterr: 200 seconds timeout expired, killing child process.-) N/A (jobId = [...]) (13 times)
      • no jobId in submission script's output (stdout:) (stderr: exe_getouterr: 200 seconds timeout expired, killing child process.-) N/A (jobId = [...]) (7 times)
      • submission command failed (exit code = 120) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = [...]) (1 time)
    • Transfer to CREAM failed (56 times 31.28%)
      • due to exception: CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Fri 13 Mar 2009 11:50:35] ErrorCode=[0] Description=[system error] FaultCause=[cannot create the job's working directory! The problem seems to be related to glexec]" (56 times)
    • Cannot take token (14 times 7.82%)
    • Others (3 times 1.68%)
      • The job cannot be submitted because the blparser service is not alive (3 times)

13) Test starts on Wed Mar 4 at 13:41:28 CET 2009 (WMS: wms007)

Description:
  • 120 collections each of 60 jobs
  • One collection every 60 seconds
  • Four users
  • The job is a "sleep 313"
  • Resubmission is enabled
  • We use both CREAM and LCG CEs
  • Long proxy

Test finishes on Wed Mar 4 at 15:38:14 CET 2009

  • 120 collections submitted in 1108 seconds: 4/9/16 (min/avg/max)

Final results

  • Collections correctly submitted: 120 (7200 jobs)
    • DONE OK: 7198 (99.97%)
      • CREAM: 2537
      • LCG: 4661
    • ABORTED: 0 (0%)
    • Not finished: 2 (0.03%)
      • LCG: 2
    • Resubmitted: 182 (2.53%)

12) Test starts on Feb 24 at 10:29:07 CET 2009 (WMS: wms007)

Description:
  • 120 collections each of 60 jobs
  • One collection every 60 seconds
  • Four users
  • The job is a "sleep 313"
  • Resubmission is enabled
  • We use both CREAM and LCG CEs
  • Long proxy

Test finishes on Mon Feb 24 at 12:25:54 CET 2009

  • 92 collections submitted in 712 seconds: 4/7/13 (min/avg/max)
    • 28 submission(s) fail(s) (due to load limiter)

Final results

  • Collections correctly submitted: 91 (5460 jobs)
    • DONE OK: 5460 (100%)
      • CREAM: 3400
      • LCG: 2060
    • ABORTED: 0 (0%)
    • Resubmitted: 163 (2.99%)

  • The submission of one collection failed due to:
Status Reason:      LBProxy is enabled
Unable to query LB and LBProxy
edg_wll_QueryEvents[Proxy]
Exit code: 1413
LB[Proxy] Error: DNS resolver error
(edg_wll_gss_connect(): Unknown host)

11) Test starts on Mon Feb 23 at 15:25:49CET 2009 (WMS: wms007)

Description:
  • 120 collections each of 60 jobs
  • One collection every 60 seconds
  • Four users
  • The job is a "sleep 313"
  • Resubmission is enabled
  • We use both CREAM and LCG CEs

Test finishes on Mon Feb 23 at 17:22:32 CET 2009

  • 120 collections submitted in 1092 seconds: 5/9/17 (min/avg/max)

Final results

  • Collections correctly submitted: 120 (7200 jobs)
    • DONE OK: 6950 (96.53%)
      • CREAM: 2243
      • LCG: 4707
    • ABORTED: 249 (3.46%)
      • LCG: 249
    • Not finished: 1 (0.01%)
      • LCG: 1
    • Resubmitted: 696 (9.7%)

  • All the jobs have been aborted for "proxy expired" because the job renewal daemon doesn't work.

10) Test starts on Tue Feb 5 at 12:41:35 CET 2009 (WMS: devel14)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf) plus cream-04.pd.infn.it
  • Used automatic-delegation and proxy renewal service (MyProxyServer = "myproxy.cern.ch")
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • ICE
      • Fix a problem with proxy renewal seen in the previous test
      • Removed useless check of proxy duration in subscriptionManager, which could result in performance problems

Final results taken on Thu Feb 10 at 16:20:32 CET 2009

  • Collections correctly submitted: 1568 (62720 jobs)
    • DONE OK: 58641 (93.5%)
    • ABORTED: 4079 (6.5%)
    • Resubmitted: 41601 (66.33%)

  • Errors found (82530):
    • Cannot move ISB (75625 times 91.63%)
    • Cannot move OSB (115 times 0.14%)
    • Proxy is expired (5711 times 6.92%)
    • pbs_reason (702 times 0.85%)
      • pbs_reason=1; [...] proxy expired (477 times)
      • pbs_reason=271 (225 times)
    • Transfer to CREAM failed (187 times 0.23%)
      • due to exception: CREAM Register raised std::exception Connection to service [...] failed: (187 times)
    • lsf_reason (184 times 0.23%)
      • lsf_reason=36608; Proxy expired: job killed Terminated Master process killed (138 times)
      • lsf_reason=256 (43 times)
      • lsf_reason=1603 (3 times)
    • Cannot take token (4 times 0%)
    • BLAH error (2 times 0%)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (2 times)

  • Job Aborted (4079)
    • request expired (792 times 19.42%)
    • hit job shallow retry count (3) (3263 times 80%)
    • hit job retry count (2) (24 times 0.58%)

ice10.png

9) Test starts on Fri Jan 30 at 12:41:22 CET 2009 (WMS: devel14)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf) plus cream-04.pd.infn.it
  • Used automatic-delegation and proxy renewal service (MyProxyServer = "myproxy.cern.ch";)
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • CEs:
      • Fix for bug #45913 (only on cream-04.pd.infn.it)
      • Fix for bug #46283 (only on cream-04.pd.infn.it and cert-04.pd.infn.it)
    • ICE
      • Fix for bug #46405
      • 5 sec. (instead of 60) of delay between two LB logging tries
      • Error code is printed in the ICE log file when a log to LB fails

Test finishes on Mon Feb 2 at 12:39:09 CET 2009

  • 1427 collections submitted in 29752 seconds: 5/20/90 (min/avg/max)
    • 2893 submissions fail due to load limiter

Results taken on Thu Feb 02 at 15:24:31 CET 2009

  • Collections correctly submitted: 1427 (57080 jobs)
    • DONE OK: 28244 (49.48%)
    • ABORTED: 0 (0%)
    • Not finished: 28836 (50.52%)
    • Resubmissions: 2012 (3.52%)

  • Errors found (2233):
    • BLAH error (19 times 0.85%)
      • submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15007)-) N/A (jobId = [...]) (18 times)
      • submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15007)- exe_getouterr: poll() got an unknown event (stdout 0x0010 - stderr: 0x0000).-) N/A (jobId = [...]) (1 time)
    • Cannot move ISB (1872 times 83.84%)
    • Proxy is expired (325 times 14.55%)
    • lsf_reason (1 time 0.04%)
      • lsf_reason=36608 (1 time)
    • pbs_reason (16 times 0.72%)
      • pbs_reason=1; [...] proxy expired (15 times)
      • pbs_reason=271; Proxy expired: job killed Terminated Master process killed (1 time)

ice9.png

8) Test starts on Mon Jan 26 17:59:02 CET 2009 (WMS: devel14)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • ICE:
      • Fixed problem with proxy renewal seen in previous test

Test interrupted for a problem in the proxy-renewal service daemon on Thu Jan 29 12:05:12 CET 2009

Results taken on Thu Jan 29 at 18:35:31 CET 2009

  • Collections correctly submitted: 1433 (57320 jobs)
    • DONE OK: 33566 (58.56%)
    • ABORTED: 8414 (14.68%)
    • Not finished: 15340 (26.76%)
    • Resubmissions: 33507 (58.46%)

  • Errors found (33957):
    • BLAH error (3 time 0.01%)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (1 time)
      • submission command failed (exit code = 106) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = [...]) (2 times)
    • Cannot move ISB (4513 times 13.29%)
    • Cannot move OSB (73 times 0.21%)
    • Transfer to CREAM failed (28607 times 84.24%)
      • due to exception: Authentication error: The proxy is EXPIRED! (28134 times)
      • due to exception: Authentication error: Unable to open the file [/var/glite/SandboxDir/[...]/user.proxy] : No such file or directory (432 times)
      • Failed to create a delegation id for job [...]: reason is Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (19 times)
      • Failed to create a delegation id for job [...]: reason is Failed proxy validation - it has expired. (4 times)
      • Failed to create a delegation id for job [...]: reason is CreamProxy_Delegate::execute() - Coundl't open proxyfile [...]: The proxy is EXPIRED! (1 time)
      • CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (12 times)
      • CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Wed 28 Jan 2009 17:42:42] ErrorCode=[0] Description=[system error] FaultCause=[cannot create the job's working directory! The problem seems to be related to glexec]" (5 times)
    • Proxy is expired (688 times 2.03%)
    • lsf_reason (61 times 0.18%)
      • lsf_reason=65280 (32 times)
      • lsf_reason=36608 (22 times)
      • lsf_reason=1603 (1 time)
      • lsf_reason=256 (6 times)
    • pbs_reason (12 times 0.04%)
      • pbs_reason=271; Proxy expired: job killed Terminated Master process killed (12 times)

ice8.png

BUGS:

  • CREAM
    • #46405: VOMSWrapper should try more than once to open a proxy file
  • BLAH
    • #46283: Possible memory leak in strtoken function for BLParser

7) Test starts on Fri Jan 23 12:28:01 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • ICE:
      • Fixed problem seen in previous test

Test interrupted on Mon Jan 26 17:05:12 CET 2009

Results taken on Mon Jan 26 at 18:20:31 CET 2009

  • Collections correctly submitted: 1741 (69640 jobs)
    • DONE OK: 51894 (74.52%)
    • ABORTED: 44 (0.06%)
    • Not finished: 16186 (23.24%)
    • CANCELLED: 1516 (2.18%)
    • Resubmissions: 18218 (26.16%)

  • Errors found (18218):
    • BLAH error (2 times 0.01%)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (1 time)
      • submission command failed (exit code = 1) (stdout:) (stderr:Cannot resolve default server host 'cream-28.pd.infn.it' - check server_name file.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15008)-) N/A (jobId = [...]) (1 time)
    • Cannot take token (77 times 0.42%)
    • Cannot move ISB (14222 time 78.07%)
    • Cannot move OSB (82 times 0.45%)
    • Transfer to CREAM failed (4 times 0.02%)
      • due to exception: Authentication error: Unable to open the file [/var/glite/SandboxDir/[...]/user.proxy] : No such file or directory (4 times)
    • lsf_reason (30 times 0.16%)
      • lsf_reason=36608 (17 times)
      • lsf_reason=256 (13 times)
    • Proxy is expired (3113 times 17.09%)
    • pbs_reason (688 times 3.78%)
      • pbs_reason=-1 (656 times)
      • pbs_reason=271; Proxy expired: job killed Terminated Master process killed (32 times)

ice7.png

6) Test starts on Thu Jan 22 at 17:17:38 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • CEs:
      • Fix for bug #45718
      • Fix for bug #45983
      • Fix for bug #46024
    • ICE
      • Use of same delegationid if CREAM complains that it doesn't exist anymore

Test aborted on Fri Jan 23 10:30:12 CET 2009

5) Test starts on Wed Jan 21 at 12:45:49 CET 2009 (WMS: devel14)

Description:
  • 300 collections each of 80 jobs
  • One collection every 60 seconds
  • One user
  • Set max_ice_threads = 40;
  • Used the CEs of testbedB (only PD)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test finishes on Wed Jan 21 at 17:44:58 CET 2009

  • 224 collections submitted in 7416 seconds: 64/33/6 (max/avg/min)
    • 76 submissions fail due to load limiter

  • Collections correctly submitted: 224 (17920 jobs)
    • DONE OK: 17850 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 70 (0.4%)
    • Resubmissions: 8 (0.04%)

  • Errors found (8):
    • Cannot move ISB (3 times)
    • lsf_reason=1603 (3 times)
    • BLAH error (2 times)

ice5.png

4) Test starts on Tue Jan 20 at 10:09:58 CET 2009 (WMS: devel14)

Description:
  • 300 collections each of 80 jobs
  • One collection every 60 seconds
  • One user
  • Used the lsf CEs of testbedB (PD+CNAF) (cert-06 at cnaf is not considered)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test finishes on Tue Jan 20 at 15:13:14 CET 2009

  • 197 collections submitted in 8499 seconds: 84/43/9 (max/avg/min)
    • 103 submissions fail due to load limiter

  • Collections correctly submitted: 197 (15760 jobs)
    • DONE OK: 15695 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 65 (0.4%)
    • Resubmissions: 3 (0.02%)

  • Errors found (3):
    • Cannot move OSB (1 time)
    • Cannot move ISB (2 times)

ice4.png

3) Test starts on Mon Jan 19 at 15:22:51 CET 2009 (WMS: devel14)

Description:
  • 2880 collections each of 40 jobs
  • One collection every 30 seconds
  • One user
  • Used the lsf CEs of testbedB (PD+CNAF)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is enabled
  • Lease mechanism is not used

Test has been modified on Mon Jan 19 at 17:03:41:

  • 1440 collections each of 80 jobs
  • One collection every 60 seconds

Test finishes on Tue Jan 20 at 01:53:01 CET 2009

  • Collections correctly submitted: 399 (24800 jobs)
    • DONE OK: 24702 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 98 (0.4%)
    • Resubmissions: 1176 (4.74%)

  • Errors found (1176):
    • Cannot take token (1 time 0.08%)
    • Cannot move ISB (39 times 3.32%)
    • Transfer to CREAM failed (50 times 4.25%)
      • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (7 times)
      • Transfer to CREAM failed due to exception: Authentication error: The proxy is EXPIRED! (43 times)
    • lsf_reason=32512 (1085 times 92.27%)
    • lsf_reason=306 (1 time 0.08%)

ice3.png

2) Test starts on Tue Jan 13 at 15:38:11 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • One user
  • Used the CEs of testbedB (PD+CNAF) plus cream-04.pd.infn.it
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • CEs:
      • Fix for bug #45437
      • Fix for bug #45736
    • ICE
      • Management of serialization error
      • Renewal done at 80 % of lifetime of proxy (or when there are only 20 minutes left)

Test finishes on Sun Jan 18 at 15:42:28 CET 2009

  • 7180 collections submitted in 70789 seconds: 141/9/3 (max/avg/min)
    • 20 submissions fail due to load limiter

  • Collections correctly submitted: 7180 ( 287200 jobs)
    • DONE OK: 284838 (99.18%)
    • ABORTED: 0 (0.0%)
    • Not finished: 2362 (0.82%)
    • Resubmissions: 4599 (1.60%)

  • Errors found (4599):
    • blparser service is not alive (578 times 12.57%)
    • BLAH error (288 times 6.26%)
      • no jobId in submission script's output (stdout:) (stderr:) N/A (jobId = [...]) (52 times)
      • send command timeout (2 times)
      • submission command failed (exit code = 120) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = [...]) (2 times)
      • submission command failed (exit code = -15) (stdout:) (stderr:-killed by signal 15-) N/A (jobId = [...]) (219 times)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (13 times)
    • Cannot take token (201 times 4.37%)
    • Cannot move OSB (1 time 0.02%)
    • Cannot move ISB (5 times 0.11%)
    • Transfer to CREAM failed (19 times 0.41%)
      • FaultCause=[The problem seems to be related to glexec which reported: java.io.IOException: Too many open files]" (10 times)
      • CREAM Register raised std::exception Connection to service [https://cert-xx.cnaf.infn.it:8443/ce-cream/services/CREAM2] failed: (9 times)
    • lsf_reason=32512 (3505 times 76.22%)
    • Proxy is expired (1 time 0.02%)
    • lsf_reason=306 (1 time 0.02%)

  • Jobs not finished:

Schedul Running Tot. Ce Name
0 6 6 cert-04.cnaf.infn.it
6 349 355 cream-34.pd.infn.it
0 3 3 cream-26.pd.infn.it
0 2 2 cream-25.pd.infn.it
5 332 337 cream-28.pd.infn.it
0 1 1 cream-27.pd.infn.it
0 2 2 cream-22.pd.infn.it
0 5 5 cream-04.pd.infn.it
0 1 1 cream-23.pd.infn.it
0 2 2 cert-07.cnaf.infn.it
1 0 1 cert-13.cnaf.infn.it
6 334 340 cream-29.pd.infn.it
9 327 336 cream-33.pd.infn.it
0 5 5 cert-08.cnaf.infn.it
0 2 2 cert-05.cnaf.infn.it
6 0 6 cert-06.cnaf.infn.it
6 307 313 cream-31.pd.infn.it
4 296 300 cream-32.pd.infn.it
8 337 345 cream-30.pd.infn.it
51 2311 2362 Totals

BUGS:

  • CREAM
  • BLAH
    • #45718: Some check on log lines should be added on BLParser code
    • #45983: BLAH can leave children processes behind.

1) Test starts on Wed Jan 7 at 16:01:32 CET 2009 (WMS: devel18)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • One user
  • Used the CEs of testbedB (PD+CNAF) plus cream-12.pd.infn.it
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able

Test stopped on Monday Jan 12 for a serialization error on ICE

Results taken on Mon Jan 12 at 12:52:56 CET 2009

  • Collections correctly submitted: 3733 (149320 jobs)
    • DONE OK: 144004 (96.44%)
    • ABORTED: 446 (0.3%)
    • Not finished: 4870 (3.26%)

  • Errors found:
    • Transfer to CREAM failed due to exception:
      • FaultCause=[org.glite.ce.common.db.DatabaseException: Rollback executed due to: Deadlock found when trying to get lock; try restarting transaction]"
      • Authentication error: Unable to open the file [...]: No such file or directory
      • Connection to service [...] failed:
      • FaultCause=[User [...] not authorized for operation JobRegister]
      • FaultCause=[The problem seems to be related to glexec which reported: java.io.IOException: Too many open files]"
      • FaultCause=[org.glite.ce.common.db.DatabaseException: Server connection failure during transaction. Due to underlying exception: 'java.net.SocketException: Too many open files'.
      • FaultCause=[java.net.UnknownHostException: cream-31.pd.infn.it: cream-31.pd.infn.it]"
      • CREAM Start raised exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client]
      • Failed to get lease_id for job [...] Exception is Lease renew operation FAILED for lease ID [...] Exception is Connection to service [https://cream-29.pd.infn.it:8443/ce-cream/services/CREAM2] failed:
      • CREAM Start failed due to error MethodName=[JOB_START] Timestamp=[Wed 07 Jan 2009 22:10:43] ErrorCode=[2] Description=[the job has a status not compatible with the JOB_START command!] FaultCause=[N/A]
    • BLAH error:
      • submission command failed (exit code = -15) (stdout:) (stderr:/opt/glite/etc/blah.config: line 54: syntax error near unexpected token `('-/opt/glite/etc/blah.config: line 54: `//Added for test by Enrico Fattibene (07/01/2009)'--killed by signal 15-) N/A (jobId = CREAM251333253)
      • submission command failed (exit code = 120) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = CREAM550710004)
      • submission command failed (exit code = 1) (stdout:) (stderr:Cannot resolve default server host 'cream-28.pd.infn.it' - check server_name file.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15008)-) N/A (jobId = CREAM027575485)
      • submission command failed (exit code = -15) (stdout:) (stderr:-killed by signal 15-) N/A (jobId = CREAM752590056)
      • no jobId in submission script's output (stdout:) (stderr:) N/A (jobId = CREAM988027857)
    • DELEGATION_PROXY_CERT_SANDBOX_PATH not defined!
    • Cannot move ISB [...] The proxy credential [...] expired 0 minutes ago.
    • Proxy is expired; Proxy expired: job killed Terminated Master process killed
    • lsf_reason=32512
    • Lease expired
    • The job cannot be submitted because the blparser service is not alive

BUGS:

  • CREAM
    • #45914: glexec and proxy rotation
    • #45913: Proxy renewal not done for CREAM jobs not yet in IDLE status
    • #45736: Problems in case of resubmissions in the same CREAM CE
    • #45437: Sometimes the jobPurger throws the exception "Too many open files"
  • BLAH
    • #45718: Some check on log lines should be added on BLParser code
    • #45717: BLParserPBS should consider log lines like "unable to run job"

-- AlessioGianelle - 08 Jan 2009

Topic attachments
I Attachment Action SizeSorted ascending Date Who Comment
PNGpng ice3.png manage 4.5 K 2009-01-20 - 16:16 AlessioGianelle Test 3 Ice submission rate
PNGpng ice9.png manage 5.4 K 2009-02-02 - 11:33 AlessioGianelle Test 9 Ice submission rate
PNGpng ice10.png manage 5.6 K 2009-02-09 - 11:28 AlessioGianelle Test 10 Ice submission rate
PNGpng ice7.png manage 5.8 K 2009-01-26 - 17:08 AlessioGianelle Test 7 Ice submission rate
PNGpng ice8.png manage 5.9 K 2009-01-28 - 11:06 AlessioGianelle Test 8 Ice submission rate
PNGpng ice5.png manage 6.0 K 2009-01-22 - 08:41 AlessioGianelle Test 5 Ice submission rate
PNGpng ice4.png manage 6.4 K 2009-01-21 - 10:15 AlessioGianelle Test4 Ice submission rate
Edit | Attach | PDF | History: r75 < r74 < r73 < r72 < r71 | Backlinks | Raw View | More topic actions
Topic revision: r75 - 2011-02-24 - AlessioGianelle
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback