TESTs on ICE (Query Event)

16) Test starts on Wed Mar 10 at 16:13:15 CET 2010 (WMS: devel20)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs located at Padua and CNAF:
    • 6 CEs SL5/64b with cream version 1.12 (2 lsf + 4 torque)
    • 11 CEs SL4 with cream version 1.12 (5 lsf + 6 torque)
  • Use automatic-delegation
  • The job is a "sleep random(7200)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • Update QueryEvents mechanism introducing parallelism

Submissions finish on Sun Mar 14 at 15:01:38 CET 2010

  • 4306 collections submitted in 268194 seconds: 5/62/529 (min/avg/max)
    • 14 submissions fails

Final results

  • Collections correctly submitted: 3551 (142040 jobs)
    • DONE OK: 142039 (99.99%)
    • NOTDONE: 0 (0 %)
    • ABORTED: 1 ( - %)
    • CANCELLED: 0 (0 %)
    • Resubmitted: 5 ( - %)

Note:

  • 755 Collections fail to be submitted by the workload-manager due to:
     11 Mar, 13:51:43 -E: [Error] unrecoverable_collection(submit_request.cpp:93): https://devel15.cnaf.infn.it:9000/otaMq9ix-WGKwJLEcXeoAQ: unable to retrieve children information from jobstatus
    11 Mar, 13:51:43 -E: [Error] unrecoverable(submit_request.cpp:111): https://devel15.cnaf.infn.it:9000/otaMq9ix-WGKwJLEcXeoAQ failed (request expired)

ice16.png

QE16.png

15) Test starts on Fri Feb 26 at 15:23:29 CET 2010 (WMS: devel20)

Description:
  • 14400 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs located at Padua and CNAF:
    • 6 CEs SL5/64b with cream version 1.12 (2 lsf + 4 torque)
    • 4 CEs SL4 with cream version 1.11 (2 lsf + 2 torque)
    • 11 CEs SL4 with cream version 1.12 (5 lsf + 6 torque)
  • Use automatic-delegation
  • The job is a "sleep random(7200)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

Submissions finish on Thu Mar 4 at 23:49:18 CET 2010

  • 14300 collections submitted in 492083 seconds: 4/34/228 (min/avg/max)
    • 100 submissions fails

Final results

  • Collections correctly submitted: 13994 (279880 jobs)
    • DONE OK: 278744 (99.594%)
    • NOTDONE: 0 (0 %)
    • ABORTED: 10 (0.004 %)
    • CANCELLED: 1126 (0.402 %) (Stucked in torque queues)
    • Resubmitted: 832 (0.3 %)

  • Errors found (921)
    • BLAH error: (358 times)
      • submission command failed (exit code = 1) (stdout:) (stderr:Connection timed out-qsub: cannot connect to server devel03.cnaf.infn.it (errno=110) Connection timed out-) N/A
      • no jobId in submission script's output (stdout:) (stderr: execute_cmd: 200 seconds timeout expired, killing child process.-)
      • submission command failed (exit code = 1) (stdout:) (stderr:Failed in an LSF library call: Error 0. Job not submitted.-)
      • submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server devel03.cnaf.infn.it (errno=15007) Unauthorized Request -)
    • Cannot move OSB ... proxy expired (93 times)
    • Cannot take token (6 times)
    • reason=1 ... proxy expired (9 times)
    • reason=999 (453 times)
    • Transfer to CREAM failed due to exception.. (2 times)

ice15.png

qe15.png

Note:

  • 306 Collections (i.e. 6120 jobs) stucked on wmproxy
  • All jobs are aborted due to "Input sandbox's proxy is missing. Cannot resubmit job". Probably proxyrenewal daemon arrives late to renew collection's proxy.

14) Test starts on Wed Feb 24 at 17:35:29 CET 2010 (WMS: devel20)

Description:
  • 2880 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs located at Padua:
    • 5 CEs SL5/64b with cream version 1.12 (2 lsf + 3 torque)
    • 4 CEs SL4 with cream version 1.11 (2 lsf + 2 torque)
    • 10 CEs SL4 with cream version 1.12 (5 lsf + 5 torque)
  • Use automatic-delegation
  • The job is a "sleep random(7200)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • Update QueryEvents mechanism (only useful events are send by the CEs)

Submissions finish on Thu Feb 25 at 17:34:45 CET 2010

  • 2878 collections submitted in 18140 seconds: 3/6/42 (min/avg/max)
    • 2 submissions fails

Final results

  • Collections correctly submitted: 2878 (57560 jobs)
    • DONE OK: 57297 (99.54%)
    • NOTDONE: 0 (0 %)
    • ABORTED: 9 (0.02 %)
    • CANCELLED: 254 (0.44 %) (Stucked in pbs queues)
    • Resubmitted: 79 (0.14 %)

  • Errors found (90)
    • BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Failed in an LSF library call: Error 0. Job not submitted.-TERM environment variable not set.-) N/A (3 times)
    • Cannot move ISB ... proxy expired (33 times)
    • Cannot move OSB ... proxy expired (40 times)
    • reason=1 (4 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2010-02-24T18:14:38.863Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-26.pd.infn.it] (1 time)
    • Input sandbox's proxy is missing. Cannot resubmit job (9 times) This is the reason for the aborted jobs (See bugs #52710 and #43577 )

ice14.png

qe.png

13) Test starts on Thu Feb 11 at 15:48:14 CET 2010 (WMS: devel20)

Description:
  • 6000 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs distributed between Padua and Bologna:
    • 6 CEs SL5/64b with cream version 1.12 (2 lsf + 4 torque)
    • 4 CEs SL4 with cream version 1.11 (2 lsf + 2 torque)
    • 11 CEs SL4 with cream version 1.12 (5 lsf + 6 torque)
  • Use automatic-delegation
  • The job is a "sleep random(7200)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • Use only one HD.
    • Changes in my.conf file:
      set-variable = innodb_flush_log_at_trx_commit=2

Submissions finish on Sat Feb 13 at 17:48:25 CET 2010

  • 5997 collections submitted in 33256 seconds: 3/5/56 (min/avg/max)
    • 3 submissions fails

Final results

  • Collections correctly submitted: 5997 (119940 jobs)
    • DONE OK: 117898 (98.3%)
    • NOTDONE: 0 (0 %)
    • ABORTED: 66 (0.05 %)
    • CANCELLED: 1976 (1.65 %) (Stucked in pbs queues)
    • Resubmitted: 645 (0.54 %)

  • Errors found (806)
    • Input sandbox's proxy is missing. Cannot resubmit job (66 times) This is the reason for the ABORTED jobs
    • BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Failed in an LSF library call: Error 0. Job not submitted.-TERM environment variable not set.-) (10 times)
    • Proxy expired: (687 times)
      • Cannot move ISB
      • Cannot move OSB
      • reason=1
      • reason=271
    • Transfer to CREAM failed due to exception (4 times)
    • Cannot take token _(25 times)
    • /opt/edg/libexec/edg-gridftp-base-rm: timeout exceeded (14 times)

ice13.png

12) Test starts on Mon Feb 8 at 17:26:42 CET 2010 (WMS: devel20)

Description:
  • 2880 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs distributed between Padua and Bologna:
    • 6 CEs SL5/64b with cream version 1.12 (2 lsf + 4 torque)
    • 4 CEs SL4 with cream version 1.11 (2 lsf + 2 torque)
    • 11 CEs SL4 with cream version 1.12 (5 lsf + 6 torque)
  • Use automatic-delegation
  • The job is a "sleep random(7200)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • Use two HDs; in the second ones we put the "persist directory" of ice (i.e. internal database), and the mysql directory (i.e. /var/lib/mysql)
    • Changes in my.conf file:
      set-variable = innodb_buffer_pool_size=1800M
      set-variable = innodb_additional_mem_pool_size=200M
      set-variable = innodb_flush_log_at_trx_commit=0
      set-variable = innodb_log_file_size=100M
      set-variable = innodb_log_group_home_dir=/var/lib/mysql_logfiles
    • LBProxy and LBServer databases have been scratched
    • All SandBox directories have been removed

Submissions finish on Tue Feb 9 at 17:25:27 CET 2010

  • 2876 collections submitted in 12475 seconds: 3/4/37 (min/avg/max)
    • 4 submissions fails

Final results

  • Collections correctly submitted: 2876 (57520 jobs)
    • DONE OK: 56971 (99.05 %)
    • NOTDONE: 0 (0 %)
    • ABORTED: 0 (0 %)
    • CANCELLED: 549 (0.95 %) (Stucked in pbs queues)

  • Errors found (363)
    • BLAH error (9 times)
    • proxy expired (341 times)
    • Cannot take token (7 times)
    • reason=1; /opt/edg/libexec/edg-gridftp-base-rm: timeout exceeded Cannot take token (5 times)
    • Transfer to CREAM failed due to exception (1 time)

ice12.png

11) Test starts on Thu Feb 4 at 11:16:00 CET 2010 (WMS: devel20)

Description:
  • 1600 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs distributed between Padua and Bologna:
    • 3 CEs SL5/64b with cream version 1.12 (2 lsf + 1 torque)
    • 4 CEs SL4 with cream version 1.11 (2 lsf + 2 torque)
    • 11 CEs SL4 with cream version 1.12 (5 lsf + 6 torque)
  • Use automatic-delegation
  • The job is a "sleep random(3600)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used
  • Logging to LB from ICE is disabled

Submissions finish on Fri Feb 5 at 00:41:33 CET 2010

  • 1600 collections submitted in 16233 seconds: 4/10/74 (min/avg/max)

Final results

  • Collections correctly submitted: 1600 (32000 jobs)
    • DONE OK: 31328 (97.9 %)
    • NOTDONE: 0 (0 %)
    • ABORTED: 7 (0.02 %)
    • CANCELLED: 665 (2.08 %) (test interrupted)

  • Errors found (198)
    • reason=999 (22 times)
    • reason=1 [...] proxy expired (176 times)

ice11.png

10) Test starts on Tue Feb 2 at 12:38:25 CET 2010 (WMS: devel20)

Description:
  • 2400 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs distributed between Padua and Bologna:
    • 3 CEs SL5/64b with cream version 1.12 (2 lsf + 1 torque)
    • 4 CEs SL4 with cream version 1.11 (2 lsf + 2 torque)
    • 11 CEs SL4 with cream version 1.12 (5 lsf + 6 torque)
  • Use automatic-delegation
  • The job is a "sleep random(3600)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

Submissions finish on Wed Feb 3 at 08:59:20 CET 2010

  • 2397 collections submitted in 39661 seconds: 4/16/99 (min/avg/max)
    • 3 submissions fail (due to limiter "FTP connections")

Final results

  • Collections correctly submitted: 2397 (47940 jobs)
    • DONE OK: 47728 (99.56 %)
    • NOTDONE: 20* (0.04 %)
    • ABORTED: 17** (0.04 %)
    • CANCELLED: 175*** (0.36 %) (jobs hold in torque system)
    • Resubmitted: 210 (0.44 %)

  • Errors found (213****)
    • reason=999*** (163 times)
    • reason=127; /opt/lcg/libexec/jobwrapper: line 42: ./CREAM500950657_jobWrapper.sh: No such file or directory (1 time)
    • reason=255 (1 time)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred. (3 times)
    • The endpoint is blacklisted (43 times)
    • Transfer to CREAM failed due to exception: CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Wed 03 Feb 2010 01:27:19] ErrorCode=[0] Description=[cannot store the delegation proxy locally] FaultCause=[Cannot run program "chmod": java.io.IOException: error=12, Cannot allocate memory]" (2 times)

ice10.png

Note:

* The "NOT TERMINATED" jobs are distributed in this way:

  • 1 Collections (i.e. 20 jobs) stucked on wmproxy (midnight problem)

** All jobs are aborted due to "Input sandbox's proxy is missing. Cannot resubmit job". Probably proxyrenewal daemon arrives late to renew collection's proxy.

*** The jobs cancelled are jobs blocked in the pbs queues (in these cases the reported error due to a qdel done by the sysadmin is "reason=999")

**** Quite all errors occur on the same CE: cream-34.pd.infn.it (a 1.11 cream ce).

9) Test starts on Wed Jan 27 at 07:59:34 CET 2010 (WMS: devel20)

Description:
  • 2400 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • We use these CEs distributed between Padua and Bologna:
    • 4 CEs SL4 with cream version 1.11 (2 lsf + 2 torque)
    • 3 CEs SL5/64b with cream version 1.12 (2 lsf + 1 torque)
    • 12 CEs SL4 with cream version 1.12 (6 lsf + 6 torque)
  • Use automatic-delegation
  • The job is a "sleep random(3600)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

Submissions finish on Thu Jan 28 at 04:24:31 CET 2010

  • 2397 collections submitted in 43027 seconds: 4/17/102 (min/avg/max)
    • 3 submissions fail

Final results

  • Collections correctly submitted: 2397 (47940 jobs)
    • DONE OK: 47592 (99.27 %)
    • NOTDONE: 85* (0.18 %)
    • ABORTED: 0 (0 %)
    • CANCELLED: 263** (0.55 %) (jobs hold in torque system)
    • Resubmitted: 369 (0.77 %)

  • Errors found (379)
    • BLAH error ... (20 times)
    • Cannot move ISB ... (3 times)
    • Cannot take token (13 times)
    • reason=127 ... (3 times)
    • Problem to detect the lifetime of the proxy ... (5 times)
    • reason=1 (1 time)
    • reason=255 (1 time)
    • reason=999** (279 times)
    • SOCKET TIMEOUT occurred ... (3 times)
    • The endpoint is blacklisted ... (51 times)

ice09.png

qe09.png

Note:

* The "NOT TERMINATED" are distributed in this way:

  • 2 Collections (i.e. 40 jobs) stucked on wmproxy (midnight problem)
  • 43 jobs are stucked on pbs queue (see bug #62070)

** The jobs cancelled are jobs blocked in the pbs queues (in these cases the reported error due to a qdel done by the sysadmin is "reason=999")

8) Test starts on Tue Nov 17 at 11:56:53 CEST 2009 (WMS: devel20)

Description:
  • 4000 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedA
  • Use automatic-delegation
  • The job is a "sleep random(900)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

Submissions finish on Wed Nov 18 at 21:23:06 CEST 2009

  • 3997 collections submitted in 35909 seconds: 2/8/70 (min/avg/max)
    • 3 submission(s) fail(s)

Final results taken on Mon Nov 19 at 12:09:23 CEST 2009

  • Collections correctly submitted: 3997 (79940 jobs)
    • DONE OK: 79798 (99.82 %)
    • NOTDONE: 0 (0 %)
    • ABORTED: 0 (0 %)
    • CANCELLED: 142 (0.18 %) (jobs hold in torque system)
    • Resubmitted: 13415 (16.78 %)

  • Errors found (13230)
    • Cannot take token (22 times)
    • reason=1 (14 times)
    • reason=127; /opt/lcg/libexec/jobwrapper: line 42: ./CREAM927980874_jobWrapper.sh: No such file or directory (1 time)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred. (527 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-11-17T20:59:46.997Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-04.pd.infn.it] (1 time)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted (12602 times)
    • Transfer to CREAM failed due to exception: CREAM Start raised exception The endpoint is blacklisted (63 times)

ice08.png

7) Test starts on Fri Nov 13 16:13:38 CEST 2009 (WMS: devel20)

Description:
  • 4000 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedA
  • Use automatic-delegation
  • The job is a "sleep random(300)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • CEs
      • Update to new release candidate version 1.12

Submissions finish on Sun Nov at 15 01:33:23 CEST 2009

  • 3443 collections submitted in 31829 seconds: 2/9/44 (min/avg/max)
    • 557 submission(s) fail(s)

Final results taken on Mon Nov 16 at 10:09:23 CEST 2009

  • Collections correctly submitted: 3443 (68860 jobs)
    • DONE OK: 60522 (87.89 %)
    • NOTDONE: 6727 (9.77 %)
    • ABORTED: 1611 (2.34 %)
    • Resubmitted: 26726 (38.81 %)

  • Errors found (44031)
    • Cannot take token (6 times)
    • reason=1 (7 times)
    • reason=127 (3 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred (185 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception MethodName=[jobRegister] ErrorCode=[0] Description=[The CREAM service cannot accept jobs at the moment] FaultCause=[Threshold for Load Average(15 min): 20 => Detected value for Load Average(15 min): (289 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted (43432 times)
    • Transfer to CREAM failed due to exception: CREAM Start raised exception The endpoint is blacklisted (108 times)
    • Transfer to CREAM failed due to exception: CREAM Start raised exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-11-14T14:00:38.733Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-04.pd.infn.it] (1 time)

ice07.png

6) Test starts on Fri Oct 30 at 15:23:47 CEST 2009 (WMS: devel20)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedB (i.e. Production CEs 1.11, query event is not implemented)
  • Use automatic-delegation
  • Use proxy renewal service (myproxy.cern.ch)
  • The job is a "sleep random(2447)"
  • Resubmission is enabled
  • Lease mechanism is not used

Submissions finish on Wed Nov 4 at 15:56:11 CEST 2009

  • 7091 collections submitted in 164788 seconds: 4/23/117 (min/avg/max)
    • 109 submission(s) fail(s)

Final results taken on Fri Nov 06 at 12:08:43 CEST 2009

  • Collections correctly submitted: 7091 (283640 jobs)
    • DONE OK: 275956 (97.29 %)
    • NOTDONE: 4823 (1.7 %) *
    • ABORTED: 8 (~0%)
    • CANCELLED: 2853 (1.01 %) **
    • Resubmitted: 2933 (1.03 %)

  • Errors found (3972)
    • blah error: send command timeout (50 times)
    • BLAH error: submission command failed (exit code = -15) (stdout:) (stderr: exe_getouterr: 200 seconds timeout expired, killing child process.- killed by signal 15.-) N/A (jobId = CREAM110305536) (1 time)
    • BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Bad host name, host group name or cluster name. Job not submitted.-TERM environment variable not set.- execute_cmd: poll() got an unknown event (stdout 0x0010 - stderr: 0x0000).-) N/A (jobId = CREAM198982235) (1 time)
    • BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Cannot connect to default server host 'cream-32.pd.infn.it' - check pbs_server daemon.-qsub: cannot connect to server cream-32.pd.infn.it (errno=111)-TERM environment variable not set.-) N/A (jobId = CREAM946959077) (1 time)
    • BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Failed in an LSF library call: Failed in sending/receiving a message: Connection reset by peer. Job not submitted.-TERM environment variable not set.-) N/A (jobId = CREAM166499182) (1 time)
    • BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Master batch daemon internal error. Job not submitted.-TERM environment variable not set.-) N/A (jobId = CREAM105778508) (7 times)
    • Cannot move ISB (1820 times)
    • Cannot take token (190 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred. (6 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (3 times) * Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted (40 times)
    • lsf_reason=32512; /opt/lcg/libexec/jobwrapper: line 42: ./CREAM391495093_jobWrapper.sh: No such file or directory (12 times)
    • lsf_reason=-1 (5 times)
    • lsf_reason=2 (1 time)
    • pbs_reason=-1 (1616 times)
    • pbs_reason=1 (8 times)
    • reason=1; /opt/edg/libexec/edg-gridftp-base-rm: error globus_ftp_client: the server responded with an error 500 500-Command failed : System error in unlink: No such file or directory 500-A system call failed: No such file or directory 500 End. Cannot take token (15 times)
    • reason=127; /opt/lcg/libexec/jobwrapper: line 42: ./CREAM077961558_jobWrapper.sh: No such file or directory (1 time)
    • reason=999 (194 times)

Note:

* The "NOT TERMINATED" are distributed in this way:

  • 1000 Collections (i.e. 4000 jobs) failed to be submitted (by WM) with reason request expired
  • 361 jobs are running
  • 462 Done (FAILED)

** Jobs has been cancelled from pbs queue because maui set them as "Blocked Jobs"

ice06.png

5) Test starts on Thu Oct 22 at 12:51:04 CEST 2009 (WMS: devel20)

Description:
  • 2000 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedA (cream-12.pd, cream-04.pd and devel03.cnaf)
  • Use automatic-delegation
  • The job is a "sleep random(2447)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

Submissions finish on Sat Oct 23 at 05:30:36 CEST 2009

  • 1455 collections submitted in 16993 seconds: 4/11/48 (min/avg/max)
    • 545 submission(s) fail(s)

Final results taken on Thu Oct 23 at 16:08:43 CEST 2009

  • Collections correctly submitted: 1455 (29100 jobs)
    • DONE OK: 26714 (91.8 %)
    • NOTDONE: 168 (0.58 %)
    • ABORTED: 2218 (7.62 %)
    • Resubmitted: 4101 (14.09 %)

  • Errors found (1758)
    • BLAH error: no jobId in submission script's output (stdout:) (stderr: execute_cmd: 200 seconds timeout expired, killing child process.-) N/A (jobId = xxx) (27 times)
    • blah error: send command timeout (21 times)
    • Cannot move ISB (${globus_transfer_cmd} gsiftp://devel20.cnaf.infn.it:2811...... ): proxy expired (1 time)
    • Cannot take token (39 times)
    • reason=999 (1 time)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Connection to service [https://cream-04.pd.infn.it:8443/ce-cream/services/CREAM2] failed: (852 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred (54 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception MethodName=[invoke] ErrorCode=[0] Description=[Authorization error: Cannot set permissions to the store proxy certificate] FaultCause=[Authorization error: Cannot set permissions to the store proxy certificate] Timestamp=[Fri 23 Oct 2009 04:31:45] (21 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception MethodName=[invoke] ErrorCode=[0] Description=[Authorization error: Cannot store proxy certificate] FaultCause=[Authorization error: Cannot store proxy certificate] Timestamp=[Fri 23 Oct 2009 04:32:16] (3 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted (733 times)
    • Transfer to CREAM failed due to exception: CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Fri 23 Oct 2009 04:44:45] ErrorCode=[0] Description=[system error] FaultCause=[The problem seems to be related to glexec]"_(1 time)_
    • Transfer to CREAM failed due to exception: CREAM Start raised exception The endpoint is blacklisted (5 times)

ice05.png

4) Test starts on Fri Oct 19 at 12:00:05 CEST 2009 (WMS: devel20)

Description:
  • 4000 collections each of 20 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedA (cream-12.pd, cream-04.pd and devel03.cnaf)
  • Use automatic-delegation
  • The job is a "sleep random(4242)"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

Test interrupted on Tue Oct 20 at 10:29:16 CEST 2009

  • Problems with the CEs that are blacklisted
    • cream-04.pd.indn.it:
      java.net.SocketException
      MESSAGE: Too many open files
      20 Oct 2009 12:55:42,323 org.glite.voms.PKIStore - Cannot refresh store: null
    • cream-12.pd.infn.it and devel03.cnaf.infn.it probably have problems with the new BLParser

  • Restarted cream-04.pd.infn.it at 12:49 on Wed Oct 21

Partial results taken on Thu Oct 22 at 09:41:43 CEST 2009

  • Collections correctly submitted: 1832 (36640 jobs)
    • DONE OK: 30808 (- %)
    • CANCELLED: 4905 (- %)
    • ABORTED: 927 (- %)
    • Resubmitted: 7140 (- %)

  • Errors found (2970)
    • BLAH error: no jobId in submission script's output (stdout:) (stderr: execute_cmd: 200 seconds timeout expired, killing child process.-) N/A (jobId = ...) (13 times)
    • blah error: send command timeout (22 times)
    • Cannot move ISB (${globus_transfer_cmd} gsiftp://devel20.cnaf.infn.it:2811/var/glite/SandboxDir/9d/https_3a_2f_2fdevel15.cnaf.infn.it_3a9000_2f9dVthtBkKwyOaSHnSLeXSQ/input/pippo file:///home/dteam028/home_cream_638539945/CREAM638539945/pippo): Problem to detect the lifetime of the proxy (1 time)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Connection to service [https://cream-04.pd.infn.it:8443/ce-cream/services/CREAM2] failed: (603 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred. (66 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception MethodName=[invoke] ErrorCode=[0] Description=[Authorization error: Cannot store proxy certificate] FaultCause=[Authorization error: Cannot store proxy certificate] Timestamp=[Tue 20 Oct 2009 05:33:28] (3 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke2009-10-19T16:08:21.143Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-12.pd.infn.it] (2 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted (1740 times)
    • Transfer to CREAM failed due to exception: CREAM Start raised exception The endpoint is blacklisted (9 times)
    • Transfer to CREAM failed due to exception: Failed to create a delegation id for job https://devel15.cnaf.infn.it:9000/01bDqoEMYLtCgBJAwkGVBQ: reason is Connection to service [https://cream-04.pd.infn.it:8443/ce-cream/services/gridsite-delegation] failed: (511 times)

ice04.png

3) Test starts on Fri Oct 16 at 13:50:05 CEST 2009 (WMS: devel20)

Description:
  • 2000 collections each of 25 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedA (cream-12.pd, cream-04.pd and devel03.cnaf)
  • Use automatic-delegation
  • The job is a "sleep 666"
  • Resubmission is enabled
  • Use proxy renewal service (myproxy.cern.ch)
  • Lease mechanism is not used

Submissions finish on Sat Oct 17 at 06:29:16 CEST 2009

  • 1746 collections correctly submitted in 15578 seconds: 4/8/25 (min/avg/max)
    • 254 submissions failures sue to load limiter

Final results taken on Mon Oct 19 at 09:41:43 CEST 2009

  • Collections correctly submitted: 1746 (43650 jobs)
    • DONE OK: 43642 (99.98 %)
    • CANCELLED: 8 (0.02%)
    • Resubmitted: 961 (2.2%)

  • Errors found (1097)
    • BLAH error: no jobId in submission script's output (stdout:) (stderr: execute_cmd: 200 seconds timeout expired, killing child process.-) N/A (jobId = xxxxx) (91 times)
    • BLAH error: submission command failed (exit code = 201) (stdout:) (stderr:[gLExec]: gLExec has detected an input file change during the use of the file. It's unknown if this file-jacking was accidental or intentional.-) N/A (jobId = xxxxx) (1 time)
    • Cannot move ISB: proxy expired (1 time)
    • Cannot take token (105 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception EOF detected during communication. Probably service closed connection or SOCKET TIMEOUT occurred. (42 times)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[] - FaultCode=[SOAP-ENV:Server.generalException] - FaultSubCode=[SOAP-ENV:Server.generalException] - FaultDetail=[invoke</MethodName.2009-10-16T15:38:04.085Z0cannot write the authN proxy to file: nullcannot write the authN proxy to file: nullorg.glite.ce.faults.AuthenticationFaultcream-12.pd.infn.it] (1 time)
    • Transfer to CREAM failed due to exception: CREAM Register raised std::exception The endpoint is blacklisted (856 times)

ice03.png

2) Test starts on Wed Oct 15 at 12:21:53 CEST 2009 (WMS: devel20)

Description:
  • 2000 collections each of 25 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedA (cream-12.pd, cream-04.pd and devel03.cnaf)
  • Use automatic-delegation
  • The job is a "sleep 666"
  • Resubmission is enabled
  • Lease mechanism is not used

  • Changes in the software wrt previous test:
    • WMS

Submissions finish on Wed Oct 16 at 05:22:55 CEST 2009

  • 1834 collections correctly submitted
    • 166 submissions failures

Final results taken on Thu Oct 16 10:26:21 CEST 2009

  • Collections correctly submitted: 1834 (45850 jobs)
    • DONE OK: 42868 (-%)
    • ABORTED: 2976 (-%) *
    • Resubmitted: 2976+8 (-%)

  • Errors found
    • Cannot take token (19 times)
    • BLAH error: submission command failed (exit code = 201) (stdout:) (stderr:[gLExec]: gLExec has detected an input file change during the use of the file. It's unknown if this file-jacking was accidental or intentional.- execute_cmd: poll() got an unknown event (stdout 0x0010 - stderr: 0x0000).-) N/A (jobId = CREAM603493778) (https://devel15.cnaf.infn.it:9000/hzShPwSsvd6S_1kQ2XsbvA)
    • Proxy is expired

ice02.png

Note:

* All the aborted are due to "proxy expired" reason because I forgot to activate proxy renewal service.

1) Test starts on Wed Oct 14 at 15:04:19 CEST 2009 (WMS: devel20)

Description:
  • 400 collections each of 25 jobs
  • One collection every 30 seconds
  • Four users
  • max_ice_threads = 10
  • Use all the CEs of testbedA (cream-12.pd, cream-04.pd and devel03.cnaf)
  • Use automatic-delegation
  • The job is a "sleep 666"
  • Resubmission is enabled
  • Lease mechanism is not used

Submissions finish on Wed Oct 14 at 18:22:55 CEST 2009

  • 400 collections submitted in 2371 seconds: 3/5/15 (min/avg/max)

Final results taken on Thu Oct 15 10:26:21 CEST 2009

  • Collections correctly submitted: 400 (10000 jobs)
    • DONE OK: 10000 (100%)
    • Resubmitted: 3 (0.03%)

  • Errors found (3)
    • Cannot take token (3 times)

ice01.png

-- AlessioGianelle - 13 Oct 2009

Topic attachments
I Attachment ActionSorted ascending Size Date Who Comment
Unknown file formatjobid DoneFailed.jobid manage 25.7 K 2009-11-06 - 15:37 AlessioGianelle DoneFailed_06
PNGpng QE16.png manage 18.7 K 2010-03-15 - 12:59 AlessioGianelle Query events test 16
Unknown file formatjobid Running06.jobid manage 38.5 K 2009-11-06 - 15:41 AlessioGianelle Running_06
PNGpng ice01.png manage 6.1 K 2009-10-15 - 08:34 AlessioGianelle Ice graph. Test 01
PNGpng ice02.png manage 5.9 K 2009-10-16 - 11:04 AlessioGianelle Ice graph. Test 02
PNGpng ice03.png manage 5.7 K 2009-10-19 - 08:28 AlessioGianelle Ice graph. Test 03
PNGpng ice04.png manage 6.3 K 2009-10-21 - 14:07 AlessioGianelle Ice graph. Test 04
PNGpng ice05.png manage 8.0 K 2009-10-23 - 15:22 AlessioGianelle Ice graph. Test 05
PNGpng ice06.png manage 7.8 K 2009-11-06 - 12:15 AlessioGianelle Ice graph. Test 06
PNGpng ice07.png manage 5.5 K 2009-11-16 - 15:50 AlessioGianelle Ice graph. Test 07
PNGpng ice08.png manage 18.8 K 2009-11-19 - 17:14 AlessioGianelle Ice graph. Test 08
PNGpng ice09.png manage 8.1 K 2010-02-02 - 11:19 AlessioGianelle Ice graph. Test 09
PNGpng ice10.png manage 10.7 K 2010-02-04 - 10:55 AlessioGianelle Ice graph. Test 10
PNGpng ice11.png manage 5.4 K 2010-02-05 - 11:47 AlessioGianelle Ice graph. Test 11
PNGpng ice12.png manage 12.8 K 2010-02-11 - 09:51 AlessioGianelle Ice graph. Test 12
PNGpng ice13.png manage 5.4 K 2010-02-15 - 11:05 AlessioGianelle Ice graph. Test 13
PNGpng ice14.png manage 5.7 K 2010-02-26 - 12:36 AlessioGianelle Ice graph. Test 14
PNGpng ice15.png manage 6.1 K 2010-03-08 - 16:08 AlessioGianelle Ice graph. Test 15
PNGpng ice16.png manage 5.3 K 2010-03-15 - 12:53 AlessioGianelle Ice graph. Test 16
PNGpng qe.png manage 13.6 K 2010-02-26 - 12:38 AlessioGianelle Query events test 14
PNGpng qe09.png manage 8.7 K 2010-02-01 - 13:13 AlessioGianelle Query events test 09
PNGpng qe15.png manage 12.0 K 2010-03-08 - 15:27 AlessioGianelle Query events test 15
Edit | Attach | PDF | History: r48 < r47 < r46 < r45 < r44 | Backlinks | Raw View | More topic actions
Topic revision: r48 - 2011-02-24 - AlessioGianelle
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback