Tags:
, view all tags

TESTs on ICE

8) Test starts on Mon Jan 26 17:59:02 CET 2009 (WMS: devel14)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used

7) Test starts on Fri Jan 23 12:28:01 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used

Test interrupted on Mon Jan 26 17:05:12 CET 2009

Partial Results taken on Mon Jan 26 at 11:24:31 CET 2009

  • Collections correctly submitted: 1741 (69640 jobs)
    • DONE OK: 47459 (77.85%)
    • ABORTED: 44 (0.07%)
    • Not finished: 13457 (22.08%)
    • Resubmissions: 16139 (26.47%)

  • Errors found (16139):
    • BLAH error (2 times 0.01%)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (1 time)
      • submission command failed (exit code = 1) (stdout:) (stderr:Cannot resolve default server host 'cream-28.pd.infn.it' - check server_name file.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15008)-) N/A (jobId = [...]) (1 time)
    • Cannot take token (71 times 0.44%)
    • Cannot move ISB (12481 time 77.33%)
    • Cannot move OSB (69 times 0.43%)
    • Transfer to CREAM failed (4 times 0.03%)
      • due to exception: Authentication error: Unable to open the file [/var/glite/SandboxDir/[...]/user.proxy] : No such file or directory (4 times)
    • lsf_reason (24 times 0.15%)
      • lsf_reason=36608 (13 times)
      • lsf_reason=256 (11 times)
    • Proxy is expired (2803 times 17.37%)
    • pbs_reason (685 times 4.24%)
      • pbs_reason=-1 (654 times)
      • pbs_reason=271; Proxy expired: job killed Terminated Master process killed (31 times)

ice7.png

6) Test starts on Thu Jan 22 at 17:17:38 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used

Test aborted on Fri Jan 23 10:30:12 CET 2009

5) Test starts on Wed Jan 21 at 12:45:49 CET 2009 (WMS: devel14)

Description:
  • 300 collections each of 80 jobs
  • One collection every 60 seconds
  • One user
  • Set max_ice_threads = 40;
  • Used the CEs of testbedB (only PD)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test finish on Wed Jan 21 at 17:44:58 CET 2009

  • 224 collections submitted in 7416 seconds: 64/33/6 (max/avg/min)
    • 76 submissions fails due to load limiter

  • Collections correctly submitted: 224 (17920 jobs)
    • DONE OK: 17850 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 70 (0.4%)
    • Resubmissions: 8 (0.04%)

  • Errors found (8):
    • Cannot move ISB (3 times)
    • lsf_reason=1603 (3 times)
    • BLAH error (2 times)

ice5.png

4) Test starts on Tue Jan 20 at 10:09:58 CET 2009 (WMS: devel14)

Description:
  • 300 collections each of 80 jobs
  • One collection every 60 seconds
  • One user
  • Used the lsf CEs of testbedB (PD+CNAF) (cert-06 at cnaf is not considered)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test finish on Tue Jan 20 at 15:13:14 CET 2009

  • 197 collections submitted in 8499 seconds: 84/43/9 (max/avg/min)
    • 103 submissions fails due to load limiter

  • Collections correctly submitted: 197 (15760 jobs)
    • DONE OK: 15695 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 65 (0.4%)
    • Resubmissions: 3 (0.02%)

  • Errors found (3):
    • Cannot move OSB (1 time)
    • Cannot move ISB (2 times)

ice4.png

3) Test starts on Mon Jan 19 at 15:22:51 CET 2009 (WMS: devel14)

Description:
  • 2880 collections each of 40 jobs
  • One collection every 30 seconds
  • One user
  • Used the lsf CEs of testbedB (PD+CNAF)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test has been modified on Mon Jan 19 at 17:03:41:

  • 1440 collections each of 80 jobs
  • One collection every 60 seconds

Test finishes on Tue Jan 20 at 01:53:01 CET 2009

  • Collections correctly submitted: 399 (24800 jobs)
    • DONE OK: 24702 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 98 (0.4%)
    • Resubmissions: 1176 (4.74%)

  • Errors found (1176):
    • Cannot take token (1 time 0.08%)
    • Cannot move ISB (39 times 3.32%)
    • Transfer to CREAM failed (50 times 4.25%)
      • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (7 times)
      • Transfer to CREAM failed due to exception: Authentication error: The proxy is EXPIRED! (43 times)
    • lsf_reason=32512 (1085 times 92.27%)
    • lsf_reason=306 (1 time 0.08%)

ice3.png

2) Test starts on Tue Jan 13 at 15:38:11 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • One user
  • Used the CEs of testbedB (PD+CNAF) plus cream-04.pd.infn.it
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test finishes on Sun Jan 18 at 15:42:28 CET 2009

  • 7180 collections submitted in 70789 seconds: 141/9/3 (max/avg/min)
    • 20 submissions fails due to load limiter

  • Collections correctly submitted: 7180 ( 287200 jobs)
    • DONE OK: 284838 (99.18%)
    • ABORTED: 0 (0.0%)
    • Not finished: 2362 (0.82%)
    • Resubmissions: 4599 (1.60%)

  • Errors found (4599):
    • blparser service is not alive (578 times 12.57%)
    • BLAH error (288 times 6.26%)
      • no jobId in submission script's output (stdout:) (stderr:) N/A (jobId = [...]) (52 times)
      • send command timeout (2 times)
      • submission command failed (exit code = 120) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = [...]) (2 times)
      • submission command failed (exit code = -15) (stdout:) (stderr:-killed by signal 15-) N/A (jobId = [...]) (219 times)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (13 times)
    • Cannot take token (201 times 4.37%)
    • Cannot move OSB (1 time 0.02%)
    • Cannot move ISB (5 times 0.11%)
    • Transfer to CREAM failed (19 times 0.41%)
      • FaultCause=[The problem seems to be related to glexec which reported: java.io.IOException: Too many open files]" (10 times)
      • CREAM Register raised std::exception Connection to service [https://cert-xx.cnaf.infn.it:8443/ce-cream/services/CREAM2] failed: (9 times)
    • lsf_reason=32512 (3505 times 76.22%)
    • Proxy is expired (1 time 0.02%)
    • lsf_reason=306 (1 time 0.02%)

  • Jobs not finished:

Schedul Running Tot. Ce Name
0 6 6 cert-04.cnaf.infn.it
6 349 355 cream-34.pd.infn.it
0 3 3 cream-26.pd.infn.it
0 2 2 cream-25.pd.infn.it
5 332 337 cream-28.pd.infn.it
0 1 1 cream-27.pd.infn.it
0 2 2 cream-22.pd.infn.it
0 5 5 cream-04.pd.infn.it
0 1 1 cream-23.pd.infn.it
0 2 2 cert-07.cnaf.infn.it
1 0 1 cert-13.cnaf.infn.it
6 334 340 cream-29.pd.infn.it
9 327 336 cream-33.pd.infn.it
0 5 5 cert-08.cnaf.infn.it
0 2 2 cert-05.cnaf.infn.it
6 0 6 cert-06.cnaf.infn.it
6 307 313 cream-31.pd.infn.it
4 296 300 cream-32.pd.infn.it
8 337 345 cream-30.pd.infn.it
51 2311 2362 Totals

BUGS:

  • CREAM
  • BLAH
    • #45718: Some check on log lines should be added on BLParser code
    • #45983: BLAH can leave children processes behind.

1) Test starts on Wed Jan 7 at 16:01:32 CET 2009 (WMS: devel18)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • One user
  • Used the CEs of testbedB (PD+CNAF) plus cream-12.pd.infn.it
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able

Test stopped on Monday Jan 12 for a serialization error on ICE

Results taken on Mon Jan 12 at 12:52:56 CET 2009

  • Collections correctly submitted: 3733 (149320 jobs)
    • DONE OK: 144004 (96.44%)
    • ABORTED: 446 (0.3%)
    • Not finished: 4870 (3.26%)

  • Errors found:
    • Transfer to CREAM failed due to exception:
      • FaultCause=[org.glite.ce.common.db.DatabaseException: Rollback executed due to: Deadlock found when trying to get lock; try restarting transaction]"
      • Authentication error: Unable to open the file [...]: No such file or directory
      • Connection to service [...] failed:
      • FaultCause=[User [...] not authorized for operation JobRegister]
      • FaultCause=[The problem seems to be related to glexec which reported: java.io.IOException: Too many open files]"
      • FaultCause=[org.glite.ce.common.db.DatabaseException: Server connection failure during transaction. Due to underlying exception: 'java.net.SocketException: Too many open files'.
      • FaultCause=[java.net.UnknownHostException: cream-31.pd.infn.it: cream-31.pd.infn.it]"
      • CREAM Start raised exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client]
      • Failed to get lease_id for job [...] Exception is Lease renew operation FAILED for lease ID [...] Exception is Connection to service [https://cream-29.pd.infn.it:8443/ce-cream/services/CREAM2] failed:
      • CREAM Start failed due to error MethodName=[JOB_START] Timestamp=[Wed 07 Jan 2009 22:10:43] ErrorCode=[2] Description=[the job has a status not compatible with the JOB_START command!] FaultCause=[N/A]
    • BLAH error:
      • submission command failed (exit code = -15) (stdout:) (stderr:/opt/glite/etc/blah.config: line 54: syntax error near unexpected token `('-/opt/glite/etc/blah.config: line 54: `//Added for test by Enrico Fattibene (07/01/2009)'--killed by signal 15-) N/A (jobId = CREAM251333253)
      • submission command failed (exit code = 120) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = CREAM550710004)
      • submission command failed (exit code = 1) (stdout:) (stderr:Cannot resolve default server host 'cream-28.pd.infn.it' - check server_name file.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15008)-) N/A (jobId = CREAM027575485)
      • submission command failed (exit code = -15) (stdout:) (stderr:-killed by signal 15-) N/A (jobId = CREAM752590056)
      • no jobId in submission script's output (stdout:) (stderr:) N/A (jobId = CREAM988027857)
    • DELEGATION_PROXY_CERT_SANDBOX_PATH not defined!
    • Cannot move ISB [...] The proxy credential [...] expired 0 minutes ago.
    • Proxy is expired; Proxy expired: job killed Terminated Master process killed
    • lsf_reason=32512
    • Lease expired
    • The job cannot be submitted because the blparser service is not alive

BUGS:

  • CREAM
    • #45914: glexec and proxy rotation
    • #45913: Proxy renewal not done for CREAM jobs not yet in IDLE status
    • #45736: Problems in case of resubmissions in the same CREAM CE
    • #45437: Sometimes the jobPurger throws the exception "Too many open files"
  • BLAH
    • #45718: Some check on log lines should be added on BLParser code
    • #45717: BLParserPBS should consider log lines like "unable to run job"

-- AlessioGianelle - 08 Jan 2009

Topic attachments
I Attachment Action Size Date Who Comment
PNGpng ice3.png manage 4.5 K 2009-01-20 - 16:16 AlessioGianelle Test 3 Ice submission rate
PNGpng ice4.png manage 6.4 K 2009-01-21 - 10:15 AlessioGianelle Test4 Ice submission rate
PNGpng ice5.png manage 6.0 K 2009-01-22 - 08:41 AlessioGianelle Test 5 Ice submission rate
PNGpng ice7.png manage 5.8 K 2009-01-26 - 17:08 AlessioGianelle Test 7 Ice submission rate
Edit | Attach | PDF | History: r75 | r29 < r28 < r27 < r26 | Backlinks | Raw View | More topic actions...
Topic revision: r27 - 2009-01-26 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback