Tags:
, view all tags

TESTs on ICE

10) Test starts on Tue Feb 3 at 14:57:25 CET 2009 (WMS: devel14)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf) plus cream-04.pd.infn.it
  • Used automatic-delegation and proxy renewal service (MyProxyServer = "myproxy.cern.ch")
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • ICE
      • Fix a problem with proxy renewal seen in the previous test

9) Test starts on Fri Jan 30 at 12:41:22 CET 2009 (WMS: devel14)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf) plus cream-04.pd.infn.it
  • Used automatic-delegation and proxy renewal service (MyProxyServer = "myproxy.cern.ch";)
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • CEs:
      • Fix for bug #45913 (only on cream-04.pd.infn.it)
      • Fix for bug #46283 (only on cream-04.pd.infn.it and cert-04.pd.infn.it)
    • ICE
      • Fix for bug #46405
      • 5 sec. (instead of 60) of delay between two LB logging tries
      • Error code is printed in the ICE log file when a log to LB fails

Test finishes on Mon Feb 2 at 12:39:09 CET 2009

  • 1427 collections submitted in 29752 seconds: 5/20/90 (min/avg/max)
    • 2893 submissions fail due to load limiter

Partial results taken on Thu Feb 02 at 15:24:31 CET 2009

  • Collections correctly submitted: 1427 (57080 jobs)
    • DONE OK: 28244 (49.48%)
    • ABORTED: 0 (0%)
    • Not finished: 28836 (50.52%)
    • Resubmissions: 2012 (3.52%)

  • Errors found (2233):
    • BLAH error (19 time 0.85%)
      • submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15007)-) N/A (jobId = [...]) (18 times)
      • submission command failed (exit code = 1) (stdout:) (stderr:pbs_iff: cannot read reply from pbs_server-No Permission.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15007)- exe_getouterr: poll() got an unknown event (stdout 0x0010 - stderr: 0x0000).-) N/A (jobId = [...]) (1 time)
    • Cannot move ISB (1872 times 83.84%)
    • Proxy is expired (325 times 14.55%)
    • lsf_reason (1 time 0.04%)
      • lsf_reason=36608 (1 time)
    • pbs_reason (16 times 0.72%)
      • pbs_reason=1; [...] proxy expired (15 times)
      • pbs_reason=271; Proxy expired: job killed Terminated Master process killed (1 times)

ice9.png

8) Test starts on Mon Jan 26 17:59:02 CET 2009 (WMS: devel14)

Description:
  • 4320 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • ICE:
      • Fixed problem with proxy renewal seen in previous test

Test interrupted for a problem in the proxy-renewal service daemon on Thu Jan 29 12:05:12 CET 2009

Results taken on Thu Jan 29 at 18:35:31 CET 2009

  • Collections correctly submitted: 1433 (57320 jobs)
    • DONE OK: 33566 (58.56%)
    • ABORTED: 8414 (14.68%)
    • Not finished: 15340 (26.76%)
    • Resubmissions: 33507 (58.46%)

  • Errors found (33957):
    • BLAH error (3 time 0.01%)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (1 time)
      • submission command failed (exit code = 106) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = [...]) (2 times)
    • Cannot move ISB (4513 times 13.29%)
    • Cannot move OSB (73 times 0.21%)
    • Transfer to CREAM failed (28607 times 84.24%)
      • due to exception: Authentication error: The proxy is EXPIRED! (28134 times)
      • due to exception: Authentication error: Unable to open the file [/var/glite/SandboxDir/[...]/user.proxy] : No such file or directory (432 times)
      • Failed to create a delegation id for job [...]: reason is Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (19 times)
      • Failed to create a delegation id for job [...]: reason is Failed proxy validation - it has expired. (4 times)
      • Failed to create a delegation id for job [...]: reason is CreamProxy_Delegate::execute() - Coundl't open proxyfile [...]: The proxy is EXPIRED! (1 time)
      • CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (12 times)
      • CREAM Register returned error "MethodName=[jobRegister] Timestamp=[Wed 28 Jan 2009 17:42:42] ErrorCode=[0] Description=[system error] FaultCause=[cannot create the job's working directory! The problem seems to be related to glexec]" (5 times)
    • Proxy is expired (688 times 2.03%)
    • lsf_reason (61 times 0.18%)
      • lsf_reason=65280 (32 times)
      • lsf_reason=36608 (22 times)
      • lsf_reason=1603 (1 time)
      • lsf_reason=256 (6 times)
    • pbs_reason (12 times 0.04%)
      • pbs_reason=271; Proxy expired: job killed Terminated Master process killed (12 times)

ice8.png

BUGS:

  • CREAM
    • #46405: VOMSWrapper should try more than once to open a proxy file
  • BLAH
    • #46283: Possible memory leak in strtoken function for BLParser

7) Test starts on Fri Jan 23 12:28:01 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • ICE:
      • Fixed problem seen in previous test

Test interrupted on Mon Jan 26 17:05:12 CET 2009

Results taken on Mon Jan 26 at 18:20:31 CET 2009

  • Collections correctly submitted: 1741 (69640 jobs)
    • DONE OK: 51894 (74.52%)
    • ABORTED: 44 (0.06%)
    • Not finished: 16186 (23.24%)
    • CANCELLED: 1516 (2.18%)
    • Resubmissions: 18218 (26.16%)

  • Errors found (18218):
    • BLAH error (2 times 0.01%)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (1 time)
      • submission command failed (exit code = 1) (stdout:) (stderr:Cannot resolve default server host 'cream-28.pd.infn.it' - check server_name file.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15008)-) N/A (jobId = [...]) (1 time)
    • Cannot take token (77 times 0.42%)
    • Cannot move ISB (14222 time 78.07%)
    • Cannot move OSB (82 times 0.45%)
    • Transfer to CREAM failed (4 times 0.02%)
      • due to exception: Authentication error: Unable to open the file [/var/glite/SandboxDir/[...]/user.proxy] : No such file or directory (4 times)
    • lsf_reason (30 times 0.16%)
      • lsf_reason=36608 (17 times)
      • lsf_reason=256 (13 times)
    • Proxy is expired (3113 times 17.09%)
    • pbs_reason (688 times 3.78%)
      • pbs_reason=-1 (656 times)
      • pbs_reason=271; Proxy expired: job killed Terminated Master process killed (32 times)

ice7.png

6) Test starts on Thu Jan 22 at 17:17:38 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • Five users
  • max_ice_threads = 20
  • Used all the CEs of testbedB (except cert-06.cnaf)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 4242"
  • Resubmission is able
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • CEs:
      • Fix for bug #45718
      • Fix for bug #45983
      • Fix for bug #46024
    • ICE
      • Use of same delegationid if CREAM complains that it doesn't exist anymore

Test aborted on Fri Jan 23 10:30:12 CET 2009

5) Test starts on Wed Jan 21 at 12:45:49 CET 2009 (WMS: devel14)

Description:
  • 300 collections each of 80 jobs
  • One collection every 60 seconds
  • One user
  • Set max_ice_threads = 40;
  • Used the CEs of testbedB (only PD)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test finishes on Wed Jan 21 at 17:44:58 CET 2009

  • 224 collections submitted in 7416 seconds: 64/33/6 (max/avg/min)
    • 76 submissions fail due to load limiter

  • Collections correctly submitted: 224 (17920 jobs)
    • DONE OK: 17850 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 70 (0.4%)
    • Resubmissions: 8 (0.04%)

  • Errors found (8):
    • Cannot move ISB (3 times)
    • lsf_reason=1603 (3 times)
    • BLAH error (2 times)

ice5.png

4) Test starts on Tue Jan 20 at 10:09:58 CET 2009 (WMS: devel14)

Description:
  • 300 collections each of 80 jobs
  • One collection every 60 seconds
  • One user
  • Used the lsf CEs of testbedB (PD+CNAF) (cert-06 at cnaf is not considered)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able
  • Lease mechanism is not used

Test finishes on Tue Jan 20 at 15:13:14 CET 2009

  • 197 collections submitted in 8499 seconds: 84/43/9 (max/avg/min)
    • 103 submissions fail due to load limiter

  • Collections correctly submitted: 197 (15760 jobs)
    • DONE OK: 15695 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 65 (0.4%)
    • Resubmissions: 3 (0.02%)

  • Errors found (3):
    • Cannot move OSB (1 time)
    • Cannot move ISB (2 times)

ice4.png

3) Test starts on Mon Jan 19 at 15:22:51 CET 2009 (WMS: devel14)

Description:
  • 2880 collections each of 40 jobs
  • One collection every 30 seconds
  • One user
  • Used the lsf CEs of testbedB (PD+CNAF)
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is enabled
  • Lease mechanism is not used

Test has been modified on Mon Jan 19 at 17:03:41:

  • 1440 collections each of 80 jobs
  • One collection every 60 seconds

Test finishes on Tue Jan 20 at 01:53:01 CET 2009

  • Collections correctly submitted: 399 (24800 jobs)
    • DONE OK: 24702 (99.6%)
    • ABORTED: 0 (0.0%)
    • Not finished: 98 (0.4%)
    • Resubmissions: 1176 (4.74%)

  • Errors found (1176):
    • Cannot take token (1 time 0.08%)
    • Cannot move ISB (39 times 3.32%)
    • Transfer to CREAM failed (50 times 4.25%)
      • Transfer to CREAM failed due to exception: CREAM Register raised std::exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client] (7 times)
      • Transfer to CREAM failed due to exception: Authentication error: The proxy is EXPIRED! (43 times)
    • lsf_reason=32512 (1085 times 92.27%)
    • lsf_reason=306 (1 time 0.08%)

ice3.png

2) Test starts on Tue Jan 13 at 15:38:11 CET 2009 (WMS: devel14)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • One user
  • Used the CEs of testbedB (PD+CNAF) plus cream-04.pd.infn.it
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is enabled
  • Lease mechanism is not used
  • Changes in the software wrt previous test:
    • CEs:
      • Fix for bug #45437
      • Fix for bug #45736
    • ICE
      • Management of serialization error
      • Renewal done at 80 % of lifetime of proxy (or when there are only 20 minutes left)

Test finishes on Sun Jan 18 at 15:42:28 CET 2009

  • 7180 collections submitted in 70789 seconds: 141/9/3 (max/avg/min)
    • 20 submissions fail due to load limiter

  • Collections correctly submitted: 7180 ( 287200 jobs)
    • DONE OK: 284838 (99.18%)
    • ABORTED: 0 (0.0%)
    • Not finished: 2362 (0.82%)
    • Resubmissions: 4599 (1.60%)

  • Errors found (4599):
    • blparser service is not alive (578 times 12.57%)
    • BLAH error (288 times 6.26%)
      • no jobId in submission script's output (stdout:) (stderr:) N/A (jobId = [...]) (52 times)
      • send command timeout (2 times)
      • submission command failed (exit code = 120) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = [...]) (2 times)
      • submission command failed (exit code = -15) (stdout:) (stderr:-killed by signal 15-) N/A (jobId = [...]) (219 times)
      • submission command failed (exit code = 1) (stdout:) (stderr:qsub: Invalid credential-) N/A (jobId = [...]) (13 times)
    • Cannot take token (201 times 4.37%)
    • Cannot move OSB (1 time 0.02%)
    • Cannot move ISB (5 times 0.11%)
    • Transfer to CREAM failed (19 times 0.41%)
      • FaultCause=[The problem seems to be related to glexec which reported: java.io.IOException: Too many open files]" (10 times)
      • CREAM Register raised std::exception Connection to service [https://cert-xx.cnaf.infn.it:8443/ce-cream/services/CREAM2] failed: (9 times)
    • lsf_reason=32512 (3505 times 76.22%)
    • Proxy is expired (1 time 0.02%)
    • lsf_reason=306 (1 time 0.02%)

  • Jobs not finished:

Schedul Running Tot. Ce Name
0 6 6 cert-04.cnaf.infn.it
6 349 355 cream-34.pd.infn.it
0 3 3 cream-26.pd.infn.it
0 2 2 cream-25.pd.infn.it
5 332 337 cream-28.pd.infn.it
0 1 1 cream-27.pd.infn.it
0 2 2 cream-22.pd.infn.it
0 5 5 cream-04.pd.infn.it
0 1 1 cream-23.pd.infn.it
0 2 2 cert-07.cnaf.infn.it
1 0 1 cert-13.cnaf.infn.it
6 334 340 cream-29.pd.infn.it
9 327 336 cream-33.pd.infn.it
0 5 5 cert-08.cnaf.infn.it
0 2 2 cert-05.cnaf.infn.it
6 0 6 cert-06.cnaf.infn.it
6 307 313 cream-31.pd.infn.it
4 296 300 cream-32.pd.infn.it
8 337 345 cream-30.pd.infn.it
51 2311 2362 Totals

BUGS:

  • CREAM
  • BLAH
    • #45718: Some check on log lines should be added on BLParser code
    • #45983: BLAH can leave children processes behind.

1) Test starts on Wed Jan 7 at 16:01:32 CET 2009 (WMS: devel18)

Description:
  • 7200 collections each of 40 jobs
  • One collection every 60 seconds
  • One user
  • Used the CEs of testbedB (PD+CNAF) plus cream-12.pd.infn.it
  • Used automatic-delegation and proxy renewal service
  • Proxy has 5 hours of lifetime (and it is renewed every 4 hours)
  • The job is a "sleep 313"
  • Resubmission is able

Test stopped on Monday Jan 12 for a serialization error on ICE

Results taken on Mon Jan 12 at 12:52:56 CET 2009

  • Collections correctly submitted: 3733 (149320 jobs)
    • DONE OK: 144004 (96.44%)
    • ABORTED: 446 (0.3%)
    • Not finished: 4870 (3.26%)

  • Errors found:
    • Transfer to CREAM failed due to exception:
      • FaultCause=[org.glite.ce.common.db.DatabaseException: Rollback executed due to: Deadlock found when trying to get lock; try restarting transaction]"
      • Authentication error: Unable to open the file [...]: No such file or directory
      • Connection to service [...] failed:
      • FaultCause=[User [...] not authorized for operation JobRegister]
      • FaultCause=[The problem seems to be related to glexec which reported: java.io.IOException: Too many open files]"
      • FaultCause=[org.glite.ce.common.db.DatabaseException: Server connection failure during transaction. Due to underlying exception: 'java.net.SocketException: Too many open files'.
      • FaultCause=[java.net.UnknownHostException: cream-31.pd.infn.it: cream-31.pd.infn.it]"
      • CREAM Start raised exception Received NULL fault; the error is due to another cause: FaultString=[Client fault] - FaultCode=[SOAP-ENV:Client] - FaultSubCode=[SOAP-ENV:Client]
      • Failed to get lease_id for job [...] Exception is Lease renew operation FAILED for lease ID [...] Exception is Connection to service [https://cream-29.pd.infn.it:8443/ce-cream/services/CREAM2] failed:
      • CREAM Start failed due to error MethodName=[JOB_START] Timestamp=[Wed 07 Jan 2009 22:10:43] ErrorCode=[2] Description=[the job has a status not compatible with the JOB_START command!] FaultCause=[N/A]
    • BLAH error:
      • submission command failed (exit code = -15) (stdout:) (stderr:/opt/glite/etc/blah.config: line 54: syntax error near unexpected token `('-/opt/glite/etc/blah.config: line 54: `//Added for test by Enrico Fattibene (07/01/2009)'--killed by signal 15-) N/A (jobId = CREAM251333253)
      • submission command failed (exit code = 120) (stdout:) (stderr:glexec policy violation: see glexec log for more details-) N/A (jobId = CREAM550710004)
      • submission command failed (exit code = 1) (stdout:) (stderr:Cannot resolve default server host 'cream-28.pd.infn.it' - check server_name file.-qsub: cannot connect to server cream-28.pd.infn.it (errno=15008)-) N/A (jobId = CREAM027575485)
      • submission command failed (exit code = -15) (stdout:) (stderr:-killed by signal 15-) N/A (jobId = CREAM752590056)
      • no jobId in submission script's output (stdout:) (stderr:) N/A (jobId = CREAM988027857)
    • DELEGATION_PROXY_CERT_SANDBOX_PATH not defined!
    • Cannot move ISB [...] The proxy credential [...] expired 0 minutes ago.
    • Proxy is expired; Proxy expired: job killed Terminated Master process killed
    • lsf_reason=32512
    • Lease expired
    • The job cannot be submitted because the blparser service is not alive

BUGS:

  • CREAM
    • #45914: glexec and proxy rotation
    • #45913: Proxy renewal not done for CREAM jobs not yet in IDLE status
    • #45736: Problems in case of resubmissions in the same CREAM CE
    • #45437: Sometimes the jobPurger throws the exception "Too many open files"
  • BLAH
    • #45718: Some check on log lines should be added on BLParser code
    • #45717: BLParserPBS should consider log lines like "unable to run job"

-- AlessioGianelle - 08 Jan 2009

Topic attachments
I Attachment Action Size DateSorted ascending Who Comment
PNGpng ice3.png manage 4.5 K 2009-01-20 - 16:16 AlessioGianelle Test 3 Ice submission rate
PNGpng ice4.png manage 6.4 K 2009-01-21 - 10:15 AlessioGianelle Test4 Ice submission rate
PNGpng ice5.png manage 6.0 K 2009-01-22 - 08:41 AlessioGianelle Test 5 Ice submission rate
PNGpng ice7.png manage 5.8 K 2009-01-26 - 17:08 AlessioGianelle Test 7 Ice submission rate
PNGpng ice8.png manage 5.9 K 2009-01-28 - 11:06 AlessioGianelle Test 8 Ice submission rate
PNGpng ice9.png manage 5.4 K 2009-02-02 - 11:33 AlessioGianelle Test 9 Ice submission rate
Edit | Attach | PDF | History: r75 | r41 < r40 < r39 < r38 | Backlinks | Raw View | More topic actions...
Topic revision: r39 - 2009-02-03 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback