Tags:
, view all tags

Certification report patch 3621

Author(s): Elisabetta Molinari & Alessio Gianelle

Outcome: in certification...

Clean installation

Upgrade from production

Test Report

List Match

  • without data: Yes / Done
  • with data: Yes / Done

Submission/GetOutput

  • Normal jobs through
    • ICE work: Yes / Done
    • JC work: Yes / Done

  • Dag jobs through:
    • JC work: Yes / Done OK

  • Collection jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done
    • also job-output for collections works even though only the parent node is set to 'Cleared'

  • Parametric jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done
      • tested with the following
         [
          JobType = "parametric";
          Executable = "/usr/bin/env";
          Environment = {"MYPATH_PARAM_=$PATH:/bin:/usr/bin:$HOME"};
          StdOutput = "echo_PARAM_.out";
          StdError = "echo_PARAM_.err";
          OutputSandbox = {"echo_PARAM_.out","echo_PARAM_.err"};
          Parameters =  5;
                usertags = [ jdl = "parametric" ];
         ]

  • Bulk jobs sent both through ICE and JC and RetryCount = 0; :
    • Submit a bulk of 3 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 50 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 100 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 500 jobs -> success 99.9% Yes / Done both to ICE and JC
    • Submit a bulk of 1000 jobs -> success 99.9% Yes / Done both to ICE and JC

  • Perusal jobs through:
    • JC work: Yes / Done
    • ICE work: Yes / Done

  • MPICH jobs: No

Cancel

  • Normal jobs
    • ICE: Yes / Done
    • JC: Yes / Done
  • Dag: Yes / Done
    • Note that children nodes in status 'submitted' don't get cancelled
  • Collection
    • ICE: Yes / Done
    • JC: Yes / Done
  • Node of a collection: Yes / Done
Note: collections stay in status 'waiting' when all the nodes are Done (Success) except for one that is 'Cancelled'

Others

  • BrokerInfo
    • ICE creation Yes / Done
    • JC creation: Yes / Done

  • Resubmission
    • Shallow: Yes / Done
    • Deep: Yes / Done

  • Job Recovery
    • Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status Yes / Done

  • Prologue and Epilogue jobs
    • ICE: Yes / Done
    • JC: Yes / Done



Check bugs:

  • Bug #42288: Problem in forwarding cerequirements to a CREAM CE

  • Bug #48910: Failure starting LM if its output jobdir doesn't exist; unprotected chown in WM/LM/JC startup scripts FIXED
    • stopped gLite services
    • deleted the jobdir under '/var/glite/workload_manager'
    • re-started the LM service checking that the jobdir gets recreated

  • Bug #52934: [ICE] Delegation in ICE doesn't refer to the myproxy server FIXED
    • GridJobID: https://devel17.cnaf.infn.it:9000/dj8r_iFRd8tnWH4bThPNeg
      • Deleg Proxy ID = [12692524052E32526wms0072Ecnaf2Einfn2Eit]
      • Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
      • Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      • MyProxyServer = "myproxy.cern.ch";
    • GridJobID: https://devel17.cnaf.infn.it:9000/UNB2dHJNn7euaDP3FvJ3og
      • Deleg Proxy ID = [12692523642E948823wms0072Ecnaf2Einfn2Eit]
      • Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
      • Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      • MyProxyServer = "myproxy.cnaf.infn.it";

  • Bug #53460: [ICE] Detection of job status changes for CREAM jobs should be improved FIXED
    • Using a new CE (1.6) looking in ice's log there is:
      2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269272870.288498 1269272870.496129 0.207631
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] There're [2] event(s) for the couple DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] Database  ID=[1261041182000]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] Exec time ID=[3]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() -  TID=[150673032] Processing [2] event(s) for Job [gridJobID="https://devel17.cnaf.infn.it:9000/uKbQNcbh7kIohBz6bDMNZQ" CREAMJobID="https://cream-30.pd.infn.it:8443/CREAM396193798"] userdn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] and ce url [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2].
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() -  TID=[150673032] EventID [685143] timestsamp [1269272804]
      2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::processSingleEvent - TID=[150673032] InsertStat 1269272870.496682 1269272870.496864 0.000182
    • Using an "old" CE instead the "poller" method is used:
      2010-03-22 16:55:55,397 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269273355.242918 1269273355.397806 0.154888
      2010-03-22 16:55:55,397 ERROR - iceCommandEventQuery::execute() -  TID=[150673032] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-34.pd.infn.it</ns2:hostname>]]
      2010-03-22 16:55:55,398 WARN - iceCommandEventQuery::execute() -  TID=[150673032] Not present QueryEvent on CE [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller.
      2010-03-22 16:55:55,398 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-22 16:55:55,398 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...

  • Bug #55103: [ICE] ICE port 7010 not cleaned up properly FIXED
    • We try a stop/start/restart sequence
      [root@wms007 ~]# ps ax | grep ice
       1283 pts/2    S+     0:00 grep ice
      30985 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      30989 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      30990 ?        Sl     0:15 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1321 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1353 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
       1357 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
       1358 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
       1398 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice restart
      stopping ICE... ok
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1433 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
       1437 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
       1438 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
       1470 pts/2    S+     0:00 grep ice

  • Bug #55452: CMS production struck by waves of "Globus error 10: data transfer to the server failed"

  • Bug #56636: [ICE] statistics counters for monitoring

  • Bug #57295: [ICE] queryDb tool may create empty DB as root FIXED
    • Verify:
      [root@wms007 ~]#  ll /var/glite/ice/persist_dir/ice.db 
      -rw-r--r--  1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db
      [root@wms007 ~]#  /opt/glite/bin/queryDb -c glite_wms.conf -s RUNNING,REALLY_RUNNING 
      0 item(s) found
      [root@wms007 ~]#  ll /var/glite/ice/persist_dir/ice.db 
      -rw-r--r--  1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db

  • Bug #57579: [ICE] Occasionally the ICE's start/stop script doesn't kill the ICE process HOPEFULLY FIXED
    • Verify:
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice status
      /opt/glite/bin/glite-wms-ice-safe (pid 1433) is running...
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19866 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19899 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      19903 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      19904 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      19932 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19978 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      20009 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      20013 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      20014 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      20046 pts/2    S+     0:00 grep ice

  • Bug #57596: [ICE] non resubmission if job failed for proxy expiration FIXED
    • Verify:
      2010-03-23 10:20:37,696 INFO - iceLBLogger::logEvent() - Job Done Failed Event, ExitCode=[0], FailureReason=[Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]
      2010-03-23 10:20:37,817 DEBUG - iceLBContext::testCode() - L&B call succeeded.
      2010-03-23 10:20:37,828 ERROR - Ice::resubmit_job() - Will NOT resubmit job [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] because it's Input Sandbox proxy file is not valid: The proxy is EXPIRED!
      2010-03-23 10:20:37,828 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
      2010-03-23 10:20:37,828 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[Input sandbox's proxy is missing. Cannot resubmit job] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]

  • Bug #58387: [ICE] should log a job aborted when it cannot resubmit the job for missing user proxy FIXED
    • Verify:
      *************************************************************
      BOOKKEEPING INFORMATION:
      
      Status info for the Job : https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA
      Current Status:     Aborted 
      Logged Reason(s):
          - Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed
      Status Reason:      Input sandbox's proxy is missing. Cannot resubmit job
      Destination:        ce202.cern.ch:8443/cream-lsf-grid_dteam
      Submitted:          Tue Mar 23 09:49:42 2010 CET
      *************************************************************

  • Bug #58977: [ICE] Wrong database colum name in ICE SQL query

  • Bug #59240: [ICE] abort reasons not always printed in its logfile

  • Bug #59399: [ICE] doesn't correctly handle request in jobdir/old when it is restarted FIXED
    • Verify submitting a big collection to cream CEs, and then restarting ICE in the middle of the submit process:
      2010-03-23 15:55:43,604 DEBUG - iceCommandSubmit::try_to_submit() -  TID=[168434952] Going to START CreamJobID [https://cream
      -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
    • restarting ice...
      2010-03-23 15:55:45,760 DEBUG - ICE VersionID is [Fri Mar 19 13:53:17 CET 2010] ProcessID=[23579]
      2010-03-23 15:55:45,760 INFO - glite-wms-ice::main() - Host certificate is [/home/glite/.certs/hostcert.pem]
      2010-03-23 15:55:45,817 DEBUG - iceThreadPool::iceThreadPool(ICE Submission Pool) - Creating 10 worker threads
      2010-03-23 15:55:45,819 DEBUG - iceThreadPool::iceThreadPool(ICE Poller Pool) - Creating 5 worker threads
      [...]
      2010-03-23 15:55:48,967 INFO - iceCommandSubmit::execute() -  TID=[144321160] This request is a Submission...
      2010-03-23 15:55:48,968 INFO - iceCommandSubmit::try_to_submit() -  TID=[144321160] GridJobID [https://devel17.cnaf.infn.it:9
      000/iM8C3YV12fwhvIG5mNip5Q] has already been REGISTERED. Will only START it...
      2010-03-23 15:55:48,968 DEBUG - iceCommandSubmit::try_to_submit() -  TID=[144321160] Going to START CreamJobID [https://cream
      -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
      2010-03-23 15:55:49,154 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/iM8C
      3YV12fwhvIG5mNip5Q] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
      2010-03-23 15:55:49,155 INFO - iceLBLogger::logEvent() - Cream Transfer OK Event - [gridJobID="https://devel17.cnaf.infn.it:9
      000/iM8C3YV12fwhvIG5mNip5Q" CREAMJobID="https://cream-32.pd.infn.it:8443/CREAM036926381"]

  • Bug #59453: [ICE] polling needs to be improved

  • Bug #60688: [ICE] does not respect LB server/proxy selection through the LBproxy attribute

  • Bug #61312: [ICE] Error in handling user dn in ICE's poller

  • Bug #61405: [ICE] Missing proxy validity evaluation in ICE

  • Bug #61413: [ICE] should not call EventQuery for a userDN if he/she doesn't have active jobs

  • Bug #61748: [ICE] EventQuery/Polling must be done also to blacklisted CE

  • Bug #63989: [ICE] doesn't handle exception raised by jobDir::new_entries()

-- AlessioGianelle - 2010-02-05

Edit | Attach | PDF | History: r75 | r31 < r30 < r29 < r28 | Backlinks | Raw View | More topic actions...
Topic revision: r29 - 2010-03-23 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback