Tags:
, view all tags

Certification report patch 3621

Author(s): Elisabetta Molinari & Alessio Gianelle

Outcome: in certification...

Clean installation

Upgrade from production

Test Report

List Match

  • without data: Yes / Done
  • with data: Yes / Done

Submission/GetOutput

  • Normal jobs through
    • ICE work: Yes / Done
    • JC work: Yes / Done

  • Dag jobs through:
    • JC work: Yes / Done

  • Collection jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done
    • also job-output for collections works even though only the parent node is set to 'Cleared'

  • Parametric jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done
      • tested with the following
         [
          JobType = "parametric";
          Executable = "/usr/bin/env";
          Environment = {"MYPATH_PARAM_=$PATH:/bin:/usr/bin:$HOME"};
          StdOutput = "echo_PARAM_.out";
          StdError = "echo_PARAM_.err";
          OutputSandbox = {"echo_PARAM_.out","echo_PARAM_.err"};
          Parameters =  5;
                usertags = [ jdl = "parametric" ];
         ]

  • Bulk jobs sent both through ICE and JC and RetryCount = 0; :
    • Submit a bulk of 3 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 50 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 100 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 500 jobs -> success 99.9% Yes / Done both to ICE and JC
    • Submit a bulk of 1000 jobs -> success 99.9% Yes / Done both to ICE and JC

  • Perusal jobs through:
    • JC work: Yes / Done
    • ICE work: Yes / Done

  • MPICH jobs: No

Cancel

  • Normal jobs
    • ICE: Yes / Done
    • JC: Yes / Done
  • Dag: Yes / Done
    • Note that children nodes in status 'submitted' don't get cancelled
  • Collection
    • ICE: Yes / Done
    • JC: Yes / Done
  • Node of a collection: Yes / Done
    • Note: collections stay in status 'waiting' when all the nodes are Done (Success) except for one that is 'Cancelled'

Others

  • BrokerInfo
    • ICE creation Yes / Done
    • JC creation: Yes / Done

  • Resubmission
    • Shallow: Yes / Done
    • Deep: Yes / Done

  • Job Recovery
    • Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status Yes / Done

  • Prologue and Epilogue jobs
    • ICE: Yes / Done
    • JC: Yes / Done



Check bugs

  • Bug #42288: Problem in forwarding cerequirements to a CREAM CE NOT TESTED

  • Bug #48910: Failure starting LM if its output jobdir doesn't exist; unprotected chown in WM/LM/JC startup scripts FIXED
    • Stopped gLite services and deleted the jobdir under '/var/glite/workload_manager'
      [root@wms007 jobdir]# service gLite stop
      [...]
      [root@wms007 workload_manager]# pwd
      /var/glite/workload_manager
      [root@wms007 workload_manager]# ls
      ismdump.fl  jobdir
      [root@wms007 workload_manager]# rm -rf jobdir
      [root@wms007 workload_manager]# ls
      ismdump.fl 
      • re-started the LM service checking that the jobdir gets recreated
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start
        Starting LogMonitor...                                     [  OK  ]
        [root@wms007 workload_manager]# ls
        ismdump.fl  jobdir
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status
        Logmonitor running...
    • Stopped gLite services and deleted the jobdir under '/var/glite/jobcontrol'
      [root@wms007 jobcontrol]# pwd
      /var/glite/jobcontrol
      [root@wms007 jobcontrol]# rm -rf jobdir
      [root@wms007 jobcontrol]# ls
      condorio  submit
      • re-started the JC service checking that the jobdir gets recreated
        [root@wms007 jobcontrol]# /opt/glite/etc/init.d/glite-wms-jc start JobController
        Starting JobController daemon(s)
           Starting JobController...                          [  OK  ]
        [root@wms007 jobcontrol]# ls
        condorio  jobdir  lock  submit
        [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-jc status JobController
        JobController running in pid: 3625
    • Stopped gLite services and deleted the jobdir under '/var/glite/ice'
      [root@wms007 ice]# pwd
      /var/glite/ice
      [root@wms007 ice]# ls
      jobdir  persist_dir
      [root@wms007 ice]# rm -rf jobdir/
      [root@wms007 ice]# ls
      persist_dir
      • re-started the ICE service checking that the jobdir gets recreated
        [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice start
        starting ICE... ok
        [root@wms007 ice]# ls
        jobdir  persist_dir
        [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice status
        /opt/glite/bin/glite-wms-ice-safe (pid 22783) is running...
    • Stopped gLite services and deleted all the jobdirs
      [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/
      ice/:
      persist_dir
      
      jobcontrol/:
      condorio  submit
      
      workload_manager/:
      ismdump.fl
      • re-started the WM service checking that all the jobdirs get recreated
        [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm start
        starting workload manager... ok
        [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/
        ice/:
        jobdir  persist_dir
        
        jobcontrol/:
        condorio  jobdir  submit
        
        workload_manager/:
        ismdump.fl  jobdir
        [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm status
        /opt/glite/bin/glite-wms-workload_manager (pid 23259) is running...
    • Comment Input/InputType parameter in wms conf file (Sections: ICE, WorkloadManager and JobController).
      • Try to start JobController:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc start JobController
        Starting !JobController daemon(s)
         Please set Input parameter in glite_wms.conf - JC section [FAILED]
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc status JobController
        JobController stopped.
      • Try to start LogMonitor:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start
        Starting LogMonitor...Please set Input parameter in glite_wms.conf - WM section
                                                                   [FAILED]
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status
        LogMonitor stopped.
      • Try to start ICE:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice start
        starting ICE... failure
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice status
        /opt/glite/bin/glite-wms-ice-safe is not running
      • Try to start WorkloadManager:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm start
        starting workload manager... Please set Input parameter in  - WM section
        Please set DispatcherType parameter in  - WM section
        Please set Input parameter in  - JC section
        Please set InputType parameter in  - JC section
        Please set Input parameter in  - ICE section
        Please set InputType parameter in  - ICE section
        failure
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm status
        /opt/glite/bin/glite-wms-workload_manager is not running

  • Bug #52934: [ICE] Delegation in ICE doesn't refer to the myproxy server FIXED
    • GridJobID: https://devel17.cnaf.infn.it:9000/dj8r_iFRd8tnWH4bThPNeg
      • Deleg Proxy ID = [12692524052E32526wms0072Ecnaf2Einfn2Eit]
      • Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
      • Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      • MyProxyServer = "myproxy.cern.ch";
    • GridJobID: https://devel17.cnaf.infn.it:9000/UNB2dHJNn7euaDP3FvJ3og
      • Deleg Proxy ID = [12692523642E948823wms0072Ecnaf2Einfn2Eit]
      • Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
      • Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      • MyProxyServer = "myproxy.cnaf.infn.it";

  • Bug #53460: [ICE] Detection of job status changes for CREAM jobs should be improved FIXED
    • Using a new CE (1.6) looking in ice's log there is:
      2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269272870.288498 1269272870.496129 0.207631
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] There're [2] event(s) for the couple DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] Database  ID=[1261041182000]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] Exec time ID=[3]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() -  TID=[150673032] Processing [2] event(s) for Job [gridJobID="https://devel17.cnaf.infn.it:9000/uKbQNcbh7kIohBz6bDMNZQ" CREAMJobID="https://cream-30.pd.infn.it:8443/CREAM396193798"] userdn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] and ce url [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2].
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() -  TID=[150673032] EventID [685143] timestsamp [1269272804]
      2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::processSingleEvent - TID=[150673032] InsertStat 1269272870.496682 1269272870.496864 0.000182
    • Using an "old" CE instead the "poller" method is used:
      2010-03-22 16:55:55,397 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269273355.242918 1269273355.397806 0.154888
      2010-03-22 16:55:55,397 ERROR - iceCommandEventQuery::execute() -  TID=[150673032] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-34.pd.infn.it</ns2:hostname>]]
      2010-03-22 16:55:55,398 WARN - iceCommandEventQuery::execute() -  TID=[150673032] Not present QueryEvent on CE [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller.
      2010-03-22 16:55:55,398 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-22 16:55:55,398 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...

  • Bug #55103: [ICE] ICE port 7010 not cleaned up properly FIXED
    • We try a stop/start/restart sequence
      [root@wms007 ~]# ps ax | grep ice
       1283 pts/2    S+     0:00 grep ice
      30985 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      30989 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      30990 ?        Sl     0:15 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1321 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1353 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
       1357 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
       1358 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
       1398 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice restart
      stopping ICE... ok
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1433 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
       1437 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
       1438 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
       1470 pts/2    S+     0:00 grep ice

  • Bug #55452: CMS production struck by waves of "Globus error 10: data transfer to the server failed" NOT TESTED

  • Bug #56636: [ICE] statistics counters for monitoring FIXED
    • Verify the command and its options:
      [root@wms007 persist_dir]#  queryStats -t "2010-04-08 00:00:00"
      JOB_REGISTERED=2
      JOB_IDLE=2
      JOB_RUNNING=2
      JOB_REALLY-RUNNING=2
      JOB_DONE-OK=2
      
      [root@wms007 persist_dir]#  queryStats -f "2010-04-08 00:00:01" -t "2010-04-09 11:00:00"
      JOB_REGISTERED=4
      JOB_IDLE=4
      JOB_RUNNING=4
      JOB_REALLY-RUNNING=4
      JOB_DONE-OK=1
      JOB_DONE-FAILED=3
      
      [root@wms007 persist_dir]#  queryStats -f "2010-04-09 11:00:01"
      JOB_REGISTERED=255
      JOB_IDLE=255
      JOB_RUNNING=193
      JOB_REALLY-RUNNING=204
      JOB_DONE-OK=191
      JOB_ABORTED=6
      
      [root@wms007 persist_dir]#  queryStats
      JOB_REGISTERED=261
      JOB_IDLE=261
      JOB_RUNNING=199
      JOB_REALLY-RUNNING=210
      JOB_DONE-OK=194
      JOB_DONE-FAILED=3
      JOB_ABORTED=6

  • Bug #57295: [ICE] queryDb tool may create empty DB as root FIXED
    • Verify:
      [root@wms007 ~]#  ll /var/glite/ice/persist_dir/ice.db 
      -rw-r--r--  1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db
      [root@wms007 ~]#  /opt/glite/bin/queryDb -c glite_wms.conf -s RUNNING,REALLY_RUNNING 
      0 item(s) found
      [root@wms007 ~]#  ll /var/glite/ice/persist_dir/ice.db 
      -rw-r--r--  1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db

  • Bug #57579: [ICE] Occasionally the ICE's start/stop script doesn't kill the ICE process HOPEFULLY FIXED
    • Verify:
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice status
      /opt/glite/bin/glite-wms-ice-safe (pid 1433) is running...
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19866 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19899 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      19903 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      19904 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      19932 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19978 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      20009 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      20013 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      20014 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      20046 pts/2    S+     0:00 grep ice

  • Bug #57596: [ICE] non resubmission if job failed for proxy expiration FIXED
    • Verify:
      2010-03-23 10:20:37,696 INFO - iceLBLogger::logEvent() - Job Done Failed Event, ExitCode=[0], FailureReason=[Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]
      2010-03-23 10:20:37,817 DEBUG - iceLBContext::testCode() - L&B call succeeded.
      2010-03-23 10:20:37,828 ERROR - Ice::resubmit_job() - Will NOT resubmit job [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] because it's Input Sandbox proxy file is not valid: The proxy is EXPIRED!
      2010-03-23 10:20:37,828 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
      2010-03-23 10:20:37,828 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[Input sandbox's proxy is missing. Cannot resubmit job] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]

  • Bug #58099: WMS purger forces purge of jobs if LB cannot be reached FIXED
    • Stop the LBServer and then run the cron purger:
      07 Apr, 16:09:13 -E: [Error] query_job_status(purger.cpp:125): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg:: edg_wll_JobStat [111] Connection refused(edg_wll_gss_connect())
      [glite@wms007 ~]$ 
    • Verify that the SandBox dir has not been removed:
       [glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/
      total 16
      drwxrwx---  2 dteam008 glite 4096 Apr  6 14:34 input
      drwxrwx---  2 dteam008 glite 4096 Apr  6 14:46 output
      drwxrwx---  2 dteam008 glite 4096 Apr  6 14:34 peek
      lrwxrwxrwx  1 glite    glite  102 Apr  6 14:34 user.proxy -> /var/glite/SandboxDir/Uo/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fUow8XY0NGbyumU3PPGMSng/user.proxy
      
    • Restart LBServer and verify that now the SBD of the job is purged:
      [glite@wms007 ~]$  /opt/glite/sbin/glite-wms-purgeStorage.sh  -p /var/glite/SandboxDir/ye  -t 10000
      07 Apr, 16:18:07 -I: [Info] operator()(purger.cpp:449): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg: removed DONE job
      [glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/
      ls: /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/: No such file or directory

  • Bug #58387: [ICE] should log a job aborted when it cannot resubmit the job for missing user proxy FIXED
    • Verify:
      *************************************************************
      BOOKKEEPING INFORMATION:
      
      Status info for the Job : https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA
      Current Status:     Aborted 
      Logged Reason(s):
          - Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed
      Status Reason:      Input sandbox's proxy is missing. Cannot resubmit job
      Destination:        ce202.cern.ch:8443/cream-lsf-grid_dteam
      Submitted:          Tue Mar 23 09:49:42 2010 CET
      *************************************************************

  • Bug #58977: [ICE] Wrong database colum name in ICE SQL query NOT TESTED

  • Bug #59240: [ICE] abort reasons not always printed in its logfile NOT TESTED

  • Bug #59399: [ICE] doesn't correctly handle request in jobdir/old when it is restarted FIXED
    • Verify submitting a big collection to cream CEs, and then restarting ICE in the middle of the submit process:
      2010-03-23 15:55:43,604 DEBUG - iceCommandSubmit::try_to_submit() -  TID=[168434952] Going to START CreamJobID [https://cream
      -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
    • restarting ice...
      2010-03-23 15:55:45,760 DEBUG - ICE VersionID is [Fri Mar 19 13:53:17 CET 2010] ProcessID=[23579]
      2010-03-23 15:55:45,760 INFO - glite-wms-ice::main() - Host certificate is [/home/glite/.certs/hostcert.pem]
      2010-03-23 15:55:45,817 DEBUG - iceThreadPool::iceThreadPool(ICE Submission Pool) - Creating 10 worker threads
      2010-03-23 15:55:45,819 DEBUG - iceThreadPool::iceThreadPool(ICE Poller Pool) - Creating 5 worker threads
      [...]
      2010-03-23 15:55:48,967 INFO - iceCommandSubmit::execute() -  TID=[144321160] This request is a Submission...
      2010-03-23 15:55:48,968 INFO - iceCommandSubmit::try_to_submit() -  TID=[144321160] GridJobID [https://devel17.cnaf.infn.it:9
      000/iM8C3YV12fwhvIG5mNip5Q] has already been REGISTERED. Will only START it...
      2010-03-23 15:55:48,968 DEBUG - iceCommandSubmit::try_to_submit() -  TID=[144321160] Going to START CreamJobID [https://cream
      -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
      2010-03-23 15:55:49,154 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/iM8C
      3YV12fwhvIG5mNip5Q] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
      2010-03-23 15:55:49,155 INFO - iceLBLogger::logEvent() - Cream Transfer OK Event - [gridJobID="https://devel17.cnaf.infn.it:9
      000/iM8C3YV12fwhvIG5mNip5Q" CREAMJobID="https://cream-32.pd.infn.it:8443/CREAM036926381"]

  • Bug #59453: [ICE] polling needs to be improved NOT TESTED

  • Bug #60668: [ICE] does not respect LB server/proxy selection through the LBproxy attribute FIXED
    • Set LBProxy = false; in glite_wms.conf (section Common), restart ice and submit...
      mysql> select * from events where jobid="YFyqjw3FF-BO-0U5BxCOtA";
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | jobid                  | event | code | prog            | host                | time_stamp          | userid                           | usec   | level | arrived             |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | YFyqjw3FF-BO-0U5BxCOtA |     0 |    5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 394848 |     8 | 2010-03-24 12:04:39 |
      | YFyqjw3FF-BO-0U5BxCOtA |     1 |   15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 548652 |     8 | 2010-03-24 12:04:39 |
      | YFyqjw3FF-BO-0U5BxCOtA |     2 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 608084 |     8 | 2010-03-24 12:04:39 |
      | YFyqjw3FF-BO-0U5BxCOtA |     3 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 657231 |     8 | 2010-03-24 12:04:39 |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      4 rows in set (0.00 sec)
    • * Set LBProxy = true; in glite_wms.conf (section Common), restart ice and submit...
      mysql> select * from events where jobid="SlKOGSnaW0oKO3TJqw9tbA";
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | jobid                  | event | code | prog            | host                | time_stamp          | userid                           | usec   | level | arrived             |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | SlKOGSnaW0oKO3TJqw9tbA |     0 |   17 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 342720 |     8 | 2010-03-24 12:09:53 |
      | SlKOGSnaW0oKO3TJqw9tbA |     1 |   21 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 470416 |     8 | 2010-03-24 12:09:53 |
      | SlKOGSnaW0oKO3TJqw9tbA |     2 |   21 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 526402 |     8 | 2010-03-24 12:09:53 |
      | SlKOGSnaW0oKO3TJqw9tbA |     3 |    2 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 606511 |     8 | 2010-03-24 12:09:54 |
      | SlKOGSnaW0oKO3TJqw9tbA |     4 |    4 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 712100 |     8 | 2010-03-24 12:09:54 |
      | SlKOGSnaW0oKO3TJqw9tbA |     5 |    4 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | 3f82b966e8a77413044be1a9144a4af4 |  43631 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     6 |    5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 167414 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     7 |   15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 297333 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     8 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 369636 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     9 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 431565 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |    10 |    5 | JobController   | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 745052 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |    11 |    1 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 846002 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |    12 |    1 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:10:04 | bdd27610035bb0ec9287e2ecaa3da2eb | 869424 |     8 | 2010-03-24 12:10:04 |
      | SlKOGSnaW0oKO3TJqw9tbA |    13 |    8 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb |  94855 |     8 | 2010-03-24 12:11:39 |
      | SlKOGSnaW0oKO3TJqw9tbA |    14 |   25 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 181448 |     8 | 2010-03-24 12:11:39 |
      | SlKOGSnaW0oKO3TJqw9tbA |    15 |   10 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 250291 |     8 | 2010-03-24 12:11:39 |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      16 rows in set (0.00 sec)

  • Bug #61312: [ICE] Error in handling user dn in ICE's poller FIXED
    • Submit 5 jobs to an old CreamCE (Cream 1.5) setting MyProxyServer attribute:
      2010-03-24 13:40:38,128 ERROR - iceCommandEventQuery::execute() -  TID=[159321352] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-33.pd.infn.it</ns2:hostname>]]
      2010-03-24 13:40:38,128 WARN - iceCommandEventQuery::execute() -  TID=[159321352] Not present QueryEvent on CE [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller.
      2010-03-24 13:40:38,128 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-24 13:40:38,128 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...
      2010-03-24 13:40:38,129 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Finished collecting jobs to poll. [5] jobs are to poll
      [...]
  • And so:
    [ale@cream-15 UI]$ glite-wms-job-status -v 0 -i testo --noint
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/tt3GLYuIiHuwrmnl7fGVtA
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/lY9fdOgQk5RcaH99g23z5g
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/jta5KlBZEP-r2KbE0SB0Vw
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/TNqI_PbRyqgFAN3L52IpKQ
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/V7Pnv2yE47CdHKgQmRaIvQ
    Current Status:     Done (Success)
    *************************************************************

  • Bug #61405: [ICE] Missing proxy validity evaluation in ICE NOT TESTED

  • Bug #61413: [ICE] should not call EventQuery for a userDN if he/she doesn't have active jobs FIXED
    • Submit a job to a CreamCE and wait until it finished.
    • Submit another job to a different CreamCE, you should not see any query to the previous used CreamCE.

  • Bug #61748: [ICE] EventQuery/Polling must be done also to blacklisted CE FIXED
    • Submit some jobs to a CreamCEVerify
    • Trigger a socket timeout so that ICE blacklisted the CreamCE :
      2010-03-24 15:58:40,753 ERROR - CreamProxyMethod::execute() - Connection timed out to CREAM: "EOF detected during communicati
      on. Probably service closed connection or SOCKET TIMEOUT occurred." on try 3/3. Blacklisting endpoint and giving up.
      2010-03-24 15:58:40,753 DEBUG - CEBlackList::blacklist_endpoint() - Blacklisting CE https://cream-25.pd.infn.it:8443/ce-cream
      /services/gridsite-delegation until Wed Mar 24 16:08:40 2010
    • Verify that the QueryEvent commad is called in any case:
      2010-03-24 16:05:28,952 DEBUG - eventStatusPoller::body() - Adding EventQuery command for couple (/C=IT/O=INFN/OU=Personal Ce
      rtificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL, https://cream-25.pd.infn.it:8443/ce-cream/services/C
      REAM2) to the thread pool...
    • Instead a submission fails:
      2010-03-24 15:58:43,265 DEBUG - Delegation_manager::delegate() - Creating new delegation with delegation id [12694427232E2651
      16wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [https://cream
      -25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio 
      Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dte
      am/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] Expiring on [Thu Mar 25 12:54:02 2010]
      2010-03-24 15:58:43,265 DEBUG - CEBlackList::is_blacklisted() - CE https://cream-25.pd.infn.it:8443/ce-cream/services/gridsit
      e-delegation is blacklisted until Wed Mar 24 16:08:40 2010
      2010-03-24 15:58:43,265 ERROR - Delegation_manager::delegate() - FAILED Creation of a new delegation with delegation id [1269
      4427232E265116wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [h
      ttps://cream-25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova
      /CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio G
      ianelle-/dteam/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] - ERROR is: [The endpoint is blacklisted]
      2010-03-24 15:58:43,265 ERROR - iceCommandSubmit::execute() -  TID=[159308760] Error during submission of jdl= Fatal Exceptio
      n is:Failed to create a delegation id for job https://devel17.cnaf.infn.it:9000/UoVsvjIj1CPluHb81xM_pQ: reason is The endpoin
      t is blacklisted

  • Bug #63989: [ICE] doesn't handle exception raised by jobDir::new_entries() FIXED
    • Change the permission of the new directory in jobdir:
      [root@wms007 jobdir]# chmod 111 new/
      [root@wms007 jobdir]# ls -l
      total 48
      d--x--x--x  2 glite glite 40960 Mar 24 16:13 new
      drwxr-xr-x  2 glite glite  4096 Mar 24 16:13 old
      drwxr-xr-x  2 glite glite  4096 Mar 24 16:13 tmp
    • Look in ICE's log:
      2010-03-25 09:45:39,545 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied
      2010-03-25 09:45:40,546 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied

  • Bug #64698: jobwrapper max osb limit should be considered only if the gridftp server is the wms FIXED only for LCG-CE
    • Set MaxOutputSandboxSize = 10000000; in section WorkloadManager of file glite_wms.conf
    • Submit a jdl with a file of more than 10Mb in the !OutputSandbox parameter and set also the corresponding OutputSandboxDestURI parameter
    • Check the output dir in the SE:
      [root@devel18 tmp]# ls -lh
      -rw-r--r--  1 dteam044 dteam  50M Apr  7 16:23 bigfile
      -rw-r--r--  1 dteam044 dteam  646 Apr  7 16:23 ls.out
    • If you don't set OutputSandboxDestURI in the jdl, than the SandBox dir in the WMS should contain a .tail file of less than 10Mb:
      [root@wms007 persist_dir]#  ls -lh /var/glite/SandboxDir/A1/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fA1cdkhzepvrCjiU_5fTaKbpg/output/
      total 9.6M
      -rw-r--r--  1 dteam008 dteam 9.6M Apr  7 16:26 bigfile.tail
      -rw-r--r--  1 dteam008 dteam  637 Apr  7 16:26 ls.out

-- AlessioGianelle - 2010-02-05

Edit | Attach | PDF | History: r75 | r40 < r39 < r38 < r37 | Backlinks | Raw View | More topic actions...
Topic revision: r38 - 2010-04-09 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback