Certification report patch 3621

Author(s): Elisabetta Molinari & Alessio Gianelle

Outcome: Certified

Clean installation

  • copied registered repo for glite-wms_R_3_2_14_6 into '/etc/yum.repos.d':
    wget http://etics-repository.cern.ch/repository/pm/registered/repomd/id/5e0f9d1b-de35-48ec-ad71-3b32a55f2b46/slc4_ia32_gcc346/etics-registered-build-by-id.repo
  • launched 'yum install glite-WMS', yum install log is here
  • copied /opt/glite/yaim/examples/siteinfo/site-info.def into ~/siteinfo/site-info.def and /opt/glite/yaim/examples/siteinfo/services/glite-wms into ~/siteinfo/services/glite-wms
  • launched 'yum install lcg-CA'
  • copied host certificate and key into '/etc/grid-security'
  • launched yaim configuration '/opt/glite/yaim/bin/yaim -c -s site-info.def -n glite-WMS'
  • yaim configuration log file is here

Upgrade from production

  • Starting from a Production WMS we update it.

Test Report

The test report has been produced following the guidelines from here

List Match

List match without data

  • without data: Yes / Done
    • tried with the following
       cat myjob-toICE.jdl
      [
      Type = "Job";
      JobType = "normal";
      InputSandbox = { "file:///home/emolinari/test.sh"};
      VirtualOrganisation = "dteam";
      Executable="test.sh";
      Arguments="Hello ";
      Requirements = ( RegExp("/cream-",other.GlueCEUniqueID));
      Rank = 0;
      fuzzyrank = true;
      StdOutput="message.txt";
      StdError="err.log";
      OutputSandbox={"message.txt","err.log",".BrokerInfo"};
      usertags = [ jdl = "normal job to ICE" ];
      RetryCount = 0;
      ShallowRetryCount = 3;
      ]
       glite-wms-job-list-match --config glite_wms_devel20.conf -a myjob-toICE.jdl
      
      Connecting to the service https://devel20.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
      ==========================================================================
      
                           COMPUTING ELEMENT IDs LIST
       The following CE(s) matching your job requirements have been found:
      
              *CEId*
       - atlas-creamce-01.roma1.infn.it:8443/cream-lsf-atlasgcert
       - bocecream.bo.infn.it:8443/cream-pbs-cert
       - bocecream.bo.infn.it:8443/cream-pbs-certSL5
       - cccreamceli01.in2p3.fr:8443/cream-bqs-medium
       - cccreamceli01.in2p3.fr:8443/cream-bqs-short
       - ce01-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam
       - ce07-lcg.cr.cnaf.infn.it:8443/cream-lsf-dteam
       - ce201.cern.ch:8443/cream-lsf-grid_2nh_dteam
       - ce201.cern.ch:8443/cream-lsf-grid_dteam
       - ce202.cern.ch:8443/cream-lsf-grid_2nh_dteam
       - ce202.cern.ch:8443/cream-lsf-grid_dteam
       - cert-15.pd.infn.it:8443/cream-lsf-cert
       - cream-38.pd.infn.it:8443/cream-pbs-creamtest1
       - cream-38.pd.infn.it:8443/cream-pbs-creamtest2
       - cream-ce.ct.infn.it:8443/cream-lsf-cert
       - cream-ce.pr.infn.it:8443/cream-pbs-cert
       - cream-ce.research-infrastructures.eu:8443/cream-pbs-cert
       - devce.cnaf.infn.it:8443/cream-pbs-cert
       - gridce0.pi.infn.it:8443/cream-lsf-cert
       - prod-ce-01.pd.infn.it:8443/cream-lsf-cert
       - t2-ce-01.to.infn.it:8443/cream-pbs-cert
       - t2-ce-01.to.infn.it:8443/cream-pbs-short
       - t2-ce-05.lnl.infn.it:8443/cream-lsf-cert1
      
      ==========================================================================
      
    • tried substituting the Requirement with
      Requirements = ( !RegExp("/cream-",other.GlueCEUniqueID));
      glite-wms-job-list-match --config glite_wms_devel20.conf -a myjob-toLcg.jdl
      
      Connecting to the service https://devel20.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
      ==========================================================================
      
                           COMPUTING ELEMENT IDs LIST
       The following CE(s) matching your job requirements have been found:
      
              *CEId*
       - argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert
       - atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert
       - atlas-ce-02.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert
       - atlasce01.na.infn.it:2119/jobmanager-lcgpbs-cert
       - boalice3.bo.infn.it:2119/jobmanager-lcgpbs-cert
       - boalice3.bo.infn.it:2119/jobmanager-lcgpbs-certSL5
       - cclcgceli01.in2p3.fr:2119/jobmanager-bqs-long
       - cclcgceli01.in2p3.fr:2119/jobmanager-bqs-medium
       - cclcgceli01.in2p3.fr:2119/jobmanager-bqs-short
       - cclcgceli02.in2p3.fr:2119/jobmanager-bqs-long
       - cclcgceli02.in2p3.fr:2119/jobmanager-bqs-medium
       - cclcgceli02.in2p3.fr:2119/jobmanager-bqs-short
       - cclcgceli03.in2p3.fr:2119/jobmanager-bqs-long
       - cclcgceli03.in2p3.fr:2119/jobmanager-bqs-medium
       - cclcgceli03.in2p3.fr:2119/jobmanager-bqs-short
       - cclcgceli04.in2p3.fr:2119/jobmanager-bqs-long
       - cclcgceli04.in2p3.fr:2119/jobmanager-bqs-medium
       - cclcgceli04.in2p3.fr:2119/jobmanager-bqs-short
       - cclcgceli07.in2p3.fr:2119/jobmanager-bqs-long
       - cclcgceli07.in2p3.fr:2119/jobmanager-bqs-medium
       - cclcgceli07.in2p3.fr:2119/jobmanager-bqs-short
       - cclcgceli08.in2p3.fr:2119/jobmanager-bqs-long
       - cclcgceli08.in2p3.fr:2119/jobmanager-bqs-medium
       - cclcgceli08.in2p3.fr:2119/jobmanager-bqs-short
       - ce-01.grid.sissa.it:2119/jobmanager-lcgpbs-cert
       - ce-01.roma3.infn.it:2119/jobmanager-lcgpbs-cert
       - ce01-lhcb-t2.cr.cnaf.infn.it:2119/jobmanager-lcglsf-cert_t2
       - ce02-lhcb-t2.cr.cnaf.infn.it:2119/jobmanager-lcglsf-cert_t2
       - ce04-lcg.cr.cnaf.infn.it:2119/jobmanager-lcglsf-dteam
       - ce05-lcg.cr.cnaf.infn.it:2119/jobmanager-lcglsf-dteam
       - ce06-lcg.cr.cnaf.infn.it:2119/jobmanager-lcglsf-dteam
       - ce103.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce103.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce104.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce104.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce105.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce105.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce106.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce106.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce107.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce107.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce112.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce112.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce113.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce113.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce114.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce114.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce124.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce124.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce125.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce125.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce126.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce126.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce127.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce127.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce128.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce128.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce129.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce129.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce130.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce130.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce131.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce131.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce132.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce132.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - ce133.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
       - ce133.cern.ch:2119/jobmanager-lcglsf-grid_dteam
       - cmsce01.na.infn.it:2119/jobmanager-lcgpbs-cert
       - grid-ce-01.ba.infn.it:2119/jobmanager-lcgpbs-cert
       - grid-ce.lns.infn.it:2119/jobmanager-lcgpbs-cert
       - grid-ce.lns.infn.it:2119/jobmanager-lcgpbs-infinite
       - grid-ce.lns.infn.it:2119/jobmanager-lcgpbs-long
       - grid-ce.lns.infn.it:2119/jobmanager-lcgpbs-short
       - grid-ce2.pr.infn.it:2119/jobmanager-pbs-cert
       - grid-eo-engine04.esrin.esa.int:2119/jobmanager-lcgpbs-cert
       - grid0.fe.infn.it:2119/jobmanager-lcgpbs-cert
       - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert
       - grid002.ca.infn.it:2119/jobmanager-lcglsf-cert
       - grid01.ge.infn.it:2119/jobmanager-lcglsf-cert
       - grid012.ct.infn.it:2119/jobmanager-lcglsf-cert
       - gridce.ilc.cnr.it:2119/jobmanager-lcgpbs-cert
       - gridce.pg.infn.it:2119/jobmanager-lcgpbs-cert
       - gridce.sns.it:2119/jobmanager-lcgpbs-cert
       - gridce1.pi.infn.it:2119/jobmanager-lcglsf-cert
       - gridce2.pi.infn.it:2119/jobmanager-lcglsf-cert
       - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
       - griditce01.na.infn.it:2119/jobmanager-lcgpbs-cert
       - lcg-ce.research-infrastructures.eu:2119/jobmanager-lcgpbs-cert
       - linucs-ce-01.cs.infn.it:2119/jobmanager-lcgpbs-atlasgcert
       - pamelace01.na.infn.it:2119/jobmanager-lcgpbs-cert
       - pbs-enmr.cerm.unifi.it:2119/jobmanager-lcgpbs-cert
       - prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert
       - t2-ce-01.lnl.infn.it:2119/jobmanager-lcglsf-cert1
       - t2-ce-01.mi.infn.it:2119/jobmanager-lcgpbs-cert
       - t2-ce-02.lnl.infn.it:2119/jobmanager-lcglsf-cert1
       - t2-ce-02.mi.infn.it:2119/jobmanager-lcgcondor-cert
       - t2-ce-02.to.infn.it:2119/jobmanager-lcgpbs-cert
       - t2-ce-02.to.infn.it:2119/jobmanager-lcgpbs-short
       - t2-ce-03.lnl.infn.it:2119/jobmanager-lcglsf-cert1
       - t2-ce-04.lnl.infn.it:2119/jobmanager-lcglsf-cert1
       - t2-ce-06.lnl.infn.it:2119/jobmanager-lcglsf-cert1
       - test7200a.cnaf.infn.it:2119/jobmanager-lcgpbs-cert
       - test7200a.cnaf.infn.it:2119/jobmanager-lcgpbs-parallel
       - virgo-ce.roma1.infn.it:2119/jobmanager-lcgpbs-cert
      
      

List match with data

  • with data: Yes / Done

Submission/GetOutput

Normal Jobs

  • Normal jobs through
    • ICE work: Yes / Done
      • glite-wms-job-submit --config glite_wms_devel20.conf -a myjob-toICE.jdl
        
        Connecting to the service https://devel20.cnaf.infn.it:7443/glite_wms_wmproxy_server
        
        
        ====================== glite-wms-job-submit Success ======================
        
        The job has been successfully submitted to the WMProxy
        Your job identifier is:
        
        https://devel15.cnaf.infn.it:9000/Atu_PYr8SfD3C4VGCU_SjQ
        
        ==========================================================================
         glite-wms-job-status https://devel15.cnaf.infn.it:9000/Atu_PYr8SfD3C4VGCU_SjQ
        
        
        *************************************************************
        BOOKKEEPING INFORMATION:
        
        Status info for the Job : https://devel15.cnaf.infn.it:9000/Atu_PYr8SfD3C4VGCU_SjQ
        Current Status:     Done (Success)
        Exit code:          0
        Status Reason:      Job Terminated Successfully
        Destination:        ce202.cern.ch:8443/cream-lsf-grid_dteam
        Submitted:          Mon Apr 26 15:00:37 2010 CEST
        *************************************************************
        
         glite-wms-job-output https://devel15.cnaf.infn.it:9000/Atu_PYr8SfD3C4VGCU_SjQ
        
        Connecting to the service https://devel20.cnaf.infn.it:7443/glite_wms_wmproxy_server
        
        
        ================================================================================
        
                                JOB GET OUTPUT OUTCOME
        
        Output sandbox files for the job:
        https://devel15.cnaf.infn.it:9000/Atu_PYr8SfD3C4VGCU_SjQ
        have been successfully retrieved and stored in the directory:
        /tmp/jobOutput/emolinari_Atu_PYr8SfD3C4VGCU_SjQ
        --------------------------
        ls /tmp/jobOutput/emolinari_Atu_PYr8SfD3C4VGCU_SjQ/
        err.log  message.txt
        
        
    • JC work: Yes / Done
      • glite-wms-job-submit --config glite_wms_devel20.conf -a myjob-toLcg.jdl
        
        Connecting to the service https://devel20.cnaf.infn.it:7443/glite_wms_wmproxy_server
        
        
        ====================== glite-wms-job-submit Success ======================
        
        The job has been successfully submitted to the WMProxy
        Your job identifier is:
        
        https://devel15.cnaf.infn.it:9000/k0ue-prhJAvtAvjL2x_7Qg
        ----------------------------------------------------------------------
        glite-wms-job-status https://devel15.cnaf.infn.it:9000/k0ue-prhJAvtAvjL2x_7Qg
        
        
        *************************************************************
        BOOKKEEPING INFORMATION:
        
        Status info for the Job : https://devel15.cnaf.infn.it:9000/k0ue-prhJAvtAvjL2x_7Qg
        Current Status:     Done (Success)
        Logged Reason(s):
            -
            - Job terminated successfully
        Exit code:          0
        Status Reason:      Job terminated successfully
        Destination:        ce131.cern.ch:2119/jobmanager-lcglsf-grid_2nh_dteam
        Submitted:          Mon Apr 26 15:18:42 2010 CEST
        *************************************************************
        glite-wms-job-output https://devel15.cnaf.infn.it:9000/k0ue-prhJAvtAvjL2x_7Qg
        
        Connecting to the service https://devel20.cnaf.infn.it:7443/glite_wms_wmproxy_server
        
        
        ================================================================================
        
                                JOB GET OUTPUT OUTCOME
        
        Output sandbox files for the job:
        https://devel15.cnaf.infn.it:9000/k0ue-prhJAvtAvjL2x_7Qg
        have been successfully retrieved and stored in the directory:
        /tmp/jobOutput/emolinari_k0ue-prhJAvtAvjL2x_7Qg
        
        ================================================================================
         ls /tmp/jobOutput/emolinari_k0ue-prhJAvtAvjL2x_7Qg
        err.log  message.txt
        

DAG jobs

  • Dag jobs through:
    • JC work: Yes / Done

Collection jobs

  • Collection jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done
    • also job-output for collections works even though only the parent node is set to 'Cleared'

Parametric jobs

  • Parametric jobs through:
    • ICE work: Yes / Done
    • JC work: Yes / Done
      • test report here
      • tested with the following
         [
          JobType = "parametric";
          Executable = "/usr/bin/env";
          Environment = {"MYPATH_PARAM_=$PATH:/bin:/usr/bin:$HOME"};
          StdOutput = "echo_PARAM_.out";
          StdError = "echo_PARAM_.err";
          OutputSandbox = {"echo_PARAM_.out","echo_PARAM_.err"};
          Parameters =  5;
                usertags = [ jdl = "parametric" ];
         ]

  • Bulk jobs sent both through ICE and JC and RetryCount = 0; :
    • Submit a bulk of 3 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 50 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 100 jobs -> success 100% Yes / Done both to ICE and JC
    • Submit a bulk of 500 jobs -> success 99.9% Yes / Done both to ICE and JC
    • Submit a bulk of 1000 jobs -> success 99.9% Yes / Done both to ICE and JC
      • bulk test report to JC here
      • bulk test report to ICE here

Perusal jobs

  • Perusal jobs through:
    • JC work: Yes / Done
    • ICE work: Yes / Done

  • MPICH jobs: Yes / Done

Cancel

  • Normal jobs
    • ICE: Yes / Done
    • JC: Yes / Done
  • Dag: Yes / Done
    • Note that children nodes in status 'submitted' don't get cancelled
  • Collection
    • ICE: Yes / Done
    • JC: Yes / Done
  • Node of a collection: Yes / Done
    • Note: collections stay in status 'waiting' when all the nodes are Done (Success) except for one that is 'Cancelled'

Others

  • BrokerInfo
    • ICE creation Yes / Done test report here
    • JC creation: Yes / Done test report here

  • Resubmission
    • Shallow: Yes / Done
    • Deep: Yes / Done

  • Job Recovery
    • Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status Yes / Done
    • test report here

  • Prologue and Epilogue jobs Yes / Done



Stress test

Description:

  • 1200 collections each of 20 jobs
  • One collection every 60 seconds
  • Four users
  • No requirements (so randomly submitted to all the production CEs)
  • Use automatic-delegation
  • The job is a "sleep random(172)"
  • Resubmission is enabled
  • Use 2 proxy renewal services (myproxy.cern.ch and myproxy.cnaf.infn.it) plus jobs without specification of the MyproxyServer (equally distributed)
  • Use a 3.2.0.12 LB Server (patch #4083)

Final results

  • 1172 collections submitted in 6935 seconds: 3/6/125 (min/avg/max)
    • 28 submissions fails (due to LB error)
  • Jobs correctly submitted: 23440 (90% to LCG CEs and 10% to CREAM CEs)
    • DONE OK: 23335 (99.52%)
    • ABORTED: 105 ( 0.48 %)
    • Resubmitted: 3067 ( 13.08 %)
  • Aborted reasons:
    • Job proxy is expired: 42 times (40 %)
    • (LB query failed): 57 times (54.4 %)
    • Removal retries exceeded: 4 times (3.8%)
    • hit job shallow retry count (3): 1 time (0.9%)
    • Submission to condor failed: 1 time (0.9%)
  • Failures (3377)
    • 7 an authentication operation failed (~63%)
    • File not available.Cannot read JobWrapper output, both from Condor and from Maradona. (~18%)
    • Transfer to CREAM failed due to exception (~12%)
    • Got a job held event (~5%)
    • ... others (~2%)



Check bugs

  • Bug #42288: Problem in forwarding cerequirements to a CREAM CE FIXED
    • description of the problem --> "The parameters to be forwarded specified in the Requirements attribute of the .jdl classad are NOT considered and ICE does not send them to the CE, therefore the classad passed to BLAH does not contain them"
      • submitted the following .jdl via WMS:
         cat myjob_forwardReq.jdl
        [
        Type = "Job";
        JobType = "normal";
        InputSandbox = { "file:///home/emolinari/test.sh"};
        VirtualOrganisation = "dteam";
        Executable="test.sh";
        Arguments="Hello ";
        requirements = (other.GlueCEUniqueID == "cream-19.pd.infn.it:8443/cream-lsf-testbedB_1") && (other.GlueHostMainMemoryRAMSize >= 0) ;
        Rank = 0;
        myproxy = myproxy.cnaf.infn.it;
        fuzzyrank = true;
        StdOutput="message.txt";
        StdError="err.log";
        OutputSandbox={"message.txt","err.log",".BrokerInfo"};
        RetryCount = 0;
        ShallowRetryCount = 3;
        ]
      • checked in the ice log file on the WMS, /var/log/glite/ice.log, that the CERequirement field of the .jdl gets populated as in the following
         CeRequirements = "true && ( true && ( true && ( other.GlueHostMainMemoryRAMSize >= 0 ) ) )"; 
      • checked on the CE that blah generates the correct classad with the requirements to be forwarded, as in the following:
         cat /tmp/subfile
        #!/bin/bash
        # LSF job wrapper generated by lsf_submit.sh
        # on Wed Apr 14 19:13:10 CEST 2010
        #
        # LSF directives:
        #BSUB -L /bin/bash
        #BSUB -J cre19_725998524
        #BSUB -q testbedB_1
        #BSUB -R "select[mem>=0]"
        ......

  • Bug #48910: Failure starting LM if its output jobdir doesn't exist; unprotected chown in WM/LM/JC startup scripts FIXED
    • Stopped gLite services and deleted the jobdir under '/var/glite/workload_manager'
      [root@wms007 jobdir]# service gLite stop
      [...]
      [root@wms007 workload_manager]# pwd
      /var/glite/workload_manager
      [root@wms007 workload_manager]# ls
      ismdump.fl  jobdir
      [root@wms007 workload_manager]# rm -rf jobdir
      [root@wms007 workload_manager]# ls
      ismdump.fl 
      • re-started the LM service checking that the jobdir gets recreated
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start
        Starting LogMonitor...                                     [  OK  ]
        [root@wms007 workload_manager]# ls
        ismdump.fl  jobdir
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status
        Logmonitor running...
    • Stopped gLite services and deleted the jobdir under '/var/glite/jobcontrol'
      [root@wms007 jobcontrol]# pwd
      /var/glite/jobcontrol
      [root@wms007 jobcontrol]# rm -rf jobdir
      [root@wms007 jobcontrol]# ls
      condorio  submit
      • re-started the JC service checking that the jobdir gets recreated
        [root@wms007 jobcontrol]# /opt/glite/etc/init.d/glite-wms-jc start JobController
        Starting JobController daemon(s)
           Starting JobController...                          [  OK  ]
        [root@wms007 jobcontrol]# ls
        condorio  jobdir  lock  submit
        [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-jc status JobController
        JobController running in pid: 3625
    • Stopped gLite services and deleted the jobdir under '/var/glite/ice'
      [root@wms007 ice]# pwd
      /var/glite/ice
      [root@wms007 ice]# ls
      jobdir  persist_dir
      [root@wms007 ice]# rm -rf jobdir/
      [root@wms007 ice]# ls
      persist_dir
      • re-started the ICE service checking that the jobdir gets recreated
        [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice start
        starting ICE... ok
        [root@wms007 ice]# ls
        jobdir  persist_dir
        [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice status
        /opt/glite/bin/glite-wms-ice-safe (pid 22783) is running...
    • Stopped gLite services and deleted all the jobdirs
      [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/
      ice/:
      persist_dir
      
      jobcontrol/:
      condorio  submit
      
      workload_manager/:
      ismdump.fl
      • re-started the WM service checking that all the jobdirs get recreated
        [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm start
        starting workload manager... ok
        [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/
        ice/:
        jobdir  persist_dir
        
        jobcontrol/:
        condorio  jobdir  submit
        
        workload_manager/:
        ismdump.fl  jobdir
        [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm status
        /opt/glite/bin/glite-wms-workload_manager (pid 23259) is running...
    • Comment Input/InputType parameter in wms conf file (Sections: ICE, WorkloadManager and JobController).
      • Try to start JobController:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc start JobController
        Starting !JobController daemon(s)
         Please set Input parameter in glite_wms.conf - JC section [FAILED]
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc status JobController
        JobController stopped.
      • Try to start LogMonitor:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start
        Starting LogMonitor...Please set Input parameter in glite_wms.conf - WM section
                                                                   [FAILED]
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status
        LogMonitor stopped.
      • Try to start ICE:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice start
        starting ICE... failure
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice status
        /opt/glite/bin/glite-wms-ice-safe is not running
      • Try to start WorkloadManager:
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm start
        starting workload manager... Please set Input parameter in  - WM section
        Please set DispatcherType parameter in  - WM section
        Please set Input parameter in  - JC section
        Please set InputType parameter in  - JC section
        Please set Input parameter in  - ICE section
        Please set InputType parameter in  - ICE section
        failure
        [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm status
        /opt/glite/bin/glite-wms-workload_manager is not running

  • Bug #52934: [ICE] Delegation in ICE doesn't refer to the myproxy server FIXED
    • GridJobID: https://devel17.cnaf.infn.it:9000/dj8r_iFRd8tnWH4bThPNeg
      • Deleg Proxy ID = [12692524052E32526wms0072Ecnaf2Einfn2Eit]
      • Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
      • Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      • MyProxyServer = "myproxy.cern.ch";
    • GridJobID: https://devel17.cnaf.infn.it:9000/UNB2dHJNn7euaDP3FvJ3og
      • Deleg Proxy ID = [12692523642E948823wms0072Ecnaf2Einfn2Eit]
      • Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
      • Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      • MyProxyServer = "myproxy.cnaf.infn.it";

  • Bug #52937: ICE uses the wrong DN to log to LB Hopefully FIXED
    • Submit a job to a cream CE and the check with a glite-wms-job-logging-info -v 2:
      [ale@cream-15 UI]$ glite-wms-job-logging-info -v 2 https://devel15.cnaf.infn.it:9000/RITl2CsEH_KRk2nEusZVKw | grep User
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy
      - User                       =    /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy

  • Bug #53297: [ yaim-wms ] glite_wms.conf hardcoded parameters FIXED
    • tested by setting the parameter 'WMS_CONF_FILE_OVERWRITE' in the ~/siteinfo/services/glite-wms file
      • set the parameter 'WMS_CONF_FILE_OVERWRITE' to true: a backup copy of the glite_wms.conf file gets created in /opt/glite/etc/glite_wms.conf.bkp_20100608_101305 and the glite_wms.conf file gets overwritten
      • set the parameter 'WMS_CONF_FILE_OVERWRITE' to false: a new copy of the glite_wms.conf file gets created into /opt/glite/etc/glite_wms.conf.yaimnew_20100608_101633

  • Bug #53460: [ICE] Detection of job status changes for CREAM jobs should be improved FIXED
    • Using a new CE (1.6) looking in ice's log there is:
      2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269272870.288498 1269272870.496129 0.207631
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] There're [2] event(s) for the couple DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] Database  ID=[1261041182000]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() -  TID=[150673032] Exec time ID=[3]
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() -  TID=[150673032] Processing [2] event(s) for Job [gridJobID="https://devel17.cnaf.infn.it:9000/uKbQNcbh7kIohBz6bDMNZQ" CREAMJobID="https://cream-30.pd.infn.it:8443/CREAM396193798"] userdn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] and ce url [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2].
      2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() -  TID=[150673032] EventID [685143] timestsamp [1269272804]
      2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::processSingleEvent - TID=[150673032] InsertStat 1269272870.496682 1269272870.496864 0.000182
    • Using an "old" CE instead the "poller" method is used:
      2010-03-22 16:55:55,397 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269273355.242918 1269273355.397806 0.154888
      2010-03-22 16:55:55,397 ERROR - iceCommandEventQuery::execute() -  TID=[150673032] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-34.pd.infn.it</ns2:hostname>]]
      2010-03-22 16:55:55,398 WARN - iceCommandEventQuery::execute() -  TID=[150673032] Not present QueryEvent on CE [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller.
      2010-03-22 16:55:55,398 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-22 16:55:55,398 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...

  • Bug #55103: [ICE] ICE port 7010 not cleaned up properly FIXED
    • We try a stop/start/restart sequence
      [root@wms007 ~]# ps ax | grep ice
       1283 pts/2    S+     0:00 grep ice
      30985 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      30989 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      30990 ?        Sl     0:15 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1321 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1353 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
       1357 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
       1358 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
       1398 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice restart
      stopping ICE... ok
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
       1433 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
       1437 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
       1438 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
       1470 pts/2    S+     0:00 grep ice

  • Bug #55452: CMS production struck by waves of "Globus error 10: data transfer to the server failed" FIXED NOT CERTIFIED

  • Bug #56636: [ICE] statistics counters for monitoring FIXED
    • Verify the command and its options:
      [root@wms007 persist_dir]#  queryStats -t "2010-04-08 00:00:00"
      JOB_REGISTERED=2
      JOB_IDLE=2
      JOB_RUNNING=2
      JOB_REALLY-RUNNING=2
      JOB_DONE-OK=2
      
      [root@wms007 persist_dir]#  queryStats -f "2010-04-08 00:00:01" -t "2010-04-09 11:00:00"
      JOB_REGISTERED=4
      JOB_IDLE=4
      JOB_RUNNING=4
      JOB_REALLY-RUNNING=4
      JOB_DONE-OK=1
      JOB_DONE-FAILED=3
      
      [root@wms007 persist_dir]#  queryStats -f "2010-04-09 11:00:01"
      JOB_REGISTERED=255
      JOB_IDLE=255
      JOB_RUNNING=193
      JOB_REALLY-RUNNING=204
      JOB_DONE-OK=191
      JOB_ABORTED=6
      
      [root@wms007 persist_dir]#  queryStats
      JOB_REGISTERED=261
      JOB_IDLE=261
      JOB_RUNNING=199
      JOB_REALLY-RUNNING=210
      JOB_DONE-OK=194
      JOB_DONE-FAILED=3
      JOB_ABORTED=6

  • Bug #57295: [ICE] queryDb tool may create empty DB as root FIXED
    • Verify:
      [root@wms007 ~]#  ll /var/glite/ice/persist_dir/ice.db 
      -rw-r--r--  1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db
      [root@wms007 ~]#  /opt/glite/bin/queryDb -c glite_wms.conf -s RUNNING,REALLY_RUNNING 
      0 item(s) found
      [root@wms007 ~]#  ll /var/glite/ice/persist_dir/ice.db 
      -rw-r--r--  1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db

  • Bug #57579: [ICE] Occasionally the ICE's start/stop script doesn't kill the ICE process HOPEFULLY FIXED
    • Verify:
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice status
      /opt/glite/bin/glite-wms-ice-safe (pid 1433) is running...
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19866 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19899 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      19903 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      19904 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      19932 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop
      stopping ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      19978 pts/2    S+     0:00 grep ice
      [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start
      starting ICE... ok
      [root@wms007 ~]# ps ax | grep ice
      20009 ?        Ss     0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid
      20013 ?        S      0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1
      20014 ?        Sl     0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf
      20046 pts/2    S+     0:00 grep ice

  • Bug #57596: [ICE] non resubmission if job failed for proxy expiration FIXED
    • Verify:
      2010-03-23 10:20:37,696 INFO - iceLBLogger::logEvent() - Job Done Failed Event, ExitCode=[0], FailureReason=[Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]
      2010-03-23 10:20:37,817 DEBUG - iceLBContext::testCode() - L&B call succeeded.
      2010-03-23 10:20:37,828 ERROR - Ice::resubmit_job() - Will NOT resubmit job [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] because it's Input Sandbox proxy file is not valid: The proxy is EXPIRED!
      2010-03-23 10:20:37,828 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
      2010-03-23 10:20:37,828 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[Input sandbox's proxy is missing. Cannot resubmit job] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]

  • Bug #58099: WMS purger forces purge of jobs if LB cannot be reached FIXED
    • Stop the LBServer and then run the cron purger:
      07 Apr, 16:09:13 -E: [Error] query_job_status(purger.cpp:125): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg:: edg_wll_JobStat [111] Connection refused(edg_wll_gss_connect())
      [glite@wms007 ~]$ 
    • Verify that the SandBox dir has not been removed:
       [glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/
      total 16
      drwxrwx---  2 dteam008 glite 4096 Apr  6 14:34 input
      drwxrwx---  2 dteam008 glite 4096 Apr  6 14:46 output
      drwxrwx---  2 dteam008 glite 4096 Apr  6 14:34 peek
      lrwxrwxrwx  1 glite    glite  102 Apr  6 14:34 user.proxy -> /var/glite/SandboxDir/Uo/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fUow8XY0NGbyumU3PPGMSng/user.proxy
      
    • Restart LBServer and verify that now the SBD of the job is purged:
      [glite@wms007 ~]$  /opt/glite/sbin/glite-wms-purgeStorage.sh  -p /var/glite/SandboxDir/ye  -t 10000
      07 Apr, 16:18:07 -I: [Info] operator()(purger.cpp:449): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg: removed DONE job
      [glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/
      ls: /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/: No such file or directory

  • Bug #58387: [ICE] should log a job aborted when it cannot resubmit the job for missing user proxy FIXED
    • Verify:
      *************************************************************
      BOOKKEEPING INFORMATION:
      
      Status info for the Job : https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA
      Current Status:     Aborted 
      Logged Reason(s):
          - Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent():  LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR:  LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed
      Status Reason:      Input sandbox's proxy is missing. Cannot resubmit job
      Destination:        ce202.cern.ch:8443/cream-lsf-grid_dteam
      Submitted:          Tue Mar 23 09:49:42 2010 CET
      *************************************************************

  • Bug #58977: [ICE] Wrong database colum name in ICE SQL query FIXED
    • Submit some jobs to a ce (e.g. cream-25.pd.infn.it:8443/cream-lsf-testbedB_2):
      [root@wms007 20100409]# queryDb -v -C -G
      [https://cream-25.pd.infn.it:8443/CREAM881525184]  [https://devel17.cnaf.infn.it:9000/V2Lj0_-XWkrjRKMaf3f6ng]
      [https://cream-25.pd.infn.it:8443/CREAM425827870]  [https://devel17.cnaf.infn.it:9000/4TKQu1U_daCMMb2mRDR2cA]
      [https://cream-25.pd.infn.it:8443/CREAM543141647]  [https://devel17.cnaf.infn.it:9000/4wqozVNHEVUXWx5UzUaXqA]
      [https://cream-25.pd.infn.it:8443/CREAM769586568]  [https://devel17.cnaf.infn.it:9000/XLcaGE3kR3h8-oj8cMWE_A]
      [https://cream-25.pd.infn.it:8443/CREAM192029588]  [https://devel17.cnaf.infn.it:9000/PuOGoOxMf-pbfFu-wSkACw]
      [https://cream-25.pd.infn.it:8443/CREAM378177464]  [https://devel17.cnaf.infn.it:9000/T8ZKSu5zZPZY-Ee1gLRX5A]
      [https://cream-25.pd.infn.it:8443/CREAM299069473]  [https://devel17.cnaf.infn.it:9000/Xh1AMEor9hWOx4picngYkA]
      [https://cream-25.pd.infn.it:8443/CREAM012571708]  [https://devel17.cnaf.infn.it:9000/YjpCU6dfrLsDBs6wU_D3Hg]
      [https://cream-25.pd.infn.it:8443/CREAM561236418]  [https://devel17.cnaf.infn.it:9000/00Qc7RnutRORYVOd0ShIKg]
      [https://cream-25.pd.infn.it:8443/CREAM972351884]  [https://devel17.cnaf.infn.it:9000/ksz80OflJnDE_ynWHmKTwQ]
      [https://cream-25.pd.infn.it:8443/CREAM827240561]  [https://devel17.cnaf.infn.it:9000/WwfftKdV6_5lSgihPOUsaA]
      [https://cream-25.pd.infn.it:8443/CREAM573497695]  [https://devel17.cnaf.infn.it:9000/S5zbkyK72hv2LXwUD1vAFw]
      [https://cream-25.pd.infn.it:8443/CREAM735112819]  [https://devel17.cnaf.infn.it:9000/0J9nTy1tJxkcuRJ9oTZACw]
      [https://cream-25.pd.infn.it:8443/CREAM526570551]  [https://devel17.cnaf.infn.it:9000/Rcl0TypyUTXMwtLk86R3yA]
      [https://cream-25.pd.infn.it:8443/CREAM992848449]  [https://devel17.cnaf.infn.it:9000/xfH1fkIroQwNvlVBdn8N5A]
      [https://cream-25.pd.infn.it:8443/CREAM944698480]  [https://devel17.cnaf.infn.it:9000/xjiyHJo3rkUsXXHHe0s6yg]
      [https://cream-25.pd.infn.it:8443/CREAM729677007]  [https://devel17.cnaf.infn.it:9000/FIZA1Mjb4moUNel1N7UXvw]
      [https://cream-25.pd.infn.it:8443/CREAM589660323]  [https://devel17.cnaf.infn.it:9000/5DJlLG7M0v3C_-WMDKSdXQ]
      [https://cream-25.pd.infn.it:8443/CREAM994745139]  [https://devel17.cnaf.infn.it:9000/T_UdwnjC55dIVrPxJOVvmg]
      [https://cream-25.pd.infn.it:8443/CREAM228224655]  [https://devel17.cnaf.infn.it:9000/URw39mrv7jj-buJ3KDza8w]
      [https://cream-25.pd.infn.it:8443/CREAM397635733]  [https://devel17.cnaf.infn.it:9000/f3FOGwNoWpHyWkxO_87AIg]
      [https://cream-25.pd.infn.it:8443/CREAM510341828]  [https://devel17.cnaf.infn.it:9000/vEfH5j5_5R_7jFNrntEsog] 
      [https://cream-25.pd.infn.it:8443/CREAM788890890]  [https://devel17.cnaf.infn.it:9000/y0IVYbdR_UWbTmrXY5O8fA]
      
      ------------------------------------------------
      23 item(s) found
    • Check also the db_id registered in the ice's database
      [root@wms007 20100409]#  sqlite3 /var/glite/ice/persist_dir/ice.db "SELECT db_id, ceurl from ce_dbid;"
      1270820425000|https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2
    • Stop the cream CE. Drop its database. Create a new empty one. Restart the CE.
    • Check what happen in the ice log file:
      2010-04-09 16:16:53,953 WARN - iceCommandEventQuery::execute() -  TID=[150150560] *** CREAM HAS PROBABLY BEEN SCRATCHED. GOING TO ERASE ALL JOBS RELATED TO OLD DB_ID [1270820425000] ***
    • Check if there are jobs in the Ice's database:
      [root@wms007 persist_dir]# queryDb -v  -C -G
      
      ------------------------------------------------
      0 item(s) found
    • and if the db_id has been changed:
      [root@wms007 persist_dir]#  sqlite3 /var/glite/ice/persist_dir/ice.db "SELECT db_id, ceurl from ce_dbid;"
      1270822483000|https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2
    • Look at the status of a job that has been removed:
          Status info for the Job : https://devel17.cnaf.infn.it:9000/xjiyHJo3rkUsXXHHe0s6yg
          Current Status:     Aborted
          Logged Reason(s):
              - job completed
          Status Reason:      CREAM'S database has been scratched and all its jobs have been lost
          Destination:        cream-25.pd.infn.it:8443/cream-lsf-testbedB_2
          Submitted:          Fri Apr  9 16:11:34 2010 CEST

  • Bug #59240: [ICE] abort reasons not always printed in its logfile FIXED NOT CERTIFIED

  • Bug #59339: [ICE] doesn't correctly handle request in jobdir/old when it is restarted FIXED
    • Verify submitting a big collection to cream CEs, and then restarting ICE in the middle of the submit process:
      2010-03-23 15:55:43,604 DEBUG - iceCommandSubmit::try_to_submit() -  TID=[168434952] Going to START CreamJobID [https://cream
      -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
    • restarting ice...
      2010-03-23 15:55:45,760 DEBUG - ICE VersionID is [Fri Mar 19 13:53:17 CET 2010] ProcessID=[23579]
      2010-03-23 15:55:45,760 INFO - glite-wms-ice::main() - Host certificate is [/home/glite/.certs/hostcert.pem]
      2010-03-23 15:55:45,817 DEBUG - iceThreadPool::iceThreadPool(ICE Submission Pool) - Creating 10 worker threads
      2010-03-23 15:55:45,819 DEBUG - iceThreadPool::iceThreadPool(ICE Poller Pool) - Creating 5 worker threads
      [...]
      2010-03-23 15:55:48,967 INFO - iceCommandSubmit::execute() -  TID=[144321160] This request is a Submission...
      2010-03-23 15:55:48,968 INFO - iceCommandSubmit::try_to_submit() -  TID=[144321160] GridJobID [https://devel17.cnaf.infn.it:9
      000/iM8C3YV12fwhvIG5mNip5Q] has already been REGISTERED. Will only START it...
      2010-03-23 15:55:48,968 DEBUG - iceCommandSubmit::try_to_submit() -  TID=[144321160] Going to START CreamJobID [https://cream
      -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
      2010-03-23 15:55:49,154 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/iM8C
      3YV12fwhvIG5mNip5Q] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
      2010-03-23 15:55:49,155 INFO - iceLBLogger::logEvent() - Cream Transfer OK Event - [gridJobID="https://devel17.cnaf.infn.it:9
      000/iM8C3YV12fwhvIG5mNip5Q" CREAMJobID="https://cream-32.pd.infn.it:8443/CREAM036926381"]

  • Bug #59453: [ICE] polling needs to be improved FIXED NOT CERTIFIED

  • Bug #60668: [ICE] does not respect LB server/proxy selection through the LBproxy attribute FIXED
    • Set LBProxy = false; in glite_wms.conf (section Common), restart ice and submit...
      mysql> select * from events where jobid="YFyqjw3FF-BO-0U5BxCOtA";
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | jobid                  | event | code | prog            | host                | time_stamp          | userid                           | usec   | level | arrived             |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | YFyqjw3FF-BO-0U5BxCOtA |     0 |    5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 394848 |     8 | 2010-03-24 12:04:39 |
      | YFyqjw3FF-BO-0U5BxCOtA |     1 |   15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 548652 |     8 | 2010-03-24 12:04:39 |
      | YFyqjw3FF-BO-0U5BxCOtA |     2 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 608084 |     8 | 2010-03-24 12:04:39 |
      | YFyqjw3FF-BO-0U5BxCOtA |     3 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 657231 |     8 | 2010-03-24 12:04:39 |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      4 rows in set (0.00 sec)
    • * Set LBProxy = true; in glite_wms.conf (section Common), restart ice and submit...
      mysql> select * from events where jobid="SlKOGSnaW0oKO3TJqw9tbA";
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | jobid                  | event | code | prog            | host                | time_stamp          | userid                           | usec   | level | arrived             |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      | SlKOGSnaW0oKO3TJqw9tbA |     0 |   17 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 342720 |     8 | 2010-03-24 12:09:53 |
      | SlKOGSnaW0oKO3TJqw9tbA |     1 |   21 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 470416 |     8 | 2010-03-24 12:09:53 |
      | SlKOGSnaW0oKO3TJqw9tbA |     2 |   21 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 526402 |     8 | 2010-03-24 12:09:53 |
      | SlKOGSnaW0oKO3TJqw9tbA |     3 |    2 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 606511 |     8 | 2010-03-24 12:09:54 |
      | SlKOGSnaW0oKO3TJqw9tbA |     4 |    4 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 712100 |     8 | 2010-03-24 12:09:54 |
      | SlKOGSnaW0oKO3TJqw9tbA |     5 |    4 | NetworkServer   | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | 3f82b966e8a77413044be1a9144a4af4 |  43631 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     6 |    5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 167414 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     7 |   15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 297333 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     8 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 369636 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |     9 |    4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 431565 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |    10 |    5 | JobController   | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 745052 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |    11 |    1 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 846002 |     8 | 2010-03-24 12:09:55 |
      | SlKOGSnaW0oKO3TJqw9tbA |    12 |    1 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:10:04 | bdd27610035bb0ec9287e2ecaa3da2eb | 869424 |     8 | 2010-03-24 12:10:04 |
      | SlKOGSnaW0oKO3TJqw9tbA |    13 |    8 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb |  94855 |     8 | 2010-03-24 12:11:39 |
      | SlKOGSnaW0oKO3TJqw9tbA |    14 |   25 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 181448 |     8 | 2010-03-24 12:11:39 |
      | SlKOGSnaW0oKO3TJqw9tbA |    15 |   10 | LogMonitor      | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 250291 |     8 | 2010-03-24 12:11:39 |
      +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
      16 rows in set (0.00 sec)

  • Bug #61312: [ICE] Error in handling user dn in ICE's poller FIXED
    • Submit 5 jobs to an old CreamCE (Cream 1.5) setting MyProxyServer attribute:
      2010-03-24 13:40:38,128 ERROR - iceCommandEventQuery::execute() -  TID=[159321352] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-33.pd.infn.it</ns2:hostname>]]
      2010-03-24 13:40:38,128 WARN - iceCommandEventQuery::execute() -  TID=[159321352] Not present QueryEvent on CE [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller.
      2010-03-24 13:40:38,128 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]
      2010-03-24 13:40:38,128 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...
      2010-03-24 13:40:38,129 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Finished collecting jobs to poll. [5] jobs are to poll
      [...]
  • And so:
    [ale@cream-15 UI]$ glite-wms-job-status -v 0 -i testo --noint
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/tt3GLYuIiHuwrmnl7fGVtA
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/lY9fdOgQk5RcaH99g23z5g
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/jta5KlBZEP-r2KbE0SB0Vw
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/TNqI_PbRyqgFAN3L52IpKQ
    Current Status:     Done (Success)
    
    *************************************************************
    BOOKKEEPING INFORMATION:
    
    Status info for the Job : https://devel17.cnaf.infn.it:9000/V7Pnv2yE47CdHKgQmRaIvQ
    Current Status:     Done (Success)
    *************************************************************

  • Bug #61405: [ICE] Missing proxy validity evaluation in ICE FIXED
    • Submit this jdl with a proxy of 30minutes NOT registered to the myproxy server (myproxy.cnaf.infn.it):
       [
        executable = "/bin/sleep"; 
        arguments = "2000"; 
        MyProxyServer = "myproxy.cnaf.infn.it"; 
        requirements = ( other.GlueCEStateStatus == "testbedb" ); 
        DefaultRank =  -other.GlueCEStateEstimatedResponseTime; 
       ]
    • After a while submit the same jdl with a fresh proxy and look in the ice's log if this new proxy is used to refresh the delegation of the previous job:
      • First it should try to renew the proxy contacting the myproxy server:
        2010-04-14 11:47:40,622 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Contacting MyProxy server [myproxy.cnaf.infn.it] for user dn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] with proxy certificate [/var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy] to renew it...
        2010-04-14 11:47:40,622 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Executing command [export X509_USER_CERT=/var/glite/wms.proxy; export X509_USER_KEY=/var/glite/wms.proxy; /opt/glite/bin/glite-wms-ice-proxy-renew -s myproxy.cnaf.infn.it -p /var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy -o /var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy.renewed]...
        2010-04-14 11:47:40,783 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Command output is [/opt/glite/bin/glite-wms-ice-proxy-renew: glite_renewal_core_renew() failed: Error contacting MyProxy server for proxy /var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy: ERROR from myproxy-server (myproxy.cnaf.infn.it):
        X509_verify_cert() failed: certificate has expired
        
        &#65533;]
      • Then it should use the proxy of the last arrived job to renew the delegation:
        2010-04-14 11:47:40,783 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Looking for the better proxy for DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] MyProxy Server name [myproxy.cnaf.infn.it]...
        2010-04-14 11:47:40,783 INFO - iceCommandDelegationRenewal::renewAllDelegations() - Will Renew Delegation ID [12712381542E417936wms0072Ecnaf2Einfn2Eit] with BetterProxy [/var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy] that will expire on [Wed Apr 14 12:12:47 2010]
        2010-04-14 11:47:40,783 INFO - CreamProxy_DelegateRenew::execute() - Calling renewProxyReq to remote service [https://cream-39.pd.infn.it:8443/ce-cream/services/gridsite-delegation]

  • Bug #61413: [ICE] should not call EventQuery for a userDN if he/she doesn't have active jobs FIXED
    • Submit a job to a CreamCE and wait until it finished.
    • Submit another job to a different CreamCE, you should not see any query to the previous used CreamCE.

  • Bug #61748: [ICE] EventQuery/Polling must be done also to blacklisted CE FIXED
    • Submit some jobs to a CreamCEVerify
    • Trigger a socket timeout so that ICE blacklisted the CreamCE :
      2010-03-24 15:58:40,753 ERROR - CreamProxyMethod::execute() - Connection timed out to CREAM: "EOF detected during communicati
      on. Probably service closed connection or SOCKET TIMEOUT occurred." on try 3/3. Blacklisting endpoint and giving up.
      2010-03-24 15:58:40,753 DEBUG - CEBlackList::blacklist_endpoint() - Blacklisting CE https://cream-25.pd.infn.it:8443/ce-cream
      /services/gridsite-delegation until Wed Mar 24 16:08:40 2010
    • Verify that the QueryEvent commad is called in any case:
      2010-03-24 16:05:28,952 DEBUG - eventStatusPoller::body() - Adding EventQuery command for couple (/C=IT/O=INFN/OU=Personal Ce
      rtificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL, https://cream-25.pd.infn.it:8443/ce-cream/services/C
      REAM2) to the thread pool...
    • Instead a submission fails:
      2010-03-24 15:58:43,265 DEBUG - Delegation_manager::delegate() - Creating new delegation with delegation id [12694427232E2651
      16wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [https://cream
      -25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio 
      Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dte
      am/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] Expiring on [Thu Mar 25 12:54:02 2010]
      2010-03-24 15:58:43,265 DEBUG - CEBlackList::is_blacklisted() - CE https://cream-25.pd.infn.it:8443/ce-cream/services/gridsit
      e-delegation is blacklisted until Wed Mar 24 16:08:40 2010
      2010-03-24 15:58:43,265 ERROR - Delegation_manager::delegate() - FAILED Creation of a new delegation with delegation id [1269
      4427232E265116wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [h
      ttps://cream-25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova
      /CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio G
      ianelle-/dteam/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] - ERROR is: [The endpoint is blacklisted]
      2010-03-24 15:58:43,265 ERROR - iceCommandSubmit::execute() -  TID=[159308760] Error during submission of jdl= Fatal Exceptio
      n is:Failed to create a delegation id for job https://devel17.cnaf.infn.it:9000/UoVsvjIj1CPluHb81xM_pQ: reason is The endpoin
      t is blacklisted

  • Bug #63989: [ICE] doesn't handle exception raised by jobDir::new_entries() FIXED
    • Change the permission of the new directory in jobdir:
      [root@wms007 jobdir]# chmod 111 new/
      [root@wms007 jobdir]# ls -l
      total 48
      d--x--x--x  2 glite glite 40960 Mar 24 16:13 new
      drwxr-xr-x  2 glite glite  4096 Mar 24 16:13 old
      drwxr-xr-x  2 glite glite  4096 Mar 24 16:13 tmp
    • Look in ICE's log:
      2010-03-25 09:45:39,545 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied
      2010-03-25 09:45:40,546 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied

  • Bug #64698: jobwrapper max osb limit should be considered only if the gridftp server is the wms FIXED only for LCG-CE
    • Set MaxOutputSandboxSize = 10000000; in section WorkloadManager of file glite_wms.conf
    • Submit a jdl with a file of more than 10Mb in the OutputSandbox parameter and set also the corresponding OutputSandboxDestURI parameter
    • Check the output dir in the SE:
      [root@devel18 tmp]# ls -lh
      -rw-r--r--  1 dteam044 dteam  50M Apr  7 16:23 bigfile
      -rw-r--r--  1 dteam044 dteam  646 Apr  7 16:23 ls.out
    • If you don't set OutputSandboxDestURI in the jdl, than the SandBox dir in the WMS should contain a .tail file of less than 10Mb:
      [root@wms007 persist_dir]#  ls -lh /var/glite/SandboxDir/A1/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fA1cdkhzepvrCjiU_5fTaKbpg/output/
      total 9.6M
      -rw-r--r--  1 dteam008 dteam 9.6M Apr  7 16:26 bigfile.tail
      -rw-r--r--  1 dteam008 dteam  637 Apr  7 16:26 ls.out

  • Bug #66721: Ineffective and never removed Job cancels FIXED
    • submitted a job with the option '--register-only' as in the following:
      glite-wms-job-submit --config ../glite_wms_devel14.conf --register-only -a ../myjob.jdl 
      
      Connecting to the service https://devel14.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
      
      ====================== glite-wms-job-submit Success ======================
      
      The job has been successfully registered to the WMProxy
      Your job identifier is:
      
      https://devel15.cnaf.infn.it:9000/mS8cXbg9szmcXsUETm8g2g
      
      ==========================================================================
      
      To complete the operation, the following file containing the InputSandbox of the job needs to be transferred:
      ==========================================================================================================
      ISB ZIP file : /tmp/ISBfiles_eCXKdK2egtqk7jZRlfFQpw_0.tar.gz
      Destination : gsiftp://devel14.cnaf.infn.it:2811/var/glite/SandboxDir/mS/https_3a_2f_2fdevel15.cnaf.infn.it_3a9000_2fmS8cXbg9szmcXsUETm8g2g/input/ISBfiles_eCXKdK2egtqk7jZRlfFQpw_0.tar.gz
      -----------------------------------------------------------------------------
      
      then start the job by issuing a submissiong with the option:
       --start https://devel15.cnaf.infn.it:9000/mS8cXbg9szmcXsUETm8g2g
    • cancel the previously submitted job as in the following:
      glite-wms-job-cancel https://devel15.cnaf.infn.it:9000/mS8cXbg9szmcXsUETm8g2g
      
      Are you sure you want to remove specified job(s) [y/n]y : y
      
      Connecting to the service https://devel14.cnaf.infn.it:7443/glite_wms_wmproxy_server
      
      
      ============================= glite-wms-job-cancel Success =============================
      
      The cancellation request has been successfully submitted for the following job(s):
      
      - https://devel15.cnaf.infn.it:9000/mS8cXbg9szmcXsUETm8g2g
      
      ========================================================================================
      

  • Bug #66986: ICE must be able to print out on file the stack trace trapping SIGSEGV, SIGILL, SIGABRT etc. FIXED NOT CERTIFIED

  • Bug #67097: [yaim-wms] Removed lcg-condor-extra usage FIXED
    • checked NORDUGRID_GAHP is set to the right value:
      grep NORDUGRID_GAHP /opt/glite/yaim/functions/*
      /opt/glite/yaim/functions/config_condor_wms:   setValue NORDUGRID_GAHP "\$(SBIN)/nordugrid_gahp"

  • Bug #68891: ICE falls into an infinite loop when a job has expired proxy and has been submitted to a CREAM without EventQuery FIXED NOT CERTIFIED
    • It is not easy to reproduce the problem.

-- AlessioGianelle - 2010-02-05

Topic attachments
I Attachment Action Size Date Who Comment
Unknown file formatEXT brokerinfo manage 3.1 K 2010-05-05 - 12:40 UnknownUser  
Unknown file formatEXT brokerinfo-ICE manage 2.5 K 2010-05-06 - 09:31 UnknownUser  
Unknown file formatEXT bulk_jobs manage 16.6 K 2010-05-04 - 09:04 UnknownUser bulk jobs to JC
Unknown file formatEXT bulk_jobs_toICE manage 7.8 K 2010-05-04 - 12:51 UnknownUser  
Unknown file formatEXT cancel-collection-toICE manage 3.3 K 2010-04-29 - 10:08 UnknownUser  
Unknown file formatEXT cancel-collection-toJC manage 3.9 K 2010-04-29 - 10:21 UnknownUser  
Unknown file formatEXT cancel-dag manage 3.6 K 2010-04-29 - 09:47 UnknownUser  
Unknown file formatEXT collection-toICE manage 5.6 K 2010-04-27 - 10:15 UnknownUser  
Unknown file formatEXT collection-toLcg manage 5.1 K 2010-04-27 - 10:02 UnknownUser  
Texttxt dag.txt manage 9.2 K 2010-06-15 - 13:53 AlessioGianelle Dag repo
Texttxt deepresub.txt manage 5.1 K 2010-06-11 - 14:29 AlessioGianelle Deer resub repo
Texttxt epilogprolog.txt manage 5.7 K 2010-06-11 - 09:54 AlessioGianelle EpilogProlog
Unknown file formatEXT jobcancel-toICE manage 3.2 K 2010-04-29 - 09:50 UnknownUser  
Unknown file formatEXT jobcancel-toLcg manage 3.3 K 2010-04-29 - 09:51 UnknownUser  
Texttxt listmatchdata.txt manage 3.4 K 2010-04-29 - 12:54 AlessioGianelle List Match with data
Texttxt mpirepo.txt manage 6.4 K 2010-06-09 - 11:26 AlessioGianelle Mpi repo
Unknown file formatEXT parametric-toICE manage 9.7 K 2010-04-27 - 11:26 UnknownUser  
Unknown file formatEXT parametric-toLcg manage 9.3 K 2010-04-27 - 11:33 UnknownUser  
Unknown file formatEXT perusal-toICE manage 4.8 K 2010-04-27 - 12:45 UnknownUser  
Unknown file formatEXT perusal-toLcg manage 5.4 K 2010-04-28 - 10:21 UnknownUser  
Texttxt shallowresub.txt manage 5.0 K 2010-06-11 - 12:08 AlessioGianelle Shallow repo
Unknown file formatEXT test_recovery manage 47.9 K 2010-06-16 - 09:44 UnknownUser test recovery
Unknown file formatEXT test_upgrade_wms manage 28.1 K 2010-06-18 - 13:23 UnknownUser test wms upgrade
Texttxt updatelog.txt manage 22.8 K 2010-06-08 - 13:41 AlessioGianelle Update log
Unknown file formatEXT yaim_conf_overwrite_true manage 28.0 K 2010-06-21 - 10:38 UnknownUser yaim_conf_wms_overwrite
Unknown file formatEXT yaim_conf_wms_log manage 8.8 K 2010-06-18 - 10:16 UnknownUser yaim_conf_wms
Unknown file formatEXT yum_install_wms manage 48.5 K 2010-06-18 - 10:13 UnknownUser yum_install_wms
Unknown file formatEXT yum_install_wms_log_good manage 48.5 K 2010-06-09 - 13:09 UnknownUser yum_install_wms_log
Edit | Attach | PDF | History: r75 < r74 < r73 < r72 < r71 | Backlinks | Raw View | More topic actions
Topic revision: r75 - 2010-06-21 - ElisabettaMolinari
 
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback