Tags:
,
view all tags
%TOC% ---+ *Certification report patch [[https://savannah.cern.ch/patch/index.php?3621][3621]]* Author(s): Elisabetta Molinari & Alessio Gianelle Outcome: %ORANGE% *in certification...* %ENDCOLOR% ---++ Clean installation ---++ Upgrade from production ---++ Test Report ---+++ List Match * without data: %ICON{choice-yes}% * with data: %ICON{choice-yes}% ---+++ Submission/GetOutput * =Normal= jobs through * ICE work: %ICON{choice-yes}% * JC work: %ICON{choice-yes}% * =Dag= jobs through: * JC work: %ICON{choice-yes}% * =Collection= jobs through: * ICE work: %ICON{choice-yes}% * JC work: %ICON{choice-yes}% * also job-output for collections works even though %ORANGE% only the parent node is set to 'Cleared' %ENDCOLOR% * =Parametric= jobs through: * ICE work: %ICON{choice-yes}% * JC work: %ICON{choice-yes}% * tested with the following <verbatim> [ JobType = "parametric"; Executable = "/usr/bin/env"; Environment = {"MYPATH_PARAM_=$PATH:/bin:/usr/bin:$HOME"}; StdOutput = "echo_PARAM_.out"; StdError = "echo_PARAM_.err"; OutputSandbox = {"echo_PARAM_.out","echo_PARAM_.err"}; Parameters = 5; usertags = [ jdl = "parametric" ]; ]</verbatim> * =Bulk= jobs sent both through ICE and JC and !RetryCount = 0; : * Submit a bulk of 3 jobs -> success 100% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 50 jobs -> success 100% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 100 jobs -> success 100% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 500 jobs -> success 99.9% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 1000 jobs -> success 99.9% %ICON{choice-yes}% both to ICE and JC * =Perusal= jobs through: * JC work: %ICON{choice-yes}% * ICE work: %ICON{choice-yes}% * =MPICH= jobs: %ICON{choice-no}% ---+++ Cancel * Normal jobs * ICE: %ICON{choice-yes}% * JC: %ICON{choice-yes}% * Dag: %ICON{choice-yes}% * Note that children nodes in status %ORANGE% 'submitted' %ENDCOLOR% don't get cancelled * Collection * ICE: %ICON{choice-yes}% * JC: %ICON{choice-yes}% * Node of a collection: %ICON{choice-yes}% * Note: collections stay in status %ORANGE% 'waiting'%ENDCOLOR% when all the nodes are Done (Success) except for one that is 'Cancelled' ---+++ Others * =BrokerInfo= * ICE creation %ICON{choice-yes}% * JC creation: %ICON{choice-yes}% * =Resubmission= * Shallow: %ICON{choice-yes}% * Deep: %ICON{choice-yes}% * =Job Recovery= * Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status %ICON{choice-yes}% * =Prologue= and =Epilogue= jobs * ICE: %ICON{choice-yes}% * JC: %ICON{choice-yes}% ------- ------- ---++ Check bugs * Bug [[https://savannah.cern.ch/bugs/?42288][#42288]]: Problem in forwarding cerequirements to a CREAM CE %GREEN% FIXED %ENDCOLOR% * description of the problem --> "The parameters to be forwarded specified in the Requirements attribute of the .jdl classad are NOT considered and ICE does not send them to the CE, therefore the classad passed to BLAH does not contain them" * submitted the following .jdl via WMS: <verbatim> cat myjob_forwardReq.jdl [ Type = "Job"; JobType = "normal"; InputSandbox = { "file:///home/emolinari/test.sh"}; VirtualOrganisation = "dteam"; Executable="test.sh"; Arguments="Hello "; requirements = (other.GlueCEUniqueID == "cream-19.pd.infn.it:8443/cream-lsf-testbedB_1") && (other.GlueHostMainMemoryRAMSize >= 0) ; Rank = 0; myproxy = myproxy.cnaf.infn.it; fuzzyrank = true; StdOutput="message.txt"; StdError="err.log"; OutputSandbox={"message.txt","err.log",".BrokerInfo"}; RetryCount = 0; ShallowRetryCount = 3; ]</verbatim> * checked in the ice log file on the WMS, /var/log/glite/ice.log, that the CERequirement field of the .jdl gets populated as in the following <verbatim> CeRequirements = "true && ( true && ( true && ( other.GlueHostMainMemoryRAMSize >= 0 ) ) )"; </verbatim> * checked on the CE that blah generates the correct classad with the requirements to be forwarded, as in the following: <verbatim> cat /tmp/subfile #!/bin/bash # LSF job wrapper generated by lsf_submit.sh # on Wed Apr 14 19:13:10 CEST 2010 # # LSF directives: #BSUB -L /bin/bash #BSUB -J cre19_725998524 #BSUB -q testbedB_1 #BSUB -R "select[mem>=0]" ......</verbatim> * Bug [[https://savannah.cern.ch/bugs/?48910][#48910]]: Failure starting LM if its output jobdir doesn't exist; unprotected chown in WM/LM/JC startup scripts %GREEN%FIXED%ENDCOLOR% * Stopped gLite services and deleted the jobdir under '/var/glite/workload_manager'<verbatim> [root@wms007 jobdir]# service gLite stop [...] [root@wms007 workload_manager]# pwd /var/glite/workload_manager [root@wms007 workload_manager]# ls ismdump.fl jobdir [root@wms007 workload_manager]# rm -rf jobdir [root@wms007 workload_manager]# ls ismdump.fl </verbatim> * re-started the LM service checking that the jobdir gets recreated <verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start Starting LogMonitor... [ OK ] [root@wms007 workload_manager]# ls ismdump.fl jobdir [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status Logmonitor running...</verbatim> * Stopped gLite services and deleted the jobdir under '/var/glite/jobcontrol'<verbatim> [root@wms007 jobcontrol]# pwd /var/glite/jobcontrol [root@wms007 jobcontrol]# rm -rf jobdir [root@wms007 jobcontrol]# ls condorio submit</verbatim> * re-started the JC service checking that the jobdir gets recreated <verbatim> [root@wms007 jobcontrol]# /opt/glite/etc/init.d/glite-wms-jc start JobController Starting JobController daemon(s) Starting JobController... [ OK ] [root@wms007 jobcontrol]# ls condorio jobdir lock submit [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-jc status JobController JobController running in pid: 3625</verbatim> * Stopped gLite services and deleted the jobdir under '/var/glite/ice'<verbatim> [root@wms007 ice]# pwd /var/glite/ice [root@wms007 ice]# ls jobdir persist_dir [root@wms007 ice]# rm -rf jobdir/ [root@wms007 ice]# ls persist_dir</verbatim> * re-started the ICE service checking that the jobdir gets recreated <verbatim> [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ice]# ls jobdir persist_dir [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice status /opt/glite/bin/glite-wms-ice-safe (pid 22783) is running...</verbatim> * Stopped gLite services and deleted all the jobdirs<verbatim> [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/ ice/: persist_dir jobcontrol/: condorio submit workload_manager/: ismdump.fl</verbatim> * re-started the WM service checking that all the jobdirs get recreated <verbatim> [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm start starting workload manager... ok [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/ ice/: jobdir persist_dir jobcontrol/: condorio jobdir submit workload_manager/: ismdump.fl jobdir [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm status /opt/glite/bin/glite-wms-workload_manager (pid 23259) is running...</verbatim> * Comment Input/InputType parameter in wms conf file (Sections: ICE, !WorkloadManager and !JobController). * Try to start !JobController:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc start JobController Starting !JobController daemon(s) Please set Input parameter in glite_wms.conf - JC section [FAILED] [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc status JobController JobController stopped.</verbatim> * Try to start !LogMonitor:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start Starting LogMonitor...Please set Input parameter in glite_wms.conf - WM section [FAILED] [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status LogMonitor stopped.</verbatim> * Try to start ICE:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... failure [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice status /opt/glite/bin/glite-wms-ice-safe is not running</verbatim> * Try to start !WorkloadManager:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm start starting workload manager... Please set Input parameter in - WM section Please set DispatcherType parameter in - WM section Please set Input parameter in - JC section Please set InputType parameter in - JC section Please set Input parameter in - ICE section Please set InputType parameter in - ICE section failure [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm status /opt/glite/bin/glite-wms-workload_manager is not running</verbatim> * Bug [[https://savannah.cern.ch/bugs/?52934][#52934]]: [ICE] Delegation in ICE doesn't refer to the myproxy server %GREEN%FIXED%ENDCOLOR% * !GridJobID: https://devel17.cnaf.infn.it:9000/dj8r_iFRd8tnWH4bThPNeg * Deleg Proxy ID = [12692524052E32526wms0072Ecnaf2Einfn2Eit] * Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B * Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle * !MyProxyServer = "myproxy.cern.ch"; * !GridJobID: https://devel17.cnaf.infn.it:9000/UNB2dHJNn7euaDP3FvJ3og * Deleg Proxy ID = [12692523642E948823wms0072Ecnaf2Einfn2Eit] * Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B * Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle * !MyProxyServer = "myproxy.cnaf.infn.it"; * Bug [[https://savannah.cern.ch/bugs/?53460][#53460]]: [ICE] Detection of job status changes for CREAM jobs should be improved %GREEN%FIXED%ENDCOLOR% * Using a new CE (1.6) looking in ice's log there is:<verbatim> 2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269272870.288498 1269272870.496129 0.207631 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] There're [2] event(s) for the couple DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2] 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] Database ID=[1261041182000] 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] Exec time ID=[3] 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() - TID=[150673032] Processing [2] event(s) for Job [gridJobID="https://devel17.cnaf.infn.it:9000/uKbQNcbh7kIohBz6bDMNZQ" CREAMJobID="https://cream-30.pd.infn.it:8443/CREAM396193798"] userdn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] and ce url [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2]. 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() - TID=[150673032] EventID [685143] timestsamp [1269272804] 2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::processSingleEvent - TID=[150673032] InsertStat 1269272870.496682 1269272870.496864 0.000182</verbatim> * Using an "old" CE instead the "poller" method is used:<verbatim> 2010-03-22 16:55:55,397 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269273355.242918 1269273355.397806 0.154888 2010-03-22 16:55:55,397 ERROR - iceCommandEventQuery::execute() - TID=[150673032] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-34.pd.infn.it</ns2:hostname>]] 2010-03-22 16:55:55,398 WARN - iceCommandEventQuery::execute() - TID=[150673032] Not present QueryEvent on CE [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller. 2010-03-22 16:55:55,398 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2] 2010-03-22 16:55:55,398 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...</verbatim> * Bug [[https://savannah.cern.ch/bugs/?55103][#55103]]: [ICE] ICE port 7010 not cleaned up properly %GREEN%FIXED%ENDCOLOR% * We try a stop/start/restart sequence<verbatim> [root@wms007 ~]# ps ax | grep ice 1283 pts/2 S+ 0:00 grep ice 30985 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 30989 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 30990 ? Sl 0:15 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop stopping ICE... ok [root@wms007 ~]# ps ax | grep ice 1321 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ~]# ps ax | grep ice 1353 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 1357 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 1358 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 1398 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice restart stopping ICE... ok starting ICE... ok [root@wms007 ~]# ps ax | grep ice 1433 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 1437 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 1438 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 1470 pts/2 S+ 0:00 grep ice</verbatim> * Bug [[https://savannah.cern.ch/bugs/?55452][#55452]]: CMS production struck by waves of "Globus error 10: data transfer to the server failed" %ORANGE% FIXED NOT CERTIFIED %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?56636][#56636]]: [ICE] statistics counters for monitoring %GREEN% FIXED %ENDCOLOR% * Verify the command and its options:<verbatim> [root@wms007 persist_dir]# queryStats -t "2010-04-08 00:00:00" JOB_REGISTERED=2 JOB_IDLE=2 JOB_RUNNING=2 JOB_REALLY-RUNNING=2 JOB_DONE-OK=2 [root@wms007 persist_dir]# queryStats -f "2010-04-08 00:00:01" -t "2010-04-09 11:00:00" JOB_REGISTERED=4 JOB_IDLE=4 JOB_RUNNING=4 JOB_REALLY-RUNNING=4 JOB_DONE-OK=1 JOB_DONE-FAILED=3 [root@wms007 persist_dir]# queryStats -f "2010-04-09 11:00:01" JOB_REGISTERED=255 JOB_IDLE=255 JOB_RUNNING=193 JOB_REALLY-RUNNING=204 JOB_DONE-OK=191 JOB_ABORTED=6 [root@wms007 persist_dir]# queryStats JOB_REGISTERED=261 JOB_IDLE=261 JOB_RUNNING=199 JOB_REALLY-RUNNING=210 JOB_DONE-OK=194 JOB_DONE-FAILED=3 JOB_ABORTED=6</verbatim> * Bug [[https://savannah.cern.ch/bugs/?57295][#57295]]: [ICE] queryDb tool may create empty DB as root %GREEN%FIXED%ENDCOLOR% * Verify:<verbatim> [root@wms007 ~]# ll /var/glite/ice/persist_dir/ice.db -rw-r--r-- 1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db [root@wms007 ~]# /opt/glite/bin/queryDb -c glite_wms.conf -s RUNNING,REALLY_RUNNING 0 item(s) found [root@wms007 ~]# ll /var/glite/ice/persist_dir/ice.db -rw-r--r-- 1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db</verbatim> * Bug [[https://savannah.cern.ch/bugs/?57579][#57579]]: [ICE] Occasionally the ICE's start/stop script doesn't kill the ICE process %GREEN%HOPEFULLY FIXED%ENDCOLOR% * Verify:<verbatim> [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice status /opt/glite/bin/glite-wms-ice-safe (pid 1433) is running... [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop stopping ICE... ok [root@wms007 ~]# ps ax | grep ice 19866 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ~]# ps ax | grep ice 19899 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 19903 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 19904 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 19932 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop stopping ICE... ok [root@wms007 ~]# ps ax | grep ice 19978 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ~]# ps ax | grep ice 20009 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 20013 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 20014 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 20046 pts/2 S+ 0:00 grep ice</verbatim> * Bug [[https://savannah.cern.ch/bugs/?57596][#57596]]: [ICE] non resubmission if job failed for proxy expiration %GREEN%FIXED%ENDCOLOR% * Verify:<verbatim> 2010-03-23 10:20:37,696 INFO - iceLBLogger::logEvent() - Job Done Failed Event, ExitCode=[0], FailureReason=[Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] 2010-03-23 10:20:37,817 DEBUG - iceLBContext::testCode() - L&B call succeeded. 2010-03-23 10:20:37,828 ERROR - Ice::resubmit_job() - Will NOT resubmit job [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] because it's Input Sandbox proxy file is not valid: The proxy is EXPIRED! 2010-03-23 10:20:37,828 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...) 2010-03-23 10:20:37,828 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[Input sandbox's proxy is missing. Cannot resubmit job] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]</verbatim> * Bug [[https://savannah.cern.ch/bugs/?58099][#58099]]: WMS purger forces purge of jobs if LB cannot be reached %GREEN% FIXED %ENDCOLOR% * Stop the LBServer and then run the cron purger:<verbatim> 07 Apr, 16:09:13 -E: [Error] query_job_status(purger.cpp:125): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg:: edg_wll_JobStat [111] Connection refused(edg_wll_gss_connect()) [glite@wms007 ~]$ </verbatim> * Verify that the !SandBox dir has not been removed:<verbatim> [glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/ total 16 drwxrwx--- 2 dteam008 glite 4096 Apr 6 14:34 input drwxrwx--- 2 dteam008 glite 4096 Apr 6 14:46 output drwxrwx--- 2 dteam008 glite 4096 Apr 6 14:34 peek lrwxrwxrwx 1 glite glite 102 Apr 6 14:34 user.proxy -> /var/glite/SandboxDir/Uo/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fUow8XY0NGbyumU3PPGMSng/user.proxy </verbatim> * Restart LBServer and verify that now the SBD of the job is purged:<verbatim> [glite@wms007 ~]$ /opt/glite/sbin/glite-wms-purgeStorage.sh -p /var/glite/SandboxDir/ye -t 10000 07 Apr, 16:18:07 -I: [Info] operator()(purger.cpp:449): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg: removed DONE job [glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/ ls: /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/: No such file or directory</verbatim> * Bug [[https://savannah.cern.ch/bugs/?58387][#58387]]: [ICE] should log a job aborted when it cannot resubmit the job for missing user proxy %GREEN%FIXED%ENDCOLOR% * Verify:<verbatim> ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA Current Status: Aborted Logged Reason(s): - Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed Status Reason: Input sandbox's proxy is missing. Cannot resubmit job Destination: ce202.cern.ch:8443/cream-lsf-grid_dteam Submitted: Tue Mar 23 09:49:42 2010 CET *************************************************************</verbatim> * Bug [[https://savannah.cern.ch/bugs/?58977][#58977]]: [ICE] Wrong database colum name in ICE SQL query %GREEN% FIXED %ENDCOLOR% * Submit some jobs to a ce (e.g. cream-25.pd.infn.it:8443/cream-lsf-testbedB_2):<verbatim> [root@wms007 20100409]# queryDb -v -C -G [https://cream-25.pd.infn.it:8443/CREAM881525184] [https://devel17.cnaf.infn.it:9000/V2Lj0_-XWkrjRKMaf3f6ng] [https://cream-25.pd.infn.it:8443/CREAM425827870] [https://devel17.cnaf.infn.it:9000/4TKQu1U_daCMMb2mRDR2cA] [https://cream-25.pd.infn.it:8443/CREAM543141647] [https://devel17.cnaf.infn.it:9000/4wqozVNHEVUXWx5UzUaXqA] [https://cream-25.pd.infn.it:8443/CREAM769586568] [https://devel17.cnaf.infn.it:9000/XLcaGE3kR3h8-oj8cMWE_A] [https://cream-25.pd.infn.it:8443/CREAM192029588] [https://devel17.cnaf.infn.it:9000/PuOGoOxMf-pbfFu-wSkACw] [https://cream-25.pd.infn.it:8443/CREAM378177464] [https://devel17.cnaf.infn.it:9000/T8ZKSu5zZPZY-Ee1gLRX5A] [https://cream-25.pd.infn.it:8443/CREAM299069473] [https://devel17.cnaf.infn.it:9000/Xh1AMEor9hWOx4picngYkA] [https://cream-25.pd.infn.it:8443/CREAM012571708] [https://devel17.cnaf.infn.it:9000/YjpCU6dfrLsDBs6wU_D3Hg] [https://cream-25.pd.infn.it:8443/CREAM561236418] [https://devel17.cnaf.infn.it:9000/00Qc7RnutRORYVOd0ShIKg] [https://cream-25.pd.infn.it:8443/CREAM972351884] [https://devel17.cnaf.infn.it:9000/ksz80OflJnDE_ynWHmKTwQ] [https://cream-25.pd.infn.it:8443/CREAM827240561] [https://devel17.cnaf.infn.it:9000/WwfftKdV6_5lSgihPOUsaA] [https://cream-25.pd.infn.it:8443/CREAM573497695] [https://devel17.cnaf.infn.it:9000/S5zbkyK72hv2LXwUD1vAFw] [https://cream-25.pd.infn.it:8443/CREAM735112819] [https://devel17.cnaf.infn.it:9000/0J9nTy1tJxkcuRJ9oTZACw] [https://cream-25.pd.infn.it:8443/CREAM526570551] [https://devel17.cnaf.infn.it:9000/Rcl0TypyUTXMwtLk86R3yA] [https://cream-25.pd.infn.it:8443/CREAM992848449] [https://devel17.cnaf.infn.it:9000/xfH1fkIroQwNvlVBdn8N5A] [https://cream-25.pd.infn.it:8443/CREAM944698480] [https://devel17.cnaf.infn.it:9000/xjiyHJo3rkUsXXHHe0s6yg] [https://cream-25.pd.infn.it:8443/CREAM729677007] [https://devel17.cnaf.infn.it:9000/FIZA1Mjb4moUNel1N7UXvw] [https://cream-25.pd.infn.it:8443/CREAM589660323] [https://devel17.cnaf.infn.it:9000/5DJlLG7M0v3C_-WMDKSdXQ] [https://cream-25.pd.infn.it:8443/CREAM994745139] [https://devel17.cnaf.infn.it:9000/T_UdwnjC55dIVrPxJOVvmg] [https://cream-25.pd.infn.it:8443/CREAM228224655] [https://devel17.cnaf.infn.it:9000/URw39mrv7jj-buJ3KDza8w] [https://cream-25.pd.infn.it:8443/CREAM397635733] [https://devel17.cnaf.infn.it:9000/f3FOGwNoWpHyWkxO_87AIg] [https://cream-25.pd.infn.it:8443/CREAM510341828] [https://devel17.cnaf.infn.it:9000/vEfH5j5_5R_7jFNrntEsog] [https://cream-25.pd.infn.it:8443/CREAM788890890] [https://devel17.cnaf.infn.it:9000/y0IVYbdR_UWbTmrXY5O8fA] ------------------------------------------------ 23 item(s) found</verbatim> * Check also the db_id registered in the ice's database<verbatim> [root@wms007 20100409]# sqlite3 /var/glite/ice/persist_dir/ice.db "SELECT db_id, ceurl from ce_dbid;" 1270820425000|https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2</verbatim> * Stop the cream CE. Drop its database. Create a new empty one. Restart the CE. * Check what happen in the ice log file:<verbatim> 2010-04-09 16:16:53,953 WARN - iceCommandEventQuery::execute() - TID=[150150560] *** CREAM HAS PROBABLY BEEN SCRATCHED. GOING TO ERASE ALL JOBS RELATED TO OLD DB_ID [1270820425000] ***</verbatim> * Check if there are jobs in the Ice's database:<verbatim> [root@wms007 persist_dir]# queryDb -v -C -G ------------------------------------------------ 0 item(s) found</verbatim> * and if the db_id has been changed:<verbatim> [root@wms007 persist_dir]# sqlite3 /var/glite/ice/persist_dir/ice.db "SELECT db_id, ceurl from ce_dbid;" 1270822483000|https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2</verbatim> * Look at the status of a job that has been removed:<verbatim> Status info for the Job : https://devel17.cnaf.infn.it:9000/xjiyHJo3rkUsXXHHe0s6yg Current Status: Aborted Logged Reason(s): - job completed Status Reason: CREAM'S database has been scratched and all its jobs have been lost Destination: cream-25.pd.infn.it:8443/cream-lsf-testbedB_2 Submitted: Fri Apr 9 16:11:34 2010 CEST</verbatim> * Bug [[https://savannah.cern.ch/bugs/?59240][#59240]]: [ICE] abort reasons not always printed in its logfile %ORANGE% FIXED NOT CERTIFIED %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?59339][#59399]]: [ICE] doesn't correctly handle request in jobdir/old when it is restarted %GREEN%FIXED%ENDCOLOR% * Verify submitting a big collection to cream CEs, and then restarting ICE in the middle of the submit process:<verbatim> 2010-03-23 15:55:43,604 DEBUG - iceCommandSubmit::try_to_submit() - TID=[168434952] Going to START CreamJobID [https://cream -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...</verbatim> * restarting ice... <verbatim> 2010-03-23 15:55:45,760 DEBUG - ICE VersionID is [Fri Mar 19 13:53:17 CET 2010] ProcessID=[23579] 2010-03-23 15:55:45,760 INFO - glite-wms-ice::main() - Host certificate is [/home/glite/.certs/hostcert.pem] 2010-03-23 15:55:45,817 DEBUG - iceThreadPool::iceThreadPool(ICE Submission Pool) - Creating 10 worker threads 2010-03-23 15:55:45,819 DEBUG - iceThreadPool::iceThreadPool(ICE Poller Pool) - Creating 5 worker threads [...] 2010-03-23 15:55:48,967 INFO - iceCommandSubmit::execute() - TID=[144321160] This request is a Submission... 2010-03-23 15:55:48,968 INFO - iceCommandSubmit::try_to_submit() - TID=[144321160] GridJobID [https://devel17.cnaf.infn.it:9 000/iM8C3YV12fwhvIG5mNip5Q] has already been REGISTERED. Will only START it... 2010-03-23 15:55:48,968 DEBUG - iceCommandSubmit::try_to_submit() - TID=[144321160] Going to START CreamJobID [https://cream -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]... 2010-03-23 15:55:49,154 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/iM8C 3YV12fwhvIG5mNip5Q] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...) 2010-03-23 15:55:49,155 INFO - iceLBLogger::logEvent() - Cream Transfer OK Event - [gridJobID="https://devel17.cnaf.infn.it:9 000/iM8C3YV12fwhvIG5mNip5Q" CREAMJobID="https://cream-32.pd.infn.it:8443/CREAM036926381"]</verbatim> * Bug [[https://savannah.cern.ch/bugs/?59453][#59453]]: [ICE] polling needs to be improved %ORANGE% FIXED NOT CERTIFIED %ENDCOLOR% * See also these [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WmsTestsICE4][tests]] * Bug [[https://savannah.cern.ch/bugs/?60668][#60668]]: [ICE] does not respect LB server/proxy selection through the LBproxy attribute %GREEN%FIXED%ENDCOLOR% * Set *LBProxy = false;* in _glite_wms.conf_ (section Common), restart ice and submit...<verbatim> mysql> select * from events where jobid="YFyqjw3FF-BO-0U5BxCOtA"; +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | jobid | event | code | prog | host | time_stamp | userid | usec | level | arrived | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | YFyqjw3FF-BO-0U5BxCOtA | 0 | 5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 394848 | 8 | 2010-03-24 12:04:39 | | YFyqjw3FF-BO-0U5BxCOtA | 1 | 15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 548652 | 8 | 2010-03-24 12:04:39 | | YFyqjw3FF-BO-0U5BxCOtA | 2 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 608084 | 8 | 2010-03-24 12:04:39 | | YFyqjw3FF-BO-0U5BxCOtA | 3 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 657231 | 8 | 2010-03-24 12:04:39 | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ 4 rows in set (0.00 sec)</verbatim> * * Set *LBProxy = true;* in _glite_wms.conf_ (section Common), restart ice and submit...<verbatim> mysql> select * from events where jobid="SlKOGSnaW0oKO3TJqw9tbA"; +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | jobid | event | code | prog | host | time_stamp | userid | usec | level | arrived | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | SlKOGSnaW0oKO3TJqw9tbA | 0 | 17 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 342720 | 8 | 2010-03-24 12:09:53 | | SlKOGSnaW0oKO3TJqw9tbA | 1 | 21 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 470416 | 8 | 2010-03-24 12:09:53 | | SlKOGSnaW0oKO3TJqw9tbA | 2 | 21 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 526402 | 8 | 2010-03-24 12:09:53 | | SlKOGSnaW0oKO3TJqw9tbA | 3 | 2 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 606511 | 8 | 2010-03-24 12:09:54 | | SlKOGSnaW0oKO3TJqw9tbA | 4 | 4 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 712100 | 8 | 2010-03-24 12:09:54 | | SlKOGSnaW0oKO3TJqw9tbA | 5 | 4 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | 3f82b966e8a77413044be1a9144a4af4 | 43631 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 6 | 5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 167414 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 7 | 15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 297333 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 8 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 369636 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 9 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 431565 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 10 | 5 | JobController | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 745052 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 11 | 1 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 846002 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 12 | 1 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:10:04 | bdd27610035bb0ec9287e2ecaa3da2eb | 869424 | 8 | 2010-03-24 12:10:04 | | SlKOGSnaW0oKO3TJqw9tbA | 13 | 8 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 94855 | 8 | 2010-03-24 12:11:39 | | SlKOGSnaW0oKO3TJqw9tbA | 14 | 25 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 181448 | 8 | 2010-03-24 12:11:39 | | SlKOGSnaW0oKO3TJqw9tbA | 15 | 10 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 250291 | 8 | 2010-03-24 12:11:39 | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ 16 rows in set (0.00 sec)</verbatim> * Bug [[https://savannah.cern.ch/bugs/?61312][#61312]]: [ICE] Error in handling user dn in ICE's poller %GREEN%FIXED%ENDCOLOR% * Submit 5 jobs to an _old_ !CreamCE (Cream 1.5) setting !MyProxyServer attribute:<verbatim> 2010-03-24 13:40:38,128 ERROR - iceCommandEventQuery::execute() - TID=[159321352] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-33.pd.infn.it</ns2:hostname>]] 2010-03-24 13:40:38,128 WARN - iceCommandEventQuery::execute() - TID=[159321352] Not present QueryEvent on CE [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller. 2010-03-24 13:40:38,128 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2] 2010-03-24 13:40:38,128 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]... 2010-03-24 13:40:38,129 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Finished collecting jobs to poll. [5] jobs are to poll [...]</verbatim> * And so:<verbatim> [ale@cream-15 UI]$ glite-wms-job-status -v 0 -i testo --noint ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/tt3GLYuIiHuwrmnl7fGVtA Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/lY9fdOgQk5RcaH99g23z5g Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/jta5KlBZEP-r2KbE0SB0Vw Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/TNqI_PbRyqgFAN3L52IpKQ Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/V7Pnv2yE47CdHKgQmRaIvQ Current Status: Done (Success) *************************************************************</verbatim> * Bug [[https://savannah.cern.ch/bugs/?61405][#61405]]: [ICE] Missing proxy validity evaluation in ICE %GREEN% FIXED %ENDCOLOR% * Submit this jdl with a proxy of 30minutes NOT registered to the myproxy server (myproxy.cnaf.infn.it):<verbatim> [ executable = "/bin/sleep"; arguments = "2000"; MyProxyServer = "myproxy.cnaf.infn.it"; requirements = ( other.GlueCEStateStatus == "testbedb" ); DefaultRank = -other.GlueCEStateEstimatedResponseTime; ]</verbatim> * After a while submit the same jdl with a fresh proxy and look in the ice's log if this new proxy is used to refresh the delegation of the previous job: * First it should try to renew the proxy contacting the myproxy server:<verbatim> 2010-04-14 11:47:40,622 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Contacting MyProxy server [myproxy.cnaf.infn.it] for user dn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] with proxy certificate [/var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy] to renew it... 2010-04-14 11:47:40,622 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Executing command [export X509_USER_CERT=/var/glite/wms.proxy; export X509_USER_KEY=/var/glite/wms.proxy; /opt/glite/bin/glite-wms-ice-proxy-renew -s myproxy.cnaf.infn.it -p /var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy -o /var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy.renewed]... 2010-04-14 11:47:40,783 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Command output is [/opt/glite/bin/glite-wms-ice-proxy-renew: glite_renewal_core_renew() failed: Error contacting MyProxy server for proxy /var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy: ERROR from myproxy-server (myproxy.cnaf.infn.it): X509_verify_cert() failed: certificate has expired �]</verbatim> * Then it should use the proxy of the last arrived job to renew the delegation:<verbatim> 2010-04-14 11:47:40,783 DEBUG - iceCommandDelegationRenewal::renewAllDelegations() - Looking for the better proxy for DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] MyProxy Server name [myproxy.cnaf.infn.it]... 2010-04-14 11:47:40,783 INFO - iceCommandDelegationRenewal::renewAllDelegations() - Will Renew Delegation ID [12712381542E417936wms0072Ecnaf2Einfn2Eit] with BetterProxy [/var/glite/ice/persist_dir/2A9DAF04C398C21D6ADF7E884BC192ED95AF554C.betterproxy] that will expire on [Wed Apr 14 12:12:47 2010] 2010-04-14 11:47:40,783 INFO - CreamProxy_DelegateRenew::execute() - Calling renewProxyReq to remote service [https://cream-39.pd.infn.it:8443/ce-cream/services/gridsite-delegation]</verbatim> * Bug [[https://savannah.cern.ch/bugs/?61413][#61413]]: [ICE] should not call !EventQuery for a userDN if he/she doesn't have active jobs %GREEN%FIXED%ENDCOLOR% * Submit a job to a !CreamCE and wait until it finished. * Submit another job to a _different_ !CreamCE, you should not see any query to the previous used !CreamCE. * Bug [[https://savannah.cern.ch/bugs/?61748][#61748]]: [ICE] !EventQuery/Polling must be done also to blacklisted CE %GREEN%FIXED%ENDCOLOR% * Submit some jobs to a !CreamCEVerify * Trigger a socket timeout so that ICE blacklisted the !CreamCE :<verbatim> 2010-03-24 15:58:40,753 ERROR - CreamProxyMethod::execute() - Connection timed out to CREAM: "EOF detected during communicati on. Probably service closed connection or SOCKET TIMEOUT occurred." on try 3/3. Blacklisting endpoint and giving up. 2010-03-24 15:58:40,753 DEBUG - CEBlackList::blacklist_endpoint() - Blacklisting CE https://cream-25.pd.infn.it:8443/ce-cream /services/gridsite-delegation until Wed Mar 24 16:08:40 2010</verbatim> * Verify that the !QueryEvent commad is called in any case:<verbatim> 2010-03-24 16:05:28,952 DEBUG - eventStatusPoller::body() - Adding EventQuery command for couple (/C=IT/O=INFN/OU=Personal Ce rtificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL, https://cream-25.pd.infn.it:8443/ce-cream/services/C REAM2) to the thread pool...</verbatim> * Instead a submission fails:<verbatim> 2010-03-24 15:58:43,265 DEBUG - Delegation_manager::delegate() - Creating new delegation with delegation id [12694427232E2651 16wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [https://cream -25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dte am/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] Expiring on [Thu Mar 25 12:54:02 2010] 2010-03-24 15:58:43,265 DEBUG - CEBlackList::is_blacklisted() - CE https://cream-25.pd.infn.it:8443/ce-cream/services/gridsit e-delegation is blacklisted until Wed Mar 24 16:08:40 2010 2010-03-24 15:58:43,265 ERROR - Delegation_manager::delegate() - FAILED Creation of a new delegation with delegation id [1269 4427232E265116wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [h ttps://cream-25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova /CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio G ianelle-/dteam/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] - ERROR is: [The endpoint is blacklisted] 2010-03-24 15:58:43,265 ERROR - iceCommandSubmit::execute() - TID=[159308760] Error during submission of jdl= Fatal Exceptio n is:Failed to create a delegation id for job https://devel17.cnaf.infn.it:9000/UoVsvjIj1CPluHb81xM_pQ: reason is The endpoin t is blacklisted</verbatim> * Bug [[https://savannah.cern.ch/bugs/?63989][#63989]]: [ICE] doesn't handle exception raised by jobDir::new_entries() %GREEN%FIXED%ENDCOLOR% * Change the permission of the _new_ directory in jobdir:<verbatim> [root@wms007 jobdir]# chmod 111 new/ [root@wms007 jobdir]# ls -l total 48 d--x--x--x 2 glite glite 40960 Mar 24 16:13 new drwxr-xr-x 2 glite glite 4096 Mar 24 16:13 old drwxr-xr-x 2 glite glite 4096 Mar 24 16:13 tmp</verbatim> * Look in ICE's log:<verbatim> 2010-03-25 09:45:39,545 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied 2010-03-25 09:45:40,546 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied</verbatim> * Bug [[https://savannah.cern.ch/bugs/?64698][#64698]]: jobwrapper max osb limit should be considered only if the gridftp server is the wms %GREEN% FIXED %ENDCOLOR% %ORANGE% *only for LCG-CE* %ENDCOLOR% * Set _MaxOutputSandboxSize = 10000000;_ in section !WorkloadManager of file glite_wms.conf * Submit a jdl with a file of more than 10Mb in the _OutputSandbox_ parameter and set also the corresponding _OutputSandboxDestURI_ parameter * Check the output dir in the SE:<verbatim> [root@devel18 tmp]# ls -lh -rw-r--r-- 1 dteam044 dteam 50M Apr 7 16:23 bigfile -rw-r--r-- 1 dteam044 dteam 646 Apr 7 16:23 ls.out</verbatim> * If you don't set _OutputSandboxDestURI_ in the jdl, than the !SandBox dir in the WMS should contain a =.tail= file of less than 10Mb:<verbatim> [root@wms007 persist_dir]# ls -lh /var/glite/SandboxDir/A1/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fA1cdkhzepvrCjiU_5fTaKbpg/output/ total 9.6M -rw-r--r-- 1 dteam008 dteam 9.6M Apr 7 16:26 bigfile.tail -rw-r--r-- 1 dteam008 dteam 637 Apr 7 16:26 ls.out</verbatim> -- Main.AlessioGianelle - 2010-02-05
Edit
|
Attach
|
PDF
|
H
istory
:
r75
|
r45
<
r44
<
r43
<
r42
|
B
acklinks
|
V
iew topic
|
More topic actions...
Topic revision: r43 - 2010-04-26
-
AlessioGianelle
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
EgeeJra1It Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback