Tags:
,
view all tags
---+ *Certification report patch [[https://savannah.cern.ch/patch/index.php?3621][3621]]* Author(s): Elisabetta Molinari & Alessio Gianelle Outcome: %ORANGE% *in certification...* %ENDCOLOR% ---++ Clean installation ---++ Upgrade from production ---++ Test Report ---+++ List Match * without data: %ICON{choice-yes}% * with data: %ICON{choice-yes}% ---+++ Submission/GetOutput * =Normal= jobs through * ICE work: %ICON{choice-yes}% * JC work: %ICON{choice-yes}% * =Dag= jobs through: * JC work: %ICON{choice-yes}% * =Collection= jobs through: * ICE work: %ICON{choice-yes}% * JC work: %ICON{choice-yes}% * also job-output for collections works even though %ORANGE% only the parent node is set to 'Cleared' %ENDCOLOR% * =Parametric= jobs through: * ICE work: %ICON{choice-yes}% * JC work: %ICON{choice-yes}% * tested with the following <verbatim> [ JobType = "parametric"; Executable = "/usr/bin/env"; Environment = {"MYPATH_PARAM_=$PATH:/bin:/usr/bin:$HOME"}; StdOutput = "echo_PARAM_.out"; StdError = "echo_PARAM_.err"; OutputSandbox = {"echo_PARAM_.out","echo_PARAM_.err"}; Parameters = 5; usertags = [ jdl = "parametric" ]; ]</verbatim> * =Bulk= jobs sent both through ICE and JC and !RetryCount = 0; : * Submit a bulk of 3 jobs -> success 100% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 50 jobs -> success 100% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 100 jobs -> success 100% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 500 jobs -> success 99.9% %ICON{choice-yes}% both to ICE and JC * Submit a bulk of 1000 jobs -> success 99.9% %ICON{choice-yes}% both to ICE and JC * =Perusal= jobs through: * JC work: %ICON{choice-yes}% * ICE work: %ICON{choice-yes}% * =MPICH= jobs: %ICON{choice-no}% ---+++ Cancel * Normal jobs * ICE: %ICON{choice-yes}% * JC: %ICON{choice-yes}% * Dag: %ICON{choice-yes}% * Note that children nodes in status %ORANGE% 'submitted' %ENDCOLOR% don't get cancelled * Collection * ICE: %ICON{choice-yes}% * JC: %ICON{choice-yes}% * Node of a collection: %ICON{choice-yes}% * Note: collections stay in status %ORANGE% 'waiting'%ENDCOLOR% when all the nodes are Done (Success) except for one that is 'Cancelled' ---+++ Others * =BrokerInfo= * ICE creation %ICON{choice-yes}% * JC creation: %ICON{choice-yes}% * =Resubmission= * Shallow: %ICON{choice-yes}% * Deep: %ICON{choice-yes}% * =Job Recovery= * Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status %ICON{choice-yes}% * =Prologue= and =Epilogue= jobs * ICE: %ICON{choice-yes}% * JC: %ICON{choice-yes}% ------- ------- ---++ Check bugs: * Bug [[https://savannah.cern.ch/bugs/?42288][#42288]]: Problem in forwarding cerequirements to a CREAM CE %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?48910][#48910]]: Failure starting LM if its output jobdir doesn't exist; unprotected chown in WM/LM/JC startup scripts %GREEN%FIXED%ENDCOLOR% * Stopped gLite services and deleted the jobdir under '/var/glite/workload_manager'<verbatim> [root@wms007 jobdir]# service gLite stop [...] [root@wms007 workload_manager]# pwd /var/glite/workload_manager [root@wms007 workload_manager]# ls ismdump.fl jobdir [root@wms007 workload_manager]# rm -rf jobdir [root@wms007 workload_manager]# ls ismdump.fl </verbatim> * re-started the LM service checking that the jobdir gets recreated <verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start Starting LogMonitor... [ OK ] [root@wms007 workload_manager]# ls ismdump.fl jobdir [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status Logmonitor running...</verbatim> * Stopped gLite services and deleted the jobdir under '/var/glite/jobcontrol'<verbatim> [root@wms007 jobcontrol]# pwd /var/glite/jobcontrol [root@wms007 jobcontrol]# rm -rf jobdir [root@wms007 jobcontrol]# ls condorio submit</verbatim> * re-started the JC service checking that the jobdir gets recreated <verbatim> [root@wms007 jobcontrol]# /opt/glite/etc/init.d/glite-wms-jc start JobController Starting JobController daemon(s) Starting JobController... [ OK ] [root@wms007 jobcontrol]# ls condorio jobdir lock submit [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-jc status JobController JobController running in pid: 3625</verbatim> * Stopped gLite services and deleted the jobdir under '/var/glite/ice'<verbatim> [root@wms007 ice]# pwd /var/glite/ice [root@wms007 ice]# ls jobdir persist_dir [root@wms007 ice]# rm -rf jobdir/ [root@wms007 ice]# ls persist_dir</verbatim> * re-started the ICE service checking that the jobdir gets recreated <verbatim> [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ice]# ls jobdir persist_dir [root@wms007 ice]# /opt/glite/etc/init.d/glite-wms-ice status /opt/glite/bin/glite-wms-ice-safe (pid 22783) is running...</verbatim> * Stopped gLite services and deleted all the jobdirs<verbatim> [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/ ice/: persist_dir jobcontrol/: condorio submit workload_manager/: ismdump.fl</verbatim> * re-started the WM service checking that all the jobdirs get recreated <verbatim> [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm start starting workload manager... ok [root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/ ice/: jobdir persist_dir jobcontrol/: condorio jobdir submit workload_manager/: ismdump.fl jobdir [root@wms007 glite]# /opt/glite/etc/init.d/glite-wms-wm status /opt/glite/bin/glite-wms-workload_manager (pid 23259) is running...</verbatim> * Comment Input/InputType parameter in wms conf file (Sections: ICE, !WorkloadManager and !JobController). * Try to start !JobController:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc start JobController Starting !JobController daemon(s) Please set Input parameter in glite_wms.conf - JC section [FAILED] [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc status JobController JobController stopped.</verbatim> * Try to start !LogMonitor:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start Starting LogMonitor...Please set Input parameter in glite_wms.conf - WM section [FAILED] [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status LogMonitor stopped.</verbatim> * Try to start ICE:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... failure [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice status /opt/glite/bin/glite-wms-ice-safe is not running</verbatim> * Try to start !WorkloadManager:<verbatim> [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm start starting workload manager... Please set Input parameter in - WM section Please set DispatcherType parameter in - WM section Please set Input parameter in - JC section Please set InputType parameter in - JC section Please set Input parameter in - ICE section Please set InputType parameter in - ICE section failure [root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm status /opt/glite/bin/glite-wms-workload_manager is not running</verbatim> * Bug [[https://savannah.cern.ch/bugs/?52934][#52934]]: [ICE] Delegation in ICE doesn't refer to the myproxy server %GREEN%FIXED%ENDCOLOR% * !GridJobID: https://devel17.cnaf.infn.it:9000/dj8r_iFRd8tnWH4bThPNeg * Deleg Proxy ID = [12692524052E32526wms0072Ecnaf2Einfn2Eit] * Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B * Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle * !MyProxyServer = "myproxy.cern.ch"; * !GridJobID: https://devel17.cnaf.infn.it:9000/UNB2dHJNn7euaDP3FvJ3og * Deleg Proxy ID = [12692523642E948823wms0072Ecnaf2Einfn2Eit] * Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B * Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle * !MyProxyServer = "myproxy.cnaf.infn.it"; * Bug [[https://savannah.cern.ch/bugs/?53460][#53460]]: [ICE] Detection of job status changes for CREAM jobs should be improved %GREEN%FIXED%ENDCOLOR% * Using a new CE (1.6) looking in ice's log there is:<verbatim> 2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269272870.288498 1269272870.496129 0.207631 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] There're [2] event(s) for the couple DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2] 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] Database ID=[1261041182000] 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] Exec time ID=[3] 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() - TID=[150673032] Processing [2] event(s) for Job [gridJobID="https://devel17.cnaf.infn.it:9000/uKbQNcbh7kIohBz6bDMNZQ" CREAMJobID="https://cream-30.pd.infn.it:8443/CREAM396193798"] userdn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] and ce url [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2]. 2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() - TID=[150673032] EventID [685143] timestsamp [1269272804] 2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::processSingleEvent - TID=[150673032] InsertStat 1269272870.496682 1269272870.496864 0.000182</verbatim> * Using an "old" CE instead the "poller" method is used:<verbatim> 2010-03-22 16:55:55,397 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269273355.242918 1269273355.397806 0.154888 2010-03-22 16:55:55,397 ERROR - iceCommandEventQuery::execute() - TID=[150673032] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-34.pd.infn.it</ns2:hostname>]] 2010-03-22 16:55:55,398 WARN - iceCommandEventQuery::execute() - TID=[150673032] Not present QueryEvent on CE [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller. 2010-03-22 16:55:55,398 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2] 2010-03-22 16:55:55,398 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...</verbatim> * Bug [[https://savannah.cern.ch/bugs/?55103][#55103]]: [ICE] ICE port 7010 not cleaned up properly %GREEN%FIXED%ENDCOLOR% * We try a stop/start/restart sequence<verbatim> [root@wms007 ~]# ps ax | grep ice 1283 pts/2 S+ 0:00 grep ice 30985 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 30989 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 30990 ? Sl 0:15 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop stopping ICE... ok [root@wms007 ~]# ps ax | grep ice 1321 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ~]# ps ax | grep ice 1353 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 1357 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 1358 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 1398 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice restart stopping ICE... ok starting ICE... ok [root@wms007 ~]# ps ax | grep ice 1433 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 1437 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 1438 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 1470 pts/2 S+ 0:00 grep ice</verbatim> * Bug [[https://savannah.cern.ch/bugs/?55452][#55452]]: CMS production struck by waves of "Globus error 10: data transfer to the server failed" %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?56636][#56636]]: [ICE] statistics counters for monitoring %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?57295][#57295]]: [ICE] queryDb tool may create empty DB as root %GREEN%FIXED%ENDCOLOR% * Verify:<verbatim> [root@wms007 ~]# ll /var/glite/ice/persist_dir/ice.db -rw-r--r-- 1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db [root@wms007 ~]# /opt/glite/bin/queryDb -c glite_wms.conf -s RUNNING,REALLY_RUNNING 0 item(s) found [root@wms007 ~]# ll /var/glite/ice/persist_dir/ice.db -rw-r--r-- 1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db</verbatim> * Bug [[https://savannah.cern.ch/bugs/?57579][#57579]]: [ICE] Occasionally the ICE's start/stop script doesn't kill the ICE process %GREEN%HOPEFULLY FIXED%ENDCOLOR% * Verify:<verbatim> [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice status /opt/glite/bin/glite-wms-ice-safe (pid 1433) is running... [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop stopping ICE... ok [root@wms007 ~]# ps ax | grep ice 19866 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ~]# ps ax | grep ice 19899 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 19903 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 19904 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 19932 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice stop stopping ICE... ok [root@wms007 ~]# ps ax | grep ice 19978 pts/2 S+ 0:00 grep ice [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-ice start starting ICE... ok [root@wms007 ~]# ps ax | grep ice 20009 ? Ss 0:00 /opt/glite/bin/glite-wms-ice-safe --conf glite_wms.conf --daemon /var/glite/glite-wms-ice-safe.pid 20013 ? S 0:00 sh -c /opt/glite/bin/glite-wms-ice --conf glite_wms.conf > /var/log/glite/ice_console.log 2>&1 20014 ? Sl 0:00 /opt/glite/bin/glite-wms-ice --conf glite_wms.conf 20046 pts/2 S+ 0:00 grep ice</verbatim> * Bug [[https://savannah.cern.ch/bugs/?57596][#57596]]: [ICE] non resubmission if job failed for proxy expiration %GREEN%FIXED%ENDCOLOR% * Verify:<verbatim> 2010-03-23 10:20:37,696 INFO - iceLBLogger::logEvent() - Job Done Failed Event, ExitCode=[0], FailureReason=[Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] 2010-03-23 10:20:37,817 DEBUG - iceLBContext::testCode() - L&B call succeeded. 2010-03-23 10:20:37,828 ERROR - Ice::resubmit_job() - Will NOT resubmit job [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] because it's Input Sandbox proxy file is not valid: The proxy is EXPIRED! 2010-03-23 10:20:37,828 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...) 2010-03-23 10:20:37,828 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[Input sandbox's proxy is missing. Cannot resubmit job] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]</verbatim> * Bug [[https://savannah.cern.ch/bugs/?58099][#58099]]: WMS purger forces purge of jobs if LB cannot be reached %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?58387][#58387]]: [ICE] should log a job aborted when it cannot resubmit the job for missing user proxy %GREEN%FIXED%ENDCOLOR% * Verify:<verbatim> ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA Current Status: Aborted Logged Reason(s): - Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed Status Reason: Input sandbox's proxy is missing. Cannot resubmit job Destination: ce202.cern.ch:8443/cream-lsf-grid_dteam Submitted: Tue Mar 23 09:49:42 2010 CET *************************************************************</verbatim> * Bug [[https://savannah.cern.ch/bugs/?58977][#58977]]: [ICE] Wrong database colum name in ICE SQL query %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?59240][#59240]]: [ICE] abort reasons not always printed in its logfile %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?59339][#59399]]: [ICE] doesn't correctly handle request in jobdir/old when it is restarted %GREEN%FIXED%ENDCOLOR% * Verify submitting a big collection to cream CEs, and then restarting ICE in the middle of the submit process:<verbatim> 2010-03-23 15:55:43,604 DEBUG - iceCommandSubmit::try_to_submit() - TID=[168434952] Going to START CreamJobID [https://cream -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...</verbatim> * restarting ice... <verbatim> 2010-03-23 15:55:45,760 DEBUG - ICE VersionID is [Fri Mar 19 13:53:17 CET 2010] ProcessID=[23579] 2010-03-23 15:55:45,760 INFO - glite-wms-ice::main() - Host certificate is [/home/glite/.certs/hostcert.pem] 2010-03-23 15:55:45,817 DEBUG - iceThreadPool::iceThreadPool(ICE Submission Pool) - Creating 10 worker threads 2010-03-23 15:55:45,819 DEBUG - iceThreadPool::iceThreadPool(ICE Poller Pool) - Creating 5 worker threads [...] 2010-03-23 15:55:48,967 INFO - iceCommandSubmit::execute() - TID=[144321160] This request is a Submission... 2010-03-23 15:55:48,968 INFO - iceCommandSubmit::try_to_submit() - TID=[144321160] GridJobID [https://devel17.cnaf.infn.it:9 000/iM8C3YV12fwhvIG5mNip5Q] has already been REGISTERED. Will only START it... 2010-03-23 15:55:48,968 DEBUG - iceCommandSubmit::try_to_submit() - TID=[144321160] Going to START CreamJobID [https://cream -32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]... 2010-03-23 15:55:49,154 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/iM8C 3YV12fwhvIG5mNip5Q] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...) 2010-03-23 15:55:49,155 INFO - iceLBLogger::logEvent() - Cream Transfer OK Event - [gridJobID="https://devel17.cnaf.infn.it:9 000/iM8C3YV12fwhvIG5mNip5Q" CREAMJobID="https://cream-32.pd.infn.it:8443/CREAM036926381"]</verbatim> * Bug [[https://savannah.cern.ch/bugs/?59453][#59453]]: [ICE] polling needs to be improved %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?60668][#60668]]: [ICE] does not respect LB server/proxy selection through the LBproxy attribute %GREEN%FIXED%ENDCOLOR% * Set *LBProxy = false;* in _glite_wms.conf_ (section Common), restart ice and submit...<verbatim> mysql> select * from events where jobid="YFyqjw3FF-BO-0U5BxCOtA"; +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | jobid | event | code | prog | host | time_stamp | userid | usec | level | arrived | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | YFyqjw3FF-BO-0U5BxCOtA | 0 | 5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 394848 | 8 | 2010-03-24 12:04:39 | | YFyqjw3FF-BO-0U5BxCOtA | 1 | 15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 548652 | 8 | 2010-03-24 12:04:39 | | YFyqjw3FF-BO-0U5BxCOtA | 2 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 608084 | 8 | 2010-03-24 12:04:39 | | YFyqjw3FF-BO-0U5BxCOtA | 3 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 657231 | 8 | 2010-03-24 12:04:39 | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ 4 rows in set (0.00 sec)</verbatim> * * Set *LBProxy = true;* in _glite_wms.conf_ (section Common), restart ice and submit...<verbatim> mysql> select * from events where jobid="SlKOGSnaW0oKO3TJqw9tbA"; +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | jobid | event | code | prog | host | time_stamp | userid | usec | level | arrived | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ | SlKOGSnaW0oKO3TJqw9tbA | 0 | 17 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 342720 | 8 | 2010-03-24 12:09:53 | | SlKOGSnaW0oKO3TJqw9tbA | 1 | 21 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 470416 | 8 | 2010-03-24 12:09:53 | | SlKOGSnaW0oKO3TJqw9tbA | 2 | 21 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 526402 | 8 | 2010-03-24 12:09:53 | | SlKOGSnaW0oKO3TJqw9tbA | 3 | 2 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 606511 | 8 | 2010-03-24 12:09:54 | | SlKOGSnaW0oKO3TJqw9tbA | 4 | 4 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 712100 | 8 | 2010-03-24 12:09:54 | | SlKOGSnaW0oKO3TJqw9tbA | 5 | 4 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | 3f82b966e8a77413044be1a9144a4af4 | 43631 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 6 | 5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 167414 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 7 | 15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 297333 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 8 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 369636 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 9 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 431565 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 10 | 5 | JobController | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 745052 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 11 | 1 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 846002 | 8 | 2010-03-24 12:09:55 | | SlKOGSnaW0oKO3TJqw9tbA | 12 | 1 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:10:04 | bdd27610035bb0ec9287e2ecaa3da2eb | 869424 | 8 | 2010-03-24 12:10:04 | | SlKOGSnaW0oKO3TJqw9tbA | 13 | 8 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 94855 | 8 | 2010-03-24 12:11:39 | | SlKOGSnaW0oKO3TJqw9tbA | 14 | 25 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 181448 | 8 | 2010-03-24 12:11:39 | | SlKOGSnaW0oKO3TJqw9tbA | 15 | 10 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 250291 | 8 | 2010-03-24 12:11:39 | +------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+ 16 rows in set (0.00 sec)</verbatim> * Bug [[https://savannah.cern.ch/bugs/?61312][#61312]]: [ICE] Error in handling user dn in ICE's poller %GREEN%FIXED%ENDCOLOR% * Submit 5 jobs to an _old_ !CreamCE (Cream 1.5) setting !MyProxyServer attribute:<verbatim> 2010-03-24 13:40:38,128 ERROR - iceCommandEventQuery::execute() - TID=[159321352] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-33.pd.infn.it</ns2:hostname>]] 2010-03-24 13:40:38,128 WARN - iceCommandEventQuery::execute() - TID=[159321352] Not present QueryEvent on CE [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller. 2010-03-24 13:40:38,128 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2] 2010-03-24 13:40:38,128 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-33.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]... 2010-03-24 13:40:38,129 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Finished collecting jobs to poll. [5] jobs are to poll [...]</verbatim> * And so:<verbatim> [ale@cream-15 UI]$ glite-wms-job-status -v 0 -i testo --noint ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/tt3GLYuIiHuwrmnl7fGVtA Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/lY9fdOgQk5RcaH99g23z5g Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/jta5KlBZEP-r2KbE0SB0Vw Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/TNqI_PbRyqgFAN3L52IpKQ Current Status: Done (Success) ************************************************************* BOOKKEEPING INFORMATION: Status info for the Job : https://devel17.cnaf.infn.it:9000/V7Pnv2yE47CdHKgQmRaIvQ Current Status: Done (Success) *************************************************************</verbatim> * Bug [[https://savannah.cern.ch/bugs/?61405][#61405]]: [ICE] Missing proxy validity evaluation in ICE %ORANGE% *NOT TESTED* %ENDCOLOR% * Bug [[https://savannah.cern.ch/bugs/?61413][#61413]]: [ICE] should not call !EventQuery for a userDN if he/she doesn't have active jobs %GREEN%FIXED%ENDCOLOR% * Submit a job to a !CreamCE and wait until it finished. * Submit another job to a _different_ !CreamCE, you should not see any query to the previous used !CreamCE. * Bug [[https://savannah.cern.ch/bugs/?61748][#61748]]: [ICE] !EventQuery/Polling must be done also to blacklisted CE %GREEN%FIXED%ENDCOLOR% * Submit some jobs to a !CreamCEVerify * Trigger a socket timeout so that ICE blacklisted the !CreamCE :<verbatim> 2010-03-24 15:58:40,753 ERROR - CreamProxyMethod::execute() - Connection timed out to CREAM: "EOF detected during communicati on. Probably service closed connection or SOCKET TIMEOUT occurred." on try 3/3. Blacklisting endpoint and giving up. 2010-03-24 15:58:40,753 DEBUG - CEBlackList::blacklist_endpoint() - Blacklisting CE https://cream-25.pd.infn.it:8443/ce-cream /services/gridsite-delegation until Wed Mar 24 16:08:40 2010</verbatim> * Verify that the !QueryEvent commad is called in any case:<verbatim> 2010-03-24 16:05:28,952 DEBUG - eventStatusPoller::body() - Adding EventQuery command for couple (/C=IT/O=INFN/OU=Personal Ce rtificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL, https://cream-25.pd.infn.it:8443/ce-cream/services/C REAM2) to the thread pool...</verbatim> * Instead a submission fails:<verbatim> 2010-03-24 15:58:43,265 DEBUG - Delegation_manager::delegate() - Creating new delegation with delegation id [12694427232E2651 16wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [https://cream -25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dte am/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] Expiring on [Thu Mar 25 12:54:02 2010] 2010-03-24 15:58:43,265 DEBUG - CEBlackList::is_blacklisted() - CE https://cream-25.pd.infn.it:8443/ce-cream/services/gridsit e-delegation is blacklisted until Wed Mar 24 16:08:40 2010 2010-03-24 15:58:43,265 ERROR - Delegation_manager::delegate() - FAILED Creation of a new delegation with delegation id [1269 4427232E265116wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [h ttps://cream-25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova /CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio G ianelle-/dteam/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] - ERROR is: [The endpoint is blacklisted] 2010-03-24 15:58:43,265 ERROR - iceCommandSubmit::execute() - TID=[159308760] Error during submission of jdl= Fatal Exceptio n is:Failed to create a delegation id for job https://devel17.cnaf.infn.it:9000/UoVsvjIj1CPluHb81xM_pQ: reason is The endpoin t is blacklisted</verbatim> * Bug [[https://savannah.cern.ch/bugs/?63989][#63989]]: [ICE] doesn't handle exception raised by jobDir::new_entries() %GREEN%FIXED%ENDCOLOR% * Change the permission of the _new_ directory in jobdir:<verbatim> [root@wms007 jobdir]# chmod 111 new/ [root@wms007 jobdir]# ls -l total 48 d--x--x--x 2 glite glite 40960 Mar 24 16:13 new drwxr-xr-x 2 glite glite 4096 Mar 24 16:13 old drwxr-xr-x 2 glite glite 4096 Mar 24 16:13 tmp</verbatim> * Look in ICE's log:<verbatim> 2010-03-25 09:45:39,545 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied 2010-03-25 09:45:40,546 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied</verbatim> * Bug [[https://savannah.cern.ch/bugs/?64698][#64698]]: jobwrapper max osb limit should be considered only if the gridftp server is the wms %ORANGE% *NOT TESTED* %ENDCOLOR% -- Main.AlessioGianelle - 2010-02-05
Edit
|
Attach
|
PDF
|
H
istory
:
r75
|
r37
<
r36
<
r35
<
r34
|
B
acklinks
|
V
iew topic
|
More topic actions...
Topic revision: r35 - 2010-03-25
-
AlessioGianelle
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
EgeeJra1It Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Edit
Attach
Copyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback