Certification report patch 3621
Author(s): Elisabetta Molinari & Alessio Gianelle
Outcome:
in certification...
Clean installation
Upgrade from production
Test Report
List Match
- without data:
- with data:
Submission/GetOutput
-
Normal
jobs through
- ICE work:
- JC work:
-
Dag
jobs through:
- JC work:
-
Collection
jobs through:
- ICE work:
- JC work:
- also job-output for collections works even though only the parent node is set to 'Cleared'
-
Parametric
jobs through:
- ICE work:
- JC work:
-
Bulk
jobs sent both through ICE and JC and RetryCount = 0; :
- Submit a bulk of 3 jobs -> success 100% both to ICE and JC
- Submit a bulk of 50 jobs -> success 100% both to ICE and JC
- Submit a bulk of 100 jobs -> success 100% both to ICE and JC
- Submit a bulk of 500 jobs -> success 99.9% both to ICE and JC
- Submit a bulk of 1000 jobs -> success 99.9% both to ICE and JC
-
Perusal
jobs through:
- JC work:
- ICE work:
-
MPICH
jobs:
Cancel
- Normal jobs
- ICE:
- JC:
- Dag:
- Note that children nodes in status 'submitted' don't get cancelled
- Collection
- ICE:
- JC:
- Node of a collection:
- Note: collections stay in status 'waiting' when all the nodes are Done (Success) except for one that is 'Cancelled'
Others
-
BrokerInfo
- ICE creation
- JC creation:
-
Resubmission
- Shallow:
- Deep:
-
Job Recovery
- Tested with a few collections re-starting the wm while some node jobs are still in a 'submitted or 'waiting' status
-
Prologue
and Epilogue
jobs
- ICE:
- JC:
Check bugs
- Bug #42288: Problem in forwarding cerequirements to a CREAM CE NOT TESTED
- Bug #48910: Failure starting LM if its output jobdir doesn't exist; unprotected chown in WM/LM/JC startup scripts FIXED
- Stopped gLite services and deleted the jobdir under '/var/glite/workload_manager'
[root@wms007 jobdir]# service gLite stop
[...]
[root@wms007 workload_manager]# pwd
/var/glite/workload_manager
[root@wms007 workload_manager]# ls
ismdump.fl jobdir
[root@wms007 workload_manager]# rm -rf jobdir
[root@wms007 workload_manager]# ls
ismdump.fl
- Stopped gLite services and deleted the jobdir under '/var/glite/jobcontrol'
[root@wms007 jobcontrol]# pwd
/var/glite/jobcontrol
[root@wms007 jobcontrol]# rm -rf jobdir
[root@wms007 jobcontrol]# ls
condorio submit
- Stopped gLite services and deleted the jobdir under '/var/glite/ice'
[root@wms007 ice]# pwd
/var/glite/ice
[root@wms007 ice]# ls
jobdir persist_dir
[root@wms007 ice]# rm -rf jobdir/
[root@wms007 ice]# ls
persist_dir
- Stopped gLite services and deleted all the jobdirs
[root@wms007 glite]# ls workload_manager/ jobcontrol/ ice/
ice/:
persist_dir
jobcontrol/:
condorio submit
workload_manager/:
ismdump.fl
- Comment Input/InputType parameter in wms conf file (Sections: ICE, WorkloadManager and JobController).
- Try to start JobController:
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc start JobController
Starting !JobController daemon(s)
Please set Input parameter in glite_wms.conf - JC section [FAILED]
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-jc status JobController
JobController stopped.
- Try to start LogMonitor:
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm start
Starting LogMonitor...Please set Input parameter in glite_wms.conf - WM section
[FAILED]
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-lm status
LogMonitor stopped.
- Try to start ICE:
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice start
starting ICE... failure
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-ice status
/opt/glite/bin/glite-wms-ice-safe is not running
- Try to start WorkloadManager:
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm start
starting workload manager... Please set Input parameter in - WM section
Please set DispatcherType parameter in - WM section
Please set Input parameter in - JC section
Please set InputType parameter in - JC section
Please set Input parameter in - ICE section
Please set InputType parameter in - ICE section
failure
[root@wms007 workload_manager]# /opt/glite/etc/init.d/glite-wms-wm status
/opt/glite/bin/glite-wms-workload_manager is not running
- Bug #52934: [ICE] Delegation in ICE doesn't refer to the myproxy server FIXED
- GridJobID: https://devel17.cnaf.infn.it:9000/dj8r_iFRd8tnWH4bThPNeg
- Deleg Proxy ID = [12692524052E32526wms0072Ecnaf2Einfn2Eit]
- Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
- Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
- MyProxyServer = "myproxy.cern.ch";
- GridJobID: https://devel17.cnaf.infn.it:9000/UNB2dHJNn7euaDP3FvJ3og
- Deleg Proxy ID = [12692523642E948823wms0072Ecnaf2Einfn2Eit]
- Destination: cream-30.pd.infn.it:8443/cream-pbs-cream_B
- Owner = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle
- MyProxyServer = "myproxy.cnaf.infn.it";
- Bug #53460: [ICE] Detection of job status changes for CREAM jobs should be improved FIXED
- Using a new CE (1.6) looking in ice's log there is:
2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269272870.288498 1269272870.496129 0.207631
2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] There're [2] event(s) for the couple DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2]
2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] Database ID=[1261041182000]
2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::execute() - TID=[150673032] Exec time ID=[3]
2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() - TID=[150673032] Processing [2] event(s) for Job [gridJobID="https://devel17.cnaf.infn.it:9000/uKbQNcbh7kIohBz6bDMNZQ" CREAMJobID="https://cream-30.pd.infn.it:8443/CREAM396193798"] userdn [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] and ce url [https://cream-30.pd.infn.it:8443/ce-cream/services/CREAM2].
2010-03-22 16:47:50,496 DEBUG - iceCommandEventQuery::processEventsForJob() - TID=[150673032] EventID [685143] timestsamp [1269272804]
2010-03-22 16:47:50,496 INFO - scoped_timer iceCommandEventQuery::processSingleEvent - TID=[150673032] InsertStat 1269272870.496682 1269272870.496864 0.000182
- Using an "old" CE instead the "poller" method is used:
2010-03-22 16:55:55,397 INFO - scoped_timer iceCommandEventQuery::execute() - SOAP Connection for QueryEvent - TID=[150673032] 1269273355.242918 1269273355.397806 0.154888
2010-03-22 16:55:55,397 ERROR - iceCommandEventQuery::execute() - TID=[150673032] Cannot query events for UserDN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] CEUrl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Exception Internal ex is [Received NULL fault; the error is due to another cause: FaultString=[No such operation 'QueryEventRequest'] - FaultCode=["http://xml.apache.org/axis/":Client] - FaultSubCode=["http://xml.apache.org/axis/":Client] - FaultDetail=[<ns2:hostname>cream-34.pd.infn.it</ns2:hostname>]]
2010-03-22 16:55:55,398 WARN - iceCommandEventQuery::execute() - TID=[150673032] Not present QueryEvent on CE [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. Falling back to old-style StatusPoller.
2010-03-22 16:55:55,398 INFO - iceCommandStatusPoller::execute() - Getting [100] jobs to poll for user [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl [https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]
2010-03-22 16:55:55,398 DEBUG - iceCommandStatusPoller::get_jobs_to_poll() - Collecting jobs to poll for userdn=[/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] creamurl=[https://cream-34.pd.infn.it:8443/ce-cream/services/CREAM2]. LIMIT set to [100]...
- Bug #55103: [ICE] ICE port 7010 not cleaned up properly FIXED
- Bug #55452: CMS production struck by waves of "Globus error 10: data transfer to the server failed" NOT TESTED
- Bug #56636: [ICE] statistics counters for monitoring FIXED
- Bug #57295: [ICE] queryDb tool may create empty DB as root FIXED
- Verify:
[root@wms007 ~]# ll /var/glite/ice/persist_dir/ice.db
-rw-r--r-- 1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db
[root@wms007 ~]# /opt/glite/bin/queryDb -c glite_wms.conf -s RUNNING,REALLY_RUNNING
0 item(s) found
[root@wms007 ~]# ll /var/glite/ice/persist_dir/ice.db
-rw-r--r-- 1 glite glite 1280000 Mar 22 17:05 /var/glite/ice/persist_dir/ice.db
- Bug #57579: [ICE] Occasionally the ICE's start/stop script doesn't kill the ICE process HOPEFULLY FIXED
- Bug #57596: [ICE] non resubmission if job failed for proxy expiration FIXED
- Verify:
2010-03-23 10:20:37,696 INFO - iceLBLogger::logEvent() - Job Done Failed Event, ExitCode=[0], FailureReason=[Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]
2010-03-23 10:20:37,817 DEBUG - iceLBContext::testCode() - L&B call succeeded.
2010-03-23 10:20:37,828 ERROR - Ice::resubmit_job() - Will NOT resubmit job [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"] because it's Input Sandbox proxy file is not valid: The proxy is EXPIRED!
2010-03-23 10:20:37,828 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
2010-03-23 10:20:37,828 INFO - iceLBLogger::logEvent() - Job Aborted Event, reason=[Input sandbox's proxy is missing. Cannot resubmit job] - [gridJobID="https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA" CREAMJobID="https://ce202.cern.ch:8443/CREAM030114428"]
- Bug #58099: WMS purger forces purge of jobs if LB cannot be reached FIXED
- Stop the LBServer and then run the cron purger:
07 Apr, 16:09:13 -E: [Error] query_job_status(purger.cpp:125): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg:: edg_wll_JobStat [111] Connection refused(edg_wll_gss_connect())
[glite@wms007 ~]$
- Verify that the SandBox dir has not been removed:
[glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/
total 16
drwxrwx--- 2 dteam008 glite 4096 Apr 6 14:34 input
drwxrwx--- 2 dteam008 glite 4096 Apr 6 14:46 output
drwxrwx--- 2 dteam008 glite 4096 Apr 6 14:34 peek
lrwxrwxrwx 1 glite glite 102 Apr 6 14:34 user.proxy -> /var/glite/SandboxDir/Uo/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fUow8XY0NGbyumU3PPGMSng/user.proxy
- Restart LBServer and verify that now the SBD of the job is purged:
[glite@wms007 ~]$ /opt/glite/sbin/glite-wms-purgeStorage.sh -p /var/glite/SandboxDir/ye -t 10000
07 Apr, 16:18:07 -I: [Info] operator()(purger.cpp:449): https://devel17.cnaf.infn.it:9000/yeoXs2eB1kvOaPp0Mtjthg: removed DONE job
[glite@wms007 ~]$ ls -l /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/
ls: /var/glite/SandboxDir/ye/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fyeoXs2eB1kvOaPp0Mtjthg/: No such file or directory
- Bug #58387: [ICE] should log a job aborted when it cannot resubmit the job for missing user proxy FIXED
- Verify:
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://devel17.cnaf.infn.it:9000/jw2aeAy1skHY3mRJHCF8YA
Current Status: Aborted
Logged Reason(s):
- Proxy is expired; /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) /opt/glite/bin/glite-lb-logevent: edg_wll_LogEvent*(): LB server (bkserver,lbproxy) store protocol error (edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused) Proxy expired: job killed Terminated Master process killed
Status Reason: Input sandbox's proxy is missing. Cannot resubmit job
Destination: ce202.cern.ch:8443/cream-lsf-grid_dteam
Submitted: Tue Mar 23 09:49:42 2010 CET
*************************************************************
- Bug #58977: [ICE] Wrong database colum name in ICE SQL query FIXED
- Submit some jobs to a ce (e.g. cream-25.pd.infn.it:8443/cream-lsf-testbedB_2):
[root@wms007 20100409]# queryDb -v -C -G
[https://cream-25.pd.infn.it:8443/CREAM881525184] [https://devel17.cnaf.infn.it:9000/V2Lj0_-XWkrjRKMaf3f6ng]
[https://cream-25.pd.infn.it:8443/CREAM425827870] [https://devel17.cnaf.infn.it:9000/4TKQu1U_daCMMb2mRDR2cA]
[https://cream-25.pd.infn.it:8443/CREAM543141647] [https://devel17.cnaf.infn.it:9000/4wqozVNHEVUXWx5UzUaXqA]
[https://cream-25.pd.infn.it:8443/CREAM769586568] [https://devel17.cnaf.infn.it:9000/XLcaGE3kR3h8-oj8cMWE_A]
[https://cream-25.pd.infn.it:8443/CREAM192029588] [https://devel17.cnaf.infn.it:9000/PuOGoOxMf-pbfFu-wSkACw]
[https://cream-25.pd.infn.it:8443/CREAM378177464] [https://devel17.cnaf.infn.it:9000/T8ZKSu5zZPZY-Ee1gLRX5A]
[https://cream-25.pd.infn.it:8443/CREAM299069473] [https://devel17.cnaf.infn.it:9000/Xh1AMEor9hWOx4picngYkA]
[https://cream-25.pd.infn.it:8443/CREAM012571708] [https://devel17.cnaf.infn.it:9000/YjpCU6dfrLsDBs6wU_D3Hg]
[https://cream-25.pd.infn.it:8443/CREAM561236418] [https://devel17.cnaf.infn.it:9000/00Qc7RnutRORYVOd0ShIKg]
[https://cream-25.pd.infn.it:8443/CREAM972351884] [https://devel17.cnaf.infn.it:9000/ksz80OflJnDE_ynWHmKTwQ]
[https://cream-25.pd.infn.it:8443/CREAM827240561] [https://devel17.cnaf.infn.it:9000/WwfftKdV6_5lSgihPOUsaA]
[https://cream-25.pd.infn.it:8443/CREAM573497695] [https://devel17.cnaf.infn.it:9000/S5zbkyK72hv2LXwUD1vAFw]
[https://cream-25.pd.infn.it:8443/CREAM735112819] [https://devel17.cnaf.infn.it:9000/0J9nTy1tJxkcuRJ9oTZACw]
[https://cream-25.pd.infn.it:8443/CREAM526570551] [https://devel17.cnaf.infn.it:9000/Rcl0TypyUTXMwtLk86R3yA]
[https://cream-25.pd.infn.it:8443/CREAM992848449] [https://devel17.cnaf.infn.it:9000/xfH1fkIroQwNvlVBdn8N5A]
[https://cream-25.pd.infn.it:8443/CREAM944698480] [https://devel17.cnaf.infn.it:9000/xjiyHJo3rkUsXXHHe0s6yg]
[https://cream-25.pd.infn.it:8443/CREAM729677007] [https://devel17.cnaf.infn.it:9000/FIZA1Mjb4moUNel1N7UXvw]
[https://cream-25.pd.infn.it:8443/CREAM589660323] [https://devel17.cnaf.infn.it:9000/5DJlLG7M0v3C_-WMDKSdXQ]
[https://cream-25.pd.infn.it:8443/CREAM994745139] [https://devel17.cnaf.infn.it:9000/T_UdwnjC55dIVrPxJOVvmg]
[https://cream-25.pd.infn.it:8443/CREAM228224655] [https://devel17.cnaf.infn.it:9000/URw39mrv7jj-buJ3KDza8w]
[https://cream-25.pd.infn.it:8443/CREAM397635733] [https://devel17.cnaf.infn.it:9000/f3FOGwNoWpHyWkxO_87AIg]
[https://cream-25.pd.infn.it:8443/CREAM510341828] [https://devel17.cnaf.infn.it:9000/vEfH5j5_5R_7jFNrntEsog]
[https://cream-25.pd.infn.it:8443/CREAM788890890] [https://devel17.cnaf.infn.it:9000/y0IVYbdR_UWbTmrXY5O8fA]
------------------------------------------------
23 item(s) found
- Check also the db_id registered in the ice's database
[root@wms007 20100409]# sqlite3 /var/glite/ice/persist_dir/ice.db "SELECT db_id, ceurl from ce_dbid;"
1270820425000|https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2
- Stop the cream CE. Drop its database. Create a new empty one. Restart the CE.
- Check what happen in the ice log file:
2010-04-09 16:16:53,953 WARN - iceCommandEventQuery::execute() - TID=[150150560] *** CREAM HAS PROBABLY BEEN SCRATCHED. GOING TO ERASE ALL JOBS RELATED TO OLD DB_ID [1270820425000] ***
- Check if there are jobs in the Ice's database:
[root@wms007 persist_dir]# queryDb -v -C -G
------------------------------------------------
0 item(s) found
- and if the db_id has been changed:
[root@wms007 persist_dir]# sqlite3 /var/glite/ice/persist_dir/ice.db "SELECT db_id, ceurl from ce_dbid;"
1270822483000|https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2
- Look at the status of a job that has been removed:
Status info for the Job : https://devel17.cnaf.infn.it:9000/xjiyHJo3rkUsXXHHe0s6yg
Current Status: Aborted
Logged Reason(s):
- job completed
Status Reason: CREAM'S database has been scratched and all its jobs have been lost
Destination: cream-25.pd.infn.it:8443/cream-lsf-testbedB_2
Submitted: Fri Apr 9 16:11:34 2010 CEST
- Bug #59240: [ICE] abort reasons not always printed in its logfile NOT TESTED
- Bug #59399: [ICE] doesn't correctly handle request in jobdir/old when it is restarted FIXED
- Verify submitting a big collection to cream CEs, and then restarting ICE in the middle of the submit process:
2010-03-23 15:55:43,604 DEBUG - iceCommandSubmit::try_to_submit() - TID=[168434952] Going to START CreamJobID [https://cream
-32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
- restarting ice...
2010-03-23 15:55:45,760 DEBUG - ICE VersionID is [Fri Mar 19 13:53:17 CET 2010] ProcessID=[23579]
2010-03-23 15:55:45,760 INFO - glite-wms-ice::main() - Host certificate is [/home/glite/.certs/hostcert.pem]
2010-03-23 15:55:45,817 DEBUG - iceThreadPool::iceThreadPool(ICE Submission Pool) - Creating 10 worker threads
2010-03-23 15:55:45,819 DEBUG - iceThreadPool::iceThreadPool(ICE Poller Pool) - Creating 5 worker threads
[...]
2010-03-23 15:55:48,967 INFO - iceCommandSubmit::execute() - TID=[144321160] This request is a Submission...
2010-03-23 15:55:48,968 INFO - iceCommandSubmit::try_to_submit() - TID=[144321160] GridJobID [https://devel17.cnaf.infn.it:9
000/iM8C3YV12fwhvIG5mNip5Q] has already been REGISTERED. Will only START it...
2010-03-23 15:55:48,968 DEBUG - iceCommandSubmit::try_to_submit() - TID=[144321160] Going to START CreamJobID [https://cream
-32.pd.infn.it:8443/CREAM036926381] related to GridJobID [https://devel17.cnaf.infn.it:9000/iM8C3YV12fwhvIG5mNip5Q]...
2010-03-23 15:55:49,154 INFO - iceLBContext::setLoggingJob - Setting log job to jobid=[https://devel17.cnaf.infn.it:9000/iM8C
3YV12fwhvIG5mNip5Q] LB server=[devel17.cnaf.infn.it:9000] (port is not used, actually...)
2010-03-23 15:55:49,155 INFO - iceLBLogger::logEvent() - Cream Transfer OK Event - [gridJobID="https://devel17.cnaf.infn.it:9
000/iM8C3YV12fwhvIG5mNip5Q" CREAMJobID="https://cream-32.pd.infn.it:8443/CREAM036926381"]
- Bug #59453: [ICE] polling needs to be improved NOT TESTED
- Bug #60668: [ICE] does not respect LB server/proxy selection through the LBproxy attribute FIXED
- Set LBProxy = false; in glite_wms.conf (section Common), restart ice and submit...
mysql> select * from events where jobid="YFyqjw3FF-BO-0U5BxCOtA";
+------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
| jobid | event | code | prog | host | time_stamp | userid | usec | level | arrived |
+------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
| YFyqjw3FF-BO-0U5BxCOtA | 0 | 5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 394848 | 8 | 2010-03-24 12:04:39 |
| YFyqjw3FF-BO-0U5BxCOtA | 1 | 15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 548652 | 8 | 2010-03-24 12:04:39 |
| YFyqjw3FF-BO-0U5BxCOtA | 2 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 608084 | 8 | 2010-03-24 12:04:39 |
| YFyqjw3FF-BO-0U5BxCOtA | 3 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:04:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 657231 | 8 | 2010-03-24 12:04:39 |
+------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
4 rows in set (0.00 sec)
- * Set LBProxy = true; in glite_wms.conf (section Common), restart ice and submit...
mysql> select * from events where jobid="SlKOGSnaW0oKO3TJqw9tbA";
+------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
| jobid | event | code | prog | host | time_stamp | userid | usec | level | arrived |
+------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
| SlKOGSnaW0oKO3TJqw9tbA | 0 | 17 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 342720 | 8 | 2010-03-24 12:09:53 |
| SlKOGSnaW0oKO3TJqw9tbA | 1 | 21 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 470416 | 8 | 2010-03-24 12:09:53 |
| SlKOGSnaW0oKO3TJqw9tbA | 2 | 21 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:53 | 3f82b966e8a77413044be1a9144a4af4 | 526402 | 8 | 2010-03-24 12:09:53 |
| SlKOGSnaW0oKO3TJqw9tbA | 3 | 2 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 606511 | 8 | 2010-03-24 12:09:54 |
| SlKOGSnaW0oKO3TJqw9tbA | 4 | 4 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:54 | 3f82b966e8a77413044be1a9144a4af4 | 712100 | 8 | 2010-03-24 12:09:54 |
| SlKOGSnaW0oKO3TJqw9tbA | 5 | 4 | NetworkServer | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | 3f82b966e8a77413044be1a9144a4af4 | 43631 | 8 | 2010-03-24 12:09:55 |
| SlKOGSnaW0oKO3TJqw9tbA | 6 | 5 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 167414 | 8 | 2010-03-24 12:09:55 |
| SlKOGSnaW0oKO3TJqw9tbA | 7 | 15 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 297333 | 8 | 2010-03-24 12:09:55 |
| SlKOGSnaW0oKO3TJqw9tbA | 8 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 369636 | 8 | 2010-03-24 12:09:55 |
| SlKOGSnaW0oKO3TJqw9tbA | 9 | 4 | WorkloadManager | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 431565 | 8 | 2010-03-24 12:09:55 |
| SlKOGSnaW0oKO3TJqw9tbA | 10 | 5 | JobController | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 745052 | 8 | 2010-03-24 12:09:55 |
| SlKOGSnaW0oKO3TJqw9tbA | 11 | 1 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:09:55 | bdd27610035bb0ec9287e2ecaa3da2eb | 846002 | 8 | 2010-03-24 12:09:55 |
| SlKOGSnaW0oKO3TJqw9tbA | 12 | 1 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:10:04 | bdd27610035bb0ec9287e2ecaa3da2eb | 869424 | 8 | 2010-03-24 12:10:04 |
| SlKOGSnaW0oKO3TJqw9tbA | 13 | 8 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 94855 | 8 | 2010-03-24 12:11:39 |
| SlKOGSnaW0oKO3TJqw9tbA | 14 | 25 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 181448 | 8 | 2010-03-24 12:11:39 |
| SlKOGSnaW0oKO3TJqw9tbA | 15 | 10 | LogMonitor | wms007.cnaf.infn.it | 2010-03-24 12:11:39 | bdd27610035bb0ec9287e2ecaa3da2eb | 250291 | 8 | 2010-03-24 12:11:39 |
+------------------------+-------+------+-----------------+---------------------+---------------------+----------------------------------+--------+-------+---------------------+
16 rows in set (0.00 sec)
- Bug #61405: [ICE] Missing proxy validity evaluation in ICE NOT TESTED
- Bug #61413: [ICE] should not call EventQuery for a userDN if he/she doesn't have active jobs FIXED
- Submit a job to a CreamCE and wait until it finished.
- Submit another job to a different CreamCE, you should not see any query to the previous used CreamCE.
- Bug #61748: [ICE] EventQuery/Polling must be done also to blacklisted CE FIXED
- Submit some jobs to a CreamCEVerify
- Trigger a socket timeout so that ICE blacklisted the CreamCE :
2010-03-24 15:58:40,753 ERROR - CreamProxyMethod::execute() - Connection timed out to CREAM: "EOF detected during communicati
on. Probably service closed connection or SOCKET TIMEOUT occurred." on try 3/3. Blacklisting endpoint and giving up.
2010-03-24 15:58:40,753 DEBUG - CEBlackList::blacklist_endpoint() - Blacklisting CE https://cream-25.pd.infn.it:8443/ce-cream
/services/gridsite-delegation until Wed Mar 24 16:08:40 2010
- Verify that the QueryEvent commad is called in any case:
2010-03-24 16:05:28,952 DEBUG - eventStatusPoller::body() - Adding EventQuery command for couple (/C=IT/O=INFN/OU=Personal Ce
rtificate/L=Padova/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL, https://cream-25.pd.infn.it:8443/ce-cream/services/C
REAM2) to the thread pool...
- Instead a submission fails:
2010-03-24 15:58:43,265 DEBUG - Delegation_manager::delegate() - Creating new delegation with delegation id [12694427232E2651
16wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [https://cream
-25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio
Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle-/dte
am/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] Expiring on [Thu Mar 25 12:54:02 2010]
2010-03-24 15:58:43,265 DEBUG - CEBlackList::is_blacklisted() - CE https://cream-25.pd.infn.it:8443/ce-cream/services/gridsit
e-delegation is blacklisted until Wed Mar 24 16:08:40 2010
2010-03-24 15:58:43,265 ERROR - Delegation_manager::delegate() - FAILED Creation of a new delegation with delegation id [1269
4427232E265116wms0072Ecnaf2Einfn2Eit] CREAM URL [https://cream-25.pd.infn.it:8443/ce-cream/services/CREAM2] Delegation URL [h
ttps://cream-25.pd.infn.it:8443/ce-cream/services/gridsite-delegation] user DN [/C=IT/O=INFN/OU=Personal Certificate/L=Padova
/CN=Alessio Gianelle-/dteam/Role=NULL/Capability=NULL] proxy hash [/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio G
ianelle-/dteam/Role=NULL/Capability=NULL] MyProxy Server [myproxy.cern.ch] - ERROR is: [The endpoint is blacklisted]
2010-03-24 15:58:43,265 ERROR - iceCommandSubmit::execute() - TID=[159308760] Error during submission of jdl= Fatal Exceptio
n is:Failed to create a delegation id for job https://devel17.cnaf.infn.it:9000/UoVsvjIj1CPluHb81xM_pQ: reason is The endpoin
t is blacklisted
- Bug #63989: [ICE] doesn't handle exception raised by jobDir::new_entries() FIXED
- Change the permission of the new directory in jobdir:
[root@wms007 jobdir]# chmod 111 new/
[root@wms007 jobdir]# ls -l
total 48
d--x--x--x 2 glite glite 40960 Mar 24 16:13 new
drwxr-xr-x 2 glite glite 4096 Mar 24 16:13 old
drwxr-xr-x 2 glite glite 4096 Mar 24 16:13 tmp
- Look in ICE's log:
2010-03-25 09:45:39,545 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied
2010-03-25 09:45:40,546 ERROR - Request_source_jobdir::get_requests() - Error returned by method jobDir::new_entries(): boost::filesystem::directory_iterator constructor: "/var/glite/ice/jobdir/new": Permission denied
- Bug #64698: jobwrapper max osb limit should be considered only if the gridftp server is the wms FIXED only for LCG-CE
- Set MaxOutputSandboxSize = 10000000; in section WorkloadManager of file glite_wms.conf
- Submit a jdl with a file of more than 10Mb in the OutputSandbox parameter and set also the corresponding OutputSandboxDestURI parameter
- Check the output dir in the SE:
[root@devel18 tmp]# ls -lh
-rw-r--r-- 1 dteam044 dteam 50M Apr 7 16:23 bigfile
-rw-r--r-- 1 dteam044 dteam 646 Apr 7 16:23 ls.out
- If you don't set OutputSandboxDestURI in the jdl, than the SandBox dir in the WMS should contain a
.tail
file of less than 10Mb:
[root@wms007 persist_dir]# ls -lh /var/glite/SandboxDir/A1/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fA1cdkhzepvrCjiU_5fTaKbpg/output/
total 9.6M
-rw-r--r-- 1 dteam008 dteam 9.6M Apr 7 16:26 bigfile.tail
-rw-r--r-- 1 dteam008 dteam 637 Apr 7 16:26 ls.out
--
AlessioGianelle - 2010-02-05
This topic: EgeeJra1It
> WmsTestsP3621
Topic revision: r39 - 2010-04-09 - AlessioGianelle