Tags:
create new tag
,
view all tags
---+ TESTs ---+++ 2009-03-25 (Danilo) 1. <noautolink>MatchMaking</noautolink> with data. %GREEN%OK%ENDCOLOR% * Files used: <verbatim> >lcg-lr lfn:test_e-2M.txt srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/file881fe282-f7ce-4e0a-bf6e-e45ebc2f54bc srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/fileb225a02d-672b-439c-bf60-2ec5939563ac >lcg-lr lfn:test_PI_1M.txt srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-03-23/filee260881d-f7b9-4ba5-aadb-c73da3537700 srm://gridsrm.ts.infn.it/infngrid/generated/2009-03-23/file175411c7-870d-4302-bcfe-e4e471ebb23b</verbatim> * <noautolink>DataRequirements = { [ DataCatalogType = "DLI"; DataCatalog = "http://lfcserver.cnaf.infn.it:8085"; InputData = {"lfn:/grid/infngrid/test_e-2M.txt","lfn:/grid/infngrid/test_PI_1M.txt"}; ] }; * DataAccessProtocol = "gsiftp"; </noautolink><verbatim> [danilo@ui]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a data-req.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert</verbatim> 1. GANG MATCHING * JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace == 87200000); <verbatim> [danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert (the closeSE is gridit-se-01 with 87200000 available space.)</verbatim> * JDL Requirements = anyMatch( other.storage.CloseSEs ,target.GlueSAStateAvailableSpace >= 50340000000); <verbatim> [danilo@ui TEST_WMS]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a test_gang.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found: - argoce01.na.infn.it:2119/jobmanager-lcgpbs-cert - atlasce1.lnf.infn.it:2119/jobmanager-lcgpbs-cert - grid-ce2.pr.infn.it:2119/jobmanager-pbs-cert - grid001.ts.infn.it:2119/jobmanager-lcglsf-cert - prod-ce-01.pd.infn.it:2119/jobmanager-lcglsf-cert - prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert - t2-ce-01.mi.infn.it:2119/jobmanager-lcgpbs-cert - atlasce01.na.infn.it:2119/jobmanager-lcgpbs-cert - atlas-ce-01.roma1.infn.it:2119/jobmanager-lcglsf-atlasgcert - grid012.ct.infn.it:2119/jobmanager-lcglsf-cert</verbatim> 1. OUTPUT SE %GREEN%OK%ENDCOLOR% * OutputSE? = "grid007g.cnaf.infn.it";<verbatim> [danilo@ui]$ glite-wms-job-list-match -c ~/WMSCONF/conf_wms007.conf -a outputSE.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server COMPUTING ELEMENT IDs LIST The following CE(s) matching your job requirements have been found - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert </verbatim> 3. <noautolink>File perusal</noautolink> with data. %GREEN%OK%ENDCOLOR% 4. <noautolink>DAGS</noautolink> with data. %GREEN%OK%ENDCOLOR% ---++ Using [[http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/6ce4466c-2f5f-4ad3-884d-3588640072e3/slc4_ia32_gcc346/index.html][this]] repository: Configuration Name: *glite-wms_R_3_2_1_12* 1. It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used. DOES NOT SEEM A WMS ISSUE<verbatim> 24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file. 24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data. 24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona... 24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona... 24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!! 24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...</verbatim> 2. Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail. ---> HOPEFULLY FIXED BY MANAGER<verbatim> <mcecchi> glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::<unnamed>::IdRequests&, glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::<unnamed>::IdToRequests&, boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.</verbatim> FIXED BY MANAGER 3. Cancelling a collection trigger a cancel also for the "Done" nodes: <verbatim> Event: Done - Arrived = Thu Mar 26 13:48:39 2009 CET - Exit code = 0 - Host = wms009.cnaf.infn.it - Reason = Job terminated successfully - Source = LogMonitor - Src instance = unique - Status code = OK - Timestamp = Thu Mar 26 13:48:39 2009 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy --- Event: Cancel - Arrived = Thu Mar 26 15:29:09 2009 CET - Host = wms009.cnaf.infn.it - Source = WorkloadManager - Src instance = 23245 - Status code = DONE - Timestamp = Thu Mar 26 15:29:09 2009 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy --- Event: Cancel - Arrived = Thu Mar 26 15:29:13 2009 CET - Host = wms009.cnaf.infn.it - Reason = Cancel requested by WorkloadManager - Source = JobController - Src instance = unique - Status code = REQ - Timestamp = Thu Mar 26 15:29:13 2009 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy --- Event: Cancel - Arrived = Thu Mar 26 15:29:14 2009 CET - Host = wms009.cnaf.infn.it - Reason = I'm not able to retrieve the condor ID. - Source = JobController - Src instance = unique - Status code = REFUSE - Timestamp = Thu Mar 26 15:29:13 2009 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy</verbatim> 1. Recovery is not able to manage correctly node resubmission request (it triggers two resubmission: one due to collection recovery and the other to the node resubmission). FIXED BY MANAGER ---++ Using [[http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/66b2ad20-fefd-4dd2-94e4-9f71e434b047/slc4_ia32_gcc346/index.html][this]] repository: Configuration Name: *glite-wms_R_3_2_1_8* 1. Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).<verbatim> 19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:651): multiple requests [CS] for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q (status 2) 19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q 19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q</verbatim> FIXED BY MANAGER 1. Sometimes the submit command fails with this message:<verbatim> [ale@cream-15 UI]$ glite-wms-job-submit -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server Error - The job has been successfully registered (the JobId is: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg), but an error occurred while transferring files: Unable to find Job InputSandbox Relative Path information needed to create the ISB Zipped File(s) (DestinationURI with https protocol not found) JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg (please contact the server administrator)</verbatim> HOPEFULLY FIXED BY WMPROXY 1. LogMonitor dies when invokes _operator()(purger.cpp:432)_ <verbatim> 19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage... 19 Mar, 15:50:20 -*- MonitorLoop::run(): Got an unhandled standard exception !!! 19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function" 19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon... </verbatim> HOPEFULLY FIXED BY PURGER AND JOBSUBMISSION 1. Submit a collection with attribute "requirements = false;" doesn't work. Using this collection: %GREEN%FIXED (glite-wms-wmproxy-3.2.1-4.slc4)%ENDCOLOR%<verbatim> [ Type = "collection"; InputSandbox = {"exe/test.sh"}; requirements = false; nodes = { [ JobType = "Normal"; Executable = "test.sh"; StdOutput = "test.out"; StdError = "test.err"; OutputSandbox = {}; ] } ]</verbatim>you should find in the wmproxy.log this message:<verbatim> 23 Mar, 10:30:34 -I- PID: 28115 - "wmpcommon::getType": JDL Type: dag 23 Mar, 10:30:34 -E- PID: 28115 - "wmpcoreoperations::jobStart": Error while checking JDL: requirements: wrong type caught for attributx</verbatim> ---++ Using [[http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/17a026b5-c4f9-4735-be68-24b4e334fe01/slc4_ia32_gcc346/index.html][this]] repository: Configuration Name: *glite-wms_R_3_2_1_5* 1. WM startup script must be in charge of creating both JC and ICE input %GREEN%FIXED (glite-wms-manager-3.2.1-8)%ENDCOLOR% 2. During recovery pending nodes are Aborted %GREEN%FIXED (glite-wms-manager-3.2.1-6)%ENDCOLOR% 3. After setting _EnableRecovery = true;_ in the glite_wms.conf file, the recovery is skipped: %GREEN%FIXED (glite-wms-manager-3.2.1-6)%ENDCOLOR%<verbatim> 16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery</verbatim> ---++ Using [[http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/e0bcdb15-8165-4457-a3d3-f73aec21413c/slc4_ia32_gcc346/index.html][this]] repository: Configuration Name: *glite-wms_R_3_2_1_3* 1. The output of the startup script of JobController is not correct: %GREEN%FIXED (glite-wms-jobsubmission-3.2.1-3)%ENDCOLOR%<verbatim> [root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start Starting JobController daemon(s) [root@wms007 ~]# CondorG...ler... [ OK ]</verbatim> ---++ Using [[http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/77a4cf05-7b56-43ff-a9a7-debe75780880/slc4_ia32_gcc346][this]] repository: Configuration Name: *glite-wms_R_3_2_1_2* 1. After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. %GREEN%FIXED (glite-wms-manager-3.2.1-4)%ENDCOLOR% 2. It is very hard to stop wm %GREEN%FIXED (glite-wms-manager-3.2.1-6)%ENDCOLOR%<verbatim> [root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop stopping workload manager... failure (stop it manually)</verbatim> ---++ Using [[http://etics-hd.cern.ch:8080/repository/pm/volatile/repomd/id/eaff55e8-37ff-4a75-b40f-1544f3232359/slc4_ia32_gcc346][this]] repository: Configuration Name: *glite-wms_branch_3_2_0* 1. 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting... HOPEFULLY FIXED BY MANAGER_3_2_10 2. Cancel of pending nodes (or jobs) doesn't work %GREEN%FIXED (glite-wms-manager-3.2.1-7)%ENDCOLOR%. %RED%But the final status of the job is ABORTED instead of CANCELLED.%ENDCOLOR% 3. recovery doesn't recognize a mm request: %RED%NOT FIXED %ENDCOLOR% <verbatim> 09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ 09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)</verbatim>and it is not removed from the jobdir. %ORANGE%Note: the ui command hangs.%ENDCOLOR% 4. Recovery of a collection with all aborted nodes causes a crash in the wm %GREEN%FIXED (glite-wms-manager-3.2.1-3)%ENDCOLOR%<verbatim> 12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953 12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so 12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir 12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery 12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ 12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request Program received signal SIGSEGV, Segmentation fault. 0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()</verbatim> 5. A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see [[https://savannah.cern.ch/bugs/?30816][#30816]] %GREEN%FIXED (glite-wms-manager-3.2.1-7)%ENDCOLOR% 6. WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10 <verbatim> 09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus</verbatim> 7. Condor fails to start %GREEN%FIXED (glite-wms-jobsubmission-3.2.1-2)%ENDCOLOR% -- Main.AlessioGianelle - 04 Mar 2009
E
dit
|
A
ttach
|
PDF
|
H
istory
: r55
<
r54
<
r53
<
r52
<
r51
|
B
acklinks
|
V
iew topic
|
M
ore topic actions
Topic revision: r55 - 2011-02-24
-
AlessioGianelle
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
EgeeJra1It Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback