TEST 03252009, Configuration Name: glite-wms_R_3_2_1_11
Problems
MM with data does not work:
Requirements = other.GlueCEInfoHostName != "spacin-ce1.dma.unina.it";
DataRequirements = {
[
DataCatalogType = "DLI";
DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
]
};
DataAccessProtocol = "gsiftp";
does not match, while
File1:
[cesini@ui DataReq]$ lcg-lr --vo infngrid lfn:/grid/infngrid/cesini/e-2M.txt
sfn://gridba6.ba.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file1d1e9575-ae91-4d67-9080-f4a7f4687806
sfn://gridse.pi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-26/fileb28f35a1-fbb4-413c-920f-b52cf002012e
sfn://prod-se-01.pd.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file88152626-c3c9-448e-9275-b01278a0d9d0
sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/fileba70145a-2a5c-4f61-ac18-a4b48bc0e78b
sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/file2ddaccf9-c62c-417a-826d-3a1755cf3718
File2:
[cesini@ui DataReq]$ lcg-lr --vo infngrid lfn:/grid/infngrid/cesini/PI_1M.txt
sfn://gridba6.ba.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/filec4384060-5c78-4ee8-ac75-6a5b7161987d
sfn://gridse.pi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-26/file6a7ce8b1-3aa9-442b-80c2-21493a60553e
sfn://prod-se-01.pd.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/filed03f96ea-4a70-4315-bc3a-a3839c9e0bed
sfn://t2-se-01.mi.infn.it/flatfiles/SE00/infngrid/generated/2007-11-21/file5a113abe-0a0e-459f-9a2f-2478f8bb6c6a
sfn://t2-se-03.lnl.infn.it/data1/infngrid/generated/2007-11-26/filed1f43d74-24ac-47a2-a83a-8b0e8b0d9d4f
Log-status
Status info for the Job : https://devel20.cnaf.infn.it:9000/LkRfwhr3ibCfROQfCCj0jA
Current Status: Aborted
Status Reason: PID: 11874 - "wmpcoreoperations::jobStart"
Submitted: Fri Mar 20 20:04:16 2009 CET
HOPEFULLY FIXED BY HELPER
Bug #40982: When a collection is aborted the "Abort" event should be logged for the sub-nodes as well /2
Using this repository: Configuration Name: glite-wms_R_3_2_1_12
- It seems that for some jobs condor is not able to retrieve the standard output of the job, in all the case tha "maradona" mechanism is used.
24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Going to parse standard output file.
24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Standard output does not contain useful data.
24 Mar, 10:38:40 -H- JobWrapperOutputParser::parse_file(...): Standard output was not useful, passing ball to Maradona...
24 Mar, 10:38:40 -*- JobWrapperOutputParser::parse_file(...): Got info from Maradona...
24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): Maradona makes another goal !!!
24 Mar, 10:38:40 -U- JobWrapperOutputParser::parse_file(...): The legend goes on...
- Submit a collection with all the nodes with "false" requirements, then a cancel for the collection; when you try to restart the wm it fail.
<mcecchi> glite-wms-workload_manager: recovery.cpp:387: void glite::wms::manager::server::single_request_recovery(const glite::wms::manager::server::<unnamed>::IdRequests&,
glite::wms::manager::server::Events&, const glite::wms::manager::server::WMReal&, glite::wms::manager::server::<unnamed>::IdToRequests&,
boost::shared_ptr<std::string>): Assertion `requests_for_id.size() == 1' failed.
HOPEFULLY FIXED BY MANAGER
Using this repository: Configuration Name: glite-wms_R_3_2_1_8
- Recovery mechanism is not able to manage a cancel and a resubmit request for the same job (it should cancel the job).
19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:651): multiple requests [CS] for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q (status 2)
19 Mar, 12:45:21 -I: [Info] multiple_request_recovery(recovery.cpp:804): invalid pattern; ignoring all requests
19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobcancel request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
19 Mar, 12:45:21 -D: [Debug] operator()(recovery.cpp:76): ignoring jobresubmit request for https://devel17.cnaf.infn.it:9000/iVp83BKI_-IytpdVlojq3Q
HOPEFULLY FIXED BY MANAGER
- Sometimes the submit command fails with this message:
[ale@cream-15 UI]$ glite-wms-job-submit -a -c ~/UI/etc/wmp_wms007.conf -o wms007 jdl/fail/fail.jdl
Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
Error - The job has been successfully registered (the JobId is: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg), but an error occurred while transferring files:
Unable to find Job InputSandbox Relative Path information needed to create the ISB Zipped File(s)
(DestinationURI with https protocol not found)
JobId: https://devel17.cnaf.infn.it:9000/9fuQafsyV-lBWdiHXjkBFg
(please contact the server administrator)
HOPEFULLY FIXED BY WMPROXY
- LogMonitor dies when invokes operator()(purger.cpp:432)
19 Mar, 15:50:20 -U- JobFilePurger::do_purge(...): Going to purge job storage...
19 Mar, 15:50:20 -*- MonitorLoop::run(): Got an unhandled standard exception !!!
19 Mar, 15:50:20 -*- MonitorLoop::run(): Namely: "call to empty boost::function"
19 Mar, 15:50:20 -*- MonitorLoop::run(): Aborting daemon...
HOPEFULLY FIXED BY PURGER AND JOBSUBMISSION
- Submit a collection with attribute "requirements = false;" doesn't work. Using this collection: FIXED (glite-wms-wmproxy-3.2.1-4.slc4)
[
Type = "collection";
InputSandbox = {"exe/test.sh"};
requirements = false;
nodes = {
[
JobType = "Normal";
Executable = "test.sh";
StdOutput = "test.out";
StdError = "test.err";
OutputSandbox = {};
]
}
]
you should find in the wmproxy.log this message:
23 Mar, 10:30:34 -I- PID: 28115 - "wmpcommon::getType": JDL Type: dag
23 Mar, 10:30:34 -E- PID: 28115 - "wmpcoreoperations::jobStart": Error while checking JDL: requirements: wrong type caught for attributx
Using this repository: Configuration Name: glite-wms_R_3_2_1_5
- WM startup script must be in charge of creating both JC and ICE input FIXED (glite-wms-manager-3.2.1-8)
- During recovery pending nodes are Aborted FIXED (glite-wms-manager-3.2.1-6)
- After setting EnableRecovery = true; in the glite_wms.conf file, the recovery is skipped: FIXED (glite-wms-manager-3.2.1-6)
16 Mar, 12:41:33 -I: [Info] main(main.cpp:411): skipping recovery
Using this repository: Configuration Name: glite-wms_R_3_2_1_3
- The output of the startup script of JobController is not correct: FIXED (glite-wms-jobsubmission-3.2.1-3)
[root@wms007 ~]# /opt/glite/etc/init.d/glite-wms-jc start
Starting JobController daemon(s)
[root@wms007 ~]# CondorG...ler... [ OK ]
Using this repository: Configuration Name: glite-wms_R_3_2_1_2
- After the submission of a collection with 20 jobs with "requirements=false", there are 20 events "Pending" logged to the same job instead of one event per job. FIXED (glite-wms-manager-3.2.1-4)
- It is very hard to stop wm FIXED (glite-wms-manager-3.2.1-6)
[root@wms007 new]# /opt/glite/etc/init.d/glite-wms-wm stop
stopping workload manager... failure (stop it manually)
Using this repository: Configuration Name: glite-wms_branch_3_2_0
- 08 Mar, 05:04:44 -E: [Error] operator()(dispatcher_utils.cpp:407): Dispatcher: boost::filesystem::directory_iterator constructor: "/var/glite/workload_manager/jobdir/new": Cannot allocate memory. Exiting...
HOPEFULLY FIXED BY MANAGER_3_2_10
- Cancel of pending nodes (or jobs) doesn't work FIXED (glite-wms-manager-3.2.1-7). But the final status of the job is ABORTED instead of CANCELLED.
HOPEFULLY FIXED BY MANAGER_3_2_10 - TO BE RETESTED
- recovery doesn't recognize a mm request: FIXED (glite-wms-manager-3.2.1-7)
09 Mar, 11:58:57 -I: [Info] operator()(recovery.cpp:836): recovering https://localhost:6000/kqU2YOq-Ofu1_ArnS5RoIQ
09 Mar, 11:58:57 -E: [Error] single_request_recovery(recovery.cpp:392): cannot create LB context (2)
and it is not removed from the jobdir. Now at startup request is ignored: ignoring match request, but the ui command hangs.
THIS IS THE EXPECTED BEHAVIOUR WITH THE UI 3.1
- Recovery of a collection with all aborted nodes causes a crash in the wm FIXED (glite-wms-manager-3.2.1-3)
12 Mar, 12:44:33 -I: [Info] main(main.cpp:328): This is the gLite Workload Manager, running with pid 3953
12 Mar, 12:44:33 -I: [Info] main(main.cpp:336): loading broker dll libglite_wms_helper_broker_ism.so
12 Mar, 12:44:33 -I: [Info] main(main.cpp:357): created input reader of type jobdir from /var/glite/workload_manager/jobdir
12 Mar, 12:44:33 -I: [Info] main(main.cpp:402): starting recovery
12 Mar, 12:44:33 -I: [Info] operator()(recovery.cpp:836): recovering https://devel17.cnaf.infn.it:9000/4ZppCer_07TdPvtmVAsqYQ
12 Mar, 12:44:33 -I: [Info] single_request_recovery(recovery.cpp:407): submit request
Program received signal SIGSEGV, Segmentation fault.
0x081471b2 in glite::wms::manager::server::(anonymous namespace)::select_recoverable_nodes::select_recoverable_nodes ()
- A pending collection is completely (i.e. also the DONE nodes) resubmitted after every restart (through recovery mechanism); this means that a job can be run more than one times (see for example https://devel17.cnaf.infn.it:9000/630alnB1XH4ZEGqPyucc4g): see #30816 FIXED (glite-wms-manager-3.2.1-7)
- WM dies after this message: -----------> HOPEFULLY FIXED BY MANAGER_3_2_10
09 Mar, 17:08:36 -E: [Error] unrecoverable_collection(submit_request.cpp:90): https://devel17.cnaf.infn.it:9000/u-dqY5E9RtQ9mHk0-i0oYg: unable to retrieve children information from jobstatus
- Condor fails to start FIXED (glite-wms-jobsubmission-3.2.1-2)
--
AlessioGianelle - 04 Mar 2009