Tags:
create new tag
,
view all tags
---+++ Problems and solutions 1. <verbatim> /opt/glite/yaim/functions/config_glite_lb: line 99: /opt/glite/etc/glite-lb-dbsetup.sql: No such file or directory ERROR 1146 (42S02) at line 1: Table 'lbserver20.short_fields' doesn't exist ERROR 1146 (42S02) at line 1: Table 'lbserver20.long_fields' doesn't exist ERROR 1146 (42S02) at line 1: Table 'lbserver20.states' doesn't exist ERROR 1146 (42S02) at line 1: Table 'lbserver20.events' doesn't exist /opt/glite/yaim/functions/config_glite_lb: line 190: /opt/glite/etc/init.d/glite-lb-bkserverd: No such file or directory /opt/glite/yaim/functions/config_glite_lb: line 200: /opt/glite/etc/init.d/glite-lb-bkserverd: No such -> file or directory ABORT: Service glite-lb-bkserverd failed to start! ERROR: Error during the execution of function: config_glite_lb ERROR: Error during the configuration.Exiting. [FAILED] ERROR: One of the functions returned with error without specifying it's nature ! </verbatim> %GREEN% _We nedd to install LB_ *FIXED* %ENDCOLOR% <BR> 2. <verbatim> DEBUG: Skipping function: config_glite_lb_setenv because it is not defined DEBUG: Skipping function: config_glite_lb because it is not defined ERROR: Error during the configuration.Exiting. [FAILED]</verbatim> %GREEN% _install glite-yaim-lb_ *FIXED* %ENDCOLOR% <BR> 3. <verbatim>ERROR: Unable to execute /etc/init.d/globus-gridftp. ERROR: Error during the execution of function: config_globus_gridftp </verbatim> %GREEN% _install glite-initscript-globus-gridftp-1.0.2-1.noarch.rpm_ *FIXED* %ENDCOLOR% <BR> 4. <verbatim> Syntax error on line 242 of /opt/glite/etc/glite_wms_wmproxy_httpd.conf: FastCgiConfig: invalid option: -intial-env </verbatim> %GREEN% _fix the template /opt/glite/etc/glite_wms_wmproxy_httpd.conf.template_ *FIXED* %ENDCOLOR% <BR> 5. start/stop script di ice doesn't work <BR> %GREEN% _fix committed in CVS_ *FIXED* %ENDCOLOR% <BR> 6. glite-yaim-wms ame:-ame: <BR> %GREEN% _The version of the yaim-wms is written directly into the Makefile during the CVS checkout (see keyword substitution). Since HEAD is used for the development of this component the keyword substitution is not correct. When a tag is available the version will be correctly define_ *FIXED* %ENDCOLOR% <BR> 7. <verbatim>Starting program: /opt/glite/bin/glite_wms_wmproxy_server (no debugging symbols found) [Thread debugging using libthread_db enabled] [New Thread 0x2b227bb94260 (LWP 25595)] Program received signal SIGSEGV, Segmentation fault. 0x00000000004e94a8 in __static_initialization_and_destruction_0 () (gdb) bt #0 0x00000000004e94a8 in __static_initialization_and_destruction_0 () #1 0x0000000000565c26 in __do_global_ctors_aux () #2 0x000000000045665b in _init () #3 0x00002b2274847010 in __CTOR_LIST__ () from /opt/glite/lib64/libglite_lb_clientpp_gcc64dbg.so.4 #4 0x0000000000565ba7 in __libc_csu_init () #5 0x0000003835e1d92e in __libc_start_main () from /lib64/libc.so.6 #6 0x0000000000457fc9 in _start () (gdb)</verbatim> %GREEN% *FIXED* %ENDCOLOR% <BR> 8. <verbatim> Warning - Unable to register the job to the service: https://cream-44.pd.infn.it:7443/glite_wms_wmproxy_server Unable to create job local directory (please contact server administrator) [root@cream-03 ~]# ls -l /opt/glite/bin/glite_wms_wmproxy_dirmanager -r-sr-xr-x 1 nobody nobody 46363 Jul 8 11:00 /opt/glite/bin/glite_wms_wmproxy_dirmanager</verbatim> %GREEN% *FIXED* %ENDCOLOR%. Implemented postun in ETICS. <BR> 9. %GREEN% *Bug: [[https://savannah.cern.ch/bugs/?72573][#72573]]* %ENDCOLOR%<verbatim>[root@cream-03 ~]# service gLite status [...] *** glite-lb-locallogger: glite-lb-logd not running [...]</verbatim> %GREEN% *FIXED* Using new LB tag (patch 4423) %ENDCOLOR% <BR> 10. lbproxy not started <BR> %GREEN% _GLITE_LB_TYPE=proxy_ is the default behaviour *FIXED* %ENDCOLOR% <BR> 11. <verbatim>[Fri Jul 09 18:05:21 2010] [error] Certificate Verification: Error (24): invalid CA certificate [Fri Jul 09 18:05:21 2010] [error] Certificate Verification: Error (26): unsupported certificate purpose</verbatim> %PURPLE% _Trying a simple list-match._ *Is it a warning?* %ENDCOLOR% <BR> 12. In /opt/glite/etc/lcmaps/lcmaps.db substitute path = ${moddir} with path = /opt/glite/lib64/modules <BR> %GREEN% *FIXED* %ENDCOLOR% <BR> 13. Remove !SDJRequirements and !WmsRequirements from section !WorkloadManagerProxy of /opt/glite/etc/glite_wms.conf <BR> %GREEN% *FIXED* %ENDCOLOR% <BR> 14. Put on section !WorkloadManager of /opt/glite/etc/glite_wms.conf:<verbatim> WmsRequirements = ( (ShortDeadlineJob =?= TRUE) ? RegExp(".*sdj$", other.GlueCEUniqueID) : !RegExp(".*sdj$", other.GlueCEUniqueID)) && (other.GlueCEPolicyMaxTotalJobs == 0 || other.GlueCEStateTotalJobs < other.GlueCEPolicyMaxTotalJobs);</verbatim> %GREEN% *FIXED* %ENDCOLOR% <BR> 15. Correct /opt/glite/etc/glite_wms_wmproxy.gacl: insert a / before the "vo" and the rule /"vo"/Role=NULL/Capability=NULL <BR> %GREEN% *FIXED* %ENDCOLOR% <BR> 16. Cleared event is not logged: <verbatim> 27 Jul, 17:41:22 -E- PID: 25792 - "wmputils::doPurge": [Error] remove_path(purger.cpp:256): LB event logging failed LB server (bkserver,lbproxy) store protocol error (1417) - edg_wll_LogEvent(): LB server (bkserver,lbproxy) store protocol error;; Logging library ERROR: LB server (bkserver,lbproxy) store protocol error;; edg_wll_DoLogEvent(): edg_wll_log_connect error Transport endpoint is not connected;; edg_wll_gss_connect();; System Error: Connection refused 27 Jul, 17:41:22 -S- PID: 25792 - "wmpcoreoperations::jobpurge": Unable to complete job purge 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": ------------------------------- Fault Description -------------------------------- 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Method: jobPurge 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Code: 1202 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Description: The Operation is not allowed: Unable to complete job purge 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": Stack: 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": JobOperationException: The Operation is not allowed: Unable to complete job purge at jobpurge()[wmpcoreoperations.cpp:2640] 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": at jobpurge()[wmpcoreoperations.cpp:2546] 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": at jobPurge()[wmpcoreoperations.cpp:2667] 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": ---------------------------------------------------------------------------------- 27 Jul, 17:41:22 -D- PID: 25792 - "wmpgsoapoperations::ns1__jobPurge": jobPurge operation completed</verbatim> %GREEN% *FIXED* for normal jobs %ENDCOLOR% <BR> 17. Submission to LCG CE doesn't work: *Got a job held event, reason: Failed to initialize GAHP* <br> Possibles solutions are * install condor-lcg-1.1.0-1 * set GT2_GAHP = /opt/condor-7.4.1/sbin/gahp_server and GRID_MONITOR = /opt/condor-7.4.1/libexec/glite/grid_monitor.sh on /opt/condor-c/local.<$HOSTNAME>/condor_config.local <br> %GREEN% *Waiting for Marteen investigation.... FIXED* Decided to use the second option. Changes done in yaim-wms.%ENDCOLOR% <BR> 18. %GREEN% *BUG [[https://savannah.cern.ch/bugs/?73192][#73192]]* %ENDCOLOR% Submission failed: <verbatim> [ale@egee-rb-03 UI]$ glite-wms-job-submit -a --config ~/UI/etc/wmp_cream-03.conf jdl/env.jdl Connecting to the service https://cream-03.pd.infn.it:7443/glite_wms_wmproxy_server Warning - Unable to register the job to the service: https://cream-03.pd.infn.it:7443/glite_wms_wmproxy_server LB: :2652 Set logging job failed edg_wll_SetLoggingJob LB[Proxy] Error: GSSAPI Error (failed to load GSI credentials: GSS Major Status: General failure (GSS Minor Status Error Chain: globus_gsi_gssapi: Error with GSI credential globus_gsi_gssapi: Error with gss credential handle globus_credential: Valid credentials could not be found in any of the possible locations specified by the credential search order. Valid credentials could not be found in any of the possible locations specified by the credential search order.</verbatim> Possibles (ugly) workarounds to solve this problem are: * chown root.glite /etc/grid-security/host*.pem * cp /etc/grid-security/hostcert.pem /home/glite/.globus/usercert.pem and cp /etc/grid-security/hostkey.pem /home/glite/.globus/userkey.pem <br> %GREEN% *FIXED* using new LB client (patch 4423) %ENDCOLOR% <BR> 19. WM required huge amount of memory:<verbatim> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 15007 glite 25 0 2657m 2.8g 8320 S 0.0 77.0 2:34.79 glite-wms-workl </verbatim> %PURPLE% *Monitoring..* %ENDCOLOR% %GREEN% Google malloc is used by default and correctly set by yaim %ENDCOLOR% <BR> 20. <verbatim> *** glite-lb-bkserverd: Starting glite-lb-bkserver ...Warning: MySQL library version mismatch (compiled '50045', runtime '50077') done</verbatim> %GREEN% Salvet said: Installed !MySQL library is slightly newer than library used for glite-lb-bkserver build. This is normal. *No Problem* %ENDCOLOR% <BR> 21. %GREEN% *Bug [[https://savannah.cern.ch/bugs/?func=detailitem&item_id=73206][#73206]]* %ENDCOLOR% Collection doesn't work. The only suspicious messages are in /var/log/messages: <verbatim> Sep 10 15:50:48 cream-44 glite_wms_wmproxy_server[6179]: ts=2010-09-10T13:50:48Z : event=wms.wmpserver_setJobFileSystem() : userid=18118 jobid=https://devel07.cnaf.infn.it:9000/pvRt15v6hiIBbEKj0UeeAw Sep 10 15:50:48 cream-44 glite_wms_wmproxy_server[6179]: ts=2010-09-10T13:50:48Z : event=wms.wmpserver_setSubjobFileSystem() : userid=18118 jobid=https://devel07.cnaf.infn.it:9000/pvRt15v6hiIBbEKj0UeeAw Sep 10 15:50:50 cream-44 glite_wms_wmproxy_server[6179]: ts=2010-09-10T13:50:50Z : event=wms.wmpserver_submit() : userid=18118 jobid=https://devel07.cnaf.infn.it:9000/pvRt15v6hiIBbEKj0UeeAw Sep 10 15:50:50 cream-44 kernel: glite_wms_wmpro[6179] general protection rip:3e86c797e0 rsp:7fff60c65148 error:0</verbatim> %GREEN% *FIXED* %ENDCOLOR% <BR> 22. %GREEN% *Bug [[https://savannah.cern.ch/bugs/?72970][#72970]]* %ENDCOLOR% I'm running on the WMS an lbserver in "both" mode:<verbatim> 9197 ? S 0:00 /opt/glite/bin/glite-lb-bkserverd --notif-il-sock=/tmp/glite-lb-notif.sock --notif-il-fprefix=/var/tmp/glite-lb-notif -c /home/glite/.certs/hostcert.pem -k /home/glite/.certs/hostkey.pem -i /var/glite/glite-lb-bkserverd.pid --dump-prefix /var/glite/dump --purge-prefix /var/glite/purge -B --proxy-il-sock /tmp/glite-lbproxy-ilog.sock --proxy-il-fprefix /tmp/glite-lbproxy-ilog_events --policy /opt/glite/etc/glite-lb/glite-lb-authz.conf</verbatim> Then I started locallogger:<verbatim> [root@devel08 glite]# /opt/glite/etc/init.d/glite-lb-locallogger start Starting glite-lb-logd ...This is LocalLogger, part of Workload Management System in EU DataGrid & EGEE. done Starting glite-lb-interlogd ... done</verbatim> but:<verbatim> [root@devel08 glite]# /opt/glite/etc/init.d/glite-lb-locallogger status glite-lb-logd not running</verbatim> instead:<verbatim> [root@devel08 ~]# ps ax | grep lb-log 10049 ? Ss 0:00 /opt/glite/bin/glite-lb-logd -i /var/glite/glite-lb-logd.pid -c /home/glite/.certs/hostcert.pem -k /home/glite/.certs/hostkey.pem 12135 pts/0 S+ 0:00 grep lb-log</verbatim> %GREEN% *FIXED* Using new LB tag (path 4423) %ENDCOLOR% <BR> 23. Problem during installation:<verbatim> Error: Missing Dependency: org.glite.build.common-cpp >= 3.2.1 is needed by package glite-security-lcmaps-without-gsi-1.4.8-5.sl5.x86_64 (ETICS-volatile-build-5b42070f-c48b-4e7b-9819-48810222a0b3-sl5_x86_64_gcc412) Error: Missing Dependency: mod_fastcgi >= 2.4.3 is needed by package glite-wms-wmproxy-3.3.0-3.sl5.x86_64 (ETICS-volatile-build-5b42070f-c48b-4e7b-9819-48810222a0b3-sl5_x86_64_gcc412) Error: Missing Dependency: mod_fastcgi >= 2.4.3 is needed by package glite-WMS-3.3.0-0.sl5.x86_64 (ETICS-volatile-build-5b42070f-c48b-4e7b-9819-48810222a0b3-sl5_x86_64_gcc412)</verbatim> %GREEN% two packages missing... *FIXED* %ENDCOLOR% <BR> 24. Resubmission for jobs submitted with -r option is not done. Jobs are aborted with "hit job shallow retry count (0)" even if !ShallowRetryCount was set in the JDL %GREEN% *FIXED* %ENDCOLOR%. <BR> 25. Problems with DAG. DAG job reported as failed:<verbatim> Current Status: Done (Exit Code !=0) Exit code: 1 Status Reason: Warning: job exit code != 0 Destination: dagman </verbatim> and nodes stuck in "Submitted". %GREEN% *FIXED* %ENDCOLOR%. <BR> 26. %GREEN% BUG [[https://savannah.cern.ch/bugs/?75223][#75223]] *FIXED* %ENDCOLOR% When a job failed logged reason is wrong<verbatim> Event: Done - Arrived = Thu Nov 11 14:06:54 2010 CET - Exit code = 1 - Host = cream-46.pd.infn.it - Reason = LM_log_done_beginThu Nov 11 14:03:17 CET 2010: prologue failed with error 1 - Source = LogMonitor - Src instance = unique - Status code = FAILED - Timestamp = Thu Nov 11 14:06:54 2010 CET</verbatim> Another type of error:<verbatim> Event: Done - Arrived = Fri Nov 26 12:15:30 2010 CET - Exit code = 0 - Host = pamelawn23.na.infn.it - Reason = Fri Nov 26 12:13:45 CET 2010: Cannot - Source = LRMS - Status code = OK - Timestamp = Fri Nov 26 12:13:46 2010 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle</verbatim> 27. Problem installing new [[http://etics-repository.cern.ch/repository/pm/registered/repomd/id/dc5ea97e-c44a-43ae-8853-bc6671c91c8b/sl5_x86_64_gcc412/index.html][build]] (07/11/1010) %GREEN% *FIXED* %ENDCOLOR%. <verbatim> --> Missing Dependency: mod_fastcgi >= 2.4.3 is needed by package glite-wms-wmproxy-3.3.0-4.sl5.x86_64 (ETICS-name-patch_2876_1)</verbatim> 28. Problem starting bdii: <verbatim> Starting BDII slapd: Traceback (most recent call last): File "/usr/sbin/bdii-update", line 821, in ? create_daemon(config['BDII_LOG_FILE']) File "/usr/sbin/bdii-update", line 168, in create_daemom e = os.open(log_file, os.O_WRONLY | os.O_APPEND | os.O_CREAT, 0644) OSError: [Errno 13] Permission denied: '/var/log/bdii/bdii-update.log' [ OK ] BDII update process failed to startStarting BDII update pro[FAILED]</verbatim> %GREEN% Be sure that the installed version cames from etics repo: bdii-5.0.9-1%ENDCOLOR% 29. %GREEN% BUG [[https://savannah.cern.ch/bugs/?75099][#75099]] *FIXED* %ENDCOLOR%. <verbatim> 2010-11-09 13:01:07,649 DEBUG - iceCommandEventQuery::execute() - TID=[203357600] Database ID=[1265375986000] 2010-11-09 13:01:07,650 DEBUG - iceCommandEventQuery::execute() - TID=[203357600] Exec time ID=[6] 2010-11-09 13:01:07,651 INFO - scoped_timer iceCommandEventQuery::processEvents() - TID=[203357600] All Events Proc Time 1289304067.650547 1289304067.651501 0.000954 terminate called after throwing an instance of 'std::out_of_range' what(): basic_string::at Aborted</verbatim> It happen when shallow resubmission is set to -1. 30. Submission failed %GREEN% *Probably FIXED* using a new LB server (2.1) %ENDCOLOR%<verbatim> Warning - Unable to register the job to the service: https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server HTTP Error <html><head> <title>500 Internal Server Error</title> </head><body> <h1>Internal Server Error</h1> <p>The server encountered an internal error or misconfiguration and was unable to complete your request.</p> <p>Please contact the server administrator, [no address given] and inform them of the time the error occurred, and anything you might have done that may have caused the error.</p> <p>More information about this error may be available in the server error log.</p> </body></html> Error code: SOAP-ENV:Server</verbatim> Looking into the /var/log/messages file of the WMS you should read:<verbatim> Nov 10 17:10:44 cream-46 kernel: glite_wms_wmpro[15571]: segfault at 0000000000000000 rip 00002b6b3ca32548 rsp 00007fffa962f320 error 4</verbatim> Instead the wmproxy log says:<verbatim> 10 Nov, 17:10:35 -S- PID: 15571 - "WMPEventlogger::registerJob": Register job failed edg_wll_RegisterJobProxy Exit code: 22 LB[Proxy] Error: Invalid argument (edg_wll_RegisterJobMaster(): unable to register job Resource temporarily unavailable;; Logging library ERROR: Resource temporarily unavailable;; edg_wll_DoLogEventServer(): edg_wll_log_direct_read error LB server (bkserver,lbproxy) store protocol error;; edg_wll_log_proto_client_direct(): error reading answer from L&B direct server LB server (bkserver,lbproxy) store protocol error;; get_reply_gss(): error reading reply LB server (bkserver,lbproxy) store protocol error;; gss_reader(): error reading message Transport endpoint is not connected;; edg_wll_gss_read_full;; GSS Error: EOF occured;)</verbatim> The httpd-wmproxy-errors.log says:<verbatim> [Wed Nov 10 17:10:44 2010] [error] [client 193.206.210.108] FastCGI: incomplete headers (0 bytes) received from server "/opt/glite/bin/glite_wms_wmproxy_server"</verbatim> 31. Cleared event not logged %GREEN% *FIXED* %ENDCOLOR%. <verbatim> [ale@cream-03 ~]$ glite-wms-job-output https://devel07.cnaf.infn.it:9000/IvaPn8c4ezVSiHbQP8FJOg Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server Warning - JobPurging not allowed (The Operation is not allowed: Unable to complete job purge)</verbatim> %GREEN%You need to set the WMS DN in both ACTION "READ_ALL" and "LOG_WMS_EVENTS" of glite-lb-authz.conf file of the LB Server%ENDCOLOR% 32. Cleared event not logged %GREEN% *FIXED* (2011/01/19)%ENDCOLOR%. <verbatim> [ale@cream-03 ~]$ glite-wms-job-output https://devel07.cnaf.infn.it:9000/Hn0VS-oMpYlT7bdAIXNB7g Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server Warning - JobPurging not allowed (Proxy exception: The delegated Proxy has expired)</verbatim> This happen when the job proxy has expired. 33. %GREEN% Bug [[https://savannah.cern.ch/bugs/?75368][#75368]] *FIXED* %ENDCOLOR% A [[%ATTACHURL%/loginfo.txt]["DONE OK"]] job is marked as [[%ATTACHURL%/status.txt]["ABORTED"]]. The problem is that a failed job should not been aborted (resubmission is possibile only for Done Failed jobs). Another case:<verbatim> Event: Abort - Arrived = Wed Nov 17 13:31:54 2010 CET - Host = cream-46.pd.infn.it - Reason = BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:Connection timed out-qsub: cannot connect to server gridba3.ba.infn.it (errno=110) Connection timed out-TERM environment variable not set.-) N/A (jobId = CREAM809848230) - Source = LogMonitor - Timestamp = Wed Nov 17 13:31:53 2010 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy --- Event: Resubmission - Arrived = Wed Nov 17 13:31:54 2010 CET - Host = cream-46.pd.infn.it - Reason = Job resubmitted by ICE - Result = WILLRESUB - Source = LogMonitor - Tag = unavailable - Timestamp = Wed Nov 17 13:31:54 2010 CET - User = /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Alessio Gianelle/CN=proxy/CN=proxy</verbatim> 34. St9exception %GREEN% *FIXED* (2011/02/01)%ENDCOLOR%. After the "get-output" of a dag, the nodes are in "DONE-OK" state. If you try a get-output at this point for a node you have (see also 53.):<verbatim> [ale@cream-03 ~]$ glite-wms-job-output https://devel17.cnaf.infn.it:9000/-ZpRldEJcENxhNiwMqidiw Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server Error - getOutputFileList Error (St9exception)</verbatim> Instead the right message should be found in the wmproxy log file:<verbatim> 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": JobId Exception: The Operation is not allowed: The job has not been registered from this Workload Manager Proxy server (https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server) or it has been purged 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": ------------------------------- Fault Description -------------------------------- 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Method: getOutputFileList 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Code: 1202 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Description: St9exception 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": Stack: 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": ---------------------------------------------------------------------------------- 19 Jan, 12:55:46 -D- PID: 22916 - "wmpgsoapoperations::ns1__getOutputFileList": getOutputFileList operation completed</verbatim> 35. The state of the nodes doesn't change after the output retrieval of the parent of a collection. %GREEN% Bug [[https://savannah.cern.ch/bugs/?77876][#77876]] *FIXED* (2011/02/14) %ENDCOLOR% 36. Dag job doesn't work %GREEN% *FIXED* %ENDCOLOR%. Probably the name of the token is wrongly set. 37. %GREEN% Bug [[https://savannah.cern.ch/bugs/?73715][#73715]] *FIXED* %ENDCOLOR% the Really Running event is not logged from JC/LM (instead it works for ICE). 38. The "type" attribute is _case sensitive_ %GREEN% *FIXED* installing the new jdl's rpm on the UI %ENDCOLOR%<verbatim> Error - AdSyntaxException The following parsing error(s) have been found: 'node_type' must be "dag"</verbatim> 39. %PURPLE% Bug [[https://savannah.cern.ch/bugs/?75402][#75402]] *FIXED* To verify with the new tag %ENDCOLOR% Synchronization loss between real validity of proxy and exp. time saved in ICE's database; this can happen when the copy of the new proxy fails <verbatim> 2010-11-15 10:57:41,869 INFO - DNProxyManager::setUserProxyIfLonger_Legacy() - Setting user proxy to [ /var/glite/SandboxDir/jI/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fjIwx1BAneLSLa93u3CEeAQ/user.proxy] copied to /var/glite/ice/persist_dir/B23D0D7177A8B6234F1985493FA09FF41A4FA98C.proxy] because the old one is less long-lived. 2010-11-15 10:57:42,019 ERROR - DNProxyManager::setUserProxyIfLonger_Legacy() - Error copying proxy [/var/glite/SandboxDir/jI/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fjIwx1BAneLSLa93u3CEeAQ/user.proxy] to [/var/glite/ice/persist_dir/B23D0D7177A8B6234F1985493FA09FF41A4FA98C.proxy]. 2010-11-15 10:57:42,019 DEBUG - DNProxyManager::setUserProxyIfLonger_Legacy() - New proxy [/var/glite/SandboxDir/jI/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fjIwx1BAneLSLa93u3CEeAQ/user.proxy] has been copied into [/var/glite/ice/persist_dir/B23D0D7177A8B6234F1985493FA09FF41A4FA98C.proxy] - New Expiration Time is [Tue Nov 16 10:56:34 2010]</verbatim> 40. Problem with job's proxy expired in ice %PURPLE% *FIXED* To verify with the new tag %ENDCOLOR%. When proxy expired the jobs in ice queue are not correctly removed<verbatim> 2010-11-17 10:08:37,269 DEBUG - iceCommandLBLogging::execute - TID=[] Will not LOG anything to LB for Job [https://devel17.cnaf.infn.it:9000/ZSwHnJmAMblpLYKwz69mFQ] for reason: CreamJobID [CREAM898633002] disappeared from ICE database !</verbatim> 41. Ice log Cancel events with wrong sequence code %GREEN% *FIXED* (2011-01-19) %ENDCOLOR% 42. Ice Aborted %PURPLE% *TO Monitoring* %ENDCOLOR% It happen that ICE aborted with these messages (see also 49.):<verbatim> t1169246528:p13952: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_destroy() failed [Thread System] mutex is locked (EBUSY)t1295124800:p13952: Fatal error: [Thread System] GLOBUSTHREAD: globus_thread_setspecific() failed</verbatim> or<verbatim> t1313102144:p12551: Fatal error: [Thread System] GLOBUSTHREAD: globus_thread_setspecific() failed [Thread System] invalid value passed to thread interface (EINVAL)Aborted (core dumped)</verbatim> 43. glite-wms-ice-db-rm, various errors %GREEN% *FIXED* (2011/01/19)%ENDCOLOR%<verbatim> [root@cream-46 persist_dir]# /opt/glite/bin/glite-wms-ice-db-rm -h /opt/glite/bin/glite-wms-ice-db-rm: invalid option -- h Type /opt/glite/bin/glite-wms-ice-db-rm -h for help [root@cream-46 persist_dir]# /opt/glite/bin/glite-wms-ice-db-rm Must specify at least one of the options --from-file <pathfile> or <gridjobid> [root@cream-46 persist_dir]# /opt/glite/bin/glite-wms-ice-db-rm --from-file a /opt/glite/bin/glite-wms-ice-db-rm: unrecognized option `--from-file' </verbatim> 44. feedback doesn't work %GREEN% *FIXED* %ENDCOLOR% The "old" token is not removed! <verbatim> [root@devel09 ~]# ls -l /var/glite/SandboxDir/B4/https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2fB4ni_5fIMNzn6cY6W8llClSA/ total 20 drwxrwx--- 2 dteam035 glite 4096 Nov 30 10:17 input -rw-r--r-- 1 dteam035 dteam 133 Nov 30 10:41 Maradona.output drwxrwx--- 2 dteam035 glite 4096 Nov 30 10:17 output drwxrwx--- 2 dteam035 glite 4096 Nov 30 10:17 peek -rw-r--r-- 1 glite glite 0 Nov 30 10:19 token.txt_0_ -rw-r--r-- 1 glite glite 0 Nov 30 10:17 token.txt_1</verbatim> 45. Misleading messages in Maradona file: %GREEN% *FIXED* (2011/02/02)%ENDCOLOR%<verbatim> LM_log_done_beginThu Dec 2 12:08:53 CET 2010: 0 LM_log_done_end jw exit status = 1</verbatim> Then the job fails with this message:<verbatim> - Standard output does not contain useful data.Cannot read JobWrapper output, both from Condor and from Maradona. Thu Dec 2 12:08:53 CET 2010: 0LM_log_done_beginThu Dec 2 12:08:53 CET 2010: 0LM_log_done_end</verbatim> 46. Feedback doesn't work with LCG-CE, so these CEs must be excluded in the MM %GREEN% *FIXED* (2011/02/10)%ENDCOLOR% Proposed fix is that yaim set:<verbatim> WmsRequirements = ((ShortDeadlineJob =?= TRUE) ? RegExp(".*sdj$", other.GlueCEUniqueID) : !RegExp(".*sdj$", other.GlueCEUniqueID)) && (other.GlueCEPolicyMaxTotalJobs == 0 || other.GlueCEStateTotalJobs < other.GlueCEPolicyMaxTotalJobs) && ((EnableWmsFeedback =?= TRUE) ? (RegExp("cream", other.GlueCEImplementationName, "i")) : true)</verbatim> 47. WM defines a rescheduled request as a "Submission" %GREEN% *FIXED* (2011/01/25)%ENDCOLOR% 48. *Bug: [[https://savannah.cern.ch/bugs/?76097][#76097]]* During the first mm of a node the "UserTag !CEInfoHostName" is not logged %GREEN% *FIXED* %ENDCOLOR% 49. Both WM and ICE give on high load: t1168148800:p30771: Fatal error: [Thread System] GLOBUSTHREAD: pthread_mutex_lock() failed. %PURPLE% Supposedly fixed by the 'newest' GPT in glite 32. Pay attention.%ENDCOLOR% (see also 42.) 50. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] Yaim fails %GREEN% *FIXED* (2011/01/31)%ENDCOLOR%: <verbatim> cp: cannot stat `/opt/glite/sbin/glite_wms_wmproxy_load_monitor.template': No such file or directory chmod: cannot access `/opt/glite/sbin/glite_wms_wmproxy_load_monitor': No such file or directory</verbatim> 51. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] Submission failed %GREEN% *FIXED* Problem was with fast-cgi in gsoap (2011/01/27)%ENDCOLOR% :<verbatim> Status: 500 Internal Server Error Server: gSOAP/2.7 Content-Type: text/xml; charset=utf-8 Content-Length: 831 Connection: close <?xml version="1.0" encoding="UTF-8"?> <SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:jsdl="http://schemas.ggf.org/jsdl/2005/11/jsdl" xmlns:jsdlposix="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix" xmlns:delegation1="http://www.gridsite.org/namespaces/delegation-1" xmlns:delegationns="http://www.gridsite.org/namespaces/delegation-2" xmlns:ns1="http://glite.org/wms/wmproxy"><SOAP-ENV:Body><SOAP-ENV:Fault SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/"><faultcode>SOAP-ENV:Client</faultcode><faultstring>End of file or no input: 'Invalid argument'</faultstring></SOAP-ENV:Fault></SOAP-ENV:Body></SOAP-ENV:Envelope>[Fri Jan 14 16:45:46 2011] [error] [client 193.206.210.108] FastCGI: incomplete headers (0 bytes) received from server "/opt/glite/bin/glite_wms_wmproxy_server"</verbatim> 52. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] Cannot take shallow resubmission token %GREEN% *FIXED* (2011/01/31)%ENDCOLOR%. Infact in the sandbox dir of the job the token is named: *token.txt_0.* (note the dot) 53. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] Misleading output message from UI %GREEN% *FIXED* (2011/02/01) %ENDCOLOR% When a user try to retrieve the output files of a job of another user the message is (see also 34.):<verbatim> [ale@cream-03 UI]$ glite-wms-job-output https://devel17.cnaf.infn.it:9000/cxyj4EodCZx7qY2vretM6g Connecting to the service https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server Error - getOutputFileList Error (St9exception)</verbatim> Instead in the wmproxy log file you should find the right reason:<verbatim> 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": JobId Exception: User not authorized to perform this operation 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": ------------------------------- Fault Description -------------------------------- 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Method: getOutputFileList 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Code: 1202 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Description: St9exception 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": Stack: 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": ---------------------------------------------------------------------------------- 19 Jan, 12:16:55 -D- PID: 17941 - "wmpgsoapoperations::ns1__getOutputFileList": getOutputFileList operation completed</verbatim> 54. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] deep resubmission doesn't work %GREEN% *FIXED* (2011/02/07) %ENDCOLOR% In fact when the old token is grabbed, wm doesn't create the new one<verbatim> 31 Jan, 16:37:21 -I: [Info] operator()(dispatcher_utils.cpp:228): new jobresubmit for https://cream-44.pd.infn.it:9000/Ham9zx-h36iNR0FVtjT9Ng 31 Jan, 16:37:21 -E: [Error] operator()(submit_request.cpp:536): cannot rename temporary token /var/glite/SandboxDir/Ha/https_3a_2f_2fcream-44.pd.infn.it_3a9000_2fHam9zx-h36iNR0FVtjT9Ng/token.txt_0. (error 2) 31 Jan, 16:37:21 -I: [Info] postpone(submit_request.cpp:227): postponing https://cream-44.pd.infn.it:9000/Ham9zx-h36iNR0FVtjT9Ng (cannot create token for shallow resubmission)</verbatim> 55. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] Bug: [[https://savannah.cern.ch/bugs/?func=detailitem&item_id=77366][#77366]]: Sometimes submission failed for LB error %GREEN% *Hopefully FIXED* (2011/02/10) %ENDCOLOR%:<verbatim> Warning - Unable to register the job to the service: https://cream-46.pd.infn.it:7443/glite_wms_wmproxy_server Register COLLECTIONfailed to LB server:devel17.cnaf.infn.it:9000 edg_wll_RegisterJobProxy/Sync Exit code: 22 LB[Proxy] Error: Invalid argument (edg_wll_RegisterJobMaster(): unable to register job Resource temporarily unavailable;; Logging library ERROR: Resource temporarily unavailable;; edg_wll_DoLogEventServer(): edg_wll_log_direct_read error LB server (bkserver,lbproxy) store protocol error;; edg_wll_log_proto_client_direct(): error reading answer from L&B direct server LB server (bkserver,lbproxy) store protocol error;; get_reply_gss(): error reading reply LB server (bkserver,lbproxy) store protocol error;; gss_reader(): error reading message Transport endpoint is not connected;; edg_wll_gss_read_full;; GSS Error: EOF occured;) Method: jobRegister Error - Operation failed Unable to find any endpoint where to perform service request</verbatim> 56. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] %GREEN% Bug [[https://savannah.cern.ch/bugs/index.php?77055][#77055]] *FIXED* (2011/01/31)%ENDCOLOR% "MyProxyServer: wrong type caught for attribute" for parametric jobs 57. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] Feedback doesn't work %GREEN% *FIXED* (2011/02/07)%ENDCOLOR% <verbatim> 20 Jan, 13:28:13 -I: [Info] operator()(replanner.cpp:226): created replanning request for job https://cream-46.pd.infn.it:9000/P4kfEQQnyw3g9NL7qyfA3Q with token /var/glite/SandboxDir/P4/https_3a_2f_2fcream-46.pd.infn.it_3a9000_2fP4kfEQQnyw3g9NL7qyfA3Q/token.txt_1 20 Jan, 13:28:14 -I: [Info] operator()(dispatcher_utils.cpp:310): cannot create LB context (2) for [...] </verbatim> 58. [[http://devel12.cnaf.infn.it:7444/repository/wms-cert_110114/][repo 110114]] Before resubmission ICE must be sure that the job's proxy should be valid for at least "n" minutes. %GREEN% *Hopefully FIXED* (2011/02/10)%ENDCOLOR% 59. [[http://etics-repository.cern.ch/repository/pm/volatile/repomd/id/13027066-9827-4d38-be9d-e7cd81a19c9b/sl5_x86_64_gcc412/index.html][WMS repo 110131]] & [[http://etics-repository.cern.ch/repository/pm/volatile/repomd/id/b67fa2cb-2f62-4c64-8293-b5be44405b27/sl5_x86_64_gcc412/index.html][LB patch #4623]] Yaim error: _sed: can't read /opt/glite/etc/gip/glite-info-generic.conf: No such file or directory_ %GREEN% *FIXED* (2011/02/07)%ENDCOLOR% Probably this package: *glite-info-generic* is missing. After configuration we have:<verbatim> [root@cream-44 ~]# ls -l /opt/glite/etc/gip/glite-info-generic.conf -rw-r--r-- 1 root root 0 Jan 31 15:45 /opt/glite/etc/gip/glite-info-generic.conf</verbatim> 60. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLog_cream-46#2011_02_04_Ale_Install_a_new_WMS][repo 110204]] %GREEN% *FIXED* (2011/02/14) %ENDCOLOR% Cron Purger doesn't work:<verbatim> [root@cream-46 ~]# /opt/glite/sbin/glite-wms-purgeStorage --help glite-wms-purgeStorage: glite-wms-purgeStorage.cpp:87: bool<unnamed>::lb_proxy(): Assertion `f_conf' failed. Aborted</verbatim> 61. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLog_cream-46#2011_02_04_Ale_Install_a_new_WMS][repo 110204]] %PURPLE% *NOT FIXED* (2011/02/11) %ENDCOLOR% All the queries done for replanning have timeout (open bug [[https://savannah.cern.ch/bugs/?78047][#78047]])<verbatim> 10 Feb, 10:35:20 -D: [Debug] get_scheduled_jobs(lb_utils.cpp:153): error (110) while querying for scheduled jobs 10 Feb, 10:35:20 -D: [Debug] operator()(replanner.cpp:382): no jobs in scheduled state for more than 1800 seconds for replanning </verbatim> 62. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLog_cream-46#2011_02_04_Ale_Install_a_new_WMS][repo 110204]] %PURPLE% *NOT FIXED* (2011/02/11) %ENDCOLOR% If a collection is aborted for "request expired" the nodes don't change status. The problem is due to LB query: (probably it is tied to point 61.)<verbatim> 10 Feb, 07:01:48 -E: [Error] unrecoverable_collection(submit_request.cpp:108): https://cream-46.pd.infn.it:9000/MRy3d5geSzmep67b0VXV1A: unable to retrieve children information from jobstatus 10 Feb, 07:01:48 -E: [Error] unrecoverable(submit_request.cpp:126): https://cream-46.pd.infn.it:9000/MRy3d5geSzmep67b0VXV1A failed (request expired) </verbatim> 63. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLog_cream-46#2011_02_04_Ale_Install_a_new_WMS][repo 110204]] %RED% *NOT FIXED* (2011/02/11) %ENDCOLOR% Submitting a collection with 192 nodes you obtaing this error (see bug [[https://savannah.cern.ch/bugs/?70061][#70061]]):<verbatim> Status info for the Job : https://cream-44.pd.infn.it:9000/Tck2cFcOKuOjM4gfH9Ermg Current Status: Waiting Status Reason: jobid: unable to complete the operation: the attribute has not been initialised yet Submitted: Fri Feb 11 17:06:23 2011 CET </verbatim> 64. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLogDevel09#2011_02_14_Ale][repo 110214]] %RED% *NOT FIXED* (2011/02/14) %ENDCOLOR% glite-WMS metapackage is wrong (too much dependencies and forget google-perftools) 65. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLogDevel09#2011_02_14_Ale][repo 110214]] %RED% *NOT FIXED* (2011/02/14) %ENDCOLOR%. There is a mistake in /opt/glite/yaim/functions/config_info_service_wms (forget a "\" at line 86), yaim reports this message:<verbatim> sed: -e expression #3, char 84: unknown option to `s'</verbatim> 66. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLogDevel09#2011_02_14_Ale][repo 110214]] %RED% *NOT FIXED* (2011/02/14) %ENDCOLOR% When you cancel a collection some nodes (the ones which where in state Running or DONE OK) are put in state "Cleared". 67. [[https://twiki.cnaf.infn.it/twiki/bin/view/EgeeJra1It/WorkLogDevel09#2011_02_14_Ale][repo 110214]] %RED% *NOT FIXED* (2011/02/14) %ENDCOLOR% Wm dies with these messages:<verbatim> 14 Feb, 16:05:46 -E: [Error] unrecoverable(submit_request.cpp:126): https://devel09.cnaf.infn.it:9000/3hTRNHN65Dv64kYDLKjZkQ failed (hit job shallow retry count (2)) 14 Feb, 16:05:46 -I: [Info] checkRequirement(matchmakerISMImpl.cpp:107): MM for job: https://devel09.cnaf.infn.it:9000/dsX5dYARJ2rAnk--uyYcJg (19/2976 [0] ) 14 Feb, 16:05:46 -E: [Error] handle_synch_signal(signal_handling.cpp:77): Got a synchronous signal (11), stack trace: /opt/glite/bin/glite-wms-workload_manager /lib64/libpthread.so.0 classad::EvalState::SetRootScope() classad::ClassAd::EvaluateAttr(std::string const&, classad::Value&) const glite::wmsutils::classads::evaluate_attribute(classad::ClassAd const&, std::string const&) /opt/glite/lib64/libglite_wms_matchmaking.so.0(_ZN5glite3wms11matchmaking17matchmakerISMImpl16checkRequirementERN7classad7Class glite::wms::broker::RBSimpleISMImpl::findSuitableCEs(classad::ClassAd const*) glite::wms::broker::ResourceBroker::findSuitableCEs(classad::ClassAd const*) /opt/glite/lib64/libglite_wms_helper_broker_ism.so glite::wms::helper::broker::Helper::resolve(classad::ClassAd const*, boost::shared_ptr<std::string>) const glite::wms::helper::Helper::resolve(classad::ClassAd const*, boost::shared_ptr<std::string>) const glite::wms::helper::RequestStateMachine::next_step(classad::ClassAd const*, boost::shared_ptr<std::string>) glite::wms::helper::Request::Impl::resolve() glite::wms::manager::server::Plan(classad::ClassAd const&, boost::shared_ptr<std::string>) glite::wms::manager::server::WMReal::submit(classad::ClassAd const&, boost::shared_ptr<_edg_wll_Context>, boost::shared_ptr<std::string>, bool) glite::wms::manager::server::SubmitProcessor::operator()() boost::function0<void, std::allocator<void> >::operator()() const glite::wms::manager::server::Events::run() boost::function0<void, std::allocator<boost::function_base> >::operator()() const /usr/lib64/libboost_thread.so.2</verbatim> -- Main.AlessioGianelle - 2011-02-22
Attachments
Attachments
Topic attachments
I
Attachment
Action
Size
Date
Who
Comment
txt
loginfo.txt
manage
41.7 K
2010-11-11 - 10:35
AlessioGianelle
Problem 33: loginfo
txt
status.txt
manage
2.2 K
2010-11-11 - 10:35
AlessioGianelle
Problem 33: status
E
dit
|
A
ttach
|
PDF
|
H
istory
: r1
|
B
acklinks
|
V
iew topic
|
M
ore topic actions
Topic revision: r1 - 2011-02-22
-
AlessioGianelle
Home
Site map
CEMon web
CREAM web
Cloud web
Cyclops web
DGAS web
EgeeJra1It web
Gows web
GridOversight web
IGIPortal web
IGIRelease web
MPI web
Main web
MarcheCloud web
MarcheCloudPilotaCNAF web
Middleware web
Operations web
Sandbox web
Security web
SiteAdminCorner web
TWiki web
Training web
UserSupport web
VOMS web
WMS web
WMSMonitor web
WeNMR web
EgeeJra1It Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
P
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
E
dit
A
ttach
Copyright © 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback