Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 17 to 17 | ||||||||
The workaround consists on running YAIM every time one of the above packages is update. | ||||||||
Changed: | ||||||||
< < | Configuration error: cannot get instance of CommandManager: org/glite/lb/LBException | |||||||
> > | Configuration error: cannot get instance of CommandManager: org/glite/lb/LBException | |||||||
This issue occurs in one of the following cases: | ||||||||
Changed: | ||||||||
< < |
ln -s /usr/share/java/glite-lb-client-java.jar /var/lib/tomcat*/webapps/ce-cream/WEB-INF/lib
| |||||||
> > |
| |||||||
Changed: | ||||||||
< < | SEVERE: A web application created a ThreadLocal ... but failed to remove it when the web application was stopped | |||||||
> > | The workaround consists on creating a symbolic link in the web application deployment directory:
ln -sf /usr/lib/java/glite-lb-client-java.jar /var/lib/tomcat*/webapps/ce-cream/WEB-INF/lib SEVERE: A web application created a ThreadLocal ... but failed to remove it when the web application was stopped | |||||||
This error is reported only by tomcat6 when the server container is being shutting down. Even if the message shows a memory leak this issue doesn't occur when the server is running. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 19 to 19 | ||||||||
Configuration error: cannot get instance of CommandManager: org/glite/lb/LBException | ||||||||
Changed: | ||||||||
< < | This issue occurs when upgrading the CREAM CE from 1.15 to 1.16. The workaround is to create a symbolic link in the web application: | |||||||
> > | This issue occurs in one of the following cases:
| |||||||
ln -s /usr/share/java/glite-lb-client-java.jar /var/lib/tomcat*/webapps/ce-cream/WEB-INF/lib | ||||||||
Added: | ||||||||
> > |
| |||||||
SEVERE: A web application created a ThreadLocal ... but failed to remove it when the web application was stopped |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Missing infoprovider wrapper or corrupted configuration file in EMI-3Due to a bug in rpm scriptlets of the packages:
| |||||||
Configuration error: cannot get instance of CommandManager: org/glite/lb/LBExceptionThis issue occurs when upgrading the CREAM CE from 1.15 to 1.16. The workaround is to create a symbolic link in the web application: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Configuration error: cannot get instance of CommandManager: org/glite/lb/LBExceptionThis issue occurs when upgrading the CREAM CE from 1.15 to 1.16. The workaround is to create a symbolic link in the web application:ln -s /usr/share/java/glite-lb-client-java.jar /var/lib/tomcat*/webapps/ce-cream/WEB-INF/lib | |||||||
SEVERE: A web application created a ThreadLocal ... but failed to remove it when the web application was stoppedThis error is reported only by tomcat6 when the server container is being shutting down. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Deleted: | ||||||||
< < | Error from TORQUE infoprovider: Exception: Cannot find user for xxxxx.xxxxxxxxThis error can be reported by the low level infoprovider of TORQUE when the CREAM service and TORQUE server are deployed on different hosts and a set of "local accounts", not belonging to any pool account, are defined on the TORQUE server. The error message is misleading, the infoprovider is not able to retrieve the group of the local account, not the user. The workaround consists on creating in the CREAM host the same set of users and groups defined on the TORQUE server. | |||||||
SEVERE: A web application created a ThreadLocal ... but failed to remove it when the web application was stoppedThis error is reported only by tomcat6 when the server container is being shutting down. | ||||||||
Line: 55 to 50 | ||||||||
Argus server does not authorize a user if it is under heavy load. For a workaround see Problem with Argus 1.5 (EMI-2) and CREAM | ||||||||
Deleted: | ||||||||
< < | NoClassDefFoundError from Argus updating from EMI-2 to EMI-3Updating from EMI-2 to EMI-3 the support for Argus requires CAnL libraries which are incompatible with the CREAM service in EMI-3. It is necessary to force the service to use old Argus libs in this way:ln -sf /usr/share/java/argus-pep-api-java-compat.jar $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/argus-pep-api-java.jar ln -sf /usr/share/java/argus-pep-common-compat.jar $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/argus-pep-common.jar Wrong time format for MaxWallClockTimeWith EMI-2 update 9 an issue occurs concerning the time format of GlueCEPolicyMaxWallClockTime and GlueCEPolicyMaxObtainableWallClockTime on a TORQUE based installation. The attributes are published in hours, there's no workaround at the moment the only way is to change the script /usr/libexec/info-dynamic-pbs according to:360c360 < $maxWall = int(&convertHhMmSs($1)/60); --- > $maxWall = int(&convertHhMmSs($1)); 363c363 < $defaultWall = int(&convertHhMmSs($1)/60); --- > $defaultWall = int(&convertHhMmSs($1));Relevant bug: https://savannah.cern.ch/bugs/?101076 | |||||||
tomcat5 fails to start unless the supplementary repository is enabled in RHEL 5.9In RHEL 5.9, installing tomcat5 (tomcat5-5.5.23-0jpp.37.el5) pulls in java-1.7.0-ibm-devel (1:1.7.0.3.0-1jpp.2.el5) IF the supplementary-5 repository is enalbed. However, without the supplementary repo set up, tomcat5 pulls in java-1.4.2-gcj-compat-devel instead. In this case starting tomcat5 fails as seen in /var/log/tomcat5/catalina.out : | ||||||||
Line: 229 to 200 | ||||||||
Moreover, to guarantee that the accounting file is updated on the fly, the GE configuration should be tunned (using qconf -mconf) in order to add under the reporting_params the following definitions: accounting=true accounting_flush_time=00:00:00 | ||||||||
Deleted: | ||||||||
< < | Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see here | |||||||
pbs_server create hangs on first time runOn a fresh installation of torque-server, running pbs_server for the first time by running/etc/init.d/pbs_server start will hang . Also, the command to create | ||||||||
Line: 347 to 312 | ||||||||
vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h lcg-ce –infile /import/dir/to/cream-ce/diagnose-for-cream | ||||||||
Deleted: | ||||||||
< < | Reconfiguration after updateAfter an update of the CREAM RPM, it is mandatory to reconfigure (via yaim) | |||||||
Special characters in CREAM_DB_USER and CREAM_DB_PASSWORD | ||||||||
Line: 382 to 344 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Error from TORQUE infoprovider: Exception: Cannot find user for xxxxx.xxxxxxxxThis error can be reported by the low level infoprovider of TORQUE when the CREAM service and TORQUE server are deployed on different hosts and a set of "local accounts", not belonging to any pool account, are defined on the TORQUE server. The error message is misleading, the infoprovider is not able to retrieve the group of the local account, not the user. The workaround consists on creating in the CREAM host the same set of users and groups defined on the TORQUE server. Fix provided in EMI-3 update 14NoClassDefFoundError from Argus updating from EMI-2 to EMI-3Updating from EMI-2 to EMI-3 the support for Argus requires CAnL libraries which are incompatible with the CREAM service in EMI-3. It is necessary to force the service to use old Argus libs in this way:ln -sf /usr/share/java/argus-pep-api-java-compat.jar $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/argus-pep-api-java.jar ln -sf /usr/share/java/argus-pep-common-compat.jar $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/argus-pep-common.jarFix provided in EMI-3 update 6 Wrong time format for MaxWallClockTimeWith EMI-2 update 9 an issue occurs concerning the time format of GlueCEPolicyMaxWallClockTime and GlueCEPolicyMaxObtainableWallClockTime on a TORQUE based installation. The attributes are published in hours, there's no workaround at the moment the only way is to change the script /usr/libexec/info-dynamic-pbs according to:360c360 < $maxWall = int(&convertHhMmSs($1)/60); --- > $maxWall = int(&convertHhMmSs($1)); 363c363 < $defaultWall = int(&convertHhMmSs($1)/60); --- > $defaultWall = int(&convertHhMmSs($1));Relevant bug: https://savannah.cern.ch/bugs/?101076 Fix provided in EMI-3 update 6 | |||||||
Problem switching off of the JobSubmissionManager (i.e. JOB_SUBMISSION_MANAGER_ENABLE false in /etc/glite-ce-cream/cream-config.xmlThe switching off of the JobSubmissionManager ( |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Error from TORQUE infoprovider: Exception: Cannot find user for xxxxx.xxxxxxxxThis error can be reported by the low level infoprovider of TORQUE when the CREAM service and TORQUE server are deployed on different hosts and a set of "local accounts", not belonging to any pool account, are defined on the TORQUE server. The error message is misleading, the infoprovider is not able to retrieve the group of the local account, not the user. The workaround consists on creating in the CREAM host the same set of users and groups defined on the TORQUE server. | |||||||
SEVERE: A web application created a ThreadLocal ... but failed to remove it when the web application was stoppedThis error is reported only by tomcat6 when the server container is being shutting down. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 13 to 13 | ||||||||
This error is reported only by tomcat6 when the server container is being shutting down. Even if the message shows a memory leak this issue doesn't occur when the server is running. | ||||||||
Added: | ||||||||
> > | Problem in configuring CREAM (ONLY if the YAIM tool isn't used)If CREAM isn't configured by using YAIM tool, the following query MUST be executed on the cream database:use creamdb; ALTER TABLE db_info MODIFY creationTime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP; commit;This because there is an unwanted auto-updating of the field "creationTime" on the cream database that it causes a problem by submitting jobs via WMS Submitting jobs via WMS, you could obtain the following wrong message: =============== glite-wms-job-status Success =============
BOOKKEEPING INFORMATION:
Status info for the Job : ...
Current Status: Aborted <-----------------
Status Reason: CREAM'S database has been scratched and all its jobs have been lost <-----------------
Destination: ...
Submitted: ...
Parent Job: ...
==================================================================
N.B: The CREAM database isn't scratched, but it is so from the WMS point of view because of the above problem. | |||||||
Authorization error from Argus in EMI-2Argus server does not authorize a user if it is under heavy load. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | SEVERE: A web application created a ThreadLocal ... but failed to remove it when the web application was stoppedThis error is reported only by tomcat6 when the server container is being shutting down. Even if the message shows a memory leak this issue doesn't occur when the server is running. | |||||||
Authorization error from Argus in EMI-2Argus server does not authorize a user if it is under heavy load. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Deleted: | ||||||||
< < | A web application appears to have started a thread named [Timer-*] but has failed to stop it. This is very likely to create a memory leakThis error message is caused by a fault in the procedure that disables the CREAM-ES and/or CEMonitor service on a SL6 installation. The files /usr/share/tomcat6/conf/Catalina/localhost/ce-cream-es.xml and /usr/share/tomcat6/conf/Catalina/localhost/ce-monitor.xml must be manually removed and the tomcat service restarted. | |||||||
Authorization error from Argus in EMI-2Argus server does not authorize a user if it is under heavy load. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 40 to 40 | ||||||||
> $defaultWall = int(&convertHhMmSs($1)); | ||||||||
Added: | ||||||||
> > | Relevant bug: https://savannah.cern.ch/bugs/?101076 | |||||||
tomcat5 fails to start unless the supplementary repository is enabled in RHEL 5.9In RHEL 5.9, installing tomcat5 (tomcat5-5.5.23-0jpp.37.el5) pulls in java-1.7.0-ibm-devel (1:1.7.0.3.0-1jpp.2.el5) IF the supplementary-5 repository is enalbed. However, without the supplementary repo set up, tomcat5 pulls in java-1.4.2-gcj-compat-devel instead. In this case starting tomcat5 fails as seen in /var/log/tomcat5/catalina.out : |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | A web application appears to have started a thread named [Timer-*] but has failed to stop it. This is very likely to create a memory leakThis error message is caused by a fault in the procedure that disables the CREAM-ES and/or CEMonitor service on a SL6 installation. The files /usr/share/tomcat6/conf/Catalina/localhost/ce-cream-es.xml and /usr/share/tomcat6/conf/Catalina/localhost/ce-monitor.xml must be manually removed and the tomcat service restarted. | |||||||
Authorization error from Argus in EMI-2Argus server does not authorize a user if it is under heavy load. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Authorization error from Argus in EMI-2Argus server does not authorize a user if it is under heavy load. For a workaround see Problem with Argus 1.5 (EMI-2) and CREAM | |||||||
NoClassDefFoundError from Argus updating from EMI-2 to EMI-3Updating from EMI-2 to EMI-3 the support for Argus requires CAnL libraries which are incompatible with the CREAM service in EMI-3. It is necessary to force the service to use old Argus libs in this way: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Changed: | ||||||||
< < | Wrong time format for MaxWallClockTimeWith EMI-2 update 9 an issue occurs concerning the time format of GlueCEPolicyMaxWallClockTime and GlueCEPolicyMaxObtainableWallClockTime on a TORQUE based installation. | |||||||
> > | NoClassDefFoundError from Argus updating from EMI-2 to EMI-3Updating from EMI-2 to EMI-3 the support for Argus requires CAnL libraries which are incompatible with the CREAM service in EMI-3. It is necessary to force the service to use old Argus libs in this way:ln -sf /usr/share/java/argus-pep-api-java-compat.jar $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/argus-pep-api-java.jar ln -sf /usr/share/java/argus-pep-common-compat.jar $CATALINA_HOME/webapps/ce-cream/WEB-INF/lib/argus-pep-common.jar Wrong time format for MaxWallClockTimeWith EMI-2 update 9 an issue occurs concerning the time format of GlueCEPolicyMaxWallClockTime and GlueCEPolicyMaxObtainableWallClockTime on a TORQUE based installation. | |||||||
The attributes are published in hours, there's no workaround at the moment the only way is to change the script /usr/libexec/info-dynamic-pbs according to:
360c360 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 151 to 151 | ||||||||
Changed: | ||||||||
< < | This can happen if the BUpdaterSGE daemon is restarted by other user different than root (for example, tomcat starts the daemon at boot time and restarts it if the daemon is dead) without sourcing the proper environment. The workaround is to force the environment to be loaded in /etc/init.d/gLite and /etc/init.d/glite-ce-blahparser. This can be done simply by adding a line like the one bellow to be sourced at the beguinning of previous scripts | |||||||
> > | This can happen if the BUpdaterSGE daemon is restarted by other user different than root (for example, tomcat starts the daemon at boot time and restarts it if the daemon is dead) without sourcing the proper environment. The workaround is to force the environment to be loaded in /etc/init.d/gLite and /etc/init.d/glite-ce-blahparser. This can be done simply by adding a line like the one bellow to be sourced at the beguinning of previous scripts where the GE environment is properly defined (SGE_EXECD, SGE_QMASTER, SGE_ROOT, SGE_CLUSTER_NAME, SGE_CELL). | |||||||
. /etc/profile.d/sge.sh |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 138 to 138 | ||||||||
The BUpdaterSGE daemons does not inherits the correct GE environment variables | ||||||||
Changed: | ||||||||
< < | If the environment present in a BUpdaterSGE process does not include the GE environment variables, the GE client commands (qstat, qconf) can not be executed by BUpdaterSGE. | |||||||
> > | If the environment present in a BUpdaterSGE process does not include the GE environment variables, the GE client commands (qstat, qconf) can not be executed by BUpdaterSGE. A consequence of this are qacct segfault messages in syslog or dmesg. | |||||||
As a consequence, BUpdaterSGE will assume that jobs have been cancelled (because it receives no information from qstat or qacct). You can check the environment for BUpdaterSGE process using the following commands and searching for the GE env variables (SGE_EXECD, SGE_QMASTER, SGE_ROOT, SGE_CLUSTER_NAME, SGE_CELL) | ||||||||
Line: 147 to 147 | ||||||||
tomcat 7423 0.6 0.5 37184 21328 ? S Nov23 103:56 /usr/bin/BUpdaterSGE root 30622 0.0 0.0 61180 804 pts/0 R+ 13:41 0:00 grep -i sge | ||||||||
Changed: | ||||||||
< < | # cat /proc/7423/environ | |||||||
> > | # (cat /proc/7423/environ; echo) | tr '\000' '\n' | |||||||
This can happen if the BUpdaterSGE daemon is restarted by other user different than root (for example, tomcat starts the daemon at boot time and restarts it if the daemon is dead) without sourcing the proper environment. The workaround is to force the environment to be loaded in /etc/init.d/gLite and /etc/init.d/glite-ce-blahparser. This can be done simply by adding a line like the one bellow to be sourced at the beguinning of previous scripts |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Wrong time format for MaxWallClockTimeWith EMI-2 update 9 an issue occurs concerning the time format of GlueCEPolicyMaxWallClockTime and GlueCEPolicyMaxObtainableWallClockTime on a TORQUE based installation. The attributes are published in hours, there's no workaround at the moment the only way is to change the script /usr/libexec/info-dynamic-pbs according to:360c360 < $maxWall = int(&convertHhMmSs($1)/60); --- > $maxWall = int(&convertHhMmSs($1)); 363c363 < $defaultWall = int(&convertHhMmSs($1)/60); --- > $defaultWall = int(&convertHhMmSs($1)); | |||||||
tomcat5 fails to start unless the supplementary repository is enabled in RHEL 5.9In RHEL 5.9, installing tomcat5 (tomcat5-5.5.23-0jpp.37.el5) pulls in java-1.7.0-ibm-devel (1:1.7.0.3.0-1jpp.2.el5) IF the supplementary-5 repository is enalbed. However, without the supplementary repo set up, tomcat5 pulls in java-1.4.2-gcj-compat-devel instead. In this case starting tomcat5 fails as seen in /var/log/tomcat5/catalina.out : |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | tomcat5 fails to start unless the supplementary repository is enabled in RHEL 5.9In RHEL 5.9, installing tomcat5 (tomcat5-5.5.23-0jpp.37.el5) pulls in java-1.7.0-ibm-devel (1:1.7.0.3.0-1jpp.2.el5) IF the supplementary-5 repository is enalbed. However, without the supplementary repo set up, tomcat5 pulls in java-1.4.2-gcj-compat-devel instead. In this case starting tomcat5 fails as seen in /var/log/tomcat5/catalina.out :Using CATALINA_BASE: /usr/share/tomcat5 Using CATALINA_HOME: /usr/share/tomcat5 Using CATALINA_TMPDIR: /usr/share/tomcat5/temp Using JRE_HOME: WARNING: error instantiating 'org.apache.juli.ClassLoaderLogManager' referenced by java.util.logging.manager, class not found java.lang.ClassNotFoundException: org.apache.juli.ClassLoaderLogManager not found <<No stacktrace available>> WARNING: error instantiating '1catalina.org.apache.juli.FileHandler,' referenced by handlers, class not found java.lang.ClassNotFoundException: 1catalina.org.apache.juli.FileHandler, <<No stacktrace available>> Exception during runtime initialization java.lang.ExceptionInInitializerError <<No stacktrace available>> Caused by: java.lang.NullPointerException <<No stacktrace available>>Version-Release number of selected component (if applicable): tomcat5-5.5.23-0jpp.37.el5. How reproducible: always. Steps to reproduce:
| |||||||
Problem on transferring the sandbox files between the (EMI-3) CREAM CE and the WN on SLURMThis issue doesn't allow the WN to transfer back to the CREAM node the sandbox files and it happens only if the file system is not shared (see the YAIM configuration for SLURM) and CREAM is configured with the following YAIM's variables: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 16 to 16 | ||||||||
CE_BATCH_SYS=slurm JOB_MANAGER=slurm | ||||||||
Changed: | ||||||||
< < | ||||||||
> > | The workaround is to overwrite the file /usr/libexec/slurm_submit.sh with the fixed one available for download at https://github.com/prelz/BLAH/blob/master/src/scripts/slurm_submit.sh and it not needed to restart tomcat or reconfigure CREAM for applying the patch. | |||||||
The jobSuspend and jobResume operations fail on CREAM CE (EMI-3) SLURM enabledSince the hold and resume operations have not yet been implemented in BLAH, CREAM reports to the user the failure of such operations with a message error like "slurm hold command failed (stdout:hold not supported-)" | ||||||||
Added: | ||||||||
> > | The jobs submitted to CREAM CE (EMI-3) SLURM enabled fail with the message “failure reason = 127”In case the jobs fail with the “failure reason = 127”, please restart the sshd daemon on all nodes in order to apply the changes on the ssh configuration:# /etc/init.d/sshd restart | |||||||
Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 18 to 18 | ||||||||
Added: | ||||||||
> > | The jobSuspend and jobResume operations fail on CREAM CE (EMI-3) SLURM enabledSince the hold and resume operations have not yet been implemented in BLAH, CREAM reports to the user the failure of such operations with a message error like "slurm hold command failed (stdout:hold not supported-)" | |||||||
Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > |
Problem on transferring the sandbox files between the (EMI-3) CREAM CE and the WN on SLURMThis issue doesn't allow the WN to transfer back to the CREAM node the sandbox files and it happens only if the file system is not shared (see the YAIM configuration for SLURM) and CREAM is configured with the following YAIM's variables:SANDBOX_TRANSFER_METHOD_BETWEEN_CE_WN=LRMS CE_BATCH_SYS=slurm JOB_MANAGER=slurm | |||||||
Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Deleted: | ||||||||
< < | Problem switching off of the JobSubmissionManager (i.e. JOB_SUBMISSION_MANAGER_ENABLE false in /etc/glite-ce-cream/cream-config.xmlThe switching off of the JobSubmissionManager ( | |||||||
Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: | ||||||||
Line: 265 to 259 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem switching off of the JobSubmissionManager (i.e. JOB_SUBMISSION_MANAGER_ENABLE false in /etc/glite-ce-cream/cream-config.xmlThe switching off of the JobSubmissionManager ( | |||||||
Problem by submitting jobs via WMS (direct submission to CREAM CE isn't affected by this problem)There is an unwanted auto-updating of the field "creationTime" on the cream database. This happen, for example, when tomcat is restarted (yaim does the stop and the start of the tomcat service). |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 391 to 391 | ||||||||
Relevant ticket: https://ggus.eu/tech/ticket_show.php?ticket=78902 | ||||||||
Added: | ||||||||
> > | Because CREAM is able to protect itself if the load, memory usage, etc. is too high disabling submission through a limiter script
/usr/bin/glite_cream_load_monitorthresholds for the values for this system and CREAM specific parameters are defined in a configuration file (/etc/glite-ce-cream-utils/glite_cream_load_monitor.conf), the file /etc/glite-ce-cream-utils/glite_cream_load_monitor.confmust be modified also. | |||||||
Memory leak in bupdater for PBS and LSFVersion 1.16.3 of BLAH is affected by a quite critical memory leak in the bupdater component for LSF and PBS. Because of that the usage of memory of the bupdater process will keep increasing till when it crashes/it is killed by OOM. It is then automatically restarted by blah. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 74 to 74 | ||||||||
this most probably means "qstat" and "qacct" commands can not be successfully executed by tomcat. This could happen by several reasons:
| ||||||||
Changed: | ||||||||
< < |
| |||||||
> > |
| |||||||
The BUpdaterSGE daemons does not inherits the correct GE environment variables | ||||||||
Line: 106 to 106 | ||||||||
error: unable to send message to qmaster using port 15020 on host "xxxxxxxx": can't set CA chain file | ||||||||
Added: | ||||||||
> > | The accounting file is not shared in the CreamCE or not updated on the flyBUpdaterSGE needs to consult the GE accounting file to determine how did a given job ended. Therefore, the GE accounting file must be shared between the GE SERVER / QMASTER and the CREAM CE. Moreover, to guarantee that the accounting file is updated on the fly, the GE configuration should be tunned (using qconf -mconf) in order to add under the reporting_params the following definitions: accounting=true accounting_flush_time=00:00:00 | |||||||
Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 518 to 518 | ||||||||
-- LisaZangrando - 2012-10-26 | ||||||||
Added: | ||||||||
> > |
| |||||||
| ||||||||
Added: | ||||||||
> > |
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 14 to 14 | ||||||||
"Received NULL fault; the error is due to another cause: FaultString=[CREAM service not available: configuration failed!] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server]" | ||||||||
Deleted: | ||||||||
< < | EMI-2 CREAM CE delegates bad proxy to WNThe delegated and limited proxy on the CE contains a corrupted field "X509v3 Key Usage"; this issue can be reproduced executing the following command:openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid]A temporary workaround is to change to proxy-limiting command in the script /usr/bin/glite-cream-copyProxyToSandboxDir.sh, the modified version of the script is available here | |||||||
Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: | ||||||||
Line: 298 to 293 | ||||||||
commit;
| ||||||||
Added: | ||||||||
> > | Fix provided with EMI-2 update 7
EMI-2 CREAM CE delegates bad proxy to WNThe delegated and limited proxy on the CE contains a corrupted field "X509v3 Key Usage"; this issue can be reproduced executing the following command:openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid]A temporary workaround is to change to proxy-limiting command in the script /usr/bin/glite-cream-copyProxyToSandboxDir.sh, the modified version of the script is available here Fix provided with EMI-2 update 7 | |||||||
Problem with cancelled jobs notificationIn BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 14 to 14 | ||||||||
"Received NULL fault; the error is due to another cause: FaultString=[CREAM service not available: configuration failed!] - FaultCode=[SOAP-ENV:Server] - FaultSubCode=[SOAP-ENV:Server]" | ||||||||
Deleted: | ||||||||
< < | Problem by submitting jobs via WMS (direct submission to CREAM CE isn't affected by this problem)There is an unwanted auto-updating of the field "creationTime" on the cream database. This happen, for example, when tomcat is restarted (yaim does the stop and the start of the tomcat service). Submitting jobs via WMS, you could obtain the following wrong message:=================== glite-wms-job-status Success =================
BOOKKEEPING INFORMATION:
Status info for the Job :
...
Current Status: Aborted <-----------------
Status Reason: CREAM'S database has been scratched and all its jobs have been lost <-----------------
Destination: ...
Submitted: ...
Parent Job: ...
======================================================================
N.B: The CREAM database isn't scratched, but it is so from the WMS point of view because of the above problem.
The problem is solved applying the following workaround on the cream database:
use creamdb; ALTER TABLE db_info MODIFY creationTime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP; commit; | |||||||
EMI-2 CREAM CE delegates bad proxy to WNThe delegated and limited proxy on the CE contains a corrupted field "X509v3 Key Usage"; this issue can be reproduced executing the following command:openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid] | ||||||||
Line: 298 to 265 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem by submitting jobs via WMS (direct submission to CREAM CE isn't affected by this problem)There is an unwanted auto-updating of the field "creationTime" on the cream database. This happen, for example, when tomcat is restarted (yaim does the stop and the start of the tomcat service). Submitting jobs via WMS, you could obtain the following wrong message:=================== glite-wms-job-status Success =================
BOOKKEEPING INFORMATION:
Status info for the Job :
...
Current Status: Aborted <-----------------
Status Reason: CREAM'S database has been scratched and all its jobs have been lost <-----------------
Destination: ...
Submitted: ...
Parent Job: ...
======================================================================
N.B: The CREAM database isn't scratched, but it is so from the WMS point of view because of the above problem.
The problem is solved applying the following workaround on the cream database:
use creamdb; ALTER TABLE db_info MODIFY creationTime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP; commit; | |||||||
Problem with cancelled jobs notificationIn BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 110 to 110 | ||||||||
this most probably means "qstat" and "qacct" commands can not be successfully executed by tomcat. This could happen by several reasons: | ||||||||
Changed: | ||||||||
< < | * The BUpdaterSGE daemons does not inherits the correct GE environment variables
* = Tomcat user is not allowed to query the GE system=
* = The accounting file is not shared in the CreamCE= | |||||||
> > |
| |||||||
The BUpdaterSGE daemons does not inherits the correct GE environment variables |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 98 to 98 | ||||||||
This will be fixed in EMI 2.
CREAM jobs are cancelled with status reason=3 in a GE system | ||||||||
Added: | ||||||||
> > |
When a job is submitted by BLAH to an GE system, the blah job_registry is updated via sge_submit.sh script with status "1" and all the subsequent status are updated in blah job_registry by the BUpdaterSGE daemon.
BUpdaterSGE is the daemon that decides in what status a given job is examining the output of a "qstat" command. There is a tricky situation when a job disappears: Was it cancelled or did it finish? To know the difference, BUpdaterSGE uses "qacct" to query the accounting log. If there is information about the job in the accounting log, it means it finished, otherwise it means it was cancelled. There are two queries to the accounting log using "qacct -j" with a difference of one minute between the two. If both queries return error, the job is assumed as cancelled.
If you are seeing systematic cancelled jobs in glite-cream-ce.log like
02 Dec 2012 12:43:01,247 org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor - JOB CREAM172977114 STATUS CHANGED: REALLY-RUNNING => CANCELLED [description=Cancelled by CE admin] [failureReason=reason=3] [localUser=XXX] [workerNode=XXX] [delegationId=1354371579.274301]this most probably means "qstat" and "qacct" commands can not be successfully executed by tomcat. This could happen by several reasons: * The BUpdaterSGE daemons does not inherits the correct GE environment variables
* = Tomcat user is not allowed to query the GE system=
* = The accounting file is not shared in the CreamCE=
The BUpdaterSGE daemons does not inherits the correct GE environment variables | |||||||
If the environment present in a BUpdaterSGE process does not include the GE environment variables, the GE client commands (qstat, qconf) can not be executed by BUpdaterSGE. As a consequence, BUpdaterSGE will assume that jobs have been cancelled (because it receives no information from qstat or qacct). You can check the environment for BUpdaterSGE process using the following commands and searching for the GE env variables (SGE_EXECD, SGE_QMASTER, SGE_ROOT, SGE_CLUSTER_NAME, SGE_CELL) | ||||||||
Line: 115 to 133 | ||||||||
. /etc/profile.d/sge.sh | ||||||||
Added: | ||||||||
> > | Tomcat user is not allowed to query the GE systemSome GE systems use certificates to encrypt the communication between GE client and server. For CreamCE, tomcat must be able to query your system (BUpdaterSGE daemon is running under user tomcat). If this is not the case, most probably you will get the following error while trying to do a "qstat" with user tomcatsu - tomcat sh-3.2$ qstat -u '*' error: commlib error: can't set CA chain file (/var/sgeCA/sge_qmaster/GridKa/userkeys/tomcat/cert.pem) error: commlib error: ssl error ([ID=33558530] in module "system library": "No such file or directory") error: unable to send message to qmaster using port 15020 on host "xxxxxxxx": can't set CA chain file | |||||||
Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 99 to 97 | ||||||||
This will be fixed in EMI 2. | ||||||||
Added: | ||||||||
> > | CREAM jobs are cancelled with status reason=3 in a GE systemIf the environment present in a BUpdaterSGE process does not include the GE environment variables, the GE client commands (qstat, qconf) can not be executed by BUpdaterSGE. As a consequence, BUpdaterSGE will assume that jobs have been cancelled (because it receives no information from qstat or qacct). You can check the environment for BUpdaterSGE process using the following commands and searching for the GE env variables (SGE_EXECD, SGE_QMASTER, SGE_ROOT, SGE_CLUSTER_NAME, SGE_CELL)# ps xuawww | grep -i sge tomcat 7423 0.6 0.5 37184 21328 ? S Nov23 103:56 /usr/bin/BUpdaterSGE root 30622 0.0 0.0 61180 804 pts/0 R+ 13:41 0:00 grep -i sge # cat /proc/7423/environThis can happen if the BUpdaterSGE daemon is restarted by other user different than root (for example, tomcat starts the daemon at boot time and restarts it if the daemon is dead) without sourcing the proper environment. The workaround is to force the environment to be loaded in /etc/init.d/gLite and /etc/init.d/glite-ce-blahparser. This can be done simply by adding a line like the one bellow to be sourced at the beguinning of previous scripts . /etc/profile.d/sge.sh | |||||||
Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem switching off of the JobSubmissionManager (i.e. JOB_SUBMISSION_MANAGER_ENABLE false in /etc/glite-ce-cream/cream-config.xmlThe switching off of the JobSubmissionManager ( | |||||||
Problem by submitting jobs via WMS (direct submission to CREAM CE isn't affected by this problem)There is an unwanted auto-updating of the field "creationTime" on the cream database. This happen, for example, when tomcat is restarted (yaim does the stop and the start of the tomcat service). |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 46 to 46 | ||||||||
openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid]A temporary workaround is to change to proxy-limiting command in the script /usr/bin/glite-cream-copyProxyToSandboxDir.sh, the modified version of the script is available here | ||||||||
Deleted: | ||||||||
< < | CREAM doesn't transfert the output files remotely if SANDBOX_TRANSFER_METHOD="LRMS"The related bug is https://savannah.cern.ch/bugs/index.php?95480; the error occurs only if the transfer method selected is "LRMS" and the name of the URL is lexicographically greater than "gsiftp://localhost". No workaround is available.FAILURE_REASON="Cannot enqueue the command id=-1: Data truncation: Data too long for column 'commandGroupId' at row 1 (rollback performed)"it's a bug reintroduced in EMI-2 CREAM CE: https://savannah.cern.ch/bugs/index.php?95593 A workaround is to modify a database table by executing the following query:use creamdb; ALTER TABLE command_queue MODIFY commandGroupId varchar(255) NULL; | |||||||
Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: | ||||||||
Line: 462 to 445 | ||||||||
| ||||||||
Added: | ||||||||
> > | CREAM doesn't transfert the output files remotely if SANDBOX_TRANSFER_METHOD="LRMS"The related bug is https://savannah.cern.ch/bugs/index.php?95480; the error occurs only if the transfer method selected is "LRMS" and the name of the URL is lexicographically greater than "gsiftp://localhost". No workaround is available.FAILURE_REASON="Cannot enqueue the command id=-1: Data truncation: Data too long for column 'commandGroupId' at row 1 (rollback performed)"it's a bug reintroduced in EMI-2 CREAM CE: https://savannah.cern.ch/bugs/index.php?95593 A workaround is to modify a database table by executing the following query:use creamdb; ALTER TABLE command_queue MODIFY commandGroupId varchar(255) NULL; | |||||||
Changed: | ||||||||
< < | -- MassimoSgaravatto - 2011-05-05 | |||||||
> > | -- LisaZangrando - 2012-10-26 | |||||||
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 86 to 86 | ||||||||
In this case the blparser can crash. It will then automatically restarted by the blparser_master process, but for a few minutes submissions won't work. | ||||||||
Deleted: | ||||||||
< < | Problem with cancelled jobs notification | |||||||
Deleted: | ||||||||
< < | In BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row:
bupdater_use_bhist_for_killed="yes" | |||||||
Issue with conflicting BUpdaterSGE instances | ||||||||
Line: 268 to 263 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > |
Problem with cancelled jobs notificationIn BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row:bupdater_use_bhist_for_killed="yes"Fix released with EMI-1 Update 17 and EMI-2 Update 1 | |||||||
Problem with generic dynamic scheduler with SGEThe yaim plugin for sge configures the gip for publishing information but when used out of the box the following error is shown in the BDII log: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 62 to 62 | ||||||||
NULL; | ||||||||
Deleted: | ||||||||
< < | GlueCEStateWaitingJobs: 444444 and WallTime workaroundIf on the queues there is published:GlueCEStateWaitingJobs: 444444and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing: Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ? wrt = qwt * nwait TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'probably the queues have no "resources_default.walltime" parameter configured. So define it for each queue by launching, for example: # qmgr -c "set queue prod resources_default.walltime = 01:00:00" # qmgr -c "set queue cert resources_default.walltime = 01:00:00" # qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00" Error parsing GLUE2PolicyRuleAll the attibutes "GLUE2PolicyRule" defined in the file /var/lib/bdii/gip/ldif/ComputingShare.ldif MUST BE in the form "VO:nameofthevo" (the VO prefix is mandatory) Other strings, even the empty one, are not correctly parsed by the script lcg-info-dynamic-scheduler and the following error is reported in the BDII log:vogrp = tmpl[2].strip() IndexError: list index out of rangeIn that case the wrong attributes MUST BE removed | |||||||
Problem at first configuration with EMI-2 CREAM with GE | ||||||||
Line: 120 to 93 | ||||||||
bupdater_use_bhist_for_killed="yes" | ||||||||
Deleted: | ||||||||
< < | Problem with generic dynamic scheduler with SGEThe yaim plugin for sge configures the gip for publishing information but when used out of the box the following error is shown in the BDII log:Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 256, in ? import lrms ImportError: No module named lrmsThe workaround is defining PYTHONPATH in /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper :
$ cat /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper #!/bin/sh #/opt/lcg/libexec/lcg-info-dynamic-scheduler -c /opt/glite/etc/lcg-info-dynamic-scheduler.conf export PYTHONPATH=/usr/lib/python:$PYTHONPATH /usr/libexec/lcg-info-dynamic-scheduler -c /etc/lcg-info-dynamic-scheduler.confRelevant ticket: https://ggus.eu/ws/ticket_info.php?ticket=76961 | |||||||
Issue with conflicting BUpdaterSGE instancesgLite service in CreamCE starts the following services by this exact order: tomcat5, glite-lb-locallogger and glite-ce-blahparser. | ||||||||
Line: 307 to 268 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem with generic dynamic scheduler with SGEThe yaim plugin for sge configures the gip for publishing information but when used out of the box the following error is shown in the BDII log:Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 256, in ? import lrms ImportError: No module named lrmsThe workaround is defining PYTHONPATH in /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper :
$ cat /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper #!/bin/sh #/opt/lcg/libexec/lcg-info-dynamic-scheduler -c /opt/glite/etc/lcg-info-dynamic-scheduler.conf export PYTHONPATH=/usr/lib/python:$PYTHONPATH /usr/libexec/lcg-info-dynamic-scheduler -c /etc/lcg-info-dynamic-scheduler.confRelevant ticket: https://ggus.eu/ws/ticket_info.php?ticket=76961 Fix provided with EMI-2 update 3 Error parsing GLUE2PolicyRuleAll the attibutes "GLUE2PolicyRule" defined in the file /var/lib/bdii/gip/ldif/ComputingShare.ldif MUST BE in the form "VO:nameofthevo" (the VO prefix is mandatory) Other strings, even the empty one, are not correctly parsed by the script lcg-info-dynamic-scheduler and the following error is reported in the BDII log:vogrp = tmpl[2].strip() IndexError: list index out of rangeIn that case the wrong attributes MUST BE removed Fix provided with EMI-2 update 3 GlueCEStateWaitingJobs: 444444 and WallTime workaroundIf on the queues there is published:GlueCEStateWaitingJobs: 444444and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing: Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ? wrt = qwt * nwait TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'probably the queues have no "resources_default.walltime" parameter configured. So define it for each queue by launching, for example: # qmgr -c "set queue prod resources_default.walltime = 01:00:00" # qmgr -c "set queue cert resources_default.walltime = 01:00:00" # qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00"Relevant ticket https://ggus.eu/tech/ticket_show.php?ticket=83229 Fix provided with EMI-2 update 3 | |||||||
Issue with the setting of the maximum number of accepted FTP connectionsThe number of maximum number of gridftp connections is now automatically set in/etc/grid-security/gridftp.conf . |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem by submitting jobs via WMS (direct submission to CREAM CE isn't affected by this problem)There is an unwanted auto-updating of the field "creationTime" on the cream database. This happen, for example, when tomcat is restarted (yaim does the stop and the start of the tomcat service). Submitting jobs via WMS, you could obtain the following wrong message:=================== glite-wms-job-status Success =================
BOOKKEEPING INFORMATION:
Status info for the Job :
...
Current Status: Aborted <-----------------
Status Reason: CREAM'S database has been scratched and all its jobs have been lost <-----------------
Destination: ...
Submitted: ...
Parent Job: ...
======================================================================
N.B: The CREAM database isn't scratched, but it is so from the WMS point of view because of the above problem.
The problem is solved applying the following workaround on the cream database:
use creamdb; ALTER TABLE db_info MODIFY creationTime TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP; commit; | |||||||
EMI-2 CREAM CE delegates bad proxy to WNThe delegated and limited proxy on the CE contains a corrupted field "X509v3 Key Usage"; this issue can be reproduced executing the following command:openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid] |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 11 to 11 | ||||||||
EMI-2 CREAM CE delegates bad proxy to WNThe delegated and limited proxy on the CE contains a corrupted field "X509v3 Key Usage"; this issue can be reproduced executing the following command:openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid] | ||||||||
Changed: | ||||||||
< < | A temporary workaround is to change to proxy-limiting command in the script /usr/bin/glite-cream-copyProxyToSandboxDir.sh from:
if [ $? -eq 0 ] ; then let "hours=$secondsLeft/3600" let "minutes=((secondsLeft%3600)/60)" /usr/bin/voms-proxy-init $rfcoption -limited -valid $hours:$minutes -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt else /usr/bin/voms-proxy-init $rfcoption -limited -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt fito if [ $? -eq 0 ] ; then let "hours=$secondsLeft/3600" let "minutes=((secondsLeft%3600)/60)" /usr/bin/grid-proxy-init $rfcoption -limited -valid $hours:$minutes -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt else /usr/bin/grid-proxy-init $rfcoption -limited -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt fi | |||||||
> > | A temporary workaround is to change to proxy-limiting command in the script /usr/bin/glite-cream-copyProxyToSandboxDir.sh, the modified version of the script is available here | |||||||
CREAM doesn't transfert the output files remotely if SANDBOX_TRANSFER_METHOD="LRMS"The related bug is https://savannah.cern.ch/bugs/index.php?95480; the error occurs only if the transfer method selected is "LRMS" and the name of the URL is lexicographically greater than "gsiftp://localhost". | ||||||||
Line: 424 to 405 | ||||||||
-- MassimoSgaravatto - 2011-05-05 \ No newline at end of file | ||||||||
Added: | ||||||||
> > |
|
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | EMI-2 CREAM CE delegates bad proxy to WNThe delegated and limited proxy on the CE contains a corrupted field "X509v3 Key Usage"; this issue can be reproduced executing the following command:openssl x509 -noout -text -in /var/glite/cream_sandbox/[vo]/[dn_fqan_mappeduser]/proxy/[delegationid_internalid]A temporary workaround is to change to proxy-limiting command in the script /usr/bin/glite-cream-copyProxyToSandboxDir.sh from: if [ $? -eq 0 ] ; then let "hours=$secondsLeft/3600" let "minutes=((secondsLeft%3600)/60)" /usr/bin/voms-proxy-init $rfcoption -limited -valid $hours:$minutes -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt else /usr/bin/voms-proxy-init $rfcoption -limited -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt fito if [ $? -eq 0 ] ; then let "hours=$secondsLeft/3600" let "minutes=((secondsLeft%3600)/60)" /usr/bin/grid-proxy-init $rfcoption -limited -valid $hours:$minutes -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt else /usr/bin/grid-proxy-init $rfcoption -limited -cert $fileNamePath.$$ -key $fileNamePath.$$ -out $fileNamePath.lmt fi | |||||||
CREAM doesn't transfert the output files remotely if SANDBOX_TRANSFER_METHOD="LRMS"The related bug is https://savannah.cern.ch/bugs/index.php?95480; the error occurs only if the transfer method selected is "LRMS" and the name of the URL is lexicographically greater than "gsiftp://localhost". No workaround is available. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | CREAM doesn't transfert the output files remotely if SANDBOX_TRANSFER_METHOD="LRMS"The related bug is https://savannah.cern.ch/bugs/index.php?95480; the error occurs only if the transfer method selected is "LRMS" and the name of the URL is lexicographically greater than "gsiftp://localhost". No workaround is available. | |||||||
FAILURE_REASON="Cannot enqueue the command id=-1: Data truncation: Data too long for column 'commandGroupId' at row 1 (rollback performed)"it's a bug reintroduced in EMI-2 CREAM CE: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | FAILURE_REASON="Cannot enqueue the command id=-1: Data truncation: Data too long for column 'commandGroupId' at row 1 (rollback performed)"it's a bug reintroduced in EMI-2 CREAM CE: https://savannah.cern.ch/bugs/index.php?95593 A workaround is to modify a database table by executing the following query:use creamdb; ALTER TABLE command_queue MODIFY commandGroupId varchar(255) NULL; | |||||||
GlueCEStateWaitingJobs: 444444 and WallTime workaroundIf on the queues there is published: | ||||||||
Line: 112 to 124 | ||||||||
This will be fixed in EMI 2. | ||||||||
Deleted: | ||||||||
< < | Issue with the setting of the maximum number of accepted FTP connectionsThe number of maximum number of gridftp connections is now automatically set in/etc/grid-security/gridftp.conf .
It should be manually added also in the file /etc/gridftp.conf where the line:
connections_max 150should be added. Relevant ticket: https://ggus.eu/tech/ticket_show.php?ticket=78902 | |||||||
Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see | ||||||||
Line: 268 to 265 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Issue with the setting of the maximum number of accepted FTP connectionsThe number of maximum number of gridftp connections is now automatically set in/etc/grid-security/gridftp.conf .
It should be manually added also in the file /etc/gridftp.conf where the line:
connections_max 150should be added. Relevant ticket: https://ggus.eu/tech/ticket_show.php?ticket=78902 | |||||||
Memory leak in bupdater for PBS and LSFVersion 1.16.3 of BLAH is affected by a quite critical memory leak in the bupdater component for LSF and PBS. Because of that the usage of memory of the bupdater process will keep increasing till when it crashes/it is killed by OOM. It is then automatically restarted by blah. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | GlueCEStateWaitingJobs: 444444 and WallTime workaroundIf on the queues there is published:GlueCEStateWaitingJobs: 444444and in the log /var/log/bdii/bdii-update.log you notice errors like the folllowing: Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 435, in ? wrt = qwt * nwait TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'probably the queues have no "resources_default.walltime" parameter configured. So define it for each queue by launching, for example: # qmgr -c "set queue prod resources_default.walltime = 01:00:00" # qmgr -c "set queue cert resources_default.walltime = 01:00:00" # qmgr -c "set queue cloudtf resources_default.walltime = 01:00:00" | |||||||
Error parsing GLUE2PolicyRule |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Error parsing GLUE2PolicyRuleAll the attibutes "GLUE2PolicyRule" defined in the file /var/lib/bdii/gip/ldif/ComputingShare.ldif MUST BE in the form "VO:nameofthevo" (the VO prefix is mandatory) Other strings, even the empty one, are not correctly parsed by the script lcg-info-dynamic-scheduler and the following error is reported in the BDII log:vogrp = tmpl[2].strip() IndexError: list index out of rangeIn that case the wrong attributes MUST BE removed | |||||||
Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem at first configuration with EMI-2 CREAM with GEThe first yaim configuration for a EMI-2 CREAM CE using GE as batch system fails with:/etc/lrms/scheduler.conf: No such file or directoryThe problem disappears running again yaim | |||||||
Problem if there are "extra" characters in the beginning of the tomcat key fileBecause on an issue in trustmanager, there can be problems if there is something before-----BEGIN ...---- in /etc/grid-security/tomcat-key.pem |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem if there are "extra" characters in the beginning of the tomcat key fileBecause on an issue in trustmanager, there can be problems if there is something before-----BEGIN ...---- in /etc/grid-security/tomcat-key.pem
The workaround is to simply remove these chars | |||||||
Problem suspending not running job for LSF with old blparserWhen the CREAM CE is configured to use the old blparser, there might be problems suspending jobs when LSF is used. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem suspending not running job for LSF with old blparserWhen the CREAM CE is configured to use the old blparser, there might be problems suspending jobs when LSF is used. In this case the blparser can crash. It will then automatically restarted by theblparser_master process, but for a few minutes submissions won't work. | |||||||
Problem with cancelled jobs notificationIn BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 13 to 13 | ||||||||
In BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row: | ||||||||
Changed: | ||||||||
< < | bupdater_use_bhist_for_killed | |||||||
> > | bupdater_use_bhist_for_killed="yes" | |||||||
Problem with generic dynamic scheduler with SGE |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 40 to 40 | ||||||||
Relevant ticket: https://ggus.eu/ws/ticket_info.php?ticket=76961 | ||||||||
Added: | ||||||||
> > | Issue with conflicting BUpdaterSGE instancesgLite service in CreamCE starts the following services by this exact order: tomcat5, glite-lb-locallogger and glite-ce-blahparser. The default behaviour of tomcat5 is to start BUpdaterSGE daemon case it thinks it is not running. The problem is that at start up time, BUpdaterSGE is also started by glite-ce-blahparser afterwards. This gives rise to two running instance of the BUpdaterSGE daemon and to a race condition while monitoring running jobs. Jobs may end up being cancelled by BUpdaterSGE conflicts. The workaroud is to change the order how the different services are started:# diff /root/gLite.orig /etc/init.d/gLite 36c36 < start) SERVICE_LIST=`cat $GLITE_STARTUP_FILE` --- > start) SERVICE_LIST=`cat $GLITE_STARTUP_FILE | sort` 44c44 < stop) SERVICE_LIST=`cat $GLITE_STARTUP_FILE | sort` --- > stop) SERVICE_LIST=`cat $GLITE_STARTUP_FILE | sort -r`This will be fixed in EMI 2. | |||||||
Issue with the setting of the maximum number of accepted FTP connections |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problem with cancelled jobs notification | |||||||
Changed: | ||||||||
< < | ||||||||
> > | In BLAH 1.18.0(EMI2) to run correctly the notification of cancelled jobs it is needed to add in /etc/blah.config the row:
bupdater_use_bhist_for_killed | |||||||
Problem with generic dynamic scheduler with SGE |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 70 to 70 | ||||||||
Relevant bug: https://savannah.cern.ch/bugs/?90085 | ||||||||
Added: | ||||||||
> > |
Problems running single yaim functions in EMI-2When configuring a EMI-2 CREAM-CE with yaim, there might be problems if a single function is run (the problem is that the =TOMCAT_VERSION_ variable is not defined). The problem involves the following functions:
yaim -c -s site-info.def -n creamCE ... ).
| |||||||
Problems re-enabling CEMon and/or EMI-ESWhen configuring a EMI-2 CREAM-CE, by default CEMon and EMI-ES are not deployed, unless the relevant yaim variablesUSE_CEMON and USE_EMIES are set to true . |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 70 to 70 | ||||||||
Relevant bug: https://savannah.cern.ch/bugs/?90085 | ||||||||
Added: | ||||||||
> > | Problems re-enabling CEMon and/or EMI-ESWhen configuring a EMI-2 CREAM-CE, by default CEMon and EMI-ES are not deployed, unless the relevant yaim variablesUSE_CEMON and USE_EMIES are set to true .
There are problems if CEMon is disabled (i.e. with USE_CEMON not set or set to false ) and then a reconfiguration is done re-enabling it (i.e setting USE_CEMON to true ). The workaround is to reinstalling the glite-ce-monitor rpm before reconfiguring.
There is the same issue for EMI-ES. | |||||||
Execution of DAG jobs |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 36 to 36 | ||||||||
Relevant ticket: https://ggus.eu/ws/ticket_info.php?ticket=76961 | ||||||||
Added: | ||||||||
> > |
Issue with the setting of the maximum number of accepted FTP connectionsThe number of maximum number of gridftp connections is now automatically set in/etc/grid-security/gridftp.conf .
It should be manually added also in the file /etc/gridftp.conf where the line:
connections_max 150should be added. Relevant ticket: https://ggus.eu/tech/ticket_show.php?ticket=78902 | |||||||
Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Deleted: | ||||||||
< < | Memory leak in bupdater for PBS and LSFVersion 1.16.3 of BLAH is affected by a quite critical memory leak in the bupdater component for LSF and PBS. Because of that the usage of memory of the bupdater process will keep increasing till when it crashes/it is killed by OOM. It is then automatically restarted by blah. SGE is not affected by this problem For LSF and PBS, the workaround is to configure the blparser using the old method (see: http://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_2_4_Choose_the_BLAH_BLparser_d) or to the restart from time to time the bupdater. Relevant bug: https://savannah.cern.ch/bugs/index.php?89859Problems starting the bupdater and bnotifier with (S)GEWith SGE, there is a problem when starting the bupdater and bnotifier:Starting BNotifier: /usr/bin/BNotifier: sge_helperpath not defined. Exiting [FAILED] Starting BUpdaterSGE: /usr/bin/BUpdaterSGE: sge_helperpath not defined. Exiting [FAILED]The workaround is to uncomment the variable: sge_helperpath=/usr/bin/sge_helperin the blah.config file. | |||||||
Deleted: | ||||||||
< < | See https://savannah.cern.ch/bugs/index.php?88974 and https://ggus.eu/ws/ticket_info.php?ticket=76067 | |||||||
Problem with generic dynamic scheduler with SGE | ||||||||
Line: 181 to 151 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Memory leak in bupdater for PBS and LSFVersion 1.16.3 of BLAH is affected by a quite critical memory leak in the bupdater component for LSF and PBS. Because of that the usage of memory of the bupdater process will keep increasing till when it crashes/it is killed by OOM. It is then automatically restarted by blah. SGE is not affected by this problem For LSF and PBS, the workaround is to configure the blparser using the old method (see: http://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_2_4_Choose_the_BLAH_BLparser_d) or to the restart from time to time the bupdater. Relevant bug: https://savannah.cern.ch/bugs/index.php?89859 Fix provided with EMI-1 Update 12Problems starting the bupdater and bnotifier with (S)GEWith SGE, there is a problem when starting the bupdater and bnotifier:Starting BNotifier: /usr/bin/BNotifier: sge_helperpath not defined. Exiting [FAILED] Starting BUpdaterSGE: /usr/bin/BUpdaterSGE: sge_helperpath not defined. Exiting [FAILED]The workaround is to uncomment the variable: sge_helperpath=/usr/bin/sge_helperin the blah.config file.
See https://savannah.cern.ch/bugs/index.php?88974 and https://ggus.eu/ws/ticket_info.php?ticket=76067
Fix provided with EMI-1 Update 12 | |||||||
No dynamic info published for one VOviewFor one VOView thelcg-info-dynamic-scheduler doesn't publish information, and therefore the values defined in the static ldif file is used. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 78 to 78 | ||||||||
a server database file hangs (/etc/init.d/pbs_server create ). Issue tracked at https://bugzilla.redhat.com/show_bug.cgi?id=744138. | ||||||||
Added: | ||||||||
> > | Problems with suspend when the old blparser is usedWhen the old blparser is used, there are problems with the suspend command. Relevant bug: https://savannah.cern.ch/bugs/?90085 | |||||||
Execution of DAG jobs |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 42 to 42 | ||||||||
See https://savannah.cern.ch/bugs/index.php?88974 and https://ggus.eu/ws/ticket_info.php?ticket=76067 | ||||||||
Added: | ||||||||
> > | Problem with generic dynamic scheduler with SGEThe yaim plugin for sge configures the gip for publishing information but when used out of the box the following error is shown in the BDII log:Traceback (most recent call last): File "/usr/libexec/lcg-info-dynamic-scheduler", line 256, in ? import lrms ImportError: No module named lrmsThe workaround is defining PYTHONPATH in /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper :
$ cat /var/lib/bdii/gip/plugin/glite-info-dynamic-scheduler-wrapper #!/bin/sh #/opt/lcg/libexec/lcg-info-dynamic-scheduler -c /opt/glite/etc/lcg-info-dynamic-scheduler.conf export PYTHONPATH=/usr/lib/python:$PYTHONPATH /usr/libexec/lcg-info-dynamic-scheduler -c /etc/lcg-info-dynamic-scheduler.confRelevant ticket: https://ggus.eu/ws/ticket_info.php?ticket=76961 | |||||||
Significant changes introduced with Torque 2.5.7-1 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 15 to 15 | ||||||||
SGE is not affected by this problem | ||||||||
Changed: | ||||||||
< < | For LSF and PBS, the workaround is to configure the blparser using the old method. See: http://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_2_4_Choose_the_BLAH_BLparser_d | |||||||
> > | For LSF and PBS, the workaround is to configure the blparser using the old method (see: http://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_2_4_Choose_the_BLAH_BLparser_d) or to the restart from time to time the bupdater. | |||||||
Relevant bug: https://savannah.cern.ch/bugs/index.php?89859 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > |
Memory leak in bupdater for PBS and LSFVersion 1.16.3 of BLAH is affected by a quite critical memory leak in the bupdater component for LSF and PBS. Because of that the usage of memory of the bupdater process will keep increasing till when it crashes/it is killed by OOM. It is then automatically restarted by blah. SGE is not affected by this problem For LSF and PBS, the workaround is to configure the blparser using the old method. See: http://wiki.italiangrid.it/twiki/bin/view/CREAM/SystemAdministratorGuideForEMI1#1_2_4_Choose_the_BLAH_BLparser_d Relevant bug: https://savannah.cern.ch/bugs/index.php?89859 | |||||||
Problems starting the bupdater and bnotifier with (S)GEWith SGE, there is a problem when starting the bupdater and bnotifier: |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 31 to 31 | ||||||||
Deleted: | ||||||||
< < | No dynamic info published for one VOviewFor one VOView thelcg-info-dynamic-scheduler doesn't publish information, and therefore the values defined in the static ldif file is used.
As found by Jan Astalos (thanks !) this is because a missing blank line at the end of /var/lib/bdii/gip/ldif/static-file-CE.ldif created by YAIM.
Waiting for the fix, the workaround is simply doing:
echo >> /var/lib/bdii/gip/ldif/static-file-CE.ldifafter having configured via yaim Relevant bug: http://savannah.cern.ch/bugs/?86191 | |||||||
Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see | ||||||||
Line: 61 to 44 | ||||||||
Deleted: | ||||||||
< < |
Problems if Torque is not configured to suppress mailsTorque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying. Relevant bug: https://savannah.cern.ch/bugs/index.php?86238Condor and SGE supportCondor and SGE are not yet fully supported as batch system for CREAM. | |||||||
Execution of DAG jobsExecution of DAG jobs on the CREAM based CE through the gLite WMS is not implemented yet. | ||||||||
Deleted: | ||||||||
< < | Memory issues with new BLAH BlparserIf the new Blparser is used (click here to check this) there can be issues if the blah registry becomes very large. The submission process can get slower and there can be problems with memory usage. Waiting for the fix, there are two possible workarounds:
| |||||||
qsub crashes | ||||||||
Line: 181 to 140 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | No dynamic info published for one VOviewFor one VOView thelcg-info-dynamic-scheduler doesn't publish information, and therefore the values defined in the static ldif file is used.
As found by Jan Astalos (thanks !) this is because a missing blank line at the end of /var/lib/bdii/gip/ldif/static-file-CE.ldif created by YAIM.
Waiting for the fix, the workaround is simply doing:
echo >> /var/lib/bdii/gip/ldif/static-file-CE.ldifafter having configured via yaim Relevant bug: http://savannah.cern.ch/bugs/?86191 Fix provided with CREAM CE 1.13.3 (see http://savannah.cern.ch/task/?24022) released with EMI-1 Update 10 Problems if Torque is not configured to suppress mailsTorque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying. Relevant bug: https://savannah.cern.ch/bugs/index.php?86238 Fix provided with BLAH 1.16.3 (see http://savannah.cern.ch/task/?22845) released with EMI-1 Update 10Memory issues with new BLAH BlparserIf the new Blparser is used (click here to check this) there can be issues if the blah registry becomes very large. The submission process can get slower and there can be problems with memory usage. Waiting for the fix, there are two possible workarounds:
| |||||||
Problems with Torque 2.5.7-1 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 27 to 27 | ||||||||
in the blah.config file. | ||||||||
Changed: | ||||||||
< < | See https://ggus.eu/ws/ticket_info.php?ticket=76067 | |||||||
> > | See https://savannah.cern.ch/bugs/index.php?88974 and https://ggus.eu/ws/ticket_info.php?ticket=76067 | |||||||
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 8 to 8 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problems starting the bupdater and bnotifier with (S)GEWith SGE, there is a problem when starting the bupdater and bnotifier:Starting BNotifier: /usr/bin/BNotifier: sge_helperpath not defined. Exiting [FAILED] Starting BUpdaterSGE: /usr/bin/BUpdaterSGE: sge_helperpath not defined. Exiting [FAILED]The workaround is to uncomment the variable: sge_helperpath=/usr/bin/sge_helperin the blah.config file.
See https://ggus.eu/ws/ticket_info.php?ticket=76067 | |||||||
No dynamic info published for one VOview |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 32 to 32 | ||||||||
here | ||||||||
Added: | ||||||||
> > | pbs_server create hangs on first time runOn a fresh installation of torque-server, running pbs_server for the first time by running/etc/init.d/pbs_server start will hang . Also, the command to create
a server database file hangs (/etc/init.d/pbs_server create ). Issue tracked at https://bugzilla.redhat.com/show_bug.cgi?id=744138.
| |||||||
Problems if Torque is not configured to suppress mailsTorque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 25 to 25 | ||||||||
Relevant bug: http://savannah.cern.ch/bugs/?86191 | ||||||||
Deleted: | ||||||||
< < | Problems with Torque 2.5.7-1There is a problem with latest torque version available in the EPEL repository (2.5.7-1). At start the following error is reported:[root@cream-38 ~]# /etc/init.d/pbs_server start /var/torque/server_priv/serverdb Starting TORQUE Server: PBS_Server: LOG_ERROR::No such file or directory (2) in pbs_init, unable to stat checkpoint directory /var/torque/checkpoint/, errno 2 (No such file or directory) PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed [FAILED]The workaround is to make that directory or install the torque-mom package otherwise pointlessly. | |||||||
Significant changes introduced with Torque 2.5.7-1 | ||||||||
Line: 168 to 151 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > |
Problems with Torque 2.5.7-1There is a problem with latest torque version available in the EPEL repository (2.5.7-1). At start the following error is reported:[root@cream-38 ~]# /etc/init.d/pbs_server start /var/torque/server_priv/serverdb Starting TORQUE Server: PBS_Server: LOG_ERROR::No such file or directory (2) in pbs_init, unable to stat checkpoint directory /var/torque/checkpoint/, errno 2 (No such file or directory) PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed [FAILED]Problem addressed with Torque v- 2.5.7-2 | |||||||
Problems affecting users with certificates signed by the GermanGridBecause of a bug in trustmanager, users with certificates signed by the GermanGrid CA can't submit jobs to CREAM. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 43 to 43 | ||||||||
The workaround is to make that directory or install the torque-mom package otherwise pointlessly. | ||||||||
Added: | ||||||||
> > | Significant changes introduced with Torque 2.5.7-1The updated EPEL5 build of torque-2.5.7-1 as compared to previous versions enables munge[1] as an inter node authentication method. Please see here | |||||||
Problems if Torque is not configured to suppress mailsTorque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 13 to 13 | ||||||||
For one VOView the lcg-info-dynamic-scheduler doesn't publish information, and therefore the values defined in the static ldif file is used. | ||||||||
Changed: | ||||||||
< < | As found by Jan Astalos (thanks !) this is because a missing newline at the end of /var/lib/bdii/gip/ldif/static-file-CE.ldif created by YAIM. | |||||||
> > | As found by Jan Astalos (thanks !) this is because a missing blank line at the end of /var/lib/bdii/gip/ldif/static-file-CE.ldif created by YAIM. | |||||||
Waiting for the fix, the workaround is simply doing: | ||||||||
Line: 23 to 23 | ||||||||
after having configured via yaim | ||||||||
Added: | ||||||||
> > | Relevant bug: http://savannah.cern.ch/bugs/?86191 | |||||||
Problems with Torque 2.5.7-1There is a problem with latest torque version available in the EPEL repository (2.5.7-1). |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 41 to 41 | ||||||||
The workaround is to make that directory or install the torque-mom package otherwise pointlessly. | ||||||||
Added: | ||||||||
> > | Problems if Torque is not configured to suppress mails | |||||||
Added: | ||||||||
> > | Torque should be configured to suppress all mails (mail_domain=never). Otherwise the bupdater process of the blparser will keep dying. Relevant bug: https://savannah.cern.ch/bugs/index.php?86238 | |||||||
Condor and SGE support |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 23 to 23 | ||||||||
after having configured via yaim | ||||||||
Added: | ||||||||
> > | Problems with Torque 2.5.7-1There is a problem with latest torque version available in the EPEL repository (2.5.7-1). At start the following error is reported:[root@cream-38 ~]# /etc/init.d/pbs_server start /var/torque/server_priv/serverdb Starting TORQUE Server: PBS_Server: LOG_ERROR::No such file or directory (2) in pbs_init, unable to stat checkpoint directory /var/torque/checkpoint/, errno 2 (No such file or directory) PBS_Server: LOG_ERROR::PBS_Server, pbsd_init failed [FAILED]The workaround is to make that directory or install the torque-mom package otherwise pointlessly.
| |||||||
Condor and SGE supportCondor and SGE are not yet fully supported as batch system for CREAM. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | No dynamic info published for one VOviewFor one VOView thelcg-info-dynamic-scheduler doesn't publish information, and therefore the values defined in the static ldif file is used.
As found by Jan Astalos (thanks !) this is because a missing newline at the end of /var/lib/bdii/gip/ldif/static-file-CE.ldif created by YAIM.
Waiting for the fix, the workaround is simply doing:
echo >> /var/lib/bdii/gip/ldif/static-file-CE.ldifafter having configured via yaim | |||||||
Condor and SGE supportCondor and SGE are not yet fully supported as batch system for CREAM. |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Line: 9 to 9 | ||||||||
Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI) | ||||||||
Deleted: | ||||||||
< < | Problems affecting users with certificates signed by the GermanGridBecause of a bug in trustmanager, users with certificates signed by the GermanGrid CA can't submit jobs to CREAM. The error message is something like:Failed to create a delegation id for job https://grid-lb0.desy.de:9000/ADkeOt6tc0Rfi8oP-pzUrQ: reason is Client 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko' is not issuer of proxy 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko,CN=proxy,CN=proxy'.
| |||||||
Condor and SGE supportCondor and SGE are not yet fully supported as batch system for CREAM. | ||||||||
Line: 134 to 122 | ||||||||
Problems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI) | ||||||||
Added: | ||||||||
> > | Problems affecting users with certificates signed by the GermanGridBecause of a bug in trustmanager, users with certificates signed by the GermanGrid CA can't submit jobs to CREAM. The error message is something like:Failed to create a delegation id for job https://grid-lb0.desy.de:9000/ADkeOt6tc0Rfi8oP-pzUrQ: reason is Client 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko' is not issuer of proxy 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko,CN=proxy,CN=proxy'.
| |||||||
Problems with SubCAs when Argus is used as authorization systemThere are problems when CREAM CE is configured to use Argus, happening with sub-CAs (e.g. CERN-TCA, UKeScienceCA) |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Deleted: | ||||||||
< < | Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software in production) | |||||||
| ||||||||
Changed: | ||||||||
< < | Condor and SGE support | |||||||
> > | Open known issuesKnown problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software released in EMI)Problems affecting users with certificates signed by the GermanGridBecause of a bug in trustmanager, users with certificates signed by the GermanGrid CA can't submit jobs to CREAM. The error message is something like:Failed to create a delegation id for job https://grid-lb0.desy.de:9000/ADkeOt6tc0Rfi8oP-pzUrQ: reason is Client 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko' is not issuer of proxy 'O=GermanGrid,OU=DESY,CN=Alexander Fomenko,CN=proxy,CN=proxy'.
Condor and SGE support | |||||||
Condor and SGE are not yet fully supported as batch system for CREAM. | ||||||||
Changed: | ||||||||
< < | Execution of DAG jobs | |||||||
> > | Execution of DAG jobs | |||||||
Execution of DAG jobs on the CREAM based CE through the gLite WMS is not implemented yet. | ||||||||
Changed: | ||||||||
< < | qsub crashes | |||||||
> > | Memory issues with new BLAH BlparserIf the new Blparser is used (click here to check this) there can be issues if the blah registry becomes very large. The submission process can get slower and there can be problems with memory usage. Waiting for the fix, there are two possible workarounds:
qsub crashes | |||||||
With some Torque versions it was observer qsub crashing with glibc detecting a double free or corruption.Although this is a problem to be addressed in Torque problem, adding: | ||||||||
Line: 26 to 55 | ||||||||
to /etc/blah.config should help | ||||||||
Changed: | ||||||||
< < | CREAM CE not Torque master: communication errors when the maui server and client are not of the same builds. | |||||||
> > | CREAM CE not Torque master: communication errors when the maui server and client are not of the same builds. | |||||||
* Bug #61698: when the CREAM CE is not a Torque server, there could be communication errors when the maui (and probably torque) server and client are NOT of the same builds. | ||||||||
Line: 82 to 111 | ||||||||
vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h lcg-ce –infile /import/dir/to/cream-ce/diagnose-for-cream | ||||||||
Changed: | ||||||||
< < | Reconfiguration after update | |||||||
> > | Reconfiguration after update | |||||||
After an update of the CREAM RPM, it is mandatory to reconfigure (via yaim) | ||||||||
Changed: | ||||||||
< < | Special characters in CREAM_DB_USER and CREAM_DB_PASSWORD | |||||||
> > | Special characters in CREAM_DB_USER and CREAM_DB_PASSWORD | |||||||
Don't use special characters in the CREAM_DB_USER and CREAM_DB_PASSWORD yaim variables | ||||||||
Changed: | ||||||||
< < | Problems with OS language different than US English | |||||||
> > | Problems with OS language different than US English | |||||||
Problems have been reported if jobs are submitted through the WMS to a CREAM CE deployed on a machine installed using a non-English language. This is because of different representations of decimal numbers. The workaround in this case is to uncomment the line: | ||||||||
Line: 101 to 130 | ||||||||
in $CATALINA_HOME/conf/tomcat5.conf and then restart tomcat | ||||||||
Added: | ||||||||
> > | Old known issuesProblems in CREAM software or in other software modules affecting a CREAM based CE that have already been fixed (i.e. they are not affecting the latest release of the software released in EMI)Problems with SubCAs when Argus is used as authorization systemThere are problems when CREAM CE is configured to use Argus, happening with sub-CAs (e.g. CERN-TCA, UKeScienceCA)
| |||||||
-- MassimoSgaravatto - 2011-05-05 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Known issues | ||||||||
Added: | ||||||||
> > | Known problems in CREAM software or in other software modules affecting a CREAM based CE (the list refer to known problem affecting the latest release of the software in production) | |||||||
| ||||||||
Added: | ||||||||
> > | Condor and SGE supportCondor and SGE are not yet fully supported as batch system for CREAM. | |||||||
Execution of DAG jobsExecution of DAG jobs on the CREAM based CE through the gLite WMS is not implemented yet. | ||||||||
Line: 19 to 26 | ||||||||
to /etc/blah.config should help | ||||||||
Added: | ||||||||
> > | CREAM CE not Torque master: communication errors when the maui server and client are not of the same builds.* Bug #61698: when the CREAM CE is not a Torque server, there could be communication errors when the maui (and probably torque) server and client are NOT of the same builds. A common scenario/example when this can happen:
[root@cream-ce]# diagnose –gIf you see: ERROR: lost connection to server ERROR: cannot request service (status)you are affected by the problem. A possible workaround is the following: On the LCG-CE create a cron file to dump the diagnose -g output to a file:
[root@lcg-ce]# cat <<EOF>> /etc/cron.d/diagnose-for-cream */5 * * * * root /usr/bin/diagnose –g > /export/dir/to/cream-ce/diagnose.out EOFThe interval defined in /etc/cron.d/diagnose-for-cream file , has to be set by the experts. Just an example has been provided here.
Then export over NFS the directory where the file is located:
[root@lcg-ce]# cat /etc/exports /export/dir/to/cream-ce cream-ce(rw,map_identity,no_root_squash,sync)On the CREAM-CE include/mount the remote directory to a local one: [root@cream-ce]# cat /etc/fstab | grep diagnose lcg-ce: /export/dir/to/cream-ce /import/dir/to/cream-ce nfs defaults,bg 0 0Then feed the lcg-info-dynamic-scheduler with the diagnose output file:
[root@cream-ce]# cat /opt/glite/etc/lcg-info-dynamic-scheduler.conf|grep vomaxjobs-maui vo_max_jobs_cmd: /opt/lcg/libexec/vomaxjobs-maui -h lcg-ce –infile /import/dir/to/cream-ce/diagnose-for-cream Reconfiguration after updateAfter an update of the CREAM RPM, it is mandatory to reconfigure (via yaim)Special characters in CREAM_DB_USER and CREAM_DB_PASSWORDDon't use special characters in the CREAM_DB_USER and CREAM_DB_PASSWORD yaim variablesProblems with OS language different than US EnglishProblems have been reported if jobs are submitted through the WMS to a CREAM CE deployed on a machine installed using a non-English language. This is because of different representations of decimal numbers. The workaround in this case is to uncomment the line:LANG=en_USin $CATALINA_HOME/conf/tomcat5.conf and then restart tomcat
| |||||||
-- MassimoSgaravatto - 2011-05-05 |
Line: 1 to 1 | ||||||||
---|---|---|---|---|---|---|---|---|
Added: | ||||||||
> > |
Known issues
Execution of DAG jobsExecution of DAG jobs on the CREAM based CE through the gLite WMS is not implemented yet.qsub crashesWith some Torque versions it was observer qsub crashing with glibc detecting a double free or corruption.Although this is a problem to be addressed in Torque problem, adding:export MALLOC_CHECK_=0to /etc/blah.config should help
-- MassimoSgaravatto - 2011-05-05 |