Automatic tests
- report #1:
- CREAM UI version: 1.12.1; CREAM testsuite version: 1.0.7
- Used direct polling for monitoring and BUpdater/BNotifier for status change detection
- Batch system: TORQUE
- All tests have been performed successfully (see the report from the testsuite)
Since the current version of the CREAM CE does not enable CEMonitor for a standard installation, all the tests that make use of the notification mechanism have not been taken into account
Test submission through a WMS (i.e. ICE)
Description:
- 2880 collections each of 20 jobs
- One collection every 60 seconds
- Four users
- We use these CEs located at Padua:
- 5 CEs SL5/64bit (2 lsf + 3 torque)
- Use automatic-delegation
- The job is a "sleep random(7200)"
- Resubmission is enabled
- Use proxy renewal service (myproxy.cern.ch)
- Lease mechanism is not used
Results
- Collections correctly submitted: 2880 (57600 jobs)
- DONE OK: 57600 (100%)
- NOTDONE: 0 (0 %)
- ABORTED: 0 (0 %)
- CANCELLED: 0 (0 %)
- Resubmitted: 23 (0.04 %)
Information system checks
Tested that the BDII is operational
ldapsearch -x -h cream-38.pd.infn.it -p 2170 -b o=infosys
Output is
here
Checked the Glue 1.3 root entry:
ldapsearch -x -h cream-38.pd.infn.it -p 2170 -b mds-vo-name=resource,o=grid
Output is
here
Checked the Glue 2.0 root entry:
ldapsearch -x -h cream-38.pd.infn.it -p 2170 -b o=glue
Output is
here
Information content check:
$ gstat-validate-ce -H cream-38.pd.infn.it -p 2170 -b o=grid
OK - errors 0, warnings 0, info 0
Checked bugs
- Bug #24708: a empty directory left on WN for every job FIXED
- Verified as explained here
- Bug #37430: BLParser should properly filter it's log output FIXED
- Not too clear what the fix is supposed to be
- According to the developer (M. Mezzadri) the command received by the old blparser from CREAM should be reported in the blparser log file without an extra new-line
- Verified in the old blparser log file
- Bug #45364: BLAH_JOB_CANCEL should report failure reason FIXED
- submit a job top CREAM and then cancels it using the LRMS command (e.g. qdel). Before the blparser (and therefore CREAM) realizes that the job was cancelled, issue a glite-ce-job-cancel.
- Issue a glite-ce-job-status -L 2. For the cancel command a failure )alomng with its reason) should be reported such as:
*** Command Name = [JOB_CANCEL]
Command Category = [JOB_MANAGEMENT]
Command Status = [ERROR]
Command Fail Reason = [qdel: Unknown Job Id 45299.cream-38.pd.infn.it]
Creation Time = [Fri 26 Feb 2010 18:43:27] (1267206207)
Start Scheduling Time = [Fri 26 Feb 2010 18:43:27] (1267206207)
Start Processing Time = [Fri 26 Feb 2010 18:43:27] (1267206207)
Execution Completed Time = [Fri 26 Feb 2010 18:43:30] (1267206210)
- Bug #46419: CREAM sandbox area should be scratched when the CREAM DB is scratched FIXED
- Submit at least one job to the CE and wait for its termination, so that the sandbox area is not empty
- Increment the value of the parameters creamdb_database_version in the file /opt/glite/etc/glite-ce-cream/cream-config.xml.template
- reconfigure the node with yaim and check whether the sandbox area is empty
- Bug #47070: [ yaim-cream ] yaim cream module should support remote mysql setup HOPEFULLY FIXED
- Bug #47254: Possible problems if the proxy used to talk with CREAM is shorter than 10 minutes FIXED
- create a voms-proxy whose lifetime is shorter than 10 minutes
- submit a simple job whose lifetime is shorter than the voms-proxy one and verify its correct termination
- Bug #47804: Possible problems configuring blah in CREAM-CE for LSF FIXED
- copy the file profile.lsf from the LSF configuration directory into a new destination, for example /tmp
- define in the site-info.def the variable LSFPROFILE_DIR=/tmp and reconfigure with yaim
- verify that in the file /opt/glite/etc/blah.conf the profile is loaded from the new path
- Bug #48786: Load should be one of the parameter of DISABLE_SUBMISSION_POLICY in CREAM FIXED
- specify a low load level in the file /opt/glite/bin/glite_cream_load_monitor
- try to overload the node with the testsuite: cream-test-monitored-submit -r 30 -n 20000 -m 2000 -C 50 -l log4py.conf -j test.jdl -R <ceID> --sotimeout 60 --vo dteam --valid 02:00 and verify that with high load the submissions are rejected
- Bug #49497: user proxies on CREAM do not get cleaned up FIXED
- delegate a proxy whose lifetime is shorter than the parameter delegation_purge_rate of the CREAM configuration file
- wait for the new proxy cleanup run (at least twice the delegation_purge_rate) and verify that the proxy file has been removed from the directory
- Bug #50226: yaim-cream-ce should use config_secure_tomcat FIXED
- install the CE node from scratch
- verify the state of the trustmanager accessing the URL: https://ce-host:8443/ce-cream/services
- Bug #50723: CREAM: check for the jobtype is not case insensitive FIXED
- submit a job specifying the parameter "jobtype=Normal" in the JDL and verify the correct execution of the job
- Bug #50875: CREAM: reason for cancelled jobs should be reported FIXED
- submit and cancel a job using the CREAM CLI command and verify that the reason reports "Cancelled by user"
- submit and cancel a job using the LRMS command (e.g. qdel) and verify that the reason reports "Cancelled by CE admin"
- Bug #50876: CREAM reports that the proxy expired even when the problem is in detecting the lifetime of the proxy FIXED
- force a failure for the command grid-proxy-init in the jobwrapper, for example delegating a proxy on the CE, manually renaming the corresponding delegated proxy in the sandbox area and then submitting a job using the given delegation ID.
- verify that the failure reason reported by the job status contains the message: Problem to detect the lifetime of the proxy
- Bug #51046: CREAM: DelegProxyInfo info sometimes is wrong FIXED
- submit a job, wait for its termination and verify the correct lifetime of the proxy in the glite-ce-job-status output
- Bug #51118: config_cream_glexec doesn't set glexec permissions right FIXED
- install a CE node from scratch and verify the permissions for /opt/glite/sbin/glexec (6555) and /opt/glite/etc/glexec.conf (640)
- Bug #51124: catalina.out is clogged with grid-proxy-init warnings FIXED
- submit a job and check the catalina.out file
- Bug #51128: lcas-suexec.db on CREAM CE should be named lcas-glexec.db for consistency FIXED
- install a CE node from scratch
- verify the existence of the files: /opt/glite/etc/lcas/lcas-glexec.db and /opt/glite/etc/lcmaps/lcmaps-glexec.db
- Bug #51249: [ yaim-cream-ce ] refactor config_cream_db FIXED
- Install the node from scratch and verify all the basic operations of the CREAM service
- Bug #51310: Wrong event timestamp FIXED
- run the consumer server (glite-ce-monitor-consumer) on the client machine
- create a subscription for the topic CREAM_JOBS on the CE specifying the URL of the consumer server above
- submit a job and verify the validity of the field TIMESTAMP of any event
- Bug #51311: Wrong event timestamp generated by the CREAM Job Sensor FIXED
- Bug #51313: CEMon must not notify the expired events CANNOT REPRODUCE
- Bug #51705: glexec-wrapper.sh should be removed from CREAM RPM FIXED
- check the content of glite-ce-cream rpm
- Bug #51706: yaim-cream-ce: remove "lcg" prefix from JOB_MANAGER FIXED
- change the value of JOB_MANAGER in the siteinfo.def e.g. from lsf to lcglsf
- configure the node with YAIM and verify that in the resource BDII the string lsf (and not lcglsf) appears in the glueeeuniqueids
- Bug #51892: Exception when using java.text.DateFormat.parse FIXED
- try to overload the node with the testsuite: cream-test-monitored-submit -r 30 -n 20000 -m 2000 -C 50 -l log4py.conf -j test.jdl -R --sotimeout 60 --vo dteam --valid 02:00 and verify that with high load the submissions are rejected
- verify the log of the CREAM service
- Bug #51928: BLAH crashes if the cerequirements classad attribute is malformed FIXED
- submit a job specifying a malformed cerequirements parameter
- verify that the job is executed and the parameter is ignored
- Bug #51978: CREAM can be slow to start FIXED
- submit a big bunch of long-lived jobs, for example using cream-test-monitored-submit -r 30 -n 2000 -m 2000 -C 100 --sotimeout 60 -j long.jdl -R <ce_id> where long.jdl is "[executable="/bin/sleep";arguments="3600";]"
- when all the jobs have been submitted restart the service and verify the startup time.
- verify in the CREAM and BLAHP logs that the jobs are checked one by one at startup, instead of polling all jobs from a given timestamp
- Bug #51993: Proxy renewal not very efficient for multiple jobs having the same delegationid FIXED
- stress the renewal mechanism with a single short delegated proxy, for example with the following test: cream-test-monitored-submit -r 30 -n 2000 -m 2000 -C 50 --sotimeout 60 -j long.jdl -R <ce_id> --vo <vo_name> --valid <00:20>
- Bug #52020: [ yaim-cream-ce ] Support use of file (besides syslog) for glexec logging FIXED
- Bug #52050: misleading error message "The problem seems to be related to glexec FIXED
- The CREAM service does not make use of glexec anymore, and therefore this error message can't appear anymore
- Bug #52051: CEMon must remove all expired subscriptions on start-up FIXED
- create a subscription for the topic CREAM_JOBS on the CE with a short lifetime
- shutdown the service and wait for the expiration of the subscription
- restart the service and verify that the subscription does not exist anymore in the directory /opt/glite/var/cemonitor/subscription
- Bug #52052: Sometimes the getInfo() operation does not report the right list of topics FIXED
- enable or disable the CE sensor removing or adding the corresponding tag in the file /opt/glite/etc/glite-ce-monitor/cemonitor-config.xml
- wait for cemonitor to reload the configuration (usually 10m)
- verify the availability of the topic using the command glite-ce-monitor-gettopics
- Bug #52268: BLAH leaves files in /tmp when CErequirements is set FIXED
- submit a job specifying a simple CE requirements (e.g. cerequirements="other.GlueHostMainMemoryRAMSize > 2000")
- verify that, after the execution of the job, in the tmp directory no files ce-req-file-* are left
- Bug #52651: CREAM file descriptor overuse FIXED
- try to overload the node with the testsuite: cream-test-monitored-submit -r 30 -n 20000 -m 2000 -C 50 -l log4py.conf -j test.jdl -R --sotimeout 60 --vo dteam --valid 02:00 and verify that with high load the submissions are rejected
- seek "too many open files" in the CREAM log
- Bug #52719: Blah doesn't set the 'executable' flag if a local jobwrapper is found FIXED
- Submitted a job to a CREAM CE
- Checked the BLAH wrapper: the chmod u+x of the CREAM JobWrapper is done in all cases (even if the job is going to be run on the WN via a local jobwrapper)
- Bug #52942: Missing description for ISB/OSB error in jobwrapper FIXED
- submit a job with an unreachable host in the inputsandbox or in the outputsandboxbasedesturi parameter
- verify that the output of the glite-ce-job-status contains the full description of the failure
- Bug #53459: [CREAM] Provide method to improve the detection of job status changes by ICE FIXED
- run the "monitored" part of the testsuite, the latest version of the testsuite makes use of the "event query" mechanism for keeping track of the job status.
- Bug #53499: CREAM job wrapper template should be put outside the jar FIXED
- check whether the file /opt/glite/share/webapps/ce-cream.war contains the file WEB-INF/jobwrapper.tpl
- Bug #54812: lsf_submit.sh job requirement FIXED
- Created (and chmoded +x) the file /opt/glite/bin/lsf_local_submit_attributes.sh on the CREAM CE with the following content:
#!/bin/sh
echo "BSUB -n 2"
-
- Submitted a job to that CE, without specifying in the JDL the cerequirements attribute
- Checked (via bjobs -l) that the -n 2 directive was used (which means that the lsf_local_submit_attributes.sh was run)
- Bug #54900: [ glite-yaim-cream-ce ] config_cream_tomcat_user should not add tomcat to VO FIXED
- check the membership of any VO group
- Bug #54949: Some job can remain in running state when BLParser is restarted for both lsf and pbs HOPEFULLY FIXED
- Not easy to reproduce
- Submitted several jobs (logged in different batch system log files) to a CREAM CE configured with the old blparser
- Restarted CREAM
- Didn't notice problems in getting the status of these jobs
- Bug #55078: Possible final state not considered in BLParserPBS and BUpdaterPBS CANNOT REPRODUCE
- To test the fix it would be necessary to have a scenario for which in the Torque log file for a certain job the event "Job Run..." is followed by the event "dequeuing from"
- Not able to reproduce such scenario
- Bug #55420: Allow admin to purge CREAM jobs in a non terminal status FIXED
- temporary disconnect any WN from the CE, e.g. shutting down the mom server in a TORQUE installation
- submit a job
- on the CE with administrator privileges run the command: /opt/glite/sbin/JobDBAdminPurger.sh -u -p -s 2 as described in the wiki page
- verify with glite-ce-job-list that the job has been purged from the database
- verify that the sandbox directory of that job has been removed from /opt/glite/var/cream_sandbox
- remove manually the job from the batch system and reconnect all the WN
- Bug #55438: BUpdater problems in updating job state with AssignFinalState for all batch system FIXED
- Submitted 3 jobs lasting 2 hours to a CREAM CE with only 2 job slots.
- For all the jobs the right events were logged by the bnotifier (i.e. it didn't log status=4 with failurereason=999)
- Bug #55531: BUpdaterPBS should consider lines like "unable to run job" FIXED
- Bug #55565: BLAH configuration attribute blah_disable_wn_proxy_renewal fails to disable proxy renewal. FIXED
- Verified issuing a BLAH_JOB_REFRESH proxy for a running job
- Moreover the BLAH proxy renewal operation is not used anymore (the proxy on the CE is renewed by CREAM and no more by BLAH)
- Bug #56075: Job failure reasons missing in the CREAM log file FIXED
- submit a job with an unreachable host in the inputsandbox or in the outputsandboxbasedesturi parameter
- verify that in the log file appears the message: failureReason=Cannot move ISB (): error: globus_xio: Unable to connect to xxxx:2811 globus_xio: globus_libc_getaddrinfo failed.globus_common: Name or service not known
- Bug #56339: [blah] "service glite-ce-blparser restart" does not always work FIXED
- try the command /opt/glite/etc/init.d/glite-ce-blparser restart and verify the correct behaviour of the script
- Bug #56367: CREAM RPM depends on C libs FIXED
- check if the package of glite-ce-cream contains any elf executable
- Bug #56518: BLAH blparser doesn't start after boot of the machine FIXED
- install the CE node from scratch specifying the parameter BLPARSER_WITH_UPDATER_NOTIFIER=false in the yaim configuration for creamCE
- reboot the machine and verify that the blparser_master is running
- Bug #56697: CREAM logging must be improved when CREAM register operation fails FIXED
- force the service to fail a register operation, e.g. temporary renaming the sandbox directory
- verify that the log reports at least the JobID and the reason of the failure
- Bug #57210: BLAH condor_submit script doesn't recognize certain options. CANNOT REPRODUCE
- Not possible to test the fix since we don't have CREAM based CEs with Condor as batch system
- Bug #57307: condor_submit.sh does not support the handling of "local" attributes CANNOT REPRODUCE
- Not possible to test the fix since we don't have CREAM based CEs with Condor as batch system
- Bug #57820: [yaim-cream-ce] CREAM-CE publishes GlueServiceDataValue incomplete FIXED
- run the infoprovider: /opt/glite/etc/gip/provider/glite-info-provider-service-cream-wrapper | grep GlueServiceDataValue
- verify that 3 different values are returned for the GlueServiceDataValue: the version, the DN and the host name of the CE
- Bug #58103: Cream database Query performance FIXED
- Internal improvement
- run a set of stress-tests and verify the performance
- Bug #58109: Wrong value for the "service version" property FIXED
- verify the property using the command glite-ce-service-info
- Bug #58119: CREAM CE: publish Production instead of Special as default value for GlueCEStateStatus FIXED
- verify with /opt/glite/libexec/glite-info-wrapper | grep -i gluecestatestatus
- Bug #58423: RFE: support for ISB/OSB transfers from/to gridftp servers running using credentials FIXED
- tested using java-based UI
- tested using the following JDL:
[
executable="/bin/ls";
inputsandbox={"gsiftp://lxsgaravatto.pd.infn.it:6787/etc/fstab?DN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto"};
stdoutput="out-gsi.out";
stderror="err-gsi.err";
outputsandbox={"out-gsi.out", "err-gsi.err"}
outputsandboxbasedesturi="gsiftp://lxsgaravatto.pd.infn.it:6787/tmp?DN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto";
]
- Bug #58659: NullPointerException from getStatus FIXED
- try to overload the node with the testsuite: cream-test-monitored-submit -r 30 -n 20000 -m 2000 -C 50 -l log4py.conf -j test.jdl -R --sotimeout 60 --vo dteam --valid 02:00 and verify that the log of the testsuite does not report any NullPointerException
- Bug #58792: JobRegister fails, because cream_sandbox directory doesn't exist FIXED
- temporary rename the directory /opt/glite/var/cream_sandbox without turning off the service
- submit a job and verify that the failure reports "cannot create the job's working directory!"
- Bug #58941: [yaim-cream-ce] lcmaps confs for glexec and gridftp are not fully synchronized FIXED
- verify that the file /opt/glite/etc/lcmaps/lcmaps.db is complaint with the one attached to the bug.
- Bug #59005: Possible problem with hold/resumed jobs in BUpdaterLSF FIXED
- Verified as reported here
- Bug #59329: Proxy symlinks left in the registry area until purged FIXED
- submit a job and verify the verify the existence of the related symlink in the directory /opt/glite/var/blah/user_blah_job_registry.bjr/registry.proxydir
- when the job terminates verify that the symlink has been removed by blah.
- Bug #59686: Possible crash of BUpdarePBS due to wrong malloc FIXED
- Define the parameter pbs_spoolpath in the file /opt/glite/etc/blah.config
- run the BUpdaterPBS daemon and verify its liveness
- Bug #59862: [ yaim-cream-ce ] broken -v functionality FIXED
- remove a mandatory variable from the site-info.def, for examples JOB_MANAGER
- run yaim configurator with option -v and verify that all the yaim functions are called.
- Bug #59962: Sometimes the CREAM initialization fails with "UserId = ADMINISTRATOR is not enable for that operation"CANNOT REPRODUCE
- Bug #60831: Error log message: "CREAM_JOB_SENSOR_HOST parameter not specified" FIXED
- verify that the parameter "CREAM_JOB_SENSOR_HOST" is not defined in the file /opt/glite/etc/glite-ce-cream/cream-config.xml
- submit several jobs
- verify that the log of the CREAM service does not report the error above
- Bug #61322: CREAM jw doesn't set GLITE_WMS_RB_BROKERINFO FIXED
- submit a job and verify that the jobwrapper script, contained into the sandbox area for that job, defines correctly the __brokerinfo variable
- Bug #61401: config_cream_blah and config_cream_clean don't take into account GLITE_LOCATION_LOG FIXED
- verify that the log files of blahp are saved into the directory specified by GLITE_LOCATION_LOG
- Bug #61402: [yaim-cream-ce] does not use GLITE_LOCATION_VAR/LOG is some cases FIXED
- change the value of GLITE_LOCATION_VAR and GLITE_LOCATION_LOG and run the yaim configurator
- verify that the new installation has been deployed into the the new directory and the log is written in the new location
- Bug #61407: Set CE_ID in the cream jw FIXED
- submit a job and verify that the jobwrapper script, contained into the sandbox area for that job, defines correctly the CE_ID variable
- Bug #61493: [ yaim-cream-ce ] glexec_get_account policy order is wrong FIXED
- As reported in the bug, this was fixed fixing bug #58941
- Bug #61604: yaim-cream-ce should not install config_gip_software_plugin FIXED
- verify that the glite-yaim-cream-ce package does not contain the file config_gip_software_plugin but it contains config_cream_gip_software_plugin instead
- Bug #61730: CREAM jw: GLITE_WMS_LOG_DESTINATION should always be set with the FQDN FIXED
- submit a job and verify that the jobwrapper script, contained into the sandbox area for that job, defines a FQDN in the __ce_hostname variable
- Bug #61761: CEMon must guarantee the notification rate FIXED
- enable the "CE Sensor" plugin
- create a subscription for the topic published by the sensor above with a running consumer: glite-ce-monitor-subscribe --cert <user_proxy> --key <user_proxy> --topic CE_MONITOR --dialects ISM_CLASSAD_GLUE_1.2 --consumer-url <consumer_url> --rate 10 --duration 600 <cemonitor_url>
- create on ore more subscriptions to non-existing consumer URL or to a fake blocking one (e.g. using nc -l -p <consumer port>) specifying the same rate as above
- verify that the notification rate for the first consumer is correct
- Bug #61790: Problems in CREAM CE when there are "strange" characters in the subject certificate FIXED
- Verified submitting a job to a Torque CREAM CE with a proxy with subject: /DC=gov/DC=fnal/O=Fermilab/OU=Robots/CN=lcgcaf/CN=cdf/CN=Donatella Lucchesi/CN=UID:lucchesi
- With the same proxy there were problems before (see https://gus.fzk.de/ws/ticket_info.php?ticket=54767)
- Bug #62070: Possible problem with notification time in BNotifier HOPEFULLY FIXED
- Not possible to reproduce it according to the developer (M. Mezzadri)
- Bug #62207: [ yaim-cream ] Enable Glue 2.0 publishing FIXED
- Bug #62436: Possible problem with updater if job remain queued too long FIXED
- Fixed as reported here: 3 jobs lasting 2 hours were submitted to a CREAM CE with only 2 job slots. For the third one the BNotifier logged the right events (i.e. it didn't log status=4 with failurereason=999)
- Bug #62565: yaim-cream-ce requires BLPARSER_HOST even if the new blparser has to be configured FIXED
- Install the CE node from scratch removing the BLPARSER_HOST definition from the site-info.def and defining BLPARSER_WITH_UPDATER_NOTIFIER=true.
- verify that the yaim log about any error concerning the variable above and the BNotifier and the Bupdater run correctly.
- Bug #62776: Yaim config for CREAM CE erroneously requires tomcat in glexec group FIXED
- Bug #62893: Possible proxy renewal problem in the CREAM jw FIXED
- try to overload the node with the testsuite: cream-test-monitored-submit -r 30 -n 20000 -m 2000 -C 50 -l log4py.conf -j test.jdl -R <ceID> --sotimeout 60 --vo dteam --valid 00:30
- verify that no proxy related issues occur
- Bug #63398: CREAM jw: removal of token should be retried in case of failure FIXED
- submit the following jdl:
[
environment= {"__token_file=gsiftp://host/path"};
executable="/bin/sleep";
arguments="30";
]
specifying existing host and path first and verify that the job terminate successfully; the owner of the token must be the mapped-user.
- submit the jdl above but specifying a fake host and/or path and verify that the job status reports 3 different failed attempts for taking the token:
"/opt/edg/libexec/edg-gridftp-base-rm: error globus_ftp_client: the server responded with an error 500 500-Command failed : System error in unlink: No such file or directory 500-A system call failed: No such file or directory 500 End"
- Bug #63874: CREAM sandbox dir creation program should not attempt creation of parent directories.FIXED
- temporary rename the directory /opt/glite/var/cream_sandbox/<voname>
- submit a job using voms-proxy published by the given VO and verify that the job fails and no directory /opt/glite/var/cream_sandbox/<voname> has been created.
- Bug #64593: RFE: CREAM jw should set the env variables CREAM_JOBID and GRID_JOBID. FIXED
- Verified as explained here
- Bug #64695: BLAH error after qsub failure. FIXED
- Verified as explained here
- Bug #65022: CEMon can shut down very slowly. FIXED
- Confirmed to be fixed by OSG/VDT people
Clean installation
Upgrade from production