Clean installation
Yum installation
yum installation log is available
here
Yaim configuration
yaim configuration log is available
here
Update
Yum update
yum update log is available
here
Yaim configuration
yaim configuration log is available
here
System tests
Functionality tests
Plain submission with file transfers
- Test result for LSF is available here PASSED
- Test result for Torque is available here PASSED
- Test result for SGE is available here PASSED
Old blparser for job which finishes normally
- This is what is reported in the blparser log file for PBS: PASSED
2011-11-02 20:33:41 Sent for Cream:[BatchJobId="914"; JobStatus=1; BlahJobName="cre38_144298043"; ClientJobId="144298043"; ChangeTime="2011-11-02 20:33:41";]
2011-11-02 20:33:42 Sent for Cream:[BatchJobId="914"; JobStatus=2; BlahJobName="cre38_144298043"; ClientJobId="144298043"; ChangeTime="2011-11-02 20:33:42";]
2011-11-02 20:38:53 Sent for Cream:[BatchJobId="914"; JobStatus=4; BlahJobName="cre38_144298043"; ClientJobId="144298043"; Reason="pbs_reason=0"; ChangeTime="2011-11-02 20:38:53";]
- This is what is reported in the blparser log file for LSF: PASSED
2011-11-02 20:50:08 Sent for Cream:[BatchJobId="193859"; JobStatus=1; BlahJobName="cre35_058976693"; ClientJobId="058976693"; ChangeTime="2011-11-02 20:50:07";]
2011-11-02 20:50:10 Sent for Cream:[BatchJobId="193859"; JobStatus=2; BlahJobName="cre35_058976693"; ClientJobId="058976693"; WorkerNode="prod-wn-003"; ChangeTime="2011-11-02 20:50:09";]
2011-11-02 20:55:16 Sent for Cream:[BatchJobId="193859"; JobStatus=4; BlahJobName="cre35_058976693"; ClientJobId="058976693"; WorkerNode="prod-wn-003"; Reason="lsf_reason=0"; ChangeTime="2011-11-02 20:55:15";]
Old blparser for job which is cancelled
- This is what is reported in the blparser log file for PBS: PASSED
2011-11-02 20:41:08 Sent for Cream:[BatchJobId="915"; JobStatus=1; BlahJobName="cre38_835745174"; ClientJobId="835745174"; ChangeTime="2011-11-02 20:41:08";]
2011-11-02 20:41:09 Sent for Cream:[BatchJobId="915"; JobStatus=2; BlahJobName="cre38_835745174"; ClientJobId="835745174"; ChangeTime="2011-11-02 20:41:09";]
2011-11-02 20:41:28 Sent for Cream:[BatchJobId="915"; JobStatus=3; BlahJobName="cre38_835745174"; ClientJobId="835745174"; ChangeTime="2011-11-02 20:41:28";]
2011-11-02 20:41:28 Sent for Cream:[BatchJobId="915"; JobStatus=4; BlahJobName="cre38_835745174"; ClientJobId="835745174"; Reason="pbs_reason=271"; ExitReason="Killed by Resource Management System"; ChangeTime="2011-11-02 20:41:28";]
- This is what is reported in the blparser log file for LSF: PASSED
2011-11-02 21:18:44 Sent for Cream:[BatchJobId="193887"; JobStatus=1; BlahJobName="cre35_062774649"; ClientJobId="062774649"; ChangeTime="2011-11-02 21:18:44";]
2011-11-02 21:18:51 Sent for Cream:[BatchJobId="193887"; JobStatus=2; BlahJobName="cre35_062774649"; ClientJobId="062774649"; WorkerNode="cream-wn-006"; ChangeTime="2011-11-02 21:18:50";]
2011-11-02 21:19:04 Sent for Cream:[BatchJobId="193887"; JobStatus=3; BlahJobName="cre35_062774649"; ClientJobId="062774649"; WorkerNode="cream-wn-006"; ChangeTime="2011-11-02 21:19:03";]
New blparser for job which finishes normally
- This is what is reported in the bnotifier log file for LSF: PASSED
2011-11-02 19:04:53 Sent for Cream:[BatchJobId="193680"; JobStatus=2; ChangeTime="2011-11-02 19:04:35"; WorkerNode="cream-wn-003"; ClientJobId="513490836"; BlahJobName="cre35_513490836";]
2011-11-02 19:10:14 Sent for Cream:[BatchJobId="193680"; JobStatus=4; ChangeTime="2011-11-02 19:09:43"; WorkerNode="cream-wn-003"; JwExitCode=0; Reason="reason=0"; ClientJobId="513490836"; BlahJobName="cre35_513490836";]
* This is what is reported in the bnotifier log file for PBS:
PASSED
2011-11-02 19:43:17 Sent for Cream:[BatchJobId="907.cream-38.pd.infn.it"; JobStatus=2; ChangeTime="2011-11-02 19:42:40"; WorkerNode="cream-43.pd.infn.it"; ClientJobId="596525744"; BlahJobName="cre38_596525744";]
2011-11-02 19:48:47 Sent for Cream:[BatchJobId="907.cream-38.pd.infn.it"; JobStatus=4; ChangeTime="2011-11-02 19:47:50"; WorkerNode="cream-43.pd.infn.it"; JwExitCode=0; Reason="reason=0"; ClientJobId="596525744"; BlahJobName="cre38_596525744";]
* This is what is reported in the bnotifier log file for SGE:
PASSED
2011-11-02 20:03:17 Sent for Cream:[BatchJobId="188"; JobStatus=2; ChangeTime="2011-11-02 20:03:16"; WorkerNode="sa3-wn\
001.egee.cesga.es"; ExitReason="0"; ClientJobId="652381209"; BlahJobName="cream_652381209";]
2011-11-02 20:07:22 Sent for Cream:[BatchJobId="188"; JobStatus=4; ChangeTime="2011-11-02 20:07:19"; WorkerNode="sa3-wn\
001.egee.cesga.es"; JwExitCode=0; Reason="reason=0"; ExitReason="0"; ClientJobId="652381209"; BlahJobName="cream_652381209";]
New blparser for job which is cancelled
- This is what is reported in the bnotifier log file for LSF: PASSED
2011-11-02 19:11:54 Sent for Cream:[BatchJobId="193699"; JobStatus=1; ChangeTime="2011-11-02 19:11:53"; ClientJobId="686826214"; BlahJobName="cre35_686826214";]
2011-11-02 19:12:19 Sent for Cream:[BatchJobId="193699"; JobStatus=2; ChangeTime="2011-11-02 19:11:55"; WorkerNode="prod-wn-001"; ClientJobId="686826214"; BlahJobName="cre35_686826214";]
2011-11-02 19:13:19 Sent for Cream:[BatchJobId="193699"; JobStatus=3; ChangeTime="2011-11-02 19:12:24"; WorkerNode="prod-wn-001"; JwExitCode=-999; Reason="reason=-999"; ClientJobId="686826214"; BlahJobName="cre35_686826214";]
- This is what is reported in the bnotifier log file for PBS: PASSED
2011-11-02 19:41:17 Sent for Cream:[BatchJobId="906.cream-38.pd.infn.it"; JobStatus=2; ChangeTime="2011-11-02 19:40:17"; WorkerNode="cream-43.pd.infn.it"; ClientJobId="697725056"; BlahJobName="cre38_697725056";]
2011-11-02 19:42:17 Sent for Cream:[BatchJobId="906.cream-38.pd.infn.it"; JobStatus=3; ChangeTime="2011-11-02 19:41:34"; WorkerNode="cream-43.pd.infn.it"; JwExitCode=-999; Reason="reason=-999"; ClientJobId="697725056"; BlahJobName="cre38_697725056";]
- This is what is reported in the bnotifier log file for SGE: PASSED
2011-11-02 20:12:37 Sent for Cream:[BatchJobId="190"; JobStatus=3; ChangeTime="2011-11-02 20:12:35"; JwExitCode=3; Reason="reason=3"; ExitReason="reason=3"; ClientJobId="119619569"; BlahJobName="cream_119619569";]
New blparser for a job which is suspended and then resumed
- This is what is reported in the bnotifier log file for LSF: PASSED
2011-11-02 19:15:39 Sent for Cream:[BatchJobId="193710"; JobStatus=2; ChangeTime="2011-11-02 19:15:25"; WorkerNode="prod-wn-001"; ClientJobId="095903296"; BlahJobName="cre35_095903296";]
2011-11-02 19:15:59 Sent for Cream:[BatchJobId="193710"; JobStatus=5; ChangeTime="2011-11-02 19:15:55"; WorkerNode="prod-wn-001"; ClientJobId="095903296"; BlahJobName="cre35_095903296";]
2011-11-02 19:16:39 Sent for Cream:[BatchJobId="193710"; JobStatus=2; ChangeTime="2011-11-02 19:16:17"; WorkerNode="prod-wn-001"; ClientJobId="095903296"; BlahJobName="cre35_095903296";]
2011-11-02 19:16:59 Sent for Cream:[BatchJobId="193710"; JobStatus=2; ChangeTime="2011-11-02 19:15:25"; WorkerNode="prod-wn-001"; ClientJobId="095903296"; BlahJobName="cre35_095903296";]
2011-11-02 19:21:00 Sent for Cream:[BatchJobId="193710"; JobStatus=4; ChangeTime="2011-11-02 19:20:30"; WorkerNode="prod-wn-001"; JwExitCode=0; Reason="reason=0"; ClientJobId="095903296"; BlahJobName="cre35_095903296";]
* This is what is reported in the bnotifier log file for PBS:
PASSED
[BatchJobId="913.cream-38.pd.infn.it"; JobStatus=1; ChangeTime="2011-11-02 19:51:38"; ClientJobId="475769889"; BlahJobName="cre38_475769889";]
2011-11-02 19:52:18 Sent for Cream:[BatchJobId="913.cream-38.pd.infn.it"; JobStatus=5; ChangeTime="2011-11-02 19:52:04"; ClientJobId="475769889"; BlahJobName="cre38_475769889";]
2011-11-02 19:52:48 Sent for Cream:[BatchJobId="913.cream-38.pd.infn.it"; JobStatus=1; ChangeTime="2011-11-02 19:52:24"; ClientJobId="475769889"; BlahJobName="cre38_475769889";]
2011-11-02 19:53:48 Sent for Cream:[BatchJobId="913.cream-38.pd.infn.it"; JobStatus=2; ChangeTime="2011-11-02 19:53:30"; WorkerNode="cream-43.pd.infn.it"; ClientJobId="475769889"; BlahJobName="cre38_475769889";]
2011-11-02 19:59:18 Sent for Cream:[BatchJobId="913.cream-38.pd.infn.it"; JobStatus=4; ChangeTime="2011-11-02 19:58:41"; WorkerNode="cream-43.pd.infn.it"; JwExitCode=0; Reason="reason=0"; ClientJobId="475769889"; BlahJobName="cre38_475769889";]
* This is what is reported in the bnotifier log file for SGE:
PASSED
[BatchJobId="189"; JobStatus=2; ChangeTime="2011-11-02 20:07:19"; WorkerNode="sa3-wn001.egee.cesga.es"; ExitReason="0"; ClientJobId="585824619"; BlahJobName="cream_585824619";]
2011-11-02 20:08:27 Sent for Cream:[BatchJobId="189"; JobStatus=5; ChangeTime="2011-11-02 20:08:24"; ExitReason="0"; ClientJobId="585824619"; BlahJobName="cream_585824619";]
2011-11-02 20:09:32 Sent for Cream:[BatchJobId="189"; JobStatus=2; ChangeTime="2011-11-02 20:09:29"; WorkerNode="sa3-wn001.egee.cesga.es"; ExitReason="0"; ClientJobId="585824619"; BlahJobName="cream_585824619";]
2011-11-02 20:12:42 Sent for Cream:[BatchJobId="189"; JobStatus=4; ChangeTime="2011-11-02 20:12:40"; WorkerNode="sa3-wn001.egee.cesga.es"; JwExitCode=0; Reason="reason=0"; ExitReason="0"; ClientJobId="585824619"; BlahJobName="cream_585824619";]
Regression tests
Verified using the testsuite documented at
https://twiki.cern.ch/twiki/bin/view/EMI/GE-utilsTestPlan:
[rrosende@ui testCreamFramework]$ pybot -e isb_baseuri -e isb_gsiftp -e osb_basedesturi -e osb_desturi -e hostnum -e smpgran -W $COLUMNS /home/rrosende/testCreamFramework/
=====================================================================================================================================================================================================================
testCreamFramework
=====================================================================================================================================================================================================================
testCreamFramework.Cream Test :: This is the main testing module,needed for testing the cream submission functionality and various jdl attributes.For more information check these urls:
=====================================================================================================================================================================================================================
Set Log Level | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Simple Submit :: Execute /bin/uname -a.Submit the jdl and wait for final job status to be done-ok. | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
ISB Client to CE :: Execute a bash shell script.The script is stored on the UI.The jdl attribute InputSandbox is used.Firstly submit the jdl and then wait for final job state to be done-ok. | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
OSB localhost :: Execute /bin/uname -a.Store std out and error streams on "gsiftp://localhost" .Firstly submit the jdl,wait for final job state done-ok,download the produced files,get the output file l... | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Environment :: Execute a bash shell script.The script is stored locally.The jdl sets an environmental variable.The script prints the value of that variable. | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Epilogue :: Execute two bash shell scripts,one for the job and the other as an epilogue.The epilogue script creates a file with a certain string,which is collected by CREAM as an output file.Submit the... | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Prologue :: Execute two bash shell scripts,one for the job and the other as a prologue.The prologue script creates a file with a certain string,which is collected by CREAM as an output file.Submit the ... | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
CPUNumber :: Execute /bin/uname -a.Set the jdl attribute CPUNumber to a certain number. | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
WholeNodes :: Execute /bin/uname -a.Set the jdl attribute WholeNodes to True and False. | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HostNumber SMPGranularity and WholeNodes :: Execute /bin/uname -a.Set the jdl attributes HostNumber and SMPGranularity to a random -correct- value and WholeNodes to True and False.Job submission should... | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Explicit Delegation :: Test job submission with explicit delegation.Three cases are tested: | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Job Cancel :: Test job cancellation.Two cases are tested: | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Job List :: Test the job listing cli utility.Store the list of the user's jobs,before and after a job submission.The returned job id shouldn't exist in the job list command output before the submission... | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Job Suspend - Resume :: Submit a job and suspend it.Wait until it's state is REALLY-RUNNING. | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
File Cleanup :: Delete any and all leftover files from the test.All the files created are located under /tmp and have the prefix "cream_testing-". | PASS |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
testCreamFramework.Cream Test :: This is the main testing module,needed for testing the cream submission functionality and various jdl attributes.For more information check these urls: | PASS |
15 critical tests, 15 passed, 0 failed
15 tests total, 15 passed, 0 failed
- Bug #75854 (Problems related to the growth of the blah registry) FIXED
Performed some stress tests towards a CREAM CE configured using the new blparser, and with
job_registry_use_mmap=yes
(default scenario).
Monitored the processes using more than 20M, and blah was not among them.
At any rate the real verification is possible only in a production environment.
- Bug #77776 (BUpdater should have an option to use cached batch system commands) FIXED
Added:
lsf_batch_caching_enabled=yes
batch_command_caching_filter=/usr/bin/runcmd.p
in
blah.config
.
with
/usr/bin/runcmd.pl
:
#!/usr/bin/perl
#---------------------#
# PROGRAM: argv.pl #
#---------------------#
$numArgs = $#ARGV + 1;
open (MYFILE, '>>/tmp/xyz');
foreach $argnum (0 .. $#ARGV) {
print MYFILE "$ARGV[$argnum]\n";
}
close (MYFILE);
Verified that in
/tmp/xyz
there are the queries to the batch system:
/opt/lsf/7.0/linux2.6-glibc2.3-x86/bin/bjobs
-u
all
-l
/opt/lsf/7.0/linux2.6-glibc2.3-x86/bin/bjobs
-u
all
-l
...
- Bug #79324 (blah is slow in processing files in the registry.npudir directory) HOPEFULLY FIXED
Not straightforward to verify this bug fix.
- Bug #80805 (BLAH job registry permissions should be improved) FIXED
Verified checkling the permissions of
/var/blah
:
/var/blah:
total 12
-rw-r--r-- 1 tomcat tomcat 5 Oct 18 07:32 blah_bnotifier.pid
-rw-r--r-- 1 tomcat tomcat 5 Oct 18 07:32 blah_bupdater.pid
drwxrwx--t 4 tomcat tomcat 4096 Oct 18 07:38 user_blah_job_registry.bjr
/var/blah/user_blah_job_registry.bjr:
total 16
-rw-rw-r-- 1 tomcat tomcat 1712 Oct 18 07:38 registry
-rw-r--r-- 1 tomcat tomcat 260 Oct 18 07:38 registry.by_blah_index
-rw-rw-rw- 1 tomcat tomcat 0 Oct 18 07:38 registry.locktest
drwxrwx-wt 2 tomcat tomcat 4096 Oct 18 07:38 registry.npudir
drwxrwx-wt 2 tomcat tomcat 4096 Oct 18 07:38 registry.proxydir
-rw-rw-r-- 1 tomcat tomcat 0 Oct 18 07:32 registry.subjectlist
/var/blah/user_blah_job_registry.bjr/registry.npudir:
total 0
/var/blah/user_blah_job_registry.bjr/registry.proxydir:
total 0
- Bug #81354 (Missing 'Iwd' Attribute when trasferring files with the 'TransferInput' attribute causes thread to loop) FIXED
Verified in the following way:
$ /usr/bin/blahpd
$GahpVersion: 1.16.2 Mar 31 2008 INFN\ blahpd\ (poly,new_esc_format) $
BLAH_SET_SUDO_ID dteam001
S Sudo\ mode\ on
blah_job_submit 1 [cmd="/bin/cp";Args="fstab\ fstab.out";TransferInput="/home/dteam001/dir1/fstab";TransferOutput="fstab.out";TransferOutputRemaps="fstab.out=/home/dteam001/dir1/fstab.out";gridtype="pbs";queue="creamtest2";x509userproxy="/tmp/proxy"]
S
results
S 1
1 0 No\ error pbs/20111010/304.cream-38.pd.infn.it
$ ls /home/dteam001/dir1/
fstab fstab.out
Without the fix, the blah jobid was not returned.
- Bug #81824 (yaim-cream-ce should manage the attribute bupdater_loop_interval) FIXED
Set
BUPDATER_LOOP_INTERVAL
to 30 in siteinfo.def and reconfigured via yaim. Verified that in
blah.config
there is:
bupdater_loop_interval=30
- Bug #82281 (blahp.log records should always contain CREAM job ID) FIXED
Submitted a job directly to CREAM using CREAM-CLI. In the accounting log file the following line got printed:
"timestamp=2011-10-10 14:37:38" "userDN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto" "userFQAN=/dteam/Role=NULL/Capability=NULL" "userFQAN=/dteam/NGI_IT/Role=NULL/Capability=NULL" "ceID=cream-38.pd.infn.it:8443/cream-pbs-creamtest2" "jobID=CREAM956286045" "lrmsID=300.cream-38.pd.infn.it" "localUser=18757" "clientID=cre38_956286045"
Submitted a job to CREAM through WMS. In the accounting log file the following line got printed:
"timestamp=2011-10-10 14:39:57" "userDN=/C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Massimo Sgaravatto" "userFQAN=/dteam/Role=NULL/Capability=NULL" "userFQAN=/dteam/NGI_IT/Role=NULL/Capability=NULL" "ceID=cream-38.pd.infn.it:8443/cream-pbs-creamtest2" "jobID=https://devel19.cnaf.infn.it:9000/dLvm84LvD7w7QXtLZK4L0A" "lrmsID=302.cream-38.pd.infn.it" "localUser=18757" "clientID=cre38_315532638"
*
Bug #82297 (blahp.log rotation period is too short)
FIXED
# cat /etc/logrotate.d/blahp-logrotate
/var/log/cream/accounting/blahp.log {
copytruncate
rotate 365
size = 10M
missingok
nomail
}
- Bug #83275 (Problem in updater with very short jobs that can cause no notification to cream) FIXED
Verified submitting very short jobs (which do just an echo and without sandbox). Verified that notifications are sent:
2011-11-04 14:11:11 Sent for Cream:[BatchJobId="927.cream-38.pd.infn.it"; JobStatus=4; ChangeTime="2011-11-04 14:08:55"; JwExitCode=0; Reason="reason=0"; ClientJobId="622028514"; BlahJobName="cre38_622028514";]
- Bug #83347 (Incorrect special character handling for BLAH Arguments and Environment attributes) FIXED
Submitted using this BLAH_JOB_SUBMIT command:
BLAH_JOB_SUBMIT 1 [Cmd="/bin/echo";Args="$HOSTNAME";Out="/tmp/stdout_l15367";In="/dev/null";GridType="pbs";Queue="creamtest1";x509userproxy="/tmp/proxy";Iwd="/tmp";TransferOutput="output_file";TransferOutputRemaps="output_file=/tmp/stdout_l15367";GridResource="blah"]
Verified that in the output file there is the hostname of the WN
Verified the fix in the following way:
- Set
tracejob_logs_to_read
to 2 in blah.config
- Set
bupdater_debug_level
to 3
- Restarted the bupdater
- Submitted a job
- Stopped tomcat and the bupdater-bnotifier before the end of the job
- Restarted tomcat (and therefore bupdater and bnotifier)
- Check the bupdater log where the following strings were found:
2011-10-12 07:26:09 /usr/bin/BUpdaterPBS: command_string in FinalStateQuery is:/usr/bin/tracejob -p /var/torque -m -l -a -n 1 630.cream-38.pd.infn.it
2011-10-12 07:26:40 /usr/bin/BUpdaterPBS: command_string in FinalStateQuery is:/usr/bin/tracejob -p /var/torque -m -l -a -n 2 630.cream-38.pd.infn.it
To reproduce the problem,
tracejob
output should contain something like:
08/08/2011 03:59:06 S preparing to send 'e' mail for job 430589.tonia.d-grid.scai.fraunhofer.de to -unavailable- (Exit_status=0
Issuing (on torque 2.5.7-2 provided by epel):
qmgr -c 'set server mail_domain=pd.infn.it'
I am not able to reproduce that, and therefore I am not able to verify the fix
- Bug #86050 (Due to a syntax error hung child processes cannot be killed) FIXED
Verified checking the code
- Bug #87419 (blparser_master add some spurious character in the BLParser command line) FIXED
Verified configured a CREAM CE using the old blparser.
A
ps
doesn't show spurious characters anymore:
root 26485 0.0 0.2 155564 5868 ? Sl 07:36 0:00 /usr/bin/BLParserPBS -d 1 -l /var/log/cream/glite-pbsparser.log -s /var/torque -p 33333 -m 56565
Not straightforward to reproduce this problem and therefore to verify the fix.
--
MassimoSgaravatto - 2011-10-10