CREAM & WN (IGI/EMI) using last Torque (2.5.7-7) and Maui 3.3-4 installation with MPI
On CE Host (also BATCH Master PBS/Torque)
INSTALLATION:
Repository settings:
cd /etc/yum.repos.d/
mv dag.repo dag.repo.orig
wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
wget http://repo-pd.italiangrid.it/mrepo/repos/igi/sl5/x86_64/igi-cert-emi.repo
cd /root/
wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://emisoft.web.cern.ch/emisoft/dist/EMI/1/sl5/x86_64/updates/emi-release-1.0.1-1.sl5.noarch.rpm
Packages installation (CA, epel, emi, cream, torque,...):
yum install ca-policy-egi-core
yum localinstall *.rpm
yum install xml-commons-apis
yum install emi-cream-ce
yum install emi-torque-server emi-torque-utils
yum install glite-mpi
Munge configuration
/usr/sbin/create-munge-key
service munge start
chkconfig munge on
Starting PBS
/etc/init.d/pbs_server start
File LOG
Installation File on CE - cert-09.pd.infn.it Work Log
CONFIGURATION:
/opt/glite/yaim/bin/yaim -c -d 6 -s /usr/local/nfs/cert-3_2/rtc_mpi/rtc-site-info.def -n MPI_CE -n creamCE -n TORQUE_server -n TORQUE_utils 2>&1 | tee /root/conf_EMI_CREAM_Torque_MPI.`hostname -s`.`date +%Y%m%d-%H%M%S`.log
SSH Customization:
Modify the file /etc/ssh/sshd_config as the example attached
here
Modify the file /etc/ssh/shosts.equiv as the example attached
here
service sshd restart
File LOG:
Yaim Configuration File on CE cert-09.pd.infn.it
On WN Hosts:
INSTALLATION:
Repository settings:
cd /etc/yum.repos.d/
mv dag.repo dag.repo.orig
wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
wget http://repo-pd.italiangrid.it/mrepo/repos/igi/sl5/x86_64/igi-cert-emi.repo
cd /root/
wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://emisoft.web.cern.ch/emisoft/dist/EMI/1/sl5/x86_64/updates/emi-release-1.0.1-1.sl5.noarch.rpm
Packages installation (CA, epel, emi, cream, torque,...):
yum install ca-policy-egi-core
yum localinstall *.rpm
yum install igi-wn_torque_noafs
yum install glite-mpi
yum install openmpi openmpi-devel mpich2
Munge configuration:
scp cert-09:/etc/munge/munge.key /etc/munge/
chown munge.munge /etc/munge/munge.key
service munge start
File LOG:
Installation File on WN - cert-wn64-08.pn.pd.infn.it Work Log
CONFIGURATION:
/opt/glite/yaim/bin/yaim -c -d 6 -s /usr/local/nfs/cert-3_2/rtc_mpi/rtc-site-info.def -n MPI_WN -n WN_torque_noafs 2>&1 | tee /root/conf_WN_Torque_MPI.`hostname -s`.`date +%Y%m%d-%H%M%S`.log
SSH Customization:
Modify the file /etc/ssh/sshd_config as the example attached
here
Modify the file /etc/ssh/shosts.equiv as the example attached
here
service sshd restart
File LOG:
Yaim Configuration File on WN cert-wn64-08.pn.pd.infn.it
TESTING:
JOB Submission:
First Test (simple job):
From an EMI UI creating proxy and submission:
-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a testCream.jdl
https://cert-09.pd.infn.it:8443/CREAM043342708
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM043342708
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM043342708]
Status = [RUNNING]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM043342708
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM043342708]
Status = [REALLY-RUNNING]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM043342708
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM043342708]
Status = [DONE-OK]
ExitCode = [0]
First Test :
PASSED
Second Test (simple job MPI with 2 core):
-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a mpi-start-wrapper_Cream.jdl
https://cert-09.pd.infn.it:8443/CREAM986827158
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
Status = [IDLE]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
Status = [RUNNING]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
Status = [REALLY-RUNNING]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
Status = [REALLY-RUNNING]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
Status = [DONE-OK]
ExitCode = [1]
-bash-3.2$ glite-ce-job-status -L 2
https://cert-09.pd.infn.it:8443/CREAM986827158
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
Current Status = [DONE-OK]
Working Dir = [[reserved]]
ExitCode = [1]
Grid JobID = [N/A]
LRMS Abs JobID = [[reserved]]
LRMS JobID = [[reserved]]
Deleg Proxy ID = [613f3558926a0ba8642c51cfcaeb019b88a5b791]
DelegProxyInfo = [Valid From : 2/14/12 2:04 PM (GMT)
Valid To : 2/14/12 11:14 PM (GMT)
Holder Subject : /C=IT/O=INFN/OU=Personal
Certificate/L=Padova/CN=Sergio Traldi
Holder CA : /C=IT/O=INFN/CN=INFN CA
VO : dteam
AC Issuer :
CN=voms2.hellasgrid.gr,OU=hellasgrid.gr,O=HellasGrid,C=GR
Attribute : /dteam/Role=NULL/Capability=NULL
/dteam/NGI_IT/Role=NULL/Capability=NULL
]
Worker Node = [cert-wn64-08.pn.pd.infn.it]
Local User = [dteam009]
CREAM ISB URI =
[gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/98/CREAM986827158/ISB]
CREAM OSB URI =
[gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/98/CREAM986827158/OSB]
JDL = [[ Arguments = "name_mpi OPENMPI"; QueueName =
"cert"; JobType = "Normal"; Executable = "mpi-start-wrapper.sh";
VirtualOrganisation = "dteam"; InputSandbox = {
"/home/traldi/JOB_MPI/mpi-start-wrapper.sh","/home/traldi/JOB_MPI/mpi-hooks.sh","/home/traldi/JOB_MPI/name_mpi.c"
}; CPUNumber = 2; StdOutput = "std.out"; Type = "Job";
OutputSandboxBaseDestUri = "gsiftp://prod-se-01.pd.infn.it/tmp"; StdError =
"std.err"; BatchSystem = "pbs"; OutputSandbox = { "std.err","std.out" } ]]
Type = [Normal]
Job status changes:
-------------------
Status = [REGISTERED] - [Tue 14 Feb 2012 15:09:29]
(1329228569)
Status = [PENDING] - [Tue 14 Feb 2012 15:09:31] (1329228571)
Status = [IDLE] - [Tue 14 Feb 2012 15:09:31] (1329228571)
Status = [RUNNING] - [Tue 14 Feb 2012 15:09:36] (1329228576)
Status = [REALLY-RUNNING] - [Tue 14 Feb 2012 15:09:40]
(1329228580)
Status = [DONE-OK] - [Tue 14 Feb 2012 15:09:43] (1329228583)
Issued Commands:
-------------------
*** Command Name = [JOB_REGISTER]
Command Category = [JOB_MANAGEMENT]
Command Status = [SUCCESSFULL]
Creation Time = [Tue 14 Feb 2012 15:09:29]
(1329228569)
Start Scheduling Time = [Tue 14 Feb 2012 15:09:29]
(1329228569)
Start Processing Time = [Tue 14 Feb 2012 15:09:29]
(1329228569)
Execution Completed Time = [Tue 14 Feb 2012 15:09:29]
(1329228569)
*** Command Name = [JOB_START]
Command Category = [JOB_MANAGEMENT]
Command Status = [SUCCESSFULL]
Creation Time = [Tue 14 Feb 2012 15:09:31]
(1329228571)
Start Scheduling Time = [Tue 14 Feb 2012 15:09:31]
(1329228571)
Start Processing Time = [Tue 14 Feb 2012 15:09:31]
(1329228571)
Execution Completed Time = [Tue 14 Feb 2012 15:09:38]
(1329228578)
Second Test :
PASSED
Third Test: (MPI 4 core required)
Test Submission with 4 core required:
-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a mpi-start-wrapper_Cream.jdl
https://cert-09.pd.infn.it:8443/CREAM888211702
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM888211702
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM888211702]
Status = [ABORTED]
ExitCode = []
FailureReason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes-) N/A (jobId = CREAM888211702)]
Third Test :
NOT PASSED
Third Test (bis): (MPI 4 core required)
Execute again the third test submitting a jobs using 4 cores:
Modified one YAIM variable in services/glite-mpi_ce file :
MPI_SUBMIT_FILTER=${MPI_SUBMIT_FILTER:-"yes"}
and reconfigure the CE
-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a mpi-start-wrapper_Cream.jdl
https://cert-09.pd.infn.it:8443/CREAM115768488
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
Status = [IDLE]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
Status = [RUNNING]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
Status = [REALLY-RUNNING]
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
Status = [DONE-OK]
ExitCode = [0]
-bash-3.2$ glite-ce-job-status -L 2 https://cert-09.pd.infn.it:8443/CREAM115768488
****** JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
Current Status = [DONE-OK]
Working Dir = [[reserved]]
ExitCode = [0]
Grid JobID = [N/A]
LRMS Abs JobID = [[reserved]]
LRMS JobID = [[reserved]]
Deleg Proxy ID = [e1444f9cc9df997f65b2b6d247d1dc582814c451]
DelegProxyInfo = [Valid From : 2/15/12 2:12 PM (GMT)
Valid To : 2/15/12 9:26 PM (GMT)
Holder Subject : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Sergio Traldi
Holder CA : /C=IT/O=INFN/CN=INFN CA
VO : dteam
AC Issuer : CN=voms2.hellasgrid.gr,OU=hellasgrid.gr,O=HellasGrid,C=GR
Attribute : /dteam/Role=NULL/Capability=NULL /dteam/NGI_IT/Role=NULL/Capability=NULL
]
Worker Node = [cert-wn64-08.pn.pd.infn.it]
Local User = [dteam009]
CREAM ISB URI = [gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/11/CREAM115768488/ISB]
CREAM OSB URI = [gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/11/CREAM115768488/OSB]
JDL = [[ Arguments = "name_mpi OPENMPI"; QueueName = "cert"; JobType = "Normal"; Executable = "mpi-start-wrapper.sh"; VirtualOrganisation = "dteam"; InputSandbox = { "/home/traldi/JOB_MPI/mpi-start-wrapper.sh","/home/traldi/JOB_MPI/mpi-hooks.sh","/home/traldi/JOB_MPI/name_mpi.c" }; CPUNumber = 4; StdOutput = "std.out"; Type = "Job"; OutputSandboxBaseDestUri = "gsiftp://prod-se-01.pd.infn.it/tmp"; StdError = "std.err"; BatchSystem = "pbs"; OutputSandbox = { "std.err","std.out" } ]]
Type = [Normal]
Job status changes:
-------------------
Status = [REGISTERED] - [Wed 15 Feb 2012 15:18:00] (1329315480)
Status = [PENDING] - [Wed 15 Feb 2012 15:18:04] (1329315484)
Status = [IDLE] - [Wed 15 Feb 2012 15:18:04] (1329315484)
Status = [RUNNING] - [Wed 15 Feb 2012 15:18:11] (1329315491)
Status = [REALLY-RUNNING] - [Wed 15 Feb 2012 15:18:14] (1329315494)
Status = [DONE-OK] - [Wed 15 Feb 2012 15:23:24] (1329315804)
Issued Commands:
-------------------
*** Command Name = [JOB_REGISTER]
Command Category = [JOB_MANAGEMENT]
Command Status = [SUCCESSFULL]
Creation Time = [Wed 15 Feb 2012 15:17:59] (1329315479)
Start Scheduling Time = [Wed 15 Feb 2012 15:17:59] (1329315479)
Start Processing Time = [Wed 15 Feb 2012 15:17:59] (1329315479)
Execution Completed Time = [Wed 15 Feb 2012 15:18:01] (1329315481)
*** Command Name = [JOB_START]
Command Category = [JOB_MANAGEMENT]
Command Status = [SUCCESSFULL]
Creation Time = [Wed 15 Feb 2012 15:18:04] (1329315484)
Start Scheduling Time = [Wed 15 Feb 2012 15:18:04] (1329315484)
Start Processing Time = [Wed 15 Feb 2012 15:18:04] (1329315484)
Execution Completed Time = [Wed 15 Feb 2012 15:18:11] (1329315491)
Inside CE:
[root@cert-09 ~]# qstat -n
cert-09.pd.infn.it:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
5.cert-09.pd.inf dteam009 cert cream_115768488 29356 2 4 -- -- R 00:02
cert-wn64-08+cert-wn64-08+cert-wn64-07+cert-wn64-07
Third Test (bis) :
PASSED