Tags:
, view all tags

EMI-CREAM-Torque using last Torque (2.5.7-7) and Maui 3.3-4 installation with MPI (CE and WN)

On CE Host (also BATCH Master PBS/Torque)

Here some steps to install the last Cream CE using the Torque Staged-Rollout release in ig/gLite distribution.

INSTALLATION:

Repository settings:

cd /etc/yum.repos.d/
mv dag.repo dag.repo.orig
wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
wget http://repo-pd.italiangrid.it/mrepo/repos/igi/sl5/x86_64/igi-cert-emi.repo
cd /root/
wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://emisoft.web.cern.ch/emisoft/dist/EMI/1/sl5/x86_64/updates/emi-release-1.0.1-1.sl5.noarch.rpm

Packages installation (CA, epel, emi, cream, torque,...):

yum install ca-policy-egi-core
yum localinstall *.rpm
yum install xml-commons-apis
yum install emi-cream-ce
yum install emi-torque-server emi-torque-utils
yum install glite-mpi

Munge configuration

/usr/sbin/create-munge-key
service munge start
chkconfig munge on

Starting PBS

 /etc/init.d/pbs_server start
 

File LOG

Installation File on CE - cert-09.pd.infn.it Work Log

CONFIGURATION:

 /opt/glite/yaim/bin/yaim -c -d 6 -s /usr/local/nfs/cert-3_2/rtc_mpi/rtc-site-info.def -n MPI_CE -n creamCE -n TORQUE_server -n TORQUE_utils 2>&1 | tee /root/conf_EMI_CREAM_Torque_MPI.`hostname -s`.`date +%Y%m%d-%H%M%S`.log

SSH Customization:

Modify the file /etc/ssh/sshd_config as the example attached here
Modify the file /etc/ssh/shosts.equiv as the example attached here

service sshd restart

File LOG:

Yaim Configuration File on CE cert-09.pd.infn.it

On WN Hosts:

INSTALLATION:

Repository settings:

cd /etc/yum.repos.d/
mv dag.repo dag.repo.orig
wget http://repo-pd.italiangrid.it/mrepo/repos/egi-trustanchors.repo
wget http://repo-pd.italiangrid.it/mrepo/repos/igi/sl5/x86_64/igi-cert-emi.repo
cd /root/
wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://emisoft.web.cern.ch/emisoft/dist/EMI/1/sl5/x86_64/updates/emi-release-1.0.1-1.sl5.noarch.rpm

Packages installation (CA, epel, emi, cream, torque,...):

yum install ca-policy-egi-core
yum localinstall *.rpm
yum install igi-wn_torque_noafs
yum install glite-mpi
yum install openmpi openmpi-devel mpich2

Munge configuration:

scp cert-09:/etc/munge/munge.key /etc/munge/
chown munge.munge /etc/munge/munge.key
service munge start

File LOG:

Installation File on WN - cert-wn64-08.pn.pd.infn.it Work Log

CONFIGURATION:

/opt/glite/yaim/bin/yaim -c -d 6 -s /usr/local/nfs/cert-3_2/rtc_mpi/rtc-site-info.def -n MPI_WN -n WN_torque_noafs  2>&1 | tee /root/conf_WN_Torque_MPI.`hostname -s`.`date +%Y%m%d-%H%M%S`.log

SSH Customization:

Modify the file /etc/ssh/sshd_config as the example attached here
Modify the file /etc/ssh/shosts.equiv as the example attached here

service sshd restart

File LOG:

Yaim Configuration File on WN cert-wn64-08.pn.pd.infn.it

TESTING:

JOB Submission:

First Test (simple job):

From an EMI UI creating proxy and submission:
 
-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a testCream.jdl
https://cert-09.pd.infn.it:8443/CREAM043342708


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM043342708

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM043342708]
        Status        = [RUNNING]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM043342708

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM043342708]
        Status        = [REALLY-RUNNING]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM043342708

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM043342708]
        Status        = [DONE-OK]
        ExitCode      = [0]
First Test : PASSED

Second Test (simple job MPI with 2 core):

-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a mpi-start-wrapper_Cream.jdl
https://cert-09.pd.infn.it:8443/CREAM986827158
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
        Status        = [IDLE]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
        Status        = [RUNNING]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
        Status        = [REALLY-RUNNING]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
        Status        = [REALLY-RUNNING]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM986827158

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
        Status        = [DONE-OK]
        ExitCode      = [1]

-bash-3.2$ glite-ce-job-status -L 2
https://cert-09.pd.infn.it:8443/CREAM986827158
******  JobID=[https://cert-09.pd.infn.it:8443/CREAM986827158]
        Current Status = [DONE-OK]
        Working Dir    = [[reserved]]
        ExitCode       = [1]
        Grid JobID     = [N/A]
        LRMS Abs JobID = [[reserved]]
        LRMS JobID     = [[reserved]]
        Deleg Proxy ID = [613f3558926a0ba8642c51cfcaeb019b88a5b791]
        DelegProxyInfo = [Valid From      : 2/14/12 2:04 PM (GMT)
                          Valid To       : 2/14/12 11:14 PM (GMT)
                          Holder Subject : /C=IT/O=INFN/OU=Personal
Certificate/L=Padova/CN=Sergio Traldi
                          Holder CA      : /C=IT/O=INFN/CN=INFN CA

                          VO              : dteam
                          AC Issuer       :
CN=voms2.hellasgrid.gr,OU=hellasgrid.gr,O=HellasGrid,C=GR
                          Attribute       : /dteam/Role=NULL/Capability=NULL
/dteam/NGI_IT/Role=NULL/Capability=NULL
                          ]
        Worker Node    = [cert-wn64-08.pn.pd.infn.it]
        Local User     = [dteam009]
        CREAM ISB URI  =
[gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/98/CREAM986827158/ISB]
        CREAM OSB URI  =
[gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/98/CREAM986827158/OSB]
        JDL            = [[ Arguments = "name_mpi OPENMPI"; QueueName =
"cert"; JobType = "Normal"; Executable = "mpi-start-wrapper.sh";
VirtualOrganisation = "dteam"; InputSandbox = {
"/home/traldi/JOB_MPI/mpi-start-wrapper.sh","/home/traldi/JOB_MPI/mpi-hooks.sh","/home/traldi/JOB_MPI/name_mpi.c"
}; CPUNumber = 2; StdOutput = "std.out"; Type = "Job";
OutputSandboxBaseDestUri = "gsiftp://prod-se-01.pd.infn.it/tmp"; StdError =
"std.err"; BatchSystem = "pbs"; OutputSandbox = { "std.err","std.out" } ]]
        Type           = [Normal]

        Job status changes:
        -------------------
        Status         = [REGISTERED] - [Tue 14 Feb 2012 15:09:29]
(1329228569)
        Status         = [PENDING] - [Tue 14 Feb 2012 15:09:31] (1329228571)
        Status         = [IDLE] - [Tue 14 Feb 2012 15:09:31] (1329228571)
        Status         = [RUNNING] - [Tue 14 Feb 2012 15:09:36] (1329228576)
        Status         = [REALLY-RUNNING] - [Tue 14 Feb 2012 15:09:40]
(1329228580)
        Status         = [DONE-OK] - [Tue 14 Feb 2012 15:09:43] (1329228583)

        Issued Commands:
        -------------------

        *** Command Name              = [JOB_REGISTER]
            Command Category          = [JOB_MANAGEMENT]
            Command Status            = [SUCCESSFULL]
            Creation Time             = [Tue 14 Feb 2012 15:09:29]
(1329228569)
            Start Scheduling Time     = [Tue 14 Feb 2012 15:09:29]
(1329228569)
            Start Processing Time     = [Tue 14 Feb 2012 15:09:29]
(1329228569)
            Execution Completed Time  = [Tue 14 Feb 2012 15:09:29]
(1329228569)


        *** Command Name              = [JOB_START]
            Command Category          = [JOB_MANAGEMENT]
            Command Status            = [SUCCESSFULL]
            Creation Time             = [Tue 14 Feb 2012 15:09:31]
(1329228571)
            Start Scheduling Time     = [Tue 14 Feb 2012 15:09:31]
(1329228571)
            Start Processing Time     = [Tue 14 Feb 2012 15:09:31]
(1329228571)
            Execution Completed Time  = [Tue 14 Feb 2012 15:09:38]
(1329228578)

Second Test : PASSED

Third Test: (MPI 4 core required)

Test Submission with 4 core required:

-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a mpi-start-wrapper_Cream.jdl 
https://cert-09.pd.infn.it:8443/CREAM888211702


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM888211702

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM888211702]
        Status        = [ABORTED]
        ExitCode      = []
        FailureReason = [BLAH error: submission command failed (exit code = 1) (stdout:) (stderr:qsub: Job exceeds queue resource limits MSG=cannot locate feasible nodes-) N/A (jobId = CREAM888211702)]
Third Test : NOT PASSED

Third Test (bis): (MPI 4 core required)

Execute again the third test submitting a jobs using 4 cores:

Modified one YAIM variable in services/glite-mpi_ce file :

MPI_SUBMIT_FILTER=${MPI_SUBMIT_FILTER:-"yes"}
and reconfigure the CE

-bash-3.2$ glite-ce-job-submit -r cert-09.pd.infn.it:8443/cream-pbs-cert -a mpi-start-wrapper_Cream.jdl 
https://cert-09.pd.infn.it:8443/CREAM115768488
-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
        Status        = [IDLE]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
        Status        = [RUNNING]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
        Status        = [REALLY-RUNNING]


-bash-3.2$ glite-ce-job-status https://cert-09.pd.infn.it:8443/CREAM115768488

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
        Status        = [DONE-OK]
        ExitCode      = [0]


-bash-3.2$ glite-ce-job-status -L 2 https://cert-09.pd.infn.it:8443/CREAM115768488

******  JobID=[https://cert-09.pd.infn.it:8443/CREAM115768488]
        Current Status = [DONE-OK]
        Working Dir    = [[reserved]]
        ExitCode       = [0]
        Grid JobID     = [N/A]
        LRMS Abs JobID = [[reserved]]
        LRMS JobID     = [[reserved]]
        Deleg Proxy ID = [e1444f9cc9df997f65b2b6d247d1dc582814c451]
        DelegProxyInfo = [Valid From      : 2/15/12 2:12 PM (GMT)
                          Valid To       : 2/15/12 9:26 PM (GMT)
                          Holder Subject : /C=IT/O=INFN/OU=Personal Certificate/L=Padova/CN=Sergio Traldi
                          Holder CA      : /C=IT/O=INFN/CN=INFN CA
                          
                          VO              : dteam
                          AC Issuer       : CN=voms2.hellasgrid.gr,OU=hellasgrid.gr,O=HellasGrid,C=GR
                          Attribute       : /dteam/Role=NULL/Capability=NULL /dteam/NGI_IT/Role=NULL/Capability=NULL 
                          ]
        Worker Node    = [cert-wn64-08.pn.pd.infn.it]
        Local User     = [dteam009]
        CREAM ISB URI  = [gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/11/CREAM115768488/ISB]
        CREAM OSB URI  = [gsiftp://cert-09.pd.infn.it/var/cream_sandbox/dteam/_C_IT_O_INFN_OU_Personal_Certificate_L_Padova_CN_Sergio_Traldi_dteam_Role_NULL_Capability_NULL_dteam009/11/CREAM115768488/OSB]
        JDL            = [[ Arguments = "name_mpi OPENMPI"; QueueName = "cert"; JobType = "Normal"; Executable = "mpi-start-wrapper.sh"; VirtualOrganisation = "dteam"; InputSandbox = { "/home/traldi/JOB_MPI/mpi-start-wrapper.sh","/home/traldi/JOB_MPI/mpi-hooks.sh","/home/traldi/JOB_MPI/name_mpi.c" }; CPUNumber = 4; StdOutput = "std.out"; Type = "Job"; OutputSandboxBaseDestUri = "gsiftp://prod-se-01.pd.infn.it/tmp"; StdError = "std.err"; BatchSystem = "pbs"; OutputSandbox = { "std.err","std.out" } ]]
        Type           = [Normal]

        Job status changes:
        -------------------
        Status         = [REGISTERED] - [Wed 15 Feb 2012 15:18:00] (1329315480)
        Status         = [PENDING] - [Wed 15 Feb 2012 15:18:04] (1329315484)
        Status         = [IDLE] - [Wed 15 Feb 2012 15:18:04] (1329315484)
        Status         = [RUNNING] - [Wed 15 Feb 2012 15:18:11] (1329315491)
        Status         = [REALLY-RUNNING] - [Wed 15 Feb 2012 15:18:14] (1329315494)
        Status         = [DONE-OK] - [Wed 15 Feb 2012 15:23:24] (1329315804)

        Issued Commands:
        -------------------

        *** Command Name              = [JOB_REGISTER]
            Command Category          = [JOB_MANAGEMENT]
            Command Status            = [SUCCESSFULL]
            Creation Time             = [Wed 15 Feb 2012 15:17:59] (1329315479)
            Start Scheduling Time     = [Wed 15 Feb 2012 15:17:59] (1329315479)
            Start Processing Time     = [Wed 15 Feb 2012 15:17:59] (1329315479)
            Execution Completed Time  = [Wed 15 Feb 2012 15:18:01] (1329315481)


        *** Command Name              = [JOB_START]
            Command Category          = [JOB_MANAGEMENT]
            Command Status            = [SUCCESSFULL]
            Creation Time             = [Wed 15 Feb 2012 15:18:04] (1329315484)
            Start Scheduling Time     = [Wed 15 Feb 2012 15:18:04] (1329315484)
            Start Processing Time     = [Wed 15 Feb 2012 15:18:04] (1329315484)
            Execution Completed Time  = [Wed 15 Feb 2012 15:18:11] (1329315491)

Inside CE:

[root@cert-09 ~]# qstat -n

cert-09.pd.infn.it: 
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
5.cert-09.pd.inf     dteam009 cert     cream_115768488   29356     2   4    --    --  R 00:02
   cert-wn64-08+cert-wn64-08+cert-wn64-07+cert-wn64-07

Third Test (bis) : PASSED

-- SergioTraldi - 2012-02-15

Edit | Attach | PDF | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | More topic actions...
Topic revision: r3 - 2012-02-17 - SergioTraldi
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback