Tags:
, view all tags

WMS Test Plan

Service Description

The Workload Management System (gLite WMS) is a software service of the gLite suite which is responsible for distributing and managing tasks across computing and storage resources available on a Grid. WMS assigns user jobs to CEs and SEs belonging to a Grid environment in a convenient fashion, so that:

  • jobs are always executed on resources that match the job requirements
  • grid-wide load balance is maintained, i.e. jobs are evenly and efficiently distributed across the entire Grid.

The WMS basically receives requests of job execution from a client, finds the required appropriate resources, then dispatches and follows them until completion. This is done handling failure in between and whenever possible. Other than single batch-like jobs, compound job types handled by the WMS are Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs), Parametric Jobs (multiple jobs with one parametrized description), and Collections (multiple jobs with a common description). Jobs are described via a flexible, high-level Job Definition Language (JDL).

Deployment scenarios

TBD

Functionality tests

Features/Scenarios to be tested

YAIM-WMS Configuration Testing

  • Installation and configuration starting from a cleaning machine (i.e. only OS)
  • Update and configuration from a previous version

WMS Job Submission/GetOutput Testing

Submit a job to the WMS service and when finished retrieve the output. Test job submission with the following type of jobs:

Normal Job
  • Test submission of normal jobs with different options and situation Implemented

  • Test the complete cycle with the two types of CEs: lcg and Cream Implemented

More different jdls can added in the future.

Perusal job

Job perusal is the ability to view output from a job while it is running. Implemented

DAG job

Directed Acyclic Graphs (a set of jobs where the input/output/execution of one of more jobs may depend on one or more other jobs).

  • Submit a jdl like this one:
[
  type = "dag";
  DefaultNodeShallowRetryCount = 3;
  nodes = [
    nodeA = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeB = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    nodeC = [
      node_type = "edg-jdl";
      file ="jdl/arg.jdl" ; 
    ];
    dependencies = {
      { nodeA, nodeB },
      { nodeA, nodeC }
    }
  ];
]

  • When job finished retrieve the output files
  • Check the final status of the dag (all nodes and parent should be "Cleared")

More different jdls can added in the future.

Parametric Job

Multiple jobs with one parametrized description. Implemented

Collection Job

Multiple jobs with a common description. There are two ways to submit collection: you can create a single jdl with all the jdls of nodes or you can submit all the jdls stored in a directory (bulk submission)

Parallel Job

WMS Job shallow and deep re-submission

  • For the shallow resubmission submit the following jdl file:
    [
      requirements =  (other.GlueCEStateStatus == "Production");
      Rank = -2 * other.GlueCEStateWaitingJobs;
      Executable = "/bin/ls";
      prologue = "/bin/false";
      shallowretrycount = 3;
      RetryCount = 0;
      usertags = [ exe = "shallowresub" ];
    ]
    • The job should Aborted with message: hit job shallow retry count (3)
    • Check the logging-info (-v 2) of the job, you should find 3 events from the WorkloadManager like this one:
      Event: Resubmission
      - Arrived                    =    Fri Jun 11 13:11:22 2010 CEST
      - Host                       =    devel18.cnaf.infn.it
      - Reason                     =    token still exists
      - Result                     =    SHALLOW
      - Source                     =    WorkloadManager
      - Src instance               =    13035
      - Tag                        =    /var/glite/SandboxDir/LW/https_3a_2f_2fdevel07.cnaf.infn.it_3a9000_2fLWVHkUN2cBOiEcWR7Rc6ew/token.txt
      - Timestamp                  =    Fri Jun 11 13:11:22 2010 CEST

  • For the deep resubmission submit the following jdl:
    [
      requirements =  (other.GlueCEStateStatus == "Production");
      Rank = -2 * other.GlueCEStateWaitingJobs;
      Executable = "/bin/ls";
      prologue = "/bin/false";
      shallowretrycount = -1; 
      RetryCount = 3;
      usertags = [ exe = "deepresub" ];
    ]
    • The job should Aborted with message: hit job retry count (3)
    • Check the logging-info (-v 2) of the job, you should find 3 events from the WorkloadManager like this one:
          Event: Resubmission
      - Arrived                    =    Fri Jun 11 14:15:19 2010 CEST
      - Host                       =    devel18.cnaf.infn.it
      - Level                      =    SYSTEM
      - Priority                   =    synchronous
      - Reason                     =    shallow resubmission is disabled
      - Result                     =    WILLRESUB
      - Seqcode                    =    UI=000000:NS=0000000005:WM=000016:BH=0000000000:JSS=000009:LM=000018:LRMS=000000:APP=000000:LBS=000000
      - Source                     =    WorkloadManager

WMS Job List-match Testing

Without data

With data

  • Set these variables:
      export LCG_GFAL_INFOSYS=<BDII set in the wms conf file> (e.g. cert-bdii-04.cnaf.infn.it:2170)
      export LFC_HOST=<lfc host name> (e.g. lfcserver.cnaf.infn.it)
      export LFC_HOME=<lfc home directory> (e.g. lfcserver.cnaf.infn.it:/grid/infngrid)
  • Register a file to a SE using lcg-cr command:
      lcg-cr --vo <VO name> -d <SE host> -l lfn:<logical file name> file:<local file path>
  • Using the lcg-rep you can create some replicas:
      lcg-rep  --vo <VO name> -d <SE host> lfn:<logical file name>
  • lcg-lr is useful to see the list of replicas:
      lcg-lr --vo <VO name>  lfn:<logical file name>
  • Submit a list-match command with a jdl likes this one (as InputData put the lfn(s) registered before):
    ###########################################
    #      JDL with Data Requirements         #
    ###########################################
    
    Executable = "calc-pi.sh";
    Arguments = "1000";
    StdOutput = "std.out";
    StdError = "std.err";
    Prologue = "prologue.sh";
    InputSandbox = {"calc-pi.sh", "fileA", "fileB","prologue.sh"};
    OutputSandbox = {"std.out", "std.err","out-PI.txt","out-e.txt"};
    Requirements = true;
    
    DataRequirements = {
    [
    DataCatalogType = "DLI";
    DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
    InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
    ]
    };
    DataAccessProtocol = "gsiftp";
  • You should obtain an output like this one:
    [ale@cream-15 DataReq]$ glite-wms-job-list-match -a --rank --config ~/UI/etc/wmp_wms007.conf data-req.jdl
    Connecting to the service https://wms007.cnaf.infn.it:7443/glite_wms_wmproxy_server
    
    ==========================================================================
    
             COMPUTING ELEMENT IDs LIST
     The following CE(s) matching your job requirements have been found:
    
      *CEId*                                                   *Rank*
    
     - cert-15.pd.infn.it:8443/cream-lsf-cert                           0
     - gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs-cert           0
     - prod-ce-01.pd.infn.it:8443/cream-lsf-cert                        0
     - prod-ce-02.pd.infn.it:2119/jobmanager-lcglsf-cert                0
     - t2-ce-01.to.infn.it:8443/cream-pbs-cert                          0
     - t2-ce-01.to.infn.it:8443/cream-pbs-short                         0
     - t2-ce-02.to.infn.it:2119/jobmanager-lcgpbs-cert                  0
     - t2-ce-02.to.infn.it:2119/jobmanager-lcgpbs-short                 0
     - test7200a.cnaf.infn.it:2119/jobmanager-lcgpbs-cert               0
     - test7200a.cnaf.infn.it:2119/jobmanager-lcgpbs-parallel           0
     - devce.cnaf.infn.it:8443/cream-pbs-cert                           -700634
    
    ==========================================================================
  • Finally you can check if these CEs are really closed to the SEs where you replicate your file(s)
    • First list the SEs:
      [ale@cream-15 DataReq]$ lcg-lr --vo dteam  lfn:/grid/infngrid/cesini/PI_1M.txt
      
      srm://aliserv6.ct.infn.it/dpm/ct.infn.it/home/infngrid/generated/2009-11-24/file0b6a613c-e1f1-4e53-a328-2152e44a8576
      srm://gridit-se-01.cnaf.infn.it/dpm/cnaf.infn.it/home/infngrid/generated/2009-11-24/fileb9f3b972-806c-48a6-92fc-a9677901403a
      srm://prod-se-01.pd.infn.it/infngrid/generated/2009-11-24/file10b5fe11-ad95-41e6-a3a3-14bc5cb9694b
      srm://t2-se-00.to.infn.it/dpm/to.infn.it/home/infngrid/generated/2009-11-24/file1bb28727-1356-4fb1-bb73-5362f93e766f
    • then for each SEs in the list use this command to find the "Close" CEs:
      ldapsearch -x -h <BDII set in the WMS conf file> -p 2170 -b mds-vo-name=local,o=grid "(&(objectclass=GlueCESEBindGroup)(GlueCESEBindGroupSEUniqueID=<SE HOST>))" | grep ^GlueCESEBindGroupCEUniqueID
      
      GlueCESEBindGroupCEUniqueID: devce.cnaf.infn.it:8443/cream-pbs-cert
      GlueCESEBindGroupCEUniqueID: gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs
      GlueCESEBindGroupCEUniqueID: cert-08.cnaf.infn.it:8443/cream-lsf-pps
      GlueCESEBindGroupCEUniqueID: cert-05.cnaf.infn.it:8443/cream-lsf-pps
      GlueCESEBindGroupCEUniqueID: test7200a.cnaf.infn.it:2119/jobmanager-lcgpbs-par
      GlueCESEBindGroupCEUniqueID: gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs
      GlueCESEBindGroupCEUniqueID: cert-07.cnaf.infn.it:8443/cream-lsf-pps
      GlueCESEBindGroupCEUniqueID: cert-09.cnaf.infn.it:8443/cream-lsf-pps
      GlueCESEBindGroupCEUniqueID: gridit-ce-001.cnaf.infn.it:2119/jobmanager-lcgpbs
      GlueCESEBindGroupCEUniqueID: cert-06.cnaf.infn.it:8443/cream-lsf-pps
      GlueCESEBindGroupCEUniqueID: cert-04.cnaf.infn.it:8443/cream-lsf-pps
      GlueCESEBindGroupCEUniqueID: test7200a.cnaf.infn.it:2119/jobmanager-lcgpbs-cer
      GlueCESEBindGroupCEUniqueID: cert-13.cnaf.infn.it:8443/cream-lsf-pps
    • You can find more CEs because you should also filter using the "vo" attribute.

WMS Job Cancel Testing

Normal job

DAG job

Collection

Node of a collection

Delegation Testing

Logging-info Testing

Job Status Testing

Others

BrokerInfo
  • Use a jdl like this one to verify its creation:
    ###########################################
    #    JDL to test Brokerinfo Creation      #
    ###########################################
    
    Executable = "/bin/ls";
    Arguments = "-la";
    StdOutput = "std.out";
    StdError = "std.err";
    FuzzyRank = true;
    InputSandbox = {"calc-pi.sh", "fileA", "fileB"};
    OutputSandbox = {"std.out", "std.err",".BrokerInfo"};
    
    Requirements = (other.GlueCEInfoHostName != "spacin-ce1.dma.unina.it") && !regexp("8443/cream", other.GlueCEUniqueID) ;
    
    DataRequirements = {
    [
    DataCatalogType = "DLI";
    DataCatalog = "http://lfcserver.cnaf.infn.it:8085";
    InputData = {"lfn:/grid/infngrid/cesini/PI_1M.txt","lfn:/grid/infngrid/cesini/e-2M.txt"};
    ]
    };
    
    DataAccessProtocol = "gsiftp";
    
  • After the retrieving of the output files you should see these files:
    [ale@cream-15 DataReq]$ ls -la /tmp/ale_5XbbEhIAtjO5uC5wRyCmyA
    total 1264
    drwxr-xr-x   2 ale  ale     4096 Mar  4 10:31 .
    drwxrwxrwt  17 root root 1269760 Mar  4 10:31 ..
    -rw-rw-r--   1 ale  ale     7804 Mar  4 10:31 .BrokerInfo
    -rw-rw-r--   1 ale  ale        0 Mar  4 10:31 std.err
    -rw-rw-r--   1 ale  ale      624 Mar  4 10:31 std.out
And also:
[ale@cream-15 DataReq]$ cat /tmp/ale_5XbbEhIAtjO5uC5wRyCmyA/std.out
total 32
drwxr-xr-x  2 dteam038 dteam 4096 Mar  4 10:27 .
drwx------  5 dteam038 dteam 4096 Mar  4 10:27 ..
-rw-r--r--  1 dteam038 dteam 7804 Mar  4 10:27 .BrokerInfo
-rw-r--r--  1 dteam038 dteam  216 Mar  4 10:27 calc-pi.sh
-rw-r--r--  1 dteam038 dteam   17 Mar  4 10:27 fileA
-rw-r--r--  1 dteam038 dteam   17 Mar  4 10:27 fileB
-rw-r--r--  1 dteam038 dteam  166 Mar  4 10:27 https_3a_2f_2fdevel17.cnaf.infn.it_3a9000_2f5XbbEhIAtjO5uC5wRyCmyA.output
-rw-r--r--  1 dteam038 dteam    0 Mar  4 10:27 std.err
-rw-r--r--  1 dteam038 dteam    0 Mar  4 10:27 std.out
-rw-------  1 dteam038 dteam    0 Mar  4 10:27 tmp.LesGC30313

  • Check the goodness of the .BrokerInfo file generated:
    • glite-brokerinfo getCE -f < .Brokerinfo file path >
      • It should return the matched CE
    • glite-brokerinfo getDataAccessProtocol -f < .Brokerinfo file path >
      • It should return the DataAccessProtocol parameter as specified in the jdl
    • glite-brokerinfo getInputData -f < .Brokerinfo file path >
      • It should return the InputData files specified in the jdl
    • glite-brokerinfo getSEs -f < .Brokerinfo file path >
      • It should return all the SEs (seen by the BDII of the WMS) that store all the files specified in the InputData attribute (You can chek it using the command lcg-lr --vo <VO name> <logical file name>)
    • glite-brokerinfo getCloseSEs -f < .Brokerinfo file path >
      • It should return only the CloseSe to the matched CE (to see them use the command: ldapsearch -x -h <specify the BDII host name> -p 2170 -b mds-vo-name=local,o=grid "(&(objectclass=GlueCESEBindGroup)(GlueCESEBindGroupCEUniqueID=<specify the CE name>))" | grep GlueCESEBindGroupSEUniqueID)
    • glite-brokerinfo getLFN2SFN <LFN of the files in InputData > -f < .Brokerinfo file path >
      • It should return the output of the command lcg-lr --vo <VO name> <logical file name>
    • glite-brokerinfo getVirtualOrganization -f < .Brokerinfo file path >
      • It should return your VO

Prologue and Epilogue jobs
  • Submit a jdl like this one:
    ###########################################
    #     JDL with Prologue and Epilogue      #
    ###########################################
    
    Executable = "exe.sh";
    Arguments = "1000";
    StdOutput = "std.out";
    StdError = "std.err";
    Prologue = "prologue.sh";
    Epilogue = "epilogue.sh";
    FuzzyRank = true;
    
    InputSandbox = {"exe/exe.sh", "data/pippo", "exe/epilogue.sh","exe/prologue.sh"};
    OutputSandbox = {"std.out", "std.err", "prologue.out", "epilogue.out"};
    
    #Requirements = regexp("8443/cream", other.GlueCEUniqueID);
    #Requirements = regexp("2119/jobmanager", other.GlueCEUniqueID);
    
    RetryCount = 1;
    ShallowRetryCount = 2;
  • Where:
    [ale@cream-15 UI]$ cat exe/exe.sh
    #!/bin/sh
    
    date
    echo "Hello world!"
    echo "My prologue said:" 
    cat prologue.out
    echo "My argument is $1"
    
    [ale@cream-15 UI]$ cat exe/prologue.sh 
    #!/bin/sh
    
    date > prologue.out
    echo "I'm the prologue" >> prologue.out
    echo "##########################" >> prologue.out
    echo "This is the output of an ls command" >> prologue.out
    ls >> prologue.out
    echo "##########################" >> prologue.out
    
    [ale@cream-15 UI]$ cat exe/epilogue.sh 
    #!/bin/sh
    
    echo " This is the output of the job: `cat std.out`" > epilogue.out
    echo "#########################" >> epilogue.out
    echo "We finish at `date`" >> epilogue.out
    echo "#########################" >> epilogue.out
    echo "All the jokes are done!" >> epilogue.out

  • Use the requirements expression to choose an LCG or a CREAM ce
  • When the job is Done retrieve the output sandbox, it should contains these files:
    • epilogue.out
    • prologue.out
    • std.err and std.out
  • This should be the contenent of the files:
    [ale@cream-15 UI]$ cat prologue.out
    Fri Jun 11 10:24:15 BST 2010
    I'm the prologue
    ##########################
    This is the output of an ls command
    epilogue.sh
    exe.sh
    pippo
    prologue.out
    prologue.sh
    ##########################
    
    [ale@cream-15 UI]$ cat std.out
    Fri Jun 11 10:24:15 BST 2010
    Hello world!
    My prologue said:
    Fri Jun 11 10:24:15 BST 2010
    I'm the prologue
    ##########################
    This is the output of an ls command
    epilogue.sh
    exe.sh
    pippo
    prologue.out
    prologue.sh
    ##########################
    My argument is 1000
    
    [ale@cream-15 UI]$ cat epilogue.out
     This is the output of the job: Fri Jun 11 10:24:15 BST 2010
    Hello world!
    My prologue said:
    Fri Jun 11 10:24:15 BST 2010
    I'm the prologue
    ##########################
    This is the output of an ls command
    epilogue.sh
    exe.sh
    pippo
    prologue.out
    prologue.sh
    ##########################
    My argument is 1000
    #########################
    We finish at Fri Jun 11 10:24:15 BST 2010
    #########################
    All the jokes are done!

Stress tests

Collection of 1000 nodes

Two days test

Description

  • 2880 collections each of 20 jobs
  • One collection every 60 seconds
  • Four users
  • Use LCG-CEs and CREAM-CEs (with different batch systems)
  • Use automatic-delegation
  • The job is a "sleep random(672)"
  • Resubmission is enabled
  • Enable proxy renewal

-- ElisabettaMolinari - 2010-02-24

Edit | Attach | PDF | History: r30 | r18 < r17 < r16 < r15 | Backlinks | Raw View | More topic actions...
Topic revision: r16 - 2011-04-27 - AlessioGianelle
 
  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback