Tags:
, view all tags

Glue2 support in CREAM

CREAM CE provided with EMI-1 provides some initial support for glue2, that needs to be finalized

1 Target scenario

1.1 CREAM CE in no cluster mode

We assume that a CREAM CE is configured in cluster mode if that is the only CREAM CE available in the site. I.e. sites with multiple CREAM CEs (submitting to the same batch system) should always have a cluster node and therefore be configured in cluster mode.

If a CREAM CE is configured in no cluster mode, all the Glue2 object classes are published by the resource BDII running on the CREAM CE.

These objectclasses are:

  • ComputingService (done)
  • ComputingEndPoint (done)
  • AccessPolicy (done)
  • ComputingManager (done)
  • ComputingShare (done)
  • MappingPolicy (done)
  • ExecutionEnvironment (done)
  • Benchmark (done)
  • ToStorageService (done)
  • ApplicationEnvironment (todo)
  • EndPoint for RTEPublihser (done)
    • "Child" of ComputingService
  • EndPoint for CEMon (done)
    • "Child" of ComputingService
    • Published only if CEMon is deployed

The ComputingServiceId is the one specified by the yaim variable COMPUTING_SERVICE_ID (if specified). Otherwise it is ${CE_HOST}_ComputingElement..

The EndpointId! is hostname + "_org.glite.ce.CREAM".

1.2 CREAM CE in cluster mode

Sites with multiple CREAM CEs (submitting to the same batch system) should always have a cluster node and therefore be configured in cluster mode.

If a CREAM CE is configured in cluster mode:

  • The resource BDII running on the CREAM CE publishes just the following objectclasses:
    • ComputingEndpoint (done)
    • AccessPolicy (done)
  • All the other objectclasses are published by the resource BDII running on the gLite-CLUSTER node:
    • ComputingService (done)
    • ComputingManager (done)
    • ComputingShare (todo)
    • MappingPolicy (todo)
    • ExecutionEnvironment (done)
    • Benchmark (done)
    • ToStorageService (done)
    • ApplicationEnvironment (todo)
    • EndPoint for RTEPublihser (done)
      • "Child" of ComputingService

  • The ServiceId is the the one specified by the yaim variable COMPUTING_SERVICE_ID which is such scenario is mandatory. This variable must have the same value in all the relevant nodes (in the cluster node and in all the CREAM CEs)

The EndpointId is hostname + "_org.glite.ce.CREAM".

2 Batch system dynamic information

2.1 Torque/PBS

The PBS dynamic plugin for Glue1 publishes for each batch system queue something like:

dn: GlueCEUniqueID=cream-38.pd.infn.it:8443/cream-pbs-creamtest1,mds-vo-name=resource,o=grid
GlueCEInfoLRMSVersion: 2.5.7
GlueCEInfoTotalCPUs: 5
GlueCEPolicyAssignedJobSlots: 5
GlueCEStateFreeCPUs: 5
GlueCEPolicyMaxCPUTime: 2880
GlueCEPolicyMaxWallClockTime: 4320
GlueCEStateStatus: Production

The generic dynamic scheduler plugin, for Glue1 publishes for each VOView something like:

dn: GlueVOViewLocalID=alice,GlueCEUniqueID=cream-38.pd.infn.it:8443/cream-pbs-creamtest1,mds-vo-name=resource,o=grid
GlueVOViewLocalID: alice
GlueCEStateRunningJobs: 0
GlueCEStateWaitingJobs: 0
GlueCEStateTotalJobs: 0
GlueCEStateFreeJobSlots: 5
GlueCEStateEstimatedResponseTime: 0
GlueCEStateWorstResponseTime: 0

and for each queue publishes something like:

dn: GlueCEUniqueID=cream-38.pd.infn.it:8443/cream-pbs-creamtest1,mds-vo-name=resource,o=grid
GlueCEStateFreeJobSlots: 5
GlueCEStateFreeCPUs: 5
GlueCEStateRunningJobs: 0
GlueCEStateWaitingJobs: 0
GlueCEStateTotalJobs: 0
GlueCEStateEstimatedResponseTime: 0
GlueCEStateWorstResponseTime: 0

2.2 LSF

The LSF dynamic plugin for Glue1 publishes for each batch system queue something like:

dn: GlueCEUniqueID=cream-29.pd.infn.it:8443/cream-lsf-creamcert2,mds-vo-name=resource,o=grid
GlueCEInfoLRMSVersion: 7 Update 5
GlueCEInfoTotalCPUs: 216
GlueCEPolicyAssignedJobSlots: 216
GlueCEPolicyMaxRunningJobs: 216
GlueCEPolicyMaxCPUTime: 9999999999
GlueCEPolicyMaxWallClockTime: 9999999999
GlueCEPolicyPriority: -20
GlueCEStateFreeCPUs: 6
GlueCEStateFreeJobSlots: 216
GlueCEStateStatus: Production

The generic dynamic scheduler plugin, for Glue1 publishes for each VOView something like:

dn: GlueVOViewLocalID=alice,GlueCEUniqueID=cream-29.pd.infn.it:8443/cream-lsf-creamcert1,mds-vo-name=resource,o=grid
GlueVOViewLocalID: alice
GlueCEStateRunningJobs: 0
GlueCEStateWaitingJobs: 0
GlueCEStateTotalJobs: 0
GlueCEStateFreeJobSlots: 216
GlueCEStateEstimatedResponseTime: 0
GlueCEStateWorstResponseTime: 0

and for each queue publishes something like:

dn: GlueCEUniqueID=cream-29.pd.infn.it:8443/cream-lsf-creamcert1,mds-vo-name=resource,o=grid
GlueCEStateFreeJobSlots: 216
GlueCEStateFreeCPUs: 216
GlueCEStateRunningJobs: 0
GlueCEStateWaitingJobs: 0
GlueCEStateTotalJobs: 0
GlueCEStateEstimatedResponseTime: 0
GlueCEStateWorstResponseTime: 0

2.3 SGE

2.4 Work to be done to support Glue2 publication

2.4.1 Work to be done in the PBS/Torque information provider

  • The value published in Glue1 as GlueCEInfoLRMSVersion should be published in Glue2 as ProductVersion (ComputingManager objectclass)
  • The value published in Glue1 as GlueCEPolicyMaxCPUTime should be published in Glue2 as MaxCPUTime (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEPolicyMaxWallClockTime should be published in Glue2 as MaxWallTime (ComputingShare objectclass) * For all the ComputingShares referring to that batch system queue
  • The value published in Glue1 as GlueCEPolicyAssignedJobSlots should be published in Glue2 as ??? Nowhere ??
  • The value published in Glue1 as GlueCEStateStatus should be published in Glue2 as ServingState (ComputingShare objectclass) * For all the ComputingShares referring to that batch system queue
  • The value published in Glue1 as GlueCEInfoTotalCPUs should be published in Glue2 as ??? Nowhere ??
  • The value published in Glue1 as GlueCEInfoFreeCPUs should be published in Glue2 as ??? Nowhere ??

2.4.2 Work to be done in the LSF information provider

  • The value published in Glue1 as GlueCEInfoLRMSVersion should be published in Glue2 as ProductVersion (ComputingManager objectclass)
  • The value published in Glue1 as GlueCEPolicyMaxCPUTime should be published in Glue2 as MaxCPUTime (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEPolicyMaxWallClockTime should be published in Glue2 as MaxWallTime (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEPolicyMaxRunningJobs should be published in Glue2 as MaxRunningJobs (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEStateFreeJobSlots should be published in Glue2 as ???
    • But the same value is published by the generic provider ...
  • The value published in Glue1 as GlueCEStateStatus should be published in Glue2 as ServingState (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEPolicyAssignedJobSlots should be published in Glue2 as ?? Nowhere ??
  • The value published in Glue1 as GlueCEInfoTotalCPUs should be published in Glue2 as ??? Nowhere ??
  • The value published in Glue1 as GlueCEInfoFreeCPUs should be published in Glue2 as ??? Nowhere ??
    • But the same value is published by the generic provider ...
  • The value published in Glue1 as GlueCEPolicyPriority should be published in Glue2 as ??? Nowhere ??

2.4.3 Work to be done in the generic dynamic scheduler

  • The value published in Glue1 as GlueCEStateFreeJobSlots for the GlueCE objectclass should be published in Glue2 as ?? Nowhere ??
    • But the same value is published by the LSF provider ...
  • The value published in Glue1 as GlueCEStateRunningJobs for the GlueCE objectclass should be published in Glue2 as ?? Nowhere ??
  • The value published in Glue1 as GlueCEStateWaitingJobs for the GlueCE objectclass should be published in Glue2 as ?? Nowhere ??
  • The value published in Glue1 as GlueCEStateTotalJobs for the GlueCE objectclass should be published in Glue2 as ?? Nowhere ??
  • The value published in Glue1 as GlueCEStateEstimatedResponseTime for the GlueCE should be published in Glue2 as ?? Nowhere ??
  • The value published in Glue1 as GlueCEStateWorstResponseTime for the GlueCE should be published in Glue2 as ?? Nowhere ??
  • The value published in Glue1 as GlueCEInfoFreeCPUs for the GlueCE objectclass should be published in Glue2 as ??? Nowhere ??
    • But the same value is published by the generic provider ...

  • The value published in Glue1 as GlueCEStateRunningJobs for the VOView objectclass should be published in Glue2 as RunningJobs (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEStateWaitingJobs for the VOView objectclass should be published in Glue2 as WaitingJobs (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEStateTotalJobs for the VOView objectclass should be published in Glue2 as TotalJobs (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEStateFreeJobSlots for the VOView objectclass should be published in Glue2 as FreeSlots (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEStateEstimatedResponseTime for the VOView should be published in Glue2 as EstimatedAverageWaitingTime (ComputingShare objectclass)
  • The value published in Glue1 as GlueCEStateWorstResponseTime for the VOView should be published in Glue2 as EstimatedWorstWaitingTime (ComputingShare objectclass)

3 Relevant RFCs

4 Testbed

The following machines are being used for testing

  • cream-38.pd.infn.it (Torque)
  • cream-29.pd.infn.it (LSF)

-- MassimoSgaravatto - 2011-06-21

Edit | Attach | PDF | History: r48 | r19 < r18 < r17 < r16 | Backlinks | Raw View | More topic actions...
Topic revision: r17 - 2011-09-16 - MassimoSgaravatto
 

  • Edit
  • Attach
This site is powered by the TWiki collaboration platformCopyright © 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback