Simulation Concept

Background

Todo

HEP context.

Components

The core simulation builds on several components, and concepts:

If you are planning to adapt the simulation for your specific use case, please consider the different components to determine what and where to extend functionality.

Job Generator

The Job Generator processes any job input files. It takes care to translate time-based characteristics of the jobs into simulation time. For this the timestamp of the first job of each job input file is taken as the base timestamp, resulting in a time value of 0 for the first job. All following jobs are adapted accordingly, i.e. time is time - base.

The Job Generator itself acts as a generator, meaning that a job is put into the simulations Job Queue as soon as the simulation time corresponds to the translated job queueing time.

Job Queue

The Job Queue is filled with jobs in creation-time order by the Job Generator. The queue is managed by the scheduler and contains all jobs that are not yet scheduled to a drone as well as jobs that have not yet been processed succesfully.

Pools

Pools are created based on the pool input files. Each pool is characterised by a set of defined resources. Further, pools have a capacity number of drones that can be created from a given pool. If the capacity is not specified, a maximum capacity of float("inf") is assumed.

For pools, we differentiate static and dynamic pools. While static pools are intialised with a fixed amount of drones, the number of drones is adapted dynamically by the pool controller for dynamic pools.

class lapis.pool.Pool(make_drone: Callable, *, capacity: int = inf, init: int = 0, name: str = None)[source]

A pool encapsulating a number of pools or drones. Given a specific demand, allocation and utilisation, the pool is able to adapt in terms of number of drones providing the given resources.

Parameters
  • capacity – Maximum number of pools that can be instantiated within the pool

  • init – Number of pools to instantiate at creation time of the pool

  • name – Name of the pool

  • make_drone – Callable to create a drone with specific properties for this pool

class lapis.pool.StaticPool(make_drone: Callable, capacity: int = 0)[source]

A static pool does not react on changing conditions regarding demand, allocation and utilisation but instead initialises the capacity of given drones with initialised resources.

Parameters
  • capacity – Maximum number of pools that can be instantiated within the pool

  • resources – Dictionary of resources available for each pool instantiated within the pool

Controllers

Each pool is managed by a controller. Each controller runs periodically to check allocation and utilisation of assigned pool(s) to regulate the demand of drones for the given pool.

The concept of controllers is introduced by COBalD. The controllers implemented in LAPIS share the general concept as well as implementation by subclassing provided controllers such as cobald.controller.linear.LinearController or cobald.controller.relative_supply.RelativeSupplyController and overwriting lapis.controller.SimulatedLinearController.run(). In this way, we enable validation of current TARDIS/COBalD setup as well as simulation of future extensions.

Available controller implementations from COBalD in LAPIS are:

class lapis.controller.SimulatedLinearController(*args, **kwargs)[source]
async run()[source]

Service entry point

class lapis.controller.SimulatedRelativeSupplyController(*args, **kwargs)[source]
async run()[source]

Service entry point

And there is also an implementation considered as an extension for COBalD:

class lapis.controller.SimulatedCostController(*args, **kwargs)[source]

Drones

Drones provide instances of the set of resources defined by a given pool. Drones are the only objects in the simulation that are able to process jobs. Simplified, drones represent worker nodes.

The concept of drones is introduced by TARDIS. A drone is a generalisation of the pilot concept used for example in High Energy Physics and is a placeholder for the real workloads to be processed. A drone is expected to autonomously manage its lifecycle, meaning, that it handles failures and termination independently from other components within the system.

Warning

Drones are not yet fully employed in LAPIS. They already run independently but do not handle termination themselves.

Scheduler

The scheduler is the connecting component between the jobs in the job queue and the running drones. It does the matchmaking between jobs and drones to assign the jobs to the best evaluated drone. Whenever a job is assigned to a drone, the job is removed from the job queue. The scheduler is notified as soon as the job is terminated independent from the state of termination. It is the task of the scheduler to decide to either remove the job from the simulation in case of success or to re-insert the job into the job queue to retry processing.

LAPIS currently supports an HTCondor-like implementation of a scheduler:

class lapis.scheduler.CondorJobScheduler(job_queue)[source]

Goal of the htcondor job scheduler is to have a scheduler that somehow mimics how htcondor does schedule jobs. Htcondor does scheduling based on a priority queue. The priorities itself are managed by operators of htcondor. So different instances can apparently behave very different.

In my case I am going to try building a priority queue that sorts job slots by increasing cost. The cost itself is calculated based on the current strategy that is used at GridKa. The scheduler checks if a job either exactly fits a slot or if it does fit into it several times. The cost for putting a job at a given slot is given by the amount of resources that might remain unallocated. :return:

Warning

The implementation of the HTCondor scheduler is still very rough. The matchmaking currently does not rely on given requirements, but only considers required and provided resources for jobs and drones. The automatic clustering, therefore, also only relies on the type and number of resources and is applied to drones only at the moment.