Simulation Concept¶
Background¶
Todo
HEP context.
Components¶
The core simulation builds on several components, and concepts:
Pools and their Controllers,
Drones, and
the Scheduler,
If you are planning to adapt the simulation for your specific use case, please consider the different components to determine what and where to extend functionality.
Job Generator¶
The Job Generator processes any job input files. It takes care to
translate time-based characteristics of the jobs into simulation
time. For this the timestamp of the first job of each job input file is
taken as the base
timestamp, resulting in a time value of 0
for the
first job. All following jobs are adapted accordingly,
i.e. time is time - base
.
The Job Generator itself acts as a generator, meaning that a job is put into the simulations Job Queue as soon as the simulation time corresponds to the translated job queueing time.
Job Queue¶
The Job Queue is filled with jobs in creation-time order by the Job Generator. The queue is managed by the scheduler and contains all jobs that are not yet scheduled to a drone as well as jobs that have not yet been processed succesfully.
Pools¶
Pools are created based on the pool input files. Each pool is characterised by
a set of defined resources. Further, pools have a capacity
number of
drones that can be created from a given pool. If the capacity
is not specified, a maximum capacity of float("inf")
is assumed.
For pools, we differentiate static and dynamic pools. While static pools are intialised with a fixed amount of drones, the number of drones is adapted dynamically by the pool controller for dynamic pools.
-
class
lapis.pool.
Pool
(make_drone: Callable, *, capacity: int = inf, init: int = 0, name: str = None)[source]¶ A pool encapsulating a number of pools or drones. Given a specific demand, allocation and utilisation, the pool is able to adapt in terms of number of drones providing the given resources.
- Parameters
capacity – Maximum number of pools that can be instantiated within the pool
init – Number of pools to instantiate at creation time of the pool
name – Name of the pool
make_drone – Callable to create a drone with specific properties for this pool
-
class
lapis.pool.
StaticPool
(make_drone: Callable, capacity: int = 0)[source]¶ A static pool does not react on changing conditions regarding demand, allocation and utilisation but instead initialises the capacity of given drones with initialised resources.
- Parameters
capacity – Maximum number of pools that can be instantiated within the pool
resources – Dictionary of resources available for each pool instantiated within the pool
Controllers¶
Each pool is managed by a controller. Each controller runs periodically to check allocation and utilisation of assigned pool(s) to regulate the demand of drones for the given pool.
The concept of controllers is introduced by COBalD. The controllers implemented
in LAPIS share the general concept as well as implementation by subclassing
provided controllers such as cobald.controller.linear.LinearController
or cobald.controller.relative_supply.RelativeSupplyController
and
overwriting lapis.controller.SimulatedLinearController.run()
. In
this way, we enable validation of current TARDIS/COBalD setup as well as simulation
of future extensions.
Available controller implementations from COBalD in LAPIS are:
And there is also an implementation considered as an extension for COBalD:
Drones¶
Drones provide instances of the set of resources defined by a given pool. Drones are the only objects in the simulation that are able to process jobs. Simplified, drones represent worker nodes.
The concept of drones is introduced by TARDIS. A drone is a generalisation of the pilot concept used for example in High Energy Physics and is a placeholder for the real workloads to be processed. A drone is expected to autonomously manage its lifecycle, meaning, that it handles failures and termination independently from other components within the system.
Warning
Drones are not yet fully employed in LAPIS. They already run independently but do not handle termination themselves.
Scheduler¶
The scheduler is the connecting component between the jobs in the job queue and the running drones. It does the matchmaking between jobs and drones to assign the jobs to the best evaluated drone. Whenever a job is assigned to a drone, the job is removed from the job queue. The scheduler is notified as soon as the job is terminated independent from the state of termination. It is the task of the scheduler to decide to either remove the job from the simulation in case of success or to re-insert the job into the job queue to retry processing.
LAPIS currently supports an HTCondor-like implementation of a scheduler:
-
class
lapis.scheduler.
CondorJobScheduler
(job_queue)[source]¶ Goal of the htcondor job scheduler is to have a scheduler that somehow mimics how htcondor does schedule jobs. Htcondor does scheduling based on a priority queue. The priorities itself are managed by operators of htcondor. So different instances can apparently behave very different.
In my case I am going to try building a priority queue that sorts job slots by increasing cost. The cost itself is calculated based on the current strategy that is used at GridKa. The scheduler checks if a job either exactly fits a slot or if it does fit into it several times. The cost for putting a job at a given slot is given by the amount of resources that might remain unallocated. :return:
Warning
The implementation of the HTCondor scheduler is still very rough.
The matchmaking currently does not rely on given requirements
, but only
considers required and provided resources
for jobs and
drones. The automatic clustering, therefore, also only relies
on the type and number of resources
and is applied to drones
only at the moment.