Monitoring Simulation Data

Monitoring information is critical information in simulations. However, the monitoring overhead can be significant. For this reason, LAPIS provides an object-based monitoring. Whenever a monitoring-relevant object does change during simulation the object is put into a monitoring usim.Queue for further processing.

When running a simulation you should register your required logging callable with the monitoring component. There is already a number of predefined logging callables that can easily be used, see Predefined Monitoring Functions. Each of these logging functions is parameterised with the objects it is able to process. Whenever an object becomes available in the monitoring queue, it is checked if matching logging callables have been registered to handle the specific object. The monitoring itself runs asynchronously: Whenever elements become available in the monitoring queue, the logging process starts.

If you want to define your own logging callable that for example logs information about changes to a drone it should follow the following format:

def log_object(the_object: Drone) -> List[Dict]:
    return []
log_object.name: str = "identifying_name"
log_object.whitelist: Tuple = (Drone,)
log_object.logging_formatter: Dict = {
    LoggingSocketHandler.__name__: JsonFormatter(),
}

Information about the object types being processed by your callable is given as a tuple in whitelist. You further need to set an identifying name for your callable as well as logging.Formatter for specific logging options.

Registering your logging callable is very easy then, you just need to call

simulator.monitoring.register_statistic(log_object)

That’s it!

LAPIS currently supports logging to

  • TCP,

  • File, and/or

  • Telegraf.

See Command Line Interface for details on how to utilise the different logging options.

Predefined Monitoring Functions

Lapis provides some predefined functions that provide monitoring of relevant information about your pools, resources, and jobs. Further, information relevant to COBalD are provided.

General Monitoring

lapis.monitor.general.resource_statistics(drone: lapis.drone.Drone) → List[Dict][source]

Log ratio of used and requested resources for drones.

Parameters

drone – the drone

Returns

list of records for logging

lapis.monitor.general.user_demand(job_queue: lapis.scheduler.JobQueue) → List[Dict][source]

Log global user demand.

Parameters

scheduler – the scheduler

Returns

list of records for logging

lapis.monitor.general.job_statistics(scheduler: lapis.scheduler.CondorJobScheduler) → List[Dict][source]

Log number of jobs running in all drones.

Note

The logging is currently synchronised with the frequency of the scheduler. If a finer resolution is required, the update of drones can be considered additionally.

Parameters

scheduler – the scheduler

Returns

list of records for logging

lapis.monitor.general.job_events(job: lapis.job.Job) → List[Dict][source]

Log relevant events for jobs. Relevant events are

  • start of a job,

  • finishing of a job, either successful or not.

Information about the start of a job are relevant to enable timely analysis of waiting times. For finishing of jobs information about the success itself, but also additional information on exceeded resources or refusal by the drone are added.

Warning

The logging format includes the name / identifier of a job. This might result in a huge index of the grafana database. The job is currently included to enable better lookup and analysis of related events.

Parameters

job – the job to log information for

Returns

list of records for logging

lapis.monitor.general.pool_status(pool: lapis.pool.Pool) → List[Dict][source]

Log state changes of pools and drones.

Parameters

simulator – the simulator

Returns

list of records for logging

lapis.monitor.general.configuration_information(simulator: Simulator) → List[Dict][source]

Log information how pools and drones are configured, e.g. provided resources.

Parameters

simulator – the simulator

Returns

list of records for logging

COBalD-specific Monitoring

lapis.monitor.cobald.drone_statistics(drone: lapis.drone.Drone) → List[Dict][source]

Collect allocation, utilisation, demand and supply of drones.

Parameters

drone – the drone

Returns

list of records for logging

lapis.monitor.cobald.pool_statistics(pool: lapis.pool.Pool) → List[Dict][source]

Collect allocation, utilisation, demand and supply of pools.

Parameters

pool – the pool

Returns

list of records to log

Caching-specific Monitoring

Todo

Will be added as soon as the caching branch is merged.

Telegraf

LAPIS supports sending monitoring information to telegraf via the CLI option --log-telegraf. The monitoring information for telegraf are sent to the default UDP logging port logging.handlers.DEFAULT_UDP_LOGGING_PORT that is port 9021.

Resource Status

type

name

values

comment

measurement

resource_status

tag

tardis

uuid

tag

resource_type

[memory | disk | cores | …]

tag

pool_configuration

[None | uuid]

tag

pool_type

[pool | drone]

tag

pool

uuid

field

used_ratio

float

field

requested_ratio

float

timestamp

time

float

COBalD Status

type

name

values

comment

measurement

cobald_status

tag

tardis

uuid

tag

pool_configuration

[None | uuid]

tag

pool_type

[pool | drone]

tag

pool

uuid

field

allocation

float

field

utilization

float

field

demand

float

field

supply

float

field

job_count

int

Running jobs

timestamp

time

float

Pool Status

type

name

values

comment

measurement

system_status

tag

tardis

uuid

tag

parent_pool

uuid

tag

pool_configuration

[None | uuid]

tag

pool_type

[pool | drone]

tag

pool

uuid

field

status

[DownState | CleanupState | …]

timestamp

time

float

User Demand

type

name

values

comment

measurement

user_demand

tag

tardis

uuid

field

value

int

timestamp

time

float

Configuration

type

name

values

comment

measurement

configuration

tag

tardis

uuid

tag

pool_configuration

uuid

tag

resource_type

[memory | disk | cores | …]

field

value

float

timestamp

time

float