Monitoring Simulation Data¶
Monitoring information is critical information in simulations. However, the
monitoring overhead can be significant. For this reason, LAPIS provides an object-based
monitoring. Whenever a monitoring-relevant object does change during simulation
the object is put into a monitoring usim.Queue
for further processing.
When running a simulation you should register your required logging callable with the monitoring component. There is already a number of predefined logging callables that can easily be used, see Predefined Monitoring Functions. Each of these logging functions is parameterised with the objects it is able to process. Whenever an object becomes available in the monitoring queue, it is checked if matching logging callables have been registered to handle the specific object. The monitoring itself runs asynchronously: Whenever elements become available in the monitoring queue, the logging process starts.
If you want to define your own logging callable that for example logs information about changes to a drone it should follow the following format:
def log_object(the_object: Drone) -> List[Dict]:
return []
log_object.name: str = "identifying_name"
log_object.whitelist: Tuple = (Drone,)
log_object.logging_formatter: Dict = {
LoggingSocketHandler.__name__: JsonFormatter(),
}
Information about the object types being processed by your callable is given as a
tuple
in whitelist
. You further need to set an identifying
name
for your callable as well as logging.Formatter
for
specific logging options.
Registering your logging callable is very easy then, you just need to call
simulator.monitoring.register_statistic(log_object)
That’s it!
LAPIS currently supports logging to
TCP,
File, and/or
Telegraf.
See Command Line Interface for details on how to utilise the different logging options.
Predefined Monitoring Functions¶
Lapis provides some predefined functions that provide monitoring of relevant information about your pools, resources, and jobs. Further, information relevant to COBalD are provided.
General Monitoring¶
-
lapis.monitor.general.
resource_statistics
(drone: lapis.drone.Drone) → List[Dict][source]¶ Log ratio of used and requested resources for drones.
- Parameters
drone – the drone
- Returns
list of records for logging
-
lapis.monitor.general.
user_demand
(job_queue: lapis.scheduler.JobQueue) → List[Dict][source]¶ Log global user demand.
- Parameters
scheduler – the scheduler
- Returns
list of records for logging
-
lapis.monitor.general.
job_statistics
(scheduler: lapis.scheduler.CondorJobScheduler) → List[Dict][source]¶ Log number of jobs running in all drones.
Note
The logging is currently synchronised with the frequency of the scheduler. If a finer resolution is required, the update of drones can be considered additionally.
- Parameters
scheduler – the scheduler
- Returns
list of records for logging
-
lapis.monitor.general.
job_events
(job: lapis.job.Job) → List[Dict][source]¶ Log relevant events for jobs. Relevant events are
start of a job,
finishing of a job, either successful or not.
Information about the start of a job are relevant to enable timely analysis of waiting times. For finishing of jobs information about the success itself, but also additional information on exceeded resources or refusal by the drone are added.
Warning
The logging format includes the name / identifier of a job. This might result in a huge index of the grafana database. The job is currently included to enable better lookup and analysis of related events.
- Parameters
job – the job to log information for
- Returns
list of records for logging
-
lapis.monitor.general.
pool_status
(pool: lapis.pool.Pool) → List[Dict][source]¶ Log state changes of pools and drones.
- Parameters
simulator – the simulator
- Returns
list of records for logging
COBalD-specific Monitoring¶
-
lapis.monitor.cobald.
drone_statistics
(drone: lapis.drone.Drone) → List[Dict][source]¶ Collect allocation, utilisation, demand and supply of drones.
- Parameters
drone – the drone
- Returns
list of records for logging
-
lapis.monitor.cobald.
pool_statistics
(pool: lapis.pool.Pool) → List[Dict][source]¶ Collect allocation, utilisation, demand and supply of pools.
- Parameters
pool – the pool
- Returns
list of records to log
Caching-specific Monitoring¶
Todo
Will be added as soon as the caching branch is merged.
Telegraf¶
LAPIS supports sending monitoring information to telegraf via the CLI option
--log-telegraf
. The monitoring information for telegraf are sent to the
default UDP logging port logging.handlers.DEFAULT_UDP_LOGGING_PORT
that is
port 9021
.
Resource Status¶
type |
name |
values |
comment |
measurement |
resource_status |
– |
|
tag |
tardis |
uuid |
|
tag |
resource_type |
[memory | disk | cores | …] |
|
tag |
pool_configuration |
[ |
|
tag |
pool_type |
[pool | drone] |
|
tag |
pool |
uuid |
|
field |
used_ratio |
|
|
field |
requested_ratio |
|
|
timestamp |
time |
|
COBalD Status¶
type |
name |
values |
comment |
measurement |
cobald_status |
– |
|
tag |
tardis |
uuid |
|
tag |
pool_configuration |
[ |
|
tag |
pool_type |
[pool | drone] |
|
tag |
pool |
uuid |
|
field |
allocation |
|
|
field |
utilization |
|
|
field |
demand |
|
|
field |
supply |
|
|
field |
job_count |
|
Running jobs |
timestamp |
time |
|
Pool Status¶
type |
name |
values |
comment |
measurement |
system_status |
– |
|
tag |
tardis |
uuid |
|
tag |
parent_pool |
uuid |
|
tag |
pool_configuration |
[ |
|
tag |
pool_type |
[pool | drone] |
|
tag |
pool |
uuid |
|
field |
status |
[DownState | CleanupState | …] |
|
timestamp |
time |
|
User Demand¶
type |
name |
values |
comment |
measurement |
user_demand |
– |
|
tag |
tardis |
uuid |
|
field |
value |
|
|
timestamp |
time |
|
Configuration¶
type |
name |
values |
comment |
measurement |
configuration |
– |
|
tag |
tardis |
uuid |
|
tag |
pool_configuration |
uuid |
|
tag |
resource_type |
[memory | disk | cores | …] |
|
field |
value |
|
|
timestamp |
time |
|