Description
Currently, we have the following hardware metadata interfaces:
Compute plugin method | What's included | How's updated/delivered | :---------------------- | :---------------- | :------------------------ | get_attached_devices() |
The device list with model names | As a part of the return value of kernel creation info | extra_info() |
Device driver/runtime versions | Agent heartbeat (cached in DB) + GraphQL Agent.compute_plugins |
get_node_hwinfo() |
Per-slot-key health info and extra metadata | Manager-to-agent RPC call (not cached) + GraphQL Agent.hardware_metadta |
---|
Let's clarify the prupose and role of these API as follows:
-
Migrate extra metadata in
get_node_hwinfo()
toextra_info()
in compute plugins* Automatically filter out the detailed metadata inAgent.compute_plugins
when the hardware information is hidden to the users -
Retire
get_node_hwinfo()
by introducing a new "health" type event to signal the health status changes of agents and its compute plugin devices.* Later, we could add specific event handlers such as (separate issues & PRs):* Automaticschedulable=false
configuration per agent -
Automatic reduction of the agent resource slot ("per-device" schedulability control) with auto-killing containers using those devices
-
Email notification
-
Aggregate and store the health information in the
agents
table