Refactor the per-compute-plugin hardware information

Currently, we have the following hardware metadata interfaces:

| Compute plugin method | What's included | How's updated/delivered |:----------------------|:----------------|:------------------------| `get_attached_devices()` | The device list with model names | As a part of the return value of kernel creation info | `extra_info()` | Device driver/runtime versions | Agent heartbeat (cached in DB) + GraphQL `Agent.compute_plugins` | `get_node_hwinfo()` | Per-slot-key health info and extra metadata | Manager-to-agent RPC call (not cached) + GraphQL `Agent.hardware_metadta` |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

Let's clarify the prupose and role of these API as follows:

- [ ] Migrate extra metadata in `get_node_hwinfo()` to `extra_info()` in compute plugins\* Automatically filter out the detailed metadata in `Agent.compute_plugins` when the hardware information is hidden to the users

- [ ] Retire `get_node_hwinfo()` by introducing a new "health" type event to signal the health status changes of agents and its compute plugin devices.\* Later, we could add specific event handlers such as (separate issues & PRs):\* Automatic `schedulable=false` configuration per agent
- Automatic reduction of the agent resource slot ("per-device" schedulability control) with auto-killing containers using those devices
- Email notification
- Aggregate and store the health information in the `agents` table





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor the per-compute-plugin hardware information #1426

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor the per-compute-plugin hardware information #1426

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions