Skip to content

Refactor the per-compute-plugin hardware information #1426

Open
@achimnol

Description

@achimnol

Currently, we have the following hardware metadata interfaces:

Compute plugin method What's included How's updated/delivered :---------------------- :---------------- :------------------------ get_attached_devices() The device list with model names As a part of the return value of kernel creation info extra_info() Device driver/runtime versions Agent heartbeat (cached in DB) + GraphQL Agent.compute_plugins get_node_hwinfo() Per-slot-key health info and extra metadata Manager-to-agent RPC call (not cached) + GraphQL Agent.hardware_metadta

Let's clarify the prupose and role of these API as follows:

  • Migrate extra metadata in get_node_hwinfo() to extra_info() in compute plugins* Automatically filter out the detailed metadata in Agent.compute_plugins when the hardware information is hidden to the users

  • Retire get_node_hwinfo() by introducing a new "health" type event to signal the health status changes of agents and its compute plugin devices.* Later, we could add specific event handlers such as (separate issues & PRs):* Automatic schedulable=false configuration per agent

  • Automatic reduction of the agent resource slot ("per-device" schedulability control) with auto-killing containers using those devices

  • Email notification

  • Aggregate and store the health information in the agents table

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions