Skip to content

[Feature Request] Eval should be allowed to run with many different random seeds and generation params then averaged #328

@tawnymanticore

Description

@tawnymanticore

Is your feature request related to a problem? Please describe.
When running evaluations on the evaluation set, the client should be able to specify how many repetitions of the samples they want. eg for a 100-sized eval set, they may want to operate over it 3x (to bump up the eval size to 300) with different random seeds to get a breadth of the variance since in production they may not use greedy sampling.

Checks

  • I've searched the docs for a solution
  • I've searched for existing Github issues

Describe the solution you'd like
A pretty UI where the Eval client is greeted by, "Greedy" vs "Sampling" toggles, where Greedy explains that every sample will produce the same generation every time, and sampling means every time can be different. Then Sampling has a sub-toggle on maybe topP and Temperature to start auto-populated to the default that the service (Fireworks for example) is currently deploying as default.

Describe alternatives you've considered
At the very least a Greedy vs not Greedy switch. Clients may not know that they have variance at inference time and need to know if their evals are representative.

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions