Skip to content

Request for Script to Reproduce Experiments in Figure 2 #75

@weizhepei

Description

@weizhepei

Thank you for the excellent work! I was wondering if you could share more implementation details or the script used for running the experiments in Figure 2. If I understand correctly, the model weights are stored in the GPU’s global memory in different precisions (e.g., 16-bit/8-bit/4-bit) and then moved to the on-chip cache for computation in 16-bit precision.

For the 8-bit/4-bit setups, the weights would need to be converted to 16-bit for computation to ensure a fair comparison of arithmetic operations across the different setups. So the only difference between the 16-bit, 8-bit, and 4-bit setups is the time taken to move the weights from global memory to on-chip memory?

BTW, I'm wondering if you tried the exps with more advanced GPUs such as H100/H200, where the memory accessing bandwidth is much higher. In this case, will memory-bandwidth still be the latency bottleneck for LLM inference? Can you share some insights on it? Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions