-
Notifications
You must be signed in to change notification settings - Fork 47
Description
Thank you for the excellent work! I was wondering if you could share more implementation details or the script used for running the experiments in Figure 2. If I understand correctly, the model weights are stored in the GPU’s global memory in different precisions (e.g., 16-bit/8-bit/4-bit) and then moved to the on-chip cache for computation in 16-bit precision.
For the 8-bit/4-bit setups, the weights would need to be converted to 16-bit for computation to ensure a fair comparison of arithmetic operations across the different setups. So the only difference between the 16-bit, 8-bit, and 4-bit setups is the time taken to move the weights from global memory to on-chip memory?
BTW, I'm wondering if you tried the exps with more advanced GPUs such as H100/H200, where the memory accessing bandwidth is much higher. In this case, will memory-bandwidth still be the latency bottleneck for LLM inference? Can you share some insights on it? Thank you!