Hi everyone, I wanted to ask you something regarding the implementation of the random pruning function used in DARE, as implemented in the PEFT library.
The file I’m referring to on GitHub is:
src/peft/utils/merge_utils.py
def random_pruning(tensor: torch.Tensor, density: float, rescale: bool) -> torch.Tensor:
"""
Prune random values based on the specified fraction `density`.
Args:
tensor (`torch.Tensor`): The tensor to prune.
density (`float`): The fraction of values to preserve. Should be in [0,1].
rescale (`bool`): Whether to rescale the result to preserve the expected value of the original tensor.
Returns:
`torch.Tensor`: The pruned tensor.
"""
mask = torch.bernoulli(torch.full_like(input=tensor, fill_value=density))
pruned_tensor = tensor * mask
if rescale:
torch.div(input=pruned_tensor, other=density)
return pruned_tensor
In comparison, the author of the original DARE paper (GitHub: yule-BUAA, repo: MergeLM, file: model_merging_methods/mask_weights_utils.py) uses this function instead:
def mask_input_with_mask_rate(input_tensor: torch.Tensor, mask_rate: float, use_rescale: bool, mask_strategy: str):
"""
mask the input with mask rate
:param input_tensor: Tensor, input tensor
:param mask_rate: float, mask rate
:param use_rescale: boolean, whether to rescale the input by 1 / (1 - mask_rate)
:param mask_strategy: str, mask strategy, can be "random" and "magnitude"
:return:
"""
assert 0.0 <= mask_rate <= 1.0, f"wrong range of mask_rate {mask_rate}, should be [0.0, 1.0]!"
if mask_strategy == "random":
mask = torch.bernoulli(torch.full_like(input=input_tensor, fill_value=mask_rate)).to(input_tensor.device)
masked_input_tensor = input_tensor * (1 - mask)
else:
assert mask_strategy == "magnitude", f"wrong setting for mask_strategy {mask_strategy}!"
original_shape = input_tensor.shape
input_tensor = input_tensor.flatten()
num_mask_params = int(len(input_tensor) * mask_rate)
# Tensor, shape (1, ), find the num_mask_params-th smallest magnitude element of all the parameters in the model
kth_values, _ = input_tensor.abs().kthvalue(k=num_mask_params, dim=0, keepdim=True)
# Tensor, shape (num_total_params, ), where True is for parameters that we want to perform mask
mask = input_tensor.abs() <= kth_values
masked_input_tensor = input_tensor * (~mask)
masked_input_tensor = masked_input_tensor.reshape(original_shape)
if use_rescale and mask_rate != 1.0:
masked_input_tensor = torch.div(input=masked_input_tensor, other=1 - mask_rate)
return masked_input_tensor
I noticed two possible issues with the PEFT implementation:
-
The mask is not inverted in the PEFT version.
In the original paper implementation, the random mask contains 1 for elements to drop, and 0 for elements to keep. The inversion (1 - mask) ensures that only the selected weights are zeroed out.
In PEFT’s version, the mask is used directly (i.e., kept values = 1, dropped = 0), which may be logically inverted compared to the original DARE implementation.
-
Incorrect scaling factor during rescaling.
In the PEFT version, if rescale=True, the pruned tensor is divided by density. However, in the paper, the rescaling divides by 1 - mask_rate, i.e., the proportion of retained weights after masking. This can be seen clearly in the paper implementation: masked_input_tensor = torch.div(input=masked_input_tensor, other=1 - mask_rate)
I wrote a simple test to compare the two methods:
import torch
def dare_paper_pruning(
tensor: torch.Tensor, mask_rate: float, mask: torch.Tensor
) -> torch.Tensor:
# invert the mask, because 0 are elements to be kept and 1 elements not to be kept
masked_input = tensor * (1 - mask)
masked_input = torch.div(input=masked_input, other=(1 - mask_rate))
return masked_input
def hf_random_pruning(
tensor: torch.Tensor, mask_rate: float, mask: torch.Tensor
) -> torch.Tensor:
masked_input = tensor * mask
masked_input = torch.div(input=masked_input, other=mask_rate)
return masked_input
tensor = torch.randn(4, 4)
mask_rate = 0.6
p = torch.full_like(input=tensor, fill_value=mask_rate)
mask = torch.bernoulli(p)
print(dare_paper_pruning(tensor, mask_rate, mask))
print(hf_random_pruning(tensor, mask_rate, mask))
The outputs are not the same, which confirms that the two functions behave differently.
My question is:
Am I misunderstanding something? Or are these two bugs in the PEFT version of the random_pruning function?
I’d be happy to help submit a fix or PR if needed. Thanks!