Hi everyone, I wanted to ask you something regarding the implementation of the random pruning function used in DARE, as implemented in the PEFT library.

The file I’m referring to on GitHub is:

src/peft/utils/merge_utils.py

def random_pruning(tensor: torch.Tensor, density: float, rescale: bool) -> torch.Tensor:
    """
    Prune random values based on the specified fraction `density`.

    Args:
        tensor (`torch.Tensor`): The tensor to prune.
        density (`float`): The fraction of values to preserve. Should be in [0,1].
        rescale (`bool`): Whether to rescale the result to preserve the expected value of the original tensor.

    Returns:
        `torch.Tensor`: The pruned tensor.
    """
    mask = torch.bernoulli(torch.full_like(input=tensor, fill_value=density))
    pruned_tensor = tensor * mask
    if rescale:
        torch.div(input=pruned_tensor, other=density)
    return pruned_tensor

In comparison, the author of the original DARE paper (GitHub: yule-BUAA, repo: MergeLM, file: model_merging_methods/mask_weights_utils.py) uses this function instead:

def mask_input_with_mask_rate(input_tensor: torch.Tensor, mask_rate: float, use_rescale: bool, mask_strategy: str):
    """
    mask the input with mask rate
    :param input_tensor: Tensor, input tensor
    :param mask_rate: float, mask rate
    :param use_rescale: boolean, whether to rescale the input by 1 / (1 - mask_rate)
    :param mask_strategy: str, mask strategy, can be "random" and "magnitude"
    :return:
    """
    assert 0.0 <= mask_rate <= 1.0, f"wrong range of mask_rate {mask_rate}, should be [0.0, 1.0]!"
    if mask_strategy == "random":
        mask = torch.bernoulli(torch.full_like(input=input_tensor, fill_value=mask_rate)).to(input_tensor.device)
        masked_input_tensor = input_tensor * (1 - mask)
    else:
        assert mask_strategy == "magnitude", f"wrong setting for mask_strategy {mask_strategy}!"
        original_shape = input_tensor.shape
        input_tensor = input_tensor.flatten()
        num_mask_params = int(len(input_tensor) * mask_rate)
        # Tensor, shape (1, ), find the num_mask_params-th smallest magnitude element of all the parameters in the model
        kth_values, _ = input_tensor.abs().kthvalue(k=num_mask_params, dim=0, keepdim=True)
        # Tensor, shape (num_total_params, ), where True is for parameters that we want to perform mask
        mask = input_tensor.abs() <= kth_values
        masked_input_tensor = input_tensor * (~mask)
        masked_input_tensor = masked_input_tensor.reshape(original_shape)
    if use_rescale and mask_rate != 1.0:
        masked_input_tensor = torch.div(input=masked_input_tensor, other=1 - mask_rate)
    return masked_input_tensor

I noticed two possible issues with the PEFT implementation:

  1. The mask is not inverted in the PEFT version.

    In the original paper implementation, the random mask contains 1 for elements to drop, and 0 for elements to keep. The inversion (1 - mask) ensures that only the selected weights are zeroed out.

    In PEFT’s version, the mask is used directly (i.e., kept values = 1, dropped = 0), which may be logically inverted compared to the original DARE implementation.

  2. Incorrect scaling factor during rescaling.

    In the PEFT version, if rescale=True, the pruned tensor is divided by density. However, in the paper, the rescaling divides by 1 - mask_rate, i.e., the proportion of retained weights after masking. This can be seen clearly in the paper implementation: masked_input_tensor = torch.div(input=masked_input_tensor, other=1 - mask_rate)

I wrote a simple test to compare the two methods:

import torch


def dare_paper_pruning(
    tensor: torch.Tensor, mask_rate: float, mask: torch.Tensor
) -> torch.Tensor:

    # invert the mask, because 0 are elements to be kept and 1 elements not to be kept
    masked_input = tensor * (1 - mask)
    masked_input = torch.div(input=masked_input, other=(1 - mask_rate))
    return masked_input


def hf_random_pruning(
    tensor: torch.Tensor, mask_rate: float, mask: torch.Tensor
) -> torch.Tensor:
    masked_input = tensor * mask
    masked_input = torch.div(input=masked_input, other=mask_rate)
    return masked_input


tensor = torch.randn(4, 4)
mask_rate = 0.6
p = torch.full_like(input=tensor, fill_value=mask_rate)
mask = torch.bernoulli(p)
print(dare_paper_pruning(tensor, mask_rate, mask))
print(hf_random_pruning(tensor, mask_rate, mask))

The outputs are not the same, which confirms that the two functions behave differently.

My question is:
Am I misunderstanding something? Or are these two bugs in the PEFT version of the random_pruning function?

I’d be happy to help submit a fix or PR if needed. Thanks!

1 Like