Skip to content

LogSoftmaxBackward0 returns NaN during training on Kaggle versioned runs (Granite 2B 4bit) #3002

@bx0-0

Description

@bx0-0

Bug Description

I’m using Unsloth to fine-tune the model unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit in a Kaggle notebook.

When running the notebook interactively, training works fine.

However, once I "Save & Run All" to create a Kaggle versioned notebook, training crashes after a few hundred steps with this error:

RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.

The error happens during the backward pass (run_backward) around step ~250 of the first epoch.


💻 Environment

  • Model: unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit
  • Platform: Kaggle Notebook (T4 GPU)
  • Torch version: 2.3.0+cu124
  • CUDA: 12.4
  • Unsloth version: latest (via pip install unsloth)
  • Transformers version: latest
  • TRL: latest
  • Python: 3.11
  • Training mode: 4-bit quantized + fp16/bf16

🧪 Code Used

✅ Model & Tokenizer Loading

from unsloth import FastLanguageModel
import torch

torch._dynamo.config.disable = True  # Prevent recompilation issues

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit",
    max_seq_length = 4096,
    dtype = None,
    load_in_4bit = True,
)

Trainer Setup:

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model       = model,
    tokenizer   = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = 4096,
    dataset_num_proc = 2,
    packing = False,

    args = TrainingArguments(
        per_device_train_batch_size  = 4,
        gradient_accumulation_steps  = 8,
        num_train_epochs             = 3,
        learning_rate                = 1e-5,
        lr_scheduler_type            = "linear",
        warmup_ratio                 = 0.1,
        max_grad_norm                = 1.0,
        optim                        = "adamw_8bit",
        fp16                         = not is_bfloat16_supported(),
        bf16                         = is_bfloat16_supported(),
        weight_decay                 = 0.01,
        seed                         = 3407,
        output_dir                   = "granite2b_spam_classifier",
        logging_steps                = 10,
        save_strategy                = "epoch",
        report_to                    = "none",
    ),
)

Training Call:

trainer_stats = trainer.train()

Error Details:
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
File "/usr/local/lib/python3.11/dist-packages/torch/autograd/init.py", line 823, in backward
File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 574, in _fn
File "/usr/local/lib/python3.11/dist-packages/unsloth/models/_utils.py", line 1129, in _unsloth_pre_compute_loss
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3836, in compute_loss
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _call_impl
...

Additional Notes:
Training works normally when notebook is interactive.

The crash only happens when the notebook is versioned (i.e., "Save Version" on Kaggle).

I already set torch._dynamo.config.disable = True to prevent recompilation errors.

The error seems specific to headless (non-interactive) Kaggle runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions