-
-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Description
Bug Description
I’m using Unsloth to fine-tune the model unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit
in a Kaggle notebook.
When running the notebook interactively, training works fine.
However, once I "Save & Run All" to create a Kaggle versioned notebook, training crashes after a few hundred steps with this error:
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
The error happens during the backward pass (run_backward
) around step ~250 of the first epoch.
💻 Environment
- Model:
unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit
- Platform: Kaggle Notebook (T4 GPU)
- Torch version: 2.3.0+cu124
- CUDA: 12.4
- Unsloth version: latest (via
pip install unsloth
) - Transformers version: latest
- TRL: latest
- Python: 3.11
- Training mode: 4-bit quantized +
fp16
/bf16
🧪 Code Used
✅ Model & Tokenizer Loading
from unsloth import FastLanguageModel
import torch
torch._dynamo.config.disable = True # Prevent recompilation issues
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/granite-3.2-2b-instruct-unsloth-bnb-4bit",
max_seq_length = 4096,
dtype = None,
load_in_4bit = True,
)
Trainer Setup:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = 4096,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 4,
gradient_accumulation_steps = 8,
num_train_epochs = 3,
learning_rate = 1e-5,
lr_scheduler_type = "linear",
warmup_ratio = 0.1,
max_grad_norm = 1.0,
optim = "adamw_8bit",
fp16 = not is_bfloat16_supported(),
bf16 = is_bfloat16_supported(),
weight_decay = 0.01,
seed = 3407,
output_dir = "granite2b_spam_classifier",
logging_steps = 10,
save_strategy = "epoch",
report_to = "none",
),
)
Training Call:
trainer_stats = trainer.train()
Error Details:
RuntimeError: Function 'LogSoftmaxBackward0' returned nan values in its 0th output.
File "/usr/local/lib/python3.11/dist-packages/torch/autograd/init.py", line 823, in backward
File "/usr/local/lib/python3.11/dist-packages/torch/_dynamo/eval_frame.py", line 574, in _fn
File "/usr/local/lib/python3.11/dist-packages/unsloth/models/_utils.py", line 1129, in _unsloth_pre_compute_loss
File "/usr/local/lib/python3.11/dist-packages/transformers/trainer.py", line 3836, in compute_loss
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1739, in _call_impl
...
Additional Notes:
Training works normally when notebook is interactive.
The crash only happens when the notebook is versioned (i.e., "Save Version" on Kaggle).
I already set torch._dynamo.config.disable = True to prevent recompilation errors.
The error seems specific to headless (non-interactive) Kaggle runs.