Runtime error during `backward()`: Trying to backward through the graph a second time (...). #604

nosmokingsurfer · 2023-09-28T22:43:50Z

nosmokingsurfer
Sep 28, 2023

Hi!

My name is Alex and I'm learning how to do backprop through FGO. My current task is to train NN model to predict odometry measurements in end-to-end manner. I have already implemented a simple example with 1D synthetic data and th.SE2 objects and trying to debug it.

First I was getting an error at mse_loss.backward() call:

Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

So, I set the retain_graph=True for my mse_loss.backward call and it started training and process converged.

But, when I started adding batches the new error popped out when doing second epoch:

one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [30, 1]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

When I was writing the code I was looking at the examples/state_estimation_2d.py as a referance. The main difference I see is that instead of predicting factor weights I'm trying to predict the measurements.

I think I'm missing something in how to prepare the graph for autograd or how to call/update inner/outer loops. I tried to visualize the graph with torchviz but it didn't help.

I have already looked through the discussions and issues and haven't spotted anything related to above mentioned errors.

It would be great if somebody can have a look at my code and maybe give a hint how to fix training with batching.

Here is the version of jupyter notebook for the second error:
https://github.com/nosmokingsurfer/fgraph_diff/blob/master/theseus_tests/theseus_tum_vi/linear_motion_test.ipynb

Cheers,
Alex

Answered by luisenp

Oct 18, 2023

Hi @nosmokingsurfer, the case of the error that required you to add retain_graph=True is that you were not setting initial values for the optimization variables, which means that after the first loop the values from the previous optimization were using as initial values (thus retaining graph info). You can replace your initialization for loop with the following

    theseus_inputs = {}
    for i in range(N):
        if i < N - 1:
            tmp = torch.zeros(B, 4)
            tmp[:, 2] = 1.0
            tmp[:, 0] = 0.5 * predicted_acc[:, i] ** 2 + predicted_acc[:, i]
            theseus_inputs[f"predicted_odometry_{i}"] = tmp
        # Using SE2(...).tensor converts the (x, y, theta) inpu…

View full answer

luisenp · 2023-10-04T11:35:31Z

luisenp
Oct 4, 2023
Collaborator

Hi @nosmokingsurfer, sorry it took me so long to respond to this, I've been really busy with other deadlines. Do you have a smaller example that reproduces this error? Your notebook is a bit large and it will take me a long time to figure out what's going on.

4 replies

nosmokingsurfer Oct 4, 2023
Author

Hello @luisenp ,

thank you for your answer.

Sure! I will prepare a minimalistic example to reproduce. Sorry, didn't think about this.

nosmokingsurfer Oct 8, 2023
Author

@luisenp Hello,

Here is the short example that thows error Trying to backward through the graph a second time ....:

import theseus as th
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.autograd.set_detect_anomaly(True)
from tqdm import tqdm
from typing import List

class SimpleNN(nn.Module):
    def __init__(self, in_size, out_size, hid_size=30, use_offset=False):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_size, hid_size, bias=False),
            nn.ReLU(),
            nn.Linear(hid_size, hid_size, bias=False),
            nn.ReLU(),
            nn.Linear(hid_size, out_size, bias=False),
        )

    def forward(self, state_):
        return self.fc(state_)

# Generating data    
B = 16 # number of trajectories in batch
N = 10 # number of trajectory points

# generating random acceleration dt = 1
input_acc = torch.randn((B,N))
gt_vel = torch.cumsum(input_acc, dim=-1)
gt_traj = torch.cumsum(gt_vel, dim=-1)

model = SimpleNN(1, 1, hid_size=30)
model.train()
model

poses : List[th.SE2] = []
for i in range(N):
    poses.append(th.SE2(torch.zeros(B,3),name=f"pose_{i}"))

cost_functions = []
for i in range(N-1):
    # odometry measurmenets will depend on NN output
    meas_tensor = th.SE2(torch.zeros((B,3)), name=f"predicted_odometry_{i}")
    cost_between = th.ScaleCostWeight(torch.ones((B,1)), name=f"scale_between_{i}")
    cost_functions.append(
                th.Between(poses[i], poses[i+1], meas_tensor,
                        cost_between,
                        name=f"between_{i}"))

for i in range(N):
    gt_pose_tensor = th.SE2(F.pad(gt_traj[:,i].view(B,-1), pad=(0,2)), name=f"gt_pose_{i}")
    scale_gps = th.ScaleCostWeight(torch.ones((B,1)), name=f"scale_gps_{i}")
    cost_functions.append(th.Difference(poses[i], gt_pose_tensor, scale_gps, name=f"gps_{i}"))

objective = th.Objective()
for cost in cost_functions:
    objective.add(cost)

optimizer = th.GaussNewton(
        objective,
        th.CholeskyDenseSolver,
        max_iterations=10,
        step_size=0.1,
    )

state_estimator = th.TheseusLayer(optimizer)
model_optimizer = torch.optim.Adam(model.parameters(), lr=5e-2)

losses = []
for epoch in tqdm(range(20)):
    model_optimizer.zero_grad()

    predicted_acc = model(input_acc.view(-1,1)).view(B,-1)

    theseus_inputs = {}
    for i in range(N-1):
        tmp = torch.zeros(B,4)
        tmp[:,2] =  1.0
        tmp[:,0] = 0.5*predicted_acc[:,i]**2 + predicted_acc[:,i]
        theseus_inputs[f"predicted_odometry_{i}"] = tmp

    objective.update(theseus_inputs)
    print(f"Objective error = {objective.error_metric().mean().item()}")

    theseus_output, _ = state_estimator.forward(
            theseus_inputs,
            optimizer_kwargs={
                "track_best_solution": True,
                "verbose": epoch % 1 == 0,
            },
        )

    optimized_path = torch.zeros((B,N))
    for i in range(N):
        optimized_path[:, i] = theseus_output[f"pose_{i}"][:,0]

    mse_loss = F.mse_loss(optimized_path, gt_traj,reduction='none')
    loss = mse_loss.mean(axis=1).mean()

    loss.backward()

    model_optimizer.step()

    losses.append(loss.item())

print(losses)

nosmokingsurfer Oct 10, 2023
Author

Hello, @luisenp

I have figured out the workaround. Looks like the problem was with how I was preparing the inputs.

If instead of:

...
    theseus_inputs = {}
    for i in range(N-1):
        tmp = torch.zeros(B,4)
        tmp[:,2] =  1.0
        tmp[:,0] = 0.5*predicted_acc[:,i]**2 + predicted_acc[:,i]
        theseus_inputs[f"predicted_odometry_{i}"] = tmp

I put:

...
    theseus_inputs = {}
    for i in range(N-1):
        # tmp = torch.zeros(B,4)
        tmp = torch.zeros(B,3)
        tmp[:,2] =  1.0
        tmp[:,0] = 0.5*predicted_acc[:,i]**2 + predicted_acc[:,i]
        # theseus_inputs[f"predicted_odometry_{i}"] = tmp
        theseus_inputs[f"predicted_odometry_{i}"] = th.SE2(torch.tensor(tmp,requires_grad=True))

and set retain_graph=True in loss.backward then it works and training process converges.

Now the question are:

is it really required to have the retain_graph=True in backwards? I beleive it might indicate that there are still some problems with the graph
what is the correct way to update the theses_inputs my th.SE2 Lie object - should it be [1,3] or [1,4] tensor for SE2?

Cheers,
Alex

luisenp Oct 18, 2023
Collaborator

Hi @nosmokingsurfer. I'm really sorry for taking so long to respond, I've been incredibly busy with some internal deadlines. Looking at this now and I will try to give feedback soon.

With regards to your second question, when using Objective.update() or TheseusLayer.forward(), the input should be size [1, 4] corresponding to (x, y, cos, sin). But in the constructor you can also use kwarg x_y_theta, and pass a tensor of size [1, 3].

luisenp · 2023-10-18T21:15:34Z

luisenp
Oct 18, 2023
Collaborator

Hi @nosmokingsurfer, the case of the error that required you to add retain_graph=True is that you were not setting initial values for the optimization variables, which means that after the first loop the values from the previous optimization were using as initial values (thus retaining graph info). You can replace your initialization for loop with the following

    theseus_inputs = {}
    for i in range(N):
        if i < N - 1:
            tmp = torch.zeros(B, 4)
            tmp[:, 2] = 1.0
            tmp[:, 0] = 0.5 * predicted_acc[:, i] ** 2 + predicted_acc[:, i]
            theseus_inputs[f"predicted_odometry_{i}"] = tmp
        # Using SE2(...).tensor converts the (x, y, theta) input to (x, y, cos, sin)
        theseus_inputs[f"pose_{i}"] = th.SE2(torch.zeros(B, 3)).tensor

2 replies

nosmokingsurfer Oct 20, 2023
Author

Hello @luisenp!

Thank you for your answer and for your time, much appreciated. Now I'm getting more confident with Theseus and what happens under its hood.

Cheers,
Alex

luisenp Oct 20, 2023
Collaborator

Happy to help @nosmokingsurfer, let me know if you have any more questions, hopefully I can respond faster in the future!

Runtime error during backward(): Trying to backward through the graph a second time (...). #604

Uh oh!

Uh oh!

nosmokingsurfer Sep 28, 2023

Replies: 2 comments · 6 replies

Uh oh!

luisenp Oct 4, 2023 Collaborator

Uh oh!

nosmokingsurfer Oct 4, 2023 Author

Uh oh!

nosmokingsurfer Oct 8, 2023 Author

Uh oh!

nosmokingsurfer Oct 10, 2023 Author

Uh oh!

Uh oh!

luisenp Oct 18, 2023 Collaborator

Uh oh!

luisenp Oct 18, 2023 Collaborator

Uh oh!

nosmokingsurfer Oct 20, 2023 Author

Uh oh!

luisenp Oct 20, 2023 Collaborator

Runtime error during `backward()`: Trying to backward through the graph a second time (...). #604

nosmokingsurfer
Sep 28, 2023

Replies: 2 comments 6 replies

luisenp
Oct 4, 2023
Collaborator

nosmokingsurfer Oct 4, 2023
Author

nosmokingsurfer Oct 8, 2023
Author

nosmokingsurfer Oct 10, 2023
Author

luisenp Oct 18, 2023
Collaborator

luisenp
Oct 18, 2023
Collaborator

nosmokingsurfer Oct 20, 2023
Author

luisenp Oct 20, 2023
Collaborator