-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Editorial: Do not use [[DFSAncestorIndex]] in InnerModuleLinking #3637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
To detect strongly connected components in module graphs, we need to use a number that increases as we go further away from the graph root on any branch of the DFS tree. This is so that when a node has a child that (1) is still not finalized yet, and (2) has a lower number than the node itself, we know that the child is an ancestor of the node and thus we are in a strongly connected component. Strongly connected components roots are thus nodes from which it is not possible to reach a non-finalized node with a lower number. For this Tarjan's algorithm uses the discovery index of each node, but other indexes that have the same property are: - the depth of the node in the DFS tree - the index of the node in the stack that contains yet-to-be-finalized nodes. This commit updates the SCC discovery logic to use this last option, rather than carrying around the discovery index.
…ower values Instead, only update it at the end of InnerModuleLinking if the current module is a non-root node of a SCC. We do not need to update its value in the loop through a module's dependencies because, even if there was a loop leading back to the node currently being processed, the [[DFSAncestorIndex]] value propagated back across the loop would not cause the module's [[DFSAncestorIndex]] to decrease.
Rather than returning ~unused~ and then reading it from the [[DFSAncestorIndex]] slot. Also return the index for modules that are not participating in cycles (returning the module index itself), to avoid having a separate path in the InnerModuleLinking loop.
The lowest possible [[DFSAncestorIndex]] will already be propagated back to the SCC root through _one_ of the SCC branches. InnerModuleLinking does not need modules to have their actually lowest possible [[DFSAncestorIndex]] set to know that they are a non-root node of a SCC: it's enough to know that from that node it's possible to reach _one_ other node with a lower index, and that's signaled by the return value.
[[DFSAncestorIndex]] index is only set once per module during a given traversal process, and it represents the module's index in _stack_. Instead explicitly storing it, we can compute it given _stack_ when needed (i.e. when we process an edge that "closes" a loop).
First some background - strongly connected components affect execution in the module system in only subtle ways:
That's it! Otherwise strongly connected components are indeed relatively unobservable to end users. But correctly classifying strongly connected components is a fundamental invariant. Now for (1) we don't formally propagate errors through the entire strongly connected component, we actually only propagate errors up the current stack of a strongly connected component. Your simplification based on this observation does seem to be unobservable so far as (1) is concerned. But, as mentioned when we discused this, your change very much does break the Tarjan algorithm's detection in being able to identify the strongly connected components correctly per Tarjan's algorithm. Simple counter example where A [0, 0] Under your scheme you would not correctly classify the two separate strongly connected components { B, C } and { D, E }, and instead classify both B and D as having the same ancestor index 1 since you are incorrectly conflating the component root index with the stack index. That is you would transition all 4 modules together instead of handling them separately. This would also incorrectly set the cycle root in the evaluation code. The comparison between sccAncestorIndex and moduleIndex is also invalid as the stack indexing diverges from the module indexing for large parallel graphs. As a side note - the onus is on you here to prove your modification or Tarjan's algorithm handles the edge cases, not on me to point out the counter examples. |
I understand, that's why I split the change in small commits each of them trying to carefully explain why the change is correct 😅 Assuming that the example you mean is the following, which yields those [dfs index, dfs ancestor index] pairs and those two SCCs: ![]() This is what happens with the updated algorithm:
The updated algorithm correctly detects the three SCCs: BC, DE, and A. Note that, both before and after this PR, the SCC detection rule is not "a set of module with the same associated number is a SCC". Already before these changes, multiple modules in the same SCC can have different [[DFSAncestorIndex]] values. For example (interactive):
Even though D is in the same SCC as B/C/E. Instead, the rule to tell that a module X is not a SCC root is that from X's outgoing edges you can reach a node with index lower than X. Then, if a module is a SCC root (so if the lowest index you can reach from X is X's index), the contents of the SCC is the list of modules in the stack starting from X.
Yes, and that's fine. When you have large parallel graphs (that are not in the same SCC), one of them is going to be processed before the other. What happens in the first of them has no effect at all on what SCCs there are in the second one. We can forget that we actually already used some indexes for that first completed graph part, and we can reuse them. Edit: I'll publish a forked version of https://nicolo-ribaudo.github.io/es-module-evaluation with these changes. |
I have a more precise proof that using the stack depth is equivalent to using the discovery index. Consider a module graph that contains a module M, with the stack at the time M is first discovered (right after pushing M to it) being « s0, s1, ..., sn-1, M = sn ». For a given module X, the discovery index of X is D(X) and the depth of the stack right before pushing it is S(X). Next(X) is the module that is discovered right after X (that is, D(Next(X)) = D(X)+1). [*] There are two possible cases. One is the trivial one, when we are going down the left-most branch of the DFS tree, so D(s0) = 0, D(s1) = 1, ..., D(sn) = n. In this case obviously D(si)=S(si) for each i. If we are not in that situation, then there is a smallest j such that D(sj) ≠ S(sj) (that is, D(sj) ≠ j). j > 0, because s0 is the root module of the graph, which is pushed in the stack first and removed at the end of the whole process. Thus we have D(s0)=S(s0)=0, D(s1)=S(s1)=1, ..., D(sj-1)=S(sj-1)=j-1. Now, we know that sj ≠ Next(sj-1) (otherwise D(sj)=D(sj-1)+1=(j-1)+1=j, which violates the condition on which we chose j). We also know that sj is not reachable from Next(sj-1), otherwise Next(sj-1) would be on the stack between sj-1 and sj (which is impossible, since they are by definition consecutive in the stack). Let { O1...Ok } be the set of all k modules reachable from Next(sj-1), Next(sj-1) included, ordered such that D(O1) < ... < D(Ok).
O1...Ok are all discovered before sj, so
Given [1] and [2], Evaluation(s0) is equivalent to first doing Evaluation(O1) (which evaluates all and only the modules { O1...Ok }) followed then by Evaluation(s0). This has no effect on evaluation order, SCC detection, or on the modules in the stack when discovering a module not in the { O1...Ok } set. The effect of Evaluation(O1) is that when we will then do Evaluation(s0) all the modules { O1...Ok } will have already been evaluated, thus we skip them in the DFS tree. Let's call D'(...) the discovery indexes of this Evaluation(s0) that happens after Evaluation(O1). We have that for each module P:
Thus we have that:
Now, we repeat from [*]:
|
This is an attempt at removing [[DFSAncestorIndex]], similarly to how #3625 removed [[DFSIndex]].
When talking to @guybedford he mentioned that not following Tarjan's algorithm is likely a bug, however it's possible that we can tweak it in ways that are not observable for our needs. This PR tries to do so.
Even though they do not represent how I came up with this different approach, I tried to split the changes in 5 separate commits that can be reviewed one-by-one for correctness.
This changes the editorial aspects but does not simplify implementations: they'd still need to keep track of a per-traversal number (just now it's the stack length rather than the discovery length) and a per-module number (unless they prefer to do an
O(number of nodes in a SCC)
loop for each cycle detected while traversing the graph, to find an index in a list).Again, like for #3635, this change is not objectively good and I'm curious to se what editors think about it. If @guybedford cannot poke holes in my logic, I'll update this PR to do the same for InnerModuleEvaluation and then remove the [[DFSAncestorIndex]] field from cyclic module records: there would be no more "auxiliary DFS fields" at all.