Tags · pytorch/torchrec

v2025.07.14.00

Dynamic 2D sparse parallel (#3177)

Summary:
Pull Request resolved: #3177

We add the ability to set 2D parallel configuration per module (coined as Dynamic 2D parallel). This means an EBC and EC can be sharded differently on the data parallel dimension. We can have 4 replicas per EBC shard and 2 replicas per EC shard. The previous setting requires all modules to have the same replication factor.

To do this, we introduce a lightweight dataclass that is used to provide per module configurations, allowing very granular control should it be required by the user:
```python
class DMPCollectionConfig:
    module: nn.Module  # this is expected to be unsharded module
    plan: "ShardingPlan"  # sub-tree-specific sharding plan
    sharding_group_size: int
    use_inter_host_allreduce: bool = False
```

The dataclass is used to provide the context for each module we are sharding. And, if configured, create separate process groups and sync logic for each of these modules.

Usage is as follows, suppose we want to use a different 2D configuration for EmbeddingCollection:
```python
# create plan for model tables over Y world size
# create plan for EmbeddingCollection tables over X world size
ec_config = DMPCollectionConfig(EmbeddingCollection, embedding_collection_plan, sharding_group_size)
model = DMPCollection(
  # pass in defaults args
  submodule_configs = [ec_config]
)
```

Future work includes:
- making it easier for users to create seperate sharding plans per module
- per table 2D

Reviewed By: liangbeixu

Differential Revision: D76774334

fbshipit-source-id: 27c7e0bc806d8227d784461a197cd8f1c7f6adfc

Jul 12, 2025
d599766
zip
tar.gz

v2025.07.07.00

Fix test more generally (#3165)

Summary:
Pull Request resolved: #3165

https://www.internalfb.com/diff/D77614983
attempted to fix a test, but I still see it showing up in other tests, so this fixes it in general.

Reviewed By: huydhn

Differential Revision: D77758554

fbshipit-source-id: bd390081b68fa650f1cfd6d2a93a1fbf206aaff7

Jul 7, 2025
a5d2d12
zip
tar.gz

v2025.06.30.00

Revert D76476676: Multisect successfully blamed "D76476676: OSS Torch…

…Rec Internal MPZCH modules" for one test failure (#3146)

Summary:
Pull Request resolved: #3146

This diff reverts D76476676
D76476676: OSS TorchRec Internal MPZCH modules by lizhouyu causes the following test failure:

Tests affected:
- [cogwheel:cogwheel_tps_basic_test#test_tps_basic_latency](https://www.internalfb.com/intern/test/562950123898458/)

Here's the Multisect link:
https://www.internalfb.com/multisect/30184635
Here are the tasks that are relevant to this breakage:
T211534727: 100+ tests, 10+ build rules failing for minimal_viable_ai

The backout may land if someone accepts it.

If this diff has been generated in error, you can Commandeer and Abandon it.

Depends on D76476676

Reviewed By: lizhouyu

Differential Revision: D77502155

fbshipit-source-id: eb990251f3276372592c30a7361579e2a3639d6c

Jun 29, 2025
3ef5b37
zip
tar.gz

v2025.06.23.00

minor refactoring to use list comprehension (#3125)

Summary:
Pull Request resolved: #3125

# context
* imported from github [#1602](#1602)
* rebased on trunk

Reviewed By: spmex

Differential Revision: D77031163

fbshipit-source-id: f65168f6ab0b74eca75b72fd60ec0ea7c762f3dc

Jun 21, 2025
ffdc822
zip
tar.gz

v2025.06.16.00

Fix Unit Test SkipIf Worldsize check (#3098)

Summary:
Pull Request resolved: #3098

These unit tests actually require at least 4 GPUs - due to world size requirements. Updating skipif to match

Created from CodeHub with https://fburl.com/edit-in-codehub

Reviewed By: aliafzal

Differential Revision: D76621861

fbshipit-source-id: 09f9b04c4d3cb7b10736fbbaff3886a8534b96fa

Jun 13, 2025
be4e6d7
zip
tar.gz

v2025.06.09.00

minior refactor github workflow (#3062)

Summary:
Pull Request resolved: #3062

# context
* refactor the matrix statement in the github workflow
* in case of pull-request
only cu128+py313 would be running in gpu ci
only py39 and py313 would be running in cpu ci

Reviewed By: aporialiao

Differential Revision: D76242338

fbshipit-source-id: b56ba4965842d89371d9a7baec858734bc306aaf

Jun 9, 2025
71db31d
zip
tar.gz

v1.2.0

update cuda version in readme

Jun 6, 2025
440b1c6
zip
tar.gz
Notes

v1.2.0-rc3

update tensordict version

Jun 6, 2025
16f6124
zip
tar.gz
Notes

v1.2.0-rc2

change version number

Jun 6, 2025
bc831a1
zip
tar.gz
Notes

v2025.06.02.00

PMT (#3023)

Summary:
Pull Request resolved: #3023

# context
* `_test_sharding` is frequently used test function covering many TorchRec sharding test cases
* the multiprocess env often introduces additional difficulties when debugging, espeically for kernel-size issues (the multiprocess env is not actually needed)
* this change make it run on the main process when the `world_size==1` so that a simple `breakpoint()` can just work.

Reviewed By: iamzainhuda

Differential Revision: D74131796

fbshipit-source-id: ccc34ab589c0153cc0ce1187bba3df7dd63cbfc6

May 31, 2025
10f1c7d
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2025.07.14.00

v2025.07.07.00

v2025.06.30.00

v2025.06.23.00

v2025.06.16.00

v2025.06.09.00

v1.2.0

v1.2.0-rc3

v1.2.0-rc2

v2025.06.02.00

Tags: pytorch/torchrec