Skip to content

Tags: pytorch/torchrec

Tags

v2025.07.14.00

Toggle v2025.07.14.00's commit message
Dynamic 2D sparse parallel (#3177)

Summary:
Pull Request resolved: #3177

We add the ability to set 2D parallel configuration per module (coined as Dynamic 2D parallel). This means an EBC and EC can be sharded differently on the data parallel dimension. We can have 4 replicas per EBC shard and 2 replicas per EC shard. The previous setting requires all modules to have the same replication factor.

To do this, we introduce a lightweight dataclass that is used to provide per module configurations, allowing very granular control should it be required by the user:
```python
class DMPCollectionConfig:
    module: nn.Module  # this is expected to be unsharded module
    plan: "ShardingPlan"  # sub-tree-specific sharding plan
    sharding_group_size: int
    use_inter_host_allreduce: bool = False
```

The dataclass is used to provide the context for each module we are sharding. And, if configured, create separate process groups and sync logic for each of these modules.

Usage is as follows, suppose we want to use a different 2D configuration for EmbeddingCollection:
```python
# create plan for model tables over Y world size
# create plan for EmbeddingCollection tables over X world size
ec_config = DMPCollectionConfig(EmbeddingCollection, embedding_collection_plan, sharding_group_size)
model = DMPCollection(
  # pass in defaults args
  submodule_configs = [ec_config]
)
```

Future work includes:
- making it easier for users to create seperate sharding plans per module
- per table 2D

Reviewed By: liangbeixu

Differential Revision: D76774334

fbshipit-source-id: 27c7e0bc806d8227d784461a197cd8f1c7f6adfc

v2025.07.07.00

Toggle v2025.07.07.00's commit message
Fix test more generally (#3165)

Summary:
Pull Request resolved: #3165

https://www.internalfb.com/diff/D77614983
attempted to fix a test, but I still see it showing up in other tests, so this fixes it in general.

Reviewed By: huydhn

Differential Revision: D77758554

fbshipit-source-id: bd390081b68fa650f1cfd6d2a93a1fbf206aaff7

v2025.06.30.00

Toggle v2025.06.30.00's commit message
Revert D76476676: Multisect successfully blamed "D76476676: OSS Torch…

…Rec Internal MPZCH modules" for one test failure (#3146)

Summary:
Pull Request resolved: #3146

This diff reverts D76476676
D76476676: OSS TorchRec Internal MPZCH modules by lizhouyu causes the following test failure:

Tests affected:
- [cogwheel:cogwheel_tps_basic_test#test_tps_basic_latency](https://www.internalfb.com/intern/test/562950123898458/)

Here's the Multisect link:
https://www.internalfb.com/multisect/30184635
Here are the tasks that are relevant to this breakage:
T211534727: 100+ tests, 10+ build rules failing for minimal_viable_ai

The backout may land if someone accepts it.

If this diff has been generated in error, you can Commandeer and Abandon it.

Depends on D76476676

Reviewed By: lizhouyu

Differential Revision: D77502155

fbshipit-source-id: eb990251f3276372592c30a7361579e2a3639d6c

v2025.06.23.00

Toggle v2025.06.23.00's commit message
minor refactoring to use list comprehension (#3125)

Summary:
Pull Request resolved: #3125

# context
* imported from github [#1602](#1602)
* rebased on trunk

Reviewed By: spmex

Differential Revision: D77031163

fbshipit-source-id: f65168f6ab0b74eca75b72fd60ec0ea7c762f3dc

v2025.06.16.00

Toggle v2025.06.16.00's commit message
Fix Unit Test SkipIf Worldsize check (#3098)

Summary:
Pull Request resolved: #3098

These unit tests actually require at least 4 GPUs - due to world size requirements. Updating skipif to match

Created from CodeHub with https://fburl.com/edit-in-codehub

Reviewed By: aliafzal

Differential Revision: D76621861

fbshipit-source-id: 09f9b04c4d3cb7b10736fbbaff3886a8534b96fa

v2025.06.09.00

Toggle v2025.06.09.00's commit message
minior refactor github workflow (#3062)

Summary:
Pull Request resolved: #3062

# context
* refactor the matrix statement in the github workflow
* in case of pull-request
only cu128+py313 would be running in gpu ci
only py39 and py313 would be running in cpu ci

Reviewed By: aporialiao

Differential Revision: D76242338

fbshipit-source-id: b56ba4965842d89371d9a7baec858734bc306aaf

v1.2.0

Toggle v1.2.0's commit message
update cuda version in readme

v1.2.0-rc3

Toggle v1.2.0-rc3's commit message
update tensordict version

v1.2.0-rc2

Toggle v1.2.0-rc2's commit message
change version number

v2025.06.02.00

Toggle v2025.06.02.00's commit message
PMT (#3023)

Summary:
Pull Request resolved: #3023

# context
* `_test_sharding` is frequently used test function covering many TorchRec sharding test cases
* the multiprocess env often introduces additional difficulties when debugging, espeically for kernel-size issues (the multiprocess env is not actually needed)
* this change make it run on the main process when the `world_size==1` so that a simple `breakpoint()` can just work.

Reviewed By: iamzainhuda

Differential Revision: D74131796

fbshipit-source-id: ccc34ab589c0153cc0ce1187bba3df7dd63cbfc6