Tags: pytorch/torchrec
Tags
Dynamic 2D sparse parallel (#3177) Summary: Pull Request resolved: #3177 We add the ability to set 2D parallel configuration per module (coined as Dynamic 2D parallel). This means an EBC and EC can be sharded differently on the data parallel dimension. We can have 4 replicas per EBC shard and 2 replicas per EC shard. The previous setting requires all modules to have the same replication factor. To do this, we introduce a lightweight dataclass that is used to provide per module configurations, allowing very granular control should it be required by the user: ```python class DMPCollectionConfig: module: nn.Module # this is expected to be unsharded module plan: "ShardingPlan" # sub-tree-specific sharding plan sharding_group_size: int use_inter_host_allreduce: bool = False ``` The dataclass is used to provide the context for each module we are sharding. And, if configured, create separate process groups and sync logic for each of these modules. Usage is as follows, suppose we want to use a different 2D configuration for EmbeddingCollection: ```python # create plan for model tables over Y world size # create plan for EmbeddingCollection tables over X world size ec_config = DMPCollectionConfig(EmbeddingCollection, embedding_collection_plan, sharding_group_size) model = DMPCollection( # pass in defaults args submodule_configs = [ec_config] ) ``` Future work includes: - making it easier for users to create seperate sharding plans per module - per table 2D Reviewed By: liangbeixu Differential Revision: D76774334 fbshipit-source-id: 27c7e0bc806d8227d784461a197cd8f1c7f6adfc
Fix test more generally (#3165) Summary: Pull Request resolved: #3165 https://www.internalfb.com/diff/D77614983 attempted to fix a test, but I still see it showing up in other tests, so this fixes it in general. Reviewed By: huydhn Differential Revision: D77758554 fbshipit-source-id: bd390081b68fa650f1cfd6d2a93a1fbf206aaff7
Revert D76476676: Multisect successfully blamed "D76476676: OSS Torch… …Rec Internal MPZCH modules" for one test failure (#3146) Summary: Pull Request resolved: #3146 This diff reverts D76476676 D76476676: OSS TorchRec Internal MPZCH modules by lizhouyu causes the following test failure: Tests affected: - [cogwheel:cogwheel_tps_basic_test#test_tps_basic_latency](https://www.internalfb.com/intern/test/562950123898458/) Here's the Multisect link: https://www.internalfb.com/multisect/30184635 Here are the tasks that are relevant to this breakage: T211534727: 100+ tests, 10+ build rules failing for minimal_viable_ai The backout may land if someone accepts it. If this diff has been generated in error, you can Commandeer and Abandon it. Depends on D76476676 Reviewed By: lizhouyu Differential Revision: D77502155 fbshipit-source-id: eb990251f3276372592c30a7361579e2a3639d6c
Fix Unit Test SkipIf Worldsize check (#3098) Summary: Pull Request resolved: #3098 These unit tests actually require at least 4 GPUs - due to world size requirements. Updating skipif to match Created from CodeHub with https://fburl.com/edit-in-codehub Reviewed By: aliafzal Differential Revision: D76621861 fbshipit-source-id: 09f9b04c4d3cb7b10736fbbaff3886a8534b96fa
minior refactor github workflow (#3062) Summary: Pull Request resolved: #3062 # context * refactor the matrix statement in the github workflow * in case of pull-request only cu128+py313 would be running in gpu ci only py39 and py313 would be running in cpu ci Reviewed By: aporialiao Differential Revision: D76242338 fbshipit-source-id: b56ba4965842d89371d9a7baec858734bc306aaf
PMT (#3023) Summary: Pull Request resolved: #3023 # context * `_test_sharding` is frequently used test function covering many TorchRec sharding test cases * the multiprocess env often introduces additional difficulties when debugging, espeically for kernel-size issues (the multiprocess env is not actually needed) * this change make it run on the main process when the `world_size==1` so that a simple `breakpoint()` can just work. Reviewed By: iamzainhuda Differential Revision: D74131796 fbshipit-source-id: ccc34ab589c0153cc0ce1187bba3df7dd63cbfc6
PreviousNext