Adding support for Apple Silicon MPS Devices (M1/Pro/Max/Ultra) WIP #975

sbonner0 · 2022-06-12T20:27:26Z

February 2025 Update

PyTorch still has issues with segfaults on MPS. Here are a few things we're waiting on:

Support for apple MPS mberr/torch-max-mem#14 and Re-enable tests for mac mberr/torch-max-mem#19 (we need to fix this once some upstream stuff is resolved)
MPS memory issue, MPS backend out of memory, but works if I empty the MPS cache pytorch/pytorch#105839
https://discuss.pytorch.org/t/runtime-error-invalid-buffer-size-when-calculating-cosine-similarity/152088
https://discuss.pytorch.org/t/mps-back-end-out-of-memory-on-github-action/189773 (related, but not the cause)

Original Issue Text

Description of the Change

Since PyTorch has added support for Apple Silicon machines (via the use of Metal Performance Shaders https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/), I thought it would be interesting to add support for it into PyKEEN. However, the full range of PyTorch operations is not currently supported meaning that, although some models (I have mostly been testing TransE) can be trained on a M1 GPU, the evaluation current fails due to the following missing operation: 'aten::index.Tensor_out'

Link to PyTorch issue tracking MPS progress: pytorch/pytorch#77764
These issues have some big implications for PyKEEN. For example, currently complex tensors are not supported by MPS, so no support for RotatE, ComplEx etc

In order to test/work on this, there are some prerequisites to consider. MPS-enabled PyTorch requires MacOS 12.3+ and an ARM-native Python installation. Details are provided on how to install MPS-enabled PyTorch here: https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)

Possible Drawbacks

Extra range of (relatively niche) devices to support. Also from existing benchmarks, performance is still a long long way from even a mid range CUDA-enabled GPU. It is also possible that the recent inclusion of PyTorch lightning means it might be better to leave the handling of exotic accelerators to it.

Release Notes

Adding support for PyKEEN model training/evaluation on Apple Silicon equipped machines (M1/Pro/Max/Ultra).

migalkin · 2022-06-14T04:14:19Z

That's a great initiative, I can assist here - my laptop is a very basic M1 (8 GB) with mps enabled (torch 1.13 dev from the nightly branch).

For the MVP, I created a default pipeline:

from pykeen.pipeline import pipeline
result = pipeline(
    dataset="fb15k237",
    dataset_kwargs=dict(
        create_inverse_triples=True,
    ),
    model='DistMult',
    model_kwargs=dict(
        embedding_dim=64,
    ),
)

print(result.get_metric("hits_at_10"))

Well, it seems that the GPU core in the basic M1 is kinda bad 😅 or smth is not yet optimized.
Training in the x86-compatible env through Rosetta gives about 100 batches/s.

Training batches on cpu:  24%|██████▉                      | 513/2126 [00:05<00:15, 104.98batch/s]

Training in the native arm64 env with the mps device is only about 44 batches/s, so pretty much 2x slower 🤔

Training batches on mps:0:  19%|█▊        | 396/2126 [00:09<00:38, 44.55batch/s]

@sbonner0 Is it different on your machine?

src/pykeen/utils.py

mberr · 2022-06-14T11:53:17Z

src/pykeen/evaluation/evaluator.py

@@ -186,9 +186,10 @@ def evaluate(

        if batch_size is None and self.automatic_memory_optimization:
            # Using automatic memory optimization on CPU may result in undocumented crashes due to OS' OOM killer.
-            if model.device.type == "cpu":
+            if model.device.type == "cpu" or model.device.type == "mps":


Would this also work?

Suggested change

if model.device.type == "cpu" or model.device.type == "mps":

if model.device.type != "cuda":

Generally yes, trying out moving tensors to mps returns expected results:

>>> a = torch.rand(5,5,device="mps") >>> a.device.type 'mps'

Can we update this logger.info to better explain why this only works on cuda GPUs?

I don't know about how mps works.

It does not work for CPU, since the CPU's OOM killer is not as lenient as the CUDA one and usually just kills your whole program instead of allowing the program to handle an exception.

My thoughts for initially disabling this was due to the fact that in Apple Silicon systems, the CPU and GPU share memory - so I thought it best to err on the side of caution whilst MPS support is very early and not let a model completely fill the shared memory. We can assess this again when there is better support in PyTorch.

FYI this line of code has been upstreamed into max's AMO library

(the warning part of it 😄 we still need a distinction here to avoid OOM killer action for non-cuda parts)

sbonner0 · 2022-06-14T18:49:22Z

Hi @migalkin thanks for being enthusiastic about this and testing it out! I can confirm the performance as it stands is not great when running on the GPU (I too am on a base M1 although with 16GB in a 13' MBP). Seems this is mostly a PyTorch issue as the M1 GPU enabled Tensorflow does perform quite a bit better. As seen in the linked PyTorch issue, support for this hardware is really early and many operations are still missing so I do hope it will improve quite a bit over time.

sbonner0 · 2022-06-14T18:50:53Z

@migalkin - just wanted to confirm with you that the evaluation loops is also failing currently due to an error with Tensor indexing?

migalkin · 2022-06-14T19:11:08Z

@migalkin - just wanted to confirm with you that the evaluation loops is also failing currently due to an error with Tensor indexing?

Generally, running a script with the env variable PYTORCH_ENABLE_MPS_FALLBACK=1 prevents a lot of crashes.
Right now, the eval script indeed crashes with the error

File "...pykeen/src/pykeen/evaluation/evaluator.py", line 465, in create_sparse_positive_filter_
    filter_batch[:, 1] = all_pos_triples[:, filter_col : filter_col + 1].view(1, -1)[:, filter_batch[:, 1]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (CPU)

but it seems to be related with tensors and devices in general, not pertaining to MPS 🤔

tcz · 2022-12-17T11:54:55Z

I'd very much appreciate this feature. Can I help out somehow?

cthoyt · 2022-12-17T13:57:47Z

I think the main blocker is waiting on one of the core team to get a computer with Apple silicon or to see if there’s a way to set up automated testing on a machine with one.

sbonner0 · 2022-12-17T18:14:22Z

I'd very much appreciate this feature. Can I help out somehow?

Hi @tcz, just to add to what Charley has said - on a technical levels there is still a lot of (frankly basic) functionally from pytorch that doesn't have an appropriate MPS ops version. A good example being that it has no support for complex tensors - meaning models like ComplEx and RotatE do not work.

However if you would like to help, I think it would be good if you could install the latest nightly version of pytorch on an Apple Silicon machine and report back here any pykeen functionality that does not work.

cthoyt · 2023-09-23T15:05:17Z

Update: @mberr and I were working AMO on the apple silicon. We upstreamed that functionality in torch_max_mem and bumped the minimum version here.

mberr · 2023-09-23T15:24:43Z

I think we also need to adjust these lines from == 'cpu' to != "cuda"

pykeen/src/pykeen/evaluation/evaluator.py

Lines 367 to 371 in a541a3b

    
           num_triples = mapped_triples.shape[0] 
        
           # no batch size -> automatic memory optimization 
        
           if batch_size is None: 
        
               batch_size = 32 if device.type == "cpu" else num_triples 
        
               logger.debug(f"Automatically set maximum batch size to {batch_size=}")

pykeen/src/pykeen/evaluation/evaluation_loop.py

Lines 195 to 200 in a541a3b

    
           # set upper limit of batch size for automatic memory optimization 
        
           if not batch_size: 
        
               if self.model.device.type == "cpu": 
        
                   batch_size = 32 
        
               else: 
        
                   batch_size = len(self.dataset)

pykeen/src/pykeen/training/training_loop.py

Lines 508 to 524 in a541a3b

    
           # Take the biggest possible training batch_size, if batch_size not set 
        
           batch_size_sufficient = False 
        
           if batch_size is None: 
        
               if self.automatic_memory_optimization: 
        
                   # Using automatic memory optimization on CPU may result in undocumented crashes due to OS' OOM killer. 
        
                   if self.model.device.type == "cpu": 
        
                       batch_size = 256 
        
                       batch_size_sufficient = True 
        
                       logger.info( 
        
                           "Currently automatic memory optimization only supports GPUs, but you're using a CPU. " 
        
                           "Therefore, the batch_size will be set to the default value '{batch_size}'", 
        
                       ) 
        
                   else: 
        
                       batch_size, batch_size_sufficient = self.batch_size_search(triples_factory=triples_factory) 
        
               else: 
        
                   batch_size = 256 
        
                   logger.info(f"No batch_size provided. Setting batch_size to '{batch_size}'.")

cthoyt · 2023-09-23T15:27:57Z

@mberr are there other types of devices we could imagine people using (does lightning or accelerate have a unique one)? In that case, we might need to do something a bit more clever on how to infer certain properties about the device that doesn't rely on hard-coding the name but rather asking a lookup table we can maintain externally

mberr · 2023-09-23T15:43:06Z

They are not really documented, but these are shown in the error message:

>>> torch.device("a")
RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: a

I think for now, it would suffice to use the cpu settings for everything except cuda, and wait until someone comes around with any of the other devices. We could also add a small debug/info/warning message when we see an unexpected device.

cthoyt · 2023-09-23T18:31:48Z

Current status

from pykeen.pipeline import pipeline

if __name__ == '__main__':
    result = pipeline(
        epochs=5,
        dataset="fb15k237",
        dataset_kwargs=dict(
            create_inverse_triples=True,
        ),
        model="DistMult",
        model_kwargs=dict(
            embedding_dim=64,
        ),
    )

    print(result.get_metric("hits_at_10"))

Gave

No random seed is specified. Setting to 2322437377.
INFO:pykeen.triples.triples_factory:Creating inverse triples.
Training epochs on mps:0:   0%|                                                                                                                                                 | 0/5 [00:00<?, ?epoch/s]
INFO:pykeen.triples.triples_factory:Creating inverse triples.
Training epochs on mps:0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:58<00:00, 11.75s/epoch, loss=0.501, prev_loss=0.725]
Evaluating on mps:0:   0%|                                                                                                                                              | 0.00/20.4k [00:00<?, ?triple/s]
WARNING:torch_max_mem.api:Encountered tensors on device_types={'mps'} while only ['cuda'] are considered safe for automatic memory utilization maximization. This may lead to undocumented crashes (but can be safe, too).
Evaluating on mps:0:  92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋          | 18.8k/20.4k [00:14<00:00, 2.17ktriple/s]
/AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:90: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (20438)'
fish: Job 1, 'python test.py' terminated by signal SIGABRT (Abort)
/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

mberr · 2023-09-23T18:50:54Z

/AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:90: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (20438)'

sounds like the array slicing may not be compatible the usual way, where it is allowed to have the stop-index be larger than the array size, e.g.,

a = torch.zeros(10)
a[6:6+6]

sbonner0 · 2023-09-25T16:16:19Z

Thanks @cthoyt and @mberr for picking this up again. I feel that the MPS support is still quite far behind where it would needs to be enabled by default for users with M series machines. What do you feel about having it as an option a user must select, rather than being enabled by default?

cthoyt · 2024-08-09T11:28:22Z

GHA now supports apple machines with apple silicon (including GPU) - can we get those into the test regime here?

cthoyt · 2025-04-25T22:44:56Z

FYI, torch 2.7 might have solved the automatic memory optimization issue!

inital support for MPS

fb59726

sbonner0 marked this pull request as draft June 12, 2022 20:29

migalkin reviewed Jun 14, 2022

View reviewed changes

src/pykeen/utils.py Outdated Show resolved Hide resolved

mberr reviewed Jun 14, 2022

View reviewed changes

cthoyt force-pushed the master branch from 4a608a0 to 9b17a12 Compare July 4, 2022 13:16

Merge branch 'master' into apple_silicon_MPS

f615c9e

sbonner0 and others added 5 commits December 17, 2022 18:19

Merge branch 'master' into apple_silicon_MPS

e682481

Merge branch 'master' into apple_silicon_MPS

a1af24b

Merge branch 'master' into apple_silicon_MPS

d40156a

Merge branch 'master' into apple_silicon_MPS

ce044a6

Merge branch 'master' into pr/975

66aec6d

cthoyt mentioned this pull request Sep 23, 2023

Support for apple MPS mberr/torch-max-mem#14

Open

Merge branch 'master' into apple_silicon_MPS

3992150

cthoyt added 2 commits September 23, 2023 17:06

Update setup.cfg

a4b591f

Clean up device resolution

a30b6ce

Switch device checks

8bd545d

Merge branch 'master' into apple_silicon_MPS

10a6af1

Merge branch 'master' into apple_silicon_MPS

b41b848

cthoyt mentioned this pull request May 14, 2024

OOM Crash on MPS/Apple silicon #1388

Closed

3 tasks

cthoyt added 2 commits August 9, 2024 13:27

Merge branch 'master' into pr/975

ce5a570

Update pyproject.toml

78c0b16

mberr and others added 4 commits August 9, 2024 15:38

enable tests on MacOS GHA runner

d721f43

Merge branch 'master' into pr/975

32a0c5a

Update utils.py

36a86f7

Merge branch 'master' into apple_silicon_MPS

0362fff

cthoyt marked this pull request as ready for review February 9, 2025 10:15

	if model.device.type == "cpu" or model.device.type == "mps":
	if model.device.type != "cuda":

Uh oh!

Adding support for Apple Silicon MPS Devices (M1/Pro/Max/Ultra) WIP #975

Are you sure you want to change the base?

Adding support for Apple Silicon MPS Devices (M1/Pro/Max/Ultra) WIP #975

Uh oh!

Conversation

sbonner0 commented Jun 12, 2022 • edited by cthoyt Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

February 2025 Update

Original Issue Text

Description of the Change

Possible Drawbacks

Release Notes

Uh oh!

migalkin commented Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mberr Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

migalkin Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cthoyt Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

mberr Jun 14, 2022

Choose a reason for hiding this comment

Uh oh!

sbonner0 Jun 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cthoyt Sep 23, 2023

Choose a reason for hiding this comment

Uh oh!

mberr Sep 23, 2023

Choose a reason for hiding this comment

Uh oh!

sbonner0 commented Jun 14, 2022

Uh oh!

sbonner0 commented Jun 14, 2022

Uh oh!

migalkin commented Jun 14, 2022

Uh oh!

tcz commented Dec 17, 2022

Uh oh!

cthoyt commented Dec 17, 2022

Uh oh!

sbonner0 commented Dec 17, 2022

Uh oh!

cthoyt commented Sep 23, 2023

Uh oh!

mberr commented Sep 23, 2023

Uh oh!

cthoyt commented Sep 23, 2023

Uh oh!

mberr commented Sep 23, 2023

Uh oh!

cthoyt commented Sep 23, 2023

Uh oh!

mberr commented Sep 23, 2023

Uh oh!

sbonner0 commented Sep 25, 2023

Uh oh!

cthoyt commented Aug 9, 2024

Uh oh!

cthoyt commented Apr 25, 2025

Uh oh!

Uh oh!

sbonner0 commented Jun 12, 2022 •

edited by cthoyt

Loading

migalkin commented Jun 14, 2022 •

edited

Loading

migalkin Jun 14, 2022 •

edited

Loading

sbonner0 Jun 14, 2022 •

edited

Loading