-
-
Notifications
You must be signed in to change notification settings - Fork 201
Adding support for Apple Silicon MPS Devices (M1/Pro/Max/Ultra) WIP #975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
That's a great initiative, I can assist here - my laptop is a very basic M1 (8 GB) with mps enabled (torch 1.13 dev from the nightly branch). For the MVP, I created a default pipeline: from pykeen.pipeline import pipeline
result = pipeline(
dataset="fb15k237",
dataset_kwargs=dict(
create_inverse_triples=True,
),
model='DistMult',
model_kwargs=dict(
embedding_dim=64,
),
)
print(result.get_metric("hits_at_10")) Well, it seems that the GPU core in the basic M1 is kinda bad 😅 or smth is not yet optimized.
Training in the native arm64 env with the mps device is only about 44 batches/s, so pretty much 2x slower 🤔
@sbonner0 Is it different on your machine? |
src/pykeen/evaluation/evaluator.py
Outdated
@@ -186,9 +186,10 @@ def evaluate( | |||
|
|||
if batch_size is None and self.automatic_memory_optimization: | |||
# Using automatic memory optimization on CPU may result in undocumented crashes due to OS' OOM killer. | |||
if model.device.type == "cpu": | |||
if model.device.type == "cpu" or model.device.type == "mps": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this also work?
if model.device.type == "cpu" or model.device.type == "mps": | |
if model.device.type != "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally yes, trying out moving tensors to mps returns expected results:
>>> a = torch.rand(5,5,device="mps")
>>> a.device.type
'mps'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update this logger.info to better explain why this only works on cuda GPUs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about how mps works.
It does not work for CPU, since the CPU's OOM killer is not as lenient as the CUDA one and usually just kills your whole program instead of allowing the program to handle an exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts for initially disabling this was due to the fact that in Apple Silicon systems, the CPU and GPU share memory - so I thought it best to err on the side of caution whilst MPS support is very early and not let a model completely fill the shared memory. We can assess this again when there is better support in PyTorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this line of code has been upstreamed into max's AMO library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the warning part of it 😄 we still need a distinction here to avoid OOM killer action for non-cuda parts)
Hi @migalkin thanks for being enthusiastic about this and testing it out! I can confirm the performance as it stands is not great when running on the GPU (I too am on a base M1 although with 16GB in a 13' MBP). Seems this is mostly a PyTorch issue as the M1 GPU enabled Tensorflow does perform quite a bit better. As seen in the linked PyTorch issue, support for this hardware is really early and many operations are still missing so I do hope it will improve quite a bit over time. |
@migalkin - just wanted to confirm with you that the evaluation loops is also failing currently due to an error with Tensor indexing? |
Generally, running a script with the env variable File "...pykeen/src/pykeen/evaluation/evaluator.py", line 465, in create_sparse_positive_filter_
filter_batch[:, 1] = all_pos_triples[:, filter_col : filter_col + 1].view(1, -1)[:, filter_batch[:, 1]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (CPU) but it seems to be related with tensors and devices in general, not pertaining to MPS 🤔 |
I'd very much appreciate this feature. Can I help out somehow? |
I think the main blocker is waiting on one of the core team to get a computer with Apple silicon or to see if there’s a way to set up automated testing on a machine with one. |
Hi @tcz, just to add to what Charley has said - on a technical levels there is still a lot of (frankly basic) functionally from pytorch that doesn't have an appropriate MPS ops version. A good example being that it has no support for complex tensors - meaning models like ComplEx and RotatE do not work. However if you would like to help, I think it would be good if you could install the latest nightly version of pytorch on an Apple Silicon machine and report back here any pykeen functionality that does not work. |
Update: @mberr and I were working AMO on the apple silicon. We upstreamed that functionality in |
I think we also need to adjust these lines from pykeen/src/pykeen/evaluation/evaluator.py Lines 367 to 371 in a541a3b
pykeen/src/pykeen/evaluation/evaluation_loop.py Lines 195 to 200 in a541a3b
pykeen/src/pykeen/training/training_loop.py Lines 508 to 524 in a541a3b
|
@mberr are there other types of devices we could imagine people using (does lightning or accelerate have a unique one)? In that case, we might need to do something a bit more clever on how to infer certain properties about the device that doesn't rely on hard-coding the name but rather asking a lookup table we can maintain externally |
They are not really documented, but these are shown in the error message: >>> torch.device("a")
RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: a I think for now, it would suffice to use the cpu settings for everything except cuda, and wait until someone comes around with any of the other devices. We could also add a small debug/info/warning message when we see an unexpected device. |
Current status
Gave
|
sounds like the array slicing may not be compatible the usual way, where it is allowed to have the stop-index be larger than the array size, e.g., a = torch.zeros(10)
a[6:6+6] |
GHA now supports apple machines with apple silicon (including GPU) - can we get those into the test regime here? |
FYI, torch 2.7 might have solved the automatic memory optimization issue! |
February 2025 Update
PyTorch still has issues with segfaults on MPS. Here are a few things we're waiting on:
Original Issue Text
Description of the Change
Since PyTorch has added support for Apple Silicon machines (via the use of Metal Performance Shaders https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/), I thought it would be interesting to add support for it into PyKEEN. However, the full range of PyTorch operations is not currently supported meaning that, although some models (I have mostly been testing TransE) can be trained on a M1 GPU, the evaluation current fails due to the following missing operation:
'aten::index.Tensor_out'
Link to PyTorch issue tracking MPS progress: pytorch/pytorch#77764
These issues have some big implications for PyKEEN. For example, currently complex tensors are not supported by MPS, so no support for RotatE, ComplEx etc
In order to test/work on this, there are some prerequisites to consider. MPS-enabled PyTorch requires MacOS 12.3+ and an ARM-native Python installation. Details are provided on how to install MPS-enabled PyTorch here: https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)
Possible Drawbacks
Extra range of (relatively niche) devices to support. Also from existing benchmarks, performance is still a long long way from even a mid range CUDA-enabled GPU. It is also possible that the recent inclusion of PyTorch lightning means it might be better to leave the handling of exotic accelerators to it.
Release Notes