Skip to content

Adding support for Apple Silicon MPS Devices (M1/Pro/Max/Ultra) WIP #975

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

sbonner0
Copy link
Contributor

@sbonner0 sbonner0 commented Jun 12, 2022

February 2025 Update

PyTorch still has issues with segfaults on MPS. Here are a few things we're waiting on:

Original Issue Text

Description of the Change

Since PyTorch has added support for Apple Silicon machines (via the use of Metal Performance Shaders https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/), I thought it would be interesting to add support for it into PyKEEN. However, the full range of PyTorch operations is not currently supported meaning that, although some models (I have mostly been testing TransE) can be trained on a M1 GPU, the evaluation current fails due to the following missing operation: 'aten::index.Tensor_out'

Link to PyTorch issue tracking MPS progress: pytorch/pytorch#77764
These issues have some big implications for PyKEEN. For example, currently complex tensors are not supported by MPS, so no support for RotatE, ComplEx etc

In order to test/work on this, there are some prerequisites to consider. MPS-enabled PyTorch requires MacOS 12.3+ and an ARM-native Python installation. Details are provided on how to install MPS-enabled PyTorch here: https://pytorch.org/blog/introducing-accelerated-pytorch-training-on-mac/)

Possible Drawbacks

Extra range of (relatively niche) devices to support. Also from existing benchmarks, performance is still a long long way from even a mid range CUDA-enabled GPU. It is also possible that the recent inclusion of PyTorch lightning means it might be better to leave the handling of exotic accelerators to it.

Release Notes

  • Adding support for PyKEEN model training/evaluation on Apple Silicon equipped machines (M1/Pro/Max/Ultra).

@sbonner0 sbonner0 marked this pull request as draft June 12, 2022 20:29
@migalkin
Copy link
Member

migalkin commented Jun 14, 2022

That's a great initiative, I can assist here - my laptop is a very basic M1 (8 GB) with mps enabled (torch 1.13 dev from the nightly branch).

For the MVP, I created a default pipeline:

from pykeen.pipeline import pipeline
result = pipeline(
    dataset="fb15k237",
    dataset_kwargs=dict(
        create_inverse_triples=True,
    ),
    model='DistMult',
    model_kwargs=dict(
        embedding_dim=64,
    ),
)

print(result.get_metric("hits_at_10"))

Well, it seems that the GPU core in the basic M1 is kinda bad 😅 or smth is not yet optimized.
Training in the x86-compatible env through Rosetta gives about 100 batches/s.

Training batches on cpu:  24%|██████▉                      | 513/2126 [00:05<00:15, 104.98batch/s]

Training in the native arm64 env with the mps device is only about 44 batches/s, so pretty much 2x slower 🤔

Training batches on mps:0:  19%|█▊        | 396/2126 [00:09<00:38, 44.55batch/s]

@sbonner0 Is it different on your machine?

@@ -186,9 +186,10 @@ def evaluate(

if batch_size is None and self.automatic_memory_optimization:
# Using automatic memory optimization on CPU may result in undocumented crashes due to OS' OOM killer.
if model.device.type == "cpu":
if model.device.type == "cpu" or model.device.type == "mps":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this also work?

Suggested change
if model.device.type == "cpu" or model.device.type == "mps":
if model.device.type != "cuda":

Copy link
Member

@migalkin migalkin Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally yes, trying out moving tensors to mps returns expected results:

>>> a = torch.rand(5,5,device="mps")
>>> a.device.type
'mps'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update this logger.info to better explain why this only works on cuda GPUs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about how mps works.

It does not work for CPU, since the CPU's OOM killer is not as lenient as the CUDA one and usually just kills your whole program instead of allowing the program to handle an exception.

Copy link
Contributor Author

@sbonner0 sbonner0 Jun 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts for initially disabling this was due to the fact that in Apple Silicon systems, the CPU and GPU share memory - so I thought it best to err on the side of caution whilst MPS support is very early and not let a model completely fill the shared memory. We can assess this again when there is better support in PyTorch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this line of code has been upstreamed into max's AMO library

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the warning part of it 😄 we still need a distinction here to avoid OOM killer action for non-cuda parts)

@sbonner0
Copy link
Contributor Author

Hi @migalkin thanks for being enthusiastic about this and testing it out! I can confirm the performance as it stands is not great when running on the GPU (I too am on a base M1 although with 16GB in a 13' MBP). Seems this is mostly a PyTorch issue as the M1 GPU enabled Tensorflow does perform quite a bit better. As seen in the linked PyTorch issue, support for this hardware is really early and many operations are still missing so I do hope it will improve quite a bit over time.

@sbonner0
Copy link
Contributor Author

@migalkin - just wanted to confirm with you that the evaluation loops is also failing currently due to an error with Tensor indexing?

@migalkin
Copy link
Member

@migalkin - just wanted to confirm with you that the evaluation loops is also failing currently due to an error with Tensor indexing?

Generally, running a script with the env variable PYTORCH_ENABLE_MPS_FALLBACK=1 prevents a lot of crashes.
Right now, the eval script indeed crashes with the error

File "...pykeen/src/pykeen/evaluation/evaluator.py", line 465, in create_sparse_positive_filter_
    filter_batch[:, 1] = all_pos_triples[:, filter_col : filter_col + 1].view(1, -1)[:, filter_batch[:, 1]]
RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (CPU)

but it seems to be related with tensors and devices in general, not pertaining to MPS 🤔

@tcz
Copy link

tcz commented Dec 17, 2022

I'd very much appreciate this feature. Can I help out somehow?

@cthoyt
Copy link
Member

cthoyt commented Dec 17, 2022

I think the main blocker is waiting on one of the core team to get a computer with Apple silicon or to see if there’s a way to set up automated testing on a machine with one.

@sbonner0
Copy link
Contributor Author

I'd very much appreciate this feature. Can I help out somehow?

Hi @tcz, just to add to what Charley has said - on a technical levels there is still a lot of (frankly basic) functionally from pytorch that doesn't have an appropriate MPS ops version. A good example being that it has no support for complex tensors - meaning models like ComplEx and RotatE do not work.

However if you would like to help, I think it would be good if you could install the latest nightly version of pytorch on an Apple Silicon machine and report back here any pykeen functionality that does not work.

@cthoyt
Copy link
Member

cthoyt commented Sep 23, 2023

Update: @mberr and I were working AMO on the apple silicon. We upstreamed that functionality in torch_max_mem and bumped the minimum version here.

@mberr
Copy link
Member

mberr commented Sep 23, 2023

I think we also need to adjust these lines from == 'cpu' to != "cuda"

num_triples = mapped_triples.shape[0]
# no batch size -> automatic memory optimization
if batch_size is None:
batch_size = 32 if device.type == "cpu" else num_triples
logger.debug(f"Automatically set maximum batch size to {batch_size=}")

# set upper limit of batch size for automatic memory optimization
if not batch_size:
if self.model.device.type == "cpu":
batch_size = 32
else:
batch_size = len(self.dataset)

# Take the biggest possible training batch_size, if batch_size not set
batch_size_sufficient = False
if batch_size is None:
if self.automatic_memory_optimization:
# Using automatic memory optimization on CPU may result in undocumented crashes due to OS' OOM killer.
if self.model.device.type == "cpu":
batch_size = 256
batch_size_sufficient = True
logger.info(
"Currently automatic memory optimization only supports GPUs, but you're using a CPU. "
"Therefore, the batch_size will be set to the default value '{batch_size}'",
)
else:
batch_size, batch_size_sufficient = self.batch_size_search(triples_factory=triples_factory)
else:
batch_size = 256
logger.info(f"No batch_size provided. Setting batch_size to '{batch_size}'.")

@cthoyt
Copy link
Member

cthoyt commented Sep 23, 2023

@mberr are there other types of devices we could imagine people using (does lightning or accelerate have a unique one)? In that case, we might need to do something a bit more clever on how to infer certain properties about the device that doesn't rely on hard-coding the name but rather asking a lookup table we can maintain externally

@mberr
Copy link
Member

mberr commented Sep 23, 2023

They are not really documented, but these are shown in the error message:

>>> torch.device("a")
RuntimeError: Expected one of cpu, cuda, ipu, xpu, mkldnn, opengl, opencl, ideep, hip, ve, fpga, ort, xla, lazy, vulkan, mps, meta, hpu, mtia, privateuseone device type at start of device string: a

I think for now, it would suffice to use the cpu settings for everything except cuda, and wait until someone comes around with any of the other devices. We could also add a small debug/info/warning message when we see an unexpected device.

@cthoyt
Copy link
Member

cthoyt commented Sep 23, 2023

Current status

from pykeen.pipeline import pipeline

if __name__ == '__main__':
    result = pipeline(
        epochs=5,
        dataset="fb15k237",
        dataset_kwargs=dict(
            create_inverse_triples=True,
        ),
        model="DistMult",
        model_kwargs=dict(
            embedding_dim=64,
        ),
    )

    print(result.get_metric("hits_at_10"))

Gave

No random seed is specified. Setting to 2322437377.
INFO:pykeen.triples.triples_factory:Creating inverse triples.
Training epochs on mps:0:   0%|                                                                                                                                                 | 0/5 [00:00<?, ?epoch/s]
INFO:pykeen.triples.triples_factory:Creating inverse triples.
Training epochs on mps:0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:58<00:00, 11.75s/epoch, loss=0.501, prev_loss=0.725]
Evaluating on mps:0:   0%|                                                                                                                                              | 0.00/20.4k [00:00<?, ?triple/s]
WARNING:torch_max_mem.api:Encountered tensors on device_types={'mps'} while only ['cuda'] are considered safe for automatic memory utilization maximization. This may lead to undocumented crashes (but can be safe, too).
Evaluating on mps:0:  92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋          | 18.8k/20.4k [00:14<00:00, 2.17ktriple/s]
/AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:90: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (20438)'
fish: Job 1, 'python test.py' terminated by signal SIGABRT (Abort)
/opt/homebrew/Cellar/python@3.11/3.11.4_1/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

@mberr
Copy link
Member

mberr commented Sep 23, 2023

/AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:90: failed assertion `[MPSNDArrayDescriptor sliceDimension:withSubrange:] error: the range subRange.start + subRange.length does not fit in dimension[1] (20438)'

sounds like the array slicing may not be compatible the usual way, where it is allowed to have the stop-index be larger than the array size, e.g.,

a = torch.zeros(10)
a[6:6+6]

@sbonner0
Copy link
Contributor Author

Thanks @cthoyt and @mberr for picking this up again. I feel that the MPS support is still quite far behind where it would needs to be enabled by default for users with M series machines. What do you feel about having it as an option a user must select, rather than being enabled by default?

@cthoyt cthoyt mentioned this pull request May 14, 2024
3 tasks
@cthoyt
Copy link
Member

cthoyt commented Aug 9, 2024

GHA now supports apple machines with apple silicon (including GPU) - can we get those into the test regime here?

@cthoyt cthoyt marked this pull request as ready for review February 9, 2025 10:15
@cthoyt
Copy link
Member

cthoyt commented Apr 25, 2025

FYI, torch 2.7 might have solved the automatic memory optimization issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants