-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
Please describe the bug
example/linear_regression.py
with AllReduce strategy crashes when run on a CPU-only multinode cluster with the resource spec like:
nodes:
- address: X.X.X.X
cpus: [0]
chief: true
- address: X.X.X.X
cpus: [0]
ssh_config: conf
ssh:
conf:
username: XXX
key_file: YYY.pem
shared_envs:
LD_LIBRARY_PATH: '$LD_LIBRARY_PATH:/usr/local/cuda/lib64'
Output
Segmentation fault (core dumped)
2021-02-24 14:39:39.456448: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:160] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1614195579.456255335","description":"Error received from peer ipv4:127.0.0.1:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
Traceback (most recent call last):
File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/xxx/.local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnavailableError: From /job:worker/replica:0/task:1:
Socket closed
Additional GRPC error information:
{"created":"@1614195579.456825677","description":"Error received from peer ipv4:10.20.41.65:15000","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Socket closed","grpc_status":14}
[[{{node scoped_allocator_1_2_CollectiveReduce}}]]
Please describe the expected behavior
System information and environment
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
- TensorFlow version: 2.2.0
- Python version: 3.6.12
- GCC/Compiler version (if compiling from source):
- CUDA version: 10.1
- NCCL version: 10
- cuDNN version: 10.1
- GPU model and memory: GTX 1080 Ti, 12G
- AutoDist version: github master
To Reproduce
Steps to reproduce the behavior:
Run example/linear_regression.py
on a multi-node multi-CPU cluster.
Screenshots
If applicable, add screenshots to help explain your problem.
Code snippet to reproduce the problem
Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
Works on a the GPUs of the same cluster.
Metadata
Metadata
Assignees
Labels
No labels