-
Notifications
You must be signed in to change notification settings - Fork 492
Open
Description
Describe the bug
Turning on asynchronous training (export BYTEPS_ENABLE_ASYNC=1
) crashes the bps server (during SendPullResponse in byteps/server/server.cc)
Expected behavior
The expected behavior is for the training to go error-free, just like in synchronous training
Stack trace from the crashed server
These are produced by turning on BYTEPS_ENABLE_GDB
, setting BYTEPS_LOG_LEVEL
to INFO and PS_VERBOSE
to 2
root@025575bc0d1646e18ff8810e88261f03000000:/# bpslaunch
BytePS launching server
Command: gdb -ex 'run' -ex 'bt' -batch --args python3.8 -c 'import byteps.server'
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[2021-01-18 03:00:55.980567: I byteps/common/compressor/compressor_registry.cc:28] dithering_compressor compressor is registered
[2021-01-18 03:00:55.980615: I byteps/common/compressor/compressor_registry.cc:28] onebit_compressor compressor is registered
[2021-01-18 03:00:55.980635: I byteps/common/compressor/compressor_registry.cc:28] randomk_compressor compressor is registered
[2021-01-18 03:00:55.980650: I byteps/common/compressor/compressor_registry.cc:28] topk_compressor compressor is registered
[2021-01-18 03:00:55.980663: I byteps/common/compressor/compressor_registry.cc:28] vanilla_ef compressor is registered
[03:00:55] byteps/server/server.cc:419: BytePS server is enabled asynchronous training
[03:00:55] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[New Thread 0x7f0da1300700 (LWP 27)]
[03:00:55] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[03:00:55] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[New Thread 0x7f0da0aff700 (LWP 28)]
[New Thread 0x7f0da02fe700 (LWP 29)]
[New Thread 0x7f0d9fafd700 (LWP 30)]
[New Thread 0x7f0d9f2fc700 (LWP 31)]
[New Thread 0x7f0d9eafb700 (LWP 32)]
[New Thread 0x7f0d9e2fa700 (LWP 33)]
[[03:00:55] src/van.cc:389: Bind to role=server, ip=10.0.0.4, port=33261, is_recovery=0
03:00:55] src/./zmq_van.h:287: Start ZMQ recv thread
[New Thread 0x7f0d9daf9700 (LWP 34)]
[03:00:55] src/van.cc:478: ? => 1. Meta: request=0, timestamp=0, control={ cmd=ADD_NODE, node={ role=server, ip=10.0.0.4, port=33261, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:503: 1 => 2147483647. Meta: request=0, timestamp=4, control={ cmd=ADD_NODE, node={ role=worker, id=9, ip=10.0.0.8, port=60029, is_recovery=0 role=worker, id=11, ip=10.0.0.6, port=39183, is_recovery=0 role=worker, id=13, ip=10.0.0.5, port=60011, is_recovery=0 role=server, id=8, ip=10.0.0.4, port=33261, is_recovery=0 role=worker, id=15, ip=10.0.0.4, port=52559, is_recovery=0 role=scheduler, id=1, ip=10.0.0.4, port=12345, is_recovery=0 } }. THIS IS NOT DATA MSG!
[03:01:26] src/van.cc:338: S[8] is connected to others
[New Thread 0x7f0d9d2f8700 (LWP 35)]
[03:01:26] src/van.cc:478: ? => 1. Meta: request=1, timestamp=1, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[Thread 0x7f0d9d2f8700 (LWP 35) exited]
[03:01:27] src/van.cc:503: 1 => 8. Meta: request=0, timestamp=10, control={ cmd=BARRIER, barrier_group=0 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:478: ? => 1. Meta: request=1, timestamp=2, control={ cmd=BARRIER, barrier_group=7 }. THIS IS NOT DATA MSG!
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[New Thread 0x7f0d9caf7700 (LWP 36)]
[New Thread 0x7f0d7ffff700 (LWP 37)]
[New Thread 0x7f0d7f7fe700 (LWP 38)]
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140364908230080
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139792355468736
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=140285773407680
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=0, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=139663003149760
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 11. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 15. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 13 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0, data_type={ UINT64 OTHER INT32 } Body: data_size=8 data_size=40 data_size=4
[03:01:27] src/van.cc:478: ? => 13. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:478: ? => 9. Meta: request=0, timestamp=1, app_id=0, customer_id=0, simple_app=0, push=1, head=0, key=0
[03:01:27] src/van.cc:503: 15 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] src/van.cc:503: 11 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] 3rdparty/ps-lite/include/dmlc/logging.h:276: [03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]
[terminate called after throwing an instance of '03:01:27] src/van.cc:dmlc::Error'
503: what(): 13 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
[03:01:27] byteps/server/server.cc:52: Check failed: updates.merged.tensor init 524288 first
Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x229fc) [0x7f0da25bc9fc]
[bt] (1) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x22e3d) [0x7f0da25bce3d]
[bt] (2) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::SendPullResponse(byteps::server::DataHandleType, unsigned long, ps::KVMeta const&, ps::KVServer<char>*)+0x2b2) [0x7f0da25b6472]
[bt] (3) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(byteps::server::BytePSHandler(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)+0x8f2) [0x7f0da25b8562]
[bt] (4) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x24b8e) [0x7f0da25beb8e]
[bt] (5) /usr/local/lib/python3.8/dist-packages/byteps-0.2.5-py3.8-linux-x86_64.egg/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so(+0x41d89) [0x7f0da25dbd89]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd6df) [0x7f0da1c9a6df]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7f0da39046db]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f0da3c3da3f]
[03:01:27] src/van.cc:503: 9 => 8. Meta: request=1, timestamp=2, app_id=0, customer_id=0, simple_app=0, push=0, head=0, key=0, data_type={ UINT64 OTHER } Body: data_size=8 data_size=0
Thread 2 "python3.8" received signal SIGABRT, Aborted.
[Switching to Thread 0x7f0da1300700 (LWP 27)]
0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#0 0x00007f0da3b5af47 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f0da3b5c8b1 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f0da1c69957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007f0da1c6fae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f0da1c6fb21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007f0da1c6fd54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f0da25bcf27 in dmlc::LogMessageFatal::~LogMessageFatal (this=0x7f0da12ff310, __in_chrg=<optimized out>) at 3rdparty/ps-lite/include/dmlc/logging.h:277
#7 0x00007f0da25b6472 in byteps::server::SendPullResponse (type=..., type@entry=..., key=<optimized out>, req_meta=..., server=server@entry=0x2018b60) at byteps/server/server.cc:52
#8 0x00007f0da25b8562 in byteps::server::BytePSHandler (req_meta=..., req_data=..., server=0x2018b60) at byteps/server/server.cc:376
#9 0x00007f0da25beb8e in std::function<void (ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*)>::operator()(ps::KVMeta const&, ps::KVPairs<char> const&, ps::KVServer<char>*) const (__args#2=<optimized out>, __args#1=..., __args#0=..., this=0x2018dc8) at /usr/include/c++/7/bits/std_function.h:706
#10 ps::KVServer<char>::Process (this=0x2018b60, msg=...) at 3rdparty/ps-lite/include/ps/kv_app.h:516
#11 0x00007f0da25dbd89 in std::function<void (ps::Message const&)>::operator()(ps::Message const&) const (__args#0=..., this=0x2018e88) at /usr/include/c++/7/bits/std_function.h:706
#12 ps::Customer::Receiving (this=0x2018e80) at src/customer.cc:64
#13 0x00007f0da1c9a6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x00007f0da39046db in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007f0da3c3da3f in clone () from /lib/x86_64-linux-gnu/libc.so.6
root@025575bc0d1646e18ff8810e88261f03000000:/#
To Reproduce
Steps to reproduce the behavior:
- Build the docker image using a local Dockerfile. The Dockerfile is modified from the one provided in the repo with a few changes. The BytePS I used in the Dockerfile is vanilla BytePS (69a3d), only with some extra print statements.
- Set up a 4-node cluster. In my case, I used Azure ML for a 4-node cluster (STANDARD_NC24, 4 x NVIDIA Tesla K80).
- Start 4 worker containers (one each on the four nodes), 1 server container and 1 scheduler container on node 0. Then, run the training scripts. Note that these steps are exactly the same as those in [Question] Does replacing torch.distributed.all_reduce with BytePS impact the training curve? #356 except that the async training knob (BYTEPS_ENABLE_ASYNC) is turned on. For the exact steps, see below
building the docker image
# cd into the directory with Dockerfile and do
export BYTEPS_WITHOUT_MXNET=1; export BYTEPS_WITHOUT_TENSORFLOW=1; sudo docker build -t bytepsimage/pytorch . -f Dockerfile --build-arg FRAMEWORK=pytorch
byteps set up
# scheduler on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=scheduler --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name scheduler bytepsimage/pytorch
# server on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=server --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name server bytepsimage/pytorch
# 4 workers on node 0-3
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=0 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=0 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker0 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=1 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=1 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker1 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=2 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=2 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker2 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=3 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=3 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=1 --name worker3 bytepsimage/pytorch
# for each node, go inside the container (server, scheduler, worker 0, ..., worker 3)
sudo docker exec -it {replace_me_with_name_of_container} bash
# use bpslaunch to start the server/scheduler. For the workers,
bpslaunch python3.8 bps_issue.py --bps -pindex 0
bpslaunch python3.8 bps_issue.py --bps -pindex 1
bpslaunch python3.8 bps_issue.py --bps -pindex 2
bpslaunch python3.8 bps_issue.py --bps -pindex 3
Environment (please complete the following information):
- OS: Linux 4.15.0-1100-azure x86_64
- GCC version: Ubuntu 5.4.0-6ubuntu1~16.04.12
- CUDA and NCCL version: CUDA 11.0, NCCL 2.7.8-1
- Framework (TF, PyTorch, MXNet): PyTorch
A few other things
- Thank you in advance for your time to look at this issue, and thank you for open-sourcing this and being so responsive in the issues section!
Metadata
Metadata
Assignees
Labels
No labels