mxnet-gluon example (with RDMA): create gluon data loader before initializing byteps #56

juncgu · 2019-07-11T04:16:14Z

Following #54, the mxnet-gluon example will crash (server crashes in RDMA layer) when byteps uses RDMA (DMLC_ENABLE_RDMA=1).

The example initializes byteps (with RDMA), and then creates gluon data loader (code).

# Initialize BytePS
bps.init()

# BytePS: pin context to local rank
context = mx.cpu(bps.local_rank()) if args.no_cuda else mx.gpu(bps.local_rank())
num_workers = bps.size()

# Load training and validation data
train_data, val_data, train_size = get_mnist_iterator()

As explained by @ymjiang in #54, the default (multi-process) gluon data loader forks from the RDMA process which may lead to unexpected errors (reason). Therefore, one very simple solution is to create gluon data loader before bps.init(), which is included in this PR.

Thank you.

Following bytedance#54, the default gluon data loader forks processes from the RDMA process, which may lead to unexpected errors.

ymjiang

LGTM. Thanks!

get gluon data loader before bps(RDMA) init

28ac3e7

Following bytedance#54, the default gluon data loader forks processes from the RDMA process, which may lead to unexpected errors.

ymjiang approved these changes Jul 11, 2019

View reviewed changes

changlan approved these changes Jul 11, 2019

View reviewed changes

ymjiang merged commit 31a1546 into bytedance:master Jul 11, 2019

ymjiang mentioned this pull request Oct 11, 2019

Failure in mxnet-gluon example (RDMA) #54

Closed

DeruiLiu mentioned this pull request Aug 6, 2020

some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282

Closed

pleasantrabbit pushed a commit that referenced this pull request Aug 13, 2020

hotfix: fix typo (#56)

ced6f2b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mxnet-gluon example (with RDMA): create gluon data loader before initializing byteps #56

mxnet-gluon example (with RDMA): create gluon data loader before initializing byteps #56

Uh oh!

juncgu commented Jul 11, 2019

Uh oh!

ymjiang left a comment

Uh oh!

Uh oh!

mxnet-gluon example (with RDMA): create gluon data loader before initializing byteps #56

mxnet-gluon example (with RDMA): create gluon data loader before initializing byteps #56

Uh oh!

Conversation

juncgu commented Jul 11, 2019

Uh oh!

ymjiang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!