-
Notifications
You must be signed in to change notification settings - Fork 492
Open
Description
Describe the bug
fatal occured when I built a docker images with Dockerfile
To Reproduce
Steps to reproduce the behavior:
- the content of my Dockerfile:
COPY ../byteps ./byteps
RUN ls -alh ./byteps
ARG https_proxy
ARG http_proxy
ARG BYTEPS_BASE_PATH=/usr/local
ARG BYTEPS_PATH=$BYTEPS_BASE_PATH/byteps
ARG BYTEPS_GIT_LINK=https://github.com/bytedance/byteps
ARG BYTEPS_BRANCH=master
ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
tzdata \
ca-certificates \
git \
curl \
wget \
vim \
cmake \
lsb-release \
libnuma-dev \
ibverbs-providers \
librdmacm-dev \
ibverbs-utils \
rdmacm-utils \
libibverbs-dev \
python3 \
python3-dev \
python3-pip \
python3-setuptools \
libnccl2=2.21.5-1+cuda12.2 \
libnccl-dev=2.21.5-1+cuda12.2
#COPY --from=builder /etc/reslov.conf /etc/reslov.conf
# install framework
# note: for tf <= 1.14, you need gcc-4.9
RUN g++ --version
ARG FRAMEWORK=tensorflow
RUN if [ "$FRAMEWORK" = "tensorflow" ]; then \
pip3 install --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple pip; \
pip3 install tensorflow==2.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple; \
pip3 install --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple setuptools; \
elif [ "$FRAMEWORK" = "pytorch" ]; then \
pip3 install -U numpy==1.18.1 torchvision==0.5.0 torch==1.4.0; \
elif [ "$FRAMEWORK" = "mxnet" ]; then \
pip3 install -U mxnet-cu100==1.5.0; \
else \
echo "unknown framework: $FRAMEWORK"; \
exit 1; \
fi
RUN ls -lh /byteps
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH
RUN cd $BYTEPS_BASE_PATH &&\
#COPY --form=builder /home/albert/tanyi4/github.com/bytedance/byteps $BYTEPS_PATH
# git clone --recursive -b $BYTEPS_BRANCH $BYTEPS_GIT_LINK &&\
cp /byteps ./byteps -r && \
cd $BYTEPS_PATH &&\
python3 setup.py install
- then built : docker build -t bytepsimage/tensorflow . -f Dockerfile --build-arg FRAMEWORK=tensorflow
- ** the error log is as follows: **
Libraries have been installed in:
Broadcast op cannot be created inside name scope #13 78.85 | ^~~~~~~~
Broadcast op cannot be created inside name scope #13 78.88 byteps/server/server.cc: In function ‘void byteps::server::BytePSHandler(const ps::KVMeta&, const ps::KVPairs&, ps::KVServer)’:
Broadcast op cannot be created inside name scope #13 78.88 byteps/server/server.cc:350:15: warning: unused variable ‘update’ [-Wunused-variable]
Broadcast op cannot be created inside name scope #13 78.88 350 | auto& update = updates->merged;
Broadcast op cannot be created inside name scope #13 78.88 | ^~~~~~
Broadcast op cannot be created inside name scope #13 78.94 In file included from 3rdparty/ps-lite/include/ps/ps.h:13,
Broadcast op cannot be created inside name scope #13 78.94 from byteps/server/server.h:24,
Broadcast op cannot be created inside name scope #13 78.94 from byteps/server/server.cc:16:
Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h: In instantiation of ‘ps::KVServer::KVServer(int, bool, int) [with Val = char]’:
Broadcast op cannot be created inside name scope #13 78.94 byteps/server/server.cc:501:62: required from here
Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h:354:18: warning: ‘new’ of type ‘ps::Customer’ with extended alignment 64 [-Waligned-new=]
Broadcast op cannot be created inside name scope #13 78.94 354 | this->obj_ = new Customer(
Broadcast op cannot be created inside name scope #13 78.94 | ^~~~~~~~~~~~~
Broadcast op cannot be created inside name scope #13 78.94 355 | app_id, app_id, std::bind(&KVServer::Process, this, 1), postoffice);
Broadcast op cannot be created inside name scope #13 78.94 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h:354:18: note: uses ‘void operator new(std::size_t)’, which does not have an alignment parameter
Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h:354:18: note: use ‘-faligned-new’ to enable C++17 over-aligned new support
Broadcast op cannot be created inside name scope #13 82.24 x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 build/temp.linux-x86_64-cpython-38/byteps/common/common.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/compressor_registry.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/error_feedback.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/dithering.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/onebit.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/randomk.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/topk.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/vanilla_error_feedback.o build/temp.linux-x86_64-cpython-38/byteps/common/cpu_reducer.o build/temp.linux-x86_64-cpython-38/byteps/common/logging.o build/temp.linux-x86_64-cpython-38/byteps/server/server.o 3rdparty/ps-lite/build/libps.a 3rdparty/ps-lite/deps/lib/libzmq.a -L/usr/local/nccl/lib -L/usr/local/nccl/lib64 -L/usr/lib -lrdmacm -libverbs -lrt -o build/lib.linux-x86_64-cpython-38/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so -Wl,--version-script=byteps.lds -fopenmp
Broadcast op cannot be created inside name scope #13 82.46 INFO: Unable to build TensorFlow plugin, will skip it.
Broadcast op cannot be created inside name scope #13 82.46
Broadcast op cannot be created inside name scope #13 82.46 Traceback (most recent call last):
Broadcast op cannot be created inside name scope #13 82.46 File "setup.py", line 383, in check_tf_version
Broadcast op cannot be created inside name scope #13 82.46 import tensorflow as tf
Broadcast op cannot be created inside name scope #13 82.46 ModuleNotFoundError: No module named 'tensorflow'
Broadcast op cannot be created inside name scope #13 82.46
Broadcast op cannot be created inside name scope #13 82.46 During handling of the above exception, another exception occurred:
Environment (please complete the following information):
- OS: ubuntu20.04
- GCC version: g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
- CUDA and NCCL version: CUDA 12.2.0 , NCCL: 2.21.5
- Framework (TF, PyTorch, MXNet): tensorflow-2.5.0
- pip-24.0
Metadata
Metadata
Assignees
Labels
No labels