Skip to content

install failed #447

@themoonstone

Description

@themoonstone

Describe the bug
fatal occured when I built a docker images with Dockerfile

To Reproduce
Steps to reproduce the behavior:

  1. the content of my Dockerfile:
COPY ../byteps ./byteps
RUN ls -alh ./byteps
ARG https_proxy
ARG http_proxy

ARG BYTEPS_BASE_PATH=/usr/local
ARG BYTEPS_PATH=$BYTEPS_BASE_PATH/byteps
ARG BYTEPS_GIT_LINK=https://github.com/bytedance/byteps
ARG BYTEPS_BRANCH=master

ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
        build-essential \
        tzdata \
        ca-certificates \
        git \
        curl \
        wget \
        vim \
        cmake \
        lsb-release \
        libnuma-dev \
        ibverbs-providers \
        librdmacm-dev \
        ibverbs-utils \
        rdmacm-utils \
        libibverbs-dev \
        python3 \
        python3-dev \
        python3-pip \
        python3-setuptools \
        libnccl2=2.21.5-1+cuda12.2 \
        libnccl-dev=2.21.5-1+cuda12.2
#COPY --from=builder /etc/reslov.conf /etc/reslov.conf
# install framework
# note: for tf <= 1.14, you need gcc-4.9
RUN g++ --version
ARG FRAMEWORK=tensorflow
RUN if [ "$FRAMEWORK" = "tensorflow" ]; then \
        pip3 install --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple pip; \
        pip3 install tensorflow==2.5.0 -i https://pypi.tuna.tsinghua.edu.cn/simple; \
	pip3 install --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple setuptools; \
    elif [ "$FRAMEWORK" = "pytorch" ]; then \
        pip3 install -U numpy==1.18.1 torchvision==0.5.0 torch==1.4.0; \
    elif [ "$FRAMEWORK" = "mxnet" ]; then \
        pip3 install -U mxnet-cu100==1.5.0; \
    else \
        echo "unknown framework: $FRAMEWORK"; \
        exit 1; \
    fi
RUN ls -lh /byteps
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH
RUN cd $BYTEPS_BASE_PATH &&\
#COPY --form=builder /home/albert/tanyi4/github.com/bytedance/byteps $BYTEPS_PATH
#    git clone --recursive -b $BYTEPS_BRANCH $BYTEPS_GIT_LINK &&\
    cp /byteps ./byteps -r && \
    cd $BYTEPS_PATH &&\ 
    python3 setup.py install
  1. then built : docker build -t bytepsimage/tensorflow . -f Dockerfile --build-arg FRAMEWORK=tensorflow
  2. ** the error log is as follows: **
    Libraries have been installed in:
    Broadcast op cannot be created inside name scope #13 78.85 | ^~~~~~~~
    Broadcast op cannot be created inside name scope #13 78.88 byteps/server/server.cc: In function ‘void byteps::server::BytePSHandler(const ps::KVMeta&, const ps::KVPairs&, ps::KVServer)’:
    Broadcast op cannot be created inside name scope #13 78.88 byteps/server/server.cc:350:15: warning: unused variable ‘update’ [-Wunused-variable]
    Broadcast op cannot be created inside name scope #13 78.88 350 | auto& update = updates->merged;
    Broadcast op cannot be created inside name scope #13 78.88 | ^~~~~~
    Broadcast op cannot be created inside name scope #13 78.94 In file included from 3rdparty/ps-lite/include/ps/ps.h:13,
    Broadcast op cannot be created inside name scope #13 78.94 from byteps/server/server.h:24,
    Broadcast op cannot be created inside name scope #13 78.94 from byteps/server/server.cc:16:
    Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h: In instantiation of ‘ps::KVServer::KVServer(int, bool, int) [with Val = char]’:
    Broadcast op cannot be created inside name scope #13 78.94 byteps/server/server.cc:501:62: required from here
    Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h:354:18: warning: ‘new’ of type ‘ps::Customer’ with extended alignment 64 [-Waligned-new=]
    Broadcast op cannot be created inside name scope #13 78.94 354 | this->obj_ = new Customer(
    Broadcast op cannot be created inside name scope #13 78.94 | ^~~~~~~~~~~~~
    Broadcast op cannot be created inside name scope #13 78.94 355 | app_id, app_id, std::bind(&KVServer::Process, this, 1), postoffice);
    Broadcast op cannot be created inside name scope #13 78.94 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h:354:18: note: uses ‘void
    operator new(std::size_t)’, which does not have an alignment parameter
    Broadcast op cannot be created inside name scope #13 78.94 3rdparty/ps-lite/include/ps/kv_app.h:354:18: note: use ‘-faligned-new’ to enable C++17 over-aligned new support
    Broadcast op cannot be created inside name scope #13 82.24 x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fwrapv -O2 build/temp.linux-x86_64-cpython-38/byteps/common/common.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/compressor_registry.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/error_feedback.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/dithering.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/onebit.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/randomk.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/topk.o build/temp.linux-x86_64-cpython-38/byteps/common/compressor/impl/vanilla_error_feedback.o build/temp.linux-x86_64-cpython-38/byteps/common/cpu_reducer.o build/temp.linux-x86_64-cpython-38/byteps/common/logging.o build/temp.linux-x86_64-cpython-38/byteps/server/server.o 3rdparty/ps-lite/build/libps.a 3rdparty/ps-lite/deps/lib/libzmq.a -L/usr/local/nccl/lib -L/usr/local/nccl/lib64 -L/usr/lib -lrdmacm -libverbs -lrt -o build/lib.linux-x86_64-cpython-38/byteps/server/c_lib.cpython-38-x86_64-linux-gnu.so -Wl,--version-script=byteps.lds -fopenmp
    Broadcast op cannot be created inside name scope #13 82.46 INFO: Unable to build TensorFlow plugin, will skip it.
    Broadcast op cannot be created inside name scope #13 82.46
    Broadcast op cannot be created inside name scope #13 82.46 Traceback (most recent call last):
    Broadcast op cannot be created inside name scope #13 82.46 File "setup.py", line 383, in check_tf_version
    Broadcast op cannot be created inside name scope #13 82.46 import tensorflow as tf
    Broadcast op cannot be created inside name scope #13 82.46 ModuleNotFoundError: No module named 'tensorflow'

    Broadcast op cannot be created inside name scope #13 82.46
    Broadcast op cannot be created inside name scope #13 82.46 During handling of the above exception, another exception occurred:

Environment (please complete the following information):

  • OS: ubuntu20.04
  • GCC version: g++ (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20)
  • CUDA and NCCL version: CUDA 12.2.0 , NCCL: 2.21.5
  • Framework (TF, PyTorch, MXNet): tensorflow-2.5.0
  • pip-24.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions