Skip to content

python/examples/alpha_zero.py crashes with CUDA_ERROR_NOT_INITIALIZED #1122

Open
@jthemphill

Description

@jthemphill

I'm running Ubuntu 22.04 WSL2, and I've tried running this with both tensorflow==2.14.0 and tf-nightly==2.15.0.dev20231010. I am using Python 3.11.5, which is supported by the latest version of Tensorflow.

You can correctly install Tensorflow with GPU support via pip install --extra-index-url https://pypi.nvidia.com tensorflow[and-cuda], or install the nightly version with pip install --extra-index-url https://pypi.nvidia.com tf-nightly[and-cuda]. Note that, without the extra-index-url flag, the installation will fail as Tensorflow 2.14.0 depends on specific versions of tensorrt and tensorrt-lib which are not in the public pypi repository.

I verified that my graphics card is visible to the WSL2 container:

~/open_spiel$ nvidia-smi
Tue Oct 10 22:58:49 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.120                Driver Version: 537.58       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1080        On  | 00000000:01:00.0  On |                  N/A |
| 35%   52C    P0              35W / 180W |    962MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A        20      G   /Xwayland                                 N/A      |
|    0   N/A  N/A        20      G   /Xwayland                                 N/A      |
|    0   N/A  N/A        23      G   /Xwayland                                 N/A      |
+---------------------------------------------------------------------------------------+

And I verified that Tensorflow itself runs code correctly with my GPU, by running this code, seeing results, and noting the spike in my GPU's utilization when I run this script:

import tensorflow as tf

tf.debugging.set_log_device_placement(True)

# Create some tensors
a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

print(c)

But even though tensorflow is working with my graphics card, alpha_zero.py fails:

~/open_spiel$ python open_spiel/python/examples/alpha_zero.py 
2023-10-10 22:51:42.689219: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-10-10 22:51:42.689281: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-10-10 22:51:42.690266: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-10-10 22:51:42.695684: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-10-10 22:51:43.360880: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-10-10 22:51:44.101936: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127175: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.127253: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Starting game connect_four
Writing logs and checkpoints to: /tmp/az-2023-10-10-22-51-connect_four-87c21nuk
Model type: resnet(128, 10)
actor-0 started
actor-1 started
learner started
[2023-10-10 22:51:44.141] Initializing model
evaluator-0 started
Exception caught in evaluator-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
evaluator-0 exiting
Process Process-3:
Exception caught in actor-0: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-0 exiting
Process Process-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 287, in evaluator
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
2023-10-10 22:51:44.231365: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231499: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:44.231562: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
Exception caught in actor-1: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
actor-1 exiting
Process Process-2:
Traceback (most recent call last):
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/jhemphill/miniconda3/envs/tf/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 171, in _watcher
    return fn(config=config, logger=logger, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 268, in actor
    model = _init_model_from_config(config)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/alpha_zero.py", line 148, in _init_model_from_config
    return model_lib.Model.build_model(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 173, in build_model
    cls._define_graph(model_type, input_shape, output_size, nn_width,
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 241, in _define_graph
    torso = cascade(observations, [
            ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 28, in cascade
    x = fn(x)
        ^^^^^
  File "/home/jhemphill/oss/open_spiel/open_spiel/python/algorithms/alpha_zero/model.py", line 50, in batch_norm_layer
    applied = bn(x, training)
              ^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/engine/base_layer_v1.py", line 838, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 690, in wrapper
    return converted_call(f, args, kwargs, options=options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 377, in converted_call
    return _call_unconverted(f, args, kwargs, options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 459, in _call_unconverted
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 597, in call
    outputs = self._fused_batch_norm(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/layers/normalization/batch_normalization.py", line 990, in _fused_batch_norm
    output, mean, variance = control_flow_util.smart_cond(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/keras/src/utils/control_flow_util.py", line 108, in smart_cond
    return tf.__internal__.smart_cond.smart_cond(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/framework/smart_cond.py", line 57, in smart_cond
    return cond.cond(pred, true_fn=true_fn, false_fn=false_fn,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/jhemphill/oss/open_spiel/venv/lib/python3.11/site-packages/tensorflow/python/eager/context.py", line 605, in ensure_initialized
    pywrap_tfe.TFE_DeleteContextOptions(opts)
tensorflow.python.framework.errors_impl.InternalError: Failed call to cuDeviceGet: CUDA_ERROR_NOT_INITIALIZED: initialization error
^C2023-10-10 22:51:45.582786: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582883: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.582914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:2017] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-10-10 22:51:45.582959: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:887] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-10-10 22:51:45.583002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1926] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6562 MB memory:  -> device: 0, name: NVIDIA GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1
[2023-10-10 22:51:45.587] learner exiting
learner exiting

<hangs at 0% GPU usage>
Caught a KeyboardInterrupt, stopping early.

AlphaZero is forking actor, evaluator, and learner processes, and it's these subprocesses which fail, so I believe this is related to tensorflow/tensorflow#57877.

Metadata

Metadata

Assignees

No one assigned

    Labels

    WindowsThis is about support on the Windows platform

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions