Cannot run trainDopamin.py in Docker container on AWS


#1

Hello.

I have followed Cloud Training Guide
When I tried to run below command

docker run --runtime=nvidia test-training python trainDopamine.py

I got the following error

Traceback (most recent call last):
  File "trainDopamine.py", line 3, in <module>
    from dopamine.agents.rainbow import rainbow_agent
  File "/opt/conda/envs/python36/lib/python3.6/site-packages/dopamine/agents/rainbow/rainbow_agent.py", line 44, in <module>
    from dopamine.agents.dqn import dqn_agent
  File "/opt/conda/envs/python36/lib/python3.6/site-packages/dopamine/agents/dqn/dqn_agent.py", line 29, in <module>
    from dopamine.replay_memory import circular_replay_buffer
  File "/opt/conda/envs/python36/lib/python3.6/site-packages/dopamine/replay_memory/circular_replay_buffer.py", line 36, in <module>
    import gin.tf
  File "/opt/conda/envs/python36/lib/python3.6/site-packages/gin/tf/__init__.py", line 68, in <module>
    from gin.tf.utils import GinConfigSaverHook
  File "/opt/conda/envs/python36/lib/python3.6/site-packages/gin/tf/utils.py", line 34, in <module>
    config.register_file_reader(tf.io.gfile.GFile, tf.io.gfile.exists)
AttributeError: module 'tensorflow._api.v1.io' has no attribute 'gfile'

Do you have any clue why this happened?

My environment:
p2.xlarge instance created by using Deep Learning Base AMI
Exactly the same as the guide suggests.

Dockerfile:

FROM nvidia/cuda:9.0-cudnn7-runtime-ubuntu16.04

RUN apt-get clean && apt-get update && apt-get install -y locales
RUN echo "en_US.UTF-8 UTF-8" > /etc/locale.gen && \
    locale-gen
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV SHELL /bin/bash

RUN apt-get update && \
    apt-get install -y curl bzip2 xvfb ffmpeg git libxrender1

WORKDIR /aaio

RUN curl -o ~/miniconda.sh -O  https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  && \
     chmod +x ~/miniconda.sh && \
     ~/miniconda.sh -b -p /opt/conda && \
     rm ~/miniconda.sh && \
     /opt/conda/bin/conda clean -ya && \
     /opt/conda/bin/conda create -n python36 python=3.6 numpy

ENV PATH /opt/conda/envs/python36/bin:/opt/conda/envs/bin:$PATH

RUN pip install animalai

COPY agent.py /aaio/agent.py
COPY data /aaio/data

ENV HTTP_PROXY ""
ENV HTTPS_PROXY ""
ENV http_proxy ""
ENV https_proxy ""

########################################################################################################################
RUN git clone https://github.com/beyretb/AnimalAI-Olympics.git
RUN pip uninstall --yes tensorflow
RUN pip install tensorflow-gpu==1.12.2
RUN apt-get install unzip wget
RUN wget https://www.doc.ic.ac.uk/~bb1010/animalAI/env_linux_v1.0.0.zip
RUN mv env_linux_v1.0.0.zip AnimalAI-Olympics/env/
RUN unzip AnimalAI-Olympics/env/env_linux_v1.0.0.zip -d AnimalAI-Olympics/env/
WORKDIR /aaio/AnimalAI-Olympics/examples
RUN sed -i 's/docker_training=False/docker_training=True/g' trainDopamine.py

RUN pip install animalai-train
########################################################################################################################

CMD ["/bin/bash"]

I’d appreciate some help.


#2

Hi,

This is due to an update in the gin package that broke some dependencies. I will update the github and the packages to fix this, but basically in your case you can just change the tensorflow version:

replace RUN pip install tensorflow-gpu==1.12.2
with RUN pip install tensorflow-gpu==1.14

I still need to make sure it doesn’t brake anything else, but it should fix this issue.