TensorFlow 安装笔记

默认分类,编程 2016-08-17 访问: 1,789 次

前言

最近上了几门深度学习的公开课,还是觉得不过瘾,总觉得要搞一个框架来试试。那么caffe,tensorflow,torch等等选哪一个呢?经过一番比较我还是选择tensorflow,首先他是一个更通用的框架,而且对python支持最好,其次还有google支持,也是开源的,相信在未来无论是学术界还是工业界,他都会流行起来的。

安装-实况记录

首先得在我的电脑(win10)上装一个双系统(不装虚拟机是因为虚拟机对显卡等资源的利用不是很好),就装一个ubuntu吧(版本14.10),怎么装就不写了,毕竟网上一大把,然后就是安装tensorflow了,官网提供了5种安装办法,基于pip,基于docker,基于Anaconda,基于Virtualenv,基于源码。由于Anaconda包含了众多的科学计算库,相信对未来的工作能大有用处,所以我就选择了基于Anaconda的安装方式。

1.首先在这里选择相应的Anaconda版本下载。

2.进入下载目录,输入命令 bash Anaconda2-4.1.1-Linux-x86_64.sh

然后根据提示进行安装,他会提示安装目录等。而且我们可以看到他自动帮我们安装了python2.7.12,beautifulsoup,ipython等等:

installing: python-2.7.12-1 ...
installing: _nb_ext_conf-0.2.0-py27_0 ...
installing: alabaster-0.7.8-py27_0 ...
installing: anaconda-client-1.4.0-py27_0 ...
installing: anaconda-navigator-1.2.1-py27_0 ...
installing: argcomplete-1.0.0-py27_1 ...
installing: astropy-1.2.1-np111py27_0 ...
installing: babel-2.3.3-py27_0 ...
installing: backports-1.0-py27_0 ...
installing: backports_abc-0.4-py27_0 ...
installing: beautifulsoup4-4.4.1-py27_0 ...

需要注意的是最后会出现:

Do you wish the installer to prepend the Anaconda2 install location
to PATH in your /root/.bashrc ? [yes|no]

这里选择yes才能把anaconda加入环境变量(path)中,然后才能使用,不然之后就得手动配置path。由于修改了环境变量,所以打开一个新的终端来测试安装结果:在新的终端中输入python,显示:

Python 2.7.12 |Anaconda 4.1.1 (64-bit)| (default, Jul  2 2016, 17:42:40) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://anaconda.org

可见的确是安装成功了。

3.conda create -n tensorflow python=2.7 来建立一个conda 计算环境

4.source activate tensorflow 来激活计算环境。

5.pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0rc0-cp27-none-linux_x86_64.whl 来安装支持GPU的tensorflow。

需要注意,支持GPU要先安装Cuda ToolkitCUDNN Toolkit(先在官网注册)

6.安装成功后打开python,

import tensorflow as tf

然后报了一堆错:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/__init__.py", line 23, in <module>
    from tensorflow.python import *
  File "/root/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/__init__.py", line 45, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/root/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 28, in <module>
    _pywrap_tensorflow = swig_import_helper()
  File "/root/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/pywrap_tensorflow.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow', fp, pathname, description)
ImportError: libcudart.so.7.5: cannot open shared object file: No such file or directory

看样子是我还没有安装好cuda所致。步骤5中下载Cuda Toolkit 太慢了,需要10个小时,还是直接在线安装吧,先下载这个,然后

dpkg -i cuda-repo-ubuntu1410_7.0-28_amd64.deb 
apt-get update
apt-get install cuda 

这个只需要20分钟左右。安装好过后cuda应该就在/usr/local/路径下了。然后安装CUDNN Toolkit,进入其下载目录:

tar xvzf cudnn-7.0-linux-x64-v3.0-prod.tgz
cp cuda/include/cudnn.h  /usr/local/cuda/include
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64

然后设置 LD_LIBRARY_PATH 和 CUDA_HOME 环境变量. 可以将下面的命令 添加到 ~/.bashrc文件中, 这样每次登陆后自动生效:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda

7.测试

测试之时发现依然报上面的错。libcudart.so.7.5没找到,我先在磁盘上查找这个文件,locate libcudart.so.7.5,果然没有,应该是我的cuda版本低了吧,cd /usr/local/cuda/lib64,然后果然发现了libcudart.so.7.0.28,而不是 libcudart.so.7.5

8.重装Cuda Toolkit

apt-get remove cuda
apt-get autoremove
#下载http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/cuda-repo-ubuntu1404_7.5-18_amd64.deb
apt-get remove cuda-repo-ubuntu1410
dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb#正试图覆盖 /etc/apt/sources.list.d/cuda.list,它同时被包含于软件包 cuda-repo-ubuntu1410 7.0-28,所以必须要上一步
apt-get update
sudo apt-get install cuda
#报错:cuda : 依赖: cuda-7-5 (= 7.5-18) 但是它将不会被安装 
#E: 无法修正错误,因为您要求某些软件包保持现状,就是它们破坏了软件包间的依赖关系。

太乱了,还是重头来过吧

  1. 同上
  2. 同上
  3. conda create -n tensor python=2.7
  4. source activate tensor
  5. 安装Cuda Toolkit,先下载,进入目录:
    dpkg -i cuda-repo-ubuntu1404_7.5-18_amd64.deb
    apt-get update
    apt-get install cuda
    #报错:cuda : 依赖: cuda-7-5 (= 7.5-18) 但是它将不会被安装 
    #E: 无法修正错误,因为您要求某些软件包保持现状,就是它们破坏了软件包间的依赖关系。
    #也是醉了

装错了版本真是麻烦,清理一下系统吧

apt-get --purge remove nvidia-*  #彻底卸载nvidia
rm -rf anaconda2
# .bashrc文件中删除关于把anaconda加入环境变量的那一句
#还是不行,依旧报错:cuda : 依赖: cuda-7-5 (= 7.5-18) 但是它将不会被安装 
#E: 无法修正错误,因为您要求某些软件包保持现状,就是它们破坏了软件包间的依赖关系。

搞不定了,还是换成本地安装试试吧,下载cudacudnn。奇怪:ubuntu下载很慢,但是windows上就快好多了,在windows上下好直接在ubuntu中拷贝过去吧。

安装-无bug版

1.

由于包依赖问题没法解决,重装了系统Ubuntu14.04.5

2.

下载cudacudnn,进入下载目录

    dpkg -i cuda-repo-ubuntu1404-7-5-local_7.5-18_amd64.deb
    sudo apt-get update
    sudo apt-get install cuda
    #稍等片刻,然后配置cudnn
    tar xvzf cudnn-7.5-linux-x64-v5.0-ga-tgz
    cp cuda/include/cudnn.h /usr/local/cuda/include
    cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
    chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

3.

修改 .bashrc 加入:

    export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
    export CUDA_HOME=/usr/local/cuda

4.

下载Anaconda,进入下载目录

    bash Anaconda2-4.1.1-Linux-x86_64.sh
注意修改配置,根据你的喜好来修改目录

5.

重新打开一个终端

    conda create -n tfgpu python=2.7
    source activate tfgpu
    pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.10.0rc0-cp27-none-linux_x86_64.whl

6.

装好过后,重启,黑屏了。应该是双显卡的问题,不管了,先进入tty试试tensorflow是否装好了。

    Ctrl+Alt+F2#进入tty2,并登陆
    root@mageek-ThinkPad-T550:~# source activate tfgpu
    (tfgpu) root@mageek-ThinkPad-T550:~# python
    Python 2.7.12 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:42:40) 
    [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    Anaconda is brought to you by Continuum Analytics.
    Please check out: http://continuum.io/thanks and https://anaconda.org
    >>> import tensorflow as tf
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
    I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
    >>> sess = tf.Session()
    I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
    name: GeForce 940M
    major: 5 minor: 0 memoryClockRate (GHz) 1.124
    pciBusID 0000:08:00.0
    Total memory: 1023.88MiB
    Free memory: 997.54MiB
    I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
    I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
    I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940M, pci bus id: 0000:08:00.0)
    >>> 
    (tfgpu) root@mageek-ThinkPad-T550:~# source deactivate
可见是安装成功了

7. 解决黑屏

```
vim /etc/modprobe.d/blacklist.conf
#添加如下几句来屏蔽一些软件
blacklist amd76x_edac
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
#退出
sudo prime-select intel #优先intel集显
reboot#重启就进入图像化界面了
```

8. IPython

这个时候直接用ipython 可以进入界面,但是没法import tensorflow,要先安装conda install ipython然后再次进入ipython,就可以了,因为只有执行了这个命令才能将ipython加入虚拟环境tfgpu,在同一个环境中ipython才能找到tensorflow。

9. IDE

虽然IPython已经比原生的python终端好多了,但是每次都要敲相同命令,比如import tensorflow as tf还是相当麻烦的,所以还是要搞一个IDE才行。这里推荐Komodo Edit,下载过后,解压。进入目录运行 ./install.sh 然后按照提示修改安装目录(注意要有权限)。比如我的目录就是 /usr/local/Komodo-Edit-10/ 然后加入环境变量。这样就可以重新打开一个终端,命令 komodo,就可以打开这个IDE了,然后配置一些基本的选项比如缩进,配色方案等等就可以正式使用了。

新建一个 tf1.py:

import tensorflow as tf
import numpy as np

# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(100).astype(np.float32)
y_data = x_data * 0.1 + 0.3

# Try to find values for W and b that compute y_data = W * x_data + b
# (We know that W should be 0.1 and b 0.3, but TensorFlow will
# figure that out for us.)
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = W * x_data + b

# Minimize the mean squared errors.
loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
train = optimizer.minimize(loss)

# Before starting, initialize the variables.  We will 'run' this first.
init = tf.initialize_all_variables()

# Launch the graph.
sess = tf.Session()
sess.run(init)

# Fit the line.
for step in range(201):
    sess.run(train)
    if step % 20 == 0:
        print(step, sess.run(W), sess.run(b))

# Learns best fit is W: [0.1], b: [0.3]

运行:

#进入文件目录
source activate tfgpu
python tf1.py

结果:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce 940M
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:08:00.0
Total memory: 1023.88MiB
Free memory: 997.54MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940M, pci bus id: 0000:08:00.0)
(0, array([-0.09839484], dtype=float32), array([ 0.5272761], dtype=float32))
(20, array([ 0.02831561], dtype=float32), array([ 0.33592272], dtype=float32))
(40, array([ 0.07941294], dtype=float32), array([ 0.31031665], dtype=float32))
(60, array([ 0.09408762], dtype=float32), array([ 0.30296284], dtype=float32))
(80, array([ 0.09830203], dtype=float32), array([ 0.3008509], dtype=float32))
(100, array([ 0.09951238], dtype=float32), array([ 0.30024436], dtype=float32))
(120, array([ 0.09985995], dtype=float32), array([ 0.3000702], dtype=float32))
(140, array([ 0.09995978], dtype=float32), array([ 0.30002016], dtype=float32))
(160, array([ 0.09998845], dtype=float32), array([ 0.30000579], dtype=float32))
(180, array([ 0.09999669], dtype=float32), array([ 0.30000168], dtype=float32))
(200, array([ 0.09999905], dtype=float32), array([ 0.30000049], dtype=float32))

10.NN

#找到tensorflow的目录
python -c 'import os; import inspect; import tensorflow; print(os.path.dirname(inspect.getfile(tensorflow)))'
#/root/anaconda2/envs/tfgpu/lib/python2.7/site-packages/tensorflow
cd /root/anaconda2/envs/tfgpu/lib/python2.7/site-packages/tensorflow/models/image/mnist/#j进入目录
python convolutional.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library    libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Traceback (most recent call last):
  File "convolutional.py", line 326, in <module>
    tf.app.run()
  File "/root/anaconda2/envs/tfgpu/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
    sys.exit(main(sys.argv))
  File "convolutional.py", line 138, in main
    train_data = extract_data(train_data_filename, 60000)
  File "convolutional.py", line 85, in extract_data
    buf = bytestream.read(IMAGE_SIZE * IMAGE_SIZE * num_images * NUM_CHANNELS)
  File "/root/anaconda2/envs/tfgpu/lib/python2.7/gzip.py", line 268, in read
    self._read(readsize)
  File "/root/anaconda2/envs/tfgpu/lib/python2.7/gzip.py", line 315, in _read
    self._read_eof()
  File "/root/anaconda2/envs/tfgpu/lib/python2.7/gzip.py", line 354, in _read_eof
    hex(self.crc)))
IOError: CRC check failed 0x4b01c89e != 0xd2b9b600L

看来是CRC校验出错,还是直接去官网下载吧,然后直接拷贝到data路径中。读一下convolutional.py就知道下载路径了,其实比较一下data里程序已经下载的文件和官网的文件就知道程序下载的文件出错了,文件小了不少,应该是丢包了。
再次执行:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties: 
name: GeForce 940M
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:08:00.0
Total memory: 1023.88MiB
Free memory: 997.54MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940M, pci bus id: 0000:08:00.0)
Initialized!
E tensorflow/stream_executor/cuda/cuda_dnn.cc:347] Loaded runtime CuDNN library: 5005 (compatibility version 5000) but source was compiled with 4007 (compatibility version 4000).  If using a binary install, upgrade your CuDNN library to match.  If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
F tensorflow/core/kernels/conv_ops.cc:457] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms) 
Aborted (core dumped)

意思就是cudnn我安装的是v5,但是cuda7.5支持的是v4,所以就去下载v4,然后按照步骤2来重新配置cudnnv4:

这里会覆盖cudnnv5,所以记得备份cudnnv5,万一用得上,我把原来解压的cuda改为cudnn5005

cd /usr/local/cuda/lib64
rm -f libcudnn* #删掉cudnnv5
#先进入cudnnv4下载目录
tar xvzf cudnn-7.0-linux-x64-v4.0-prod.tgz
cp cuda/include/cudnn.h /usr/local/cuda/include#用v4覆盖v5
cp cuda/lib64/libcudnn* /usr/local/cuda/lib64#加入v4
chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

再次执行:
cd /root/anaconda2/envs/tfgpu/lib/python2.7/site-packages/tensorflow/models/image/mnist/#j进入目录
python convolutional.py
结果:

I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
E tensorflow/stream_executor/cuda/cuda_driver.cc:491] failed call to cuInit: CUDA_ERROR_NO_DEVICE
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:153] retrieving CUDA diagnostic information for host: mageek-ThinkPad-T550
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:160] hostname: mageek-ThinkPad-T550
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:185] libcuda reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:356] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  352.63  Sat Nov  7 21:25:42 PST 2015
GCC version:  gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3) 
"""
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] kernel reported version is: 352.63.0
I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:293] kernel version seems to match DSO: 352.63.0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:81] No GPU devices available on machine.
Initialized!
Step 0 (epoch 0.00), 5.4 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 280.2 ms
Minibatch loss: 3.287, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.0%
Step 200 (epoch 0.23), 281.0 ms
Minibatch loss: 3.491, learning rate: 0.010000
Minibatch error: 12.5%
Validation error: 3.6%
Step 300 (epoch 0.35), 281.0 ms
Minibatch loss: 3.265, learning rate: 0.010000
Minibatch error: 10.9%
Validation error: 3.2%
Step 400 (epoch 0.47), 293.0 ms
Minibatch loss: 3.221, learning rate: 0.010000
Minibatch error: 7.8%
Validation error: 2.7%
Step 500 (epoch 0.58), 289.0 ms
Minibatch loss: 3.292, learning rate: 0.010000
Minibatch error: 7.8%
Validation error: 2.7%
Step 600 (epoch 0.70), 287.4 ms
Minibatch loss: 3.227, learning rate: 0.010000
Minibatch error: 7.8%
Validation error: 2.6%
Step 700 (epoch 0.81), 287.0 ms
Minibatch loss: 3.015, learning rate: 0.010000
Minibatch error: 3.1%
Validation error: 2.4%
Step 800 (epoch 0.93), 287.0 ms
Minibatch loss: 3.152, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 2.0%
Step 900 (epoch 1.05), 287.7 ms
Minibatch loss: 2.938, learning rate: 0.009500
Minibatch error: 3.1%
Validation error: 1.6%
Step 1000 (epoch 1.16), 287.4 ms
Minibatch loss: 2.862, learning rate: 0.009500
Minibatch error: 1.6%
Validation error: 1.7%
.
.
.

可见程序是跑起来了,但是没有找到GPU,
reboot

.....

source activate tfgpu
cd /root/anaconda2/envs/tfgpu/lib/python2.7/site-packages/tensorflow/models/image/mnist/#j进入目录
python convolutional.py

结果:
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
Extracting data/train-images-idx3-ubyte.gz
Extracting data/train-labels-idx1-ubyte.gz
Extracting data/t10k-images-idx3-ubyte.gz
Extracting data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce 940M
major: 5 minor: 0 memoryClockRate (GHz) 1.124
pciBusID 0000:08:00.0
Total memory: 1023.88MiB
Free memory: 997.54MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce 940M, pci bus id: 0000:08:00.0)
Initialized!
Step 0 (epoch 0.00), 81.3 ms
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Step 100 (epoch 0.12), 44.4 ms
Minibatch loss: 3.291, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.1%
Step 200 (epoch 0.23), 44.4 ms
Minibatch loss: 3.462, learning rate: 0.010000
Minibatch error: 12.5%
Validation error: 3.6%
Step 300 (epoch 0.35), 44.0 ms
Minibatch loss: 3.188, learning rate: 0.010000
Minibatch error: 4.7%
Validation error: 3.2%
Step 400 (epoch 0.47), 44.3 ms
Minibatch loss: 3.253, learning rate: 0.010000
Minibatch error: 9.4%
Validation error: 2.8%
Step 500 (epoch 0.58), 44.3 ms
Minibatch loss: 3.288, learning rate: 0.010000
Minibatch error: 9.4%
Validation error: 2.5%
Step 600 (epoch 0.70), 43.9 ms
Minibatch loss: 3.180, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 2.8%
Step 700 (epoch 0.81), 44.2 ms
Minibatch loss: 3.033, learning rate: 0.010000
Minibatch error: 3.1%
Validation error: 2.4%
Step 800 (epoch 0.93), 44.0 ms
Minibatch loss: 3.149, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 2.0%
Step 900 (epoch 1.05), 44.0 ms
Minibatch loss: 2.919, learning rate: 0.009500
Minibatch error: 3.1%
Validation error: 1.6%
Step 1000 (epoch 1.16), 43.8 ms
Minibatch loss: 2.849, learning rate: 0.009500
Minibatch error: 0.0%
Validation error: 1.7%
Step 1100 (epoch 1.28), 43.6 ms
Minibatch loss: 2.822, learning rate: 0.009500
Minibatch error: 0.0%
Validation error: 1.6%
Step 1200 (epoch 1.40), 43.6 ms
Minibatch loss: 2.979, learning rate: 0.009500
Minibatch error: 7.8%
Validation error: 1.5%
Step 1300 (epoch 1.51), 43.6 ms
Minibatch loss: 2.763, learning rate: 0.009500
Minibatch error: 0.0%
Validation error: 1.9%
Step 1400 (epoch 1.63), 43.6 ms
Minibatch loss: 2.781, learning rate: 0.009500
Minibatch error: 3.1%
Validation error: 1.5%
Step 1500 (epoch 1.75), 43.6 ms
Minibatch loss: 2.861, learning rate: 0.009500
Minibatch error: 6.2%
Validation error: 1.4%
Step 1600 (epoch 1.86), 43.8 ms
Minibatch loss: 2.698, learning rate: 0.009500
Minibatch error: 1.6%
Validation error: 1.3%
Step 1700 (epoch 1.98), 43.9 ms
Minibatch loss: 2.650, learning rate: 0.009500
Minibatch error: 0.0%
Validation error: 1.3%
Step 1800 (epoch 2.09), 44.1 ms
Minibatch loss: 2.652, learning rate: 0.009025
Minibatch error: 1.6%
Validation error: 1.3%
Step 1900 (epoch 2.21), 44.1 ms
Minibatch loss: 2.655, learning rate: 0.009025
Minibatch error: 1.6%
Validation error: 1.3%
Step 2000 (epoch 2.33), 43.9 ms
Minibatch loss: 2.640, learning rate: 0.009025
Minibatch error: 3.1%
Validation error: 1.2%
Step 2100 (epoch 2.44), 44.0 ms
Minibatch loss: 2.568, learning rate: 0.009025
Minibatch error: 0.0%
Validation error: 1.1%
Step 2200 (epoch 2.56), 44.0 ms
Minibatch loss: 2.564, learning rate: 0.009025
Minibatch error: 0.0%
Validation error: 1.1%
Step 2300 (epoch 2.68), 44.2 ms
Minibatch loss: 2.561, learning rate: 0.009025
Minibatch error: 1.6%
Validation error: 1.2%
Step 2400 (epoch 2.79), 44.2 ms
Minibatch loss: 2.500, learning rate: 0.009025
Minibatch error: 0.0%
Validation error: 1.3%
Step 2500 (epoch 2.91), 44.0 ms
Minibatch loss: 2.471, learning rate: 0.009025
Minibatch error: 0.0%
Validation error: 1.2%
Step 2600 (epoch 3.03), 43.8 ms
Minibatch loss: 2.451, learning rate: 0.008574
Minibatch error: 0.0%
Validation error: 1.2%
Step 2700 (epoch 3.14), 43.6 ms
Minibatch loss: 2.483, learning rate: 0.008574
Minibatch error: 1.6%
Validation error: 1.1%
Step 2800 (epoch 3.26), 43.7 ms
Minibatch loss: 2.426, learning rate: 0.008574
Minibatch error: 1.6%
Validation error: 1.1%
Step 2900 (epoch 3.37), 44.3 ms
Minibatch loss: 2.449, learning rate: 0.008574
Minibatch error: 3.1%
Validation error: 1.1%
Step 3000 (epoch 3.49), 43.9 ms
Minibatch loss: 2.395, learning rate: 0.008574
Minibatch error: 1.6%
Validation error: 1.0%
Step 3100 (epoch 3.61), 44.1 ms
Minibatch loss: 2.390, learning rate: 0.008574
Minibatch error: 3.1%
Validation error: 1.0%
Step 3200 (epoch 3.72), 43.6 ms
Minibatch loss: 2.330, learning rate: 0.008574
Minibatch error: 0.0%
Validation error: 1.1%
Step 3300 (epoch 3.84), 43.8 ms
Minibatch loss: 2.319, learning rate: 0.008574
Minibatch error: 1.6%
Validation error: 1.1%
Step 3400 (epoch 3.96), 44.4 ms
Minibatch loss: 2.296, learning rate: 0.008574
Minibatch error: 0.0%
Validation error: 1.0%
Step 3500 (epoch 4.07), 44.4 ms
Minibatch loss: 2.273, learning rate: 0.008145
Minibatch error: 0.0%
Validation error: 1.0%
Step 3600 (epoch 4.19), 44.2 ms
Minibatch loss: 2.253, learning rate: 0.008145
Minibatch error: 0.0%
Validation error: 0.9%
Step 3700 (epoch 4.31), 44.4 ms
Minibatch loss: 2.237, learning rate: 0.008145
Minibatch error: 0.0%
Validation error: 1.0%
Step 3800 (epoch 4.42), 43.8 ms
Minibatch loss: 2.234, learning rate: 0.008145
Minibatch error: 1.6%
Validation error: 0.9%
Step 3900 (epoch 4.54), 43.9 ms
Minibatch loss: 2.325, learning rate: 0.008145
Minibatch error: 3.1%
Validation error: 0.9%
Step 4000 (epoch 4.65), 43.6 ms
Minibatch loss: 2.215, learning rate: 0.008145
Minibatch error: 0.0%
Validation error: 1.1%
Step 4100 (epoch 4.77), 43.6 ms
Minibatch loss: 2.209, learning rate: 0.008145
Minibatch error: 1.6%
Validation error: 1.0%
Step 4200 (epoch 4.89), 43.6 ms
Minibatch loss: 2.242, learning rate: 0.008145
Minibatch error: 1.6%
Validation error: 1.0%
Step 4300 (epoch 5.00), 43.5 ms
Minibatch loss: 2.188, learning rate: 0.007738
Minibatch error: 1.6%
Validation error: 0.9%
Step 4400 (epoch 5.12), 43.5 ms
Minibatch loss: 2.155, learning rate: 0.007738
Minibatch error: 3.1%
Validation error: 1.0%
Step 4500 (epoch 5.24), 43.5 ms
Minibatch loss: 2.164, learning rate: 0.007738
Minibatch error: 4.7%
Validation error: 0.9%
Step 4600 (epoch 5.35), 43.5 ms
Minibatch loss: 2.095, learning rate: 0.007738
Minibatch error: 0.0%
Validation error: 0.9%
Step 4700 (epoch 5.47), 43.6 ms
Minibatch loss: 2.062, learning rate: 0.007738
Minibatch error: 0.0%
Validation error: 0.9%
Step 4800 (epoch 5.59), 43.6 ms
Minibatch loss: 2.068, learning rate: 0.007738
Minibatch error: 1.6%
Validation error: 1.0%
Step 4900 (epoch 5.70), 43.6 ms
Minibatch loss: 2.062, learning rate: 0.007738
Minibatch error: 1.6%
Validation error: 1.0%
Step 5000 (epoch 5.82), 43.5 ms
Minibatch loss: 2.148, learning rate: 0.007738
Minibatch error: 3.1%
Validation error: 1.0%
Step 5100 (epoch 5.93), 43.5 ms
Minibatch loss: 2.017, learning rate: 0.007738
Minibatch error: 1.6%
Validation error: 0.9%
Step 5200 (epoch 6.05), 43.5 ms
Minibatch loss: 2.074, learning rate: 0.007351
Minibatch error: 3.1%
Validation error: 1.0%
Step 5300 (epoch 6.17), 43.6 ms
Minibatch loss: 1.983, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 1.1%
Step 5400 (epoch 6.28), 43.6 ms
Minibatch loss: 1.957, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 0.8%
Step 5500 (epoch 6.40), 43.5 ms
Minibatch loss: 1.955, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 0.9%
Step 5600 (epoch 6.52), 43.5 ms
Minibatch loss: 1.926, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 0.8%
Step 5700 (epoch 6.63), 43.5 ms
Minibatch loss: 1.914, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 1.0%
Step 5800 (epoch 6.75), 43.6 ms
Minibatch loss: 1.897, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 0.9%
Step 5900 (epoch 6.87), 43.5 ms
Minibatch loss: 1.887, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 0.8%
Step 6000 (epoch 6.98), 43.6 ms
Minibatch loss: 1.878, learning rate: 0.007351
Minibatch error: 0.0%
Validation error: 1.0%
Step 6100 (epoch 7.10), 43.5 ms
Minibatch loss: 1.859, learning rate: 0.006983
Minibatch error: 0.0%
Validation error: 0.8%
Step 6200 (epoch 7.21), 43.6 ms
Minibatch loss: 1.844, learning rate: 0.006983
Minibatch error: 0.0%
Validation error: 0.8%
Step 6300 (epoch 7.33), 43.6 ms
Minibatch loss: 1.850, learning rate: 0.006983
Minibatch error: 1.6%
Validation error: 0.9%
Step 6400 (epoch 7.45), 43.6 ms
Minibatch loss: 1.916, learning rate: 0.006983
Minibatch error: 3.1%
Validation error: 0.8%
Step 6500 (epoch 7.56), 43.6 ms
Minibatch loss: 1.808, learning rate: 0.006983
Minibatch error: 0.0%
Validation error: 0.8%
Step 6600 (epoch 7.68), 43.5 ms
Minibatch loss: 1.839, learning rate: 0.006983
Minibatch error: 1.6%
Validation error: 0.9%
Step 6700 (epoch 7.80), 43.6 ms
Minibatch loss: 1.781, learning rate: 0.006983
Minibatch error: 0.0%
Validation error: 0.8%
Step 6800 (epoch 7.91), 43.6 ms
Minibatch loss: 1.773, learning rate: 0.006983
Minibatch error: 0.0%
Validation error: 0.8%
Step 6900 (epoch 8.03), 43.5 ms
Minibatch loss: 1.762, learning rate: 0.006634
Minibatch error: 0.0%
Validation error: 0.9%
Step 7000 (epoch 8.15), 43.5 ms
Minibatch loss: 1.797, learning rate: 0.006634
Minibatch error: 1.6%
Validation error: 0.9%
Step 7100 (epoch 8.26), 43.5 ms
Minibatch loss: 1.741, learning rate: 0.006634
Minibatch error: 0.0%
Validation error: 0.8%
Step 7200 (epoch 8.38), 43.5 ms
Minibatch loss: 1.744, learning rate: 0.006634
Minibatch error: 0.0%
Validation error: 0.9%
Step 7300 (epoch 8.49), 43.6 ms
Minibatch loss: 1.726, learning rate: 0.006634
Minibatch error: 1.6%
Validation error: 0.8%
Step 7400 (epoch 8.61), 43.5 ms
Minibatch loss: 1.704, learning rate: 0.006634
Minibatch error: 0.0%
Validation error: 0.8%
Step 7500 (epoch 8.73), 43.6 ms
Minibatch loss: 1.695, learning rate: 0.006634
Minibatch error: 0.0%
Validation error: 0.8%
Step 7600 (epoch 8.84), 43.5 ms
Minibatch loss: 1.808, learning rate: 0.006634
Minibatch error: 3.1%
Validation error: 0.8%
Step 7700 (epoch 8.96), 43.6 ms
Minibatch loss: 1.667, learning rate: 0.006634
Minibatch error: 0.0%
Validation error: 0.9%
Step 7800 (epoch 9.08), 43.5 ms
Minibatch loss: 1.660, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.9%
Step 7900 (epoch 9.19), 43.6 ms
Minibatch loss: 1.649, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.9%
Step 8000 (epoch 9.31), 43.5 ms
Minibatch loss: 1.666, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.8%
Step 8100 (epoch 9.43), 43.6 ms
Minibatch loss: 1.626, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.8%
Step 8200 (epoch 9.54), 43.5 ms
Minibatch loss: 1.633, learning rate: 0.006302
Minibatch error: 1.6%
Validation error: 0.8%
Step 8300 (epoch 9.66), 43.6 ms
Minibatch loss: 1.616, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.8%
Step 8400 (epoch 9.77), 43.6 ms
Minibatch loss: 1.597, learning rate: 0.006302
Minibatch error: 0.0%
Validation error: 0.8%
Step 8500 (epoch 9.89), 43.5 ms
Minibatch loss: 1.612, learning rate: 0.006302
Minibatch error: 1.6%
Validation error: 0.8%
Test error: 0.8%

Finally Dode!!!

总结

来来回回折腾了4天。教训就是一定要根据官网一步一步来,因为不同版本兼容性不行,所以不要随意下载其他版本,同时要仔细分析报出的错误,再采取下一步行动。

欢迎访问我的主页http://mageek.cn/



本文由 mageek 创作,采用 知识共享署名 3.0,可自由转载、引用,但需署名作者且注明文章出处。
此文章共有条评论, 人参与 |Powerd By Angboo