ARM native_client with GPU support

gvoysey · December 10, 2017, 2:58am

I am attempting to build a version of deepspeech-gpu bindings and the native_client for ARMv8 with GPU support. The target platform is NVIDIA’s Jetson-class embedded systems – the TX-1/2 in particular, but I have access to a PX2 as well.

These systems run ubuntu 16.04 LTS for aarch64. Cuda 8.0, Cudnn 6, and the compute capability is 5.2.

I have the Deepspeech repo as of commit e5757d21a38d40923c1de9b86597685f365150ee, the Mozilla fork of tensorflow as of commit 08894f64fc67b7a8031fc68cb838a27009c3e6e6, and bazel 0.5.4. My python version is 3.5.2.

I have added the --config=cuda option to the suggested build command. Here’s the session output:

ubuntu@nvidia:~/Source/deepspeech/tensorflow$ bazel build -c opt --config=cuda --copt=-O3 //tensorflow:libtensorflow_cc.so //tensorflow:libtensorflow_framework.so //native_client:deepspeech //native_client:deepspeech_utils //native_client:libctc_decoder_with_kenlm.so //native_client:generate_trie
....
547 / 671] Compiling native_client/kenlm/util/double-conversion/bignum-dtoERROR: /home/ubuntu/Source/deepspeech/tensorflow/native_client/BUILD:48:1:C++ compilation of rule '//native_client:deepspeech' failed (Exit 1).
In file included from native_client/kenlm/util/double-conversion/bignum-dtoa.h:31:0, from native_client/kenlm/util/double-conversion/bignum-dtoa.cc:30:
native_client/kenlm/util/double-conversion/utils.h:71:2: error: #error Target architecture was not detected as supported by Double-Conversion.
 #error Target architecture was not detected as supported by Double-Conversion.
  ^

What is a more appropriate list of build targets to give bazel? I’m willing to go without the language model for now if i have to – the raw output from the NN is good enough for my purposes right now.

lissyx · December 10, 2017, 10:46am

Thanks for testing this! I know that @elpimous_robot succeeded in this setup and he had to add a small patch on top of KenLM. As much as I can tell, he was in the process of submitting this patch upstream.

elpimous_robot · December 10, 2017, 1:00pm

hi,
open this file (change with your link):
/home/nvidia/DeepSpeech/native_client/kenlm/util/double-conversion/utils.h

Add it : defined(__aarch64__) ||

// On Linux,x86 89255e-22 != Div_double(89255.0/1e22)
#if defined(_M_X64) || defined(__x86_64__) || \
    defined(__ARMEL__) || defined(__avr32__) || \
    defined(__hppa__) || defined(__ia64__) || \
    defined(__mips__) || defined(__powerpc__) || \
    defined(__sparc__) || defined(__sparc) || defined(__s390__) || \
    defined(__SH4__) || defined(__alpha__) || defined(__aarch64__) || \
    defined(_MIPS_ARCH_MIPS32R2)
#define DOUBLE_CONVERSION_CORRECT_DOUBLE_OPERATIONS 1
#elif defined(_M_IX86) || defined(__i386__) || defined(__i386)
#if defined(_WIN32)

That’s all.
Enjoy
Vincent

gvoysey · December 11, 2017, 1:01am

Okay, that really helped a lot.

I can make a wheel – @lissyx , are all wheels named deepspeech-0.1.0-... or must I do something else to get deepspeech-gpu ?

lissyx · December 11, 2017, 1:19am

If you want to make wheels available, you should take a look at tc-build.sh, it does document it
But at some point, if you can, it would be better to just work on adding ARMv8 cross-compilation, that would benefit for everybody.

lissyx · December 11, 2017, 1:39am

More precisely @gvoysey, it is there: https://github.com/mozilla/tensorflow/blob/master/tc-build.sh#L41

gvoysey · December 11, 2017, 2:33am

@lissyx i’ll update in ~18 hours with my progress.

I had thought that the tf build_pip_package tool was just for building wheels of tensorflow itself, so i’ll investigate further.

lissyx · December 11, 2017, 2:37am

Oh, right. Sorry, I misread. It’s handled there: https://github.com/mozilla/DeepSpeech/blob/cbaaf158d939d3d2a401d545586f48de3337ad91/tc-tests-utils.sh#L259-L297 and for GPU builds you need to pass --project_name deepspeech-gpu

gvoysey · February 23, 2018, 4:04am

I’m back. I think i’m close, but running into some last few troublesome things. A bunch of stuff changed and improved, so I’ve readjusted my build steps, and hopefully documented them well, below. Very long error logs are included as links to a pastebin.

@lissyx @elpimous_robot – anything jump out at you in what’s below?

Goal: compile deepspeech `native_client` for ARMv8 (aarch64) with GPU support

Configuration

All work done was performed on an NVIDIA TX-1 running jetpack 3.1. The
kernel was recompiled to support swap files, and an 8GB swap file was
created.

Prep Work

FIrst, just set up the repos

Clone mozilla’s deepspeech and tensorflow libraries at the right versions

mkdir $HOME/deepspeech
cd $HOME/deepspeech
git clone https://github.com/mozilla/DeepSpeech
git clone https://github.com/mozilla/tensorflow@r1.4
cd $HOME
ln -s deepspeech/tensorflow ./
ln -s deepspeech/DeepSpeech ./

Update `tc-vars.sh`

I adjust the cuda paths to adapt to what’s true on the TX-1: diff below.

git diff 23d3d54b3cbc9099678e9f01e45ea2627c835fc1:tc-vars.sh HEAD:tc-vars.sh
diff --git a/23d3d54b3cbc9099678e9f01e45ea2627c835fc1:tc-vars.sh b/HEAD:tc-vars.sh
index dec1ad7..f372dc4 100755
--- a/23d3d54b3cbc9099678e9f01e45ea2627c835fc1:tc-vars.sh
+++ b/HEAD:tc-vars.sh
@@ -95,7 +95,9 @@ if [ "${OS}" = "Darwin" ]; then
 fi;

 ### Define build parameters/env variables that we will re-ues in sourcing scripts.
-TF_CUDA_FLAGS="TF_CUDA_CLANG=0 TF_CUDA_VERSION=9.0 TF_CUDNN_VERSION=7 CUDA_TOOLKIT_PATH=${DS_ROOT_TASK}/DeepSpeech/CUDA CUDNN_INSTALL_PATH=${DS_ROOT_TASK}/DeepSpeech/CUDA TF_CUDA_COMPUTE_CAPABILITIES=\"3.0,3.5,3.7,5.2,6.0,6.1\""
+GV_CUDA_PATH='/usr/local/cuda'
+GV_CUDNN_PATH='/usr/lib/aarch64-linux-gnu/'
+TF_CUDA_FLAGS="TF_CUDA_CLANG=0 TF_CUDA_VERSION=8.0 TF_CUDNN_VERSION=6 CUDA_TOOLKIT_PATH=${GV_CUDA_PATH} CUDNN_INSTALL_PATH=${GV_CUDNN_PATH} TF_CUDA_COMPUTE_CAPABILITIES=\"3.0,3.5,3.7,5.2,5.3,6.0,6.1\""
 BAZEL_ARM_FLAGS="--config=rpi3"
 BAZEL_CUDA_FLAGS="--config=cuda"
 BAZEL_EXTRA_FLAGS="--copt=-fvisibility=hidden"

Update `tc-build.sh`

Update tc-build to add a new option for building tensorflow natively on
ARMv8 with CUDA support (using the vars set in tc-vars.sh).

git diff 23d3d54b3cbc9099678e9f01e45ea2627c835fc1:tc-build.sh tc-build.sh
diff --git a/tc-build.sh b/tc-build.sh
index 31c4d69..a7d432e 100755
--- a/tc-build.sh
+++ b/tc-build.sh
@@ -11,14 +11,18 @@ if [ "$1" = "--gpu" ]; then
     build_gpu=yes
 fi

-if [ "$1" = "--arm" ]; then
-    build_gpu=no
+if [ "$2" = "--arm" ]; then
     build_arm=yes
 fi

-pushd ${DS_ROOT_TASK}/DeepSpeech/tf/
+pushd ${DS_ROOT_TASK}/tensorflow
     BAZEL_BUILD="bazel ${BAZEL_OUTPUT_USER_ROOT} build -s --explain bazel_monolithic_tf.log --verbose_explanations --experimental_strict_action_env --config=monolithic"

+    # experimental aarch64 GPU build (NVIDIA Jetson-class devices)
+    if [ "${build_gpu}" = "yes" -a "${build_arm}" = "yes" ]; then
+       eval "export ${TF_CUDA_FLAGS}" && (echo "" | TF_NEED_CUDA=1 ./configure) && ${BAZEL_BUILD} -c opt ${BAZEL_CUDA_FLAGS} ${BAZEL_EXTRA_FLAGS} ${BUILD_TARGET_LIB_CPP_API} ${BUILD_TARGET_GRAPH_TRANSFORMS} ${BUILD_TARGET_GRAPH_SUMMARIZE} ${BUILD_TARGET_GRAPH_BENCHMARK} ${BUILD_TARGET_CONVERT_MMAP} ${BUILD_TARGET_FRAMEWORK} ${BUILD_TARGET_DEEPSPEECH} ${BUILD_TARGET_DEEPSPEECH_UTILS} ${BUILD_TARGET_KENLM} ${BUILD_TARGET_TRIE}
+    fi
+    
     # Pure amd64 CPU-only build

Build tensorflow 1.4

By running tc-build.sh --gpu --arm, we obtain this tree in
bazel-bin
(very long paste), which contains the build targets we specified. In
particular, libdeepspeech.so, libtensorflow_cc.so, etc. are all
built and of reasonable sizes.

Attempt to build `native-client`

Next, I adapt the flow from taskcluster (as suggested on
discourse)
and created $HOME/deepspeech/DeepSpeech/taskcluster/cuda-arm-build.sh

#!/bin/bash

set -xe

source $(dirname "$0")/../tc-tests-utils.sh

source ${DS_ROOT_TASK}/tensorflow/tc-vars.sh

BAZEL_TARGETS="
//native_client:libdeepspeech.so
//native_client:deepspeech_utils
//native_client:generate_trie
"

BAZEL_ENV_FLAGS="TF_NEED_CUDA=1 ${TF_CUDA_FLAGS}"
BAZEL_BUILD_FLAGS="${BAZEL_CUDA_FLAGS} ${BAZEL_EXTRA_FLAGS} ${BAZEL_OPT_FLAGS}"
SYSTEM_TARGET=host
EXTRA_LOCAL_CFLAGS="-march=armv8-a"
EXTRA_LOCAL_LDFLAGS="-L/usr/local/cuda/targets/aarch64-linux/lib/ -L/usr/local/cuda/targets/aarch64-linux/lib/stubs -lcudart -lcuda"

#do_bazel_build
deepspeech_python_build()
{
  rename_to_gpu=$1

  # unset PYTHON_BIN_PATH
  # unset PYTHONPATH
  # export PYENV_ROOT="${DS_ROOT_TASK}/DeepSpeech/.pyenv"
  # export PATH="${PYENV_ROOT}/bin:$PATH"

  # install_pyenv "${PYENV_ROOT}"
  # install_pyenv_virtualenv "$(pyenv root)/plugins/pyenv-virtualenv"

  mkdir -p wheels

  SETUP_FLAGS=""
  if [ "${rename_to_gpu}" ]; then
    SETUP_FLAGS="--project_name deepspeech-gpu"
  fi

  # for pyver in ${SUPPORTED_PYTHON_VERSIONS}; do
  #   pyenv install ${pyver}
  #   pyenv virtualenv ${pyver} deepspeech
  #   source ${PYENV_ROOT}/versions/${pyver}/envs/deepspeech/bin/activate

    EXTRA_CFLAGS="${EXTRA_LOCAL_CFLAGS}" EXTRA_LDFLAGS="${EXTRA_LOCAL_LDFLAGS}" EXTRA_LIBS="${EXTRA_LOCAL_LIBS}" make -C native_client/ \
      TARGET=${SYSTEM_TARGET} \
#      RASPBIAN=/tmp/multistrap-raspbian-jessie \
      TFDIR=${DS_TFDIR} \
      SETUP_FLAGS="${SETUP_FLAGS}" \
      bindings-clean bindings

    cp native_client/dist/*.whl wheels

    make -C native_client/ bindings-clean

    # deactivate
 #   pyenv uninstall --force deepspeech
 # done;
}
#do_deepspeech_binary_build

deepspeech_python_build rename_to_gpu

#do_deepspeech_nodejs_build rename_to_gpu

$(dirname "$0")/decoder-build.sh

Running this failed quickly with ld failing to find Model::Model in
libdeepspeech.so.

With nm, we can inspect libdeepspeech.so and see that indeed the
symbols are missing. They are present, however, in libdeepspeech.a:

ubuntu@nvidia:~/tensorflow/bazel-bin/native_client$ nm -gC libdeepspeech.so | grep Model::Model
ubuntu@nvidia:~/tensorflow/bazel-bin/native_client$ nm -gC libdeepspeech.a | grep Model::Model
0000000000000000 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)
0000000000000000 T DeepSpeech::Model::Model(char const*, int, int, char const*, int)

We move past this point with some trepidation, but changing
DeepSpeech/native_client/definitions.mk as follows yielded some
success:

ubuntu@nvidia:~/deepspeech/DeepSpeech/native_client$ git diff 52adc2b2ddfb70eebfea84ada44f74af29336f2b:native_client/definitions.mk definitions.mk
diff --git a/native_client/definitions.mk b/native_client/definitions.mk
index 32a4d80..622e88d 100644
--- a/native_client/definitions.mk
+++ b/native_client/definitions.mk
@@ -48,8 +48,8 @@ LDFLAGS_RPATH  := -Wl,-rpath,@executable_path
 endif

 CFLAGS  += $(EXTRA_CFLAGS)
-LIBS    := -ldeepspeech -ldeepspeech_utils $(EXTRA_LIBS)
-LDFLAGS_DIRS := -L${TFDIR}/bazel-bin/native_client $(EXTRA_LDFLAGS)
+LIBS    := -ltensorflow_so -l:libdeepspeech.a -l:libdeepspeech_utils.a $(EXTRA_LIBS)
+LDFLAGS_DIRS := -L${TFDIR}/bazel-bin/tensorflow -L${TFDIR}/bazel-bin/native_client $(EXTRA_LDFLAGS)
 LDFLAGS += $(LDFLAGS_NEEDED) $(LDFLAGS_RPATH) $(LDFLAGS_DIRS) $(LIBS)

Now we see the symbols in libdeepspeech.a.

However, rerunning the native client build now yields this very long
error, which
seems to make it past missing Model::Model, only to fail on finding
symbols in libtensorflow.so in a very similar fashion.

At this point i started to get worried that my bazel build step was
totally wrong in some way that broke the linker.

Questions

How can i make *.so files only, and skip making .a entirely?
What could cause symbols to be stripped from the .so in this
situation?
How close am I to home base?

lissyx · February 23, 2018, 7:44am

I’m sorry, you should move to r1.5 now, we build monolithic all in one libdeepspeech.so :). There’s too much changes wrt to your questions that it’s hard to answer.

lissyx · February 23, 2018, 8:13am

I’d like to add that in the end, if you just follow the build instructions in native_client/README.md, it should be straightforward: @elpimous_robot successfully updated his codebase the other day on current r1.5+master, without any issue.

Unless you want to get ARMv8 cross-compilation to work (which could be cool), there’s little to no value and need in hacking directly the taskcluster code

saikishor · February 23, 2018, 2:53pm

@lissyx @elpimous_robot @gvoysey Thanks for the clear explanation. I have few questions to post, I see that you were using the tensorflow repository from mozilla group. Isn’t the normal tensorflow installation using pip from their official website fine?. Atleast, can I use the installation from jetsonhacks. Because in the later stage, I would like to run DeepSpeech and Object Detection from the models.

@lissyx as per your recent post, if I modify the code that elpimous_robot pointed out and build it, you say that I will be able to generate a python package that will be able to run DeepSpeech on Jetson GPU right?.

lissyx · February 23, 2018, 3:01pm

You should have nothing to modify. If you are referring to KenLM, yes, you might need to do that, but it’s unrelated to the Python package.

What exactly do you want to run on your jetson? Training? Only inference?

gvoysey · February 23, 2018, 3:04pm

@saikishor i have been proceeding under these assumptions:

the mozilla fork of tensorflow contains things that deepspeech needs but google’s TF doesn’t provide

including ARM-specific code
including CTC-related features

the tensorflow pip package is a precompiled binary, not something you can use to build deepspeech against. No precompiled packages are available for ARMv8/aarch64 + CUDA, thus my attempts to build everything myself.
Once native_client is actually built, i can deploy it with its libraries to arbitrary TX1 without having to repeat (1) and (2).

@lissyx i’m not who you asked, but my goal is to have inference on a TX1 with CUDA. I have much nicer x86_64-based systems for training, thankfully with precompiled everything included

@lissyx I’m moving forward with the mozilla 1.5 tf process – stay tuned Previous attempts to follow native_client/readme.rst have resulted in builds that succeed but seem totally ignorant of the existence of a GPU at all. (and thus have ~15 second inference times on a ~1s utterance, which is no bueno).

saikishor · February 23, 2018, 3:06pm

Thanks for the imminent reply. I want to do only inference on Jetson TX2, in this case, I can normally install deepspeech and proceed? or are there any steps to be performed as above?

lissyx · February 23, 2018, 3:07pm

If you are only running inference, there’s no need for the Python wheel package. Regarding your build that could not use GPU, I would need more information on what you did, I cannot help you without knowing.

lissyx · February 23, 2018, 3:31pm

Two solutions: setup ARMv6 multilib on your system and use the prebuilt binaries (for CPU only), or build for ARMv8. In any case, if you are only doing inference, you don’t care about the Python wheel.

saikishor · February 23, 2018, 3:12pm

Cool!!! Will this consider the GPU for the inference, if I installed deepspeech-gpu on jetson TX2?

you mean as what @gvoysey explained earlier

right?.

gvoysey · February 23, 2018, 3:12pm

@lissyx i am only doing inference but i’m using the python bindings as part of a larger framework – so i do care about the wheel (or I am deeply confused).

regarding my no-GPU build – i am throwing it out and bumping to mozilla tf 1.5, on which i will take very detailed notes indeed.

lissyx · February 23, 2018, 3:12pm

I mean as documented in native_client/README.md. His instructions are just slight variations but there is really nothing different.