Deep Learning Applications Design, Development and
Deployment in IoT Edge
Jayakumar. S PhD (IIT Bombay)
Object Automation, ​04/02/2020
Contents
Introduction 3
Power 9 based DL training 4
2.1 HW and SW Requirements 4
2.1.1 Talos™ II Entry-Level Developer System 4
2.1.2 Software 4
2.2 “DLtrain” resources in github 5
2.3 Build Tool Set 5
2.4 Build Process to Create “DLtrain” 5
2.5 Use DLtrain to train ANN model 6
2.6 Use DLtrain to Inferencing 6
Working with CUDA core 7
3.1 Hardware Used in CUDA computing 7
3.2 Driver Software Installation in Power AC922 7
3.3 Single Project using Power AC922 and GPU 8
3.3.1 Build Application 8
3.3.2 Sample Code in cu 9
Inference app in Android phone 13
4.1 Install NDK in Android Studio 13
4.2 Build Inference Engine 14
Update Inference Engine with Trained Model 15
Question: How to build toPhone/SndModel.jar ? 15
Question: How to use toPhone/SndModel.jar ? 15
Working with ML in Watson Studio 17
Visual Recognition in Watson Studio 17
Deploy Visual Recognition 17
Visual Recognition Client in Android Phone 19
1.Introduction
Intelligence IoT Edge is playing a critical role in services that require real time
inferencing. Historically, there have been systems with a high amount of
engineering complexity in terms of deployment and also in operation. For
example, SCADA is one such a system that has been working in the Power
Generation industry, Oil and Gas industry, Cement factories etc. In fact, SCADA
includes humans in loop and making it as Supervisory control and Data
acquisition. In the advent of Deep learning and its success in the modern Digital
side, there have been huge amounts of interest among researchers to carry
Deep learning Models to above mentioned industrial verticals and trying to bring
up Intelligent control and Data acquisition. In the place of Supervisor, it appears
that intelligent IoT edge coming up to perform those tasks that are handled by
Human beings in the form of Supervisor. Thus there is immense interest in
making IoT Edge as intelligent systems in these core engineering verticals apart
from consumer industry requirements. Lecture series designed to include NN
models, Train a NN model with training data ( mostly use MNIST ) and validate
trained NN models before going for deployment in IoT Edge. In the case of
deployment, there is a huge interest in making SmartPhones as IoT Edge such
that the same device can be used without much investment during the learning
time of each learner. However, in the case of industrial deployment are expected
to happen in devices like Jetson Nano, Ultra96-V2, mmWave Radar IWR 6843
etc. Maybe, as an advanced lecture series, there is a plan to cover the above
mentioned IoT Edges.
2.Power 9 based DL training
Along with Compute capability, it is important to have Ultra high speed I/O capability to
share training data with GPUs plus other Power AC922. Artificial Neural Network (ANN)
Model is created to perform classification of given image. The MNIST data set is used to
Train ANN models by using AC922.
2.1 HW and SW Requirements
2.1.1 ​Talos™ II Entry-Level Developer System
TLSDS3
Talos™ II Entry-Level Developer System including:
● EATX chassis with 500W ATX power supply
● A single Talos™ II Lite EATX mainboard
● One 4-core IBM POWER9 CPU
○ 4 cores per package
○ SMT4 capable
○ POWER IOMMU
○ Hardware virtualization extensions
○ 90W TDP
● One 3U POWER9 heatsink / fan (HSF) assembly
● 8GB DDR4 ECC registered memory
● 2 front panel USB 2.0 ports
● 128GB internal NVMe storage
● Recovery DV​D
https://www.raptorcs.com/content/TLSDS3/intro.html
2.1.2 ​Software
1. Ubuntu 18.04,
2. cmake version 3.10.2 ,
3. gcc -9 , g++9,
2.2 “DLtrain” resources in github
https://github.com/DLinIoTedge/DLtrain provides ready use applications to train ANN
based Deep Learning Model.
Deep Learning training application is coded in C and C++.
2.3 Build Tool Set
Ubuntu 18.04 based g++, gcc tool set used to ( cmake also used to create Make file) to
create DL application that is running in Power AC922.
2.4 Build Process to Create “DLtrain”
Making “DLtrain” application in Ubuntu 18.04 ( Power AC922 ) machine with
gcc-9 g++-9
​cd C++NNFast
rm -rf build
cd build
​ cmake -D CMAKE_C_COMPILER=gcc-9 -D CMAKE_CXX_COMPILER=g++-9 ..
make
2.5 Use DLtrain to train ANN model
Using, “DLtrain” application to train ANN models by using the MNIST data set.
​bin/DLtrain conf train ​// Train DL model
​bin/DLtrain conf train o ​// to overwrite tranined model
2.6 Use DLtrain to Inferencing
Using, “DLtrain” application to infer handwritten number numbers and stored in
28x28 pixel size images.
​ bin/DLtrain conf infer 3 ​// only all 3 from data set
bin/DLtrain conf infer ​<filename>​ ​ // only raw ...768 bytes... binary value
bin/DLtrain conf infer img.raw ​// sample file with number 4 in img.raw
3. Working with CUDA core
3.1 Hardware Used in CUDA computing
Hardware from Nvidia :​ GeForce RTX 2070
● GPU Architecture ​Turing
● NVIDIA CUDA​®​
Cores 2304
● RTX-OPS ​42T
● Boost Clock ​1620 MHz
● Frame Buffer ​8 GB GDDR6
● Memory Speed ​14 Gbps
3.2 Driver Software Installation in Power AC922
Get CUDA 10.2 for ubuntu 18.04 on ppc64le from the following link. Use wget to obtain
“cuda_10.2.89_440.33.01_linux_ppc64le.run”
http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10.
2.89_440.33.01_linux_ppc64le.run
Install driver by using following command.
sudo sh cuda_10.2.89_440.33.01_linux_ppc64le.run
After above command, use nvidia-smi to get details on installed driver.
3.3 Single Project using Power AC922 and GPU
3.3.1 Build Application
File with *.cu extension is used to build application that runs partly in AC922 and also
partly in CUDA core of GPU
For GPU
nvcc is NVIDIA (R) Cuda compiler driver and its version “Cuda compilation tools,
release 10.1, V10.1.243 “
For Host CPU
g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
Use make ( gnu make 4.1 is used) tool to build applications.
jk@jkDL:~/NVIDIA_CUDA-10.1_Samples/0_Simple/vectorAdd$​ make
/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30
-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37
-gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52
-gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61
-gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75
-gencode arch=compute_75,code=compute_75 -o vectorAdd vectorAdd.o
mkdir -p ../../bin/ppc64le/linux/release
cp vectorAdd ../../bin/ppc64le/linux/release
Above make process creating “​vectorAdd” and the same is executed as given below.
jk@jkDL:~/NVIDIA_CUDA-10.1_Samples/0_Simple/vectorAdd$​ ​./vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
3.3.2 Sample Code in cu
In the following sample source code is given.
#include <stdio.h>
// For the CUDA runtime routines (prefixed with "cuda_")
#include <cuda_runtime.h>
#include <helper_cuda.h>
/**
* CUDA Kernel Device code
*
* Computes the vector addition of A and B into C. The 3 vectors have the same
* number of elements numElements.
*/
__global__ void
vectorAdd(const float *A, const float *B, float *C, int numElements)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements)
{
C[i] = A[i] + B[i];
}
}
/**
* Host main routine
*/
int
main(void)
{
// Error code to check return values for CUDA calls
cudaError_t err = cudaSuccess;
// Print the vector length to be used, and compute its size
int numElements = 50000;
size_t size = numElements * sizeof(float);
printf("[Vector addition of %d elements]n", numElements);
// Allocate the host input vector A
float *h_A = (float *)malloc(size);
// Allocate the host input vector B
float *h_B = (float *)malloc(size);
// Allocate the host output vector C
float *h_C = (float *)malloc(size);
// Verify that allocations succeeded
if (h_A == NULL || h_B == NULL || h_C == NULL)
{
fprintf(stderr, "Failed to allocate host vectors!n");
exit(EXIT_FAILURE);
}
// Initialize the host input vectors
for (int i = 0; i < numElements; ++i)
{
h_A[i] = rand()/(float)RAND_MAX;
h_B[i] = rand()/(float)RAND_MAX;
}
// Allocate the device input vector A
float *d_A = NULL;
err = cudaMalloc((void **)&d_A, size);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector A (error code %s)!n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Allocate the device input vector B
float *d_B = NULL;
err = cudaMalloc((void **)&d_B, size);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector B (error code %s)!n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Allocate the device output vector C
float *d_C = NULL;
err = cudaMalloc((void **)&d_C, size);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to allocate device vector C (error code %s)!n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy the host input vectors A and B in host memory to the device input vectors in
// device memory
printf("Copy input data from the host memory to the CUDA devicen");
err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Launch the Vector Add CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
printf("CUDA kernel launch with %d blocks of %d threadsn", blocksPerGrid,
threadsPerBlock);
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
err = cudaGetLastError();
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Copy the device result vector in device memory to the host result vector
// in host memory.
printf("Copy output data from the CUDA device to the host memoryn");
err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!n",
cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Verify that the result vector is correct
for (int i = 0; i < numElements; ++i)
{
if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5)
{
fprintf(stderr, "Result verification failed at element %d!n", i);
exit(EXIT_FAILURE);
}
}
printf("Test PASSEDn");
// Free device global memory
err = cudaFree(d_A);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector A (error code %s)!n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_B);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector B (error code %s)!n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaFree(d_C);
if (err != cudaSuccess)
{
fprintf(stderr, "Failed to free device vector C (error code %s)!n", cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free host memory
free(h_A);
free(h_B);
free(h_C);
printf("Donen");
return 0;
}
4.Inference app in Android phone
4.1 Install NDK in Android Studio
ANN model based inference engine is coded in C++ and C. To make a library by using
inference engine code, there is a need to have NDK also installed in Android studio. In
the following installation of Android studio and also installation of NDK are given in
detail. Following is installed in Ubuntu 18.04 x86 machine.
Installation of dependencies
Android Studio requires OpenJDK version 8 or above to be installed to your system
sudo apt update
sudo apt install openjdk-8-jdk
java -version
Install Android Studio
sudo snap install android-studio --classic
start Android Studio either by typing android-studio in your terminal or by clicking on
the Android Studio icon (Activities -> Android Studio).
SDK required version
Use​ SDK 22 or above.
What is used in present build of J722 version of App is ​SDK 29
Install NDK
Use SDK manager to install the following components ( NDK). these components useful to build
JNI part of Inference engine
Packages to install:
- ​LLDB 3.1 (lldb;3.1)
- CMake 3.10.2.4988404 (cmake;3.10.2.4988404)
- NDK (Side by side) 20.1.5948944 (ndk;20.1.5948944)
Reference:
https://linuxize.com/post/how-to-install-android-studio-on-ubuntu-18-04/
4.2 Build Inference Engine
Using Android studio build Inference engine as given in the following workflow and copy
the same APK into Android phone .
Windows 10 or Ubuntu 18.04 with Android Studio ( the latest stable version of Android
Studio is version 3.3.1.0.) is used. NDK (Side by side) 20.1.5948944
(ndk;20.1.5948944) is used to handle JNI side lib creation for Inference Engine,
Inference engine full source code is given in “ ​https://github.com/DLinIoTedge/NN​ “
Following diagram provides information on workflow to create J722 application in the
form of APK that can run in Android phone.
5.Update Inference Engine with Trained Model
Model update application source code is given in this URL
https://github.com/DLinIoTedge/Send2Phone
Following IP network configuration is recommended
Send2Phone source code in JAVA and it is running in Power machine or in X86
machine.
Question: How to build toPhone/SndModel.jar ?
javac main.java
Question: How to use toPhone/SndModel.jar ?
java -jar toPhone/SndModel.jar
And also follow the above given workflow in Host CPU and also in Android Phone. Successfully
operations in the above result in “deployment of trained model” in Inference engine that runs in
Android smartphone.
6. Working with ML in Watson Studio
Details are given in the following link
ai-ml-watson.pdf
7. Visual Recognition in Watson Studio
Details are given in the following link
A beginner's guide to setting up a visual recognition service
8. Deploy Visual Recognition
// following worked
curl -X POST -u "apikey:​put your API key here​" -F "images_file=@sasi.jpeg" -F "threshold=0.6" -F
"classifier_ids=DefaultCustomModel_1738357304"
"https://gateway.watsonplatform.net/visual-recognition/api/v3/classify?version=2018-03-19"
{
"images": [
{
"classifiers": [
{
"classifier_id": "DefaultCustomModel_1738357304",
"name": "Default Custom Model",
"classes": [
{
"class": "lb1compress.zip",
"score": 0.855
}
]
}
],
"image": "sasi.jpeg"
}
],
"images_processed": 1,
"custom_classes": 3
}
Sample code for Client Application
…………………………. Pyrhan code ……………
pip
pip install --upgrade "watson-developer-cloud>=2.1.1"
Authentication
from watson_developer_cloud import VisualRecognitionV3
visual_recognition = VisualRecognitionV3(
version='{version}',
iam_apikey='{iam_api_key}'
)
Authentication (for instances created before May 23, 2018)
from watson_developer_cloud import VisualRecognitionV3
visual_recognition = VisualRecognitionV3(
version='{version}',
api_key='{api_key}'
)
Classify an image
import json
from watson_developer_cloud import VisualRecognitionV3
visual_recognition = VisualRecognitionV3(
'2018-03-19',
iam_apikey='{iam_api_key}')
with open('./fruitbowl.jpg', 'rb') as images_file:
classes = visual_recognition.classify(
images_file,
threshold='0.6',
classifier_ids='DefaultCustomModel_967878440').get_result()
print(json.dumps(classes, indent=2))
…………………………………………………………
9. Visual Recognition Client in Android Phone
Details are given in the following link
VR
///////////////////////////////////////////// ​document ends​ //////////////////////////////////////////////////////

Deep Learning Edge

  • 1.
    Deep Learning ApplicationsDesign, Development and Deployment in IoT Edge Jayakumar. S PhD (IIT Bombay) Object Automation, ​04/02/2020
  • 2.
    Contents Introduction 3 Power 9based DL training 4 2.1 HW and SW Requirements 4 2.1.1 Talos™ II Entry-Level Developer System 4 2.1.2 Software 4 2.2 “DLtrain” resources in github 5 2.3 Build Tool Set 5 2.4 Build Process to Create “DLtrain” 5 2.5 Use DLtrain to train ANN model 6 2.6 Use DLtrain to Inferencing 6 Working with CUDA core 7 3.1 Hardware Used in CUDA computing 7 3.2 Driver Software Installation in Power AC922 7 3.3 Single Project using Power AC922 and GPU 8 3.3.1 Build Application 8 3.3.2 Sample Code in cu 9 Inference app in Android phone 13 4.1 Install NDK in Android Studio 13 4.2 Build Inference Engine 14 Update Inference Engine with Trained Model 15 Question: How to build toPhone/SndModel.jar ? 15 Question: How to use toPhone/SndModel.jar ? 15 Working with ML in Watson Studio 17 Visual Recognition in Watson Studio 17 Deploy Visual Recognition 17 Visual Recognition Client in Android Phone 19
  • 3.
    1.Introduction Intelligence IoT Edgeis playing a critical role in services that require real time inferencing. Historically, there have been systems with a high amount of engineering complexity in terms of deployment and also in operation. For example, SCADA is one such a system that has been working in the Power Generation industry, Oil and Gas industry, Cement factories etc. In fact, SCADA includes humans in loop and making it as Supervisory control and Data acquisition. In the advent of Deep learning and its success in the modern Digital side, there have been huge amounts of interest among researchers to carry Deep learning Models to above mentioned industrial verticals and trying to bring up Intelligent control and Data acquisition. In the place of Supervisor, it appears that intelligent IoT edge coming up to perform those tasks that are handled by Human beings in the form of Supervisor. Thus there is immense interest in making IoT Edge as intelligent systems in these core engineering verticals apart from consumer industry requirements. Lecture series designed to include NN models, Train a NN model with training data ( mostly use MNIST ) and validate trained NN models before going for deployment in IoT Edge. In the case of deployment, there is a huge interest in making SmartPhones as IoT Edge such that the same device can be used without much investment during the learning time of each learner. However, in the case of industrial deployment are expected to happen in devices like Jetson Nano, Ultra96-V2, mmWave Radar IWR 6843 etc. Maybe, as an advanced lecture series, there is a plan to cover the above mentioned IoT Edges.
  • 4.
    2.Power 9 basedDL training Along with Compute capability, it is important to have Ultra high speed I/O capability to share training data with GPUs plus other Power AC922. Artificial Neural Network (ANN) Model is created to perform classification of given image. The MNIST data set is used to Train ANN models by using AC922. 2.1 HW and SW Requirements 2.1.1 ​Talos™ II Entry-Level Developer System TLSDS3 Talos™ II Entry-Level Developer System including: ● EATX chassis with 500W ATX power supply ● A single Talos™ II Lite EATX mainboard ● One 4-core IBM POWER9 CPU ○ 4 cores per package ○ SMT4 capable ○ POWER IOMMU ○ Hardware virtualization extensions ○ 90W TDP ● One 3U POWER9 heatsink / fan (HSF) assembly ● 8GB DDR4 ECC registered memory ● 2 front panel USB 2.0 ports ● 128GB internal NVMe storage ● Recovery DV​D https://www.raptorcs.com/content/TLSDS3/intro.html 2.1.2 ​Software 1. Ubuntu 18.04, 2. cmake version 3.10.2 , 3. gcc -9 , g++9,
  • 5.
    2.2 “DLtrain” resourcesin github https://github.com/DLinIoTedge/DLtrain provides ready use applications to train ANN based Deep Learning Model. Deep Learning training application is coded in C and C++. 2.3 Build Tool Set Ubuntu 18.04 based g++, gcc tool set used to ( cmake also used to create Make file) to create DL application that is running in Power AC922. 2.4 Build Process to Create “DLtrain” Making “DLtrain” application in Ubuntu 18.04 ( Power AC922 ) machine with gcc-9 g++-9 ​cd C++NNFast rm -rf build cd build ​ cmake -D CMAKE_C_COMPILER=gcc-9 -D CMAKE_CXX_COMPILER=g++-9 .. make
  • 6.
    2.5 Use DLtrainto train ANN model Using, “DLtrain” application to train ANN models by using the MNIST data set. ​bin/DLtrain conf train ​// Train DL model ​bin/DLtrain conf train o ​// to overwrite tranined model 2.6 Use DLtrain to Inferencing Using, “DLtrain” application to infer handwritten number numbers and stored in 28x28 pixel size images. ​ bin/DLtrain conf infer 3 ​// only all 3 from data set bin/DLtrain conf infer ​<filename>​ ​ // only raw ...768 bytes... binary value bin/DLtrain conf infer img.raw ​// sample file with number 4 in img.raw
  • 7.
    3. Working withCUDA core 3.1 Hardware Used in CUDA computing Hardware from Nvidia :​ GeForce RTX 2070 ● GPU Architecture ​Turing ● NVIDIA CUDA​®​ Cores 2304 ● RTX-OPS ​42T ● Boost Clock ​1620 MHz ● Frame Buffer ​8 GB GDDR6 ● Memory Speed ​14 Gbps 3.2 Driver Software Installation in Power AC922 Get CUDA 10.2 for ubuntu 18.04 on ppc64le from the following link. Use wget to obtain “cuda_10.2.89_440.33.01_linux_ppc64le.run” http://developer.download.nvidia.com/compute/cuda/10.2/Prod/local_installers/cuda_10. 2.89_440.33.01_linux_ppc64le.run Install driver by using following command. sudo sh cuda_10.2.89_440.33.01_linux_ppc64le.run After above command, use nvidia-smi to get details on installed driver.
  • 8.
    3.3 Single Projectusing Power AC922 and GPU 3.3.1 Build Application File with *.cu extension is used to build application that runs partly in AC922 and also partly in CUDA core of GPU For GPU nvcc is NVIDIA (R) Cuda compiler driver and its version “Cuda compilation tools, release 10.1, V10.1.243 “ For Host CPU g++ (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 Use make ( gnu make 4.1 is used) tool to build applications. jk@jkDL:~/NVIDIA_CUDA-10.1_Samples/0_Simple/vectorAdd$​ make /usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o vectorAdd vectorAdd.o mkdir -p ../../bin/ppc64le/linux/release cp vectorAdd ../../bin/ppc64le/linux/release Above make process creating “​vectorAdd” and the same is executed as given below. jk@jkDL:~/NVIDIA_CUDA-10.1_Samples/0_Simple/vectorAdd$​ ​./vectorAdd [Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
  • 9.
    3.3.2 Sample Codein cu In the following sample source code is given. #include <stdio.h> // For the CUDA runtime routines (prefixed with "cuda_") #include <cuda_runtime.h> #include <helper_cuda.h> /** * CUDA Kernel Device code * * Computes the vector addition of A and B into C. The 3 vectors have the same * number of elements numElements. */ __global__ void vectorAdd(const float *A, const float *B, float *C, int numElements) { int i = blockDim.x * blockIdx.x + threadIdx.x; if (i < numElements) { C[i] = A[i] + B[i]; } } /** * Host main routine */ int main(void) { // Error code to check return values for CUDA calls cudaError_t err = cudaSuccess; // Print the vector length to be used, and compute its size int numElements = 50000; size_t size = numElements * sizeof(float); printf("[Vector addition of %d elements]n", numElements); // Allocate the host input vector A float *h_A = (float *)malloc(size); // Allocate the host input vector B float *h_B = (float *)malloc(size);
  • 10.
    // Allocate thehost output vector C float *h_C = (float *)malloc(size); // Verify that allocations succeeded if (h_A == NULL || h_B == NULL || h_C == NULL) { fprintf(stderr, "Failed to allocate host vectors!n"); exit(EXIT_FAILURE); } // Initialize the host input vectors for (int i = 0; i < numElements; ++i) { h_A[i] = rand()/(float)RAND_MAX; h_B[i] = rand()/(float)RAND_MAX; } // Allocate the device input vector A float *d_A = NULL; err = cudaMalloc((void **)&d_A, size); if (err != cudaSuccess) { fprintf(stderr, "Failed to allocate device vector A (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Allocate the device input vector B float *d_B = NULL; err = cudaMalloc((void **)&d_B, size); if (err != cudaSuccess) { fprintf(stderr, "Failed to allocate device vector B (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Allocate the device output vector C float *d_C = NULL; err = cudaMalloc((void **)&d_C, size); if (err != cudaSuccess) { fprintf(stderr, "Failed to allocate device vector C (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE);
  • 11.
    } // Copy thehost input vectors A and B in host memory to the device input vectors in // device memory printf("Copy input data from the host memory to the CUDA devicen"); err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice); if (err != cudaSuccess) { fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); if (err != cudaSuccess) { fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Launch the Vector Add CUDA Kernel int threadsPerBlock = 256; int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock; printf("CUDA kernel launch with %d blocks of %d threadsn", blocksPerGrid, threadsPerBlock); vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements); err = cudaGetLastError(); if (err != cudaSuccess) { fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Copy the device result vector in device memory to the host result vector // in host memory. printf("Copy output data from the CUDA device to the host memoryn"); err = cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); if (err != cudaSuccess) { fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE);
  • 12.
    } // Verify thatthe result vector is correct for (int i = 0; i < numElements; ++i) { if (fabs(h_A[i] + h_B[i] - h_C[i]) > 1e-5) { fprintf(stderr, "Result verification failed at element %d!n", i); exit(EXIT_FAILURE); } } printf("Test PASSEDn"); // Free device global memory err = cudaFree(d_A); if (err != cudaSuccess) { fprintf(stderr, "Failed to free device vector A (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } err = cudaFree(d_B); if (err != cudaSuccess) { fprintf(stderr, "Failed to free device vector B (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } err = cudaFree(d_C); if (err != cudaSuccess) { fprintf(stderr, "Failed to free device vector C (error code %s)!n", cudaGetErrorString(err)); exit(EXIT_FAILURE); } // Free host memory free(h_A); free(h_B); free(h_C); printf("Donen"); return 0; }
  • 13.
    4.Inference app inAndroid phone 4.1 Install NDK in Android Studio ANN model based inference engine is coded in C++ and C. To make a library by using inference engine code, there is a need to have NDK also installed in Android studio. In the following installation of Android studio and also installation of NDK are given in detail. Following is installed in Ubuntu 18.04 x86 machine. Installation of dependencies Android Studio requires OpenJDK version 8 or above to be installed to your system sudo apt update sudo apt install openjdk-8-jdk java -version Install Android Studio sudo snap install android-studio --classic start Android Studio either by typing android-studio in your terminal or by clicking on the Android Studio icon (Activities -> Android Studio). SDK required version Use​ SDK 22 or above. What is used in present build of J722 version of App is ​SDK 29 Install NDK Use SDK manager to install the following components ( NDK). these components useful to build JNI part of Inference engine Packages to install: - ​LLDB 3.1 (lldb;3.1) - CMake 3.10.2.4988404 (cmake;3.10.2.4988404) - NDK (Side by side) 20.1.5948944 (ndk;20.1.5948944) Reference: https://linuxize.com/post/how-to-install-android-studio-on-ubuntu-18-04/
  • 14.
    4.2 Build InferenceEngine Using Android studio build Inference engine as given in the following workflow and copy the same APK into Android phone . Windows 10 or Ubuntu 18.04 with Android Studio ( the latest stable version of Android Studio is version 3.3.1.0.) is used. NDK (Side by side) 20.1.5948944 (ndk;20.1.5948944) is used to handle JNI side lib creation for Inference Engine, Inference engine full source code is given in “ ​https://github.com/DLinIoTedge/NN​ “ Following diagram provides information on workflow to create J722 application in the form of APK that can run in Android phone.
  • 15.
    5.Update Inference Enginewith Trained Model Model update application source code is given in this URL https://github.com/DLinIoTedge/Send2Phone Following IP network configuration is recommended Send2Phone source code in JAVA and it is running in Power machine or in X86 machine. Question: How to build toPhone/SndModel.jar ? javac main.java Question: How to use toPhone/SndModel.jar ? java -jar toPhone/SndModel.jar
  • 16.
    And also followthe above given workflow in Host CPU and also in Android Phone. Successfully operations in the above result in “deployment of trained model” in Inference engine that runs in Android smartphone.
  • 17.
    6. Working withML in Watson Studio Details are given in the following link ai-ml-watson.pdf 7. Visual Recognition in Watson Studio Details are given in the following link A beginner's guide to setting up a visual recognition service 8. Deploy Visual Recognition // following worked curl -X POST -u "apikey:​put your API key here​" -F "images_file=@sasi.jpeg" -F "threshold=0.6" -F "classifier_ids=DefaultCustomModel_1738357304" "https://gateway.watsonplatform.net/visual-recognition/api/v3/classify?version=2018-03-19" { "images": [ { "classifiers": [ { "classifier_id": "DefaultCustomModel_1738357304", "name": "Default Custom Model", "classes": [ { "class": "lb1compress.zip", "score": 0.855
  • 18.
    } ] } ], "image": "sasi.jpeg" } ], "images_processed": 1, "custom_classes":3 } Sample code for Client Application …………………………. Pyrhan code …………… pip pip install --upgrade "watson-developer-cloud>=2.1.1" Authentication from watson_developer_cloud import VisualRecognitionV3 visual_recognition = VisualRecognitionV3( version='{version}', iam_apikey='{iam_api_key}' ) Authentication (for instances created before May 23, 2018) from watson_developer_cloud import VisualRecognitionV3 visual_recognition = VisualRecognitionV3( version='{version}', api_key='{api_key}' ) Classify an image import json from watson_developer_cloud import VisualRecognitionV3 visual_recognition = VisualRecognitionV3( '2018-03-19', iam_apikey='{iam_api_key}') with open('./fruitbowl.jpg', 'rb') as images_file: classes = visual_recognition.classify( images_file, threshold='0.6', classifier_ids='DefaultCustomModel_967878440').get_result() print(json.dumps(classes, indent=2)) …………………………………………………………
  • 19.
    9. Visual RecognitionClient in Android Phone Details are given in the following link VR ///////////////////////////////////////////// ​document ends​ //////////////////////////////////////////////////////