Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

計算機性能の限界点とその考え方

330 views

Published on

計算機性能の限界点とその考え方
(次世代データベース研究報告より)

2018年11月29日
さくらインターネット株式会社
さくらインターネット研究所
上級研究員 松本直人

Published in: Technology
  • Be the first to comment

計算機性能の限界点とその考え方

  1. 1. 計算機性能の限界点と その考え方 2018/11/29 さくらインターネット株式会社 さくらインターネット研究所 上級研究員 / 松本直人 (C) Copyright 1996-2017 SAKURA Internet Inc 次世代データベース研究報告より
  2. 2. サーバー/ストレージにおけるストレージネットワークを介したデータ処理の流れ 2 CPU 40/100Gbit/s NIC CPUCPU PCI Express 3.0 OS(kernel) Application OS(kernel) PCI Express 3.0 HDD/SSD OS(kernel) CPU OS(kernel) 40/100Gbit/s NIC CPU PCI Express 3.0 OS(kernel) Application データ参照 データ提供 結果出力 サーバー/ストレージにおけるデータ処理の流れ DRAM DRAM データ処理 データ処理 長大なデータ処理の流れから「首長竜」に例えられる サーバー間でのデータ通信には長大なデータ処理が介在する (技術的な基礎理解)
  3. 3. 単位時間(秒)におけるデータサイズと処理性能の比較 3 40Gbit/s Ethernet (DPDK) 47M pps (64Bytes)* fio RAMDSIK (DDR4) 19M iops (128Bytes)*** NVMe SSD (U.2) 1-3Miops (4KBytes)*** redis GET (localhost/DRAM) 2M rps (2Bytes)** 40Gbit/s Ethernet (line-rate) 3M pps (1500Bytes)* Apache Ignite (CPU) 250Kops**** SOURCE: Linux 40GbE DPDK Performance / High Speed Packet Processing with Terminator 5 /Chelsio Communications Inc. (2015)*, redis-benchmark with AMD RYZEN 1800X Intel Kaby Lake (i7-7700K) memo [GET rates: / SAKURA Internet Research Center. (2017/05)**, SAKURA Internet Research Center Lab test results (2017)***, Apache Ignite on Intel Core i7 (4.5GHz)****, R = randint(0,100,600000000); a = cp.array(R, dtype=np.uint8) 2.27 sec ; cp.sort(a) 0.54 sec; ***** SAKURA Internet Research Center (2018/05) cupy.sort (GPU DDR5) 214Mops (uint8)***** 214Mops/byte (GPU) 732kpps/byte (CPU) 148kpps/byte (CPU) 2kpps/byte (NIC) 多量に高速演算処理が必要な場合、高速メモリと演算器を密結合させた構成が良い 1Mrps/byte (CPU) 750iops/byte (CPU) 単位を揃える→ (Ops/byte)
  4. 4. キャッシュ/サービス(API) 従来型の情報共有システムの問題点と次世代データベース領域の課題整理 4 リクエスト振分処理 不特定多数の参照ユーザー(80%) 不特定少数の投稿ユーザー(20%) データベース/ストレージ 恒久的な データ保存 整合性チェック アーカイブ処理 力技と物量による問題解決(現在) データ参照向けキャッシュ効率最適化/リクエスト振分処理等の改善は今後も課題 データ処理 ライフサイクル (例) 複雑かつ 高コスト
  5. 5. Appendix/memo CPU/GPU 5
  6. 6. How to measure your dataflow using Apache Ignite 6 Intel Core i7 (4.5GHz) AMD Threadripper (3.8GHz) AMD EPYC (2.1GHz) (Operations/sec) Apache Ignite Benchmark シングルスレッドに特化したプロセスにはCPUクロック性能の高い環境が良い
  7. 7. How to measure your dataflow using cupy & numpy (NVIDIA GPU) 7 SOURCE: SAKURA Internet Research Center. (04/2018) Project Sprig. import time import cupy as cp import numpy as np from numpy.random import * start = time.time() R = randint(0,100,600000000) end = time.time() print ( end - start ) start = time.time() a = np.array(R, dtype=np.uint8) end = time.time() print ( end - start ) start = time.time() np.sort(a) end = time.time() print ( end - start ) import time import cupy as cp import numpy as np from numpy.random import * start = time.time() R = randint(0,100,600000000) end = time.time() print ( end - start ) start = time.time() a = cp.array(R, dtype=cp.uint8) end = time.time() print ( end - start ) start = time.time() cp.sort(a) end = time.time() print ( end - start ) 性能比較 numpy (CPU) cupy (GPU) # apt install python-pip # pip install --upgrade pip # pip install --upgrade setuptools # pip install numpy cupy time # python R = randint(0,100,600000000) R = randint(0,100,600000000) a = cp.array(R, dtype=np.uint8) 2.27 sec a = np.array(R, dtype=np.uint8) 0.46 sec cp.sort(a) np.sort(a) numpy (CPU) cupy (GPU) 5.36 sec 15.1sec 5.36 sec 0.54 sec | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1050 Off | 00000000:65:00.0 Off | N/A | | 29% 27C P8 N/A / 65W | 1205MiB / 1997MiB | 0% Default | +-------------------------------+----------------------+----------------------+ Time (Lower is better)
  8. 8. ROCm with dGPU(AMD GPU) using pyopencl 8 # uname -sr; cat /etc/lsb-release Linux 4.4.0-116-generic DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" ( ROCm does not support 17.10) # lscpu Model name: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz # lspci | grep VGA 65:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev cf) ROCm Platform Supports Two Graphics Core Next (GCN) GPU Generations GFX8: Radeon RX 480,Radeon RX 470,Radeon RX 460,R9 Nano,Radeon R9 Fury,Radeon R9 Fury X Radeon Pro WX7100, FirePro S9300 x2 Radeon Vega Frointer Edition, Radeon Instinct: MI6, MI8, and MI25 (https://rocm.github.io/hardware.html) # apt update # apt dist-upgrade -y # apt-get install -y libnuma-dev # wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add - # sh -c 'echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list' # apt update # apt-get install -y rocm-dkms # ln -s /opt/rocm/opencl/lib/x86_64/libOpenCL.so.1 /usr/lib/libOpenCL.so # usermod -a -G video $LOGNAME # sync; sync; sync; reboot # /opt/rocm/opencl/bin/x86_64/clinfo Platform Version: OpenCL 2.1 AMD-APP.internal (2576.0) Platform Name: AMD Accelerated Parallel Processing # apt install python-pip opencl-headers -y # pip install --upgrade pip # pip install --upgrade setuptools # pip install pyopencl Successfully installed pyopencl-2018.1.1 >>> import numpy as np >>> import pyopencl as cl >>> from pyopencl import array as clarray >>> from pyopencl import algorithm as clalg >>> ctx = cl.create_some_context(0) >>> queue = cl.CommandQueue(ctx) >>> R = np.random.randint(0, 99, 100000000).astype(np.int8) >>> a = clarray.to_device(queue, R) >>> b = clalg.copy_if(a, 'ary[i] >= 55') >>> print b
  9. 9. How to burn your GPU with CUDA9.1 9 # uname -sr; cat /etc/lsb-release Linux 4.13.0-21-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # vi /etc/modprobe.d/blacklist-nouveau.conf blacklist nouveau options nouveau modeset=0 # sync; sync; reboot # apt install g++ freeglut3-dev build-essential libx11-dev libxmu-dev # apt install libxi-dev libglu1-mesa libglu1-mesa-dev gcc-6 g++-6 Download CUDA9.1 from https://developer.nvidia.com/cuda-toolkit # bash cuda_9.1.85_387.26_linux.run --silent --toolkit --override --no-opengl-libs --driver # ln -s /usr/bin/gcc-6 /usr/local/cuda/bin/gcc # ln -s /usr/bin/g++-6 /usr/local/cuda/bin/g++ # vi ~/.bashrc export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export CUDA_HOME=/usr/local/cuda # source ~/.bashrc # git clone https://github.com/wilicc/gpu-burn.git # cd gpu-burn/ # vi Makefile NVCC=/usr/local/cuda/bin/nvcc # make # ./gpu_burn 1000 # watch -n 1 nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 387.26 Driver Version: 387.26 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1050 Off | 00000000:65:00.0 Off | N/A | | 37% 72C P0 N/A / 65W | 1793MiB / 1997MiB | 100% Default | +-------------------------------+----------------------+----------------------+
  10. 10. How to burn your GPU with CUDA9.1 (MapD Community Edition 3.4.0) 10 # apt install -y curl apt-transport-https # useradd -U mapd # ufw disable; ufw enable; ufw allow 9092/tcp; ufw allow 22/tcp # curl https://releases.mapd.com/ce/mapd-ce-cuda.list | sudo tee /etc/apt/sources.list.d/mapd.list # curl https://releases.mapd.com/GPG-KEY-mapd | sudo apt-key add - # apt update # apt install -y mapd # vi ~/.bashrc export MAPD_USER=mapd export MAPD_GROUP=mapd export MAPD_STORAGE=/var/lib/mapd export MAPD_PATH=/opt/mapd # source ~/.bashrc # mkdir -p $MAPD_STORAGE # chown -R $MAPD_USER $MAPD_STORAGE # cd $MAPD_PATH/systemd # ./install_mapd_systemd.sh # cd $MAPD_PATH # systemctl start mapd_server; systemctl enable mapd_server # systemctl start mapd_web_server; systemctl enable mapd_web_server # $MAPD_PATH/insert_sample_data 2) Flights (2008) 10k 2 # $MAPD_PATH/bin/mapdql -t Password: HyperInteractive mapdql> SELECT origin_city AS "Origin", dest_city AS "Destination", AVG(airtime) AS "Average Airtime" FROM flights_2008_10k WHERE distance <= 33 GROUP BY origin_city, dest_city; Execution time: 1268 ms, Total time: 1269 ms SOURCE: https://www.mapd.com/platform/download-community/ +---------------------------------------------------------- | NVIDIA-SMI 387.26 Driver Version: 387.26 |-------------------------------+----------------------+--- | GPU Name Persistence-M| Bus-Id Disp.A | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | |===============================+======================+=== | 0 GeForce GTX 1050 Off | 00000000:65:00.0 Off | | 29% 27C P0 N/A / 65W | 1449MiB / 1997MiB | +-------------------------------+----------------------+--- |========================================================== | 0 5828 C /opt/mapd/bin/mapd_server +---------------------------------------------------------- Origin|Destination|Average Airtime West Palm Beach|Tampa|33.81818181818182 Norfolk|Baltimore|36.07142857142857 Ft. Myers|Orlando|28.66666666666667 Indianapolis|Chicago|39.53846153846154 Tampa|West Palm Beach|33.25 Orlando|Ft. Myers|32.58333333333334 Austin|Houston|33.05555555555556 Chicago|Indianapolis|32.7 Baltimore|Norfolk|31.71428571428572 Houston|Austin|29.61111111111111
  11. 11. ROCm with dGPU(AMD GPU) (memo) 11 # uname -sr; cat /etc/lsb-release Linux 4.4.0-87-generic DISTRIB_DESCRIPTION="Ubuntu 16.04.3 LTS" # lscpu Model name: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz # lspci | grep VGA 65:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev cf) / *[Radeon RX 460] ROCm Platform Supports Two Graphics Core Next (GCN) GPU Generations GFX8: Radeon RX 480,Radeon RX 470,Radeon RX 460,R9 Nano,Radeon R9 Fury,Radeon R9 Fury X Radeon Pro WX7100, FirePro S9300 x2 Radeon Vega Frointer Edition, Radeon Instinct: MI6, MI8, and MI25 (https://rocm.github.io/hardware.html) # apt update # apt dist-upgrade # apt-get install -y libnuma-dev # sync; sync; sync; reboot # wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | sudo apt-key add - # sh -c 'echo deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main > /etc/apt/sources.list.d/rocm.list' # apt-get install -y rocm-dkms # usermod -a -G video $LOGNAME # sync; sync; sync; reboot # /opt/rocm/opencl/bin/x86_64/clinfo Platform Version: OpenCL 2.1 AMD-APP.internal (2545.0) Platform Name: AMD Accelerated Parallel Processing # wget https://raw.githubusercontent.com/bgaster/opencl-book-samples/master/src/Chapter_2/HelloWorld/HelloWorld.cpp # wget https://raw.githubusercontent.com/bgaster/opencl-book-samples/master/src/Chapter_2/HelloWorld/HelloWorld.cl # g++ -I /opt/rocm/opencl/include/ ./HelloWorld.cpp -o HelloWorld -L/opt/rocm/opencl/lib/x86_64 -lOpenCL # ./HelloWorld 0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 105 108 111 114 117 120 ... 2985 2988 2991 2994 2997 Executed program succesfully.
  12. 12. AMDGPU ROCm Tensorflow 1.8 install memo (not support Ubuntu 1804) 12 # uname -sr; tail -2 /etc/lsb-release Linux 4.4.0-131-generic DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS" # lscpi 17:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67ef (rev cf) # apt update # apt dist-upgrade # apt install -y libnuma-dev wget python3-pip # sync; sync; sync; reboot # wget -qO - http://repo.radeon.com/rocm/apt/debian/rocm.gpg.key | apt-key add - # vi /etc/apt/sources.list.d/rocm.list deb [arch=amd64] http://repo.radeon.com/rocm/apt/debian/ xenial main # apt update # apt install -y rocm-dkms # usermod -a -G video $LOGNAME # sync; sync; sync; reboot # apt install -y rocm-libs miopen-hip cxlactivitylogger # sync; sync; sync; reboot # wget http://repo.radeon.com/rocm/misc/tensorflow/tensorflow-1.8.0-cp35-cp35m-manylinux1_x86_64.whl # pip3 install ./tensorflow-1.8.0-cp35-cp35m-manylinux1_x86_64.whl # git clone https://github.com/tensorflow/models.git # python3 classify_image.py # cd ; git clone https://github.com/tensorflow/tensorflow.git # cd tensorflow/ # python3 tensorflow/examples/speech_commands/train.py # watch -n 1 /opt/rocm/bin/rocm-smi ==================== ROCm System Management Interface ==================== ================================================================================ GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD 0 35c 21.82W 1210Mhz 300Mhz 0.0% auto 0% 0% ================================================================================ ==================== End of ROCm SMI Log ==================== 2018-09-02 10:40:10.368117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1451] Found device 0 with properties: name: Device 67ef AMDGPU ISA: gfx803 memoryClockRate (GHz) 1.21 pciBusID 0000:17:00.0 Total memory: 2.00GiB Free memory: 1.75GiB Adding visible gpu devices: 0 Device interconnect Created TensorFlow device (/job:localhost/replica:0/task:0/device: GPU:0 with 1567 MB memory) -> physical GPU (device: 0, name: Device 67ef, pci bus id: 0000:17:00.0)
  13. 13. AMDGPU ROCm Tensorflow 1.8 (classify_image.py) 13 # wget http://repo.radeon.com/rocm/misc/tensorflow/tensorflow-1.8.0-cp35-cp35m-manylinux1_x86_64.whl # pip3 install ./tensorflow-1.8.0-cp35-cp35m-manylinux1_x86_64.whl # git clone https://github.com/tensorflow/models.git # python3 classify_image.py 2018-09-02 10:40:10.368117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1451] Found device 0 with properties: name: Device 67ef AMDGPU ISA: gfx803 memoryClockRate (GHz) 1.21 pciBusID 0000:17:00.0 Total memory: 2.00GiB Free memory: 1.75GiB 2018-09-02 10:40:10.368135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1562] Adding visible gpu devices: 0 2018-09-02 10:40:10.368153: I tensorflow/core/common_runtime/gpu/gpu_device.cc:989] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-09-02 10:40:10.368162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:995] 0 2018-09-02 10:40:10.368175: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1008] 0: N 2018-09-02 10:40:10.368207: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1124] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1567 MB memory) -> physical GPU (device: 0, name: Device /opt/rocm/miopen/share/miopen/db/gfx803_14.cd.pdb.txt giant panda, panda, panda bear, coon bear, Ailuropoda melanoleuca (score = 0.89107) indri, indris, Indri indri, Indri brevicaudatus (score = 0.00779) lesser panda, red panda, panda, bear cat, cat bear, Ailurus fulgens (score = 0.00296) custard apple (score = 0.00147) earthstar (score = 0.00117) #
  14. 14. AMDGPU ROCm Tensorflow 1.8 (speech_commands/train.py) 14 # git clone https://github.com/tensorflow/tensorflow.git # cd tensorflow/ # python3 tensorflow/examples/speech_commands/train.py 2018-09-02 10:43:36.924800: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA AMDGPU ISA: gfx803 memoryClockRate (GHz) 1.21 pciBusID 0000:17:00.0 Total memory: 2.00GiB Free memory: 1.75GiB : INFO:tensorflow:Step #1: rate 0.001000, accuracy 9.0%, cross entropy 2.724346 INFO:tensorflow:Step #2: rate 0.001000, accuracy 9.0%, cross entropy 2.521507 : INFO:tensorflow:Saving to "/tmp/speech_commands_train/conv.ckpt-4300" INFO:tensorflow:Step #4301: rate 0.001000, accuracy 65.0%, cross entropy 1.094288 INFO:tensorflow:Step #4302: rate 0.001000, accuracy 69.0%, cross entropy 0.876309 : # /opt/rocm/bin/rocm-smi GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD 0 52c 44.230W 1172Mhz 1750Mhz 0.0% auto 0% 0% # top top - 10:58:10 up 25 min, 2 users, load average: 1.51, 1.29, 0.89 Tasks: 222 total, 2 running, 220 sleeping, 0 stopped, 0 zombie %Cpu0 : 6.2 us, 1.7 sy, 0.0 ni, 92.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 5.6 us, 2.8 sy, 0.0 ni, 91.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 8.3 us, 3.1 sy, 0.0 ni, 88.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 6.4 us, 2.7 sy, 0.0 ni, 90.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 9.8 us, 3.7 sy, 0.0 ni, 86.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 8.4 us, 3.0 sy, 0.0 ni, 88.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 5.4 us, 2.3 sy, 0.0 ni, 92.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 3.4 us, 2.0 sy, 0.0 ni, 94.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 3.4 us, 1.7 sy, 0.0 ni, 94.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 3.7 us, 1.7 sy, 0.0 ni, 94.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 6.0 us, 2.7 sy, 0.0 ni, 91.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 4.4 us, 2.0 sy, 0.0 ni, 93.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
  15. 15. Appendix/memo NVMe SSD/SPDK/DRAM 15
  16. 16. In-Memory Computing for FASTDATA using fio with RAMDISK(DDR4) 16 # uname -sr; cat /etc/lsb-release Linux 4.13.0-21-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # lshw -c cpu product: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz # lshw -class memory description: DIMM DDR4 Synchronous 2666 MHz (0.4 ns) # mkdir /ramdisk # mount -t tmpfs tmpfs /ramdisk # fio -directory=/ramdisk -rw=read -bs=* -size=1G -numjobs=16 -runtime=10 -group_reporting -name=data 64GB RAMDISK (fio block size: Bytes) with Core i7-7800X OverClocked 5GHz 19.9M IOPS 18.6M IOPS 16.3M IOPS 12.6M IOPS 7.8M IOPS 4.6M IOPS 2.4M IOPS 1.2M IOPS (Bytes)
  17. 17. How To Configure NVMe over Fabrics using MLNX_OFED <DRAFT> 17 NVME Target Configuration # ./mlnxofedinstall --add-kernel-support --with-nvmf # modprobe mlx5_core # modprobe nvmet # modprobe nvmet-rdma # modprobe nvme-rdma # mkdir /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name # cd /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name # echo 1 > attr_allow_any_host # mkdir namespaces/10 # cd namespaces/10 # echo -n /dev/nvme0n1> device_path # echo 1 > enable # mkdir /sys/kernel/config/nvmet/ports/1 # cd /sys/kernel/config/nvmet/ports/1 # ip addr add 1.1.1.1/24 dev enp2s0f0 # echo 1.1.1.1 > addr_traddr # echo rdma > addr_trtype # echo 4420 > addr_trsvcid # echo ipv4 > addr_adrfam # ln -s /sys/kernel/config/nvmet/subsystems/nvme-subsystem-name /sys/kernel/config/nvmet/ports/1/subsystems/nvme-subsystem-name NVMe Client (Initiator) Configuration # ./mlnxofedinstall --add-kernel-support --with-nvmf # modprobe mlx5_core # modprobe nvme-rdma # git clone https://github.com/linux-nvme/nvme-cli.git # cd nvme-cli # make # make install # nvme discover -t rdma -a 1.1.1.1 -s 4420 # nvme connect -t rdma -n nvme-subsystem-name -a 1.1.1.1 -s 4420 # nvme disconnect -d /dev/nvme0n1
  18. 18. Intel SPDK(Storage Performance Development Kit) benchmark 18 # uname -sr; Linux 4.10.0-40-generic # apt-get install libnuma-dev git uuid-dev libaio-dev libcunit1-dev libcunit1 libssl-dev g++ -y # cd /opt/; git clone https://github.com/axboe/fio # cd fio; git checkout -b fio-2.21 # make; make install # cd /opt/; git clone https://github.com/spdk/spdk # cd sdpk; git submodule update --init # ./configure --with-fio=/opt/fio/ # make # /opt/spdk/scripts/setup.sh # fio --name=nvme --numjobs=8 --filename="trtype=PCIe traddr=0000.01.00.0 ns=1" --bs=4K --iodepth=4 --ioengine=/opt/spdk/examples/nvme/fio_plugin/fio_plugin --group_reporting --size=50% --runtime=100 --thread=8 --rw=read nvme: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=spdk, iodepth=4 ... fio-3.2-19-g609ac1 Starting 8 threads Starting DPDK 17.11.0 initialization... [ DPDK EAL parameters: fio -c 0x1 -m 512 --file-prefix=spdk_pid18356 ] EAL: Detected 8 lcore(s) EAL: No free hugepages reported in hugepages-1048576kB EAL: Probing VFIO support... EAL: PCI device 0000:01:00.0 on NUMA socket 0 EAL: probe driver: 8086:2700 spdk_nvme nvme: (groupid=0, jobs=8): err= 0: pid=18367: Mon Nov 27 15:36:06 2017 read: IOPS=572k, BW=2236MiB/s (2345MB/s)(218GiB/100001msec) slat (nsec): min=91, max=471828, avg=200.94, stdev=122.85 clat (usec): min=9, max=13319, avg=55.44, stdev= 7.84 lat (usec): min=14, max=13319, avg=55.64, stdev= 7.84 clat percentiles (usec): | 1.00th=[ 48], 5.00th=[ 50], 10.00th=[ 50], 20.00th=[ 51], | 30.00th=[ 52], 40.00th=[ 53], 50.00th=[ 53], 60.00th=[ 54], | 70.00th=[ 56], 80.00th=[ 60], 90.00th=[ 64], 95.00th=[ 67], | 99.00th=[ 88], 99.50th=[ 91], 99.90th=[ 100], 99.95th=[ 111], | 99.99th=[ 121] bw ( KiB/s): min=242664, max=310392, per=12.50%, avg=286296.77, stdev=11653.87, samples=1592 iops : min=60666, max=77598, avg=71574.18, stdev=2913.46, samples=1592 lat (usec) : 10=0.01%, 20=0.01%, 50=9.44%, 100=90.46%, 250=0.09% lat (usec) : 500=0.01%, 750=0.01% lat (msec) : 2=0.01%, 20=0.01%
  19. 19. In-Memory Database Registration Performance Check (Intel vs AMD) 19 Purley# uname -sr; cat /etc/redhat-release Linux 3.10.0-514.el7.x86_64 CentOS Linux release 7.3.1611 (Core) Purley# grep proc /proc/cpuinfo | wc -l 48 Purley# lscpu Model name: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz RYZEN# uname -sr; cat /etc/debian_version Linux 4.10.0-19-generic stretch/sid RYZEN# grep proc /proc/cpuinfo | wc -l 16 RYZEN# lscpu Model name: AMD Ryzen 7 1800X Eight-Core Processor redisはデータサイズに応じてプロセスあたりの処理性能に低下が確認できる
  20. 20. In-Memory Database Performance Check 20 Intel Purley AMD Ryzen Xeon Phi(KNL) # uname -sr; cat /etc/redhat-release Linux 3.10.0-514.el7.x86_64 CentOS Linux release 7.3.1611 (Core) # grep proc /proc/cpuinfo | wc -l 48 # lscpu Model name: Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz
  21. 21. ALL FLASH DATACENTER & IN-MEMORY COMPUTING: HOT TOPICS 21 SOURCE: SAKURA Internet Research Center. (2017/10), Project Sprig.
  22. 22. ClickHouse column-oriented database Install memo 22 # uname -sr; cat /etc/issue Linux 4.10.0-35-generic Ubuntu 17.04 # apt install software-properties-common # apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4 # apt-add-repository "deb http://repo.yandex.ru/clickhouse/trusty stable main" # apt-get update # apt-get install clickhouse-server-common clickhouse-client -y # service clickhouse-server start # clickhouse-client --multiline ClickHouse client version 1.1.54304. Connecting to localhost:9000. Connected to ClickHouse server version 1.1.54304. :) CREATE TABLE ontime ( Year UInt16, Quarter UInt8, Month UInt8, : Div5TailNum String ) ENGINE = MergeTree(FlightDate, (Year, FlightDate), 8192); or # xz -v -c -d < ontime.csv.xz | clickhouse-client --query="INSERT INTO ontime FORMAT CSV"
  23. 23. MariaDB ColumnStore column-oriented database Install memo 23 # uname -sr; cat /etc/redhat-release Linux 3.10.0-514.el7.x86_64 Red Hat Enterprise Linux Server release 7.4 (Maipo) # mkdir mcs; cd mcs; # wget https://downloads.mariadb.com/ColumnStore/1.0.11/centos/x86_64/7/mariadb-columnstore-1.0.11-1-centos7.x86_64.rpm.tar.gz # tar xzvf ./mariadb-columnstore-1.0.11-1-centos7.x86_64.rpm.tar.gz # yum install boost boost-devel boost-doc expect perl-DBD-MySQL -y # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-common.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-common.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-client.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-server.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-libs.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-shared.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-gssapi-client.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-gssapi-server.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-platform.rpm -vh # rpm -i mariadb-columnstore-1.0.11-1-x86_64-centos7-storage-engine.rpm -vh # /usr/local/mariadb/columnstore/bin/postConfigure Select the type of System Server install [1=single, 2=multi] (2) > 1 Enter System Name (columnstore-1) > sprig-1 Select the type of Data Storage [1=internal, 2=external, 3=GlusterFS] (1) > 1 Enter the list (Nx,Ny,Nz) or range (Nx-Nz) of DBRoot IDs assigned to module 'pm1' (1) > 1 # . /usr/local/mariadb/columnstore/bin/columnstoreAlias # mcsadmin MariaDB ColumnStore Admin Console enter 'help' for list of commands enter 'exit' to exit the MariaDB ColumnStore Command Console use up/down arrows to recall commands mcsadmin>
  24. 24. Appendix/memo eBPF/XDP/DPDK 24
  25. 25. Quagga with ROUTE_MULTIPATH (memo) 25 # uname -sr; cat /etc/lsb-release Linux 4.13.0-21-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # grep ROUTE_MULTIPATH /usr/src/*/.config CONFIG_IP_ROUTE_MULTIPATH=y # apt-get install -y quagga traceroute # vi /etc/sysctl.conf net.ipv4.conf.all.forwarding=1 net.ipv4.fib_multipath_hash_policy = 1 net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.all.arp_ignore = 1 net.ipv4.conf.default.arp_filter = 1 net.ipv6.conf.all.forwarding=1 net.ipv6.route.max_size = 32768 net.ipv6.xfrm6_gc_thresh = 32768 # touch /etc/quagga/zebra.conf # touch /etc/quagga/ospfd.conf # touch /etc/quagga/ospf6d.conf # chown quagga.quaggavty /etc/quagga/*.conf # chmod 640 /etc/quagga/*.conf # ufw disable # vi /etc/quagga/daemons zebra=yes ospfd=yes ospf6d=yes # echo VTYSH_PAGER=more >> /etc/environment # sync; sync; sync; reboot # vtysh Quagga with ROUTE_MULTIPATH
  26. 26. My First XDP (eXpress Data Path) 26 # uname -sr; cat /etc/lsb-release Linux 4.13.0-21-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # apt install -y make gcc libssl-dev bc libelf-dev libcap-dev clang # apt install -y gcc-multilib llvm libncurses5-dev git bison flex pkg-config # apt install -y libmnl0 libmnl-dev clang libasm1 libasm-dev # mkdir /usr/local/include/asm # ln -s /usr/include/x86_64-linux-gnu/asm/* /usr/local/include/asm # git clone git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git # ./configure --prefix=/sbin # cd iproute2/ # make; make install # vi xdp_example.c #include <linux/bpf.h> #ifndef __section # define __section(NAME) __attribute__((section(NAME), used)) #endif __section("prog") int xdp_drop(struct xdp_md *ctx) { return XDP_DROP; } char __license[] __section("license") = "GPL"; # clang -O2 -Wall -target bpf -c xdp_example.c -o xdp_example.o # ip link set dev eth0 xdp obj xdp_example.o # ip link set dev eth0 xdp of SOURCE: https://github.com/torvalds/linux/tree/master/samples/bpf, http://cilium.readthedocs.io/en/latest/bpf/#llvm, http://vger.kernel.org/netconf2017_files/XDP_devel_update_NetConf2017_Seoul.pdf, http://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html, https://netdevconf.org/1.2/slides/oct7/10_nic_viljoen_eBPF_Offload_to_Hardware__cls_bpf_and_XDP_finalised.pdf, https://people.netfilter.org/hawk/presentations/NetDev2.2_2017/XDP_for_the_Rest_of_Us_Part_2.pdf, XDP – eXpress Data Path
  27. 27. My First F-Stack 27 #lscpu Model name: AMD Ryzen Threadripper 1900X 8-Core Processor # uname -sr; cat /etc/lsb-release Linux 4.10.0-35-generic DISTRIB_CODENAME=zesty DISTRIB_DESCRIPTION="Ubuntu 17.04" # cd /opt # git clone https://github.com/F-Stack/f-stack.git # /opt/f-stack/dpdk/tools/dpdk-setup.sh [15] x86_64-native-linuxapp-gcc Option: 15 # echo 1024 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # mkdir /mnt/huge # mount -t hugetlbfs nodev /mnt/huge # echo 0 > /proc/sys/kernel/randomize_va_space # modprobe uio # insmod /opt/f-stack/dpdk/x86_64-native-linuxapp-gcc/kmod/igb_uio.ko # insmod /opt/f-stack/dpdk/x86_64-native-linuxapp-gcc/kmod/rte_kni.ko # export FF_PATH=/opt/f-stack/ # export FF_DPDK=/opt/f-stack/dpdk/x86_64-native-linuxapp-gcc/ # cd /root/f-stack/lib # make ; make ; make ; make install # cd /opt/f-stack/app/nginx-1.11.10 # ./configure --prefix=/usr/local/nginx_fstack --with-ff_module --without-http_rewrite_module # make # make install # grep f-stack /usr/local/nginx_fstack/conf/nginx.conf fstack_conf f-stack.conf; # grep addr /usr/local/nginx_fstack/conf/f-stack.conf addr=192.168.1.2 Copyright © 2018. Tencent Cloud All rights reserved.
  28. 28. My First FD.io VPP (Segment Routing for IPv6 / L3VPN for IPv4 traffic) 28 # uname -sr; cat /etc/lsb-release Linux 4.13.0-21-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # vi /etc/apt/sources.list.d/99fd.io.list deb [trusted=yes] https://nexus.fd.io/content/repositories/fd.io.ubuntu.xenial.main/ ./ # apt-get update # apt-get install -y vpp-lib vpp vpp-plugins # service vpp start # service vpp status ● vpp.service - vector packet processing engine Loaded: loaded (/lib/systemd/system/vpp.service; enabled; vendor preset: enabled) Active: active (running) since Tue 2018-02-13 09:30:25 JST; 21s ago : CGroup: /system.slice/vpp.service mq2011 /usr/bin/vpp -c /etc/vpp/startup.conf # vppctl vpp# set sr encaps source addr C1:: vpp# sr policy add bsid C1::999:2 next C2:: next C4::4 encap vpp# sr steer l3 1.1.1.0/24 via sr policy bsid C1::999:2 : vpp# sr localsid address C4::4 behavior end.dx4 GigabitEthernet0/6/0 1.1.1.1 vpp# show sr localsid SRv6 - My LocalSID Table: ========================= Address: c4::4 Behavior: DX4 (Endpoint with decapsulation and IPv4 cross-connect) Iface: GigabitEthernet0/6/0 Next hop: 1.1.1.1 SOURCE: VPP/Segment Routing for IPv6 (https://wiki.fd.io/view/VPP/Segment_Routing_for_IPv6) © 2017 FD.io is a Linux Foundation Project. All Rights Reserved.
  29. 29. FD.io VPP with XeonPhi (Basic Configuration) 29 # uname -sr; cat /etc/lsb-release Linux 4.13.0-21-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # lscpu CPU(s): 256 Model name: Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz # vi /etc/apt/sources.list.d/99fd.io.list deb [trusted=yes] https://nexus.fd.io/content/repositories/fd.io.ubuntu.xenial.main/ ./ # apt-get update # apt install vpp vpp-lib vpp-plugins python-pip # pip install vpp-config # vpp-config 5) Execute some basic tests. Command: 5 1) List/Create Simple IPv4 Setup Command: 1 Would you like to keep this configuration [Y/n]? n Would you like add address to interface GigabitEthernet4/0/1 [Y/n]? Y Please enter the IPv4 Address [n.n.n.n/n]: 1.1.1.11/24 # vi /etc/vpp/startup.conf unix { nodaemon log /var/log/vpp/vpp.log full-coredump cli-listen /run/vpp/cli.sock exec /usr/local/vpp/vpp-config/scripts/set_int_ipv4_and_up } # sync; sync; sync; reboot © 2017 FD.io is a Linux Foundation Project. All Rights Reserved. # vppctl # show int Name Idx State GigabitEthernet4/0/1 1 up # show int addr GigabitEthernet4/0/1 (up): 1.1.1.11/24
  30. 30. FD.io VPP with XeonPhi (Load Balancer plugin) 30 # vppctl # show int addr GigabitEthernet4/0/1 (up): 1.1.1.11/24 # lb conf ip4-src-address 1.1.1.11 timeout 3 # lb vip 1.2.3.4/32 encap gre4 new_len 1024 # lb as 1.2.3.4/32 1.1.1.8 1.1.1.9 1.1.1.10 # show lb vips 1.2.3.4 ip4-gre4 1.2.3.4/32 new_size:1024 #as:3 Application Server(1.1.1.8,9,10) side Configuration # ip tunnel add tun0 mode gre local 1.1.1.8 remote 1.1.1.11 ttl 255 # ifconfig tun0 1.2.3.4/32 up # echo 1 > /proc/sys/net/ipv4/conf/tun0/arp_ignore # echo 2 > /proc/sys/net/ipv4/conf/tun0/arp_announce # echo 0 > /proc/sys/net/ipv4/conf/tun0/rp_filter # echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter # echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore # echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce © 2017 FD.io is a Linux Foundation Project. All Rights Reserved. 1.1.1.8 1.1.1.11 1.1.1.9 (1.2.3.4/32) tun0 (1.2.3.4/32) tun0 GRE Tunnels IP routing (1.2.3.4/32) Application Server Application Server Direct Server Responce (DSR) FD.io VPP
  31. 31. Quagga with ROUTE_MULTIPATH for BGP load balancing (memo) 31 # uname -sr; cat /etc/lsb-release Linux 4.13.0-36-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # grep ROUTE_MULTIPATH /usr/src/*/.config /usr/src/linux-headers-4.13.0-36-generic/.config:CONFIG_IP_ROUTE_MULTIPATH=y # apt install -y quagga traceroute # touch /etc/quagga/zebra.conf; touch /etc/quagga/bgpd.conf; chown quagga.quaggavty /etc/quagga/*.conf # chmod 640 /etc/quagga/*.conf # ufw disable ; echo VTYSH_PAGER=more >> /etc/environment # vi /etc/quagga/daemons zebra=yes bgpd=yes # sync; sync; sync; reboot # vtysh # router bgp 65001 # bgp router-id 1.1.1.1 # bgp bestpath as-path multipath-relax # bgp bestpath compare-routerid # redistribute connected # neighbor 1.1.1.2 remote-as 65002 # neighbor 1.1.1.3 remote-as 65003 # maximum-paths 64 # interface lo # ip address 1.2.3.4/24 # router bgp 65002 # bgp router-id 1.1.1.2 # bgp bestpath as-path multipath-relax # bgp bestpath compare-routerid # redistribute connected # neighbor 1.1.1.1 remote-as 65001 # maximum-paths 64 # interface lo # ip address 1.2.3.4/24 # router bgp 65003 # bgp router-id 1.1.1.3 # bgp bestpath as-path multipath-relax # bgp bestpath compare-routerid # redistribute connected # neighbor 1.1.1.1 remote-as 65001 # maximum-paths 64 # show ip bgp BGP table version is 0, local router ID is 1.1.1.1 Status codes: s suppressed, d damped, h history, * valid, > best, = multipath, Network Next Hop Metric LocPrf Weight Path *> 1.2.3.0/24 1.1.1.2 0 0 65002 *= 1.1.1.3 0 0 65003
  32. 32. FD.io VPP tap-inject with sample_plugins 32 © 2017 FD.io is a Linux Foundation Project. All Rights Reserved. # uname -sr; cat /etc/lsb-release Linux 4.13.0-37-generic DISTRIB_DESCRIPTION="Ubuntu 17.10" # echo VTYSH_PAGER=more >> /etc/environment # apt install -y quagga # touch /etc/quagga/zebra.conf # touch /etc/quagga/bgpd.conf # chown quagga.quaggavty /etc/quagga/*.conf # chmod 640 /etc/quagga/*.conf # ufw disable # vi /etc/quagga/daemons zebra=yes bgpd=yes # sync; sync; sync; reboot # apt install build-essential -y # cd /opt/ # git clone https://gerrit.fd.io/r/vpp # git clone https://gerrit.fd.io/r/vppsb # cd /opt/vpp # ./extras/vagrant/build.sh # make install-dep; make bootstrap; make build # vi /opt/vppsb/router/router/tap_inject_node.c #include <sys/uio.h> # ln -sf /opt/vppsb/netlink # ln -sf /opt/vppsb/router # ln -sf /opt/vppsb/netlink/netlink.mk build-data/packages/ # ln -sf /opt/vppsb/router/router.mk build-data/packages/ # cd build-root/ # make V=0 PLATFORM=vpp TAG=vpp_debug netlink-install router-install # dpkg -i *.deb # cp -p /opt/vpp/build-root/install-vpp_debug-native/router/lib64/router.so.0.0.0 /usr/lib/vpp_plugins/router.so # service vpp restart # vppctl enable tap-inject # vppctl show tap-inject GigabitEthernet13/0/0 -> vpp1 GigabitEthernetb/0/0 -> vpp0 # vtysh (quagga) # configure terminal (config)# interface vpp0 (config-if)# ip address 192.168.11.100/24 (config-if)# exit (config)# exit # write # quit # vppctl show int addr GigabitEthernetb/0/0 (up): L3 192.168.11.100/24 L3 fe80::20c:29ff:fe24:af28/64 # /opt/vpp/src/examples/sample-plugin # libtoolize # aclocal # autoconf # autoheader # automake --add-missing # chmod +x configure # ./configure # make # make install GigabitEthernetb/0/0 vpp0 vpp_plugins / router.so vpp_plugins / sample_plugin.so quagga
  33. 33. FD.io VPP 18.07 with Ubuntu 16.04.5 LTS (not support Ubunutu 1804) 33 # uname -sr; tail -2 /etc/lsb-release Linux 4.4.0-131-generic DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS" # apt remove --purge vpp* # vi /etc/apt/sources.list.d/99fd.io.list deb [trusted=yes] https://nexus.fd.io/content/repositories/fd.io.stable.1807.ubuntu.xenial.main/ ./ # apt update # apt dist-upgrade -y # apt install -y vpp vpp-lib vpp-plugins vpp-dpdk-dkms # vppctl show pci Address Sock VID:PID Link Speed Driver Product Name 0000:05:00.0 0 8086:1539 2.5 GT/s x1 uio_pci_generic 0000:65:00.0 0 8086:1584 8.0 GT/s x8 uio_pci_generic XL710 40GbE Controller # vi /etc/vpp/startup.conf dpdk { dev 0000:65:00.0 } # service vpp restart # service vpp status Active: active (running) since Tue 2018-09-04 18:50:02 JST; 2s ago # vppctl set int ip address FortyGigabitEthernet65/0/0 1.2.3.4/24 # vppctl set int state FortyGigabitEthernet65/0/0 up # vppctl show interface addr FortyGigabitEthernet65/0/0 (up): L3 1.2.3.4/24 # vppctl show version vpp v18.07-rc2~6-gdb6d6b3~b28 built by root on 10268b67c8b1 at Mon Jul 30 ... # vi /etc/apt/sources.list deb http://security.ubuntu.com/ubuntu bionic-security main # apt update # apt install libssl1.1 -y Download form http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc2/ # dpkg -i linux-headers-4.19.0-041900rc2_...rc2.201809022230_all.deb # dpkg -i linux-headers-4.19.0-041900rc2-...rc2.201809022230_amd64.deb # dpkg -i linux-headers-4.19.0-041900rc2-...rc2.201809022230_amd64.deb # dpkg -i linux-image-unsigned-4.19.0-......rc2.201809022230_amd64.deb # sync; sync; sync; reboot # uname -sr; tail -2 /etc/lsb-release Linux 4.19.0-041900rc2-generic DISTRIB_CODENAME=xenial DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS" # vppctl show int # (it does not works) NOTICE: does not works with kernel 4.19-rc2 © 2018 The Fast Data Project. Copyright © 2018 FD.IO Project a Series of LF Projects, LLC
  34. 34. Appendix/memo etc 34
  35. 35. My First Intel/Movidius NCS 35 SOURCE: SAKURA Internet Rsearch Center. (2017/07) Project Sprig. $ sudo su # apt-get update ; apt-get upgrade -y # mkdir /opt/mvncsdk ; cd /opt/mvncsdk/ GoTo: https://developer.movidius.com/getting-started # wget https://ncs-forum-uploads.s3.amazonaws.com/ncsdk/MvNC_SDK_01_07_07/MvNC_SDK_1.07.07.tgz # tar zxvf MvNC_SDK_1.07.07.tgz ; tar zxvf MvNC_Toolkit-1.07.06.tgz ; tar xzvf ./MvNC_API-1.07.07.tgz # ./bin/setup.sh ; ./bin/data/dlnets.sh # source ~/.bashrc # cd /opt/mvncsdk/ncapi/; ./setup.sh ; cd ./c_examples/ ; make # ./ncs-fullcheck -l2 -c1 ../networks/AlexNet ../images/cat.jpg Device 0 Address: 2 - VID/PID 03e7:2150 Starting wait for connect with 2000ms timeout Found Address: 2 - VID/PID 03e7:2150 Found EP 0x81 : max packet size is 512 bytes Found EP 0x01 : max packet size is 512 bytes Found and opened device Performing bulk write of 825136 bytes... Successfully sent 825136 bytes of data in 35.764553 ms (22.002540 MB/s) Boot successful, device address 2 Found Address: 2 - VID/PID 040e:f63b done Booted 2 -> VSC OpenDevice 2 succeeded Graph allocated : $ uname -sr; Linux 4.8.0-36-generic (Ubuntu 16.04.02) $ lsusb -v Device Descriptor: iProduct 2 Movidius MA2X5X MaxPower 500mA © Copyright Movidius 2017. All Rights Reserved.
  36. 36. UP Board AI Core Configuration memo 36 # uname -sr; cat /etc/lsb-release Linux 4.4.0-116-generic DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS" # lshw *-pci:1 *-usb description: USB controller product: FL1100 USB 3.0 Host Controller *-usbhost:1 *-usb UNCLAIMED description: Generic USB device product: Movidius MA2X5X vendor: Movidius Ltd. # git clone -b ncsdk2 http://github.com/Movidius/ncsdk && cd ncsdk && make install # export PYTHONPATH="${PYTHONPATH}:/opt/movidius/caffe/python" # cd /examples/tensorflow/inception_v3 # cat run.py image_filename = path_to_images + 'nps_electric_guitar.png' devices = mvnc.enumerate_devices() # python3 run.py Number of categories: 1001 Start download to NCS... ******************************************************************************* inception-v3 on NCS ******************************************************************************* 547 electric guitar 0.988281 403 acoustic guitar 0.00751877 715 pick, plectrum, plectron 0.0014801 421 banjo 0.000901222 820 stage 0.000654221 ******************************************************************************* Finished Copyright 2018 Up Board | All Rights Reserved
  37. 37. USB 3.0 CAPTURE HDMI 4K with Loop-through for Image redistribution 37 # uname -sr; tail -1 /etc/redhat-release Linux 3.10.0-862.9.1.el7.x86_64 CentOS Linux release 7.4.1708 (Core) # yum install -y usbutils hwinfo mplayer v4l-utils ffmpeg git # lsusb -t /: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/10p, 5000M |__ Port 4: Dev 2, If 9, Class=Human Interface Device, Driver=usbhid, 5000M # lsusb -vv # hwinfo --usb # v4l2-ctl --list-devices USB Capture HDMI 4K+ (usb-0000:00:14.0-4): /dev/video0 # v4l2-ctl -d /dev/video0 --info # v4l2-ctl --list-formats-ext -d /dev/video0 Type : Video Capture Name : YUV 4:2:2 (YUYV) Size: Discrete 4096x2160 Interval: Discrete 0.017s (60.000 fps) # wget https://libav.org/releases/libav-12.3.tar.xz # tar Jxvf ./libav-12.3.tar.xz; cd libav-12.3 # ./configure --disable-yasm; make; make install # avconv -f video4linux2 -input_format nv12 -s 1920x1080 -i /dev/video0 -qscale 10 out.mpeg Input #0, video4linux2, from '/dev/video0': Duration: N/A, start: 1240.062083, bitrate: 1492992 kb/s nv12, 1920x1080, 1492992 kb/s 60 fps, 1000k tbn # ffmpeg -f v4l2 -list_formats all -i /dev/video0 [video4linux2,v4l2 @ 0x24114c0] Raw : yuyv422 : YUV 4:2:2 (YUYV) : 640x360 640x480 720x480 720x576 768x576 800x600 856x480 960x540 1024x576 1024x768 1280x720 1280x800 1280x960 1280x1024 1368x768 1440x900 1600x1200 1680x1050 1920x1080 1920x1200 2048x1080 2560x1440 3840x2160 4096x2160 [video4linux2,v4l2 @ 0x24114c0] Raw : nv12 : YUV 4:2:0 (NV12) : 640x360 640x480 720x480 720x576 768x576 800x600 856x480 960x540 1024x576 1024x768 1280x720 1280x800 1280x960 1280x1024 1368x768 1440x900 1600x1200 1680x1050 1920x1080 1920x1200 2048x1080 2560x1440 3840x2160 4096x2160 © 2018, Nanjing Magewell Electronics Co., Ltd SERVERSERVER HDMI output SERVER SERVER HDMI CAPTURE SERVER HDMI CAPTURE HDMI Loop-Trough HDMI Loop-Trough HDMI Loop-TroughHDMI Loop-Trough POWER ON/OS DOWN POWER ON/OS DOWN HDMI CAPTURE HDMI CAPTURE USB 3.0 BUS POWER USB 3.0 BUS POWER USB 3.0 BUS POWER USB 3.0 BUS POWER USB 3.0 BUS POWER ORIGINAL READ once COPY COPY
  38. 38. AMD Threadripper 1900X overview/spec 38 # uname -sr Linux 4.10.0-19-generic # vi /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT="pci=noaer" # update-grub; sync; sync; sync; reboot # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 1 : Model name: AMD Ryzen Threadripper 1900X 8-Core Processor CPU MHz: 3800.000 CPU max MHz: 3800.0000 CPU min MHz: 2200.0000 BogoMIPS: 7585.39 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-15 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx cpb hw_pstate vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca
  39. 39. AMD EPYC 7251 overview/spec 39 # uname -sr ; cat /etc/redhat-release Linux 3.10.0-693.5.2.el7.x86_64 CentOS Linux release 7.4.1708 (Core) # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 8 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7251 8-Core Processor Stepping: 2 CPU MHz: 1200.000 CPU max MHz: 2100.0000 CPU min MHz: 1200.0000 BogoMIPS: 4199.47 Virtualization: AMD-V L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 4096K NUMA node0 CPU(s): 0,1,16,17 NUMA node1 CPU(s): 2,3,18,19 NUMA node2 CPU(s): 4,5,20,21 NUMA node3 CPU(s): 6,7,22,23 NUMA node4 CPU(s): 8,9,24,25 NUMA node5 CPU(s): 10,11,26,27 NUMA node6 CPU(s): 12,13,28,29 NUMA node7 CPU(s): 14,15,30,31 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl nonstop_tsc extd_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb hw_pstate avic fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold overflow_recov succor smca
  40. 40. Appendix: SSD DC P4800X and cost/performance analysis 40 (Cost) 512GB DDR4 DRAM (2666Hz/ECC) $6,399.68 SOURCE: © 2016 Colfax International. , © 2000-2017 Newegg Inc. / SAKURA Internet Research Center. (08/2017) Project Sprig. 750GB 3D XPOINT/NVMe SSD (P4800X x2) $3,790.00 30GB (300M records) sort: 296 sec 200GB (2,000M records) sort: 4,648 sec 192GB DDR4 DRAM (2666Hz/ECC) $2,399.88- In-Memory Computing All Flash Computing Processing Size: 6.7x Processing Cost: 1.5x Processing Time: 15x In-Memory Computing # gensort -a 2000000000 test # time sort --parallel=52 -T /memdrv test -o out # gensort -a 300000000 test # time sort --parallel=52 -T /ramdisk test -o out Processing Size: 2.7x Processing Cost: 2.7x Processing Time: N/A
  41. 41. Appendix: stream_openmp performance check 41 Xeon Phi AMD RYZEN Xeon Xeon Special Thanks: Takefumi Miyoshi
  42. 42. Appendix: Network Application Benchmark result (iperf with 20 servers) 42
  43. 43. How to measure your dataflow using fio, pktgen and bandwidthTest 43 WRITE: 12,648MB/s (bs=256KB) READ: 13,793MB/s (bs=256KB) RAMDISK DDR4 2133MHz 16GB x4 40Mpps (pkt=64B) 2,560MB/s 40GbE (max rate) 5,000MB/s Mellanox Connect X-4 Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz WRITE: 12,648MB/s (bs=256KB) READ: 13,793MB/s (bs=256KB) RAMDISK DDR4 2133MHz 16GB x4 Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz # cd /opt # git clone git://dpdk.org/dpdk # git clone git://dpdk.org/apps/pktgen-dpdk export RTE_SDK=/opt/dpdk export RTE_TARGET=x86_64-native-linuxapp-gcc # sysctl vm.nr_hugepages=2048 # cd /opt/dpdk # make install T=x86_64-native-linuxapp-gcc # /opt/dpdk/usertools/dpdk-devbind.py -u 0b:00.0 # /opt/dpdk/usertools/dpdk-devbind.py -u 13:00.0 # /opt/dpdk/usertools/dpdk-devbind.py -b igb_uio 0b:00.0 # /opt/dpdk/usertools/dpdk-devbind.py -b igb_uio 13:00.0 # /opt/dpdk/usertools/dpdk-devbind.py --status # cd /opt/pktgen-dpdk/ # make # /opt/pktgen-dpdk/tools/setup.sh # /opt/pktgen-dpdk/app/x86_64-native-linuxapp-gcc/pktgen -- -m "1.0, 2.1" Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz Host to Device: 6,029MB/s Device to Host: 6,448MB/s GeForce GTX 1050 WRITE: 2,000MB/s (bs=4KB) READ: 2,500MB/s (bs=4KB) IntelOpate 900P (3DXP) WRITE: 2,000MB/s (bs=4KB) READ: 2,500MB/s (bs=4KB) IntelOpate 900P (3DXP) # mount -t tmpfs -o size=32G tmpfs /ramdisk # fio --directory=/ramdisk --rw=write --bs=4k --size=1G --numjobs=3 --runtime=100 --group_reporting --name=data # bash cuda_9.1.85_387.26_linux.run --silent --toolkit --override --no-opengl-libs --driver : # cd NVIDIA_CUDA-9.1_Samples/1_Utilities/bandwidthTest # ./bandwidthTest

×