Analysis and
Improvement of
IOTA PoW
Implementation
chenwei (魏禛)
<zhenwei.tw@gmail.com>
AndyYang (楊子賢)
<kukry5566@gmail.com>
March 10, 2018 / SITCON2018 1
chenwei (魏禛)
● From Tainan, Taiwan
● Study Master degree at National Taiwan University
● Recent work
○ Learning how to implement a interpreter
○ Learning Golang
○ Optimize Neural Network on multiple GPUs
● GitHub <https://github.com/chenwei-tw>
2
AndyYang (楊子賢)
● 來自台北
● 目前就讀台大資工所一年級
● 研究領域 :
○ 機器學習
○ 計算機結構
● Recent Work :
○ ReRam Based Accelerator for Convolutional Neural
Network
3
Brief Introduction to IOTA
from: “Iota Tangle Visualization” <https://simulation1.tangle.works/>
4
Brief Introduction to IOTA
● IRI (IOTA Reference Implementation)
○ Provides RESTful API to participate in Tangle
○ Exchange transactions with other nodes
○ Maintain Database for storing transactions
Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係”
<https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet-
full-wallet-and-full-node/>
Referenced: “IOTA API Reference”
<https://iota.readme.io/v1.2.0/reference>
5
Brief Introduction to IOTA
● (Light) Wallet
○ 查詢餘額、收款、轉帳
○ 因為沒有運行完整的 Node,所以 Wallet 的資訊都必
須透過前述的 RESTful API 與一個 full node 做溝通
○ Before doing any operation with your wallet,
check host connected is available
Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係”
<https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet-
full-wallet-and-full-node/>
6
Brief Introduction to IOTA
● 如何發起一筆交易 ?
○ Node 選擇兩個交易 (transaction) 做驗證
○ 檢查該兩筆交易是否有衝突 (conflict)
(e.g. 帳戶餘額為負)
○ 解出一道加密問題 (PoW),耗費計算力
Referenced: “Tangle 白皮書” <https://hackmd.io/s/ryriSgvAW>
Further Reading: “深入理解 IOTA 交易方式”
<https://blog.louie.lu/2018/01/10/in-depth-explain-iota-transaction/>
7
How I get involved in
● <attachToTangle> in IRI
Referenced: “iotaledger/iri” <https://github.com/iotaledger/iri>
8
How I get involved in
● There are too many IOTA PoW Implementation hided
in these libraries
○ curl.lib.js
<https://github.com/iotaledger/curl.lib.js>
○ gIOTA <https://github.com/iotaledger/gIOTA>
○ ccurl <https://github.com/iotaledger/ccurl>
○ iota-pearldiver
<https://github.com/mlouielu/iota-pearldiver>
9
● gIOTA 蒐集了多種的PoW實作(C, SSE, AVX, OpenCL)
○ 而這些實作多以 C code 的形式內嵌在 Golang 裡
Why choose gIOTA?
● 故我們可以藉由 C 打造 IOTA 底層的
trinary structure 後,便可快速將實作移轉過去
10
● Alternative to Binary, Trinary is a base-3 numeral
system
● Trits: Analogous to bits, a ternary digit is a trit .The
digits may have the values 1, 0, or -1
● Trytes: A tryte consists of 3 trits, which can
represent 27 values.
○ in IOTA, trytes are represented as characters
'9,A-Z'.
Referenced: “IOTA Glossary” <https://iota.readme.io/docs/glossary>
Trinary Structure
11
Source Code: “chenwei-tw/dcurl” <https://github.com/chenwei-
tw/dcurl/blob/dev/src/trinary/trinary.h>
Our Trinary Structure
12
● 9 in tryte = {0,0,0} in trits
What is PoW (Proof Of Work)?
Referenced: “The Anatomy of a Transaction”
<https://domschiener.gitbooks.io/iota-
guide/content/chapter1/transactions-and-bundles.html>
...0000...0
MWM
Hash
13
● giota 所蒐集的實作使用的多執行緒寫
法,並不是真的把計算函數分工,而是
同時執行多個一樣的函數看誰比較快算
出來的暴力解法
● 不同執行緒的起始 seed 不一樣
如何找出Nonce?
14
● C, GO, SSE 的實作沒有
問題
Referenced: “用 C 開發 IOTA PoW 的各種實作" <https://hackmd.io/s/HyNw4VM-z>
實測 giota 正確性
15
● AVX, OpenCL 卻沒通過
pow_avx_test.go:47: pow is illegal
J9QTUNNMONCMIR9JBNMRC9SC9QTBRKBUVCBYBUITBHEICYVQ9HXEXSPWPU9KACTSDRSQBDOJPOOEAFVMP
pow_cl_test.go:46: pow is illegal
IIHYVX9VHSMQWSNDJYWZOJBCBTPVQBLVBF9UYIYSTEKJVEFVY9JPJJMRLFWOJFKNWKAANSZKLXDBWMALI
● 後來發現 iotaledger/ccurl, 和 gIOTA 的 OpenCL Kernel
Function 是一樣的, 但是 ccurl 的結果是對的, 我們推測可
能是 gIOTA 在 launch kernel 的時候發生問題
● 於是後來的 GPU 效能評估與後續的設計都是基於
iotaledger/ccurl 版本做修改
實測 giota 正確性
16
● 以一個 tryte 量測三種 PoW 實作的效能
● 但是後來發現不同的 tryte 找到的 Nonce 時間不一樣
量測各種 PoW 實作效能
17
● 以大量的 trytes 來量測並繪製分布圖, 觀察各實作的效能
● 30 trytes 200 samples 的結果
量測各種 PoW 實作效能
47組 samples 執行時間約 10 秒
重複初始化 OpenCL context
的下場
Source Code: “chenwei-tw/iota-pow-in-c”
<https://github.com/chenwei-tw/iota-pow-in-c>
18
● 疑問: 為何使用 GPU 的 OpenCL 效能特別差 ?
● 可能的問題點:
○ 尋找 Nonce 的 kernel function 要計算很久?
○ Device 與 Host 之間的 Communication overhead
過大 ?
○ 還是 OpenCL 哪一個的 API 出了問題 ?
● 另外一個問題:
○ 由於實驗環境的 GPU 為 Nvidia,且 Nvidia 沒有提供
其 OpenCL 的 Profiling Tool
OpenCL 效能差的原因?
19
● 最直覺的想法便是重新把 OpenCL 實作改寫為 CUDA 後
再用 toolkit 的其中一項工具 nvprof 進行觀察
● 從下圖的結果,無法直接觀察到變慢的原因
自幹一發 CUDA !
Further Reading: “Profiler :: CUDA Toolkit Documentation”
<http://docs.nvidia.com/cuda/profiler-users-guide/index.html>
20
● 後來在 github 找到另一個 Profiling Tool - uftrace, 這個
工具可以提供如:
○ Duration
○ TID
○ Times of Function Call
○ Total time
● 雖然 uftrace 無法分析有關 GPU 的 Profiling
Information , 但是它提供的資訊仍可以讓我們了解效能
是卡在哪裡
Referenced: “namhyung/uftrace” <https://github.com/namhyung/uftrace>
嘗試另一個 Profiling Tool
21
● record : runs a program and saves the trace data
● graph : shows function call graph in the trace data
uftrace 的量測結果
$ uftrace record pow_cl
$ uftrace graph main
22
● GPU初始化階段占了近70%的比重
total time
init_clcon
text
init_cl_ke
rnel
write_cl_b
uffer
clEnqueueW
riteBuffer
clWaitForE
vents
clEnqueueR
eadBuffer
Hash
1.938 1.354 s 14.362 us 1.541 ms 1.538 ms 569.901 ms 84.981 us 5.502 ms
OpenCL context Initialization OpenCL searching nonce
uftrace 的量測結果
23
● 想辦法避免 OpenCL context 重複初始化的問題
○ 而 ccurl 的解決辦法是,一次只做一個 PoW Task,並
重複利用同一個 context
● 閱讀完 ccurl 的程式碼後,我們認為 ccurl 的資料結構設
計也有試圖想實現 multi-thread Pow Task,但是我們嘗
試在同一個 address space 同時 launch 多個
<ccurl_pow> ,算出來的 hash 卻是錯的
如何改善 OpenCL 版本的問題
24
New IOTA PoW Library - dcurl
● Goal
○ 在給定的硬體環境裡,想辦法讓 PoW 跑越快越好
○ 整合至 IRI,並檢驗效能是否有提升
● Our ideas
○ PoW tasks can be multi-threaded executed
○ Integrate powerful IOTA PoW implementation
25
New IOTA PoW Library - dcurl
● Hardware Environment
○ Ubuntu 16.04
○ Intel(R) Xeon(R) CPU E5-2650 v4 @2.2GHz 48 cores
○ Nvidia Titan Xp
○ 94.2 GB RAM
26
New IOTA PoW Library - dcurl
27
New IOTA PoW Library - dcurl It’s important to find
respective lock
28
Does multi-thread really bring speedup?
Frequency
Time (s)
29
Does multi-thread really bring speedup?
Frequency
Time (s)
30
Compare dcurl with other PoW Libraries
Frequency
Time (s)
31
Integrate dcurl into IRI
32
Integrate dcurl into IRI
● Use javah to produce header file for c program
$ javah com.iota.iri.hash.PearlDiver
33
Integrate dcurl into IRI
● <jni.h> provides many functions to convert
java objects to C objects, such as ...
○ GetIntArrayElements() gets java int array
and return c int array
○ SetIntArrayRegion() copys c int array to
java int array
Further Reading: “JNI Functions”
<https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html>
Further Reading: “Java Programming Tutorial Java Natve Interface (JNI)”
<https://www.ntu.edu.sg/home/ehchua/programming/java/JavaNativeInterface.html>
34
Integrate dcurl into IRI
● Reminder
○ Provide include path to OpenJDK for compiler
○ Set java library path before launch your jvm
● Lets compile it !
○ We can get a shared library for jvm to load
○ Done!
Source code: “chenwei-tw/iri” <https://github.com/chenwei-
tw/iri/tree/task/integrate_dcurl>
35
Performance between IRI and dcurl
Frequency
Time (s)
Different Hardware Platform
● Intel(R) Core(™) i7-8700K
Processor
● Nvidia GeForce GTX 1080 Ti
● 32 GB Memory
<attachToTangle> Performance Comparison
36
Something in progress ...
● Fix AVX implementation
● Let dcurl can configure environment and
support multiple GPUs
● dcurl would be crashed if GPU memory is not enough
● dcurl would decide suitable parameter set
automatically
37
Future Work
● Add a new interface for PearlDiver in IRI,
so everyone can load suitable PoW implementation
for their hardware environment
● Search for other bottlenecks of IRI and try to improve
38

[Sitcon2018] Analysis and Improvement of IOTA PoW Implementation

  • 1.
    Analysis and Improvement of IOTAPoW Implementation chenwei (魏禛) <zhenwei.tw@gmail.com> AndyYang (楊子賢) <kukry5566@gmail.com> March 10, 2018 / SITCON2018 1
  • 2.
    chenwei (魏禛) ● FromTainan, Taiwan ● Study Master degree at National Taiwan University ● Recent work ○ Learning how to implement a interpreter ○ Learning Golang ○ Optimize Neural Network on multiple GPUs ● GitHub <https://github.com/chenwei-tw> 2
  • 3.
    AndyYang (楊子賢) ● 來自台北 ●目前就讀台大資工所一年級 ● 研究領域 : ○ 機器學習 ○ 計算機結構 ● Recent Work : ○ ReRam Based Accelerator for Convolutional Neural Network 3
  • 4.
    Brief Introduction toIOTA from: “Iota Tangle Visualization” <https://simulation1.tangle.works/> 4
  • 5.
    Brief Introduction toIOTA ● IRI (IOTA Reference Implementation) ○ Provides RESTful API to participate in Tangle ○ Exchange transactions with other nodes ○ Maintain Database for storing transactions Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係” <https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet- full-wallet-and-full-node/> Referenced: “IOTA API Reference” <https://iota.readme.io/v1.2.0/reference> 5
  • 6.
    Brief Introduction toIOTA ● (Light) Wallet ○ 查詢餘額、收款、轉帳 ○ 因為沒有運行完整的 Node,所以 Wallet 的資訊都必 須透過前述的 RESTful API 與一個 full node 做溝通 ○ Before doing any operation with your wallet, check host connected is available Referenced: “IOTA 輕量錢包、完整錢包與 IOTA Node 的關係” <https://blog.louie.lu/2017/12/06/relationship-between-iota-light-wallet- full-wallet-and-full-node/> 6
  • 7.
    Brief Introduction toIOTA ● 如何發起一筆交易 ? ○ Node 選擇兩個交易 (transaction) 做驗證 ○ 檢查該兩筆交易是否有衝突 (conflict) (e.g. 帳戶餘額為負) ○ 解出一道加密問題 (PoW),耗費計算力 Referenced: “Tangle 白皮書” <https://hackmd.io/s/ryriSgvAW> Further Reading: “深入理解 IOTA 交易方式” <https://blog.louie.lu/2018/01/10/in-depth-explain-iota-transaction/> 7
  • 8.
    How I getinvolved in ● <attachToTangle> in IRI Referenced: “iotaledger/iri” <https://github.com/iotaledger/iri> 8
  • 9.
    How I getinvolved in ● There are too many IOTA PoW Implementation hided in these libraries ○ curl.lib.js <https://github.com/iotaledger/curl.lib.js> ○ gIOTA <https://github.com/iotaledger/gIOTA> ○ ccurl <https://github.com/iotaledger/ccurl> ○ iota-pearldiver <https://github.com/mlouielu/iota-pearldiver> 9
  • 10.
    ● gIOTA 蒐集了多種的PoW實作(C,SSE, AVX, OpenCL) ○ 而這些實作多以 C code 的形式內嵌在 Golang 裡 Why choose gIOTA? ● 故我們可以藉由 C 打造 IOTA 底層的 trinary structure 後,便可快速將實作移轉過去 10
  • 11.
    ● Alternative toBinary, Trinary is a base-3 numeral system ● Trits: Analogous to bits, a ternary digit is a trit .The digits may have the values 1, 0, or -1 ● Trytes: A tryte consists of 3 trits, which can represent 27 values. ○ in IOTA, trytes are represented as characters '9,A-Z'. Referenced: “IOTA Glossary” <https://iota.readme.io/docs/glossary> Trinary Structure 11
  • 12.
    Source Code: “chenwei-tw/dcurl”<https://github.com/chenwei- tw/dcurl/blob/dev/src/trinary/trinary.h> Our Trinary Structure 12
  • 13.
    ● 9 intryte = {0,0,0} in trits What is PoW (Proof Of Work)? Referenced: “The Anatomy of a Transaction” <https://domschiener.gitbooks.io/iota- guide/content/chapter1/transactions-and-bundles.html> ...0000...0 MWM Hash 13
  • 14.
  • 15.
    ● C, GO,SSE 的實作沒有 問題 Referenced: “用 C 開發 IOTA PoW 的各種實作" <https://hackmd.io/s/HyNw4VM-z> 實測 giota 正確性 15
  • 16.
    ● AVX, OpenCL卻沒通過 pow_avx_test.go:47: pow is illegal J9QTUNNMONCMIR9JBNMRC9SC9QTBRKBUVCBYBUITBHEICYVQ9HXEXSPWPU9KACTSDRSQBDOJPOOEAFVMP pow_cl_test.go:46: pow is illegal IIHYVX9VHSMQWSNDJYWZOJBCBTPVQBLVBF9UYIYSTEKJVEFVY9JPJJMRLFWOJFKNWKAANSZKLXDBWMALI ● 後來發現 iotaledger/ccurl, 和 gIOTA 的 OpenCL Kernel Function 是一樣的, 但是 ccurl 的結果是對的, 我們推測可 能是 gIOTA 在 launch kernel 的時候發生問題 ● 於是後來的 GPU 效能評估與後續的設計都是基於 iotaledger/ccurl 版本做修改 實測 giota 正確性 16
  • 17.
    ● 以一個 tryte量測三種 PoW 實作的效能 ● 但是後來發現不同的 tryte 找到的 Nonce 時間不一樣 量測各種 PoW 實作效能 17
  • 18.
    ● 以大量的 trytes來量測並繪製分布圖, 觀察各實作的效能 ● 30 trytes 200 samples 的結果 量測各種 PoW 實作效能 47組 samples 執行時間約 10 秒 重複初始化 OpenCL context 的下場 Source Code: “chenwei-tw/iota-pow-in-c” <https://github.com/chenwei-tw/iota-pow-in-c> 18
  • 19.
    ● 疑問: 為何使用GPU 的 OpenCL 效能特別差 ? ● 可能的問題點: ○ 尋找 Nonce 的 kernel function 要計算很久? ○ Device 與 Host 之間的 Communication overhead 過大 ? ○ 還是 OpenCL 哪一個的 API 出了問題 ? ● 另外一個問題: ○ 由於實驗環境的 GPU 為 Nvidia,且 Nvidia 沒有提供 其 OpenCL 的 Profiling Tool OpenCL 效能差的原因? 19
  • 20.
    ● 最直覺的想法便是重新把 OpenCL實作改寫為 CUDA 後 再用 toolkit 的其中一項工具 nvprof 進行觀察 ● 從下圖的結果,無法直接觀察到變慢的原因 自幹一發 CUDA ! Further Reading: “Profiler :: CUDA Toolkit Documentation” <http://docs.nvidia.com/cuda/profiler-users-guide/index.html> 20
  • 21.
    ● 後來在 github找到另一個 Profiling Tool - uftrace, 這個 工具可以提供如: ○ Duration ○ TID ○ Times of Function Call ○ Total time ● 雖然 uftrace 無法分析有關 GPU 的 Profiling Information , 但是它提供的資訊仍可以讓我們了解效能 是卡在哪裡 Referenced: “namhyung/uftrace” <https://github.com/namhyung/uftrace> 嘗試另一個 Profiling Tool 21
  • 22.
    ● record :runs a program and saves the trace data ● graph : shows function call graph in the trace data uftrace 的量測結果 $ uftrace record pow_cl $ uftrace graph main 22
  • 23.
    ● GPU初始化階段占了近70%的比重 total time init_clcon text init_cl_ke rnel write_cl_b uffer clEnqueueW riteBuffer clWaitForE vents clEnqueueR eadBuffer Hash 1.9381.354 s 14.362 us 1.541 ms 1.538 ms 569.901 ms 84.981 us 5.502 ms OpenCL context Initialization OpenCL searching nonce uftrace 的量測結果 23
  • 24.
    ● 想辦法避免 OpenCLcontext 重複初始化的問題 ○ 而 ccurl 的解決辦法是,一次只做一個 PoW Task,並 重複利用同一個 context ● 閱讀完 ccurl 的程式碼後,我們認為 ccurl 的資料結構設 計也有試圖想實現 multi-thread Pow Task,但是我們嘗 試在同一個 address space 同時 launch 多個 <ccurl_pow> ,算出來的 hash 卻是錯的 如何改善 OpenCL 版本的問題 24
  • 25.
    New IOTA PoWLibrary - dcurl ● Goal ○ 在給定的硬體環境裡,想辦法讓 PoW 跑越快越好 ○ 整合至 IRI,並檢驗效能是否有提升 ● Our ideas ○ PoW tasks can be multi-threaded executed ○ Integrate powerful IOTA PoW implementation 25
  • 26.
    New IOTA PoWLibrary - dcurl ● Hardware Environment ○ Ubuntu 16.04 ○ Intel(R) Xeon(R) CPU E5-2650 v4 @2.2GHz 48 cores ○ Nvidia Titan Xp ○ 94.2 GB RAM 26
  • 27.
    New IOTA PoWLibrary - dcurl 27
  • 28.
    New IOTA PoWLibrary - dcurl It’s important to find respective lock 28
  • 29.
    Does multi-thread reallybring speedup? Frequency Time (s) 29
  • 30.
    Does multi-thread reallybring speedup? Frequency Time (s) 30
  • 31.
    Compare dcurl withother PoW Libraries Frequency Time (s) 31
  • 32.
  • 33.
    Integrate dcurl intoIRI ● Use javah to produce header file for c program $ javah com.iota.iri.hash.PearlDiver 33
  • 34.
    Integrate dcurl intoIRI ● <jni.h> provides many functions to convert java objects to C objects, such as ... ○ GetIntArrayElements() gets java int array and return c int array ○ SetIntArrayRegion() copys c int array to java int array Further Reading: “JNI Functions” <https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html> Further Reading: “Java Programming Tutorial Java Natve Interface (JNI)” <https://www.ntu.edu.sg/home/ehchua/programming/java/JavaNativeInterface.html> 34
  • 35.
    Integrate dcurl intoIRI ● Reminder ○ Provide include path to OpenJDK for compiler ○ Set java library path before launch your jvm ● Lets compile it ! ○ We can get a shared library for jvm to load ○ Done! Source code: “chenwei-tw/iri” <https://github.com/chenwei- tw/iri/tree/task/integrate_dcurl> 35
  • 36.
    Performance between IRIand dcurl Frequency Time (s) Different Hardware Platform ● Intel(R) Core(™) i7-8700K Processor ● Nvidia GeForce GTX 1080 Ti ● 32 GB Memory <attachToTangle> Performance Comparison 36
  • 37.
    Something in progress... ● Fix AVX implementation ● Let dcurl can configure environment and support multiple GPUs ● dcurl would be crashed if GPU memory is not enough ● dcurl would decide suitable parameter set automatically 37
  • 38.
    Future Work ● Adda new interface for PearlDiver in IRI, so everyone can load suitable PoW implementation for their hardware environment ● Search for other bottlenecks of IRI and try to improve 38

Editor's Notes

  • #6 能夠完成這些行為的都能夠稱做 “full node”
  • #25 cue: