Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
使用SMACK開發小型 Hybrid DB 系統
(踩過的坑)心得分享
許致軒 (Joe)
PilotTV / Data Engineer
● Chih-Hsuan Hsu (Joe)
● PilotTV Data Engineer
● Interestd in SMACK/ELK architecture
● 技術書籍譯者
○ Spark學習手冊
○ Cassandra技術手冊
...
既有系統改造Story
● RDB負載與日俱增
● 單點失效
● 想將Batch Mode改造成Straming-based Data Flow
● 第一階段為了簡化只採用SMACK中的Spark、Kafka與Cassandra
○ 因為還不熟...
系統改造前/(預期)改造後
ETL
New ETL with
Kafka Producer
4
New ETL
● 以Java實做
● ETL結果會產生Json-format串流資料
● 透過Kafka Producer API將Json Streaming送到Kafka Cluster
︰用預設值建立的Kafka Producer th...
開始研究Kafka Producer參數實驗(0.8.2)
參數 預設值 可用選項
producer.type sync sync, async
compression.codec none none, gzip, snappy
batch.n...
Kafka Producer: producer.type
● 將producer.type設定async可開啟批次傳輸模式
● 批次模式有較佳的Throughput,但客戶端忽然當機時有Data Loss的可能
6.3X
7
Kafka Producer: batch.num.messages
● 使用async模式時,一個批次傳輸的資料量
● 批次就會送出時機
○ 資料量達到batch.num.messages
○ 超過queue.buffer.max.ms的等待...
Kafka Producer: queue.buffering.max.messages
● Queue中允許暫存的訊息數量
1.05X
9
Kafka Producer: compression.codec
● 支援輸出串流壓縮
2.14X
3.02X
10
Kafka Producer: request.required.acks
● 0:無須與Kafka Cluster進行資料接收確認(ack)
● 1:僅與Repica Leader進行ack
● -1:與所有Repica都進行ack
6.3X...
● request.required.acks = 1 還是有可能掉資料
● sync with leader node不代表有容錯
Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 ...
Kafka Producer: request.required.acks(con)
● request.required.acks = 1 還是有可能掉資料
● sync with leader node不代表有容錯
1 2 3 4
1 2 ...
Kafka Producer: request.required.acks(con)
● request.required.acks = 1 還是有可能掉資料
● sync with leader node不代表有容錯
1 2 3 4
1 2 ...
Kafka Producer: request.required.acks(con)
● request.required.acks = -1
● 要容錯:replication.factor >= 2 && min.insync.replic...
Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4
5 6
all replicas sync, ack return
● request.required.ac...
Kafka Producer: request.required.acks(con)
1 2 3 4
1 2 3 4
1 2 3 4
5 6
● request.required.acks = -1
● 要容錯:replication.fact...
Java Lamba Streaming
● 實做中發現parallelStream()也有助於提昇througput!
3.47X
18
另外關於Kafka Producer物件本身.....
● Thread safe, 所以可以讓所有threads共享
19
Kafka Producer 實驗結論
參數 最後採用值 可用選項
producer.type async sync, async
compression.codec snappy none, gzip, snappy
batch.num.me...
Spark streaming
● 以Scala實做
● 多個Kafka中的Streamings做Client-sideJoin
● 將Join結果寫入(upsert)SQL server以維護既有架構
New ETL with
Kafka P...
Spark streaming at beginning
︰悲劇的運算throughput RRRRRRRR!! (Join 500Msgs/sec!!) 22
DB Lock Resource
● 檢查DB之後發現,Spark執行upsert時將Lock Resource用光.....
23
● 建立一張plain的base table(沒有任何index)
● 以insert取代upsert,再透過store-procedure進行二次Aggregation
Solution
append
aggregation
24
Throughput Improvement (Join 500 -> 13000 msgs/sec)
25
Spark with RDB的另一個(坑)注意事項
● SQL Server 最大允許同時連線數為32,767
● 不要問我為何知道............
● 無論是使用哪一套connection pool,要注意計算總連線數
● Total...
使用mapWithState()的Stateful API做Mapping時..
● 條件允許時可以設定timeout移除KV降低table的記憶體使用量
● 以最後一次KV被讀取的時間計算
27
Spark submit 一些好用的config
Ref: https://spark.apache.org/docs/2.0.0-preview/configuration.html
● supervise
● spark.streaming...
NoSQL之Cassandra
● Query-First 的 Schema設計理念
● 設計表之前需要先盡可能列出所有使用的情境
Ref: https://www.datastax.com/
Step1. 畫 ER Digram
Step2....
How ever......Cassandra Out! in this project
User: Joe..........我們想要建立一個Dash Board。需要Ad-hoc
Query,可以對任意欄位進行任意的操作。所以我們無法跟你討...
Migrate NoSQL solution to ELK stacks!
New ETL with
Kafka Producer
31
Logstash ingestion from Kafka
︰Bulk Loading to ES 的Throughput很低(indexing 8000 docs / min)
32
Logstash啟動Flag參數研究
https://www.elastic.co/guide/en/logstash/2.4/command-line-flags.html 33
First Step Improvement
● 因為資源尚足夠,嘗試增加Workers數量與Batch Size
● workers -> 20; batch -> 500
7.5X
34
LogStash需要處理多個Kafka topics時....
● Ver >= 5.0時,有topics屬性可以一次接的多個topics
● Ver < 5.0 時...........需要在設定檔中逐一宣告
35
ES Side Turning for Bulk Loading
● Bulk Load時幾個Trade-off的選項
○ 不Care最新資料的Latency -> 降低index.refresh_interval
○ 不Care容錯與查詢速度...
Bulk Load Improvement
● 最終結果:40萬Docs/min, 50X Throughput
50X
37
當Bulk load的很爽時........
● Too many open files!
一坑還有一坑......
38
● 先檢查目前的max_file_descriptors然後進行設定
ES max_file_descriptors修改
39
ES search Query Turning
● Optimize (force merge) cold index,甚至合成單一segment
● 使用兩類Cache提昇查詢效能
○ Filter cache:將過濾的結果cache起來,以...
Filter Query v.s. Normal Query
Ref: Elasticsearch in Action
41
Translate Normal Query to Filter Query
42
關於ES欄位的新增修改
● 新增全新的欄位很容易(Flexible schema )
● 修改欄位的型別很麻煩!!需要reindex...
43
Kibana…..還沒踩到(時候未到?)
● Kibana 4.2版之後有Sense工具,下DSL很好用!!
● $./bin/kibana plugin --install elastic/sense
44
Summary
● Discussed components versions
○ Kafka: 0.8.2
○ Spark: 2.0.2
○ Cassandra: 3.10
○ Elasticsearch: 2.4.5
○ Logstash:...
Upcoming SlideShare
Loading in …5
×

SMACK Dev Experience

247 views

Published on

2017 DataCon Talk

Published in: Data & Analytics
  • Be the first to comment

SMACK Dev Experience

  1. 1. 使用SMACK開發小型 Hybrid DB 系統 (踩過的坑)心得分享 許致軒 (Joe) PilotTV / Data Engineer
  2. 2. ● Chih-Hsuan Hsu (Joe) ● PilotTV Data Engineer ● Interestd in SMACK/ELK architecture ● 技術書籍譯者 ○ Spark學習手冊 ○ Cassandra技術手冊 ● LinkIn:www.linkedin.com/in/joechh ● Mail:joechh731126@gmail.com 2 About Me
  3. 3. 既有系統改造Story ● RDB負載與日俱增 ● 單點失效 ● 想將Batch Mode改造成Straming-based Data Flow ● 第一階段為了簡化只採用SMACK中的Spark、Kafka與Cassandra ○ 因為還不熟Mesos跟Akka…QQ 3
  4. 4. 系統改造前/(預期)改造後 ETL New ETL with Kafka Producer 4
  5. 5. New ETL ● 以Java實做 ● ETL結果會產生Json-format串流資料 ● 透過Kafka Producer API將Json Streaming送到Kafka Cluster ︰用預設值建立的Kafka Producer throughput太低 New ETL with Kafka Producer 5
  6. 6. 開始研究Kafka Producer參數實驗(0.8.2) 參數 預設值 可用選項 producer.type sync sync, async compression.codec none none, gzip, snappy batch.num.messages 200 unlimited request.required.acks 0 -1, 0, 1 queue.buffering.max.messages 10000 unlimited https://kafka.apache.org/082/documentation.html 6
  7. 7. Kafka Producer: producer.type ● 將producer.type設定async可開啟批次傳輸模式 ● 批次模式有較佳的Throughput,但客戶端忽然當機時有Data Loss的可能 6.3X 7
  8. 8. Kafka Producer: batch.num.messages ● 使用async模式時,一個批次傳輸的資料量 ● 批次就會送出時機 ○ 資料量達到batch.num.messages ○ 超過queue.buffer.max.ms的等待時間 2.55X 2.53X 8
  9. 9. Kafka Producer: queue.buffering.max.messages ● Queue中允許暫存的訊息數量 1.05X 9
  10. 10. Kafka Producer: compression.codec ● 支援輸出串流壓縮 2.14X 3.02X 10
  11. 11. Kafka Producer: request.required.acks ● 0:無須與Kafka Cluster進行資料接收確認(ack) ● 1:僅與Repica Leader進行ack ● -1:與所有Repica都進行ack 6.3X 2.94X 11
  12. 12. ● request.required.acks = 1 還是有可能掉資料 ● sync with leader node不代表有容錯 Kafka Producer: request.required.acks(con) 1 2 3 4 1 2 3 4 1 2 3 4Replica Follower 1 Replica Leader 5 6 producer sent Replica Follower 2 12
  13. 13. Kafka Producer: request.required.acks(con) ● request.required.acks = 1 還是有可能掉資料 ● sync with leader node不代表有容錯 1 2 3 4 1 2 3 4 1 2 3 4 5 6 ack return Replica Follower 1 Replica Leader Replica Follower 2 13
  14. 14. Kafka Producer: request.required.acks(con) ● request.required.acks = 1 還是有可能掉資料 ● sync with leader node不代表有容錯 1 2 3 4 1 2 3 4 1 2 3 4 5 6 Replica Follower 1 Replica Leader Replica Follower 2 14
  15. 15. Kafka Producer: request.required.acks(con) ● request.required.acks = -1 ● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2 1 2 3 4 1 2 3 4 1 2 3 4 5 6 producer sent Replica Follower 1 Replica Leader Replica Follower 2 15
  16. 16. Kafka Producer: request.required.acks(con) 1 2 3 4 1 2 3 4 1 2 3 4 5 6 all replicas sync, ack return ● request.required.acks = -1 ● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2 5 6Replica Follower 1 Replica Leader Replica Follower 2 16 5 6
  17. 17. Kafka Producer: request.required.acks(con) 1 2 3 4 1 2 3 4 1 2 3 4 5 6 ● request.required.acks = -1 ● 要容錯:replication.factor >= 2 && min.insync.replicas >= 2 5 6Replica Follower 1 Replica Leader Replica Follower 2 17 5 6
  18. 18. Java Lamba Streaming ● 實做中發現parallelStream()也有助於提昇througput! 3.47X 18
  19. 19. 另外關於Kafka Producer物件本身..... ● Thread safe, 所以可以讓所有threads共享 19
  20. 20. Kafka Producer 實驗結論 參數 最後採用值 可用選項 producer.type async sync, async compression.codec snappy none, gzip, snappy batch.num.messages 1000 unlimited request.required.acks 0 -1, 0, 1 queue.buffering.max.messages 20000 unlimited ● 還是必須根據需求實際測試以及對Data Loss的容忍度 ● Latency v.s. Throughput 20
  21. 21. Spark streaming ● 以Scala實做 ● 多個Kafka中的Streamings做Client-sideJoin ● 將Join結果寫入(upsert)SQL server以維護既有架構 New ETL with Kafka Producer upsert streaming join 21
  22. 22. Spark streaming at beginning ︰悲劇的運算throughput RRRRRRRR!! (Join 500Msgs/sec!!) 22
  23. 23. DB Lock Resource ● 檢查DB之後發現,Spark執行upsert時將Lock Resource用光..... 23
  24. 24. ● 建立一張plain的base table(沒有任何index) ● 以insert取代upsert,再透過store-procedure進行二次Aggregation Solution append aggregation 24
  25. 25. Throughput Improvement (Join 500 -> 13000 msgs/sec) 25
  26. 26. Spark with RDB的另一個(坑)注意事項 ● SQL Server 最大允許同時連線數為32,767 ● 不要問我為何知道............ ● 無論是使用哪一套connection pool,要注意計算總連線數 ● Total Connection = connection pool size * spark executor number 26
  27. 27. 使用mapWithState()的Stateful API做Mapping時.. ● 條件允許時可以設定timeout移除KV降低table的記憶體使用量 ● 以最後一次KV被讀取的時間計算 27
  28. 28. Spark submit 一些好用的config Ref: https://spark.apache.org/docs/2.0.0-preview/configuration.html ● supervise ● spark.streaming.backpressure.enabled ● spark.streaming.backpressure.initialRate ● spark.streaming.kafka.maxRatePerPartition ● spark.executor.extraJavaOptions ○ -XX:+UseConcMarkSweepGC ● spark.cleaner.referenceTracking.cleanCheckpoint 28
  29. 29. NoSQL之Cassandra ● Query-First 的 Schema設計理念 ● 設計表之前需要先盡可能列出所有使用的情境 Ref: https://www.datastax.com/ Step1. 畫 ER Digram Step2. 考慮查詢情境 Step3. 建立滿足查詢的表 29
  30. 30. How ever......Cassandra Out! in this project User: Joe..........我們想要建立一個Dash Board。需要Ad-hoc Query,可以對任意欄位進行任意的操作。所以我們無法跟你討 論可能的Query呢~~ Joe: 30
  31. 31. Migrate NoSQL solution to ELK stacks! New ETL with Kafka Producer 31
  32. 32. Logstash ingestion from Kafka ︰Bulk Loading to ES 的Throughput很低(indexing 8000 docs / min) 32
  33. 33. Logstash啟動Flag參數研究 https://www.elastic.co/guide/en/logstash/2.4/command-line-flags.html 33
  34. 34. First Step Improvement ● 因為資源尚足夠,嘗試增加Workers數量與Batch Size ● workers -> 20; batch -> 500 7.5X 34
  35. 35. LogStash需要處理多個Kafka topics時.... ● Ver >= 5.0時,有topics屬性可以一次接的多個topics ● Ver < 5.0 時...........需要在設定檔中逐一宣告 35
  36. 36. ES Side Turning for Bulk Loading ● Bulk Load時幾個Trade-off的選項 ○ 不Care最新資料的Latency -> 降低index.refresh_interval ○ 不Care容錯與查詢速度 -> 將副本數設定為0 ○ 不Care Merge Segment佔用的IO(越快越好) ->不掐Merge IO 36
  37. 37. Bulk Load Improvement ● 最終結果:40萬Docs/min, 50X Throughput 50X 37
  38. 38. 當Bulk load的很爽時........ ● Too many open files! 一坑還有一坑...... 38
  39. 39. ● 先檢查目前的max_file_descriptors然後進行設定 ES max_file_descriptors修改 39
  40. 40. ES search Query Turning ● Optimize (force merge) cold index,甚至合成單一segment ● 使用兩類Cache提昇查詢效能 ○ Filter cache:將過濾的結果cache起來,以供未來其他查詢使用 ○ Shard cache:將查詢結果整個cache起來,下次一樣的查詢直接回傳 ● 別忘了移除為了Bulk Load模式所做的暫時性設定 40
  41. 41. Filter Query v.s. Normal Query Ref: Elasticsearch in Action 41
  42. 42. Translate Normal Query to Filter Query 42
  43. 43. 關於ES欄位的新增修改 ● 新增全新的欄位很容易(Flexible schema ) ● 修改欄位的型別很麻煩!!需要reindex... 43
  44. 44. Kibana…..還沒踩到(時候未到?) ● Kibana 4.2版之後有Sense工具,下DSL很好用!! ● $./bin/kibana plugin --install elastic/sense 44
  45. 45. Summary ● Discussed components versions ○ Kafka: 0.8.2 ○ Spark: 2.0.2 ○ Cassandra: 3.10 ○ Elasticsearch: 2.4.5 ○ Logstash: 2.4.1 ○ Kibana: 4.6.4 New ETL with Kafka Producer 45

×