Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Real-time Big Data Applications 
with Hadoop Ecosystem 
Chris Huang 
Sr. Manager, Core Tech 
2014/9/24 
1 9/25/2014 Confid...
About – Chris Huang 
• Chris Huang 
– SPN Solution Developer Manager 
– SPN Hadoop Architect 
– Hadoop.TW Active Member 
•...
Conference Talks 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
Conference Talks 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 4
Hot Keywords in Hadoop Community 
Real-time 
• Impala, Stinger 
Computing Framework 
• YARN, Tez 
In Memory 
• Spark 
Stre...
Big Data Applications 
• Operational 
– Real-time 
– Near Real-time 
• Analytical 
– Batch 
– Interactive 
– Near Real-tim...
An Online Music Example 
• Operational 
– Recent N login time (listen duration) 
– Recent N album/artist user browses 
– R...
An Online Banking Example 
• Operational 
– Recent N login time / frequency 
– Recent N items purchased by credit card 
– ...
Building Your Big Data Applications 
• Think about your data 
– Entity or Event? 
• Think about your use case 
– Operation...
Think About Your Data 
Slides from “Apache HBase Application Archetypes”, 
HBaseCon 2014 
You can Replace HBase with simil...
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 11
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 13
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 14
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 15
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 17
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 18
Think About Your Use 
Case 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 19
Operational Use Case 1 
MR / 
Spark 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 20 
Real-time 
MR / 
Spark 
R...
HBase: No Secondary Index (yet) 
• Search index building (row key) 
• Use Solr to make text data searchable 
– Snapshot & ...
Operational Use Case 2 (SPN) 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 22 
Get, Scan 
Solr Client 
low late...
Operational Use Case 3 (Mixed) 
Real-time 
Put, Incr, 
Append 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 23 ...
HBase or HDFS? 
• Depends on what’s your data 
– Entity or Event? 
• Depends on your workload 
– Low latency? 
– Random re...
Wait… 
Batch for 
Operational? 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 25
Yes, 
Why not? 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 27
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 28
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 29
Operational: Batch + Real-time 
• Bridge the gap between batch and now 
• 80/20 rule 
– HDFS/MapReduce/Spark solves 80% ea...
What is Real-time? 
• Real-time is NOT always “faster than batch” 
– If you have really BIG DATA 
• Most of the time, we w...
Analytical Use Case 
Batch/streaming compute 
Near real-time/interactive deliver 
9/25/2014 Confidential | Copyright 2013 ...
Near Real-time Interactive 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 33
Recommendation 
System 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 34
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
The Online Music Example 
• Operational 
– Recent N login time (listen duration ) 
– Recent N album/artist user browses 
–...
Analytical Use Case 1 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 37 
Batch 
HDFS 
Index Query 
Solr Client 
...
Analytical Use Case 2 (SPN) 
“A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase”,...
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 39
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 40
You Need an 
Interactive 
Analytic Engine 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 41
Stinger 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 42
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 43
Impala Architecture 
Datanode 
Tasktracker 
Regionserver 
impala 
daemon 
NN, JT, HM 
Active 
9/25/2014 Confidential | Cop...
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
Apache Pig (MapReduce) 
• Do hourly count on akamai log 
– A = load 'date://2014/07/20/00' 
using AkamaiRCLoader(); 
B = f...
Using Impala 
• No memory cache 
– > select count(*) from akafast 
where day=20140720 and hour=0 
– 194202349 
• with OS c...
Don’t Connect 
Analytic 
Engine with 
Operational 
Use Case 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 51
Analytical Use Case 3 
low latency 
high throughput 
Real-time 
Put, Incr, 
Append 
9/25/2014 Confidential | Copyright 201...
Streaming Use Cases 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 53
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 54
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 55
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 56
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 57
TME – Trend Message Exchange 
http://trendmicro.github.io/tme/ 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 58
Streaming Operational Use Case 
Real-time 
Gets 
Short scan 
Kafka/Storm 
Put, Incr, 
Append 
HBase Client 
Kafka/Storm 
l...
Streaming Analytical Use Case 
Put, Incr, 
Append 
HBase Client 
Kafka/Storm 
low latency 
HDFS 
high throughput 
Flume 
F...
Think About Your Data User 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 61
Data User 
• External 
– Customer 
– Partner 
• Internal 
– Business report user 
– Data researcher 
– Data analyst 
– Alg...
No Silver Bullet 
For Real-time, 
Or Big Data Application 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 63
Q&A 
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 64
9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 65
Upcoming SlideShare
Loading in …5
×

Real time big data applications with hadoop ecosystem

1,580 views

Published on

Published in: Technology
  • Be the first to comment

Real time big data applications with hadoop ecosystem

  1. 1. Real-time Big Data Applications with Hadoop Ecosystem Chris Huang Sr. Manager, Core Tech 2014/9/24 1 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc.
  2. 2. About – Chris Huang • Chris Huang – SPN Solution Developer Manager – SPN Hadoop Architect – Hadoop.TW Active Member • Believes Cloud, Service, Software, Big Data are critical factors for Taiwan’s future economic development 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  3. 3. Conference Talks 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
  4. 4. Conference Talks 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 4
  5. 5. Hot Keywords in Hadoop Community Real-time • Impala, Stinger Computing Framework • YARN, Tez In Memory • Spark Streaming • Kafka, Storm 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
  6. 6. Big Data Applications • Operational – Real-time – Near Real-time • Analytical – Batch – Interactive – Near Real-time – Streaming 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 6
  7. 7. An Online Music Example • Operational – Recent N login time (listen duration) – Recent N album/artist user browses – Recent N keyword user search – Recent N song/album/artist user listens (buys) – Recent N month user’s purchase amount • Analytical – Recommend right song/album/artist to right user at right time – Correlate similar song/album/artist (CDDB or user behavior) – Know seasonal music trending (X’max, Valentine’s Day, New Year) – Know regional music trending – Calculate regional leaderboard – Connect user with social network 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 7
  8. 8. An Online Banking Example • Operational – Recent N login time / frequency – Recent N items purchased by credit card – Recent N month balance amount – Recent N transfer in/out amount – Recent N investment event – Recent N month investment balance • Analytical – Know user’s profile more (assets/debts/shopping habits/family) – Recommend right product to right user (investment, credit card, loan) – Know seasonal trending (tax month/year end/back to school/X’mas) – Know regional investment product leaderboard (by different age) – Recommend product by similar user profile 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 8
  9. 9. Building Your Big Data Applications • Think about your data – Entity or Event? • Think about your use case – Operational or Analytic? • Think about your data user – External or Internal? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 9
  10. 10. Think About Your Data Slides from “Apache HBase Application Archetypes”, HBaseCon 2014 You can Replace HBase with similar alternatives, but concepts are the same 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 10
  11. 11. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 11
  12. 12. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
  13. 13. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 13
  14. 14. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 14
  15. 15. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 15
  16. 16. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
  17. 17. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 17
  18. 18. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 18
  19. 19. Think About Your Use Case 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 19
  20. 20. Operational Use Case 1 MR / Spark 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 20 Real-time MR / Spark Real-time Batch Batch Real-time HDFS
  21. 21. HBase: No Secondary Index (yet) • Search index building (row key) • Use Solr to make text data searchable – Snapshot & clone table – Index column qualifier text – Record row-key in Solr document – Use HBase client to fetch data • Usually less than few seconds 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
  22. 22. Operational Use Case 2 (SPN) 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 22 Get, Scan Solr Client low latency high throughput Index Query MapReduce Pig HDFS Flume Feed App Real-time Real-time Batch
  23. 23. Operational Use Case 3 (Mixed) Real-time Put, Incr, Append 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 23 Get, Scan Solr Client low latency high throughput Index Query Gets Short scan MapReduce Pig HDFS Flume Feed App Real-time Batch HBase Client HBase Client Bulk Import HBase Client MR / Spark Batch HBase Replication Solr MR / Batch Spark
  24. 24. HBase or HDFS? • Depends on what’s your data – Entity or Event? • Depends on your workload – Low latency? – Random read/write? – Short/full scan? – Sequential read/write? – Update? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 24
  25. 25. Wait… Batch for Operational? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 25
  26. 26. Yes, Why not? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
  27. 27. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 27
  28. 28. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 28
  29. 29. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 29
  30. 30. Operational: Batch + Real-time • Bridge the gap between batch and now • 80/20 rule – HDFS/MapReduce/Spark solves 80% easily – Remaining 20% takes 80% of the efforts • Go as close as possible, don’t overdo it! 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 30
  31. 31. What is Real-time? • Real-time is NOT always “faster than batch” – If you have really BIG DATA • Most of the time, we want Timely Information • Minimize the gap between scheduled batch jobs Hourly Job 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 31 Hourly Job Hourly Job How to get result at 1:33?
  32. 32. Analytical Use Case Batch/streaming compute Near real-time/interactive deliver 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 32
  33. 33. Near Real-time Interactive 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 33
  34. 34. Recommendation System 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 34
  35. 35. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
  36. 36. The Online Music Example • Operational – Recent N login time (listen duration ) – Recent N album/artist user browses – Recent N keyword user search – Recent N song/album/artist user listens (buys) – Recent N month user’s purchase amount Do you really want to analytical result • Analytical (recommendation) EVERY 50 millisecond? – Recommend right song/album/artist to right user at right time – Correlate similar song/album/artist (CDDB or user behavior) – Know seasonal music trending (X’max, Valentine’s Day, New Year) – Know regional music trending – Calculate regional leaderboard – Connect user with social network 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 36
  37. 37. Analytical Use Case 1 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 37 Batch HDFS Index Query Solr Client Real-time
  38. 38. Analytical Use Case 2 (SPN) “A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase”, HBaseCon 2014 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 38
  39. 39. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 39
  40. 40. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 40
  41. 41. You Need an Interactive Analytic Engine 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 41
  42. 42. Stinger 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 42
  43. 43. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 43
  44. 44. Impala Architecture Datanode Tasktracker Regionserver impala daemon NN, JT, HM Active 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 NN, JT, HM Standby Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon State store Catalog Datanode Tasktracker Regionserver impala daemon Hive Metastore
  45. 45. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  46. 46. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  47. 47. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  48. 48. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  49. 49. Apache Pig (MapReduce) • Do hourly count on akamai log – A = load 'date://2014/07/20/00' using AkamaiRCLoader(); B = foreach (group A all) COUNT_STAR(A); dump B; – … 0% complete 100% complete (194202349) 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 Too Slow for Interactive
  50. 50. Using Impala • No memory cache – > select count(*) from akafast where day=20140720 and hour=0 – 194202349 • with OS cache • Do a further query: – select count(*) from akafast where day=20140720 and hour=00 and c='US'; – 41118019 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 Make Sense Now
  51. 51. Don’t Connect Analytic Engine with Operational Use Case 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 51
  52. 52. Analytical Use Case 3 low latency high throughput Real-time Put, Incr, Append 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 52 Gets Short scan HBase Client Impala/Stinger HDFS Flume Feed App Real-time Interactive HBase Client Bulk Import HBase Client MR / Spark Batch Customer Analyst
  53. 53. Streaming Use Cases 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 53
  54. 54. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 54
  55. 55. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 55
  56. 56. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 56
  57. 57. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 57
  58. 58. TME – Trend Message Exchange http://trendmicro.github.io/tme/ 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 58
  59. 59. Streaming Operational Use Case Real-time Gets Short scan Kafka/Storm Put, Incr, Append HBase Client Kafka/Storm low latency HDFS high throughput 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 59 HBase Client Streaming Index Query Solr Client Streaming
  60. 60. Streaming Analytical Use Case Put, Incr, Append HBase Client Kafka/Storm low latency HDFS high throughput Flume Feed App 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 60 Gets Short scan HBase Client Impala/Stinger Interactive Analyst Real-time Customer Streaming
  61. 61. Think About Your Data User 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 61
  62. 62. Data User • External – Customer – Partner • Internal – Business report user – Data researcher – Data analyst – Algorithm developer • They want instant response • They don’t know (and don’t care) if the recommendation is computed 1 hour ago or 50 ms ago • Interactive or near real-time is enough • Sometimes even wait for batch (make data small and analyze) • Of course, everyone wants result faster, but it depends on your investment $$ 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 62
  63. 63. No Silver Bullet For Real-time, Or Big Data Application 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 63
  64. 64. Q&A 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 64
  65. 65. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 65

×