Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real time big data applications with hadoop ecosystem

1,700 views

Published on

Published in: Technology
  • Be the first to comment

Real time big data applications with hadoop ecosystem

  1. 1. Real-time Big Data Applications with Hadoop Ecosystem Chris Huang Sr. Manager, Core Tech 2014/9/24 1 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc.
  2. 2. About – Chris Huang • Chris Huang – SPN Solution Developer Manager – SPN Hadoop Architect – Hadoop.TW Active Member • Believes Cloud, Service, Software, Big Data are critical factors for Taiwan’s future economic development 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  3. 3. Conference Talks 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 3
  4. 4. Conference Talks 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 4
  5. 5. Hot Keywords in Hadoop Community Real-time • Impala, Stinger Computing Framework • YARN, Tez In Memory • Spark Streaming • Kafka, Storm 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 5
  6. 6. Big Data Applications • Operational – Real-time – Near Real-time • Analytical – Batch – Interactive – Near Real-time – Streaming 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 6
  7. 7. An Online Music Example • Operational – Recent N login time (listen duration) – Recent N album/artist user browses – Recent N keyword user search – Recent N song/album/artist user listens (buys) – Recent N month user’s purchase amount • Analytical – Recommend right song/album/artist to right user at right time – Correlate similar song/album/artist (CDDB or user behavior) – Know seasonal music trending (X’max, Valentine’s Day, New Year) – Know regional music trending – Calculate regional leaderboard – Connect user with social network 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 7
  8. 8. An Online Banking Example • Operational – Recent N login time / frequency – Recent N items purchased by credit card – Recent N month balance amount – Recent N transfer in/out amount – Recent N investment event – Recent N month investment balance • Analytical – Know user’s profile more (assets/debts/shopping habits/family) – Recommend right product to right user (investment, credit card, loan) – Know seasonal trending (tax month/year end/back to school/X’mas) – Know regional investment product leaderboard (by different age) – Recommend product by similar user profile 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 8
  9. 9. Building Your Big Data Applications • Think about your data – Entity or Event? • Think about your use case – Operational or Analytic? • Think about your data user – External or Internal? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 9
  10. 10. Think About Your Data Slides from “Apache HBase Application Archetypes”, HBaseCon 2014 You can Replace HBase with similar alternatives, but concepts are the same 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 10
  11. 11. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 11
  12. 12. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 12
  13. 13. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 13
  14. 14. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 14
  15. 15. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 15
  16. 16. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 16
  17. 17. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 17
  18. 18. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 18
  19. 19. Think About Your Use Case 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 19
  20. 20. Operational Use Case 1 MR / Spark 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 20 Real-time MR / Spark Real-time Batch Batch Real-time HDFS
  21. 21. HBase: No Secondary Index (yet) • Search index building (row key) • Use Solr to make text data searchable – Snapshot & clone table – Index column qualifier text – Record row-key in Solr document – Use HBase client to fetch data • Usually less than few seconds 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 21
  22. 22. Operational Use Case 2 (SPN) 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 22 Get, Scan Solr Client low latency high throughput Index Query MapReduce Pig HDFS Flume Feed App Real-time Real-time Batch
  23. 23. Operational Use Case 3 (Mixed) Real-time Put, Incr, Append 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 23 Get, Scan Solr Client low latency high throughput Index Query Gets Short scan MapReduce Pig HDFS Flume Feed App Real-time Batch HBase Client HBase Client Bulk Import HBase Client MR / Spark Batch HBase Replication Solr MR / Batch Spark
  24. 24. HBase or HDFS? • Depends on what’s your data – Entity or Event? • Depends on your workload – Low latency? – Random read/write? – Short/full scan? – Sequential read/write? – Update? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 24
  25. 25. Wait… Batch for Operational? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 25
  26. 26. Yes, Why not? 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 26
  27. 27. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 27
  28. 28. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 28
  29. 29. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 29
  30. 30. Operational: Batch + Real-time • Bridge the gap between batch and now • 80/20 rule – HDFS/MapReduce/Spark solves 80% easily – Remaining 20% takes 80% of the efforts • Go as close as possible, don’t overdo it! 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 30
  31. 31. What is Real-time? • Real-time is NOT always “faster than batch” – If you have really BIG DATA • Most of the time, we want Timely Information • Minimize the gap between scheduled batch jobs Hourly Job 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 31 Hourly Job Hourly Job How to get result at 1:33?
  32. 32. Analytical Use Case Batch/streaming compute Near real-time/interactive deliver 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 32
  33. 33. Near Real-time Interactive 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 33
  34. 34. Recommendation System 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 34
  35. 35. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 35
  36. 36. The Online Music Example • Operational – Recent N login time (listen duration ) – Recent N album/artist user browses – Recent N keyword user search – Recent N song/album/artist user listens (buys) – Recent N month user’s purchase amount Do you really want to analytical result • Analytical (recommendation) EVERY 50 millisecond? – Recommend right song/album/artist to right user at right time – Correlate similar song/album/artist (CDDB or user behavior) – Know seasonal music trending (X’max, Valentine’s Day, New Year) – Know regional music trending – Calculate regional leaderboard – Connect user with social network 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 36
  37. 37. Analytical Use Case 1 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 37 Batch HDFS Index Query Solr Client Real-time
  38. 38. Analytical Use Case 2 (SPN) “A Graph Service for Global Web Entities Traversal and Reputation Evaluation Based on HBase”, HBaseCon 2014 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 38
  39. 39. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 39
  40. 40. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 40
  41. 41. You Need an Interactive Analytic Engine 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 41
  42. 42. Stinger 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 42
  43. 43. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 43
  44. 44. Impala Architecture Datanode Tasktracker Regionserver impala daemon NN, JT, HM Active 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 NN, JT, HM Standby Datanode Tasktracker Regionserver impala daemon Datanode Tasktracker Regionserver impala daemon State store Catalog Datanode Tasktracker Regionserver impala daemon Hive Metastore
  45. 45. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  46. 46. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  47. 47. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  48. 48. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2
  49. 49. Apache Pig (MapReduce) • Do hourly count on akamai log – A = load 'date://2014/07/20/00' using AkamaiRCLoader(); B = foreach (group A all) COUNT_STAR(A); dump B; – … 0% complete 100% complete (194202349) 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 Too Slow for Interactive
  50. 50. Using Impala • No memory cache – > select count(*) from akafast where day=20140720 and hour=0 – 194202349 • with OS cache • Do a further query: – select count(*) from akafast where day=20140720 and hour=00 and c='US'; – 41118019 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 2 Make Sense Now
  51. 51. Don’t Connect Analytic Engine with Operational Use Case 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 51
  52. 52. Analytical Use Case 3 low latency high throughput Real-time Put, Incr, Append 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 52 Gets Short scan HBase Client Impala/Stinger HDFS Flume Feed App Real-time Interactive HBase Client Bulk Import HBase Client MR / Spark Batch Customer Analyst
  53. 53. Streaming Use Cases 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 53
  54. 54. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 54
  55. 55. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 55
  56. 56. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 56
  57. 57. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 57
  58. 58. TME – Trend Message Exchange http://trendmicro.github.io/tme/ 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 58
  59. 59. Streaming Operational Use Case Real-time Gets Short scan Kafka/Storm Put, Incr, Append HBase Client Kafka/Storm low latency HDFS high throughput 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 59 HBase Client Streaming Index Query Solr Client Streaming
  60. 60. Streaming Analytical Use Case Put, Incr, Append HBase Client Kafka/Storm low latency HDFS high throughput Flume Feed App 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 60 Gets Short scan HBase Client Impala/Stinger Interactive Analyst Real-time Customer Streaming
  61. 61. Think About Your Data User 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 61
  62. 62. Data User • External – Customer – Partner • Internal – Business report user – Data researcher – Data analyst – Algorithm developer • They want instant response • They don’t know (and don’t care) if the recommendation is computed 1 hour ago or 50 ms ago • Interactive or near real-time is enough • Sometimes even wait for batch (make data small and analyze) • Of course, everyone wants result faster, but it depends on your investment $$ 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 62
  63. 63. No Silver Bullet For Real-time, Or Big Data Application 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 63
  64. 64. Q&A 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 64
  65. 65. 9/25/2014 Confidential | Copyright 2013 TrendMicro Inc. 65

×