Successfully reported this slideshow.

Approaching real-time-hadoop

1,090 views

Published on

Published in: Technology, Business
  • Be the first to comment

Approaching real-time-hadoop

  1. 1. Approaching real-time: Things you can do before going Impala Chris Huang SPN Hadoop Architect
  2. 2. About – Chris Huang • Chris Huang – SPN Hadoop Architect – SPN Dumbo Team – Hadoop.TW Active Member 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 2
  3. 3. About – SPN • SPN, Smart Protection Network – 主動式雲端截毒技術 • 2013 Big Data Foresight Forum – Scaling Big Data Mining Infrastructure: The Smart Protection Network http://www.slideshare.net/chenhsiu/scaling-bigdatamininginfra2 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 3
  4. 4. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 4
  5. 5. Batch v.s. Real-time 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 5 Batch, High Throughput Real-time, Timely Information Q: How can I transport 10,000 people from Taipei to Kaohsiung? Q: What’s the fastest way to Taipei Train Station?
  6. 6. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 6 67% Query Hadoop using Hive 51% Load data into Hadoop in less than 90 mins 54% Use HBase for real-time data access * Cloudera customer survey Aug. 2012 Time is Money!
  7. 7. From Batch to Real-time • Bridge the gap between batch and now • 80/20 rule – Hadoop solves 80% easily – Remaining 20% takes 80% of the efforts • Go as close as possible, don’t overdo it! 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 7
  8. 8. What is Real-time? • Real-time is NOT always “faster than batch” – If you have really BIG DATA • Most of the time, we want Timely Information • Minimize the gap between scheduled MR jobs 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 8 Hourly Job Hourly Job Hourly Job How to get result at 1:33?
  9. 9. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 9 So, You want to talk about Impala?
  10. 10. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 10 NO
  11. 11. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 11 Impala is not silver bullet * Here Impala denotes any interactive query solution, including Apache Drill, Apache Tez + Stinger
  12. 12. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 12 You can do a lot before using Impala
  13. 13. 3 Arrows for Real-time Applications 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 13 HBase (20%) SolrCloud (60%) Streaming (20%)
  14. 14. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. Example Case 14
  15. 15. Question 1 • If we get a C&C malicious URL hxxp://www.thebadguy.com/?info=12345678 • Yesterday, Who accessed that URL? From where, How? What’s the frequency? 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 15
  16. 16. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 16 Very Simple
  17. 17. But we have 5 billion lines of log per day 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 17
  18. 18. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 18 It takes about 20 minutes ~1 hour if you’re not lucky
  19. 19. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 19 And we may query 50,000 times a day
  20. 20. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 20 We need a real-time (interactive) system
  21. 21. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 21 1st Arrow: HBase
  22. 22. Make Good Use of HBase Row Key 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 22 Region Start Key End Key R1 net.pwnnetwork#201208 net.tlm100.f19e100f#201304 R2 net.tlm100.f19e100f#201304 nl.efkobeton.www#201211 R3 nl.efkobeton.www#201211 no.rubrikk#201305 R4 no.rubrikk#201305 org.saintalphonsus.www#201304 R5 org.saintalphonsus.www#201304 pl.opole.uni.socjologia.www#201301 com.domain.reverse#YYYYMMDD Easy retrieve data by row key scan Hadoop in Taiwan 2012 –設計高效能 HBase Schema 了解HBase http://www.youtube.com/watch?v=8DMzNmVrXEI
  23. 23. Compute Once, Import Once • Clarify your use case • Compute the whole thing once – Daily job + hourly job • Import into HBase using Bulk Loading • On the fly query, with constant query time 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 23
  24. 24. If You Really Care About Real-Time • Delta data are not big, don’t use MR • Write another program to calculate on the fly • Dynamically put into HBase – Row key: com.domain.reverse#YYYYMMDD_HHmmss • Query from both hourly batch and delta data • Drop delta data in next hourly batch 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 24 2 am 3 am Delta data
  25. 25. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 25 But... Life suffers because of “but”
  26. 26. Question 2 • Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0) 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 26
  27. 27. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 27 HBase does not have 2nd index (yet)
  28. 28. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 28 2nd Arrow: SolrCloud
  29. 29. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 29 Lucene, Solr, SolrCloud TW Hadoop User Group Q1 Meetup - Solr Tutorial http://www.slideshare.net/chenhsiu/20130310-solr-tuorial
  30. 30. What is Lucene? • Full-text search library • Written in Java • Indexing & searching • One of the top 5 Apache projects 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 30
  31. 31. Inverted Index 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 31 https://developer.apple.com/library/mac/#documentation/userexperience/Conceptual/SearchKitConcepts/searchKit_basics/searc hKit_basics.html
  32. 32. What is Solr? • Enterprise search server based on Lucene – NOT a database • Advanced full-text search capabilities • Flexible and adaptable with XML configuration • Extensible plug-in architecture • REST-like APIs • Web admin interface • Runs inside a Java servlet container such as Jetty and Tomcat 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 32
  33. 33. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 33 Use Hadoop MapReduce for Indexing Lucene Indexing Flow
  34. 34. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 34 Use SolrCloud for Scalable, Fault Tolerant Query Solr: Index Query Flow
  35. 35. What is SolrCloud? 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 35
  36. 36. Indexing in SolrCloud 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 36
  37. 37. Searching in SolrCloud 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 37
  38. 38. Question 2 • Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0) 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 38 A = load 'date://2013/09/28' using NSCTmProxyURLFProtobufLoader(); B = foreach A generate value.addr.peerIp as ip, value.NSCLog.URL as url, Location(value.addr.peerIp) as loc; C = foreach B generate ip, url, loc.countryName as cn, CONCAT(CONCAT((chararray)loc.latitude, ','), (chararray)loc.longitude) as loc; store C into 'solrcloud://$COLLECTION' using SolrStorage('ip_s,url_domain,cn_s,loc_p', '$USERNAME', '$PASSWORD'); hxxp://$SERVER:8983/solr/$SHARD/select?q=cn_s:Japan+url_s:com*&wt=js on&indent=true&rows=5&sort=geodist(loc_p,30.0,130.0)+asc
  39. 39. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 39 That’s it? YES
  40. 40. If You Really Care About Real-Time • Delta data are not big, don’t use MR • Write another program to calculate on the fly • Solr supports dynamic indexing – Send your data to Solr to create a delta index • Query from both batch index and delta index • Drop delta index in next hourly batch 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 40 2 am 3 am Delta data
  41. 41. Domain/IP Census 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 41
  42. 42. www.facebook.com 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 42
  43. 43. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 43 Excellent!
  44. 44. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 44 But... Life suffers because of “but”
  45. 45. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 45 We need to identify use case first Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency? Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)
  46. 46. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 46 3rd Arrow: Streaming
  47. 47. Question 1 Revisited 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 47 Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency? • Can you send email when there is a contact to specific C&C server? • Can you monitor a specific client IP to a list of C&C server? • I found there is certain pattern in C&C URL paths, can you give me a hourly update of top 10 path grouping? • Report the C&C connect’s parent process SHA-1 to Virus DB for sourcing
  48. 48. The Messaging 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 48 OSDC.TW 2012 - TME: Open Source Realtime Big Data Processing Platform http://cloud.github.com/downloads/trendmicro/tme/TME_Introduction_OSDC.tw2012%20.pdf
  49. 49. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 49 Let’s dump the data
  50. 50. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 50 You need lots of workers!
  51. 51. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 51 Your boss won’t buy you another 100 servers
  52. 52. NextGen MapReduce (YARN) 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 52
  53. 53. Storm-YARN 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 53 Storm-on-YARN: Convergence of Low-Latency and Big-Data http://www.slideshare.net/Hadoop_Summit/feng-june26-1120amhall1v2
  54. 54. Continuously Processing • Calculate data on the fly, endless processing • Hook up your processing anytime – Or store scripts on ZooKeeper • Leverage your existing Hadoop cluster • Dynamically scale in/out your workers 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 54
  55. 55. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 55 Summary
  56. 56. 3 Arrows for Real-time Applications 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 56 HBase (20%) SolrCloud (60%) Streaming (20%)
  57. 57. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 57 80/20 Rule As close as possible, don’t overdo
  58. 58. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 58 Why not just use Impala?
  59. 59. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 59 The same problem, anyway
  60. 60. Q&A 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 60
  61. 61. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 61 You’re Brilliant We’re hiring!
  62. 62. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 62

×