Your SlideShare is downloading. ×
Approaching real-time-hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Approaching real-time-hadoop

713
views

Published on

Published in: Technology, Business

0 Comments
7 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
713
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
49
Comments
0
Likes
7
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • The analytics platform at Trend Micro has experienced tremendous growth over the past few years in terms of size and complexity. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”.
  • Transcript

    • 1. Approaching real-time: Things you can do before going Impala Chris Huang SPN Hadoop Architect
    • 2. About – Chris Huang • Chris Huang – SPN Hadoop Architect – SPN Dumbo Team – Hadoop.TW Active Member 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 2
    • 3. About – SPN • SPN, Smart Protection Network – 主動式雲端截毒技術 • 2013 Big Data Foresight Forum – Scaling Big Data Mining Infrastructure: The Smart Protection Network http://www.slideshare.net/chenhsiu/scaling-bigdatamininginfra2 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 3
    • 4. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 4
    • 5. Batch v.s. Real-time 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 5 Batch, High Throughput Real-time, Timely Information Q: How can I transport 10,000 people from Taipei to Kaohsiung? Q: What’s the fastest way to Taipei Train Station?
    • 6. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 6 67% Query Hadoop using Hive 51% Load data into Hadoop in less than 90 mins 54% Use HBase for real-time data access * Cloudera customer survey Aug. 2012 Time is Money!
    • 7. From Batch to Real-time • Bridge the gap between batch and now • 80/20 rule – Hadoop solves 80% easily – Remaining 20% takes 80% of the efforts • Go as close as possible, don’t overdo it! 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 7
    • 8. What is Real-time? • Real-time is NOT always “faster than batch” – If you have really BIG DATA • Most of the time, we want Timely Information • Minimize the gap between scheduled MR jobs 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 8 Hourly Job Hourly Job Hourly Job How to get result at 1:33?
    • 9. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 9 So, You want to talk about Impala?
    • 10. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 10 NO
    • 11. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 11 Impala is not silver bullet * Here Impala denotes any interactive query solution, including Apache Drill, Apache Tez + Stinger
    • 12. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 12 You can do a lot before using Impala
    • 13. 3 Arrows for Real-time Applications 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 13 HBase (20%) SolrCloud (60%) Streaming (20%)
    • 14. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. Example Case 14
    • 15. Question 1 • If we get a C&C malicious URL hxxp://www.thebadguy.com/?info=12345678 • Yesterday, Who accessed that URL? From where, How? What’s the frequency? 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 15
    • 16. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 16 Very Simple
    • 17. But we have 5 billion lines of log per day 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 17
    • 18. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 18 It takes about 20 minutes ~1 hour if you’re not lucky
    • 19. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 19 And we may query 50,000 times a day
    • 20. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 20 We need a real-time (interactive) system
    • 21. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 21 1st Arrow: HBase
    • 22. Make Good Use of HBase Row Key 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 22 Region Start Key End Key R1 net.pwnnetwork#201208 net.tlm100.f19e100f#201304 R2 net.tlm100.f19e100f#201304 nl.efkobeton.www#201211 R3 nl.efkobeton.www#201211 no.rubrikk#201305 R4 no.rubrikk#201305 org.saintalphonsus.www#201304 R5 org.saintalphonsus.www#201304 pl.opole.uni.socjologia.www#201301 com.domain.reverse#YYYYMMDD Easy retrieve data by row key scan Hadoop in Taiwan 2012 –設計高效能 HBase Schema 了解HBase http://www.youtube.com/watch?v=8DMzNmVrXEI
    • 23. Compute Once, Import Once • Clarify your use case • Compute the whole thing once – Daily job + hourly job • Import into HBase using Bulk Loading • On the fly query, with constant query time 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 23
    • 24. If You Really Care About Real-Time • Delta data are not big, don’t use MR • Write another program to calculate on the fly • Dynamically put into HBase – Row key: com.domain.reverse#YYYYMMDD_HHmmss • Query from both hourly batch and delta data • Drop delta data in next hourly batch 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 24 2 am 3 am Delta data
    • 25. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 25 But... Life suffers because of “but”
    • 26. Question 2 • Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0) 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 26
    • 27. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 27 HBase does not have 2nd index (yet)
    • 28. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 28 2nd Arrow: SolrCloud
    • 29. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 29 Lucene, Solr, SolrCloud TW Hadoop User Group Q1 Meetup - Solr Tutorial http://www.slideshare.net/chenhsiu/20130310-solr-tuorial
    • 30. What is Lucene? • Full-text search library • Written in Java • Indexing & searching • One of the top 5 Apache projects 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 30
    • 31. Inverted Index 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 31 https://developer.apple.com/library/mac/#documentation/userexperience/Conceptual/SearchKitConcepts/searchKit_basics/searc hKit_basics.html
    • 32. What is Solr? • Enterprise search server based on Lucene – NOT a database • Advanced full-text search capabilities • Flexible and adaptable with XML configuration • Extensible plug-in architecture • REST-like APIs • Web admin interface • Runs inside a Java servlet container such as Jetty and Tomcat 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 32
    • 33. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 33 Use Hadoop MapReduce for Indexing Lucene Indexing Flow
    • 34. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 34 Use SolrCloud for Scalable, Fault Tolerant Query Solr: Index Query Flow
    • 35. What is SolrCloud? 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 35
    • 36. Indexing in SolrCloud 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 36
    • 37. Searching in SolrCloud 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 37
    • 38. Question 2 • Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0) 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 38 A = load 'date://2013/09/28' using NSCTmProxyURLFProtobufLoader(); B = foreach A generate value.addr.peerIp as ip, value.NSCLog.URL as url, Location(value.addr.peerIp) as loc; C = foreach B generate ip, url, loc.countryName as cn, CONCAT(CONCAT((chararray)loc.latitude, ','), (chararray)loc.longitude) as loc; store C into 'solrcloud://$COLLECTION' using SolrStorage('ip_s,url_domain,cn_s,loc_p', '$USERNAME', '$PASSWORD'); hxxp://$SERVER:8983/solr/$SHARD/select?q=cn_s:Japan+url_s:com*&wt=js on&indent=true&rows=5&sort=geodist(loc_p,30.0,130.0)+asc
    • 39. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 39 That’s it? YES
    • 40. If You Really Care About Real-Time • Delta data are not big, don’t use MR • Write another program to calculate on the fly • Solr supports dynamic indexing – Send your data to Solr to create a delta index • Query from both batch index and delta index • Drop delta index in next hourly batch 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 40 2 am 3 am Delta data
    • 41. Domain/IP Census 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 41
    • 42. www.facebook.com 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 42
    • 43. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 43 Excellent!
    • 44. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 44 But... Life suffers because of “but”
    • 45. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 45 We need to identify use case first Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency? Query malicious sites with pattern *.com hosted in Japan, sorted by the distance to GeoLocation (30.0,130.0)
    • 46. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 46 3rd Arrow: Streaming
    • 47. Question 1 Revisited 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 47 Yesterday, Who accessed hxxp://www.thebadbuy.com? From where, How? What’s the frequency? • Can you send email when there is a contact to specific C&C server? • Can you monitor a specific client IP to a list of C&C server? • I found there is certain pattern in C&C URL paths, can you give me a hourly update of top 10 path grouping? • Report the C&C connect’s parent process SHA-1 to Virus DB for sourcing
    • 48. The Messaging 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 48 OSDC.TW 2012 - TME: Open Source Realtime Big Data Processing Platform http://cloud.github.com/downloads/trendmicro/tme/TME_Introduction_OSDC.tw2012%20.pdf
    • 49. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 49 Let’s dump the data
    • 50. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 50 You need lots of workers!
    • 51. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 51 Your boss won’t buy you another 100 servers
    • 52. NextGen MapReduce (YARN) 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 52
    • 53. Storm-YARN 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 53 Storm-on-YARN: Convergence of Low-Latency and Big-Data http://www.slideshare.net/Hadoop_Summit/feng-june26-1120amhall1v2
    • 54. Continuously Processing • Calculate data on the fly, endless processing • Hook up your processing anytime – Or store scripts on ZooKeeper • Leverage your existing Hadoop cluster • Dynamically scale in/out your workers 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 54
    • 55. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 55 Summary
    • 56. 3 Arrows for Real-time Applications 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 56 HBase (20%) SolrCloud (60%) Streaming (20%)
    • 57. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 57 80/20 Rule As close as possible, don’t overdo
    • 58. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 58 Why not just use Impala?
    • 59. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 59 The same problem, anyway
    • 60. Q&A 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 60
    • 61. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 61 You’re Brilliant We’re hiring!
    • 62. 9/28/2013 Confidential | Copyright 2013 TrendMicro Inc. 62