Advertisement

HadoopCon- Trend Micro SPN Hadoop Overview

Consultant at SAS Institute Taiwan Ltd.
Sep. 23, 2014
Advertisement

More Related Content

Similar to HadoopCon- Trend Micro SPN Hadoop Overview(20)

Advertisement

HadoopCon- Trend Micro SPN Hadoop Overview

  1. Trend Micro SPN Hadoop Overview 張雅芳Mammi Chang @ 2014 Taiwan HadoopCon
  2. Who am I ? • Mammi Chang 張雅芳 • Engineer, SPN, Trend Micro • SPN Hadoop Cluster Administrator for 2 years • Developer of operation tool • Expertise : HDFS/Hbase/Pig • Experience on Mahout Recommendation System
  3. 3 Why Big Data in Trend Micro?
  4. Web Reputation 8+ billions URL process daily Technology Process Operation User Traffic / Sourcing CDN vender Rating Server for Known Threats Unknown & Prefilter Page Download Threat Analysis 8 billions/day 4.8 billions/day 40% filtered 82% filtered 860 millions/day 99.98% filtered 25,000 malicious URL /day Trend Micro Products / Technology CDN Cache High Throughput Web Service Hadoop Cluster Web Crawling Machine Learning Data Mining Block malicious URL within 15 minutes once it goes online!
  5. SPN Solution Architecture File Web / URL Email Domain IP File Reputation Service Email Reputation Service Customer Smart Protection Community Intelligence (Feedback loop) Web Reputation Service Sourcing Processing & Analysis Validate & Create Solution Quality Assurance Solution Distribution Solution Adoption SPN Correlation
  6. SPN Hadoop Use Cases Marketing Report Near real time 6 Service Researcher Data Scientist Hadoop Platform query Service Batch processing data business value information HBase HDFS
  7. Yesterday ~40 Hadoop nodes ~15 Service/user accounts 7 3 Teams <50 TB storage <100 Jobs per day
  8. Today hundreds Hadoop nodes >170 Service/user accounts >13 Teams ~1.5 PB storage >16000 Jobs per day 8
  9. 1 MapReduce Job 9 Submitted Each 5.4 Seconds
  10. 10 Central Management Hadoop as a Service Automation Highly Availability Customizatio n
  11. Real World Difficulties on Deployment • Hundreds of servers • Complicated Hadoop ecosystem deployment • Necessary of configuration management • Limited maintenance time 11
  12. Hadoop Ecosystem Puppet Hadooppet A project for deploy Trend Micro Hadoop distribution on a large cluster 12 IT automation software
  13. Hadooppet Workflow – Cluster Deployment 13 ………. /etc/puppet |-- auth.conf |-- fileserver.conf |-- puppet.conf `-- ssl /etc/puppet |-- auth.conf |-- autosign.conf |-- files |-- fileserver.conf |-- manifests |-- modules |-- puppet.conf `-- ssl Puppet server Yum Server Pull packages from Yum Server Auto-deploy Hadoop by role Puppet Client Auto-deploy Hadoop by role Puppet Client Auto-deploy Hadoop by role Puppet Client 1. certificate request 2. Sign certificate 3. Retrive catalog for nodes Hadoop Node Hadoop Node Hadoop Node
  14. Hadooppet Workfolw – Change Configuration 15 /etc/puppet |-- auth.conf |-- autosign.conf |-- files |-- fileserver.conf |-- manifests |-- modules |-- puppet.conf `-- ssl Hadoop Node Hadoop Node ………. Hadoop Node /etc/puppet |-- auth.conf |-- fileserver.conf |-- puppet.conf `-- ssl Puppet server Puppet Client Puppet Client Puppet Client conf 2. Synchronize Configuration 1. Modify configuration at server side
  15. CLUSTER DEPLOYMENT BY DISTRIBUTION / ENVIRONMENT • POC, Staging, Production • All-in-one VM, AWS EC2 deployment CLUSTER DEPLOYMENT • Package installation • Configuration adjustment CLUSTER OPERATION • Add new Hadoop node/client • Account management • Process management Hadooppet SANITY CHECK • DFSIO, YCSB , etc • Sample Applications 16
  16. Anything more? 17
  17. Real World Difficulties on Hadoop Distribution • Too many running services to do big change • No suitable Hadoop version for Trend Micro • Always need to patch for our need 18
  18. Trend Micro Hadoop (TMH) • Be flexible. Pick up Business needed features • Fetch official patches in to current adopted version • Add your own patch at any time ISSUE TRACKING • Jira DEVELOPMEN T • Gitlab • Hudson TESTING • Dumbo Cluster • POC / Staging DEPLOYMENT • Hadooppet PROFILING • Nagios , Ganglia • Splunk MANAGEMEN T • Hadooppet
  19. TMH Development Process Jira • Tracking Issues Gitlab • Version control of source code Unit Test • Developer run unit test at development local machine Hudson • Build / test software projects Yum Server • Automatic updates, package and dependency management 20
  20. 21 POC Hadoop Cluster Staging Hadoop Cluster Production Hadoop Cluster Yum Server Developer
  21. 22 POC Hadoop Cluster Staging Hadoop Cluster Production Hadoop Cluster Yum Server Developer
  22. Hadoop Cluster Profiling • Availability – Process Healthy – Cluster Healthy – System Healthy • Utilization – Cluster Usage – Log Analysis • Auditing
  23. Nagios • Service healthy monitor • Cluster healthy monitor Ganglia • System monitor / Hadoop metrics monitor • Cluster resource monitor Splunk • Application /Cluster Resource Profiling • Auditing/Log Analysis 24
  24. I feel cluster HDFS become slow recently…. Really? From when? Do you have any detail information or log? Case Study USE R me
  25. …………, Let me check on it Okay! USE R me
  26. Sorry, we have no log now. But it is really slow. …………………. 15 minutes later ….. USE R me
  27. What can I do? • Check on Nagios services alert • Check Splunk Cluster HDFS Profiling, recently user usage • Check Ganglia cluster loading
  28. • Check on Nagios services alert • Check Splunk Cluster HDFS Profiling, recently user usage • Check Ganglia cluster loading I can do… => Finding Root Cause !
  29. 30 Central Management Hadoop as a Service Automation Highly Availability Customizatio n
  30. Tomorrow YARN 31 MRv2 Spark? Impala? We choose what we really want!
  31. Thank you! WE ARE HIRING! WELCOME TO JOIN TMH! #TrendInsight
Advertisement