Hadoop ecosystem framework n hadoop in live environment


Published on

Delhi Hadoop User Group MeetUp - 10th Sept. 2011 -Slides

Published in: Technology, Education
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop ecosystem framework n hadoop in live environment

  1. 1. <ul><li>Hadoop ecosystem framework Hadoop in live environment </li></ul><ul><li>- Ashish Agrawal </li></ul>
  2. 2. Outline <ul><li>Introduction to HADOOP & Distributed FileSystems </li></ul><ul><li>Architecture of Hadoop Ecosystem (Hbase/Pig) & setting up Hadoop Single/Multiple node cluster </li></ul><ul><li>Introduction to MapReduce & running sample programs on Hadoop </li></ul><ul><li>Hadoop ecosystem framework - Hadoop in live environment </li></ul>
  3. 3. Hadoop Ecosystem <ul><li>HDFS </li></ul><ul><li>Map Reduce </li></ul><ul><li>Hbase </li></ul><ul><li>Pig </li></ul><ul><li>Hive </li></ul><ul><li>Mahout </li></ul><ul><li>Zookeeper </li></ul>
  4. 4. HDFS Architecture
  5. 5. Map Reduce Flow By Ricky Ho
  6. 6. HBase Architecture
  7. 7. Job Scheduler <ul><li>CronJobs </li></ul><ul><li>Chain Map Recude </li></ul><ul><li>Azkaban By LinkedIn </li></ul><ul><li>Oozie by Yahoo! </li></ul>
  8. 8. Overview of Oozie <ul><li>Manage data processing jobs </li></ul><ul><li>Offers scalable data oriented service </li></ul><ul><li>Manages dependencies between jobs </li></ul><ul><li>Support job execution in topological order </li></ul><ul><li>Provides time & event driven triggering mechanism </li></ul>
  9. 9. Overview of Oozie <ul><li>Supports map reduce, pig, filesystem, java applications, even map reduce streaming and pipes as action nodes </li></ul><ul><li>Action nodes are connected through dependency edges </li></ul><ul><li>Decision, fork and join nodes are used as flow control operations </li></ul>
  10. 10. Overview of Oozie <ul><li>Actions and decisions depends upon properties of job, hadoop counters or file/directory status </li></ul><ul><li>A workflow application contains definition file for workflow, jar files, native and third party libraries, resource file and pig scripts </li></ul>
  11. 11. Oozie vs Azkaban <ul><li>Oozie can be restarted from point of failure but azkaban does not </li></ul><ul><li>Oozie keeps flow in DB while azkaban keeps in memory </li></ul><ul><li>Azkaban fixes execution path before starting job while Oozie allows decision nodes to decide </li></ul><ul><li>Azkaban does not support event trigger </li></ul><ul><li>Azkaban is used for simpler work flow </li></ul>
  12. 12. Chain MR <ul><li>Chains the multiple mapper classes in single map task which saves lots of I/O </li></ul><ul><li>The output of immediate previous mapper is fed as input to current mapper </li></ul><ul><li>The output of last mapper is written as task output </li></ul><ul><li>Supports passing key/value pairs to next maps by reference to save [de]serialization time </li></ul><ul><li>ChainReducer supports to chain multiple mapper classes after reducer within reducer task </li></ul>
  13. 13. Oozie Flow Start Map reduce Fork MR Streaming Pig Join Decision MR Pipes Java FileSystem End
  14. 14. Performance Tuning Parameters <ul><li>Network bandwidth – Gigabytes Nw </li></ul><ul><li>Disk throughput – SCSI Drives </li></ul><ul><li>Memory usage – ECC RAM </li></ul><ul><li>CPU overhead for thread handling </li></ul><ul><li>HDFS block size </li></ul><ul><li>Max number of requests allowed in progress </li></ul><ul><li>Per user file descriptors – needs to be set high </li></ul><ul><li>Running the balancer </li></ul>
  15. 15. Performance Tuning Parameters <ul><li>Sufficient space for temp directory </li></ul><ul><li>Compressed data storage </li></ul><ul><li>Speculative data execution </li></ul><ul><li>Use of combiner function – Associative & commulative </li></ul><ul><li>Selection of Job scheduler : FIFO/Capacity/Fair </li></ul><ul><li>Number of mappers : larger files are preferred </li></ul><ul><li>Number of reducers : Slightly less than #nodes </li></ul>
  16. 16. Performance Tuning Parameters <ul><li>Compression of intermediate data from Mappers </li></ul><ul><li>sort size (io.sort.mb) – larger if mapper has to write large data </li></ul><ul><li>Sort factor (io.sort.factor) – set high for larger jobs (#input files can be merged at once) </li></ul><ul><li>mapred.reduce.parallel.copies - higher for large jobs </li></ul><ul><li>dfs.namenode.handler.count & dfs.datanode.handler.count – high for large cluster </li></ul>
  17. 17. Tips <ul><li>Use an appropriate MapReduce language </li></ul><ul><ul><li>Java : Speed, control and binary data. Working with existing libraries. </li></ul></ul><ul><ul><li>Pipes : Working with existing C++ libraries </li></ul></ul><ul><ul><li>Streaming : Writing MR in scripting languages </li></ul></ul><ul><ul><li>Dumbo (Python), happy(Jython), Wukong (Ruby) </li></ul></ul><ul><ul><li>Pig, Hive, Cascading : For nested data, joins etc </li></ul></ul><ul><li>Thumb Rule : P ure Java for large, recurring jobs, Hive for SQL style analysis and Pig/Streaming for ad-hoc analysis. </li></ul>
  18. 18. Tips <ul><li>Few Larger files are preferred over many smaller files </li></ul><ul><li>Report Progress </li></ul><ul><ul><li>For CPU intensive job, increase the mapred.task.timeout (default 10 mins) </li></ul></ul><ul><li>Use Distributed cache </li></ul><ul><ul><li>To make data available to all mappers/reducers. For example keeping look up hash map </li></ul></ul><ul><ul><li>Used to make auxiliary jars available among mappers/reducers </li></ul></ul>
  19. 19. Tips <ul><li>Use SequenceFile and MapFile </li></ul><ul><ul><li>Splittable. Unlike other compressable format, they are map reduce job friendly and each map gets an independent split to work on </li></ul></ul><ul><ul><li>Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still. </li></ul></ul><ul><ul><li>Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format. </li></ul></ul><ul><li>A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key. </li></ul>
  20. 20. Mahout (Machine learning library) <ul><li>Collaborative Filtering </li></ul><ul><li>User and Item based recommenders </li></ul><ul><li>K-Means, Fuzzy K-Means clustering </li></ul><ul><li>Mean Shift clustering </li></ul><ul><li>Dirichlet process clustering </li></ul><ul><li>Latent Dirichlet Allocation </li></ul><ul><li>Singular value decomposition </li></ul><ul><li>Parallel Frequent Pattern mining </li></ul><ul><li>Complementary Naive Bayes classifier </li></ul>
  21. 21. Different minds Different interpretation <ul><li>http://www.youtube.com/watch?v=9izUKE5bN0U </li></ul>
  22. 22. Hadoop in live environment <ul><li>Google </li></ul><ul><li>Yahoo </li></ul><ul><li>Amazon </li></ul><ul><li>LinkedIn </li></ul><ul><li>Facebook </li></ul><ul><li>StumbleUpon </li></ul><ul><li>Nokia </li></ul><ul><li>Last.fm </li></ul><ul><li>Clickable </li></ul>
  23. 23. @Google <ul><li>Google uses it for </li></ul><ul><ul><li>indexing the web </li></ul></ul><ul><ul><li>computing PageRank </li></ul></ul><ul><ul><li>processing geographic information in Google Maps </li></ul></ul><ul><ul><li>clustering news articles, </li></ul></ul><ul><ul><li>machine translation </li></ul></ul><ul><ul><li>Google Trends etc </li></ul></ul>
  24. 24. @Google <ul><li>An Example : </li></ul><ul><ul><li>403,152 TB (terabytes) data </li></ul></ul><ul><ul><li>394 machines were allocated </li></ul></ul><ul><ul><li>Completion time is 6 minutes and a half. </li></ul></ul><ul><li>Google indexing system uses 20TB data </li></ul><ul><li>Bigtable (Hbase) is used for many Google products such as Orkut, Finance etc. </li></ul><ul><li>Sawzall is used for massive log processing </li></ul>
  25. 25. @Yahoo! <ul><li>The Two Quadrillionth Bit of π is 0! </li></ul><ul><ul><li>One of the largest computations took 23 days of wall clock time and 503 years of CPU time on a 1000-node cluster </li></ul></ul><ul><ul><li>Yahoo! Has 4000 nodes in hadoop cluster </li></ul></ul><ul><li>Following slides have been taken from opencirrus summit 2009 </li></ul>
  26. 26. Hadoop is critical to Yahoo’s business <ul><li>When you visit yahoo, you are interacting with data processed with Hadoop! </li></ul>Ads Optimization Content Optimization Search Index Content Feed Processing Machine Learning (e.g. Spam filters)
  27. 27. Tremendous Impact on Productivity <ul><li>Makes Developers & Scientists more productive </li></ul><ul><ul><li>Key computations solved in days and not months </li></ul></ul><ul><ul><li>Projects move from research to production in days </li></ul></ul><ul><ul><li>Easy to learn, even our rocket scientists use it! </li></ul></ul><ul><li>The major factors </li></ul><ul><ul><li>You don’t need to find new hardware to experiment </li></ul></ul><ul><ul><li>You can work with all your data! </li></ul></ul><ul><ul><li>Production and research based on same framework </li></ul></ul><ul><ul><li>No need for R&D to do IT (it just works) </li></ul></ul>
  28. 28. Search & Advertising Sciences Hadoop Applications: Search Assist™ <ul><li>Database for Search Assist™ is built using Hadoop. </li></ul><ul><li>3 years of log-data </li></ul><ul><li>20-steps of map-reduce </li></ul>Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
  29. 29. Largest Hadoop Clusters in the Universe <ul><li>25,000+ nodes (~200,000 cores) </li></ul><ul><ul><li>Clusters of up to 4,000 nodes </li></ul></ul><ul><li>4 Tiers of clusters </li></ul><ul><ul><li>Development, Testing and QA (~10%) </li></ul></ul><ul><ul><li>Proof of Concepts and Ad-Hoc work (~10%) </li></ul></ul><ul><ul><ul><li>Runs the latest version of Hadoop – currently 0.20 </li></ul></ul></ul><ul><ul><li>Science and Research (~60%) </li></ul></ul><ul><ul><ul><li>Runs more stable versions </li></ul></ul></ul><ul><ul><li>Production (~20%) </li></ul></ul><ul><ul><ul><li>Currently Hadoop 0.18.3 </li></ul></ul></ul>
  30. 30. Large Hadoop-Based Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output 1480 nodes ~73 hours runtime ~490 TB shuffling ~280 TB output 2500 nodes Sort benchmarks (Jim Gray contest) <ul><li>1 Terabyte sorted </li></ul><ul><li>209 seconds </li></ul><ul><li>900 nodes </li></ul><ul><li>1 Terabyte sorted </li></ul><ul><li>62 seconds, 1500 nodes 1 Petabyte sorted </li></ul><ul><li>16.25 hours, 3700 nodes </li></ul>Largest cluster <ul><li>2000 nodes </li></ul><ul><li>6PB raw disk </li></ul><ul><li>16TB of RAM </li></ul><ul><li>16K CPUs </li></ul><ul><li>4000 nodes </li></ul><ul><li>16PB raw disk </li></ul><ul><li>64TB of RAM </li></ul><ul><li>32K CPUs </li></ul><ul><li>(40% faster CPUs too) </li></ul>
  31. 31. @Facebook <ul><li>Claims to have the largest single Hadoop cluster in the world </li></ul><ul><li>Have multiple clusters at separate data centers </li></ul><ul><li>Largest warehouse cluster currently spans 3000 of machines </li></ul><ul><li>Scan around 2 petabytes per day </li></ul><ul><li>300 people throughout the company query this warehouse every month </li></ul>
  32. 32. @Facebook <ul><li>Facebook ”messages” uses the Hbase in prod </li></ul><ul><li>Collects click logs in near real time from web servers and stream them directly into Hadoop clusters </li></ul><ul><li>Medium-term archiving of MySQL databases </li></ul><ul><ul><li>Fast backup and recovery from data stored in Hadoop File System </li></ul></ul><ul><ul><li>Reduces maintenance and deployment costs for archiving petabyte size datasets. </li></ul></ul>
  33. 33. @Nokia <ul><li>Started using hadoop in August 2009 in search analytics team </li></ul><ul><li>Started with 15 machines as part of cluster </li></ul><ul><li>To analyse large scale search logs for various analytics purposes </li></ul><ul><li>Search relevance calculation </li></ul><ul><li>Duplicate places handling, data cleaning </li></ul><ul><li>Fuzzy query parsing and tagging for spelling correction and lookahead suggestion model </li></ul>
  34. 34. @Clickable <ul><li>Using Hbase, HDFS, Map reduce for various purposes such as data storage, analytics, reportings and recommendations </li></ul><ul><li>7 machines cluster for production </li></ul><ul><li>Used Hbase to address continous data updates from networks or any other user action at our end. </li></ul>
  35. 35. @Stumbleupon <ul><li>Log early, log often, log everything </li></ul><ul><li>No piece of data is too small or too noisy to be used in future </li></ul><ul><li>Uses for apache log file processing and session analysis, spam detection </li></ul>
  36. 36. @Stumbleupon <ul><li>Uses Scribe to collect data directly into HDFS where it is reviewed and processed by number of systems </li></ul><ul><li>Uses MR to extract data from logs for click counts </li></ul><ul><li>Uses for search index updates, thumbnail creation and recommendation systems </li></ul>
  37. 37. <ul><ul><li>Questions? </li></ul></ul>