Hadoop ecosystem framework  n hadoop in live environment
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Hadoop ecosystem framework n hadoop in live environment

on

  • 3,211 views

Delhi Hadoop User Group MeetUp - 10th Sept. 2011 -Slides

Delhi Hadoop User Group MeetUp - 10th Sept. 2011 -Slides

Statistics

Views

Total Views
3,211
Views on SlideShare
3,203
Embed Views
8

Actions

Likes
7
Downloads
139
Comments
1

3 Embeds 8

http://192.168.6.179 5
http://www.docshut.com 2
http://apollo89.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop ecosystem framework n hadoop in live environment Presentation Transcript

  • 1.
    • Hadoop ecosystem framework Hadoop in live environment
    • - Ashish Agrawal
  • 2. Outline
    • Introduction to HADOOP & Distributed FileSystems
    • Architecture of Hadoop Ecosystem (Hbase/Pig) & setting up Hadoop Single/Multiple node cluster
    • Introduction to MapReduce & running sample programs on Hadoop
    • Hadoop ecosystem framework - Hadoop in live environment
  • 3. Hadoop Ecosystem
    • HDFS
    • Map Reduce
    • Hbase
    • Pig
    • Hive
    • Mahout
    • Zookeeper
  • 4. HDFS Architecture
  • 5. Map Reduce Flow By Ricky Ho
  • 6. HBase Architecture
  • 7. Job Scheduler
    • CronJobs
    • Chain Map Recude
    • Azkaban By LinkedIn
    • Oozie by Yahoo!
  • 8. Overview of Oozie
    • Manage data processing jobs
    • Offers scalable data oriented service
    • Manages dependencies between jobs
    • Support job execution in topological order
    • Provides time & event driven triggering mechanism
  • 9. Overview of Oozie
    • Supports map reduce, pig, filesystem, java applications, even map reduce streaming and pipes as action nodes
    • Action nodes are connected through dependency edges
    • Decision, fork and join nodes are used as flow control operations
  • 10. Overview of Oozie
    • Actions and decisions depends upon properties of job, hadoop counters or file/directory status
    • A workflow application contains definition file for workflow, jar files, native and third party libraries, resource file and pig scripts
  • 11. Oozie vs Azkaban
    • Oozie can be restarted from point of failure but azkaban does not
    • Oozie keeps flow in DB while azkaban keeps in memory
    • Azkaban fixes execution path before starting job while Oozie allows decision nodes to decide
    • Azkaban does not support event trigger
    • Azkaban is used for simpler work flow
  • 12. Chain MR
    • Chains the multiple mapper classes in single map task which saves lots of I/O
    • The output of immediate previous mapper is fed as input to current mapper
    • The output of last mapper is written as task output
    • Supports passing key/value pairs to next maps by reference to save [de]serialization time
    • ChainReducer supports to chain multiple mapper classes after reducer within reducer task
  • 13. Oozie Flow Start Map reduce Fork MR Streaming Pig Join Decision MR Pipes Java FileSystem End
  • 14. Performance Tuning Parameters
    • Network bandwidth – Gigabytes Nw
    • Disk throughput – SCSI Drives
    • Memory usage – ECC RAM
    • CPU overhead for thread handling
    • HDFS block size
    • Max number of requests allowed in progress
    • Per user file descriptors – needs to be set high
    • Running the balancer
  • 15. Performance Tuning Parameters
    • Sufficient space for temp directory
    • Compressed data storage
    • Speculative data execution
    • Use of combiner function – Associative & commulative
    • Selection of Job scheduler : FIFO/Capacity/Fair
    • Number of mappers : larger files are preferred
    • Number of reducers : Slightly less than #nodes
  • 16. Performance Tuning Parameters
    • Compression of intermediate data from Mappers
    • sort size (io.sort.mb) – larger if mapper has to write large data
    • Sort factor (io.sort.factor) – set high for larger jobs (#input files can be merged at once)
    • mapred.reduce.parallel.copies - higher for large jobs
    • dfs.namenode.handler.count & dfs.datanode.handler.count – high for large cluster
  • 17. Tips
    • Use an appropriate MapReduce language
      • Java : Speed, control and binary data. Working with existing libraries.
      • Pipes : Working with existing C++ libraries
      • Streaming : Writing MR in scripting languages
      • Dumbo (Python), happy(Jython), Wukong (Ruby)
      • Pig, Hive, Cascading : For nested data, joins etc
    • Thumb Rule : P ure Java for large, recurring jobs, Hive for SQL style analysis and Pig/Streaming for ad-hoc analysis.
  • 18. Tips
    • Few Larger files are preferred over many smaller files
    • Report Progress
      • For CPU intensive job, increase the mapred.task.timeout (default 10 mins)
    • Use Distributed cache
      • To make data available to all mappers/reducers. For example keeping look up hash map
      • Used to make auxiliary jars available among mappers/reducers
  • 19. Tips
    • Use SequenceFile and MapFile
      • Splittable. Unlike other compressable format, they are map reduce job friendly and each map gets an independent split to work on
      • Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still.
      • Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format.
    • A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.
  • 20. Mahout (Machine learning library)
    • Collaborative Filtering
    • User and Item based recommenders
    • K-Means, Fuzzy K-Means clustering
    • Mean Shift clustering
    • Dirichlet process clustering
    • Latent Dirichlet Allocation
    • Singular value decomposition
    • Parallel Frequent Pattern mining
    • Complementary Naive Bayes classifier
  • 21. Different minds Different interpretation
    • http://www.youtube.com/watch?v=9izUKE5bN0U
  • 22. Hadoop in live environment
    • Google
    • Yahoo
    • Amazon
    • LinkedIn
    • Facebook
    • StumbleUpon
    • Nokia
    • Last.fm
    • Clickable
  • 23. @Google
    • Google uses it for
      • indexing the web
      • computing PageRank
      • processing geographic information in Google Maps
      • clustering news articles,
      • machine translation
      • Google Trends etc
  • 24. @Google
    • An Example :
      • 403,152 TB (terabytes) data
      • 394 machines were allocated
      • Completion time is 6 minutes and a half.
    • Google indexing system uses 20TB data
    • Bigtable (Hbase) is used for many Google products such as Orkut, Finance etc.
    • Sawzall is used for massive log processing
  • 25. @Yahoo!
    • The Two Quadrillionth Bit of π is 0!
      • One of the largest computations took 23 days of wall clock time and 503 years of CPU time on a 1000-node cluster
      • Yahoo! Has 4000 nodes in hadoop cluster
    • Following slides have been taken from opencirrus summit 2009
  • 26. Hadoop is critical to Yahoo’s business
    • When you visit yahoo, you are interacting with data processed with Hadoop!
    Ads Optimization Content Optimization Search Index Content Feed Processing Machine Learning (e.g. Spam filters)
  • 27. Tremendous Impact on Productivity
    • Makes Developers & Scientists more productive
      • Key computations solved in days and not months
      • Projects move from research to production in days
      • Easy to learn, even our rocket scientists use it!
    • The major factors
      • You don’t need to find new hardware to experiment
      • You can work with all your data!
      • Production and research based on same framework
      • No need for R&D to do IT (it just works)
  • 28. Search & Advertising Sciences Hadoop Applications: Search Assist™
    • Database for Search Assist™ is built using Hadoop.
    • 3 years of log-data
    • 20-steps of map-reduce
    Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
  • 29. Largest Hadoop Clusters in the Universe
    • 25,000+ nodes (~200,000 cores)
      • Clusters of up to 4,000 nodes
    • 4 Tiers of clusters
      • Development, Testing and QA (~10%)
      • Proof of Concepts and Ad-Hoc work (~10%)
        • Runs the latest version of Hadoop – currently 0.20
      • Science and Research (~60%)
        • Runs more stable versions
      • Production (~20%)
        • Currently Hadoop 0.18.3
  • 30. Large Hadoop-Based Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output 1480 nodes ~73 hours runtime ~490 TB shuffling ~280 TB output 2500 nodes Sort benchmarks (Jim Gray contest)
    • 1 Terabyte sorted
    • 209 seconds
    • 900 nodes
    • 1 Terabyte sorted
    • 62 seconds, 1500 nodes 1 Petabyte sorted
    • 16.25 hours, 3700 nodes
    Largest cluster
    • 2000 nodes
    • 6PB raw disk
    • 16TB of RAM
    • 16K CPUs
    • 4000 nodes
    • 16PB raw disk
    • 64TB of RAM
    • 32K CPUs
    • (40% faster CPUs too)
  • 31. @Facebook
    • Claims to have the largest single Hadoop cluster in the world
    • Have multiple clusters at separate data centers
    • Largest warehouse cluster currently spans 3000 of machines
    • Scan around 2 petabytes per day
    • 300 people throughout the company query this warehouse every month
  • 32. @Facebook
    • Facebook ”messages” uses the Hbase in prod
    • Collects click logs in near real time from web servers and stream them directly into Hadoop clusters
    • Medium-term archiving of MySQL databases
      • Fast backup and recovery from data stored in Hadoop File System
      • Reduces maintenance and deployment costs for archiving petabyte size datasets.
  • 33. @Nokia
    • Started using hadoop in August 2009 in search analytics team
    • Started with 15 machines as part of cluster
    • To analyse large scale search logs for various analytics purposes
    • Search relevance calculation
    • Duplicate places handling, data cleaning
    • Fuzzy query parsing and tagging for spelling correction and lookahead suggestion model
  • 34. @Clickable
    • Using Hbase, HDFS, Map reduce for various purposes such as data storage, analytics, reportings and recommendations
    • 7 machines cluster for production
    • Used Hbase to address continous data updates from networks or any other user action at our end.
  • 35. @Stumbleupon
    • Log early, log often, log everything
    • No piece of data is too small or too noisy to be used in future
    • Uses for apache log file processing and session analysis, spam detection
  • 36. @Stumbleupon
    • Uses Scribe to collect data directly into HDFS where it is reviewed and processed by number of systems
    • Uses MR to extract data from logs for click counts
    • Uses for search index updates, thumbnail creation and recommendation systems
  • 37.
      • Questions?