Hadoop ecosystem framework  n hadoop in live environment
Upcoming SlideShare
Loading in...5

Hadoop ecosystem framework n hadoop in live environment



Delhi Hadoop User Group MeetUp - 10th Sept. 2011 -Slides

Delhi Hadoop User Group MeetUp - 10th Sept. 2011 -Slides



Total Views
Views on SlideShare
Embed Views



3 Embeds 8 5
http://www.docshut.com 2
http://apollo89.com 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.


11 of 1

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hadoop ecosystem framework  n hadoop in live environment Hadoop ecosystem framework n hadoop in live environment Presentation Transcript

      • Hadoop ecosystem framework Hadoop in live environment
      • - Ashish Agrawal
    • Outline
      • Introduction to HADOOP & Distributed FileSystems
      • Architecture of Hadoop Ecosystem (Hbase/Pig) & setting up Hadoop Single/Multiple node cluster
      • Introduction to MapReduce & running sample programs on Hadoop
      • Hadoop ecosystem framework - Hadoop in live environment
    • Hadoop Ecosystem
      • HDFS
      • Map Reduce
      • Hbase
      • Pig
      • Hive
      • Mahout
      • Zookeeper
    • HDFS Architecture
    • Map Reduce Flow By Ricky Ho
    • HBase Architecture
    • Job Scheduler
      • CronJobs
      • Chain Map Recude
      • Azkaban By LinkedIn
      • Oozie by Yahoo!
    • Overview of Oozie
      • Manage data processing jobs
      • Offers scalable data oriented service
      • Manages dependencies between jobs
      • Support job execution in topological order
      • Provides time & event driven triggering mechanism
    • Overview of Oozie
      • Supports map reduce, pig, filesystem, java applications, even map reduce streaming and pipes as action nodes
      • Action nodes are connected through dependency edges
      • Decision, fork and join nodes are used as flow control operations
    • Overview of Oozie
      • Actions and decisions depends upon properties of job, hadoop counters or file/directory status
      • A workflow application contains definition file for workflow, jar files, native and third party libraries, resource file and pig scripts
    • Oozie vs Azkaban
      • Oozie can be restarted from point of failure but azkaban does not
      • Oozie keeps flow in DB while azkaban keeps in memory
      • Azkaban fixes execution path before starting job while Oozie allows decision nodes to decide
      • Azkaban does not support event trigger
      • Azkaban is used for simpler work flow
    • Chain MR
      • Chains the multiple mapper classes in single map task which saves lots of I/O
      • The output of immediate previous mapper is fed as input to current mapper
      • The output of last mapper is written as task output
      • Supports passing key/value pairs to next maps by reference to save [de]serialization time
      • ChainReducer supports to chain multiple mapper classes after reducer within reducer task
    • Oozie Flow Start Map reduce Fork MR Streaming Pig Join Decision MR Pipes Java FileSystem End
    • Performance Tuning Parameters
      • Network bandwidth – Gigabytes Nw
      • Disk throughput – SCSI Drives
      • Memory usage – ECC RAM
      • CPU overhead for thread handling
      • HDFS block size
      • Max number of requests allowed in progress
      • Per user file descriptors – needs to be set high
      • Running the balancer
    • Performance Tuning Parameters
      • Sufficient space for temp directory
      • Compressed data storage
      • Speculative data execution
      • Use of combiner function – Associative & commulative
      • Selection of Job scheduler : FIFO/Capacity/Fair
      • Number of mappers : larger files are preferred
      • Number of reducers : Slightly less than #nodes
    • Performance Tuning Parameters
      • Compression of intermediate data from Mappers
      • sort size (io.sort.mb) – larger if mapper has to write large data
      • Sort factor (io.sort.factor) – set high for larger jobs (#input files can be merged at once)
      • mapred.reduce.parallel.copies - higher for large jobs
      • dfs.namenode.handler.count & dfs.datanode.handler.count – high for large cluster
    • Tips
      • Use an appropriate MapReduce language
        • Java : Speed, control and binary data. Working with existing libraries.
        • Pipes : Working with existing C++ libraries
        • Streaming : Writing MR in scripting languages
        • Dumbo (Python), happy(Jython), Wukong (Ruby)
        • Pig, Hive, Cascading : For nested data, joins etc
      • Thumb Rule : P ure Java for large, recurring jobs, Hive for SQL style analysis and Pig/Streaming for ad-hoc analysis.
    • Tips
      • Few Larger files are preferred over many smaller files
      • Report Progress
        • For CPU intensive job, increase the mapred.task.timeout (default 10 mins)
      • Use Distributed cache
        • To make data available to all mappers/reducers. For example keeping look up hash map
        • Used to make auxiliary jars available among mappers/reducers
    • Tips
      • Use SequenceFile and MapFile
        • Splittable. Unlike other compressable format, they are map reduce job friendly and each map gets an independent split to work on
        • Compressible. By using block compression you get the benefits of compression (use less disk space, faster to read and write), while keeping the file splittable still.
        • Compact. SequenceFiles are usually used with Hadoop Writable objects, which have a pretty compact format.
      • A MapFile is an indexed SequenceFile, useful for if you want to do look-ups by key.
    • Mahout (Machine learning library)
      • Collaborative Filtering
      • User and Item based recommenders
      • K-Means, Fuzzy K-Means clustering
      • Mean Shift clustering
      • Dirichlet process clustering
      • Latent Dirichlet Allocation
      • Singular value decomposition
      • Parallel Frequent Pattern mining
      • Complementary Naive Bayes classifier
    • Different minds Different interpretation
      • http://www.youtube.com/watch?v=9izUKE5bN0U
    • Hadoop in live environment
      • Google
      • Yahoo
      • Amazon
      • LinkedIn
      • Facebook
      • StumbleUpon
      • Nokia
      • Last.fm
      • Clickable
    • @Google
      • Google uses it for
        • indexing the web
        • computing PageRank
        • processing geographic information in Google Maps
        • clustering news articles,
        • machine translation
        • Google Trends etc
    • @Google
      • An Example :
        • 403,152 TB (terabytes) data
        • 394 machines were allocated
        • Completion time is 6 minutes and a half.
      • Google indexing system uses 20TB data
      • Bigtable (Hbase) is used for many Google products such as Orkut, Finance etc.
      • Sawzall is used for massive log processing
    • @Yahoo!
      • The Two Quadrillionth Bit of π is 0!
        • One of the largest computations took 23 days of wall clock time and 503 years of CPU time on a 1000-node cluster
        • Yahoo! Has 4000 nodes in hadoop cluster
      • Following slides have been taken from opencirrus summit 2009
    • Hadoop is critical to Yahoo’s business
      • When you visit yahoo, you are interacting with data processed with Hadoop!
      Ads Optimization Content Optimization Search Index Content Feed Processing Machine Learning (e.g. Spam filters)
    • Tremendous Impact on Productivity
      • Makes Developers & Scientists more productive
        • Key computations solved in days and not months
        • Projects move from research to production in days
        • Easy to learn, even our rocket scientists use it!
      • The major factors
        • You don’t need to find new hardware to experiment
        • You can work with all your data!
        • Production and research based on same framework
        • No need for R&D to do IT (it just works)
    • Search & Advertising Sciences Hadoop Applications: Search Assist™
      • Database for Search Assist™ is built using Hadoop.
      • 3 years of log-data
      • 20-steps of map-reduce
      Before Hadoop After Hadoop Time 26 days 20 minutes Language C++ Python Development Time 2-3 weeks 2-3 days
    • Largest Hadoop Clusters in the Universe
      • 25,000+ nodes (~200,000 cores)
        • Clusters of up to 4,000 nodes
      • 4 Tiers of clusters
        • Development, Testing and QA (~10%)
        • Proof of Concepts and Ad-Hoc work (~10%)
          • Runs the latest version of Hadoop – currently 0.20
        • Science and Research (~60%)
          • Runs more stable versions
        • Production (~20%)
          • Currently Hadoop 0.18.3
    • Large Hadoop-Based Applications 2008 2009 Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output 1480 nodes ~73 hours runtime ~490 TB shuffling ~280 TB output 2500 nodes Sort benchmarks (Jim Gray contest)
      • 1 Terabyte sorted
      • 209 seconds
      • 900 nodes
      • 1 Terabyte sorted
      • 62 seconds, 1500 nodes 1 Petabyte sorted
      • 16.25 hours, 3700 nodes
      Largest cluster
      • 2000 nodes
      • 6PB raw disk
      • 16TB of RAM
      • 16K CPUs
      • 4000 nodes
      • 16PB raw disk
      • 64TB of RAM
      • 32K CPUs
      • (40% faster CPUs too)
    • @Facebook
      • Claims to have the largest single Hadoop cluster in the world
      • Have multiple clusters at separate data centers
      • Largest warehouse cluster currently spans 3000 of machines
      • Scan around 2 petabytes per day
      • 300 people throughout the company query this warehouse every month
    • @Facebook
      • Facebook ”messages” uses the Hbase in prod
      • Collects click logs in near real time from web servers and stream them directly into Hadoop clusters
      • Medium-term archiving of MySQL databases
        • Fast backup and recovery from data stored in Hadoop File System
        • Reduces maintenance and deployment costs for archiving petabyte size datasets.
    • @Nokia
      • Started using hadoop in August 2009 in search analytics team
      • Started with 15 machines as part of cluster
      • To analyse large scale search logs for various analytics purposes
      • Search relevance calculation
      • Duplicate places handling, data cleaning
      • Fuzzy query parsing and tagging for spelling correction and lookahead suggestion model
    • @Clickable
      • Using Hbase, HDFS, Map reduce for various purposes such as data storage, analytics, reportings and recommendations
      • 7 machines cluster for production
      • Used Hbase to address continous data updates from networks or any other user action at our end.
    • @Stumbleupon
      • Log early, log often, log everything
      • No piece of data is too small or too noisy to be used in future
      • Uses for apache log file processing and session analysis, spam detection
    • @Stumbleupon
      • Uses Scribe to collect data directly into HDFS where it is reviewed and processed by number of systems
      • Uses MR to extract data from logs for click counts
      • Uses for search index updates, thumbnail creation and recommendation systems
        • Questions?