An introduction to Hadoop for large scale data analysis
Upcoming SlideShare
Loading in...5
×
 

An introduction to Hadoop for large scale data analysis

on

  • 1,697 views

 

Statistics

Views

Total Views
1,697
Views on SlideShare
1,686
Embed Views
11

Actions

Likes
0
Downloads
24
Comments
0

2 Embeds 11

http://www.linkedin.com 7
https://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    An introduction to Hadoop for large scale data analysis An introduction to Hadoop for large scale data analysis Presentation Transcript

    • Hadoop – Large scale data analysis
      Abhijit Sharma
      Page 1 | 9/8/2011
    • Unprecedented growth in
      Data set size - Facebook 21+ PB data warehouse, 12+ TB/day
      Un(semi)-structured data – logs, documents, graphs
      Connected data web, tags, graphs
      Relevant to enterprises – logs, social media, machine generated data, breaking of silos
      Page 2 | 9/8/2011
      Big Data Trends
    • Page 3 | 9/8/2011
      Putting Big Data to work
      Data driven Org – decision support, new offerings
      Analytics on large data sets (FB Insights – Page, App etc stats),
      Data Mining – Clustering - Google News articles
      Search - Google
    • Embarrassingly data parallel problems
      Data chunked & distributed across cluster
      Parallel processing with data locality – task dispatched where data is
      Horizontal/Linear scaling approach using commodity hardware
      Write Once, Read Many
      Examples
      Distributed logs – grep, # of accesses per URL
      Search - Term Vector generation, Reverse Links
      Page 4 | 9/8/2011
      Problem characteristics and examples
    • Open source system for large scale batch distributed computing on big data
      Map Reduce Programming Paradigm & Framework
      Map Reduce Infrastructure
      Distributed File System (HDFS)
      Endorsed/used extensively by web giants – Google, FB, Yahoo!
      Page 5 | 9/8/2011
      What is Hadoop?
    • MapReduce is a programming model and an implementation for parallel processing of large data sets
      Map processes each logical record per input split to generate a set of intermediate key/value pairs
      Reduce merges all intermediate values associated with the same intermediate key
      Page 6 | 9/8/2011
      Map Reduce - Definition
    • Map : Apply a function to each list member - Parallelizable
      [1, 2, 3].collect { it * it }
      Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]
      Reduce : Apply a function and an accumulator to each list member
      [1, 2, 3].inject(0) { sum, item -> sum + item }
      Output : [1, 2, 3] -> Reduce (Sum) : 6
      Map & Reduce
      [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item }
      Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14
      Page 7 | 9/8/2011
      Map Reduce - Functional Programming Origins
    • Page 8 | 9/8/2011
      Word Count - Shell
      cat * | grep | sort | uniq –c
      input| map | shuffle & sort | reduce
    • Page 9 | 9/8/2011
      Word Count - Map Reduce
    • mapper (filename, file-contents):
      for each word in file-contents:
      emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the”
      reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])
      sum = 0
      for each value in intermediate_values:
      sum = sum + value
      emit (word, sum)
      Page 10 | 9/8/2011
      Word Count - Pseudo code
    • Word Count / Distributed logs search for # accesses to various URLs
      Map – emits word/URL, 1 for each doc/log split
      Reduce – sums up the counts for a specific word/URL
      Term Vector generation – term -> [doc-id]
      Map – emits term, doc-id for each doc split
      Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..])
      Reverse Links – source -> target to target-> source
      Map – emits (target, source) for each doc split
      Reducer – Identity Reducer – accumulates the (target, [source, source ..])
      Page 11 | 9/8/2011
      Examples – Map Reduce Defn
    • Hides complexity of distributed computing
      Automatic parallelization of job
      Automatic data chunking & distribution (via HDFS)
      Data locality – MR task dispatched where data is
      Fault tolerant to server, storage, N/W failures
      Network and disk transfer optimization
      Load balancing
      Page 12 | 9/8/2011
      Map Reduce – Hadoop Implementation
    • Page 13 | 9/8/2011
      Hadoop Map Reduce Architecture
    • Very large files – block size 64 MB/128 MB
      Data access pattern - Write once read many
      Writes are large, create & append only
      Reads are large & streaming
      Commodity hardware
      Tolerant to failure – server, storage, network
      Highly available through transparent replication
      • Throughput is more important than latency
      Page 14 | 9/8/2011
      HDFS Characteristics
    • Page 15 | 9/8/2011
      HDFS Architecture
    • Thanks
      Page 16 | 9/8/2011
    • Page 17 | 9/8/2011
      Backup Slides
    • Page 18 | 9/8/2011
      Map & Reduce Functions
    • Page 19 | 9/8/2011
      Job Configuration
    • Job Tracker tracks MR jobs – runs on master node
      Task Tracker
      Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node
      Heartbeats to Job Tracker
      Maintains and picks up tasks from a queue
      Page 20 | 9/8/2011
      Hadoop Map Reduce Components
    • Name Node
      Manages the file system namespace and regulates access to files by clients – stores meta data
      Mapping of blocks to Data Nodes and replicas
      Manage replication
      Executes file system namespace operations like opening, closing, and renaming files and directories.
      Data Node
      One per node, which manages local storage attached to the node
      Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes
      Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.
      Page 21 | 9/8/2011
      HDFS