An introduction to Hadoop for large scale data analysis
Upcoming SlideShare
Loading in...5
×
 

An introduction to Hadoop for large scale data analysis

on

  • 1,739 views

 

Statistics

Views

Total Views
1,739
Views on SlideShare
1,728
Embed Views
11

Actions

Likes
0
Downloads
24
Comments
0

2 Embeds 11

http://www.linkedin.com 7
https://www.linkedin.com 4

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

An introduction to Hadoop for large scale data analysis An introduction to Hadoop for large scale data analysis Presentation Transcript

  • Hadoop – Large scale data analysis
    Abhijit Sharma
    Page 1 | 9/8/2011
  • Unprecedented growth in
    Data set size - Facebook 21+ PB data warehouse, 12+ TB/day
    Un(semi)-structured data – logs, documents, graphs
    Connected data web, tags, graphs
    Relevant to enterprises – logs, social media, machine generated data, breaking of silos
    Page 2 | 9/8/2011
    Big Data Trends
  • Page 3 | 9/8/2011
    Putting Big Data to work
    Data driven Org – decision support, new offerings
    Analytics on large data sets (FB Insights – Page, App etc stats),
    Data Mining – Clustering - Google News articles
    Search - Google
    View slide
  • Embarrassingly data parallel problems
    Data chunked & distributed across cluster
    Parallel processing with data locality – task dispatched where data is
    Horizontal/Linear scaling approach using commodity hardware
    Write Once, Read Many
    Examples
    Distributed logs – grep, # of accesses per URL
    Search - Term Vector generation, Reverse Links
    Page 4 | 9/8/2011
    Problem characteristics and examples
    View slide
  • Open source system for large scale batch distributed computing on big data
    Map Reduce Programming Paradigm & Framework
    Map Reduce Infrastructure
    Distributed File System (HDFS)
    Endorsed/used extensively by web giants – Google, FB, Yahoo!
    Page 5 | 9/8/2011
    What is Hadoop?
  • MapReduce is a programming model and an implementation for parallel processing of large data sets
    Map processes each logical record per input split to generate a set of intermediate key/value pairs
    Reduce merges all intermediate values associated with the same intermediate key
    Page 6 | 9/8/2011
    Map Reduce - Definition
  • Map : Apply a function to each list member - Parallelizable
    [1, 2, 3].collect { it * it }
    Output : [1, 2, 3] -> Map (Square) : [1, 4, 9]
    Reduce : Apply a function and an accumulator to each list member
    [1, 2, 3].inject(0) { sum, item -> sum + item }
    Output : [1, 2, 3] -> Reduce (Sum) : 6
    Map & Reduce
    [1, 2, 3].collect { it * it }.inject(0) { sum, item -> sum + item }
    Output : [1, 2, 3] -> Map (Square) -> [1, 4, 9] -> Reduce (Sum) : 14
    Page 7 | 9/8/2011
    Map Reduce - Functional Programming Origins
  • Page 8 | 9/8/2011
    Word Count - Shell
    cat * | grep | sort | uniq –c
    input| map | shuffle & sort | reduce
  • Page 9 | 9/8/2011
    Word Count - Map Reduce
  • mapper (filename, file-contents):
    for each word in file-contents:
    emit (word, 1) // single count for a word e.g. (“the”, 1) for each occurrence of “the”
    reducer (word, Iterator values): // Iterator for list of counts for a word e.g. (“the”, [1,1,..])
    sum = 0
    for each value in intermediate_values:
    sum = sum + value
    emit (word, sum)
    Page 10 | 9/8/2011
    Word Count - Pseudo code
  • Word Count / Distributed logs search for # accesses to various URLs
    Map – emits word/URL, 1 for each doc/log split
    Reduce – sums up the counts for a specific word/URL
    Term Vector generation – term -> [doc-id]
    Map – emits term, doc-id for each doc split
    Reduce – Identity Reducer – accumulates the (term, [doc-id, doc-id ..])
    Reverse Links – source -> target to target-> source
    Map – emits (target, source) for each doc split
    Reducer – Identity Reducer – accumulates the (target, [source, source ..])
    Page 11 | 9/8/2011
    Examples – Map Reduce Defn
  • Hides complexity of distributed computing
    Automatic parallelization of job
    Automatic data chunking & distribution (via HDFS)
    Data locality – MR task dispatched where data is
    Fault tolerant to server, storage, N/W failures
    Network and disk transfer optimization
    Load balancing
    Page 12 | 9/8/2011
    Map Reduce – Hadoop Implementation
  • Page 13 | 9/8/2011
    Hadoop Map Reduce Architecture
  • Very large files – block size 64 MB/128 MB
    Data access pattern - Write once read many
    Writes are large, create & append only
    Reads are large & streaming
    Commodity hardware
    Tolerant to failure – server, storage, network
    Highly available through transparent replication
    • Throughput is more important than latency
    Page 14 | 9/8/2011
    HDFS Characteristics
  • Page 15 | 9/8/2011
    HDFS Architecture
  • Thanks
    Page 16 | 9/8/2011
  • Page 17 | 9/8/2011
    Backup Slides
  • Page 18 | 9/8/2011
    Map & Reduce Functions
  • Page 19 | 9/8/2011
    Job Configuration
  • Job Tracker tracks MR jobs – runs on master node
    Task Tracker
    Runs on data nodes and tracks Mapper, Reducer tasks assigned to the node
    Heartbeats to Job Tracker
    Maintains and picks up tasks from a queue
    Page 20 | 9/8/2011
    Hadoop Map Reduce Components
  • Name Node
    Manages the file system namespace and regulates access to files by clients – stores meta data
    Mapping of blocks to Data Nodes and replicas
    Manage replication
    Executes file system namespace operations like opening, closing, and renaming files and directories.
    Data Node
    One per node, which manages local storage attached to the node
    Internally, a file is split into one or more blocks and these blocks are stored in a set of Data Nodes
    Responsible for serving read and write requests from the file system’s clients. The Data Nodes also perform block creation, deletion, and replication upon instruction from the Name Node.
    Page 21 | 9/8/2011
    HDFS