Hadoop: Playing with data, at scale


If you have lot of data to process….
           What should you know?

Mahesh Tiyyagura

25th November, Bangalore
Mahesh Tiyyagura

Email: tmahesh@gmail.com
http://www.twitter.com/tmahesh

Work on large scale crawling and extraction of structured data from the web.
Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs
Hadoop
•    Massively scalable storage and batch data processing system

•    Its all about Scale……
       –  Scaling hardware infrastructure (horizontal scaling)
       –  Scaling operations and maintenance (handling failures)
       –  Scaling developer productivity (keep it simple)
Numbers you should know…
•    You can store say, 10TB of data per node
•    1 Disk: 75MB/sec (sequential read)
•    Say, you want to process, 200GB of data
•    That’s is ~ 1 hour to just read the data!!
•    Processing data (CPU) is much much faster (say, 10x)
•    To remove the bottleneck, we need to read data in parallel
•    Read from 100 Disks in parallel: 7.5GB/sec!!
•    Insight: Move computation, NOT data

•    Oh! BTW, Data should not (and cannot) reside on only one node
•    In a 1000 node cluster, you can expect ~10 failures per week
•    For peace of mind, Reliability should be handled by software


Hadoop is designed to address these issues.
The Platform, in brief…
•    HDFS: Storing Data
      –  Data spilt into multiple blocks across nodes
      –  Replication protects data from failures
      –  A master node orchestrates the read/write requests (without being a bottleneck!!)
      –  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable)

•    MapReduce (MR): Processing Data
      –  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce)
      –  You don’t need no knowledge of network IO, node failures, checkpoints, distributed
         what??
      –  Most of the data processing jobs can be mapped into MapReduce Abstraction
      –  Data processed locally, in parallel. Reliability is implicit.
      –  A giant merge sort infrastructure does the magic

      Will revisit this slide. Something's are better understood in retrospect.
HDFS
MR: Programming Model
•    Map function: (key, value) -> (key1, value1) list
•    Reduce function: (key1, value1 list) -> key1, output

•    Examples:
      –  map(k, v) -> emit (k, v.toUpper())
      –  map(k, v) -> foreach c in v; do emit(k, c) done;

      –  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)
MAPREDUCE
Thinking in MapReduce ….
•    Word Count Example
      –  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done
      –  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word,
         sum)




•    Document search index (Inverted index)
      –  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done
      –  reduce(term, docid list) -> emit(term, docid list);.
Thinking in MapReduce ….

•    All the anchor text to a page
      –  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done
      –  Reduce(link, anchorText list) -> emit(link, anchorText list);




•    Image resize
      –  Map(imgid, image) -> emit(imgid, image.resize());
      –  No need for reduce
Hadoop Streaming Demo
•    cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile

•    Each line in InputFile written to stdin of shellMapper.sh
•    A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value)
•    Each key, value pair is fed as line into stdin of shellReducer.sh
•    A line on stdout of shellReducer written to outputFile

•    hadoop jar hadoop-streaming.jar 
       -input myInputDirs 
      -output myOutputDir 
      -mapper /bin/cat 
      -reducer /bin/wc

•    More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html

•    Will share the terminal session now… for DEMO
Brief intro to PIG
•    Adhoc data analysis
•    An abstraction over mapreduce
•    Think of it as a stdlib for mapreduce
•    Supports common data processing operators (join, group by)
•    A high level language for data processing

 PIG demo – switch to terminal

•    Also try… HIVE, exposes a SQL like interface over HDFS data

HIVE demo – switch to terminal
HadoopThe Hadoop Java Software Framework

HadoopThe Hadoop Java Software Framework

  • 2.
    Hadoop: Playing withdata, at scale If you have lot of data to process…. What should you know? Mahesh Tiyyagura 25th November, Bangalore
  • 3.
    Mahesh Tiyyagura Email: tmahesh@gmail.com http://www.twitter.com/tmahesh Workon large scale crawling and extraction of structured data from the web. Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs
  • 4.
    Hadoop •  Massively scalable storage and batch data processing system •  Its all about Scale…… –  Scaling hardware infrastructure (horizontal scaling) –  Scaling operations and maintenance (handling failures) –  Scaling developer productivity (keep it simple)
  • 5.
    Numbers you shouldknow… •  You can store say, 10TB of data per node •  1 Disk: 75MB/sec (sequential read) •  Say, you want to process, 200GB of data •  That’s is ~ 1 hour to just read the data!! •  Processing data (CPU) is much much faster (say, 10x) •  To remove the bottleneck, we need to read data in parallel •  Read from 100 Disks in parallel: 7.5GB/sec!! •  Insight: Move computation, NOT data •  Oh! BTW, Data should not (and cannot) reside on only one node •  In a 1000 node cluster, you can expect ~10 failures per week •  For peace of mind, Reliability should be handled by software Hadoop is designed to address these issues.
  • 6.
    The Platform, inbrief… •  HDFS: Storing Data –  Data spilt into multiple blocks across nodes –  Replication protects data from failures –  A master node orchestrates the read/write requests (without being a bottleneck!!) –  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable) •  MapReduce (MR): Processing Data –  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce) –  You don’t need no knowledge of network IO, node failures, checkpoints, distributed what?? –  Most of the data processing jobs can be mapped into MapReduce Abstraction –  Data processed locally, in parallel. Reliability is implicit. –  A giant merge sort infrastructure does the magic Will revisit this slide. Something's are better understood in retrospect.
  • 7.
  • 8.
    MR: Programming Model •  Map function: (key, value) -> (key1, value1) list •  Reduce function: (key1, value1 list) -> key1, output •  Examples: –  map(k, v) -> emit (k, v.toUpper()) –  map(k, v) -> foreach c in v; do emit(k, c) done; –  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)
  • 9.
  • 10.
    Thinking in MapReduce…. •  Word Count Example –  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done –  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word, sum) •  Document search index (Inverted index) –  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done –  reduce(term, docid list) -> emit(term, docid list);.
  • 11.
    Thinking in MapReduce…. •  All the anchor text to a page –  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done –  Reduce(link, anchorText list) -> emit(link, anchorText list); •  Image resize –  Map(imgid, image) -> emit(imgid, image.resize()); –  No need for reduce
  • 12.
    Hadoop Streaming Demo •  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile •  Each line in InputFile written to stdin of shellMapper.sh •  A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value) •  Each key, value pair is fed as line into stdin of shellReducer.sh •  A line on stdout of shellReducer written to outputFile •  hadoop jar hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc •  More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html •  Will share the terminal session now… for DEMO
  • 13.
    Brief intro toPIG •  Adhoc data analysis •  An abstraction over mapreduce •  Think of it as a stdlib for mapreduce •  Supports common data processing operators (join, group by) •  A high level language for data processing PIG demo – switch to terminal •  Also try… HIVE, exposes a SQL like interface over HDFS data HIVE demo – switch to terminal