Hadoop: Playing with data, at scale


If you have lot of data to process….
           What should you know?

Mahesh Tiyyag...
Mahesh Tiyyagura

Email: tmahesh@gmail.com
http://www.twitter.com/tmahesh

Work on large scale crawling and extraction of ...
Hadoop
•    Massively scalable storage and batch data processing system

•    Its all about Scale……
       –  Scaling hard...
Numbers you should know…
•    You can store say, 10TB of data per node
•    1 Disk: 75MB/sec (sequential read)
•    Say, y...
The Platform, in brief…
•    HDFS: Storing Data
      –  Data spilt into multiple blocks across nodes
      –  Replication...
HDFS
MR: Programming Model
•    Map function: (key, value) -> (key1, value1) list
•    Reduce function: (key1, value1 list) -> ...
MAPREDUCE
Thinking in MapReduce ….
•    Word Count Example
      –  map(docid, text) -> foreach word in text.split(); do emit(word, ...
Thinking in MapReduce ….

•    All the anchor text to a page
      –  map(docid, html) -> foreach link in getLinks(html); ...
Hadoop Streaming Demo
•    cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile

•    Each line in InputF...
Brief intro to PIG
•    Adhoc data analysis
•    An abstraction over mapreduce
•    Think of it as a stdlib for mapreduce
...
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
Upcoming SlideShare
Loading in …5
×

HadoopThe Hadoop Java Software Framework

3,595 views

Published on

Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.

Topics Covered

* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.

Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.

Published in: Technology, Education

HadoopThe Hadoop Java Software Framework

  1. Hadoop: Playing with data, at scale If you have lot of data to process…. What should you know? Mahesh Tiyyagura 25th November, Bangalore
  2. Mahesh Tiyyagura Email: tmahesh@gmail.com http://www.twitter.com/tmahesh Work on large scale crawling and extraction of structured data from the web. Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs
  3. Hadoop •  Massively scalable storage and batch data processing system •  Its all about Scale…… –  Scaling hardware infrastructure (horizontal scaling) –  Scaling operations and maintenance (handling failures) –  Scaling developer productivity (keep it simple)
  4. Numbers you should know… •  You can store say, 10TB of data per node •  1 Disk: 75MB/sec (sequential read) •  Say, you want to process, 200GB of data •  That’s is ~ 1 hour to just read the data!! •  Processing data (CPU) is much much faster (say, 10x) •  To remove the bottleneck, we need to read data in parallel •  Read from 100 Disks in parallel: 7.5GB/sec!! •  Insight: Move computation, NOT data •  Oh! BTW, Data should not (and cannot) reside on only one node •  In a 1000 node cluster, you can expect ~10 failures per week •  For peace of mind, Reliability should be handled by software Hadoop is designed to address these issues.
  5. The Platform, in brief… •  HDFS: Storing Data –  Data spilt into multiple blocks across nodes –  Replication protects data from failures –  A master node orchestrates the read/write requests (without being a bottleneck!!) –  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable) •  MapReduce (MR): Processing Data –  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce) –  You don’t need no knowledge of network IO, node failures, checkpoints, distributed what?? –  Most of the data processing jobs can be mapped into MapReduce Abstraction –  Data processed locally, in parallel. Reliability is implicit. –  A giant merge sort infrastructure does the magic Will revisit this slide. Something's are better understood in retrospect.
  6. HDFS
  7. MR: Programming Model •  Map function: (key, value) -> (key1, value1) list •  Reduce function: (key1, value1 list) -> key1, output •  Examples: –  map(k, v) -> emit (k, v.toUpper()) –  map(k, v) -> foreach c in v; do emit(k, c) done; –  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)
  8. MAPREDUCE
  9. Thinking in MapReduce …. •  Word Count Example –  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done –  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word, sum) •  Document search index (Inverted index) –  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done –  reduce(term, docid list) -> emit(term, docid list);.
  10. Thinking in MapReduce …. •  All the anchor text to a page –  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done –  Reduce(link, anchorText list) -> emit(link, anchorText list); •  Image resize –  Map(imgid, image) -> emit(imgid, image.resize()); –  No need for reduce
  11. Hadoop Streaming Demo •  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile •  Each line in InputFile written to stdin of shellMapper.sh •  A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value) •  Each key, value pair is fed as line into stdin of shellReducer.sh •  A line on stdout of shellReducer written to outputFile •  hadoop jar hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc •  More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html •  Will share the terminal session now… for DEMO
  12. Brief intro to PIG •  Adhoc data analysis •  An abstraction over mapreduce •  Think of it as a stdlib for mapreduce •  Supports common data processing operators (join, group by) •  A high level language for data processing PIG demo – switch to terminal •  Also try… HIVE, exposes a SQL like interface over HDFS data HIVE demo – switch to terminal

×