HadoopThe Hadoop Java Software Framework

Hadoop: Playing with data, at scale

If you have lot of data to process….
What should you know?

Mahesh Tiyyagura

25th November, Bangalore

Mahesh Tiyyagura

Email: tmahesh@gmail.com
http://www.twitter.com/tmahesh

Work on large scale crawling and extraction of structured data from the web.
Used Hadoop at Yahoo! to run machine learning algorithms and analyzing click logs

Hadoop
•  Massively scalable storage and batch data processing system

•  Its all about Scale……
–  Scaling hardware infrastructure (horizontal scaling)
–  Scaling operations and maintenance (handling failures)
–  Scaling developer productivity (keep it simple)

Numbers you should know…
•  You can store say, 10TB of data per node
•  1 Disk: 75MB/sec (sequential read)
•  Say, you want to process, 200GB of data
•  That’s is ~ 1 hour to just read the data!!
•  Processing data (CPU) is much much faster (say, 10x)
•  To remove the bottleneck, we need to read data in parallel
•  Read from 100 Disks in parallel: 7.5GB/sec!!
•  Insight: Move computation, NOT data

•  Oh! BTW, Data should not (and cannot) reside on only one node
•  In a 1000 node cluster, you can expect ~10 failures per week
•  For peace of mind, Reliability should be handled by software

Hadoop is designed to address these issues.

The Platform, in brief…
•  HDFS: Storing Data
–  Data spilt into multiple blocks across nodes
–  Replication protects data from failures
–  A master node orchestrates the read/write requests (without being a bottleneck!!)
–  Scales linearly… 4TB of raw disk translates to ~ 1TB of storage (tunable)

•  MapReduce (MR): Processing Data
–  A beautiful abstraction; asks user to implement just 2 functions (Map and Reduce)
–  You don’t need no knowledge of network IO, node failures, checkpoints, distributed
what??
–  Most of the data processing jobs can be mapped into MapReduce Abstraction
–  Data processed locally, in parallel. Reliability is implicit.
–  A giant merge sort infrastructure does the magic

Will revisit this slide. Something's are better understood in retrospect.

MR: Programming Model
•  Map function: (key, value) -> (key1, value1) list
•  Reduce function: (key1, value1 list) -> key1, output

•  Examples:
–  map(k, v) -> emit (k, v.toUpper())
–  map(k, v) -> foreach c in v; do emit(k, c) done;

–  reduce(k, vals) -> foreach v in vals; do sum+= v; done emit(k, sum)

Thinking in MapReduce ….
•  Word Count Example
–  map(docid, text) -> foreach word in text.split(); do emit(word, 1); done
–  reduce(word, counts list) -> foreach count in counts; do sum+=count; done; emit(word,
sum)

•  Document search index (Inverted index)
–  map(docid, html) -> foreach term in getTerms(html); do emit(term, docid); done
–  reduce(term, docid list) -> emit(term, docid list);.

Thinking in MapReduce ….

•  All the anchor text to a page
–  map(docid, html) -> foreach link in getLinks(html); do emit(link, anchorText); done
–  Reduce(link, anchorText list) -> emit(link, anchorText list);

•  Image resize
–  Map(imgid, image) -> emit(imgid, image.resize());
–  No need for reduce

Hadoop Streaming Demo
•  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile

•  Each line in InputFile written to stdin of shellMapper.sh
•  A line on stdout of shellMapper.sh; split into key, value (before first tab is key, rest is value)
•  Each key, value pair is fed as line into stdin of shellReducer.sh
•  A line on stdout of shellReducer written to outputFile

•  hadoop jar hadoop-streaming.jar
-input myInputDirs
-output myOutputDir
-mapper /bin/cat
-reducer /bin/wc

•  More Info: http://hadoop.apache.org/common/docs/r0.18.2/streaming.html

•  Will share the terminal session now… for DEMO

Brief intro to PIG
•  Adhoc data analysis
•  An abstraction over mapreduce
•  Think of it as a stdlib for mapreduce
•  Supports common data processing operators (join, group by)
•  A high level language for data processing

PIG demo – switch to terminal

•  Also try… HIVE, exposes a SQL like interface over HDFS data

HIVE demo – switch to terminal

HadoopThe Hadoop Java Software Framework

HadoopThe Hadoop Java Software Framework

More Related Content

What's hot

Similar to HadoopThe Hadoop Java Software Framework

More from ThoughtWorks

Recently uploaded

HadoopThe Hadoop Java Software Framework