A Distributed Programming Framework
A Very Short Introduction
“Big data” is data that
becomes large enough
that it cannot be
~ O’Reilly Radar
Apache Hadoop is not a database
Apache Hadoop is not a single program, tool or application but a set of projects with a
common goal integrated under one umbrella / term Hadoop (Core)
Anatomy of a Hadoop Cluster
Distributed Computing (MapReduce)
Distributed storage (HDFS)
The MapReduce master is
responsible for organizing
where computational work
should be scheduled on the
The HDFS master is
partitioning the storage
across the slave nodes and
keeping track of where
data is located.
Let the data remain where it is and move the executable code to its hosting machine.
Stated simply, the mapper is meant to filter and
transform the input into something that the reducer can
MapReduce uses lists and (key/value) pairs as its main
Shapes are keys, its colors are values.
Move data from RDBMS into Hadoop using Sqoop
Move log files using Flume, Chukwa, or Scribe
Writing Map/Reduce Jobs
We can use multiple languages to write Map/Reduce jobs
Python with Hadoop Streaming
Pros: fast development
Cons: slower than Java, no access to Hadoop API
Pros: fast, access to Hadoop API
Cons: verbose language
Pros: very small scripts, faster than streaming
Cons: yet another language to learn
Pros: SQL like syntax (easy for non-programmers) and relational data model
Cons: slower than PIG, more moving parts
Where can we use Hadoop?
Granular reports over large data set of 5-7 years
Root cause analysis
Better capacity planning (servers, people, bandwidth)
Recommendations (better than external parties, because of the amount of data)
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.