MapReduce and Hadoop

MapReduce and Hadoop
Cadenelli Nicola
Datenbanken Implementierungstechniken

Introduction
● History
● Motivations
MapReduce
● What MapReduce is
● Why it is usefull
● Execution Details
● Some Examples
● Conclusions
Outline
Hadoop
● Introduction
● Hadoop Architecture
● Hadoop Ecosystem
● In real world
MapReduce&Databases
● SQL-MapReduce
● In-Database Map-Reduce
● Conclusions
Introduction MapReduce Hadoop MR&Databases
● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS
MapReduce
BigTable
HDFS
MapReduce
○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

2004: Google
publishes the
papers
2006:
Apache releases
Hadoop.
Is the first Open
Source
implementation of
GFS and
MapReduce.
Now:
Amazon, AOL,
eBay, Facebook,
HP, IBM, Last.fm,
LinkedIn, Microsoft,
Spotify,
Twitter and more
are using Hadoop.
A Brief History
○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Data start to be really big: more than >10TB.
E.g: Large Synoptic Survey Telescope (30TB / night)
● The best idea is to scale out (not scale up) the
system, but . . .
 How do we scale to more than 1000+ machines?
 How do we handle machine failures?
 How can we facilitate communications between nodes?
 If we change system, do we lose all our optimisation
work?
● Google needed to recreate the index of the web.
Motivations
○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“MapReduce is a programming model and an
associated implementation for processing and
generating large data sets.” – Google, Inc.
MapReduce paper, 2004.
It is a really simple API that has just two serial
functions, map() and reduce() and is language
independent (Java, Python, Perl …).
What is MapReduce?
○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce hides messy details in the runtime
library:
● Parallelization and Distribution
● Load balancing
● Network and disk transfer optimization
● Handling of machine failures
● Fault tolerance
● Monitoring & status updates
All users obtain benefits from improvements on the
core library.
Why is MapReduce useful?
○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

1. Read a lot of data
2. Map: extract something we care about from each record
3. Shuffle and Sort
4. Reduce: aggregate, summarize, filter, or transform
5. Write the results
From an outside view is the same (read, elaborate,
write), map and reduce change to fit the problem.
Typical problem solved by MapReduce
○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Single master controls job execution on multiple slaves.
● Mappers preferentially placed on same node or same
rack as their input block → minimizes network usage!!!
● Mappers save outputs to local disk before serving them
to reducers.
● If a map or reduce crashes: Re-execute!
● Allows having more mappers and reducers than nodes.
Some Execution Details
○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Execution overview
Google, Inc. MapReduce paper, 2004.
○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Programmer has to write two primary methods:
map (k1,v1) → list(k2,v2)
reduce (k2,list(v2)) → list(k2,v2)
● All v' with the same k' are reduced together, in
order.
● The input keys and values are drawn from a
different domain than the output keys and values.
MapReduce Programming Model
○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Example: Words Frequency
“documentx”, “To be or not to be”
“be”, 2
“not”, 1
“or”, 1
“to”, 2
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“document1”,
“To be or not to be”
“be”, 2
“not”, 1
“or”, 1
“to”, 2
...
“to”, 1
“be”, 1
“or”, 1
“not”, 1
“to”, 1
“be”, 1
key = “be”
values = “1”,”1”
key = “not”
values = “1”
key = “or”
values = “1”
key = “to”
values = “1”,”1”
...“document2”,
“text”
...
...
“be”, 1
“be”, 1
...
“not”, 1
...
“or”, 1
...
“to”, 1
“to”, 1
...
ShuffleandSort:aggregatevaluesbykey
Map Reduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Inverted index
- Find what documents contain a specific word.
- Map: parse document, emit <word, document-ID> pairs.
- Reduce: for each word, sort the corresponding document Ids.
Emit <word, list(document-ID)>
• Reverse web-link graph
- Find where page links come from.
- Map: output <target, source> for each link to target in a page
source.
- Reduce: concatenate the list of all source URLs associated
with a target.
Emit <target, list(source)>
Others examples
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Proven to be a useful abstraction
● Really simplifies large-scala computations
● Fun to use:
- Focus on problem
- Let the library deal with messy details
Conclusions on MapReduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS
MapReduce
HDFS
MapReduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Is a framework for distributed processing
● It is Open Source (Apache v2 Licence)
● It is a top-level Apache Project
● Written in Java
● Batch processing centric
● Runs on commodity hardware
What is Hadoop?
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Distributed File System
● For very large files: TBs, PBs.
● Each file is partitioned into chunks of 64MB.
● Each chunk is replicated several times (>=3), on
different racks, for fault tolerance.
● Is an abstract FS, disks are formatted on ext3, ext4
or XFS.

○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Architecture
● TaskTracker is the MapReduce server
(processing part)
● DataNode is the HDFS server
(data part)
TaskTracker
DataNode
Machine

Hadoop Architecture - Master/Slave
TaskTracker
DataNode
JobTracker:
● Accepts users' jobs
● Assigns tasks to workers
● Keeps track of the jobs status
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Master/Slave
TaskTracker
DataNode
NameNode:
● Keeps information on data location
● Decides where a file has to be written
TaskTracker
DataNode
TaskTracker
DataNode
NameNode
Data never flows trough the NameNode!
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture – Scalable
TaskTracker
DataNode
Machine
● Having multiple machine with Hadoop creates a
cluster.
● What If we need more storage or compute power?
TaskTracker
DataNode
Machine
TaskTracker
DataNode
Machine
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Overview
B C
Client JobTracker
NameNode
Secondary
NameNode A
File
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – Pig & Hive
MapReduce
HDFS
Pig Hive
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – HBase
MapReduce
HDFS
Pig Hive
HBase
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

@Google
● Index construction for Google Search
● Article clustering for Google News
● Statistical machine translation
@Yahoo! (4100 nodes)
● “Web map” powering Yahoo! Search
● Spam detection for Yahoo! Mail
@Facebook (>100 PB of storage)
● Data mining
● Ad optimization
● Spam detection
What is MapReduce/Hadoop used for?
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce's use of input files and lack of schema
support prevents the performance improvements
enabled by features like B-trees and hash
partitioning . . .
. . . most of the data in companies are stored on
databases!
but . . .
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○

● SQL-MapReduce by Teradata Aster
● In-Database Map-Reduce by Oracle
● Connectors to allow external Hadoop
programs to access data from databases
and to store Hadoop output in databases
Solutions
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○

Is a framework to allow developers to write SQL-
MapReduce functions in languages such as Java,
C#, Python and C++ and push them into the
database for advanced in-database analytics.
SQL-MapReduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○

MR functions can be used like custom SQL operators and
can implement any algorithm or transformation.
SQL-MapReduce - Syntax
http://www.asterdata.com/resources/mapreduce.php
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
ORDER BY wordcount DESC
LIMIT 20;
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
LIMIT 20;
Demo #2: Why do Reduce when we have SQL?
SELECT word, count(*) AS wordcount
FROM Tokenize ( ON blogs )
GROUP BY word
LIMIT 20;
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

● Uses Table Functions to implement Map-Reduce within
the database.
● Parallelization is provided by the Oracle Parallel
Execution framework.
Using this in combination with SQL, Oracle provides an
simple mechanism for database developers to
develop Map-Reduce functionality using languages they
know.
In-Database Map-Reduce by Oracle
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○

SELECT *
FROM table(oracle_map_reduce.reducer(
cursor(
SELECT value(map_result).word word
FROM table(oracle_map_reduce.mapper(
cursor(
SELECT a FROM documents), ' '
)
)
map_result
)
));
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○

However this solutions are not source
compatible with Hadoop.
Native Hadoop programs need to be
rewritten before becoming usable in
databases.
Still not perfect!
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○

Questions?
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●

MapReduce and Hadoop

More Related Content

What's hot

Viewers also liked

Similar to MapReduce and Hadoop

Recently uploaded

MapReduce and Hadoop