MapReduce and Hadoop

MapReduce and Hadoop
Cadenelli Nicola
Datenbanken Implementierungstechniken

Introduction
● History
● Motivations
MapReduce
● What MapReduce is
● Why it is usefull
● Execution Details
● Some Examples
● Conclusions
Outline
Hadoop
● Introduction
● Hadoop Architecture
● Hadoop Ecosystem
● In real world
MapReduce&Databases
● SQL-MapReduce
● In-Database Map-Reduce
● Conclusions
Introduction MapReduce Hadoop MR&Databases
● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS
MapReduce
BigTable
HDFS
MapReduce
○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

2004: Google
publishes the
papers
2006:
Apache releases
Hadoop.
Is the first Open
Source
implementation of
GFS and
MapReduce.
Now:
Amazon, AOL,
eBay, Facebook,
HP, IBM, Last.fm,
LinkedIn, Microsoft,
Spotify,
Twitter and more
are using Hadoop.
A Brief History
○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Data start to be really big: more than >10TB.
E.g: Large Synoptic Survey Telescope (30TB / night)
● The best idea is to scale out (not scale up) the
system, but . . .
 How do we scale to more than 1000+ machines?
 How do we handle machine failures?
 How can we facilitate communications between nodes?
 If we change system, do we lose all our optimisation
work?
● Google needed to recreate the index of the web.
Motivations
○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“MapReduce is a programming model and an
associated implementation for processing and
generating large data sets.” – Google, Inc.
MapReduce paper, 2004.
It is a really simple API that has just two serial
functions, map() and reduce() and is language
independent (Java, Python, Perl …).
What is MapReduce?
○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce hides messy details in the runtime
library:
● Parallelization and Distribution
● Load balancing
● Network and disk transfer optimization
● Handling of machine failures
● Fault tolerance
● Monitoring & status updates
All users obtain benefits from improvements on the
core library.
Why is MapReduce useful?
○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

1. Read a lot of data
2. Map: extract something we care about from each record
3. Shuffle and Sort
4. Reduce: aggregate, summarize, filter, or transform
5. Write the results
From an outside view is the same (read, elaborate,
write), map and reduce change to fit the problem.
Typical problem solved by MapReduce
○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Single master controls job execution on multiple slaves.
● Mappers preferentially placed on same node or same
rack as their input block → minimizes network usage!!!
● Mappers save outputs to local disk before serving them
to reducers.
● If a map or reduce crashes: Re-execute!
● Allows having more mappers and reducers than nodes.
Some Execution Details
○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Execution overview
Google, Inc. MapReduce paper, 2004.
○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Programmer has to write two primary methods:
map (k1,v1) → list(k2,v2)
reduce (k2,list(v2)) → list(k2,v2)
● All v' with the same k' are reduced together, in
order.
● The input keys and values are drawn from a
different domain than the output keys and values.
MapReduce Programming Model
○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Example: Words Frequency
“documentx”, “To be or not to be”
“be”, 2
“not”, 1
“or”, 1
“to”, 2
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

“document1”,
“To be or not to be”
“be”, 2
“not”, 1
“or”, 1
“to”, 2
...
“to”, 1
“be”, 1
“or”, 1
“not”, 1
“to”, 1
“be”, 1
key = “be”
values = “1”,”1”
key = “not”
values = “1”
key = “or”
values = “1”
key = “to”
values = “1”,”1”
...“document2”,
“text”
...
...
“be”, 1
“be”, 1
...
“not”, 1
...
“or”, 1
...
“to”, 1
“to”, 1
...
ShuffleandSort:aggregatevaluesbykey
Map Reduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Inverted index
- Find what documents contain a specific word.
- Map: parse document, emit <word, document-ID> pairs.
- Reduce: for each word, sort the corresponding document Ids.
Emit <word, list(document-ID)>
• Reverse web-link graph
- Find where page links come from.
- Map: output <target, source> for each link to target in a page
source.
- Reduce: concatenate the list of all source URLs associated
with a target.
Emit <target, list(source)>
Others examples
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Proven to be a useful abstraction
● Really simplifies large-scala computations
● Fun to use:
- Focus on problem
- Let the library deal with messy details
Conclusions on MapReduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

GFS
MapReduce
HDFS
MapReduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

● Is a framework for distributed processing
● It is Open Source (Apache v2 Licence)
● It is a top-level Apache Project
● Written in Java
● Batch processing centric
● Runs on commodity hardware
What is Hadoop?
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Distributed File System
● For very large files: TBs, PBs.
● Each file is partitioned into chunks of 64MB.
● Each chunk is replicated several times (>=3), on
different racks, for fault tolerance.
● Is an abstract FS, disks are formatted on ext3, ext4
or XFS.

○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Architecture
● TaskTracker is the MapReduce server
(processing part)
● DataNode is the HDFS server
(data part)
TaskTracker
DataNode
Machine

Hadoop Architecture - Master/Slave
TaskTracker
DataNode
JobTracker:
● Accepts users' jobs
● Assigns tasks to workers
● Keeps track of the jobs status
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Master/Slave
TaskTracker
DataNode
NameNode:
● Keeps information on data location
● Decides where a file has to be written
TaskTracker
DataNode
TaskTracker
DataNode
NameNode
Data never flows trough the NameNode!
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture – Scalable
TaskTracker
DataNode
Machine
● Having multiple machine with Hadoop creates a
cluster.
● What If we need more storage or compute power?
TaskTracker
DataNode
Machine
TaskTracker
DataNode
Machine
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Architecture - Overview
B C
Client JobTracker
NameNode
Secondary
NameNode A
File
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – Pig & Hive
MapReduce
HDFS
Pig Hive
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

Hadoop Ecosystem – HBase
MapReduce
HDFS
Pig Hive
HBase
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○

@Google
● Index construction for Google Search
● Article clustering for Google News
● Statistical machine translation
@Yahoo! (4100 nodes)
● “Web map” powering Yahoo! Search
● Spam detection for Yahoo! Mail
@Facebook (>100 PB of storage)
● Data mining
● Ad optimization
● Spam detection
What is MapReduce/Hadoop used for?
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○

MapReduce's use of input files and lack of schema
support prevents the performance improvements
enabled by features like B-trees and hash
partitioning . . .
. . . most of the data in companies are stored on
databases!
but . . .
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○

● SQL-MapReduce by Teradata Aster
● In-Database Map-Reduce by Oracle
● Connectors to allow external Hadoop
programs to access data from databases
and to store Hadoop output in databases
Solutions
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○

Is a framework to allow developers to write SQL-
MapReduce functions in languages such as Java,
C#, Python and C++ and push them into the
database for advanced in-database analytics.
SQL-MapReduce
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○

MR functions can be used like custom SQL operators and
can implement any algorithm or transformation.
SQL-MapReduce - Syntax
http://www.asterdata.com/resources/mapreduce.php
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
ORDER BY wordcount DESC
LIMIT 20;
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
LIMIT 20;
Demo #2: Why do Reduce when we have SQL?
SELECT word, count(*) AS wordcount
FROM Tokenize ( ON blogs )
GROUP BY word
LIMIT 20;
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○

● Uses Table Functions to implement Map-Reduce within
the database.
● Parallelization is provided by the Oracle Parallel
Execution framework.
Using this in combination with SQL, Oracle provides an
simple mechanism for database developers to
develop Map-Reduce functionality using languages they
know.
In-Database Map-Reduce by Oracle
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○

SELECT *
FROM table(oracle_map_reduce.reducer(
cursor(
SELECT value(map_result).word word
FROM table(oracle_map_reduce.mapper(
cursor(
SELECT a FROM documents), ' '
)
)
map_result
)
));
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○

However this solutions are not source
compatible with Hadoop.
Native Hadoop programs need to be
rewritten before becoming usable in
databases.
Still not perfect!
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○

Questions?
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●

MapReduce and Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to MapReduce and Hadoop

Similar to MapReduce and Hadoop (20)

Recently uploaded

Recently uploaded (20)

MapReduce and Hadoop