MapReduce and Hadoop
Cadenelli Nicola
Datenbanken Implementierungstechniken
Introduction
● History
● Motivations
MapReduce
● What MapReduce is
● Why it is usefull
● Execution Details
● Some Examples
● Conclusions
Outline
Hadoop
● Introduction
● Hadoop Architecture
● Hadoop Ecosystem
● In real world
MapReduce&Databases
● SQL-MapReduce
● In-Database Map-Reduce
● Conclusions
Introduction MapReduce Hadoop MR&Databases
● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
GFS
MapReduce
BigTable
HDFS
MapReduce
Introduction MapReduce Hadoop MR&Databases
○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
2004: Google
publishes the
papers
2006:
Apache releases
Hadoop.
Is the first Open
Source
implementation of
GFS and
MapReduce.
Now:
Amazon, AOL,
eBay, Facebook,
HP, IBM, Last.fm,
LinkedIn, Microsoft,
Spotify,
Twitter and more
are using Hadoop.
A Brief History
Introduction MapReduce Hadoop MR&Databases
○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
● Data start to be really big: more than >10TB.
E.g: Large Synoptic Survey Telescope (30TB / night)
● The best idea is to scale out (not scale up) the
system, but . . .
 How do we scale to more than 1000+ machines?
 How do we handle machine failures?
 How can we facilitate communications between nodes?
 If we change system, do we lose all our optimisation
work?
● Google needed to recreate the index of the web.
Motivations
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
“MapReduce is a programming model and an
associated implementation for processing and
generating large data sets.” – Google, Inc.
MapReduce paper, 2004.
It is a really simple API that has just two serial
functions, map() and reduce() and is language
independent (Java, Python, Perl …).
What is MapReduce?
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
MapReduce hides messy details in the runtime
library:
● Parallelization and Distribution
● Load balancing
● Network and disk transfer optimization
● Handling of machine failures
● Fault tolerance
● Monitoring & status updates
All users obtain benefits from improvements on the
core library.
Why is MapReduce useful?
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
1. Read a lot of data
2. Map: extract something we care about from each record
3. Shuffle and Sort
4. Reduce: aggregate, summarize, filter, or transform
5. Write the results
From an outside view is the same (read, elaborate,
write), map and reduce change to fit the problem.
Typical problem solved by MapReduce
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
● Single master controls job execution on multiple slaves.
● Mappers preferentially placed on same node or same
rack as their input block → minimizes network usage!!!
● Mappers save outputs to local disk before serving them
to reducers.
● If a map or reduce crashes: Re-execute!
● Allows having more mappers and reducers than nodes.
Some Execution Details
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Execution overview
Google, Inc. MapReduce paper, 2004.
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Programmer has to write two primary methods:
map (k1,v1) → list(k2,v2)
reduce (k2,list(v2)) → list(k2,v2)
● All v' with the same k' are reduced together, in
order.
● The input keys and values are drawn from a
different domain than the output keys and values.
MapReduce Programming Model
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
EmitIntermediate(w, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(AsString(result));
Example: Words Frequency
“documentx”, “To be or not to be”
“be”, 2
“not”, 1
“or”, 1
“to”, 2
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
“document1”,
“To be or not to be”
“be”, 2
“not”, 1
“or”, 1
“to”, 2
...
“to”, 1
“be”, 1
“or”, 1
“not”, 1
“to”, 1
“be”, 1
key = “be”
values = “1”,”1”
key = “not”
values = “1”
key = “or”
values = “1”
key = “to”
values = “1”,”1”
...“document2”,
“text”
...
...
“be”, 1
“be”, 1
...
“not”, 1
...
“or”, 1
...
“to”, 1
“to”, 1
...
ShuffleandSort:aggregatevaluesbykey
Map Reduce
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
● Inverted index
- Find what documents contain a specific word.
- Map: parse document, emit <word, document-ID> pairs.
- Reduce: for each word, sort the corresponding document Ids.
Emit <word, list(document-ID)>
• Reverse web-link graph
- Find where page links come from.
- Map: output <target, source> for each link to target in a page
source.
- Reduce: concatenate the list of all source URLs associated
with a target.
Emit <target, list(source)>
Others examples
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
● Proven to be a useful abstraction
● Really simplifies large-scala computations
● Fun to use:
- Focus on problem
- Let the library deal with messy details
Conclusions on MapReduce
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
GFS
MapReduce
HDFS
MapReduce
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
● Is a framework for distributed processing
● It is Open Source (Apache v2 Licence)
● It is a top-level Apache Project
● Written in Java
● Batch processing centric
● Runs on commodity hardware
What is Hadoop?
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Distributed File System
● For very large files: TBs, PBs.
● Each file is partitioned into chunks of 64MB.
● Each chunk is replicated several times (>=3), on
different racks, for fault tolerance.
● Is an abstract FS, disks are formatted on ext3, ext4
or XFS.
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Architecture
● TaskTracker is the MapReduce server
(processing part)
● DataNode is the HDFS server
(data part)
TaskTracker
DataNode
Machine
Hadoop Architecture - Master/Slave
TaskTracker
DataNode
JobTracker:
● Accepts users' jobs
● Assigns tasks to workers
● Keeps track of the jobs status
TaskTracker
DataNode
TaskTracker
DataNode
JobTracker
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Architecture - Master/Slave
TaskTracker
DataNode
NameNode:
● Keeps information on data location
● Decides where a file has to be written
TaskTracker
DataNode
TaskTracker
DataNode
NameNode
Data never flows trough the NameNode!
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Architecture – Scalable
TaskTracker
DataNode
Machine
● Having multiple machine with Hadoop creates a
cluster.
● What If we need more storage or compute power?
TaskTracker
DataNode
Machine
TaskTracker
DataNode
Machine
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Architecture - Overview
B C
Client JobTracker
NameNode
Secondary
NameNode A
File
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Ecosystem – Pig & Hive
MapReduce
HDFS
Pig Hive
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
Hadoop Ecosystem – HBase
MapReduce
HDFS
Pig Hive
HBase
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
@Google
● Index construction for Google Search
● Article clustering for Google News
● Statistical machine translation
@Yahoo! (4100 nodes)
● “Web map” powering Yahoo! Search
● Spam detection for Yahoo! Mail
@Facebook (>100 PB of storage)
● Data mining
● Ad optimization
● Spam detection
What is MapReduce/Hadoop used for?
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○
MapReduce's use of input files and lack of schema
support prevents the performance improvements
enabled by features like B-trees and hash
partitioning . . .
. . . most of the data in companies are stored on
databases!
but . . .
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○
● SQL-MapReduce by Teradata Aster
● In-Database Map-Reduce by Oracle
● Connectors to allow external Hadoop
programs to access data from databases
and to store Hadoop output in databases
Solutions
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○
Is a framework to allow developers to write SQL-
MapReduce functions in languages such as Java,
C#, Python and C++ and push them into the
database for advanced in-database analytics.
SQL-MapReduce
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○
MR functions can be used like custom SQL operators and
can implement any algorithm or transformation.
SQL-MapReduce - Syntax
http://www.asterdata.com/resources/mapreduce.php
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○
Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
ORDER BY wordcount DESC
LIMIT 20;
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○
Demo #1: Map (Tokenization) and Reduce (WordCount) in SQL/MR
SELECT key AS word, value AS wordcount
FROM WordCountReduce (
ON Tokenize ( ON blogs )
PARTITION BY key
)
ORDER BY wordcount DESC
LIMIT 20;
Demo #2: Why do Reduce when we have SQL?
SELECT word, count(*) AS wordcount
FROM Tokenize ( ON blogs )
GROUP BY word
ORDER BY wordcount DESC
LIMIT 20;
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○
● Uses Table Functions to implement Map-Reduce within
the database.
● Parallelization is provided by the Oracle Parallel
Execution framework.
Using this in combination with SQL, Oracle provides an
simple mechanism for database developers to
develop Map-Reduce functionality using languages they
know.
In-Database Map-Reduce by Oracle
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○
SELECT *
FROM table(oracle_map_reduce.reducer(
cursor(
SELECT value(map_result).word word
FROM table(oracle_map_reduce.mapper(
cursor(
SELECT a FROM documents), ' '
)
)
map_result
)
));
Example: Words Frequency
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○
However this solutions are not source
compatible with Hadoop.
Native Hadoop programs need to be
rewritten before becoming usable in
databases.
Still not perfect!
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○
Questions?
Introduction MapReduce Hadoop MR&Databases
○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●

MapReduce and Hadoop

  • 1.
    MapReduce and Hadoop CadenelliNicola Datenbanken Implementierungstechniken
  • 2.
    Introduction ● History ● Motivations MapReduce ●What MapReduce is ● Why it is usefull ● Execution Details ● Some Examples ● Conclusions Outline Hadoop ● Introduction ● Hadoop Architecture ● Hadoop Ecosystem ● In real world MapReduce&Databases ● SQL-MapReduce ● In-Database Map-Reduce ● Conclusions Introduction MapReduce Hadoop MR&Databases ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 3.
    GFS MapReduce BigTable HDFS MapReduce Introduction MapReduce HadoopMR&Databases ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 4.
    2004: Google publishes the papers 2006: Apachereleases Hadoop. Is the first Open Source implementation of GFS and MapReduce. Now: Amazon, AOL, eBay, Facebook, HP, IBM, Last.fm, LinkedIn, Microsoft, Spotify, Twitter and more are using Hadoop. A Brief History Introduction MapReduce Hadoop MR&Databases ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 5.
    ● Data startto be really big: more than >10TB. E.g: Large Synoptic Survey Telescope (30TB / night) ● The best idea is to scale out (not scale up) the system, but . . .  How do we scale to more than 1000+ machines?  How do we handle machine failures?  How can we facilitate communications between nodes?  If we change system, do we lose all our optimisation work? ● Google needed to recreate the index of the web. Motivations Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 6.
    “MapReduce is aprogramming model and an associated implementation for processing and generating large data sets.” – Google, Inc. MapReduce paper, 2004. It is a really simple API that has just two serial functions, map() and reduce() and is language independent (Java, Python, Perl …). What is MapReduce? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 7.
    MapReduce hides messydetails in the runtime library: ● Parallelization and Distribution ● Load balancing ● Network and disk transfer optimization ● Handling of machine failures ● Fault tolerance ● Monitoring & status updates All users obtain benefits from improvements on the core library. Why is MapReduce useful? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 8.
    1. Read alot of data 2. Map: extract something we care about from each record 3. Shuffle and Sort 4. Reduce: aggregate, summarize, filter, or transform 5. Write the results From an outside view is the same (read, elaborate, write), map and reduce change to fit the problem. Typical problem solved by MapReduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 9.
    ● Single mastercontrols job execution on multiple slaves. ● Mappers preferentially placed on same node or same rack as their input block → minimizes network usage!!! ● Mappers save outputs to local disk before serving them to reducers. ● If a map or reduce crashes: Re-execute! ● Allows having more mappers and reducers than nodes. Some Execution Details Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 10.
    Execution overview Google, Inc.MapReduce paper, 2004. Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 11.
    Programmer has towrite two primary methods: map (k1,v1) → list(k2,v2) reduce (k2,list(v2)) → list(k2,v2) ● All v' with the same k' are reduced together, in order. ● The input keys and values are drawn from a different domain than the output keys and values. MapReduce Programming Model Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 12.
    map(String key, Stringvalue): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Example: Words Frequency “documentx”, “To be or not to be” “be”, 2 “not”, 1 “or”, 1 “to”, 2 Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 13.
    “document1”, “To be ornot to be” “be”, 2 “not”, 1 “or”, 1 “to”, 2 ... “to”, 1 “be”, 1 “or”, 1 “not”, 1 “to”, 1 “be”, 1 key = “be” values = “1”,”1” key = “not” values = “1” key = “or” values = “1” key = “to” values = “1”,”1” ...“document2”, “text” ... ... “be”, 1 “be”, 1 ... “not”, 1 ... “or”, 1 ... “to”, 1 “to”, 1 ... ShuffleandSort:aggregatevaluesbykey Map Reduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 14.
    ● Inverted index -Find what documents contain a specific word. - Map: parse document, emit <word, document-ID> pairs. - Reduce: for each word, sort the corresponding document Ids. Emit <word, list(document-ID)> • Reverse web-link graph - Find where page links come from. - Map: output <target, source> for each link to target in a page source. - Reduce: concatenate the list of all source URLs associated with a target. Emit <target, list(source)> Others examples Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 15.
    ● Proven tobe a useful abstraction ● Really simplifies large-scala computations ● Fun to use: - Focus on problem - Let the library deal with messy details Conclusions on MapReduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 16.
    GFS MapReduce HDFS MapReduce Introduction MapReduce HadoopMR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 17.
    ● Is aframework for distributed processing ● It is Open Source (Apache v2 Licence) ● It is a top-level Apache Project ● Written in Java ● Batch processing centric ● Runs on commodity hardware What is Hadoop? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 18.
    Introduction MapReduce HadoopMR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ Hadoop Distributed File System ● For very large files: TBs, PBs. ● Each file is partitioned into chunks of 64MB. ● Each chunk is replicated several times (>=3), on different racks, for fault tolerance. ● Is an abstract FS, disks are formatted on ext3, ext4 or XFS.
  • 19.
    Introduction MapReduce HadoopMR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ Hadoop Architecture ● TaskTracker is the MapReduce server (processing part) ● DataNode is the HDFS server (data part) TaskTracker DataNode Machine
  • 20.
    Hadoop Architecture -Master/Slave TaskTracker DataNode JobTracker: ● Accepts users' jobs ● Assigns tasks to workers ● Keeps track of the jobs status TaskTracker DataNode TaskTracker DataNode JobTracker Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 21.
    Hadoop Architecture -Master/Slave TaskTracker DataNode NameNode: ● Keeps information on data location ● Decides where a file has to be written TaskTracker DataNode TaskTracker DataNode NameNode Data never flows trough the NameNode! Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 22.
    Hadoop Architecture –Scalable TaskTracker DataNode Machine ● Having multiple machine with Hadoop creates a cluster. ● What If we need more storage or compute power? TaskTracker DataNode Machine TaskTracker DataNode Machine Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 23.
    Hadoop Architecture -Overview B C Client JobTracker NameNode Secondary NameNode A File Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 24.
    Hadoop Ecosystem –Pig & Hive MapReduce HDFS Pig Hive Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 25.
    Hadoop Ecosystem –HBase MapReduce HDFS Pig Hive HBase Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 26.
    @Google ● Index constructionfor Google Search ● Article clustering for Google News ● Statistical machine translation @Yahoo! (4100 nodes) ● “Web map” powering Yahoo! Search ● Spam detection for Yahoo! Mail @Facebook (>100 PB of storage) ● Data mining ● Ad optimization ● Spam detection What is MapReduce/Hadoop used for? Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○ ○
  • 27.
    MapReduce's use ofinput files and lack of schema support prevents the performance improvements enabled by features like B-trees and hash partitioning . . . . . . most of the data in companies are stored on databases! but . . . Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○ ○
  • 28.
    ● SQL-MapReduce byTeradata Aster ● In-Database Map-Reduce by Oracle ● Connectors to allow external Hadoop programs to access data from databases and to store Hadoop output in databases Solutions Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○ ○
  • 29.
    Is a frameworkto allow developers to write SQL- MapReduce functions in languages such as Java, C#, Python and C++ and push them into the database for advanced in-database analytics. SQL-MapReduce Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○ ○
  • 30.
    MR functions canbe used like custom SQL operators and can implement any algorithm or transformation. SQL-MapReduce - Syntax http://www.asterdata.com/resources/mapreduce.php Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○ ○
  • 31.
    Demo #1: Map(Tokenization) and Reduce (WordCount) in SQL/MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20; Example: Words Frequency Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○
  • 32.
    Demo #1: Map(Tokenization) and Reduce (WordCount) in SQL/MR SELECT key AS word, value AS wordcount FROM WordCountReduce ( ON Tokenize ( ON blogs ) PARTITION BY key ) ORDER BY wordcount DESC LIMIT 20; Demo #2: Why do Reduce when we have SQL? SELECT word, count(*) AS wordcount FROM Tokenize ( ON blogs ) GROUP BY word ORDER BY wordcount DESC LIMIT 20; Example: Words Frequency Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○ ○
  • 33.
    ● Uses TableFunctions to implement Map-Reduce within the database. ● Parallelization is provided by the Oracle Parallel Execution framework. Using this in combination with SQL, Oracle provides an simple mechanism for database developers to develop Map-Reduce functionality using languages they know. In-Database Map-Reduce by Oracle Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○ ○
  • 34.
    SELECT * FROM table(oracle_map_reduce.reducer( cursor( SELECTvalue(map_result).word word FROM table(oracle_map_reduce.mapper( cursor( SELECT a FROM documents), ' ' ) ) map_result ) )); Example: Words Frequency Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○ ○
  • 35.
    However this solutionsare not source compatible with Hadoop. Native Hadoop programs need to be rewritten before becoming usable in databases. Still not perfect! Introduction MapReduce Hadoop MR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ● ○
  • 36.
    Questions? Introduction MapReduce HadoopMR&Databases ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ○ ●