SlideShare a Scribd company logo
MapReduce and the 

art of “Thinking Parallel”
Shailesh Kumar
Third Leap, Inc.
Three I’s of a great product!
Interface Intuitive |Functional | Elegant
Infrastructur
e
Storage |Computation |
Network
Intelligence Learn |Predict | Adapt |
Evolve
Drowning in Data, Starving for Knowledge
ATATTAGGTTTTTACCTACCC
AGGAAAAGCCAACCAACCTC
GATCTCTTGTAGATCTGTTCT
CTAAACGAACTTTAAAATCTG
TGTAGCTGTCGCTCGGCTG
CATGCCTAGTGCACCTACGC
AGTATAAACAATAATAAATTTT
ACTGTCGTTGACAAGAAACG
AGTAACTCGTCCCTCTTCTG
CAGACTGCTTATTACGCGAC
CGTAAGCTAC…
How BIG is Big Data?
600 million
tweets per DAY
100 hours per
MINUTE
800+ websites
per MINUTE
100 TB of data
uploaded DAILY
3.5 Billion
queries PER DAY
300 Million
Active customers
How BIG is BigData?
▪ Better Sensors
▪ Higher resolution, Real-time, Diverse measurements, …
▪ Faster Communication
▪ Network infrastructure, Compression Technologies, …
▪ Cheaper Storage
▪ Cloud based storage, large warehouses, NoSQL databases
▪ Massive Computation
▪ Cloud computing, Mapreduce/Hadoop parallel processing paradigms
▪ Intelligent Decisions
▪ Advances in Machine Learning and Artificial Intelligence
How did we get here?
The Evolution of “Computing”
Parallel Computing Basics
▪ Data Parallelism (distributed computing)
▪ Lots of data ! Break it into “chunks”,
▪ Process each “chunk” of data in parallel,
▪ Combine results from each “chunk”
▪ MAPREDUCE = Data Parallelism
▪ Process Parallelism (data flow computing)
▪ Lots of stages ! Set up process graph
▪ Pass data through all stages
▪ All stages running in parallel on different data
▪ Assembly line = process parallelism
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
MAPREDUCE 101: A 4-stage ProcessLotsofdata
Shard
1
Shard
N
Shard
2
Reduce
1
Reduce
R
Map
1
Map
2
Map
K
Combine
1
Combine
2
Combine
K
Shuffle
1
Shuffle
2
Shuffle
K
Output
1
Output
R
Each Map
processes
N/K shards
MAPREDUCE 101: An example Task
▪ Count total frequency of all words on the web
▪ Total number of documents > 20Billion
▪ Total number of unique words > 20Million
▪ Non-Parallel / Linear Implementation
for each document d on the Web
for each unique word w in d
DocCount w d( )= # times w occurred in d
WebCount w( ) += DocCount w d( )
MAPREDUCE – MAP/COMBINE
Shard1
Key Value
A 10
B 7
C 9
D 3
B 4
Key Value
A 10
B 11
C 9
D 3
Shard2
Key Value
A 3
D 1
C 4
D 9
B 6
Key Value
A 3
B 6
C 4
D 10
Shard3
Key Value
B 3
D 5
C 4
A 6
A 3
Map-1
Map-2
Map-3
Key Value
A 9
B 3
C 4
D 5
Combine-1
Combine-2
Combine-3
MAPREDUCE – Shuffle/Reduce
Key Value
A 10
B 11
C 9
D 3
Key Value
A 3
B 6
C 4
D 10
Key Value
A 9
B 3
C 4
D 5
Key Value
A 10
A 3
A 9
C 9
C 4
C 4
Key Value
B 11
B 6
B 3
D 3
D 10
D 5
Shuffle
1
Shuffle
2
Shuffle
3
Key Value
A 22
C 17
Key Value
B 20
D 18
Reduce
1
Reduce
2
Key Questions in MAPREDUCE
▪ Is the task really “data-parallelizable”?
▪ High dependence tasks (e.g. Fibonacci series)
▪ Recursive tasks (e.g. Binary Search)
▪ What is the key-value pair output for MAP step?
▪ Each map processes only one data record at a time
▪ It can generate none, one, or multiple key-value pairs
▪ How to combine values of a key in REDUCE step?
▪ The key for reduce is same as key for Map output
▪ The reduce function must be “order agnostic”
Other considerations
▪ Reliability/Robustness
▪ A processor or disk might go bad during the process
▪ Optimization/Efficiency
▪ Allocate CPU’s near data shards to reduce network overhead
▪ Scale/Parallelism
▪ Parallelization linearly proportional to number of machines
▪ Simplicity/Usability
▪ Just specify the Map task and the Reduce task and be done!
▪ Generality
▪ Lots of parallelizable tasks can be written in MapReduce
▪ With some creativity, many more than you can imagine!
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Similarity between all pairs of docs.
▪ Why bother?
▪ Document Clustering, Similar document search, etc.
▪ Document represented as a “Bag-of-Tokens”
▪ A weight associated with each tokens in vocabulary.
▪ Most weights are zero – Sparsity
▪ Cosine Similarity between two documents
di = w1
i
,w2
i
,...,wT
i
{ }, dj = w1
j
,w2
j
,...,wT
j
{ }
Sim di ,dj( )= wt
i
t=1
T
∑ × wt
j
Non-Parallel / Linear Implementation
For each document di
For each document dj ( j > i)
Sim di ,dj( )= wt
i
t=1
T
∑ × wt
j
Complexity = O D2
Tσ( )
σ = Sparsity factor =10−5
= Average Fraction of vocabulary per document
D = O(10B), T = O(10M )
Complexity = O 1020+7−5
( )= O 1022
( )
Toy Example for doc-doc similarity
A classic “Join”
Documents = W, X,Y, Z{ }, Words = a,b,c,d,e{ }
W → a,1 , b,2 , e,5{ }
X → a,3 , c,4 , d,5{ }
Y → b,6 , c,7 , d,8{ }
Z → a,9 , e,10{ }
Input W, X → Sim W, X( )= 3
W,Y → Sim W,Y( )= 12
W,Z → Sim W,Z( )= 59
X,Y → Sim X,Y( )= 68
X,Z → Sim X,Z( )= 27
Y,Z → Sim Y,Z( )= 0
Output
Reverse Indexing to the rescue
First convert the data to reverse index
a→ W,1 , X,3 , Z,9{ }
b→ W,2 , Y,6{ }
c→ X,4 , Y,7{ }
d → X,5 , Y,8{ }
e→ W,5 , Z,10{ }
W → a,1 , b,2 , e,5{ }
X → a,3 , c,4 , d,5{ }
Y → b,6 , c,7 , d,8{ }
Z → a,9 , e,10{ }
Key/Value for the MAP-Step
a→ W,1 , X,3 , Z,9{ }
W, X → 3
W,Z → 9
X,Z → 27
b→ W,2 , Y,6{ }
c→ X,4 , Y,7{ }
W,Y →12
e→ W,5 , Z,10{ }
d → X,5 , Y,8{ }
X,Y → 28
X,Y → 40
W,Z → 50
W, X → 3
W,Y →12
W,Z → 9
W,Z → 50
X,Y → 40
X,Y → 28
X,Z → 27
Value combining in REDUCE-Step
W, X → 3
W,Y →12
W,Z → 9
W,Z → 50
X,Y → 40
X,Y → 28
X,Z → 27
W, X → Sim W, X( )= 3
W,Y → Sim W,Y( )= 12
W,Z → Sim W,Z( )= 59
X,Y → Sim X,Y( )= 68
X,Z → Sim X,Z( )= 27
Y,Z → Sim Y,Z( )= 0
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
assignments ! centers
K-Means Clustering
mk
(t+1)
←
δn,k
(t )
xn
n=1
N
∑
δn,k
(t)
n=1
N
∑
m1
(t+1)
m2
(t+1)
δn,2
(t )
= 1
δn,1
(t )
= 1
m1
(t )
m2
(t )
centers ! assignments
δn,k
(t+1)
= k == arg min
j=1...K
Δ x n( )
,mj
(t)
( ){ }( )
K-means clustering 101 – Non-parallel
E-Step – Update assignments from centers


M-Step – Update centers from cluster assignments
πn
(t)
← arg min
k=1...K
Δ xn
,mk
(t)
( ){ }
mk
(t+1)
←
δ πn
(t)
= k( )xn
n=1
N
∑
δ πn
(t)
= k( )
n=1
N
∑
O NKD( ):
N = Number of data points
K = Number of clusters
D = number of dimensions
⎧
⎨
⎪
⎩
⎪
O ND( ):
N = Number of data points
D = number of dimensions
⎧
⎨
⎩
K-Means MapReduce
mk
(t)
{ }k=1
K
Key = πn
(t)
→ Value = xn
πn
(t)
= arg min
k=1...K
Δ xn
,mk
(t)
( ){ } mk
(t+1)
←
δ πn
(t)
= k( )xn
n=1
N
∑
δ πn
(t)
= k( )
n=1
N
∑
mk
(t+1)
{ }k=1
K
πn
(t)
mk
(t+1)
Map
Shuffle
Reduce
Iterative MapReduce: Update Cluster Centers/iteration
Agenda
MAPREDUCE Background
Problem 1 – Similarity between all pairs of documents!
Problem 2 – Parallelizing K-Means clustering
Problem 3 – Finding all Maximal Cliques in a Graph
Cliques: Useful structures in Graphs
• People
• Products
• Movies
• Keywords
• Documents
• Genes
• Neurons
• Co-Social
• Co-purchase
• Co-like
• Co-occurrence
• Similarity
• Co-expressions
• Co-firing
guitarist
rock-music
guitar
song
musician
rock-band
singer
electric-guitar
singing
university
school
college
student
classroom
school-teacher
teacher
teacher-student-relationship
judge
lawsuit
trial
lawyerfalse-persecution
perjury
courtroom
Example Concepts in IMDB
Graph, Cliques, and Maximal Cliques
Clique = a “fully connected” sub-graph
Maximal Clique = a clique with no “Super-clique”
Finding all Maximal Cliques is NP-hard: O(3n/3)
a
e
b
f
c
g
d
h
Neighborhood of a Clique
a
e
b
f
c
g
d
h
f is connected to BOTH b and c
g is connected to BOTH b and c
N({b,c}) = {f,g}
CLIQUEMAP: Clique (key) ! Its Neighbor (value)
{a} → {b,e}
{a,b} → {e}
{b,c} → { f,g}
{b,c, f } → {g}
{h} → ∅
{c,d} → ∅
{a,b,e} → ∅
{b,c, f ,g} → ∅
Growing Cliques from CliqueMap
{b,c, f} → {g}
a
e
b
f
c
g
d
h
{b,c, f} is a clique
g is connected to all of them
⎫
⎬
⎭
⇒ {b,c, f,g} is a clique
MapReduce for Maximal Cliques
CliqueMap of size k ! size k + 1
{a,b} → {e}
{a,e} → {b}
{b,c} → { f,g}
{b,e} → {a}
{b, f } → {c,g}
{b,g} → {c, f }
{c, f } → {b,g}
{c, g} → {b, f }
{ f, g} → {b,c}
{c,d} → ∅
Iteration 2
{a,b,e} → ∅
{b,c, f } → {g}
{b,c,g} → { f }
{b, f ,g} → {c}
{c, f ,g} → {b}
Iteration 3
{b,c, f,g} → ∅
Iteration 4
{a} → {b,e}
{b} → {a,c,e, f ,g}
{c} → {b,d, f ,g}
{d} → {c}
{e} → {a,b}
{ f } → {b,c,g}
{g} → {b,c, f }
{h} → ∅
Iteration 1
Input: Adjacency List
a
e
b
f
c
g
d
h
Key/Value for the MAP-Step
a
e
b
f
c
g
d
h
{a} → {b,e} {a,b} ⇒ {e}
{a,e} ⇒ {b}
{e} → {a,b}
{b} → {a,c,e, f,g}
{a,e} ⇒ {b}
{b,e} ⇒ {a}
{a,b} ⇒ {c,e, f, g}
{b,c} ⇒ {a,e, f, g}
{b,e} ⇒ {a,c, f, g}
{b, f } ⇒ {a,c,e, g}
{b,g} ⇒ {a,c,e, f }
{a,e} ⇒ {b}
{a,e} ⇒ {b}
{a,b} ⇒ {e}
{a,b} ⇒ {c,e, f ,g}
{b,e} ⇒ {a.c, f,g}
{b,e} ⇒ {a}
SHUFFLE
MAP
Value combining in REDUCE-Step
a
e
b
f
c
g
d
h
{a,e} ⇒ {b}
{a,e} ⇒ {b}
{a,b} ⇒ {e}
{a,b} ⇒ {c,e, f ,g}
{b,e} ⇒ {a,c, f ,g}
{b,e} ⇒ {a}
SHUFFLE
{a,b} → {e}∩{c,e, f,g} = {e}
{b,e} → {a,c, f,g}∩{a} = {a}
{a,e} → {b}∩{b} = {b}
REDUCE
Reduce = Intersection
Value combining in REDUCE-Step
a
e
b
f
c
g
d
h
c,d{ }⇒ b, f ,g{ }
c,d{ }⇒ ∅
c{ }→ b,d, f,g{ }
d{ }→ c{ }
b,c{ }⇒ a,e, f,g{ }
b,c{ }⇒ d, f,g{ }
b{ }→ a,c,e, f,g{ }
c{ }→ b,d, f ,g{ }
c,d{ }→
{b, f ,g}∩∅ = ∅
b,c{ }→
a,e, f ,g{ }∩ d, f ,g{ }
= f ,g{ }
“Art of Thinking Parallel” is about
▪ Transforming the Input Data appropriately
▪ e.g. Reverse Indexing (doc-doc similarity)
▪ Breaking the problem into smaller ones
▪ e.g. Iterative MapReduce (clustering)
▪ Designing the Map step - Key/Value output
▪ e.g. CliqueMaps in Maximal Cliques
▪ Design the Reduce step – Combine values of key
▪ e.g. Intersections in Maximal Cliques

More Related Content

What's hot

An application of gd
An application of gdAn application of gd
An application of gd
graphhoc
 
Geospatial Data in R
Geospatial Data in RGeospatial Data in R
Geospatial Data in R
Barry Rowlingson
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
Rebecca Bilbro
 
Multi-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasMulti-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideas
Gus Gutoski
 
The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
The Lazy Traveling Salesman Memory Management for Large-Scale Link DiscoveryThe Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
Holistic Benchmarking of Big Linked Data
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with R
Dr. Volkan OBAN
 
1452 86301000013 m
1452 86301000013 m1452 86301000013 m
1452 86301000013 m
Praveen Kumar
 
Kumaraswamy distributin:
Kumaraswamy distributin:Kumaraswamy distributin:
Kumaraswamy distributin:
Pankaj Das
 
Treewidth and Applications
Treewidth and ApplicationsTreewidth and Applications
Treewidth and Applications
ASPAK2014
 
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHSDISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
graphhoc
 
Direct split-radix algorithm for fast computation of type-II discrete Hartley...
Direct split-radix algorithm for fast computation of type-II discrete Hartley...Direct split-radix algorithm for fast computation of type-II discrete Hartley...
Direct split-radix algorithm for fast computation of type-II discrete Hartley...
TELKOMNIKA JOURNAL
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
DocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
DocEng2013 Bilauca Healy - Splitting Wide Tables OptimallyDocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
DocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
mbilauca
 
Sqlserver 2008 r2
Sqlserver 2008 r2Sqlserver 2008 r2
Sqlserver 2008 r2
Kashif Akram
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
Harsha Vardhan Tetali
 
S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applications
Steven Duplij (Stepan Douplii)
 
Multilayerity within multilayerity? On multilayer assortativity in social net...
Multilayerity within multilayerity? On multilayer assortativity in social net...Multilayerity within multilayerity? On multilayer assortativity in social net...
Multilayerity within multilayerity? On multilayer assortativity in social net...
Moses Boudourides
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
Dataspora
 

What's hot (18)

An application of gd
An application of gdAn application of gd
An application of gd
 
Geospatial Data in R
Geospatial Data in RGeospatial Data in R
Geospatial Data in R
 
A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Multi-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideasMulti-scalar multiplication: state of the art and new ideas
Multi-scalar multiplication: state of the art and new ideas
 
The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
The Lazy Traveling Salesman Memory Management for Large-Scale Link DiscoveryThe Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
The Lazy Traveling Salesman Memory Management for Large-Scale Link Discovery
 
k-means Clustering and Custergram with R
k-means Clustering and Custergram with Rk-means Clustering and Custergram with R
k-means Clustering and Custergram with R
 
1452 86301000013 m
1452 86301000013 m1452 86301000013 m
1452 86301000013 m
 
Kumaraswamy distributin:
Kumaraswamy distributin:Kumaraswamy distributin:
Kumaraswamy distributin:
 
Treewidth and Applications
Treewidth and ApplicationsTreewidth and Applications
Treewidth and Applications
 
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHSDISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
 
Direct split-radix algorithm for fast computation of type-II discrete Hartley...
Direct split-radix algorithm for fast computation of type-II discrete Hartley...Direct split-radix algorithm for fast computation of type-II discrete Hartley...
Direct split-radix algorithm for fast computation of type-II discrete Hartley...
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
DocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
DocEng2013 Bilauca Healy - Splitting Wide Tables OptimallyDocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
DocEng2013 Bilauca Healy - Splitting Wide Tables Optimally
 
Sqlserver 2008 r2
Sqlserver 2008 r2Sqlserver 2008 r2
Sqlserver 2008 r2
 
Distributed Support Vector Machines
Distributed Support Vector MachinesDistributed Support Vector Machines
Distributed Support Vector Machines
 
S. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applicationsS. Duplij. Polyadic algebraic structures and their applications
S. Duplij. Polyadic algebraic structures and their applications
 
Multilayerity within multilayerity? On multilayer assortativity in social net...
Multilayerity within multilayerity? On multilayer assortativity in social net...Multilayerity within multilayerity? On multilayer assortativity in social net...
Multilayerity within multilayerity? On multilayer assortativity in social net...
 
An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)An Interactive Introduction To R (Programming Language For Statistics)
An Interactive Introduction To R (Programming Language For Statistics)
 

Viewers also liked

Offline first geeknight
Offline first geeknightOffline first geeknight
Offline first geeknight
Hyderabad Scalability Meetup
 
GeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming LanguagesGeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming Languages
Hyderabad Scalability Meetup
 
Serverless architectures
Serverless architecturesServerless architectures
Serverless architectures
Hyderabad Scalability Meetup
 
Geeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningGeeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine Learning
Hyderabad Scalability Meetup
 
Proving parallelism
Proving parallelismProving parallelism
Proving parallelism
salvie alvaro
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
Hyderabad Scalability Meetup
 
Parallelism and perpendicularity
Parallelism and perpendicularityParallelism and perpendicularity
Parallelism and perpendicularity
salvie alvaro
 
Pertemuan ii mankiw krugman
Pertemuan ii mankiw krugmanPertemuan ii mankiw krugman
Pertemuan ii mankiw krugman
stephaniejessey
 
Sim Photosynthesis
Sim  PhotosynthesisSim  Photosynthesis
Sim Photosynthesis
eric
 
What is Parallelism?
What is Parallelism?What is Parallelism?
What is Parallelism?
Hussain Al-ghawi
 

Viewers also liked (10)

Offline first geeknight
Offline first geeknightOffline first geeknight
Offline first geeknight
 
GeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming LanguagesGeekNight: Evolution of Programming Languages
GeekNight: Evolution of Programming Languages
 
Serverless architectures
Serverless architecturesServerless architectures
Serverless architectures
 
Geeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine LearningGeeknight : Artificial Intelligence and Machine Learning
Geeknight : Artificial Intelligence and Machine Learning
 
Proving parallelism
Proving parallelismProving parallelism
Proving parallelism
 
Understanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQLUnderstanding and building big data Architectures - NoSQL
Understanding and building big data Architectures - NoSQL
 
Parallelism and perpendicularity
Parallelism and perpendicularityParallelism and perpendicularity
Parallelism and perpendicularity
 
Pertemuan ii mankiw krugman
Pertemuan ii mankiw krugmanPertemuan ii mankiw krugman
Pertemuan ii mankiw krugman
 
Sim Photosynthesis
Sim  PhotosynthesisSim  Photosynthesis
Sim Photosynthesis
 
What is Parallelism?
What is Parallelism?What is Parallelism?
What is Parallelism?
 

Similar to Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar

Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph Databases
Max De Marzi
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
Tokyo Tech (Tokyo Institute of Technology)
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
David Gleich
 
Minimum spanning tree
Minimum spanning treeMinimum spanning tree
Minimum spanning tree
AhmedMalik74
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
David Gleich
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
David Gleich
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
Lalit Kumar
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
jeykottalam
 
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Max De Marzi
 
FUNCTION- Algebraic Function
FUNCTION- Algebraic FunctionFUNCTION- Algebraic Function
FUNCTION- Algebraic Function
Janak Singh saud
 
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
Ken'ichi Matsui
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
Spark Summit
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
MLconf
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
Christian Robert
 
Make money fast! department of computer science-copypasteads.com
Make money fast!   department of computer science-copypasteads.comMake money fast!   department of computer science-copypasteads.com
Make money fast! department of computer science-copypasteads.com
jackpot201
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
Jonny Daenen
 
Chapter 09-Trees
Chapter 09-TreesChapter 09-Trees
Chapter 09-Trees
MuhammadBakri13
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
Trector Rancor
 
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
Lucidworks
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 

Similar to Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar (20)

Outrageous Ideas for Graph Databases
Outrageous Ideas for Graph DatabasesOutrageous Ideas for Graph Databases
Outrageous Ideas for Graph Databases
 
Interactive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social GraphsInteractive High-Dimensional Visualization of Social Graphs
Interactive High-Dimensional Visualization of Social Graphs
 
Relaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networksRelaxation methods for the matrix exponential on large networks
Relaxation methods for the matrix exponential on large networks
 
Minimum spanning tree
Minimum spanning treeMinimum spanning tree
Minimum spanning tree
 
Localized methods for diffusions in large graphs
Localized methods for diffusions in large graphsLocalized methods for diffusions in large graphs
Localized methods for diffusions in large graphs
 
Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...Gaps between the theory and practice of large-scale matrix-based network comp...
Gaps between the theory and practice of large-scale matrix-based network comp...
 
LalitBDA2015V3
LalitBDA2015V3LalitBDA2015V3
LalitBDA2015V3
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker NotesDeveloper Intro Deck-PowerPoint - Download for Speaker Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
 
FUNCTION- Algebraic Function
FUNCTION- Algebraic FunctionFUNCTION- Algebraic Function
FUNCTION- Algebraic Function
 
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
第13回数学カフェ「素数!!」二次会 LT資料「乱数!!」
 
Visual Api Training
Visual Api TrainingVisual Api Training
Visual Api Training
 
MLconf NYC Shan Shan Huang
MLconf NYC Shan Shan HuangMLconf NYC Shan Shan Huang
MLconf NYC Shan Shan Huang
 
CDT 22 slides.pdf
CDT 22 slides.pdfCDT 22 slides.pdf
CDT 22 slides.pdf
 
Make money fast! department of computer science-copypasteads.com
Make money fast!   department of computer science-copypasteads.comMake money fast!   department of computer science-copypasteads.com
Make money fast! department of computer science-copypasteads.com
 
Parallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-JoinsParallel Evaluation of Multi-Semi-Joins
Parallel Evaluation of Multi-Semi-Joins
 
Chapter 09-Trees
Chapter 09-TreesChapter 09-Trees
Chapter 09-Trees
 
Tree distance algorithm
Tree distance algorithmTree distance algorithm
Tree distance algorithm
 
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
Scorer’s Diversity Phase 2.0: Presented by Mikhail Khludnev, Grid Dynamics Inc.
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 

More from Hyderabad Scalability Meetup

Turbo charging v8 engine
Turbo charging v8 engineTurbo charging v8 engine
Turbo charging v8 engine
Hyderabad Scalability Meetup
 
Git internals
Git internalsGit internals
Nlp
NlpNlp
Internet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - HyderabadInternet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - Hyderabad
Hyderabad Scalability Meetup
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Hyderabad Scalability Meetup
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Hyderabad Scalability Meetup
 
Java 8 Lambda Expressions
Java 8 Lambda ExpressionsJava 8 Lambda Expressions
Java 8 Lambda Expressions
Hyderabad Scalability Meetup
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Hyderabad Scalability Meetup
 
Docker by demo
Docker by demoDocker by demo

More from Hyderabad Scalability Meetup (10)

Turbo charging v8 engine
Turbo charging v8 engineTurbo charging v8 engine
Turbo charging v8 engine
 
Git internals
Git internalsGit internals
Git internals
 
Nlp
NlpNlp
Nlp
 
Internet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - HyderabadInternet of Things - GeekNight - Hyderabad
Internet of Things - GeekNight - Hyderabad
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
 
Demystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep DiveDemystify Big Data, Data Science & Signal Extraction Deep Dive
Demystify Big Data, Data Science & Signal Extraction Deep Dive
 
Java 8 Lambda Expressions
Java 8 Lambda ExpressionsJava 8 Lambda Expressions
Java 8 Lambda Expressions
 
No SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability MeetupNo SQL and MongoDB - Hyderabad Scalability Meetup
No SQL and MongoDB - Hyderabad Scalability Meetup
 
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability MeetupApache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
Apache Spark - Lightning Fast Cluster Computing - Hyderabad Scalability Meetup
 
Docker by demo
Docker by demoDocker by demo
Docker by demo
 

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

Map reduce and the art of Thinking Parallel - Dr. Shailesh Kumar

  • 1. MapReduce and the 
 art of “Thinking Parallel” Shailesh Kumar Third Leap, Inc.
  • 2. Three I’s of a great product! Interface Intuitive |Functional | Elegant Infrastructur e Storage |Computation | Network Intelligence Learn |Predict | Adapt | Evolve
  • 3. Drowning in Data, Starving for Knowledge ATATTAGGTTTTTACCTACCC AGGAAAAGCCAACCAACCTC GATCTCTTGTAGATCTGTTCT CTAAACGAACTTTAAAATCTG TGTAGCTGTCGCTCGGCTG CATGCCTAGTGCACCTACGC AGTATAAACAATAATAAATTTT ACTGTCGTTGACAAGAAACG AGTAACTCGTCCCTCTTCTG CAGACTGCTTATTACGCGAC CGTAAGCTAC…
  • 4. How BIG is Big Data? 600 million tweets per DAY 100 hours per MINUTE 800+ websites per MINUTE 100 TB of data uploaded DAILY 3.5 Billion queries PER DAY 300 Million Active customers How BIG is BigData?
  • 5. ▪ Better Sensors ▪ Higher resolution, Real-time, Diverse measurements, … ▪ Faster Communication ▪ Network infrastructure, Compression Technologies, … ▪ Cheaper Storage ▪ Cloud based storage, large warehouses, NoSQL databases ▪ Massive Computation ▪ Cloud computing, Mapreduce/Hadoop parallel processing paradigms ▪ Intelligent Decisions ▪ Advances in Machine Learning and Artificial Intelligence How did we get here?
  • 6. The Evolution of “Computing”
  • 7. Parallel Computing Basics ▪ Data Parallelism (distributed computing) ▪ Lots of data ! Break it into “chunks”, ▪ Process each “chunk” of data in parallel, ▪ Combine results from each “chunk” ▪ MAPREDUCE = Data Parallelism ▪ Process Parallelism (data flow computing) ▪ Lots of stages ! Set up process graph ▪ Pass data through all stages ▪ All stages running in parallel on different data ▪ Assembly line = process parallelism
  • 8. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 9. MAPREDUCE 101: A 4-stage ProcessLotsofdata Shard 1 Shard N Shard 2 Reduce 1 Reduce R Map 1 Map 2 Map K Combine 1 Combine 2 Combine K Shuffle 1 Shuffle 2 Shuffle K Output 1 Output R Each Map processes N/K shards
  • 10. MAPREDUCE 101: An example Task ▪ Count total frequency of all words on the web ▪ Total number of documents > 20Billion ▪ Total number of unique words > 20Million ▪ Non-Parallel / Linear Implementation for each document d on the Web for each unique word w in d DocCount w d( )= # times w occurred in d WebCount w( ) += DocCount w d( )
  • 11. MAPREDUCE – MAP/COMBINE Shard1 Key Value A 10 B 7 C 9 D 3 B 4 Key Value A 10 B 11 C 9 D 3 Shard2 Key Value A 3 D 1 C 4 D 9 B 6 Key Value A 3 B 6 C 4 D 10 Shard3 Key Value B 3 D 5 C 4 A 6 A 3 Map-1 Map-2 Map-3 Key Value A 9 B 3 C 4 D 5 Combine-1 Combine-2 Combine-3
  • 12. MAPREDUCE – Shuffle/Reduce Key Value A 10 B 11 C 9 D 3 Key Value A 3 B 6 C 4 D 10 Key Value A 9 B 3 C 4 D 5 Key Value A 10 A 3 A 9 C 9 C 4 C 4 Key Value B 11 B 6 B 3 D 3 D 10 D 5 Shuffle 1 Shuffle 2 Shuffle 3 Key Value A 22 C 17 Key Value B 20 D 18 Reduce 1 Reduce 2
  • 13. Key Questions in MAPREDUCE ▪ Is the task really “data-parallelizable”? ▪ High dependence tasks (e.g. Fibonacci series) ▪ Recursive tasks (e.g. Binary Search) ▪ What is the key-value pair output for MAP step? ▪ Each map processes only one data record at a time ▪ It can generate none, one, or multiple key-value pairs ▪ How to combine values of a key in REDUCE step? ▪ The key for reduce is same as key for Map output ▪ The reduce function must be “order agnostic”
  • 14. Other considerations ▪ Reliability/Robustness ▪ A processor or disk might go bad during the process ▪ Optimization/Efficiency ▪ Allocate CPU’s near data shards to reduce network overhead ▪ Scale/Parallelism ▪ Parallelization linearly proportional to number of machines ▪ Simplicity/Usability ▪ Just specify the Map task and the Reduce task and be done! ▪ Generality ▪ Lots of parallelizable tasks can be written in MapReduce ▪ With some creativity, many more than you can imagine!
  • 15. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 16. Similarity between all pairs of docs. ▪ Why bother? ▪ Document Clustering, Similar document search, etc. ▪ Document represented as a “Bag-of-Tokens” ▪ A weight associated with each tokens in vocabulary. ▪ Most weights are zero – Sparsity ▪ Cosine Similarity between two documents di = w1 i ,w2 i ,...,wT i { }, dj = w1 j ,w2 j ,...,wT j { } Sim di ,dj( )= wt i t=1 T ∑ × wt j
  • 17. Non-Parallel / Linear Implementation For each document di For each document dj ( j > i) Sim di ,dj( )= wt i t=1 T ∑ × wt j Complexity = O D2 Tσ( ) σ = Sparsity factor =10−5 = Average Fraction of vocabulary per document D = O(10B), T = O(10M ) Complexity = O 1020+7−5 ( )= O 1022 ( )
  • 18. Toy Example for doc-doc similarity A classic “Join” Documents = W, X,Y, Z{ }, Words = a,b,c,d,e{ } W → a,1 , b,2 , e,5{ } X → a,3 , c,4 , d,5{ } Y → b,6 , c,7 , d,8{ } Z → a,9 , e,10{ } Input W, X → Sim W, X( )= 3 W,Y → Sim W,Y( )= 12 W,Z → Sim W,Z( )= 59 X,Y → Sim X,Y( )= 68 X,Z → Sim X,Z( )= 27 Y,Z → Sim Y,Z( )= 0 Output
  • 19. Reverse Indexing to the rescue First convert the data to reverse index a→ W,1 , X,3 , Z,9{ } b→ W,2 , Y,6{ } c→ X,4 , Y,7{ } d → X,5 , Y,8{ } e→ W,5 , Z,10{ } W → a,1 , b,2 , e,5{ } X → a,3 , c,4 , d,5{ } Y → b,6 , c,7 , d,8{ } Z → a,9 , e,10{ }
  • 20. Key/Value for the MAP-Step a→ W,1 , X,3 , Z,9{ } W, X → 3 W,Z → 9 X,Z → 27 b→ W,2 , Y,6{ } c→ X,4 , Y,7{ } W,Y →12 e→ W,5 , Z,10{ } d → X,5 , Y,8{ } X,Y → 28 X,Y → 40 W,Z → 50 W, X → 3 W,Y →12 W,Z → 9 W,Z → 50 X,Y → 40 X,Y → 28 X,Z → 27
  • 21. Value combining in REDUCE-Step W, X → 3 W,Y →12 W,Z → 9 W,Z → 50 X,Y → 40 X,Y → 28 X,Z → 27 W, X → Sim W, X( )= 3 W,Y → Sim W,Y( )= 12 W,Z → Sim W,Z( )= 59 X,Y → Sim X,Y( )= 68 X,Z → Sim X,Z( )= 27 Y,Z → Sim Y,Z( )= 0
  • 22. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 23. assignments ! centers K-Means Clustering mk (t+1) ← δn,k (t ) xn n=1 N ∑ δn,k (t) n=1 N ∑ m1 (t+1) m2 (t+1) δn,2 (t ) = 1 δn,1 (t ) = 1 m1 (t ) m2 (t ) centers ! assignments δn,k (t+1) = k == arg min j=1...K Δ x n( ) ,mj (t) ( ){ }( )
  • 24. K-means clustering 101 – Non-parallel E-Step – Update assignments from centers 
 M-Step – Update centers from cluster assignments πn (t) ← arg min k=1...K Δ xn ,mk (t) ( ){ } mk (t+1) ← δ πn (t) = k( )xn n=1 N ∑ δ πn (t) = k( ) n=1 N ∑ O NKD( ): N = Number of data points K = Number of clusters D = number of dimensions ⎧ ⎨ ⎪ ⎩ ⎪ O ND( ): N = Number of data points D = number of dimensions ⎧ ⎨ ⎩
  • 25. K-Means MapReduce mk (t) { }k=1 K Key = πn (t) → Value = xn πn (t) = arg min k=1...K Δ xn ,mk (t) ( ){ } mk (t+1) ← δ πn (t) = k( )xn n=1 N ∑ δ πn (t) = k( ) n=1 N ∑ mk (t+1) { }k=1 K πn (t) mk (t+1) Map Shuffle Reduce Iterative MapReduce: Update Cluster Centers/iteration
  • 26. Agenda MAPREDUCE Background Problem 1 – Similarity between all pairs of documents! Problem 2 – Parallelizing K-Means clustering Problem 3 – Finding all Maximal Cliques in a Graph
  • 27. Cliques: Useful structures in Graphs • People • Products • Movies • Keywords • Documents • Genes • Neurons • Co-Social • Co-purchase • Co-like • Co-occurrence • Similarity • Co-expressions • Co-firing
  • 29. Graph, Cliques, and Maximal Cliques Clique = a “fully connected” sub-graph Maximal Clique = a clique with no “Super-clique” Finding all Maximal Cliques is NP-hard: O(3n/3) a e b f c g d h
  • 30. Neighborhood of a Clique a e b f c g d h f is connected to BOTH b and c g is connected to BOTH b and c N({b,c}) = {f,g} CLIQUEMAP: Clique (key) ! Its Neighbor (value) {a} → {b,e} {a,b} → {e} {b,c} → { f,g} {b,c, f } → {g} {h} → ∅ {c,d} → ∅ {a,b,e} → ∅ {b,c, f ,g} → ∅
  • 31. Growing Cliques from CliqueMap {b,c, f} → {g} a e b f c g d h {b,c, f} is a clique g is connected to all of them ⎫ ⎬ ⎭ ⇒ {b,c, f,g} is a clique
  • 32. MapReduce for Maximal Cliques CliqueMap of size k ! size k + 1 {a,b} → {e} {a,e} → {b} {b,c} → { f,g} {b,e} → {a} {b, f } → {c,g} {b,g} → {c, f } {c, f } → {b,g} {c, g} → {b, f } { f, g} → {b,c} {c,d} → ∅ Iteration 2 {a,b,e} → ∅ {b,c, f } → {g} {b,c,g} → { f } {b, f ,g} → {c} {c, f ,g} → {b} Iteration 3 {b,c, f,g} → ∅ Iteration 4 {a} → {b,e} {b} → {a,c,e, f ,g} {c} → {b,d, f ,g} {d} → {c} {e} → {a,b} { f } → {b,c,g} {g} → {b,c, f } {h} → ∅ Iteration 1 Input: Adjacency List a e b f c g d h
  • 33. Key/Value for the MAP-Step a e b f c g d h {a} → {b,e} {a,b} ⇒ {e} {a,e} ⇒ {b} {e} → {a,b} {b} → {a,c,e, f,g} {a,e} ⇒ {b} {b,e} ⇒ {a} {a,b} ⇒ {c,e, f, g} {b,c} ⇒ {a,e, f, g} {b,e} ⇒ {a,c, f, g} {b, f } ⇒ {a,c,e, g} {b,g} ⇒ {a,c,e, f } {a,e} ⇒ {b} {a,e} ⇒ {b} {a,b} ⇒ {e} {a,b} ⇒ {c,e, f ,g} {b,e} ⇒ {a.c, f,g} {b,e} ⇒ {a} SHUFFLE MAP
  • 34. Value combining in REDUCE-Step a e b f c g d h {a,e} ⇒ {b} {a,e} ⇒ {b} {a,b} ⇒ {e} {a,b} ⇒ {c,e, f ,g} {b,e} ⇒ {a,c, f ,g} {b,e} ⇒ {a} SHUFFLE {a,b} → {e}∩{c,e, f,g} = {e} {b,e} → {a,c, f,g}∩{a} = {a} {a,e} → {b}∩{b} = {b} REDUCE Reduce = Intersection
  • 35. Value combining in REDUCE-Step a e b f c g d h c,d{ }⇒ b, f ,g{ } c,d{ }⇒ ∅ c{ }→ b,d, f,g{ } d{ }→ c{ } b,c{ }⇒ a,e, f,g{ } b,c{ }⇒ d, f,g{ } b{ }→ a,c,e, f,g{ } c{ }→ b,d, f ,g{ } c,d{ }→ {b, f ,g}∩∅ = ∅ b,c{ }→ a,e, f ,g{ }∩ d, f ,g{ } = f ,g{ }
  • 36. “Art of Thinking Parallel” is about ▪ Transforming the Input Data appropriately ▪ e.g. Reverse Indexing (doc-doc similarity) ▪ Breaking the problem into smaller ones ▪ e.g. Iterative MapReduce (clustering) ▪ Designing the Map step - Key/Value output ▪ e.g. CliqueMaps in Maximal Cliques ▪ Design the Reduce step – Combine values of key ▪ e.g. Intersections in Maximal Cliques