SlideShare a Scribd company logo
1 of 23
Download to read offline
Relational Algebra and MapReduce
Towards High-level Programming Languages
Pietro Michiardi
Eurecom
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 1 / 23
Sources and Acks
Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with
MapReduce,” Morgan & Claypool Publishers, 2010.
http://lintool.github.io/MapReduceAlgorithms/
Tom White, “Hadoop, The Definitive Guide,” O’Reilly / Yahoo
Press, 2012
Anand Rajaraman, Jeffrey D. Ullman, Jure Leskovec, “Mining of
Massive Datasets”, Cambridge University Press, 2013
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 2 / 23
Relational Algebra
Relational Algebra and
MapReduce
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 3 / 23
Relational Algebra Introduction
Introduction
Disclaimer
This is not a full course on Relational Algebra
Neither this is a course on SQL
Introduction to Relational Algebra, RDBMS and SQL
Follow the video lectures of the Stanford class on RDBMS
https://www.coursera.org/course/db
→ Note that you have to sign up for an account
Overview of this part
Brief introduction to simplified relational algebra
Useful to understand Pig, Hive and HBase
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 4 / 23
Relational Algebra Introduction
Relational Algebra Operators
There are a number of operations on data that fit well the
relational algebra model
In traditional RDBMS, queries involve retrieval of small amounts of
data
In this course, and in particular in this class, we should keep in
mind the particular workload underlying MapReduce
→ Full scans of large amounts of data
→ Queries are not selective1
, they process all data
A review of some terminology
A relation is a table
Attributes are the column headers of the table
The set of attributes of a relation is called a schema
Example: R(A1, A2, ..., An) indicates a relation called R whose
attributes are A1, A2, ..., An
1
This is true in general. However, most ETL jobs involve selection and projection to
do data preparation.
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 5 / 23
Relational Algebra Operators
Operators
Let’s start with an example
Below, we have part of a relation called Links describing the
structure of the Web
There are two attributes: From and To
A row, or tuple, of the relation is a pair of URLs, indicating the
existence of a link between them
→ The number of tuples in a real dataset is in the order of billions (109
)
From To
url1 url2
url1 url3
url2 url3
url2 url4
· · · · · ·
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 6 / 23
Relational Algebra Operators
Operators
Relations (however big) can be stored in a distributed
filesystem
If they don’t fit in a single machine, they’re broken into pieces (think
HDFS)
Next, we review and describe a set of relational algebra
operators
Intuitive explanation of what they do
“Pseudo-code” of their implementation in/by MapReduce
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 7 / 23
Relational Algebra Operators
Operators
Selection: σC(R)
Apply condition C to each tuple of relation R
Produce in output a relation containing only tuples that satisfy C
Projection: πS(R)
Given a subset S of relation R attributes
Produce in output a relation containing only tuples for the attributes
in S
Union, Intersection and Difference
Well known operators on sets
Apply to the set of tuples in two relations that have the same
schema
Variations on the theme: work on bags
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 8 / 23
Relational Algebra Operators
Operators
Natural join R S
Given two relations, compare each pair of tuples, one from each
relation
If the tuples agree on all the attributes common to both schema →
produce an output tuple that has components on each attribute
Otherwise produce nothing
Join condition can be on a subset of attributes
Let’s work with an example
Recall the Links relation from previous slides
Query (or data processing job): find the paths of length
two in the Web
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 9 / 23
Relational Algebra Operators
Join Example
Informally, to satisfy the query we must:
find the triples of URLs in the form (u, v, w) such that there is a link
from u to v and a link from v to w
Using the join operator
Imagine we have two relations (with different schema), and let’s try
to apply the natural join operator
There are two copies of Links: L1(U1, U2) and L2(U2, U3)
Let’s compute L1 L2
For each tuple t1 of L1 and each tuple t2 of L2, see if their U2
component are the same
If yes, then produce a tuple in output, with the schema (U1, U2, U3)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 10 / 23
Relational Algebra Operators
Join Example
What we have seen is called (to be precise) a self-join
Question: How would you implement a self join in your favorite
programming language?
Question: What is the time complexity of your algorithm?
Question: What is the space complexity of your algorithm?
To continue the example
Say you are not interested in the entire two-hop path but just the
start and end nodes
Then you do a projection and the notation would be: πU1,U3
(L1 L2)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 11 / 23
Relational Algebra Operators
Operators
Grouping and Aggregation: γX (R)
Given a relation R, partition its tuples according to their values in
one set of attributes G
The set G is called the grouping attributes
Then, for each group, aggregate the values in certain other
attributes
Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...
In the notation, X is a list of elements that can be:
A grouping attribute
An expression θ(A), where θ is one of the (five) aggregation
functions and A is an attribute NOT among the grouping attributes
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 12 / 23
Relational Algebra Operators
Operators
Grouping and Aggregation: γX (R)
The result of this operation is a relation with one tuple for each
group
That tuple has a component for each of the grouping attributes, with
the value common to tuples of that group
That tuple has another component for each aggregation, with the
aggregate value for that group
Let’s work with an example
Imagine that a social-networking site has a relation
Friends(User, Friend)
The tuples are pairs (a, b) such that b is a friend of a
Query: compute the number of friends each member
has
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 13 / 23
Relational Algebra Operators
Grouping and Aggregation Example
How to satisfy the query
γUser,COUNT(Friend))(Friends)
This operation groups all the tuples by the value in their frist
component
→ There is one group for each user
Then, for each group, it counts the number of friends
Some details
The COUNT operation applied to an attribute does not consider the
values of that attribute
In fact, it counts the number of tuples in the group
In SQL, there is a “count distinct” operator that counts the number
of different values
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 14 / 23
Relational Algebra Operators and MapReduce
Computing Selection
In practice, selections do not need a full-blown MapReduce
implementation
They can be implemented in the map phase alone
Actually, they could also be implemented in the reduce portion
A MapReduce implementation of σC(R)
Map: For each tuple t in R, check if t satisfies C
If so, emit a key/value pair (t, t)
Reduce: Identity reducer
Question: single or multiple reducers?
NOTE: the output is not exactly a relation
WHY?
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 15 / 23
Relational Algebra Operators and MapReduce
Computing Projections
Similar process to selection
But, projection may cause same tuple to appear several times
A MapReduce implementation of πS(R)
Map: For each tuple t in R, construct a tuple t by eliminating those
components whose attributes are not in S
Emit a key/value pair (t , t )
Reduce: For each key t produced by any of the Map tasks, fetch t , [t , · · · , t ]
Emit a key/value pair (t , t )
NOTE: the reduce operation is duplicate elimination
This operation is associative and commutative, so it is possible to
optimize MapReduce by using a Combiner in each mapper
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 16 / 23
Relational Algebra Operators and MapReduce
Computing Unions
Suppose relations R and S have the same schema
Map tasks will be assigned chunks from either R or S
Mappers don’t do much, just pass by to reducers
Reducers do duplicate elimination
A MapReduce implementation of union
Map: 2
For each tuple t in R or S, emit a key/value pair (t, t)
Reduce: For each key t there will be either one or two values
Emit (t, t) in either case
2
Hadoop MapReduce supports reading multiple inputs.
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 17 / 23
Relational Algebra Operators and MapReduce
Computing Intersections
Very similar to computing unions
Suppose relations R and S have the same schema
The map function is the same (an identity mapper) as for union
The reduce function must produce a tuple only if both relations
have that tuple
A MapReduce implementation of intersection
Map: For each tuple t in R or S, emit a key/value pair (t, t)
Reduce: If key t has value list [t, t] then emit the key/value pair (t, t)
Otherwise, emit the key/value pair (t, NULL)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 18 / 23
Relational Algebra Operators and MapReduce
Computing difference
Assume we have two relations R and S with the same
schema
The only way a tuple t can appear in the output is if it is in R but not
in S
The map function passes tuples from R and S to the reducer
NOTE: it must inform the reducer whether the tuple came from R or
S
A MapReduce implementation of difference
Map: For a tuple t in R emit a key/value pair (t, R ) and for a tuple t in S,
emit a key/value pair (t, S )
Reduce: For each key t, do the following:
If it is associated to R , then emit (t, t)
If it is associated to [ R , S ] or [ S , R ], or [ S ], emit the key/value
pair (t, NULL)
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 19 / 23
Relational Algebra Operators and MapReduce
Computing the natural Join
This topic is subject to continuous refinements
There are many JOIN operators and many different
implementations
We’ve seen some of them in the laboratory sessions
Let’s look at two relations R(A, B) and S(B, C)
We must find tuples that agree on their B components
We shall use the B-value of tuples from either relation as the key
The value will be the other component and the name of the relation
That way the reducer knows from which relation each tuple is
coming from
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 20 / 23
Relational Algebra Operators and MapReduce
Computing the natural Join
A MapReduce implementation of Natural Join
Map: For each tuple (a, b) of R emit the key/value pair (b, ( R , a))
For each tuple (b, c) of S emit the key/value pair (b, ( S , c))
Reduce: Each key b will be associated to a list of pairs that are either ( R , a)
or ( S , c)
Emit key/value pairs of the form
(b, [(a1, b, c1), (a2, b, c2), · · · , (an, b, cn)])
NOTES
Question: what if the MapReduce framework wouldn’t implement
the distributed (and sorted) group by?
In general, for n tuples in relation R and m tuples in relation S all
with a common B-value, then we end up with nm tuples in the result
If all tuples of both relations have the same B-value, then we’re
computing the Cartesian product
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 21 / 23
Relational Algebra Operators and MapReduce
Overview of SQL Joins
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 22 / 23
Relational Algebra Operators and MapReduce
Grouping and Aggregation in MapReduce
Let R(A, B, C) be a relation to which we apply γA,θ(B)(R)
The map operation prepares the grouping
The grouping is done by the framework
The reducer computes the aggregation
Simplifying assumptions: one grouping attribute and one
aggregation function
MapReduce implementation of γA,θ(B)(R)3
Map: For each tuple (a, b, c) emit the key/value pair (a, b)
Reduce: Each key a represents a group
Apply θ to the list [b1, b2, · · · , bn]
Emit the key/value pair (a, x) where x = θ([b1, b2, · · · , bn])
3
Note here that we are also projecting.
Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 23 / 23

More Related Content

What's hot

3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clusteringKrish_ver2
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Marina Santini
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitionerSubhas Kumar Ghosh
 
OLAP operations
OLAP operationsOLAP operations
OLAP operationskunj desai
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxvishal choudhary
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methodsKrish_ver2
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AIIldar Nurgaliev
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDatamining Tools
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 

What's hot (20)

3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods Lecture 6: Ensemble Methods
Lecture 6: Ensemble Methods
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
 
Bayesian learning
Bayesian learningBayesian learning
Bayesian learning
 
OLAP operations
OLAP operationsOLAP operations
OLAP operations
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptxEX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
EX-6-Implement Matrix Multiplication with Hadoop Map Reduce.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Fraud and Risk in Big Data
Fraud and Risk in Big DataFraud and Risk in Big Data
Fraud and Risk in Big Data
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
 
Fuzzy logic and application in AI
Fuzzy logic and application in AIFuzzy logic and application in AI
Fuzzy logic and application in AI
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 

Similar to Relational Algebra and MapReduce

Unit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdfUnit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdfajajkhan16
 
Lecture 07 leonidas guibas - networks of shapes and images
Lecture 07   leonidas guibas - networks of shapes and imagesLecture 07   leonidas guibas - networks of shapes and images
Lecture 07 leonidas guibas - networks of shapes and imagesmustafa sarac
 
RelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdfRelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdf10GUPTASOUMYARAMPRAK
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Yueshen Xu
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPCVictor Eijkhout
 
Ch4.mapreduce algorithm design
Ch4.mapreduce algorithm designCh4.mapreduce algorithm design
Ch4.mapreduce algorithm designAllenWu
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
Automatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce FrameworkAutomatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce FrameworkKinoshita Minoru
 
Graph Matching
Graph MatchingGraph Matching
Graph Matchinggraphitech
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Introduction to Relational Database Management Systems
Introduction to Relational Database Management SystemsIntroduction to Relational Database Management Systems
Introduction to Relational Database Management SystemsAdri Jovin
 

Similar to Relational Algebra and MapReduce (20)

Unit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdfUnit-II DBMS presentation for students.pdf
Unit-II DBMS presentation for students.pdf
 
Lecture 07 leonidas guibas - networks of shapes and images
Lecture 07   leonidas guibas - networks of shapes and imagesLecture 07   leonidas guibas - networks of shapes and images
Lecture 07 leonidas guibas - networks of shapes and images
 
RelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdfRelationalAlgebra-RelationalCalculus-SQL.pdf
RelationalAlgebra-RelationalCalculus-SQL.pdf
 
Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)Aggregation computation over distributed data streams(the final version)
Aggregation computation over distributed data streams(the final version)
 
Integrative Parallel Programming in HPC
Integrative Parallel Programming in HPCIntegrative Parallel Programming in HPC
Integrative Parallel Programming in HPC
 
Ch4.mapreduce algorithm design
Ch4.mapreduce algorithm designCh4.mapreduce algorithm design
Ch4.mapreduce algorithm design
 
Main map reduce
Main map reduceMain map reduce
Main map reduce
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Relational Algebra-23-04-2023.pdf
Relational Algebra-23-04-2023.pdfRelational Algebra-23-04-2023.pdf
Relational Algebra-23-04-2023.pdf
 
Automatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce FrameworkAutomatic Synthesis of Combiners in the MapReduce Framework
Automatic Synthesis of Combiners in the MapReduce Framework
 
Graph Matching
Graph MatchingGraph Matching
Graph Matching
 
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
MUMS: Agent-based Modeling Workshop - Practical Bayesian Optimization for Age...
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
An Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked ExamplesAn Introduction to MATLAB with Worked Examples
An Introduction to MATLAB with Worked Examples
 
Introduction to Relational Database Management Systems
Introduction to Relational Database Management SystemsIntroduction to Relational Database Management Systems
Introduction to Relational Database Management Systems
 

Recently uploaded

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

Relational Algebra and MapReduce

  • 1. Relational Algebra and MapReduce Towards High-level Programming Languages Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 1 / 23
  • 2. Sources and Acks Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with MapReduce,” Morgan & Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/ Tom White, “Hadoop, The Definitive Guide,” O’Reilly / Yahoo Press, 2012 Anand Rajaraman, Jeffrey D. Ullman, Jure Leskovec, “Mining of Massive Datasets”, Cambridge University Press, 2013 Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 2 / 23
  • 3. Relational Algebra Relational Algebra and MapReduce Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 3 / 23
  • 4. Relational Algebra Introduction Introduction Disclaimer This is not a full course on Relational Algebra Neither this is a course on SQL Introduction to Relational Algebra, RDBMS and SQL Follow the video lectures of the Stanford class on RDBMS https://www.coursera.org/course/db → Note that you have to sign up for an account Overview of this part Brief introduction to simplified relational algebra Useful to understand Pig, Hive and HBase Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 4 / 23
  • 5. Relational Algebra Introduction Relational Algebra Operators There are a number of operations on data that fit well the relational algebra model In traditional RDBMS, queries involve retrieval of small amounts of data In this course, and in particular in this class, we should keep in mind the particular workload underlying MapReduce → Full scans of large amounts of data → Queries are not selective1 , they process all data A review of some terminology A relation is a table Attributes are the column headers of the table The set of attributes of a relation is called a schema Example: R(A1, A2, ..., An) indicates a relation called R whose attributes are A1, A2, ..., An 1 This is true in general. However, most ETL jobs involve selection and projection to do data preparation. Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 5 / 23
  • 6. Relational Algebra Operators Operators Let’s start with an example Below, we have part of a relation called Links describing the structure of the Web There are two attributes: From and To A row, or tuple, of the relation is a pair of URLs, indicating the existence of a link between them → The number of tuples in a real dataset is in the order of billions (109 ) From To url1 url2 url1 url3 url2 url3 url2 url4 · · · · · · Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 6 / 23
  • 7. Relational Algebra Operators Operators Relations (however big) can be stored in a distributed filesystem If they don’t fit in a single machine, they’re broken into pieces (think HDFS) Next, we review and describe a set of relational algebra operators Intuitive explanation of what they do “Pseudo-code” of their implementation in/by MapReduce Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 7 / 23
  • 8. Relational Algebra Operators Operators Selection: σC(R) Apply condition C to each tuple of relation R Produce in output a relation containing only tuples that satisfy C Projection: πS(R) Given a subset S of relation R attributes Produce in output a relation containing only tuples for the attributes in S Union, Intersection and Difference Well known operators on sets Apply to the set of tuples in two relations that have the same schema Variations on the theme: work on bags Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 8 / 23
  • 9. Relational Algebra Operators Operators Natural join R S Given two relations, compare each pair of tuples, one from each relation If the tuples agree on all the attributes common to both schema → produce an output tuple that has components on each attribute Otherwise produce nothing Join condition can be on a subset of attributes Let’s work with an example Recall the Links relation from previous slides Query (or data processing job): find the paths of length two in the Web Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 9 / 23
  • 10. Relational Algebra Operators Join Example Informally, to satisfy the query we must: find the triples of URLs in the form (u, v, w) such that there is a link from u to v and a link from v to w Using the join operator Imagine we have two relations (with different schema), and let’s try to apply the natural join operator There are two copies of Links: L1(U1, U2) and L2(U2, U3) Let’s compute L1 L2 For each tuple t1 of L1 and each tuple t2 of L2, see if their U2 component are the same If yes, then produce a tuple in output, with the schema (U1, U2, U3) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 10 / 23
  • 11. Relational Algebra Operators Join Example What we have seen is called (to be precise) a self-join Question: How would you implement a self join in your favorite programming language? Question: What is the time complexity of your algorithm? Question: What is the space complexity of your algorithm? To continue the example Say you are not interested in the entire two-hop path but just the start and end nodes Then you do a projection and the notation would be: πU1,U3 (L1 L2) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 11 / 23
  • 12. Relational Algebra Operators Operators Grouping and Aggregation: γX (R) Given a relation R, partition its tuples according to their values in one set of attributes G The set G is called the grouping attributes Then, for each group, aggregate the values in certain other attributes Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ... In the notation, X is a list of elements that can be: A grouping attribute An expression θ(A), where θ is one of the (five) aggregation functions and A is an attribute NOT among the grouping attributes Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 12 / 23
  • 13. Relational Algebra Operators Operators Grouping and Aggregation: γX (R) The result of this operation is a relation with one tuple for each group That tuple has a component for each of the grouping attributes, with the value common to tuples of that group That tuple has another component for each aggregation, with the aggregate value for that group Let’s work with an example Imagine that a social-networking site has a relation Friends(User, Friend) The tuples are pairs (a, b) such that b is a friend of a Query: compute the number of friends each member has Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 13 / 23
  • 14. Relational Algebra Operators Grouping and Aggregation Example How to satisfy the query γUser,COUNT(Friend))(Friends) This operation groups all the tuples by the value in their frist component → There is one group for each user Then, for each group, it counts the number of friends Some details The COUNT operation applied to an attribute does not consider the values of that attribute In fact, it counts the number of tuples in the group In SQL, there is a “count distinct” operator that counts the number of different values Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 14 / 23
  • 15. Relational Algebra Operators and MapReduce Computing Selection In practice, selections do not need a full-blown MapReduce implementation They can be implemented in the map phase alone Actually, they could also be implemented in the reduce portion A MapReduce implementation of σC(R) Map: For each tuple t in R, check if t satisfies C If so, emit a key/value pair (t, t) Reduce: Identity reducer Question: single or multiple reducers? NOTE: the output is not exactly a relation WHY? Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 15 / 23
  • 16. Relational Algebra Operators and MapReduce Computing Projections Similar process to selection But, projection may cause same tuple to appear several times A MapReduce implementation of πS(R) Map: For each tuple t in R, construct a tuple t by eliminating those components whose attributes are not in S Emit a key/value pair (t , t ) Reduce: For each key t produced by any of the Map tasks, fetch t , [t , · · · , t ] Emit a key/value pair (t , t ) NOTE: the reduce operation is duplicate elimination This operation is associative and commutative, so it is possible to optimize MapReduce by using a Combiner in each mapper Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 16 / 23
  • 17. Relational Algebra Operators and MapReduce Computing Unions Suppose relations R and S have the same schema Map tasks will be assigned chunks from either R or S Mappers don’t do much, just pass by to reducers Reducers do duplicate elimination A MapReduce implementation of union Map: 2 For each tuple t in R or S, emit a key/value pair (t, t) Reduce: For each key t there will be either one or two values Emit (t, t) in either case 2 Hadoop MapReduce supports reading multiple inputs. Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 17 / 23
  • 18. Relational Algebra Operators and MapReduce Computing Intersections Very similar to computing unions Suppose relations R and S have the same schema The map function is the same (an identity mapper) as for union The reduce function must produce a tuple only if both relations have that tuple A MapReduce implementation of intersection Map: For each tuple t in R or S, emit a key/value pair (t, t) Reduce: If key t has value list [t, t] then emit the key/value pair (t, t) Otherwise, emit the key/value pair (t, NULL) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 18 / 23
  • 19. Relational Algebra Operators and MapReduce Computing difference Assume we have two relations R and S with the same schema The only way a tuple t can appear in the output is if it is in R but not in S The map function passes tuples from R and S to the reducer NOTE: it must inform the reducer whether the tuple came from R or S A MapReduce implementation of difference Map: For a tuple t in R emit a key/value pair (t, R ) and for a tuple t in S, emit a key/value pair (t, S ) Reduce: For each key t, do the following: If it is associated to R , then emit (t, t) If it is associated to [ R , S ] or [ S , R ], or [ S ], emit the key/value pair (t, NULL) Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 19 / 23
  • 20. Relational Algebra Operators and MapReduce Computing the natural Join This topic is subject to continuous refinements There are many JOIN operators and many different implementations We’ve seen some of them in the laboratory sessions Let’s look at two relations R(A, B) and S(B, C) We must find tuples that agree on their B components We shall use the B-value of tuples from either relation as the key The value will be the other component and the name of the relation That way the reducer knows from which relation each tuple is coming from Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 20 / 23
  • 21. Relational Algebra Operators and MapReduce Computing the natural Join A MapReduce implementation of Natural Join Map: For each tuple (a, b) of R emit the key/value pair (b, ( R , a)) For each tuple (b, c) of S emit the key/value pair (b, ( S , c)) Reduce: Each key b will be associated to a list of pairs that are either ( R , a) or ( S , c) Emit key/value pairs of the form (b, [(a1, b, c1), (a2, b, c2), · · · , (an, b, cn)]) NOTES Question: what if the MapReduce framework wouldn’t implement the distributed (and sorted) group by? In general, for n tuples in relation R and m tuples in relation S all with a common B-value, then we end up with nm tuples in the result If all tuples of both relations have the same B-value, then we’re computing the Cartesian product Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 21 / 23
  • 22. Relational Algebra Operators and MapReduce Overview of SQL Joins Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 22 / 23
  • 23. Relational Algebra Operators and MapReduce Grouping and Aggregation in MapReduce Let R(A, B, C) be a relation to which we apply γA,θ(B)(R) The map operation prepares the grouping The grouping is done by the framework The reducer computes the aggregation Simplifying assumptions: one grouping attribute and one aggregation function MapReduce implementation of γA,θ(B)(R)3 Map: For each tuple (a, b, c) emit the key/value pair (a, b) Reduce: Each key a represents a group Apply θ to the list [b1, b2, · · · , bn] Emit the key/value pair (a, x) where x = θ([b1, b2, · · · , bn]) 3 Note here that we are also projecting. Pietro Michiardi (Eurecom) Relational Algebra and MapReduce 23 / 23