SlideShare a Scribd company logo
1 of 37
www.edureka.co/r-for-analytics
www.edureka.co/mapreduce-design-patterns
Top 3 Design Patterns in MapReduce
Slide 2Slide 2Slide 2 www.edureka.co/mapreduce-design-patterns
Today we will take you through the following:
 Summarization Patterns
 Numerical Summarization
 Filter Patterns
 Finding Top K records
 Join Patterns
 Reduce side join
Agenda
Hands On
Hands On
Hands On
Slide 3Slide 3Slide 3 www.edureka.co/mapreduce-design-patterns
MapReduce Review
Slide 4Slide 4Slide 4 www.edureka.co/mapreduce-design-patterns
Why MapReduce Design Patterns - Question
Let's broach this topic with few questions.
 Will you use standard sorting algorithms on MapReduce framework ?
» Quick Sort, Merge Sort etc. ??? NO
» Why ?
 MapReduce imposes constraints like any other framework
» You have to think in terms of Map tasks and Reduce tasks
» Programmer has little control over many aspects of execution
 But MapReduce does provide a number of techniques for controlling flow of data
Slide 5Slide 5Slide 5 www.edureka.co/mapreduce-design-patterns
MapReduce Paradigm - Constraints (Contd.)
 Programmer has little control over many aspects of execution
» Where a mapper or reducer runs
» When a mapper or reducer begins or finishes
» Which input key-value pairs are processed by a specific mapper
» Which intermediate key-value pairs are processed by a specific reducer
Slide 6Slide 6Slide 6 www.edureka.co/mapreduce-design-patterns
Why MapReduce Design Patterns - Answer
 Because of the constraints discussed in earlier slide
» Design Patterns help you solve problems and people have learnt to solve these problems in the best
possible ways
 Because of the MapReduce techniques for controlling execution & flow of data
» Use these techniques on problems in standard ways that people have already created
 Judicious use of Distributed Cache, Sorting Comparator can help in quite a few algorithms
 Scalability & Efficiency concerns
Slide 7Slide 7Slide 7 www.edureka.co/mapreduce-design-patterns
Summarization Patterns – What is it
 Provides high level aggregate view of data set when visual inspection of whole data not feasible
 Group similar data together and perform an operations like
» Calculating a statistic, indexing, counting etc.
 Apply on a new dataset to quickly understand what's important and what to look closely at
 Example
» Number of hits per hour per location on a website in a web log
» Average length of comments / user in blog comments
» Top ten salary per profession region-wise
Slide 8Slide 8Slide 8 www.edureka.co/mapreduce-design-patterns
Numerical Summarizations – Description
 General Pattern for calculating aggregate statistic on the dataset
 Group records by a key field and calculate a numerical aggregate per group
» Min, max, sum, average, median, standard deviation etc.
 Use Combiner properly for efficient implementation
 Example
» Take advertising actions based on hours users are most active on your site
» Group hourly average amount users spend on your site
 Applicability – Use it when
» You are dealing with numerical data or counting
» The data can be grouped by fields
Slide 9Slide 9Slide 9 www.edureka.co/mapreduce-design-patterns
Numerical Summarizations – Structure
 Mapper
» Output Key = field to group by; Output Value = numerical item to summarize on
» Make sure only relevant items are output from Map to Reduce network traffic
 Combiner
» Use if summarization operation on reducer is Associative & Commutative
» Will reduce the network traffic between Map tasks & Reduce tasks
Slide 10Slide 10Slide 10 www.edureka.co/mapreduce-design-patterns
Numerical Summarizations – Structure (Contd.)
 Partitioner
» Use custom partitioner if you feel skew in the data
» To distribute computation uniformly across reducers
 Reducer
» Each reducer applies summarization function on the data set received on the group key
» Output key = group key; summarization statistic
» Job output is a set of part files containing a single record per reducer input group
Slide 11Slide 11Slide 11 www.edureka.co/mapreduce-design-patterns
Numerical Summarizations – Analogy, Performance
 Performance
» The crux of this pattern – Grouping by key – is what MapReduce provides at it's core
» Performs well when combiner is used properly
» For skewed dataset, use custom partitioner for improved performance
» Use appropriate number of reducers
Slide 12Slide 12Slide 12 www.edureka.co/mapreduce-design-patterns
Numerical Summarizations – Use Cases
 Min/Max/Count
» Analytics to find minimum, maximum, count of an event
 Average/Median/Standard Deviation
» Analytics similar to Min/Max/Count
» Implementation not as straight forward as operations not associative
 Record Count
» Common analytics to get a heartbeat of data flow rate on a particular interval
 Word Count
» Basic Text Analytics of word count in a document
» Hello World of MapReduce
Slide 13Slide 13Slide 13 www.edureka.co/mapreduce-design-patterns
Min/Max/Count Example – Data Flow
Slide 14Slide 14Slide 14 www.edureka.co/mapreduce-design-patterns
DEMO
Min/Max/Count Example
Slide 15Slide 15Slide 15 www.edureka.co/mapreduce-design-patterns
Filtering Patterns – What is it
 Finding a subset of interest from a large data set
 So that further analytics can be applied on this subset
 These patterns don't alter the original dataset
Example:
 Sampling – to get a representative sample to apply on Machine Learning Algorithms
 Selecting all records for a user to apply further analytics
Slide 16Slide 16Slide 16 www.edureka.co/mapreduce-design-patterns
Basic Filtering Pattern – Description
 Acts as a basic filtering abstract pattern for some other patterns
 Filter out records that are not of interest and keep the ones that are
 Parallel processing system like Hadoop is required due to large size of original data set
 Filtered in subset may be large or small
Example: To study behaviour of users between 10-11am filter out records from log file
Applicability – Use it when
 Widely applicable
 Use it when data can be easily parsed to yield a filtering criteria
Slide 17Slide 17Slide 17 www.edureka.co/mapreduce-design-patterns
Basic Filtering Pattern – Structure
Slide 18Slide 18Slide 18 www.edureka.co/mapreduce-design-patterns
Basic Filtering Pattern – Description
Mapper
 Applies filtering criteria to each record it receives
 Outputs records that match filtering in criteria
 Output key/value pairs same as input key/value pairs
Combiner
 Not Required; map only job
Partitioner
 Not Required; map only job
Reducer
 Generally Not Required ; Map Only job
 But can use Identity reducers
Slide 19Slide 19Slide 19 www.edureka.co/mapreduce-design-patterns
Basic Filtering Pattern – Use Cases
 Closer view of data
 Removing low scoring data
 Distributed grep
 Data cleansing
 Simple random sampling
 Tracking a thread of events
Slide 20Slide 20Slide 20 www.edureka.co/mapreduce-design-patterns
Top Ten – Description
 Filter in a fixed and relatively small number (10) of records from a large data set
 Based on a total ordering ranking criteria
 You can manually look at this small number of records to see what's special about them
 Important in terms of how one would implement Top Ten in MapReduce vis-a-vis SQL
» In SQL or any programming language you would sort and then take top 10
» In Map Reduce total order sorting is complex and resource intensive
Example: Top ten users with highest number of comments posted on Stackoverflow in 2014
Slide 21Slide 21Slide 21 www.edureka.co/mapreduce-design-patterns
Top Ten – Applicability
Applicability – Use it when
 A comparator function is available for ranking records
 Number of output records much smaller than input records
» If not, one is better off sorting the whole dataset
Slide 22Slide 22Slide 22 www.edureka.co/mapreduce-design-patterns
Top Ten – Structure
Slide 23Slide 23Slide 23 www.edureka.co/mapreduce-design-patterns
Mapper
 In setup() method initialize an array of size k(=10)
 In map(), insert record field into array in a sorted way
 If sizeOf(array) truncate array to size == 10, keeping highest 10
 In cleanup() read the array and output key = null and value = record
Combiner and custom Partitioner not required
Reducer
 Considering number of output records from mapper is small only 1 reducer is used
 Reducer does things similar to mapper
Top Ten – Structure
Slide 24Slide 24Slide 24 www.edureka.co/mapreduce-design-patterns
Top Ten – Use Cases
 Outlier analysis
 Select interesting data for further BI systems which cannot handle Big Data sets
 Publish interesting dashboards
Slide 25Slide 25Slide 25 www.edureka.co/mapreduce-design-patterns
DEMO
Top Ten Example
Slide 26Slide 26Slide 26 www.edureka.co/mapreduce-design-patterns
Join Patterns – What is it
 Datasets generally exist in multiple sources
 Deriving full-value requires merging them together
 Join Patterns are used for this purpose
 Performing joins on the fly on Big Data can be costly in terms of time
Example: Joining StackOverflow data from Comments & Posts on UserId
Slide 27Slide 27Slide 27 www.edureka.co/mapreduce-design-patterns
Join – Refresher
 Inner Join
 Outer Join
» Left Outer Join
» Right Outer Join
» Full Outer Join
 Anti Join
 Cartesian Product
Slide 28Slide 28Slide 28 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description
 Easiest to implement but can be longest to execute
 Supports all types of join operation
 Can join multiple data sources, but expensive in terms of network resources & time
 All data transferred across network
Example : Join PostLinks table data in StackOverflow to Posts data
Slide 29Slide 29Slide 29 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Description (Contd.)
 Applicability – Use it when
» Multiple large data sets require to be joined
» If one of the data sources is small look at using replicated join
» Different data sources are linked by a foreign key
» You want all join operations to be supported
Slide 30Slide 30Slide 30 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure
Slide 31Slide 31Slide 31 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Structure (Contd.)
 Mapper
» Output key should reflect the foreign key
» Value can be the whole record and an identifier to identify the source
» Use projection and output only the required number of fields
 Combiner
» Not Required ; No additional benefit
 Partitioner
» User Custom Partitioner if required;
 Reducer
» Reducer logic based on type of join required
» Reducer receives the data from all the different sources per key
Slide 32Slide 32Slide 32 www.edureka.co/mapreduce-design-patterns
Reduce Side Join – Performance
 Performance
» The whole data moves across the network to reducers
» You can optimize by using projection and sending only the required fields
» Number of reducers typically higher than normal
» If you can use any other Join type for your problem, use that instead
Slide 33Slide 33Slide 33 www.edureka.co/mapreduce-design-patterns
DEMO
Reduce Side Join Example
Demo
Questions
Slide 35
Slide 36
Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your
experience better!
Please spare few minutes to take the survey after the webinar.
Survey
TOP MAPREDUCE DESIGN PATTERNS

More Related Content

What's hot

Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analyticsAvinash Pandu
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analyticsAvinash Pandu
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseJonathan Bloom
 
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseLarge-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseAapo Kyrölä
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processinghuguk
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersSaliya Ekanayake
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 HiveNamit Jain
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkIRJET Journal
 
Lecture 24
Lecture 24Lecture 24
Lecture 24Shani729
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Martin Junghanns
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Shinwoo Jang
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Shardinginside-BigData.com
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map ReduceApache Apex
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins Edureka!
 

What's hot (20)

Applying stratosphere for big data analytics
Applying stratosphere for big data analyticsApplying stratosphere for big data analytics
Applying stratosphere for big data analytics
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Stratosphere with big_data_analytics
Stratosphere with big_data_analyticsStratosphere with big_data_analytics
Stratosphere with big_data_analytics
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Intro to Hybrid Data Warehouse
Intro to Hybrid Data WarehouseIntro to Hybrid Data Warehouse
Intro to Hybrid Data Warehouse
 
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseLarge-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
High Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC ClustersHigh Performance Data Analytics with Java on Large Multicore HPC Clusters
High Performance Data Analytics with Java on Large Multicore HPC Clusters
 
Hadoop Summit 2009 Hive
Hadoop Summit 2009 HiveHadoop Summit 2009 Hive
Hadoop Summit 2009 Hive
 
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame WorkA Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
A Big-Data Process Consigned Geographically by Employing Mapreduce Frame Work
 
Lecture 24
Lecture 24Lecture 24
Lecture 24
 
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
 
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...Building data fusion surrogate models for spacecraft aerodynamic problems wit...
Building data fusion surrogate models for spacecraft aerodynamic problems wit...
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Machine Learning - Unsupervised Learning
Machine Learning - Unsupervised LearningMachine Learning - Unsupervised Learning
Machine Learning - Unsupervised Learning
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Reduce Side Joins
Reduce Side Joins Reduce Side Joins
Reduce Side Joins
 

Viewers also liked

A new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarkingA new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarkingThibault Dory
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4yyooooon
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 

Viewers also liked (11)

A new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarkingA new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarking
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 

Similar to TOP MAPREDUCE DESIGN PATTERNS

Mrdp reduce side_join
Mrdp reduce side_joinMrdp reduce side_join
Mrdp reduce side_joinEdureka!
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsDilum Bandara
 
Download It
Download ItDownload It
Download Itbutest
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...AzarulIkhwan
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsRobert Grossman
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSHCL Technologies
 
Using dask for large systems of financial models
Using dask for large systems of financial modelsUsing dask for large systems of financial models
Using dask for large systems of financial modelsPetr Wolf
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5Robert Grossman
 
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotCitus Data
 
Useful Tools for Problem Solving by Operational Excellence Consulting
Useful Tools for Problem Solving by Operational Excellence ConsultingUseful Tools for Problem Solving by Operational Excellence Consulting
Useful Tools for Problem Solving by Operational Excellence ConsultingOperational Excellence Consulting
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchArtemSunfun
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...butest
 
Victor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkVictor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkCBOD ANR project U-PSUD
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...Yahoo Developer Network
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Databricks
 

Similar to TOP MAPREDUCE DESIGN PATTERNS (20)

Mrdp reduce side_join
Mrdp reduce side_joinMrdp reduce side_join
Mrdp reduce side_join
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Download It
Download ItDownload It
Download It
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
Task Scheduling using Tabu Search algorithm in Cloud Computing Environment us...
 
Sawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data CloudsSawmill - Integrating R and Large Data Clouds
Sawmill - Integrating R and Large Data Clouds
 
Clementine tool
Clementine toolClementine tool
Clementine tool
 
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICSUSING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
USING FACTORY DESIGN PATTERNS IN MAP REDUCE DESIGN FOR BIG DATA ANALYTICS
 
Using dask for large systems of financial models
Using dask for large systems of financial modelsUsing dask for large systems of financial models
Using dask for large systems of financial models
 
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
The Impact of Cloud Computing on Predictive Analytics 7-29-09 v5
 
E05312426
E05312426E05312426
E05312426
 
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco SlotDistributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
Distributing Queries the Citus Way | PostgresConf US 2018 | Marco Slot
 
Useful Tools for Problem Solving by Operational Excellence Consulting
Useful Tools for Problem Solving by Operational Excellence ConsultingUseful Tools for Problem Solving by Operational Excellence Consulting
Useful Tools for Problem Solving by Operational Excellence Consulting
 
mod 2.pdf
mod 2.pdfmod 2.pdf
mod 2.pdf
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Towards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning ResearchTowards Increasing Predictability of Machine Learning Research
Towards Increasing Predictability of Machine Learning Research
 
A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...A Survey of Machine Learning Methods Applied to Computer ...
A Survey of Machine Learning Methods Applied to Computer ...
 
Victor Chang: Cloud computing business framework
Victor Chang: Cloud computing business frameworkVictor Chang: Cloud computing business framework
Victor Chang: Cloud computing business framework
 
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...Apache Hadoop India Summit 2011 talk  "Making Hadoop Enterprise Ready with Am...
Apache Hadoop India Summit 2011 talk "Making Hadoop Enterprise Ready with Am...
 
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
Understanding Parallelization of Machine Learning Algorithms in Apache Spark™
 

More from Edureka!

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaEdureka!
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaEdureka!
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaEdureka!
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaEdureka!
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaEdureka!
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaEdureka!
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaEdureka!
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaEdureka!
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaEdureka!
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaEdureka!
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | EdurekaEdureka!
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEdureka!
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEdureka!
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaEdureka!
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaEdureka!
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaEdureka!
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Edureka!
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaEdureka!
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaEdureka!
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | EdurekaEdureka!
 

More from Edureka! (20)

What to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | EdurekaWhat to learn during the 21 days Lockdown | Edureka
What to learn during the 21 days Lockdown | Edureka
 
Top 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | EdurekaTop 10 Dying Programming Languages in 2020 | Edureka
Top 10 Dying Programming Languages in 2020 | Edureka
 
Top 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | EdurekaTop 5 Trending Business Intelligence Tools | Edureka
Top 5 Trending Business Intelligence Tools | Edureka
 
Tableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | EdurekaTableau Tutorial for Data Science | Edureka
Tableau Tutorial for Data Science | Edureka
 
Python Programming Tutorial | Edureka
Python Programming Tutorial | EdurekaPython Programming Tutorial | Edureka
Python Programming Tutorial | Edureka
 
Top 5 PMP Certifications | Edureka
Top 5 PMP Certifications | EdurekaTop 5 PMP Certifications | Edureka
Top 5 PMP Certifications | Edureka
 
Top Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | EdurekaTop Maven Interview Questions in 2020 | Edureka
Top Maven Interview Questions in 2020 | Edureka
 
Linux Mint Tutorial | Edureka
Linux Mint Tutorial | EdurekaLinux Mint Tutorial | Edureka
Linux Mint Tutorial | Edureka
 
How to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| EdurekaHow to Deploy Java Web App in AWS| Edureka
How to Deploy Java Web App in AWS| Edureka
 
Importance of Digital Marketing | Edureka
Importance of Digital Marketing | EdurekaImportance of Digital Marketing | Edureka
Importance of Digital Marketing | Edureka
 
RPA in 2020 | Edureka
RPA in 2020 | EdurekaRPA in 2020 | Edureka
RPA in 2020 | Edureka
 
Email Notifications in Jenkins | Edureka
Email Notifications in Jenkins | EdurekaEmail Notifications in Jenkins | Edureka
Email Notifications in Jenkins | Edureka
 
EA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | EdurekaEA Algorithm in Machine Learning | Edureka
EA Algorithm in Machine Learning | Edureka
 
Cognitive AI Tutorial | Edureka
Cognitive AI Tutorial | EdurekaCognitive AI Tutorial | Edureka
Cognitive AI Tutorial | Edureka
 
AWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | EdurekaAWS Cloud Practitioner Tutorial | Edureka
AWS Cloud Practitioner Tutorial | Edureka
 
Blue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | EdurekaBlue Prism Top Interview Questions | Edureka
Blue Prism Top Interview Questions | Edureka
 
Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka Big Data on AWS Tutorial | Edureka
Big Data on AWS Tutorial | Edureka
 
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | EdurekaA star algorithm | A* Algorithm in Artificial Intelligence | Edureka
A star algorithm | A* Algorithm in Artificial Intelligence | Edureka
 
Kubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | EdurekaKubernetes Installation on Ubuntu | Edureka
Kubernetes Installation on Ubuntu | Edureka
 
Introduction to DevOps | Edureka
Introduction to DevOps | EdurekaIntroduction to DevOps | Edureka
Introduction to DevOps | Edureka
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

TOP MAPREDUCE DESIGN PATTERNS

  • 2. Slide 2Slide 2Slide 2 www.edureka.co/mapreduce-design-patterns Today we will take you through the following:  Summarization Patterns  Numerical Summarization  Filter Patterns  Finding Top K records  Join Patterns  Reduce side join Agenda Hands On Hands On Hands On
  • 3. Slide 3Slide 3Slide 3 www.edureka.co/mapreduce-design-patterns MapReduce Review
  • 4. Slide 4Slide 4Slide 4 www.edureka.co/mapreduce-design-patterns Why MapReduce Design Patterns - Question Let's broach this topic with few questions.  Will you use standard sorting algorithms on MapReduce framework ? » Quick Sort, Merge Sort etc. ??? NO » Why ?  MapReduce imposes constraints like any other framework » You have to think in terms of Map tasks and Reduce tasks » Programmer has little control over many aspects of execution  But MapReduce does provide a number of techniques for controlling flow of data
  • 5. Slide 5Slide 5Slide 5 www.edureka.co/mapreduce-design-patterns MapReduce Paradigm - Constraints (Contd.)  Programmer has little control over many aspects of execution » Where a mapper or reducer runs » When a mapper or reducer begins or finishes » Which input key-value pairs are processed by a specific mapper » Which intermediate key-value pairs are processed by a specific reducer
  • 6. Slide 6Slide 6Slide 6 www.edureka.co/mapreduce-design-patterns Why MapReduce Design Patterns - Answer  Because of the constraints discussed in earlier slide » Design Patterns help you solve problems and people have learnt to solve these problems in the best possible ways  Because of the MapReduce techniques for controlling execution & flow of data » Use these techniques on problems in standard ways that people have already created  Judicious use of Distributed Cache, Sorting Comparator can help in quite a few algorithms  Scalability & Efficiency concerns
  • 7. Slide 7Slide 7Slide 7 www.edureka.co/mapreduce-design-patterns Summarization Patterns – What is it  Provides high level aggregate view of data set when visual inspection of whole data not feasible  Group similar data together and perform an operations like » Calculating a statistic, indexing, counting etc.  Apply on a new dataset to quickly understand what's important and what to look closely at  Example » Number of hits per hour per location on a website in a web log » Average length of comments / user in blog comments » Top ten salary per profession region-wise
  • 8. Slide 8Slide 8Slide 8 www.edureka.co/mapreduce-design-patterns Numerical Summarizations – Description  General Pattern for calculating aggregate statistic on the dataset  Group records by a key field and calculate a numerical aggregate per group » Min, max, sum, average, median, standard deviation etc.  Use Combiner properly for efficient implementation  Example » Take advertising actions based on hours users are most active on your site » Group hourly average amount users spend on your site  Applicability – Use it when » You are dealing with numerical data or counting » The data can be grouped by fields
  • 9. Slide 9Slide 9Slide 9 www.edureka.co/mapreduce-design-patterns Numerical Summarizations – Structure  Mapper » Output Key = field to group by; Output Value = numerical item to summarize on » Make sure only relevant items are output from Map to Reduce network traffic  Combiner » Use if summarization operation on reducer is Associative & Commutative » Will reduce the network traffic between Map tasks & Reduce tasks
  • 10. Slide 10Slide 10Slide 10 www.edureka.co/mapreduce-design-patterns Numerical Summarizations – Structure (Contd.)  Partitioner » Use custom partitioner if you feel skew in the data » To distribute computation uniformly across reducers  Reducer » Each reducer applies summarization function on the data set received on the group key » Output key = group key; summarization statistic » Job output is a set of part files containing a single record per reducer input group
  • 11. Slide 11Slide 11Slide 11 www.edureka.co/mapreduce-design-patterns Numerical Summarizations – Analogy, Performance  Performance » The crux of this pattern – Grouping by key – is what MapReduce provides at it's core » Performs well when combiner is used properly » For skewed dataset, use custom partitioner for improved performance » Use appropriate number of reducers
  • 12. Slide 12Slide 12Slide 12 www.edureka.co/mapreduce-design-patterns Numerical Summarizations – Use Cases  Min/Max/Count » Analytics to find minimum, maximum, count of an event  Average/Median/Standard Deviation » Analytics similar to Min/Max/Count » Implementation not as straight forward as operations not associative  Record Count » Common analytics to get a heartbeat of data flow rate on a particular interval  Word Count » Basic Text Analytics of word count in a document » Hello World of MapReduce
  • 13. Slide 13Slide 13Slide 13 www.edureka.co/mapreduce-design-patterns Min/Max/Count Example – Data Flow
  • 14. Slide 14Slide 14Slide 14 www.edureka.co/mapreduce-design-patterns DEMO Min/Max/Count Example
  • 15. Slide 15Slide 15Slide 15 www.edureka.co/mapreduce-design-patterns Filtering Patterns – What is it  Finding a subset of interest from a large data set  So that further analytics can be applied on this subset  These patterns don't alter the original dataset Example:  Sampling – to get a representative sample to apply on Machine Learning Algorithms  Selecting all records for a user to apply further analytics
  • 16. Slide 16Slide 16Slide 16 www.edureka.co/mapreduce-design-patterns Basic Filtering Pattern – Description  Acts as a basic filtering abstract pattern for some other patterns  Filter out records that are not of interest and keep the ones that are  Parallel processing system like Hadoop is required due to large size of original data set  Filtered in subset may be large or small Example: To study behaviour of users between 10-11am filter out records from log file Applicability – Use it when  Widely applicable  Use it when data can be easily parsed to yield a filtering criteria
  • 17. Slide 17Slide 17Slide 17 www.edureka.co/mapreduce-design-patterns Basic Filtering Pattern – Structure
  • 18. Slide 18Slide 18Slide 18 www.edureka.co/mapreduce-design-patterns Basic Filtering Pattern – Description Mapper  Applies filtering criteria to each record it receives  Outputs records that match filtering in criteria  Output key/value pairs same as input key/value pairs Combiner  Not Required; map only job Partitioner  Not Required; map only job Reducer  Generally Not Required ; Map Only job  But can use Identity reducers
  • 19. Slide 19Slide 19Slide 19 www.edureka.co/mapreduce-design-patterns Basic Filtering Pattern – Use Cases  Closer view of data  Removing low scoring data  Distributed grep  Data cleansing  Simple random sampling  Tracking a thread of events
  • 20. Slide 20Slide 20Slide 20 www.edureka.co/mapreduce-design-patterns Top Ten – Description  Filter in a fixed and relatively small number (10) of records from a large data set  Based on a total ordering ranking criteria  You can manually look at this small number of records to see what's special about them  Important in terms of how one would implement Top Ten in MapReduce vis-a-vis SQL » In SQL or any programming language you would sort and then take top 10 » In Map Reduce total order sorting is complex and resource intensive Example: Top ten users with highest number of comments posted on Stackoverflow in 2014
  • 21. Slide 21Slide 21Slide 21 www.edureka.co/mapreduce-design-patterns Top Ten – Applicability Applicability – Use it when  A comparator function is available for ranking records  Number of output records much smaller than input records » If not, one is better off sorting the whole dataset
  • 22. Slide 22Slide 22Slide 22 www.edureka.co/mapreduce-design-patterns Top Ten – Structure
  • 23. Slide 23Slide 23Slide 23 www.edureka.co/mapreduce-design-patterns Mapper  In setup() method initialize an array of size k(=10)  In map(), insert record field into array in a sorted way  If sizeOf(array) truncate array to size == 10, keeping highest 10  In cleanup() read the array and output key = null and value = record Combiner and custom Partitioner not required Reducer  Considering number of output records from mapper is small only 1 reducer is used  Reducer does things similar to mapper Top Ten – Structure
  • 24. Slide 24Slide 24Slide 24 www.edureka.co/mapreduce-design-patterns Top Ten – Use Cases  Outlier analysis  Select interesting data for further BI systems which cannot handle Big Data sets  Publish interesting dashboards
  • 25. Slide 25Slide 25Slide 25 www.edureka.co/mapreduce-design-patterns DEMO Top Ten Example
  • 26. Slide 26Slide 26Slide 26 www.edureka.co/mapreduce-design-patterns Join Patterns – What is it  Datasets generally exist in multiple sources  Deriving full-value requires merging them together  Join Patterns are used for this purpose  Performing joins on the fly on Big Data can be costly in terms of time Example: Joining StackOverflow data from Comments & Posts on UserId
  • 27. Slide 27Slide 27Slide 27 www.edureka.co/mapreduce-design-patterns Join – Refresher  Inner Join  Outer Join » Left Outer Join » Right Outer Join » Full Outer Join  Anti Join  Cartesian Product
  • 28. Slide 28Slide 28Slide 28 www.edureka.co/mapreduce-design-patterns Reduce Side Join – Description  Easiest to implement but can be longest to execute  Supports all types of join operation  Can join multiple data sources, but expensive in terms of network resources & time  All data transferred across network Example : Join PostLinks table data in StackOverflow to Posts data
  • 29. Slide 29Slide 29Slide 29 www.edureka.co/mapreduce-design-patterns Reduce Side Join – Description (Contd.)  Applicability – Use it when » Multiple large data sets require to be joined » If one of the data sources is small look at using replicated join » Different data sources are linked by a foreign key » You want all join operations to be supported
  • 30. Slide 30Slide 30Slide 30 www.edureka.co/mapreduce-design-patterns Reduce Side Join – Structure
  • 31. Slide 31Slide 31Slide 31 www.edureka.co/mapreduce-design-patterns Reduce Side Join – Structure (Contd.)  Mapper » Output key should reflect the foreign key » Value can be the whole record and an identifier to identify the source » Use projection and output only the required number of fields  Combiner » Not Required ; No additional benefit  Partitioner » User Custom Partitioner if required;  Reducer » Reducer logic based on type of join required » Reducer receives the data from all the different sources per key
  • 32. Slide 32Slide 32Slide 32 www.edureka.co/mapreduce-design-patterns Reduce Side Join – Performance  Performance » The whole data moves across the network to reducers » You can optimize by using projection and sending only the required fields » Number of reducers typically higher than normal » If you can use any other Join type for your problem, use that instead
  • 33. Slide 33Slide 33Slide 33 www.edureka.co/mapreduce-design-patterns DEMO Reduce Side Join Example
  • 34. Demo
  • 36. Slide 36 Your feedback is vital for us, be it a compliment, a suggestion or a complaint. It helps us to make your experience better! Please spare few minutes to take the survey after the webinar. Survey

Editor's Notes

  1. 15
  2. 16
  3. 17
  4. 18
  5. 19
  6. 20
  7. 21
  8. 22
  9. 23
  10. 24
  11. 25
  12. 26
  13. 27
  14. 28
  15. 29
  16. 30
  17. 31
  18. 32
  19. 33