SlideShare a Scribd company logo
MAPREDUCE SUCCINCTLY
Data everywhere
Problem - We are drowning in data
Hadoop’s place
Effective storage and processing of large chunks of data
Google GFS and MapReduce
• Google was dealing a large amount of data over 10 years ago
• Documented experience in a series of papers
• The MapReduce programming model
• Google File System
• Scalable model that was implemented in Hadoop
Disk speeds
• Processing 10 TB file
• Time – ~430 minutes
• Stored as 1TB on 10 machines
• Time – ~43 minutes
To store data at scale you need to
use multiple disks/machines
Processor trends
• CPU speeds are not growing exponentially
• Processors take less power
• Processors are able to do more in one cycle
Product Name
Intel® Core™ i7-920
Processor (8M Cache,
2.66 GHz, 4.80 GT/s
Intel® QPI)
Intel® Core™ i7-6700K
Processor (8M Cache, up
to 4.20 GHz)
Code Name Bloomfield Skylake
Launch Date Q4'08 Q3'15
Lithography 45 nm 14 nm
Recommended
Customer Price BOX : $305.00 BOX : $350.00
# of Cores 4 4
# of Threads 8 8
Processor Base
Frequency 2.66 GHz 4 GHz
Max Turbo
Frequency 2.93 GHz 4.2 GHz
TDP 130 W 91 W
Source - http://ark.intel.com/compare/88195,37147
To scale you need to use multiple
CPUs/machines
Network speeds
• Gigabit - Speed: 1000 mbps
• Size: 1 TB
• ~ 2 Hours
Don’t move data unless you have to
Example scenario
• Example that we will use to understand the problem
• Data on favorite beverage
• Calculate average cups consumed per day for each beverage
Brianna, coffee, 3
Cameron, milk, 5
Thomas, milk, 4
Wyatt, coffee, 5
coffee, 4
milk, 4.5
Example – Single Threaded
Average cups consumed by tea drinkers is 3.33
Transform
Group by beverage
Summarize and display results
The problem of shared state
Can we avoid
shared state?
Key idea – cooperating units
• Organize program into independent but cooperating units
• Programs need to be broken into a structure that will minimize
the need for any shared state
• Cooperating units can work in parallel without sharing resources
and cooperate as needed
Key idea – avoid shared state
Sum large list
Add list 1
Add list 2
Add list 3
Add and display
sum
How can we apply to our problem?
• Data can be split into blocks
• Each block of data can be processed by a thread
Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
The Akka Actor model
• Units can send and receive messages
• Mailbox
Implementation structured to avoid shared state
Implementation – Take 2
Implementation – Take 3
MapReduce
Framework
Sorts, groups and
sends data by key
[Sort/Shuffle step]
The MapReduce framework
Preparation Map - input Map - output Sort/shuffle -
output
Reduce output
Break files into
blocks that can
be processed
independently
Locate and use
code to read
each record
Brianna, coffee, 1
Cameron, milk, 5
Thomas, milk, 4
Wyatt, tea, 1
Victoria, coffee, 3
Grace, coffee, 4
David, tea, 4
coffee, 1
milk, 5
milk, 4
tea, 1
coffee, 3
coffee, 4
tea, 4
coffee, {1,3,4}
milk, {5, 4}
tea, {1, 4}
Coffee – 2.67
Milk, 4.5
Tea – 2.5
Hadoop Distributed File System
• Files are split into large blocks
• Each block is stored on multiple nodes
• Namenode tracks block location
Other aspects
• Framework does a lot of the heavy lifting
• Machines can fail
• Tasks can fail
• Stragglers
• Users just write the Map and Reduce functions
Cup count demo – Apache Hadoop
• Demo
• Program is almost identical to what we wrote
Next steps
• Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr
• Read Google’s paper on Map Reduce and GFS (HDFS)
• http://research.google.com/archive/mapreduce.html
• http://research.google.com/archive/gfs.html
• Get familiar with Hadoop and Apache Spark
• Become familiar with functional programming
• Scala, F#, Clojure
• Check out Syncfusion’s free e-Books on related topics
• If working with Windows checkout Syncfusion’s easy to use Big Data Platform -
http://www.syncfusion.com/products/big-data
http://www.syncfusion.com/products/big-data
http://www.syncfusion.com/resources/techportal/ebooks
Related links
Thank you
Daniel Jebaraj
www.syncfusion.com

More Related Content

Similar to MapReduce succinctly

Hadoop
HadoopHadoop
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
Ahmed Misbah
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemVaibhav Jain
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
RahulBhole12
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Databricks
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
Ruben Badaró
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
Jannet Peetz
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
MongoDB
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
AtulYadav218546
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
Adam Doyle
 
Breaking data
Breaking dataBreaking data
Breaking data
Terry Bunio
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
Huibert Aalbers
 

Similar to MapReduce succinctly (20)

Hadoop
HadoopHadoop
Hadoop
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
 
Meeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop ClustersMeeting Performance Goals in multi-tenant Hadoop Clusters
Meeting Performance Goals in multi-tenant Hadoop Clusters
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Writing Scalable Software in Java
Writing Scalable Software in JavaWriting Scalable Software in Java
Writing Scalable Software in Java
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
Performance Tuning
Performance TuningPerformance Tuning
Performance Tuning
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
Hardware Provisioning
Hardware Provisioning Hardware Provisioning
Hardware Provisioning
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
MapReduce.pptx
MapReduce.pptxMapReduce.pptx
MapReduce.pptx
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 
Breaking data
Breaking dataBreaking data
Breaking data
 
ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)ITI015En-The evolution of databases (I)
ITI015En-The evolution of databases (I)
 

Recently uploaded

Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
Paco van Beckhoven
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 

Recently uploaded (20)

Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024Cracking the code review at SpringIO 2024
Cracking the code review at SpringIO 2024
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 

MapReduce succinctly

  • 2. Data everywhere Problem - We are drowning in data
  • 3. Hadoop’s place Effective storage and processing of large chunks of data
  • 4. Google GFS and MapReduce • Google was dealing a large amount of data over 10 years ago • Documented experience in a series of papers • The MapReduce programming model • Google File System • Scalable model that was implemented in Hadoop
  • 5. Disk speeds • Processing 10 TB file • Time – ~430 minutes • Stored as 1TB on 10 machines • Time – ~43 minutes To store data at scale you need to use multiple disks/machines
  • 6. Processor trends • CPU speeds are not growing exponentially • Processors take less power • Processors are able to do more in one cycle Product Name Intel® Core™ i7-920 Processor (8M Cache, 2.66 GHz, 4.80 GT/s Intel® QPI) Intel® Core™ i7-6700K Processor (8M Cache, up to 4.20 GHz) Code Name Bloomfield Skylake Launch Date Q4'08 Q3'15 Lithography 45 nm 14 nm Recommended Customer Price BOX : $305.00 BOX : $350.00 # of Cores 4 4 # of Threads 8 8 Processor Base Frequency 2.66 GHz 4 GHz Max Turbo Frequency 2.93 GHz 4.2 GHz TDP 130 W 91 W Source - http://ark.intel.com/compare/88195,37147 To scale you need to use multiple CPUs/machines
  • 7. Network speeds • Gigabit - Speed: 1000 mbps • Size: 1 TB • ~ 2 Hours Don’t move data unless you have to
  • 8. Example scenario • Example that we will use to understand the problem • Data on favorite beverage • Calculate average cups consumed per day for each beverage Brianna, coffee, 3 Cameron, milk, 5 Thomas, milk, 4 Wyatt, coffee, 5 coffee, 4 milk, 4.5
  • 9. Example – Single Threaded Average cups consumed by tea drinkers is 3.33 Transform Group by beverage Summarize and display results
  • 10. The problem of shared state Can we avoid shared state?
  • 11. Key idea – cooperating units • Organize program into independent but cooperating units • Programs need to be broken into a structure that will minimize the need for any shared state • Cooperating units can work in parallel without sharing resources and cooperate as needed
  • 12. Key idea – avoid shared state Sum large list Add list 1 Add list 2 Add list 3 Add and display sum
  • 13. How can we apply to our problem? • Data can be split into blocks • Each block of data can be processed by a thread Stage 1 - input Stage 1 - output Stage 2 - output Stage 3 output Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  • 14. The Akka Actor model • Units can send and receive messages • Mailbox
  • 15. Implementation structured to avoid shared state
  • 17. Implementation – Take 3 MapReduce Framework Sorts, groups and sends data by key [Sort/Shuffle step]
  • 18. The MapReduce framework Preparation Map - input Map - output Sort/shuffle - output Reduce output Break files into blocks that can be processed independently Locate and use code to read each record Brianna, coffee, 1 Cameron, milk, 5 Thomas, milk, 4 Wyatt, tea, 1 Victoria, coffee, 3 Grace, coffee, 4 David, tea, 4 coffee, 1 milk, 5 milk, 4 tea, 1 coffee, 3 coffee, 4 tea, 4 coffee, {1,3,4} milk, {5, 4} tea, {1, 4} Coffee – 2.67 Milk, 4.5 Tea – 2.5
  • 19. Hadoop Distributed File System • Files are split into large blocks • Each block is stored on multiple nodes • Namenode tracks block location
  • 20. Other aspects • Framework does a lot of the heavy lifting • Machines can fail • Tasks can fail • Stragglers • Users just write the Map and Reduce functions
  • 21. Cup count demo – Apache Hadoop • Demo • Program is almost identical to what we wrote
  • 22. Next steps • Check out sample files on GitHub - https://github.com/danjebaraj/hadoopmr • Read Google’s paper on Map Reduce and GFS (HDFS) • http://research.google.com/archive/mapreduce.html • http://research.google.com/archive/gfs.html • Get familiar with Hadoop and Apache Spark • Become familiar with functional programming • Scala, F#, Clojure • Check out Syncfusion’s free e-Books on related topics • If working with Windows checkout Syncfusion’s easy to use Big Data Platform - http://www.syncfusion.com/products/big-data