MapReduce Explained: Why We Need It and How It Works

•

0 likes•209 views

My first week working with MapReduce, Java, Eclipse, and Maven. See the code at GitHub: https://github.com/jaduimstra/mapreduce

Technology

MapReduce
Why do we need it?
What is it?
My initial interaction with it
Joe Duimstra
Aug 6, 2015

Data Growth
There is an ENORMOUS amount of data out there and
it's growing exponentially!!
Example:
20+ billion web pages x 20KB = 400+ TB
1 computer reads 30-35 MB/sec from disk
~4 months to read the web
~1,000 hard drives to store the web
Takes even more to do something usefulwith the data!

What to do?
Option 1:
Make a bigger, custom
machine
–Custom = Expensive
–Eventually this machine
will be too small
–Note: could be useful if
cheap enough to be
used with option 2

Parallel computing
● Use cheap, off-the-shelf servers and network equipment
● Moving data is expensive so do the computation near
the data
MapReduce (from Google) is one paradigm for parallel
computing
● Use distributed file system
–Hardware will fail
–Provide redundancy by replicating chunks across
machines

Architecture
●
Apache Hadoop is open source implementation of Google's
MapReduce

My impressions
●
I have virtually no experience with Java, so that's an initial barrier
●
Hadoop seems pretty low-level
– Running multiple jobs is kind of a pain
– Seems like you need to have knowledge of partioning, sorting, and
grouping implementations to optimize performance
– I believe abstractions (e.g. Cascading) exist?
●
Spark:
– Should be faster than Hadoop since it's in memory and has lazy execution
– Seems like a more cohesive set of tools for parallel data processing
●
Hadoop:
– Requires you to 'roll-up-your-sleeves'

Similar to MapReduce Explained: Why We Need It and How It Works

Seminar Presentation HadoopVarun Narang

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...Cloudera, Inc.

HadoopRamakrishna Reddy Bijjam

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal

Scaling up with Aerospike!Anshu Prateek

Challenges in Large Scale Machine LearningSudarsun Santhiappan

Big Data Unit 4 - HadoopRojaT4

Big Data and Hadoop - An IntroductionNagarjuna Kanamarlapudi

Hadoop-2.6.0 Slideskul prasad subedi

DataIntensiveComputing.pdfBrahmam8

Big Data And HadoopAnkur Tripathi

Hadoop training-in-hyderabadsreehari orienit

Hadoop Training Tutorial for Freshersrajkamaltibacademy

Introduction to apache horn (incubating)Edward Yoon

Introduction into CouchDB / Jan LehnardtBBC Web Developers

Architecting and productionising data science applications at scalesamthemonad

Cloud accounting software ukArcus Universe Ltd

Big tablePSIT

Hadoop tutorialAamir Ameen

Big Data for QAsAhmed Misbah

Similar to MapReduce Explained: Why We Need It and How It Works (20)

Seminar Presentation Hadoop

Chicago Data Summit: Keynote - Data Processing with Hadoop: Scalable and Cost...

Hadoop

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...

Scaling up with Aerospike!

Challenges in Large Scale Machine Learning

Big Data Unit 4 - Hadoop

Big Data and Hadoop - An Introduction

Hadoop-2.6.0 Slides

DataIntensiveComputing.pdf

Big Data And Hadoop

Hadoop training-in-hyderabad

Hadoop Training Tutorial for Freshers

Introduction to apache horn (incubating)

Introduction into CouchDB / Jan Lehnardt

Architecting and productionising data science applications at scale

Cloud accounting software uk

Big table

Hadoop tutorial

Big Data for QAs

Recently uploaded

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

A Domino Admins Adventures (Engage 2024)Gabriella Davis

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

AI as an Interface for Commercial BuildingsMemoori

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

How to convert PDF to text with Nanonetsnaman860154

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Handwritten Text Recognition for manuscripts and early printed texts

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

The 7 Things I Know About Cyber Security After 25 Years | April 2024

A Domino Admins Adventures (Engage 2024)

08448380779 Call Girls In Civil Lines Women Seeking Men

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Presentation on how to chat with PDF using ChatGPT code interpreter

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

Pigging Solutions in Pet Food Manufacturing

Human Factors of XR: Using Human Factors to Design XR Systems

AI as an Interface for Commercial Buildings

How to Troubleshoot Apps for the Modern Connected Worker

Unblocking The Main Thread Solving ANRs and Frozen Frames

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

How to convert PDF to text with Nanonets

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

MapReduce Explained: Why We Need It and How It Works

1. MapReduce Why do we need it? What is it? My initial interaction with it Joe Duimstra Aug 6, 2015

2. Data Growth There is an ENORMOUS amount of data out there and it's growing exponentially!! Example: 20+ billion web pages x 20KB = 400+ TB 1 computer reads 30-35 MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Takes even more to do something usefulwith the data!

3. What to do? Option 1: Make a bigger, custom machine –Custom = Expensive –Eventually this machine will be too small –Note: could be useful if cheap enough to be used with option 2

4. What to do? Option 2: ● Parallelize it!

5. Parallel computing ● Use cheap, off-the-shelf servers and network equipment ● Moving data is expensive so do the computation near the data MapReduce (from Google) is one paradigm for parallel computing ● Use distributed file system –Hardware will fail –Provide redundancy by replicating chunks across machines

6. Architecture ● Apache Hadoop is open source implementation of Google's MapReduce

7. My impressions ● I have virtually no experience with Java, so that's an initial barrier ● Hadoop seems pretty low-level – Running multiple jobs is kind of a pain – Seems like you need to have knowledge of partioning, sorting, and grouping implementations to optimize performance – I believe abstractions (e.g. Cascading) exist? ● Spark: – Should be faster than Hadoop since it's in memory and has lazy execution – Seems like a more cohesive set of tools for parallel data processing ● Hadoop: – Requires you to 'roll-up-your-sleeves'

MapReduce Explained: Why We Need It and How It Works

Recommended

Recommended

More Related Content

Similar to MapReduce Explained: Why We Need It and How It Works

Similar to MapReduce Explained: Why We Need It and How It Works (20)

Recently uploaded

Recently uploaded (20)

MapReduce Explained: Why We Need It and How It Works