A Hands-on Introduction to MapReduce (in Python)

•Download as PPTX, PDF•

0 likes•512 views

David Massart

This presentation explains how to build a simple map reduce algorithm step by step.

Software

A Hands-on Introduction to
MapReduce in Python
David Massart, PhD

Outline
• Set-up and requirements
• Counting words
• Limitations
• Map / Reduce
– Mapping
– Shuffling
– Reducing
• Hadoop

Environment Set-up
• Required
– Unix-like shell
• Linux
• Mac OS X
• Windows + Cygwin
– Python (e.g., anaconda)
• Good to have
– Java 8
– Hadoop 2.6

Moby Dick by Herman Melville
• Download Moby Dick:
wget
https://www.gutenberg.org/cache/epub/2701/p
g2701.txt
• Rename it input.txt:
mv pg2701.txt input.txt

Limitations
• Processing time is, at best, proportional to the
size of the text
• Actually, performance decreases with the size
of the dictionary
• Very large texts can require more than one
disk

MapReduce, Part 2: Shuffling
• Redistribute data based on the output keys
produced by the "mapper”
• So that all data belonging to one key is
grouped together

./mapper.py < input.txt | sort | ./reducer.py

More details available at
http://zettadatanet.wordpress.com
These slides are available at
http://www.slideshare.net/dmassart
/mapreduce-20150315

What's hot

Hypercubes In HbaseGeorge Ang

Memory: The New DiskTim Lossen

Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012Steven Pousty

Mongo sf spatialmongoSteven Pousty

Openshift GeoSpatial CapabilitiesSteven Pousty

Cassandra at talkbitsMax Alexejev

Alexander Ignatyev "MapReduce infrastructure"Yandex

Hadoop breizhjugDavid Morin

OpenShift.io on Glustermountpoint.io

Druid beginner performance tipsvishnu rao

Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar

JOSA TechTalks - Big Data on HadoopJordan Open Source Association

Hadoop and MapReduceHemanth Kumar Mantri

Redis Beyondpaolomanarang

themidgame-tube-slidesPedro Moy

Dumbo Hadoop Streaming Made Elegant And Easy Klaas BosteelsGeorge Ang

OpenShift with Eclipse Tooling - EclipseCon 2012Steven Pousty

Redis BeyondKLabCyscorpions-TechBlog

Hadoop hbase introductionJakub Stransky

Update on Crimson - the Seastarized Ceph - Seastar SummitScyllaDB

What's hot (20)

Hypercubes In Hbase

Memory: The New Disk

Spatial Mongo and Node.JS on Openshift JS.Everywhere 2012

Mongo sf spatialmongo

Openshift GeoSpatial Capabilities

Cassandra at talkbits

Alexander Ignatyev "MapReduce infrastructure"

Hadoop breizhjug

OpenShift.io on Gluster

Druid beginner performance tips

Hadoop And Big Data - My Presentation To Selective Audience

JOSA TechTalks - Big Data on Hadoop

Hadoop and MapReduce

Redis Beyond

themidgame-tube-slides

Dumbo Hadoop Streaming Made Elegant And Easy Klaas Bosteels

OpenShift with Eclipse Tooling - EclipseCon 2012

Redis Beyond

Hadoop hbase introduction

Update on Crimson - the Seastarized Ceph - Seastar Summit

Similar to A Hands-on Introduction to MapReduce (in Python)

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju

Cloud infrastructure. Google File System and MapReduce - Andrii VozniukAndrii Vozniuk

Apache hadoop, hdfs and map reduce OverviewNisanth Simon

An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek

Hadoop performance optimization tipsSubhas Kumar Ghosh

Large scale computing with mapreducehansen3032

Big Data Technologies - HadoopTalentica Software

Hadoop - Introduction to HDFSVibrant Technologies & Computers

L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDYVenneladonthireddy1

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin

HadoopYojana Nanaware

Introduction to HadoopYork University

Hadoopdevakalyan143

Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati

Introduction to Hadoop and MapReduceCsaba Toth

An Introduction of Apache HadoopKMS Technology

Introduction to Hadoop and Big DataJoe Alex

Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett

Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko

Similar to A Hands-on Introduction to MapReduce (in Python) (20)

Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Cloud infrastructure. Google File System and MapReduce - Andrii Vozniuk

Apache hadoop, hdfs and map reduce Overview

An Introduction to Apache Hadoop, Mahout and HBase

Hadoop performance optimization tips

Large scale computing with mapreduce

Big Data Technologies - Hadoop

Hadoop - Introduction to HDFS

L19CloudMapReduce introduction for cloud computing .ppt

02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY

(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...

Hadoop

Introduction to Hadoop

Hadoop

Big Data and Hadoop in Cloud - Leveraging Amazon EMR

Introduction to Hadoop and MapReduce

An Introduction of Apache Hadoop

Introduction to Hadoop and Big Data

Processing Big Data: An Introduction to Data Intensive Computing

Distributed Computing with Apache Hadoop: Technology Overview

Recently uploaded

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Asset Management Software - InfographicHr365.us smith

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

Introduction to Decentralized Applications (dApps)Intelisync

TECUNIQUE: Success Stories: IT Service providermohitmore19

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Recently uploaded (20)

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Exploring iOS App Development: Simplifying the Process

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Cloud Management Software Platforms: OpenStack

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Engage Usergroup 2024 - The Good The Bad_The Ugly

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Unlocking the Future of AI Agents with Large Language Models

Asset Management Software - Infographic

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Der Spagat zwischen BIAS und FAIRNESS (2024)

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

Unit 1.1 Excite Part 1, class 9, cbse...

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Hand gesture recognition PROJECT PPT.pptx

Introduction to Decentralized Applications (dApps)

TECUNIQUE: Success Stories: IT Service provider

HR Software Buyers Guide in 2024 - HRSoftware.com

A Hands-on Introduction to MapReduce (in Python)

1. A Hands-on Introduction to MapReduce in Python David Massart, PhD

2. Who Am I ?

3. Outline • Set-up and requirements • Counting words • Limitations • Map / Reduce – Mapping – Shuffling – Reducing • Hadoop

4. Environment Set-up • Required – Unix-like shell • Linux • Mac OS X • Windows + Cygwin – Python (e.g., anaconda) • Good to have – Java 8 – Hadoop 2.6

5. Moby Dick by Herman Melville • Download Moby Dick: wget https://www.gutenberg.org/cache/epub/2701/p g2701.txt • Rename it input.txt: mv pg2701.txt input.txt

6. cat input.txt

7. Counting Words

8. ./counter.py < input.txt

9. Limitations • Processing time is, at best, proportional to the size of the text • Actually, performance decreases with the size of the dictionary • Very large texts can require more than one disk

10. MapReduce, Part 1: Mapping

11. ./mapper.py < input.txt

12. MapReduce, Part 2: Shuffling • Redistribute data based on the output keys produced by the "mapper” • So that all data belonging to one key is grouped together

13. ./mapper.py < input.txt | sort

14. MapReduce, Part 3: Reducing

15. ./mapper.py < input.txt | sort | ./reducer.py

16. Hadoop

17. More details available at http://zettadatanet.wordpress.com These slides are available at http://www.slideshare.net/dmassart /mapreduce-20150315

Editor's Notes

Fact names need to be normalized too.

A Hands-on Introduction to MapReduce (in Python)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Hands-on Introduction to MapReduce (in Python)

Similar to A Hands-on Introduction to MapReduce (in Python) (20)

More from David Massart

More from David Massart (16)

Recently uploaded

Recently uploaded (20)

A Hands-on Introduction to MapReduce (in Python)

Editor's Notes