MapReduce to Apache Spark: An Ecosystem Evolves

•

4 likes•1,040 views

The document summarizes the evolution from MapReduce to Apache Spark for data processing. Some key points: - MapReduce provided breakthroughs like data locality, fault tolerance, and scalability but the programming model required developing generally scalable solutions. - Apache Spark provides a richer, more expressive API that allows developing applications with 2-5x less code than MapReduce. It also provides fast in-memory execution up to an order of magnitude faster than MapReduce. - A survey found 82% of developers replaced MapReduce with Spark for its speed and ability to handle large datasets faster than MapReduce. Spark is now an important part of the Hadoop ecosystem.

Software

MapReduce to Apache Spark:
An Ecosystem Evolves
Doug Cutting (@cutting)
Chief Architect & Co-founder of Apache Hadoop

Hadoop’s Original Architecture
MapReduce
Data Processing and Resource Management
HDFS
Filesystem/Storage

The MapReduce Breakthrough
Key advances in MapReduce:
• Data locality: Automatic split computation and appropriate launch of mappers
• Fault-tolerance: Write-out of intermediate results and restartable mappers provides ability to run on commodity
hardware
• Linear scalability: Combination of locality + programming model forces developers to write generally scalable
solutions
Map Map Map Map Map Map Map Map Map Map Map Map
Reduce Reduce Reduce Reduce

Apache Spark: A Better MapReduce
Easy, Expressive API
• Rich API (Java, Scala, and Python)
• Interactive shell
• 2-5x less code needed than MR
Fast Execution
• General execution graphs
• In-memory storage
• Order-of-magnitude improvement
over MR

Big Data Developers are Rapidly Sparking Up
Source: Typesafe Apache Spark
Adoption Survey, Jan. 2015
• 82% have replaced MapReduce
with Spark
• 78% need faster processing for
large data sets
• 62% load data into Spark via HDFS
• 22% of respondents run CDH, more
than twice as many as any other
Hadoop platform

Spark is now an important part of the Hadoop Platform

A Platform That Just Won’t Stop Growing
NEWPROJECTS
EXISTINGPROJECTS
*CDHSUPPORTED
Core Hadoop
(HDFS,
MapReduce)
Solr
Pig
Core Hadoop
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
Core Hadoop
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Knox
Flink
Parquet
Sentry
Spark
Tez
Impala
Kafka
Drill
Flume
Bigtop
Oozie
HCatalog
Hue
Sqoop
Avro
Hive
Mahout
HBase
ZooKeeper
Solr
Pig
YARN
Core Hadoop
Kudu*
RecordService*
Ibis*
Falcon
Knox
Flink
Parquet*
Sentry*
Spark*
Tez
Impala*
Kafka*
Drill
Flume*
Bigtop*
Oozie*
Hcatalog*
Hue*
Sqoop*
Avro*
Hive*
Mahout*
Hbase*
ZooKeeper*
Solr*
Pig*
YARN*
Core Hadoop*
2006 2008 2009 2010 2011 2012 20132007 2014 2015

Hadoop’s Next 10 Years
Interest in public-cloud
deployments are driving
native support for them
into the platform.
Rapid hardware advances
are forcing the
community to re-think
Hadoop’s foundations.
Data sources are more
numerous, distributed,
and diverse (IoT), and
Hadoop will adapt.

What's hot

LEGO: Data Driven Growth Hacking Powered by Big Data DataWorks Summit/Hadoop Summit

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...Cloudera, Inc.

Hadoop and Machine Learningjoshwills

Accelerating Data Warehouse ModernizationDataWorks Summit/Hadoop Summit

Big Data Computing ArchitectureGang Tao

Ignite Your Big Data With a Spark!Progress

High Performance Spatial-Temporal Trajectory Analysis with Spark DataWorks Summit/Hadoop Summit

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman

Introduction to Kudu - StampedeCon 2016StampedeCon

The EDW EcosystemDataWorks Summit/Hadoop Summit

Solr consistency and recovery internalsCloudera, Inc.

Extreme Sports & Beyond: Exploring a new frontier in data with GoProCloudera, Inc.

Optimizing Big Data to run in the Public CloudQubole

Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.

Ibis: Scaling Python Analytics on Hadoop and ImpalaWes McKinney

Hadoop in the Cloud: Common Architectural PatternsDataWorks Summit

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.

Atlanta MLConfQubole

Smart Enterprise Big Data Bus for the Modern Responsive EnterpriseDataWorks Summit

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...Cloudera, Inc.

What's hot (20)

LEGO: Data Driven Growth Hacking Powered by Big Data

Cloudera Data Science Workbench: sparklyr, implyr, and More - dplyr Interfac...

Hadoop and Machine Learning

Accelerating Data Warehouse Modernization

Big Data Computing Architecture

Ignite Your Big Data With a Spark!

High Performance Spatial-Temporal Trajectory Analysis with Spark

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...

Introduction to Kudu - StampedeCon 2016

The EDW Ecosystem

Solr consistency and recovery internals

Extreme Sports & Beyond: Exploring a new frontier in data with GoPro

Optimizing Big Data to run in the Public Cloud

Hadoop Essentials -- The What, Why and How to Meet Agency Objectives

Ibis: Scaling Python Analytics on Hadoop and Impala

Hadoop in the Cloud: Common Architectural Patterns

Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...

Atlanta MLConf

Smart Enterprise Big Data Bus for the Modern Responsive Enterprise

New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...

Viewers also liked

Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Sameer Farooqui

Hadoop securityshrey mehrotra

Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Kevin Minder

Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...Abhiraj Butala

Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...Hortonworks

Hadoop & Security - Past, Present, FutureUwe Printz

From MapReduce to Apache SparkJen Aman

Spring Boot IntroAlberto Flores

Open Source Security Tools for Big DataRommel Garcia

Hadoop Security Today & Tomorrow with Apache KnoxVinay Shukla

REST with Spring Boot #jqfkToshiaki Maki

Hadoop REST API Security with Apache Knox GatewayDataWorks Summit

10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu

Developing Java Web Applicationshchen1

Microservices with Java, Spring Boot and Spring CloudEberhard Wolff

Microservices with Spring BootJoshua Long

3 Tier ArchitectureWebx

Spring bootsdeeg

Spring pptMumbai Academisc

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

Viewers also liked (20)

Spark & Cassandra at DataStax Meetup on Jan 29, 2015

Hadoop security

Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...

Hadoop Security in Big-Data-as-a-Service Deployments - Presented at Hadoop Su...

Discover HDP 2.2: Comprehensive Hadoop Security with Apache Ranger and Apache...

Hadoop & Security - Past, Present, Future

From MapReduce to Apache Spark

Spring Boot Intro

Open Source Security Tools for Big Data

Hadoop Security Today & Tomorrow with Apache Knox

REST with Spring Boot #jqfk

Hadoop REST API Security with Apache Knox Gateway

10 Amazing Things To Do With a Hadoop-Based Data Lake

Developing Java Web Applications

Microservices with Java, Spring Boot and Spring Cloud

Microservices with Spring Boot

3 Tier Architecture

Spring boot

Spring ppt

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0

Similar to MapReduce to Apache Spark: An Ecosystem Evolves

Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.

Mapreduce Hadop.pptxBangladesh University of Professionals

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance

Introduction to Apache HadoopChristopher Pezza

Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer

Is Spark Replacing HadoopMapR Technologies

Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore

Evolution of spark framework for simplifying data analysis.Anirudh Gangwar

Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi

Spark_Part 1Shashi Prakash

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Hadoop vs sparkamarkayam

Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman

Glint with Apache SparkVenkata Naga Ravi

Introduction to sparkHome

Big data overviewbeCloudReady

Apache Spark FundamentalsZahra Eskandari

Big Data visualization with Apache Spark and Zeppelinprajods

Big data with javaStefan Angelov

Why Spark over Hadoop?Prwatech Institution

Similar to MapReduce to Apache Spark: An Ecosystem Evolves (20)

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

Mapreduce Hadop.pptx

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...

Introduction to Apache Hadoop

Big Data Analytics with Hadoop, MongoDB and SQL Server

Is Spark Replacing Hadoop

Advanced Analytics and Big Data (August 2014)

Evolution of spark framework for simplifying data analysis.

Transitioning Compute Models: Hadoop MapReduce to Spark

Spark_Part 1

Processing Large Data with Apache Spark -- HasGeek

Hadoop vs spark

Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :

Glint with Apache Spark

Introduction to spark

Big data overview

Apache Spark Fundamentals

Big Data visualization with Apache Spark and Zeppelin

Big data with java

Why Spark over Hadoop?

Recently uploaded

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Project Based Learning (A.I).pptx detail explanationkaushalgiri8080

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions

Professional Resume Template for Software DevelopersVinodh Ram

DNT_Corporate presentation know about usDynamic Netsoft

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

What is Binary Language? Computer Number SystemsJheuzeDellosa

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin

why an Opensea Clone Script might be your perfect match.pdfjoe51371421

5 Signs You Need a Fashion PLM Software.pdfWave PLM

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Recently uploaded (20)

Exploring iOS App Development: Simplifying the Process

The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf

Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Unlocking the Future of AI Agents with Large Language Models

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

HR Software Buyers Guide in 2024 - HRSoftware.com

Project Based Learning (A.I).pptx detail explanation

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

Advancing Engineering with AI through the Next Generation of Strategic Projec...

Professional Resume Template for Software Developers

DNT_Corporate presentation know about us

Hand gesture recognition PROJECT PPT.pptx

What is Binary Language? Computer Number Systems

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...

why an Opensea Clone Script might be your perfect match.pdf

5 Signs You Need a Fashion PLM Software.pdf

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

MapReduce to Apache Spark: An Ecosystem Evolves

1. MapReduce to Apache Spark: An Ecosystem Evolves Doug Cutting (@cutting) Chief Architect & Co-founder of Apache Hadoop

2. Hadoop’s Original Architecture MapReduce Data Processing and Resource Management HDFS Filesystem/Storage

3. The MapReduce Breakthrough Key advances in MapReduce: • Data locality: Automatic split computation and appropriate launch of mappers • Fault-tolerance: Write-out of intermediate results and restartable mappers provides ability to run on commodity hardware • Linear scalability: Combination of locality + programming model forces developers to write generally scalable solutions Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce

4. Apache Spark: A Better MapReduce Easy, Expressive API • Rich API (Java, Scala, and Python) • Interactive shell • 2-5x less code needed than MR Fast Execution • General execution graphs • In-memory storage • Order-of-magnitude improvement over MR

5. Big Data Developers are Rapidly Sparking Up Source: Typesafe Apache Spark Adoption Survey, Jan. 2015 • 82% have replaced MapReduce with Spark • 78% need faster processing for large data sets • 62% load data into Spark via HDFS • 22% of respondents run CDH, more than twice as many as any other Hadoop platform

6. Spark is now an important part of the Hadoop Platform

7. A Platform That Just Won’t Stop Growing NEWPROJECTS EXISTINGPROJECTS *CDHSUPPORTED Core Hadoop (HDFS, MapReduce) Solr Pig Core Hadoop HBase ZooKeeper Solr Pig Core Hadoop Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig Core Hadoop Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Knox Flink Parquet Sentry Spark Tez Impala Kafka Drill Flume Bigtop Oozie HCatalog Hue Sqoop Avro Hive Mahout HBase ZooKeeper Solr Pig YARN Core Hadoop Kudu* RecordService* Ibis* Falcon Knox Flink Parquet* Sentry* Spark* Tez Impala* Kafka* Drill Flume* Bigtop* Oozie* Hcatalog* Hue* Sqoop* Avro* Hive* Mahout* Hbase* ZooKeeper* Solr* Pig* YARN* Core Hadoop* 2006 2008 2009 2010 2011 2012 20132007 2014 2015

8. Hadoop’s Next 10 Years Interest in public-cloud deployments are driving native support for them into the platform. Rapid hardware advances are forcing the community to re-think Hadoop’s foundations. Data sources are more numerous, distributed, and diverse (IoT), and Hadoop will adapt.

9. Learn More cloudera.com/hadoop10

Editor's Notes

This data is from Typesafe’s 2015 survey of 2100+ developers, data scientists, and IT executives whose orgs are either running or researching Spark
What does the future hold for Hadoop? There are many possible permutations, but these are just a couple of the obvious influences going forward.

MapReduce to Apache Spark: An Ecosystem Evolves

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to MapReduce to Apache Spark: An Ecosystem Evolves

Similar to MapReduce to Apache Spark: An Ecosystem Evolves (20)

More from Cloudera, Inc.

More from Cloudera, Inc. (20)

Recently uploaded

Recently uploaded (20)

MapReduce to Apache Spark: An Ecosystem Evolves

Editor's Notes