Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1© Cloudera, Inc. All rights reserved.
Why Apache Spark is the Heir to
MapReduce in the Apache
Hadoop Ecosystem
2© Cloudera, Inc. All rights reserved.
Key Advances by MapReduce:
• Data Locality: Automatic split computation and launch ...
3© Cloudera, Inc. All rights reserved.
MR was sufficient for many use cases, but a bit like Haiku in its expressiveness:
A...
4© Cloudera, Inc. All rights reserved.
Better Developer Productivity
Rich APIs for Scala, Java, and Python
Interactive she...
5© Cloudera, Inc. All rights reserved.
• Native support for multiple
languages with identical APIs
• Use of closures, iter...
6© Cloudera, Inc. All rights reserved.
In Spark, individual execution tasks are expressed as a single, parallelized
progra...
7© Cloudera, Inc. All rights reserved.
Run continuous processing of data using Spark’s core API.
Example use cases:
• “On-...
8© Cloudera, Inc. All rights reserved.
Spark and Hadoop Belong Together (via YARN)
YARN
Spark
Spark
Streaming
GraphX MLlib...
9© Cloudera, Inc. All rights reserved.
Cloudera Is a Leader in the Spark Movement
2013 2014 2015 2016
Identified Spark’s
e...
10© Cloudera, Inc. All rights reserved.
Spark is Replacing MapReduce as the Open Standard
With help from Cloudera’s Apache...
11© Cloudera, Inc. All rights reserved.
Cloudera & Intel: Joint Roadmap for Spark
Cloudera and Intel engineers are major c...
12© Cloudera, Inc. All rights reserved.
Developers are Sparking Up
Source: Typesafe Apache
Spark Adoption Survey, Jan.
201...
13© Cloudera, Inc. All rights reserved.
Focus Areas for Contributions
Enterprise Readiness Performance SQL
• Comprehensive...
14© Cloudera, Inc. All rights reserved.
Get Educated About Spark at cloudera.com/spark
Read the Spark book by
Cloudera’s c...
15© Cloudera, Inc. All rights reserved.
Thank You
cloudera.com/spark
Upcoming SlideShare
Loading in …5
×

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

22,124 views

Published on

Learn why Apache Spark is replacing MapReduce as the defailt general data processing engine for Hadoop ecosystem components

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem

  1. 1. 1© Cloudera, Inc. All rights reserved. Why Apache Spark is the Heir to MapReduce in the Apache Hadoop Ecosystem
  2. 2. 2© Cloudera, Inc. All rights reserved. Key Advances by MapReduce: • Data Locality: Automatic split computation and launch of mappers appropriately • Fault-Tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware • Linear Scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems MapReduce: Hadoop’s Original Data Processing Engine Map Map Map Map Map Map Map Map Map Map Map Map Reduce Reduce Reduce Reduce
  3. 3. 3© Cloudera, Inc. All rights reserved. MR was sufficient for many use cases, but a bit like Haiku in its expressiveness: A very rigid framework; Diverse, powerful. MapReduce Did Its Original Job Well, But… MapReduce Hive Pig Mahout Crunch Solr
  4. 4. 4© Cloudera, Inc. All rights reserved. Better Developer Productivity Rich APIs for Scala, Java, and Python Interactive shell We Can Do Better with Apache Spark Better Performance General execution graphs In-memory storage
  5. 5. 5© Cloudera, Inc. All rights reserved. • Native support for multiple languages with identical APIs • Use of closures, iterations, and other common language constructs to minimize code • Unified API for batch and streaming High-Productivity Language Support Python lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count();
  6. 6. 6© Cloudera, Inc. All rights reserved. In Spark, individual execution tasks are expressed as a single, parallelized program flow. Big time saver for developers! Automatic Parallelization of Complex Flows rdd1.map(splitlines).filter("ERROR") rdd2.map(splitlines).groupBy(key) rdd2.join(rdd1, key).take(10)
  7. 7. 7© Cloudera, Inc. All rights reserved. Run continuous processing of data using Spark’s core API. Example use cases: • “On-the-fly” ETL as data is ingested into Hadoop/HDFS • Detecting anomalous behavior and triggering alerts • Continuous reporting of summary metrics for incoming data Integrated Streaming
  8. 8. 8© Cloudera, Inc. All rights reserved. Spark and Hadoop Belong Together (via YARN) YARN Spark Spark Streaming GraphX MLlib HDFS, HBase HivePig Impala Spark or MR Spark SQL Search Core Hadoop Spark components
  9. 9. 9© Cloudera, Inc. All rights reserved. Cloudera Is a Leader in the Spark Movement 2013 2014 2015 2016 Identified Spark’s early potential Ships and Supports Spark with CDH 4.4 Significant contributions to Spark-on-YARN integration Announces initiative to make Spark the standard execution engine Launches first Spark training Added security integration Cloudera engineers publish O’Reilly Spark book Leading effort to further performance, usability, and enterprise-readiness
  10. 10. 10© Cloudera, Inc. All rights reserved. Spark is Replacing MapReduce as the Open Standard With help from Cloudera’s Apache committers, ecosystem communities are complementing MapReduce with Spark as their execution engine/making Spark the default: Hive Pig Mahout Crunch Solr
  11. 11. 11© Cloudera, Inc. All rights reserved. Cloudera & Intel: Joint Roadmap for Spark Cloudera and Intel engineers are major contributors to Spark, working alongside those of DataBricks and the rest of the global Apache community to help build the platform. • 23 total engineers working on Spark (including 5 committers) • Cloudera: 8 (4 committers) • Intel: 15 (1 committer) • 900+ patches contributed to date
  12. 12. 12© Cloudera, Inc. All rights reserved. Developers are Sparking Up Source: Typesafe Apache Spark Adoption Survey, Jan. 2015 • 82% of users have Spark to replace MapReduce • 78% of users need faster processing for large data sets • 67% of users plan to introduce event stream processing • 22% of users run Spark on Cloudera, twice as many as any other platform option
  13. 13. 13© Cloudera, Inc. All rights reserved. Focus Areas for Contributions Enterprise Readiness Performance SQL • Comprehensive Security • Comprehensive Governance • Improved Monitoring and Dashboards • Core shuffle and sort improvements • Improved leverage of HDFS data locality • Automatic performance tuning • Leverage HDFS Caching • Scale testing • HDFS discard-able distributed memory integration • Spark-on-YARN improvements: dynamic container resizing • Spark SQL stability • SQL on Spark Streaming • Column-level security Growing the Ecosystem • Hive on Spark • Remote Spark Context • Sqoop on Spark Data Science • MLlib Pipelines • Interactive iPython-style notebooks • Intel MKL integration for performance improvements
  14. 14. 14© Cloudera, Inc. All rights reserved. Get Educated About Spark at cloudera.com/spark Read the Spark book by Cloudera’s committers Get Spark trainingGet hands-on with Spark and Hadoop on AWS
  15. 15. 15© Cloudera, Inc. All rights reserved. Thank You cloudera.com/spark

×