Apache Spark: killer or savior of Apache Hadoop?
Upcoming SlideShare
Loading in...5
×
 

Apache Spark: killer or savior of Apache Hadoop?

on

  • 1,122 views

The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs ...

The Big Boss(tm) has just OKed the first Hadoop cluster in the company. You are the guy in charge of analyzing petabytes of your company's valuable data using a combination of custom MapReduce jobs and SQL-on-Hadoop solutions. All of a sudden the web is full of articles telling you that Hadoop is dead, Spark has won and you should quit while you're still ahead. But should you?

Statistics

Views

Total Views
1,122
Views on SlideShare
1,064
Embed Views
58

Actions

Likes
3
Downloads
26
Comments
0

5 Embeds 58

https://twitter.com 25
http://www.slideee.com 20
http://dschool.co 8
http://www.dschool.co 3
https://www.rebelmouse.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Apache Spark: killer or savior of Apache Hadoop? Apache Spark: killer or savior of Apache Hadoop? Presentation Transcript

  • Apache Spark: killer or savior of Apache Hadoop? Roman Shaposhnik Director of Open Source @Pivotal (Twitter: @rhatr)
  • Who’s this guy? •  Director of Open Source (building a team of OS contributors) •  Apache Software Foundation guy (Member, VP of Apache Incubator, committer on Hadoop, Giraph, Sqoop, etc) •  Used to be root@Cloudera •  Used to be PHB@Yahoo! (original Hadoop team) •  Used to be a hacker at Sun microsystems (Sun Studio compilers and tools)
  • Shameless plug http://manning.com/martella View slide
  • Dearly beloved… View slide
  • 40 minute to figure out Hadoop vs. Spark
  • 40 minute to figure out Hadoop++ == Spark
  • 40 minute to figure out Hadoop + Spark
  • 40 minute to figure out
  • Long, long time ago… HDFS ASF Projects FLOSS Projects Pivotal Products MapReduce
  • In a blink of an eye HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon
  • A Spark view? HDFS Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie Hadoop UI Hue SolrCloud Phoenix HBase Spark Shark Streaming MLib GraphX SpringXD YARN Tachyon
  • BDAS
  • Principle #1 HDFS is the datalake
  • Your datacenter … server 1 server N
  • Hadoop’s view MapReduce server 1 server N HDFS
  • HDFS: decoupled storage … MR HDFS MR
  • Anatomy of MapReduce d a c a b c a 3 b 1 c 2 a 1 b 1 c 1 a 1 c 1 a 1 a 1 1 1 b 1 c 1 1 HDFS mappers reducers HDFS
  • Principle #2 MR is assembly language
  • MapReduce 1.0 Job Tracker Task Tracker (HDFS) Task Tracker (HDFS) task1 task1 task1 task1 task1 task1 task1 task1 task1 taskN
  • YARN (AKA MR2.0) Resource Manager Job Tracker task1 task1 task1 task1 task1 Task Tracker
  • YARN (AKA MR2.0) Resource Manager Job Tracker task1 task1 task1 task1 task1 Task Tracker
  • Principle #3 MR: YARN + library
  • What’s wrong with MR? Source: UC Berkeley Spark project (just the image)
  • Principle #4 $ grep –R | awk | sort …
  • Spark philosophy • Make life easy for Data Scientists • Provide well documented and expressive APIs • Powerful Domain Specific Libraries • Easy integration with storage systems • Caching to avoid data movement • Well defined releases, stable API
  • Spark innovations • Resilient Distribtued Datasets (RDDs) • Distributed on a cluster • Manipulated via parallel operators (map, etc.) • Automatically rebuilt on failure • A parallel ecosystem • A solution to iterative and multi-stage apps
  • RDDs warnings = textFile(…).filter(_.contains(“warning”)) .map(_.split(‘ ‘)(1)) HadoopRDD path = hdfs:// FilteredRDD contains… MappedRDD split…
  • Parallel operators • map, reduce • sample, filter • groupBy, reduceByKey • join, leftOuterJoin, rightOuterJoin • union, cross
  • How do I use it? val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • Principle #5 Memory is the new disk
  • RDDs are the foundation • SQL • Graph • ML • Streaming
  • Spark SQL • Lib in Spark Core that models RDDs as rels. • SchemaRDD • Replaces Shark • Lightweight with no code from Hive • Import/Export into different storage formats • Columnar storage (as in Shark)
  • Spark Streaming • Extend Spark to do large scale stream processing • Simple, batch like API with RDDs • Single semantics for both real time and high latency
  • D-Streams
  • Streaming from Twitter TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(5))
  • Spark GraphX • Pregel (BSP) (formerly know as Bagel) • Graph-centric modeling • Unification of processing • No more MR trickery
  • You killed Apache Giraph?
  • MLbase • Machine Learning toolset • MatLab for scale out computing • Built on Spark Mlib • Classification, Regression, Colab. Filtering, etc.
  • What is really happening? HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire XD Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN Tachyon
  • Principle #6 Spark: the ecosystem
  • May be its not so bad server 1 server N
  • But HDFS/YARN are safe? HDFS, Ceph, S3, NAS, etc. New HDFS New YARN
  • What is *really* going on? • 2009 Research at UCB, written in Scala • 2010 Open Sourced • 2013 Accepted into Apache Incubator • 2013 Databricks formed ($14M funding) • 2014 Becomes TLP with ASF • 2014 Spark 1.0 is out • 2014 Databricks gets an extra $33M
  • Bigdata: brought to U by ASF • >50% ML traffic • 100-200 contributors across 25-35 companies • More active than Hadoop • Cross-pollination with other TLPs
  • Principle #7 Where Hadoop was ‘09
  • This is how hardening looks
  • What is Hadoop? Hadoop != MR + HDFS
  • The ecosystem • Apache HBase • Apache Crunch, Pig, Hive and Phoenix • Apache Giraph • Apache Oozie • Apache Mahout • Apache Sqoop and Flume
  • Principle #8 Spark: an alternative backend
  • Spark is best for cloud
  • Principle #9 Memory is expensive
  • What’s new? • True elasticity • Resource partitioning • Security • Data marketplace • Multi datacenter deployments
  • Hadoop Maturity ETL Offload Accommodate massive  data growth with existing EDW investments Data Lakes Unify Unstructured and Structured Data Access Big Data Apps Build analytic-led applications impacting  top line revenue Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
  • Pivotal HD on Pivotal CF Ÿ Enterprise PaaS Management System Ÿ Flexible multi-language ‘buildpack’ architecture Ÿ Deployed applications enjoy built-in services Ÿ On-Premise Hadoop as a Service Ÿ Single cluster deployment of Pivotal HD Ÿ Developers instantly bind to shared Hadoop Clusters Ÿ Speeds up time-to-value
  • Pivotal’s view Data Science Platform Tachyon/Gem Cluster Manager MR Application Stream Server MPP SQL Data Lake / HDFS / Virtual Storage GemFireXD ...ETC Hadoop HDFS Isilon App Dev / Ops MLbase Streaming Legacy Systems Legacy Data Scientists Data Sources End Users SparkSQL
  • Principle #10 The rumors of my death…
  • It will be called Hadoop HDFS Pig Sqoop Flume Coordination and workflow management Zookeeper Command Center ASF Projects FLOSS Projects Pivotal Products GemFire with Tachyon Oozie MapReduce Hive Tez Giraph Hadoop UI Hue SolrCloud Phoenix HBase Crunch Mahout Spark Shark Streaming MLib GraphX Impala HAWQ SpringXD MADlib Hamster PivotalR YARN
  • Spark recap • Is it “Big Data” (Yes) • Is it “Hadoop” (No) • It’s one of those “in memory” things, right (Yes) • JVM, Java, Scala (All) • Is it Real or just another shiny technology with a long, but ultimately small tail (Yes and ?)
  • A NEW PLATFORM FOR A NEW ERA
  • Credits • Wikipedia and Dilbert.com • Apache Software Foundation • Scott Deeg • Milind Bhandarkar • Susheel Kaushik • Mak Gokhale
  • Questions ?