Successfully reported this slideshow.

Intro to Apache Spark

2

Share

Upcoming SlideShare
Getting started big data
Getting started big data
Loading in …3
×
1 of 15
1 of 15

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Intro to Apache Spark

  1. 1. Intro to Apache Spark Marius Soutier Freelance Software Engineer @mariussoutier Clustered In-Memory Computation
  2. 2. Motivation • Classical data architectures break down • RDMBS can’t handle large amounts of data well • Most RDMBS can’t handle multiple input formats • Most NoSQLs don’t offer analytics Problem Running computations on BigData®
  3. 3. The 3 Vs of Big Data Volume 100s of GB, TB, PB Variety Structured, Unstructured, Semi-Structured Velocity Sensors, Realtime “Fast Data”
  4. 4. Hadoop (1) • De-facto standard for running computations on large amounts of different data is Hadoop • Hadoop consists of • HDFS distributed, fault-tolerant file system • Map/Reduce parallelizable computations pioneered by Google • Hadoop is typically run on a (large) cluster of non-virtualized commodity hardware
  5. 5. Hadoop (2) • However, Map/Reduce are batch jobs with high latency • Not suitable for interactive queries, real-time analytics, or Machine Learning • Pure Map/Reduce is hard to develop and maintain
  6. 6. Enter Spark Spark is a framework for clustered in-memory data processing
  7. 7. • Developed at UC Berkeley, released in 2010 • Apache Top-Level Project Since February 2014, current version is 1.2.1 / 1.3.0 • USP: Uses cluster-wide available memory to speed up computations • Very active community Apache Spark (1)
  8. 8. • Written in Scala (& Akka), 
 APIs for Java and Python • Programming model is a collection pipeline* instead of Map/Reduce • Supports batch, streaming, interactive, 
 or all combined using unified API Apache Spark (2) * http://martinfowler.com/articles/collection-pipeline/
  9. 9. Spark Ecosystem Spark Core Spark SQL Spark Hive BlinkDB Approximate SQL Spark Streaming MLlib Machine Learning GraphX SparkR ALPHA ALPHA ALPHA Tachyon
  10. 10. Spark is a framework for clustered in-memory data processing Spark is a platform for data driven products.
  11. 11. • Base abstraction Resilient Distributed Dataset (RDD) • Essentially a distributed collection of objects • Can be cached in memory or on disk RDD
  12. 12. RDD Word Count val sc = new SparkContext()
 val input: RDD[String] = sc.textFile("/tmp/word.txt")
 val words: RDD[(String, Long)] = input
 .flatMap(line => line.toLowerCase.split("s+"))
 .map(word => word -> 1L)
 .cache()
 
 val wordCountsRdd: RDD[(String, Long)] = words
 .reduceByKey(_ + _)
 .sortByKey() 
 val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()
  13. 13. Cluster Driver SparkContext Master Worker Executor Worker Executor Tasks Tasks • Spark app (driver) builds DAG from RDD operations • DAG is split into tasks that are executed by workers
  14. 14. Example Architecture Input HDFS Message Queue Spark Streaming Spark Batch Jobs SparkSQL Real-Time Dashboard Interactive SQL Analytics, Reports
  15. 15. Demo Questions?

×