Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Apache Spark

1,318 views

Published on

Intro to Spark at the Scala User Group Cologne/Bonn

Published in: Data & Analytics
  • Be the first to comment

Intro to Apache Spark

  1. 1. Intro to Apache Spark Marius Soutier Freelance Software Engineer @mariussoutier Clustered In-Memory Computation
  2. 2. Motivation • Classical data architectures break down • RDMBS can’t handle large amounts of data well • Most RDMBS can’t handle multiple input formats • Most NoSQLs don’t offer analytics Problem Running computations on BigData®
  3. 3. The 3 Vs of Big Data Volume 100s of GB, TB, PB Variety Structured, Unstructured, Semi-Structured Velocity Sensors, Realtime “Fast Data”
  4. 4. Hadoop (1) • De-facto standard for running computations on large amounts of different data is Hadoop • Hadoop consists of • HDFS distributed, fault-tolerant file system • Map/Reduce parallelizable computations pioneered by Google • Hadoop is typically run on a (large) cluster of non-virtualized commodity hardware
  5. 5. Hadoop (2) • However, Map/Reduce are batch jobs with high latency • Not suitable for interactive queries, real-time analytics, or Machine Learning • Pure Map/Reduce is hard to develop and maintain
  6. 6. Enter Spark Spark is a framework for clustered in-memory data processing
  7. 7. • Developed at UC Berkeley, released in 2010 • Apache Top-Level Project Since February 2014, current version is 1.2.1 / 1.3.0 • USP: Uses cluster-wide available memory to speed up computations • Very active community Apache Spark (1)
  8. 8. • Written in Scala (& Akka), 
 APIs for Java and Python • Programming model is a collection pipeline* instead of Map/Reduce • Supports batch, streaming, interactive, 
 or all combined using unified API Apache Spark (2) * http://martinfowler.com/articles/collection-pipeline/
  9. 9. Spark Ecosystem Spark Core Spark SQL Spark Hive BlinkDB Approximate SQL Spark Streaming MLlib Machine Learning GraphX SparkR ALPHA ALPHA ALPHA Tachyon
  10. 10. Spark is a framework for clustered in-memory data processing Spark is a platform for data driven products.
  11. 11. • Base abstraction Resilient Distributed Dataset (RDD) • Essentially a distributed collection of objects • Can be cached in memory or on disk RDD
  12. 12. RDD Word Count val sc = new SparkContext()
 val input: RDD[String] = sc.textFile("/tmp/word.txt")
 val words: RDD[(String, Long)] = input
 .flatMap(line => line.toLowerCase.split("s+"))
 .map(word => word -> 1L)
 .cache()
 
 val wordCountsRdd: RDD[(String, Long)] = words
 .reduceByKey(_ + _)
 .sortByKey() 
 val wordCounts: Array[(String, Long)] = wordCountsRdd.collect()
  13. 13. Cluster Driver SparkContext Master Worker Executor Worker Executor Tasks Tasks • Spark app (driver) builds DAG from RDD operations • DAG is split into tasks that are executed by workers
  14. 14. Example Architecture Input HDFS Message Queue Spark Streaming Spark Batch Jobs SparkSQL Real-Time Dashboard Interactive SQL Analytics, Reports
  15. 15. Demo Questions?

×