SlideShare a Scribd company logo
Apache Spark
The next Generation Cluster Computing
Ivan Lozić, 04/25/2017
Ivan Lozić, software engineer & entrepreneur
Scala & Spark, C#, Node.js, Swift
Web page: www.deegloo.com
E-Mail: ilozic@gmail.com
LinkedIn: https://www.linkedin.com/in/ilozic/
Zagreb, Croatia
Contents
● Apache Spark and its relation to Hadoop MapReduce
● What makes Apache Spark run fast
● How to use Spark rich API to build batch ETL jobs
● Streaming capabilities
● Structured streaming
3
Apache Hadoop
44
Apache Hadoop
● Open Source framework for distributed storage and processing
● Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella)
● 2006. Yahoo! Created Hadoop based on GFS and MapReduce
● Based on MapReduce programming model
● Fundamental assumption - all the modules are built to handle
hardware failures automatically
● Clusters built of commodity hardware
5
6
Apache Spark
77
Motivation
● Hardware - CPU compute bottleneck
● Users - democratise access to data and improve usability
● Applications - necessity to build near real time big data applications
8
Apache Spark
● Open source fast and expressive cluster computing framework
designed for Big data analytics
● Compatible with Apache Hadoop
● Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache
Software Foundation in 2013.
● Original author - Matei Zaharia
● Databricks inc. - company behind Apache Spark
9
Apache Spark
● General distributed computing engine which unifies:
○ SQL and DataFrames
○ Real-time streaming (Spark streaming)
○ Machine learning (SparkML/MLLib)
○ Graph processing (GraphX)
10
Apache Spark
● Runs everywhere - standalone, EC2, Hadoop YARN, Apache Mesos
● Reads and writes from/to:
○ File/Directory
○ HDFS/S3
○ JDBC
○ JSON
○ CSV
○ Parquet
○ Cassandra, HBase, ...
11
Apache Spark - architecture
12
source: Databricks
Word count - MapReduce vs Spark
13
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" ")).map(word => (word,
1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
Hadoop ecosystem
14
Who uses Apache Spark?
15
Core data
abstractions
1616
Resilient Distributed Dataset
● RDDs are partitioned collections of objects - building blocks of Spark
● Immutable and provide fault tolerant computation
● Two types of operations:
1. Transformations - map, reduce, sort, filter, groupBy, ...
2. Actions - collect, count, take, first, foreach, saveToCassandra, ...
17
RDD
● Types of operations are based on Scala collection API
● Transformations are lazily evaluated DAG (Directed Acyclic Graph)
constituents
● Actions invoke DAG creation and actual computation
18
RDD
19
Data shuffling
● Sending data over the network
● Slow - should be minimized as much as possible!
● Typical example - groupByKey (slow) vs reduceByKey (faster)
20
RDD - the problems
● They express the how better than what
● Operations and data type in clojure are black box for Spark - Spark
cannot make optimizations
21
val category = spark.sparkContext.textFile("/data/SFPD_Incidents_2003.csv")
.map(line => line.split(byCommaButNotUnderQuotes)(1))
.filter(cat => cat != "Category")
Structure
(Structured APIs)
22
SparkSQL
23
● Originally named “Shark” - to enable HiveQL queries
● As of Spark 2.0 - SQL 2003 support
category.toDF("categoryName").createOrReplaceTempView("category")
spark.sql("""
SELECT categoryName, count(*) AS Count
FROM category
GROUP BY categoryName
ORDER BY 2 DESC
""").show(5)
DataFrame
● Higher level abstraction (DSL) to manipulate with data
● Distributed collection of rows organized into named columns
● Modeled after Pandas DataFrame
● DataFrame has schema (something RDD is missing)
24
val categoryDF = category.toDF("categoryName")
categoryDF
.groupBy("categoryName")
.count()
.orderBy($"Count".desc)
.show(5)
DataFrame
25
Structured APIs error-check comparison
26
source: Databricks
Dataset
● Extension to DataFrame
● Type-safe
● DataFrame = Dataset[Row]
27
case class Incident(Category: String, DayOfWeek: String)
val incidents = spark
.read
.option("header", "true")
.csv("/data/SFPD_Incidents_2003.csv")
.select("Category", "DayOfWeek")
.as[Incident]
val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
val histogram = incidents.groupByKey(_.Category).mapGroups {
case (category, daysOfWeek) => {
val buckets = new Array[Int](7)
daysOfWeek.map(_.DayOfWeek).foreach { dow =>
buckets(days.indexOf(dow)) += 1
}
(category, buckets)
}
}
What makes
Spark fast?
2828
In memory computation
● Fault tolerance is achieved by using HDFS
● Easy possible to spend 90% of time in Disk I/O only
29
iter. 1
input
iter. 2 ...
HDFS read HDFS write HDFS read HDFS write HDFS read
● Fault tolerance is provided by building lineage of transformations
● Data is not being replicated
iter. 1
input
iter. 2 ...
Catalyst - query optimizer
30
source: Databricks
● Applies transformations to convert unoptimized to optimized query
plan
Project Tungsten
● Improve Spark execution memory and CPU efficiency by:
○ Performing explicit memory management instead of relying on JVM objects (Dataset
encoders)
○ Generating code on the fly to fuse multiple operators into one (Whole stage codegen)
○ Introducing cache-aware computation
○ In-memory columnar format
● Bringing Spark closer to the bare metal
31
Dataset encoders
● Encoders translate between domain objects and Spark's internal
format
32
source: Databricks
Dataset encoders
● Encoders bridge objects with data sources
33
{
"Category": "THEFT",
"IncidntNum": "150060275",
"DayOfWeek": "Saturday"
}
case class Incident(IncidntNum: Int,
Category: String,
DayOfWeek: String)
Dataset benchmark
Space efficiency
34
source: Databricks
Dataset benchmark
Serialization/deserialization performance
35
source: Databricks
Whole stage codegen
● Fuse the operators together
● Generate code on the fly
● The idea: generate specialized code as if it was written manually to be
fast
Result: Spark 2.0 is 10x faster than Spark 1.6
36
Whole stage codegen
37
SELECT COUNT(*) FROM store_sales
WHERE ss_item_sk=1000
Whole stage codegen
Volcano iterator model
38
Whole stage codegen
What if we would ask some intern to write this in c#?
39
long count = 0;
foreach (var ss_item_sk in store_sales) {
if (ss_item_sk == 1000)
count++;
}
Volcano vs Intern
40
Volcano
Intern
source: Databricks
Volcano vs Intern
41
Developing ETL
with Spark
4242
Choose your favorite IDE
43
Define Spark job entry point
44
object IncidentsJob {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.appName("Incidents processing job")
.config("spark.sql.shuffle.partitions", "16")
.master("local[4]")
.getOrCreate()
{ spark transformations and actions... }
System.exit(0)
}
Create build.sbt file
45
lazy val root = (project in file(".")).
settings(
organization := "com.mycompany",
name := "spark.job.incidents",
version := "1.0.0",
scalaVersion := "2.11.8",
mainClass in Compile := Some("com.mycompany.spark.job.incidents.main")
)
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.0.1" % "provided",
"org.apache.spark" %% "spark-sql" % "2.0.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "2.0.1" % "provided",
"com.microsoft.sqlserver" % "sqljdbc4" % "4.0"
)
Create application (fat) jar file
$ sbt compile
$ sbt test
$ sbt assembly (sbt-assembly plugin)
46
Submit job via spark-submit command
./bin/spark-submit 
--class <main-class> 
--master <master-url> 
--deploy-mode <deploy-mode> 
--conf <key>=<value> 
... # other options
<application-jar> 
[application-arguments]
47
Example workflow
48
code
1. pull content
2. take build number (331)
3. build & test
4. copy to cluster
job331.jar
produce job artifact
notification
5. create/schedule job job331 (http)
6. spark submit
job331
Spark Streaming
4949
Apache Spark streaming
● Scalable fault tolerant streaming system
● Receivers receive data streams and chop them into batches
● Spark processes batches and pushes out the result
50
● Input: Files, Socket, Kafka, Flume, Kinesis...
Apache Spark streaming
51
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("Incidents processing job - Stream")
val ssc = new StreamingContext(conf, Seconds(1))
val topics = Set(
Topics.Incident,
val directKafkaStream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte],
DefaultDecoder, DefaultDecoder](
ssc,
kafkaParams,
topics)
// process batches
directKafkaStream.map(_._2).flatMap(_.split(“ “))...
// Start the computation
ssc.start()
ssc.awaitTermination()
System.exit(0)
}
Apache Spark streaming
● Integrates with the rest of the ecosystem
○ Combine batch and stream processing
○ Combine machine learning with streaming
○ Combine SQL with streaming
52
Structured
streaming
53
[Alpha version in Spark 2.1]
53
Structured streaming (continuous apps)
● High-level streaming API built on DataFrames
● Catalyst optimizer creates incremental execution plan
● Unifies streaming, interactive and batch queries
● Supports multiple sources and sinks
● E.g. aggregate data in a stream, then serve using JDBC
54
Structured streaming key idea
The simplest way to perform streaming analytics is not having to reason
about streaming.
55
Structured streaming
56
Structured streaming
● Reusing same API
57
val categories = spark
.read
.option("header", "true")
.schema(schema)
.csv("/data/source")
.select("Category")
val categories = spark
.readStream
.option("header", "true")
.schema(schema)
.csv("/data/source")
.select("Category")
finite infinite
Structured streaming
● Reusing same API
58
categories
.write
.format("parquet")
.save("/data/warehouse/categories.parquet")
categories
.writeStream
.format("parquet")
.start("/data/warehouse/categories.parquet")
finite infinite
Structured streaming
59
Useful resources
● Spark home page: https://spark.apache.org/
● Spark summit page: https://spark-summit.org/
● Apache Spark Docker image:
https://github.com/dylanmei/docker-zeppelin
● SFPD Incidents:
https://data.sfgov.org/Public-Safety/Police-Department-Incidents/tmn
f-yvry
60
Thank you for the attention!
61
References
62
● Michael Armbrust - STRUCTURING SPARK: DATAFRAMES, DATASETS AND STREAMING -
https://spark-summit.org/2016/events/structuring-spark-dataframes-datasets-and-streaming/
● Apache Parquet - https://parquet.apache.org/
● Spark Performance: What's Next -
https://spark-summit.org/east-2016/events/spark-performance-whats-next/
● Avoid groupByKey -
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reduceby
key_over_groupbykey.html

More Related Content

What's hot

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
Dean Chen
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
Takuya UESHIN
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
Cloudera, Inc.
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 
Spark overview
Spark overviewSpark overview
Spark overview
Lisa Hua
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
Cheng Lian
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Sameer Farooqui
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
spark-project
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
Majid Hajibaba
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
Spark Summit
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
sparkInstructor
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
Tudor Lapusan
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Robert Sanders
 

What's hot (20)

Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Spark overview
Spark overviewSpark overview
Spark overview
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015 Spark & Cassandra at DataStax Meetup on Jan 29, 2015
Spark & Cassandra at DataStax Meetup on Jan 29, 2015
 
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17Deep Dive with Spark Streaming - Tathagata  Das - Spark Meetup 2013-06-17
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
 
Apache Spark
Apache Spark Apache Spark
Apache Spark
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
Apache Spark RDD 101
Apache Spark RDD 101Apache Spark RDD 101
Apache Spark RDD 101
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
 
Apache spark Intro
Apache spark IntroApache spark Intro
Apache spark Intro
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Viewers also liked

Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.
Konstantinos Kloudas
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
Thomas W. Dinsmore
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Databricks
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
Aerospike, Inc.
 

Viewers also liked (8)

Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.Introduction to Stateful Stream Processing with Apache Flink.
Introduction to Stateful Stream Processing with Apache Flink.
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Apache Spark Briefing
Apache Spark BriefingApache Spark Briefing
Apache Spark Briefing
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on TutorialsSparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
Sparkcamp @ Strata CA: Intro to Apache Spark with Hands-on Tutorials
 
What the Spark!? Intro and Use Cases
What the Spark!? Intro and Use CasesWhat the Spark!? Intro and Use Cases
What the Spark!? Intro and Use Cases
 

Similar to Apache Spark, the Next Generation Cluster Computing

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Аліна Шепшелей
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Inhacking
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
I Goo Lee.
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
Thành Nguyễn
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
jeykottalam
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
Vienna Data Science Group
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Mohamed hedi Abidi
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
Hugo Gävert
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
Denis Dus
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
Snehal Nagmote
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Summit
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 

Similar to Apache Spark, the Next Generation Cluster Computing (20)

Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
 
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Introduce spark (by 조창원)
Introduce spark (by 조창원)Introduce spark (by 조창원)
Introduce spark (by 조창원)
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Using spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and CassandraUsing spark 1.2 with Java 8 and Cassandra
Using spark 1.2 with Java 8 and Cassandra
 
Meetup ml spark_ppt
Meetup ml spark_pptMeetup ml spark_ppt
Meetup ml spark_ppt
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with SparkSpark Application Carousel: Highlights of Several Applications Built with Spark
Spark Application Carousel: Highlights of Several Applications Built with Spark
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 

More from Gerger

Source Control for the Oracle Database
Source Control for the Oracle DatabaseSource Control for the Oracle Database
Source Control for the Oracle Database
Gerger
 
Big Data for Oracle Professionals
Big Data for Oracle ProfessionalsBig Data for Oracle Professionals
Big Data for Oracle Professionals
Gerger
 
Best Way to Write SQL in Java
Best Way to Write SQL in JavaBest Way to Write SQL in Java
Best Way to Write SQL in Java
Gerger
 
Version control for PL/SQL
Version control for PL/SQLVersion control for PL/SQL
Version control for PL/SQL
Gerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
Gerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
Gerger
 
PostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sPostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA's
Gerger
 
Shaping Optimizer's Search Space
Shaping Optimizer's Search SpaceShaping Optimizer's Search Space
Shaping Optimizer's Search Space
Gerger
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
Gerger
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
Gerger
 
Introducing ProHuddle
Introducing ProHuddleIntroducing ProHuddle
Introducing ProHuddle
Gerger
 
Use Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cUse Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12c
Gerger
 
Introducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLIntroducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQL
Gerger
 

More from Gerger (13)

Source Control for the Oracle Database
Source Control for the Oracle DatabaseSource Control for the Oracle Database
Source Control for the Oracle Database
 
Big Data for Oracle Professionals
Big Data for Oracle ProfessionalsBig Data for Oracle Professionals
Big Data for Oracle Professionals
 
Best Way to Write SQL in Java
Best Way to Write SQL in JavaBest Way to Write SQL in Java
Best Way to Write SQL in Java
 
Version control for PL/SQL
Version control for PL/SQLVersion control for PL/SQL
Version control for PL/SQL
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
PostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA'sPostgreSQL for Oracle Developers and DBA's
PostgreSQL for Oracle Developers and DBA's
 
Shaping Optimizer's Search Space
Shaping Optimizer's Search SpaceShaping Optimizer's Search Space
Shaping Optimizer's Search Space
 
Gitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQLGitora, Version Control for PL/SQL
Gitora, Version Control for PL/SQL
 
Monitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with ZabbixMonitoring Oracle Database Instances with Zabbix
Monitoring Oracle Database Instances with Zabbix
 
Introducing ProHuddle
Introducing ProHuddleIntroducing ProHuddle
Introducing ProHuddle
 
Use Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12cUse Cases of Row Pattern Matching in Oracle 12c
Use Cases of Row Pattern Matching in Oracle 12c
 
Introducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQLIntroducing Gitora,the version control tool for PL/SQL
Introducing Gitora,the version control tool for PL/SQL
 

Recently uploaded

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Natan Silnitsky
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Globus
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
kalichargn70th171
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 

Recently uploaded (20)

Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
A Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdfA Comprehensive Look at Generative AI in Retail App Testing.pdf
A Comprehensive Look at Generative AI in Retail App Testing.pdf
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 

Apache Spark, the Next Generation Cluster Computing

  • 1. Apache Spark The next Generation Cluster Computing Ivan Lozić, 04/25/2017
  • 2. Ivan Lozić, software engineer & entrepreneur Scala & Spark, C#, Node.js, Swift Web page: www.deegloo.com E-Mail: ilozic@gmail.com LinkedIn: https://www.linkedin.com/in/ilozic/ Zagreb, Croatia
  • 3. Contents ● Apache Spark and its relation to Hadoop MapReduce ● What makes Apache Spark run fast ● How to use Spark rich API to build batch ETL jobs ● Streaming capabilities ● Structured streaming 3
  • 5. Apache Hadoop ● Open Source framework for distributed storage and processing ● Origins are in the project “Nutch” back in 2002 (Cutting, Cafarella) ● 2006. Yahoo! Created Hadoop based on GFS and MapReduce ● Based on MapReduce programming model ● Fundamental assumption - all the modules are built to handle hardware failures automatically ● Clusters built of commodity hardware 5
  • 6. 6
  • 8. Motivation ● Hardware - CPU compute bottleneck ● Users - democratise access to data and improve usability ● Applications - necessity to build near real time big data applications 8
  • 9. Apache Spark ● Open source fast and expressive cluster computing framework designed for Big data analytics ● Compatible with Apache Hadoop ● Developed at UC Berkley’s AMP Lab 2009. and donated to the Apache Software Foundation in 2013. ● Original author - Matei Zaharia ● Databricks inc. - company behind Apache Spark 9
  • 10. Apache Spark ● General distributed computing engine which unifies: ○ SQL and DataFrames ○ Real-time streaming (Spark streaming) ○ Machine learning (SparkML/MLLib) ○ Graph processing (GraphX) 10
  • 11. Apache Spark ● Runs everywhere - standalone, EC2, Hadoop YARN, Apache Mesos ● Reads and writes from/to: ○ File/Directory ○ HDFS/S3 ○ JDBC ○ JSON ○ CSV ○ Parquet ○ Cassandra, HBase, ... 11
  • 12. Apache Spark - architecture 12 source: Databricks
  • 13. Word count - MapReduce vs Spark 13 package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } } val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 15. Who uses Apache Spark? 15
  • 17. Resilient Distributed Dataset ● RDDs are partitioned collections of objects - building blocks of Spark ● Immutable and provide fault tolerant computation ● Two types of operations: 1. Transformations - map, reduce, sort, filter, groupBy, ... 2. Actions - collect, count, take, first, foreach, saveToCassandra, ... 17
  • 18. RDD ● Types of operations are based on Scala collection API ● Transformations are lazily evaluated DAG (Directed Acyclic Graph) constituents ● Actions invoke DAG creation and actual computation 18
  • 20. Data shuffling ● Sending data over the network ● Slow - should be minimized as much as possible! ● Typical example - groupByKey (slow) vs reduceByKey (faster) 20
  • 21. RDD - the problems ● They express the how better than what ● Operations and data type in clojure are black box for Spark - Spark cannot make optimizations 21 val category = spark.sparkContext.textFile("/data/SFPD_Incidents_2003.csv") .map(line => line.split(byCommaButNotUnderQuotes)(1)) .filter(cat => cat != "Category")
  • 23. SparkSQL 23 ● Originally named “Shark” - to enable HiveQL queries ● As of Spark 2.0 - SQL 2003 support category.toDF("categoryName").createOrReplaceTempView("category") spark.sql(""" SELECT categoryName, count(*) AS Count FROM category GROUP BY categoryName ORDER BY 2 DESC """).show(5)
  • 24. DataFrame ● Higher level abstraction (DSL) to manipulate with data ● Distributed collection of rows organized into named columns ● Modeled after Pandas DataFrame ● DataFrame has schema (something RDD is missing) 24 val categoryDF = category.toDF("categoryName") categoryDF .groupBy("categoryName") .count() .orderBy($"Count".desc) .show(5)
  • 26. Structured APIs error-check comparison 26 source: Databricks
  • 27. Dataset ● Extension to DataFrame ● Type-safe ● DataFrame = Dataset[Row] 27 case class Incident(Category: String, DayOfWeek: String) val incidents = spark .read .option("header", "true") .csv("/data/SFPD_Incidents_2003.csv") .select("Category", "DayOfWeek") .as[Incident] val days = Array("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday") val histogram = incidents.groupByKey(_.Category).mapGroups { case (category, daysOfWeek) => { val buckets = new Array[Int](7) daysOfWeek.map(_.DayOfWeek).foreach { dow => buckets(days.indexOf(dow)) += 1 } (category, buckets) } }
  • 29. In memory computation ● Fault tolerance is achieved by using HDFS ● Easy possible to spend 90% of time in Disk I/O only 29 iter. 1 input iter. 2 ... HDFS read HDFS write HDFS read HDFS write HDFS read ● Fault tolerance is provided by building lineage of transformations ● Data is not being replicated iter. 1 input iter. 2 ...
  • 30. Catalyst - query optimizer 30 source: Databricks ● Applies transformations to convert unoptimized to optimized query plan
  • 31. Project Tungsten ● Improve Spark execution memory and CPU efficiency by: ○ Performing explicit memory management instead of relying on JVM objects (Dataset encoders) ○ Generating code on the fly to fuse multiple operators into one (Whole stage codegen) ○ Introducing cache-aware computation ○ In-memory columnar format ● Bringing Spark closer to the bare metal 31
  • 32. Dataset encoders ● Encoders translate between domain objects and Spark's internal format 32 source: Databricks
  • 33. Dataset encoders ● Encoders bridge objects with data sources 33 { "Category": "THEFT", "IncidntNum": "150060275", "DayOfWeek": "Saturday" } case class Incident(IncidntNum: Int, Category: String, DayOfWeek: String)
  • 36. Whole stage codegen ● Fuse the operators together ● Generate code on the fly ● The idea: generate specialized code as if it was written manually to be fast Result: Spark 2.0 is 10x faster than Spark 1.6 36
  • 37. Whole stage codegen 37 SELECT COUNT(*) FROM store_sales WHERE ss_item_sk=1000
  • 38. Whole stage codegen Volcano iterator model 38
  • 39. Whole stage codegen What if we would ask some intern to write this in c#? 39 long count = 0; foreach (var ss_item_sk in store_sales) { if (ss_item_sk == 1000) count++; }
  • 44. Define Spark job entry point 44 object IncidentsJob { def main(args: Array[String]) { val spark = SparkSession.builder() .appName("Incidents processing job") .config("spark.sql.shuffle.partitions", "16") .master("local[4]") .getOrCreate() { spark transformations and actions... } System.exit(0) }
  • 45. Create build.sbt file 45 lazy val root = (project in file(".")). settings( organization := "com.mycompany", name := "spark.job.incidents", version := "1.0.0", scalaVersion := "2.11.8", mainClass in Compile := Some("com.mycompany.spark.job.incidents.main") ) libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.0.1" % "provided", "org.apache.spark" %% "spark-sql" % "2.0.1" % "provided", "org.apache.spark" %% "spark-streaming" % "2.0.1" % "provided", "com.microsoft.sqlserver" % "sqljdbc4" % "4.0" )
  • 46. Create application (fat) jar file $ sbt compile $ sbt test $ sbt assembly (sbt-assembly plugin) 46
  • 47. Submit job via spark-submit command ./bin/spark-submit --class <main-class> --master <master-url> --deploy-mode <deploy-mode> --conf <key>=<value> ... # other options <application-jar> [application-arguments] 47
  • 48. Example workflow 48 code 1. pull content 2. take build number (331) 3. build & test 4. copy to cluster job331.jar produce job artifact notification 5. create/schedule job job331 (http) 6. spark submit job331
  • 50. Apache Spark streaming ● Scalable fault tolerant streaming system ● Receivers receive data streams and chop them into batches ● Spark processes batches and pushes out the result 50 ● Input: Files, Socket, Kafka, Flume, Kinesis...
  • 51. Apache Spark streaming 51 def main(args: Array[String]) { val conf = new SparkConf() .setMaster("local[2]") .setAppName("Incidents processing job - Stream") val ssc = new StreamingContext(conf, Seconds(1)) val topics = Set( Topics.Incident, val directKafkaStream = KafkaUtils.createDirectStream[Array[Byte], Array[Byte], DefaultDecoder, DefaultDecoder]( ssc, kafkaParams, topics) // process batches directKafkaStream.map(_._2).flatMap(_.split(“ “))... // Start the computation ssc.start() ssc.awaitTermination() System.exit(0) }
  • 52. Apache Spark streaming ● Integrates with the rest of the ecosystem ○ Combine batch and stream processing ○ Combine machine learning with streaming ○ Combine SQL with streaming 52
  • 54. Structured streaming (continuous apps) ● High-level streaming API built on DataFrames ● Catalyst optimizer creates incremental execution plan ● Unifies streaming, interactive and batch queries ● Supports multiple sources and sinks ● E.g. aggregate data in a stream, then serve using JDBC 54
  • 55. Structured streaming key idea The simplest way to perform streaming analytics is not having to reason about streaming. 55
  • 57. Structured streaming ● Reusing same API 57 val categories = spark .read .option("header", "true") .schema(schema) .csv("/data/source") .select("Category") val categories = spark .readStream .option("header", "true") .schema(schema) .csv("/data/source") .select("Category") finite infinite
  • 58. Structured streaming ● Reusing same API 58 categories .write .format("parquet") .save("/data/warehouse/categories.parquet") categories .writeStream .format("parquet") .start("/data/warehouse/categories.parquet") finite infinite
  • 60. Useful resources ● Spark home page: https://spark.apache.org/ ● Spark summit page: https://spark-summit.org/ ● Apache Spark Docker image: https://github.com/dylanmei/docker-zeppelin ● SFPD Incidents: https://data.sfgov.org/Public-Safety/Police-Department-Incidents/tmn f-yvry 60
  • 61. Thank you for the attention! 61
  • 62. References 62 ● Michael Armbrust - STRUCTURING SPARK: DATAFRAMES, DATASETS AND STREAMING - https://spark-summit.org/2016/events/structuring-spark-dataframes-datasets-and-streaming/ ● Apache Parquet - https://parquet.apache.org/ ● Spark Performance: What's Next - https://spark-summit.org/east-2016/events/spark-performance-whats-next/ ● Avoid groupByKey - https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reduceby key_over_groupbykey.html