SlideShare a Scribd company logo
What’s new in Spark 2.0?
Machine Learning Stockholm Meetup
27 October, 2016
Schibsted Media Group
1
Rerngvit Yanggratoke @ Combient AB
Örjan Lundberg @ Combient AB
Combient - Who We Are
• A joint venture owned by several Swedish global enterprises in the Wallenberg sphere
• One of our key missions is the

Analytics Centre of Excellence (ACE)
• Provide data science and data engineer
resources to our owners and members
• Share best practices, advanced methods,
and state-of-the-art technology to
generate business value from data
Apache Spark Evolution
2016201420122010 2011 2013 2015
Apache Incubator
Project
Spark 1.0

released
Spark 2.0
released
First Spark Summit
2017
Spark paper

OSDI 2012
Top-level

Apache Project
Spark paper

HotCloud 2010

announced

strong investment in Spark
Project begin

in AMPLab – 

UC Berkeley
Spark technical

report, UC Berkeley

announced efforts

to unite Spark with
Hadoop
* Very rapid takeoffs: from research publication to major adoptions in industry in 4 years
integrates Spark 1.5.2

into their distribution
added Spark to 

their 4.0 release
Spark 2.0 “Easier, Faster, and Smarter”
• Spark Session - a new entry point
• Unification of APIs => Dataset & Dataframe
• Spark SQL enhancements: subqueries, native Window functions, 

better error messages
• Built-in CSV file type support
• Machine learning pipeline persistence across languages
• Approximate DataFrame Statistics Functions
• Performance enhancement (Whole-stage-code generation)
• Structured streaming (Alpha)
4
Spark Session
5
Spark 1.6 => SQLContext, HiveContext
Spark 2.0 => SparkSession
val ss = SparkSession.builder().master(master)
.appName(appName).getOrCreate()
val conf = new SparkConf().setMaster(master).setAppName(appName)
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
Unified API: DataFrame & Dataset
• DataFrame - introduced in Spark 1.3 (March 2015)
• Dataset - experimental in Spark 1.6 (March 2016)
6
• Observation: very rapid changes of APIs

Java, Scala,
Python, R
Java, Scala
Dataset API example
7
Which countries has a
shrinking population
from 2014 to 2015?
name population_2014 population_2015
Afghanistan XXXX XXXX
Albania XXXX XXXX
…. …. ….
• Key benefit over DataFrame API: static typing!
Compile time
error detection
Table: country_pop
Subqueries
8
Spark 1.6 => Subqueries inside FROM clause
Spark 2.0 => Subqueries inside FROM and WHERE clause
Ex. Which countries have more than 100M people in 2016?
Window Functions
9
Spark 1.X => required HiveContext
Spark 2.0 => native window function 

SQL Error messages
10
Finding typos/errors in larger SQL?
SQL Error messages
11
Spark 2.0 => illustrate possible options
Spark 1.6 => an exception happen at this location X
Built-in CSV file support
12
Spark 1.6 => require package databricks/spark-csv
Spark 2.0 => built-in
Spark 2.0 Machine learning
(ML) APIs Status
• spark.mllib (RDD-based APIs)
• Entering bug-fixing modes (no new features)
• spark.ml (Dataframe-based): models + transformations
• Introduced in January 2015
• Become a primary API
13An example Pipeline model
ML Pipeline Persistence
Across languages
14
Data scientist Data engineer
• Prototype and experiments with
candidate ML pipelines

• Focus on rapid prototype and
fast exploration of data

ML Pipelines
• Implement and deploy pipelines in
production environment

• Focus on maintainable and
reliable systems

Propose Implement / deploy
Prefer Python / R / ..
Why?
Source: classroomclipart.com Source: worldartsme.com/
Prefer Java / Scala
- Diverge sourcecodes

- Synchronisation efforts/delays
Problems
ML Pipeline Persistence
Across languages
15
Data scientist Data engineer
• Prototype and experiments with
candidate ML pipelines

• Focus on rapid prototype and
fast exploration of data

ML Pipelines
• Implement and deploy pipelines in
production environment

• Focus on maintainable and
reliable systems

Propose Implement / deploy
Prefer Python / R / ..
Why?
Pipeline persistence allow for import/export pipeline across languages
• Introduced only for Scala in Spark 1.6
• Status of Spark 2.0
• Scala/Java (Full), Python (Near-full), R (with hacks)
Source: classroomclipart.com Source: worldartsme.com/
Prefer Java / Scala
Create ML pipeline

and persist it in Python
16
Loading the pipeline
and test it in Scala
17
Approximate DataFrame 

Statistics Functions
Approximate Quantile
18
Motivation: rapid explorative data analysis
Bloom filter - Approximate set membership data structure
CountMinSketch - Approximate frequency estimation data structure
Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile
summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data
(SIGMOD '01), Timos Sellis and Sharad Mehrotra (Eds.). ACM, New York, NY, USA, 58-66.
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun.
ACM 13, 7 (July 1970), 422-426. DOI=http://dx.doi.org/10.1145/362686.362692
Cormode, Graham (2009). "Count-min sketch" (PDF). Encyclopedia of Database Systems.
Springer. pp. 511–516.
Bloom filter- Approximate set
membership data structure
19
Inserting items O(k)
Spark 2: example code
Membership query => O(k)
k hash functions
Trade space efficiency with false positives
Source : wikipedia
Count-Min Sketch
20
Source : http://debasishg.blogspot.se/2014/01/count-
min-sketch-data-structure-for.html
Spark 2: example code
Operations

• Add item - O(wd)

• Query item frequency - O(wd)
Approximate Quantile
21
Spark 2: example code
Details are much more complicated than the above two. :)
Count-Min Sketch
22
Source : http://debasishg.blogspot.se/2014/01/count-
min-sketch-data-structure-for.html
Spark 2: example code
Operations

• Add item - O(wd)

• Query item frequency - O(wd)
Approximate Quantile
23
Spark 2: example code
Details are much more complicated than the above two. :)
res8: Array[Double] = Array(167543.0, 1260934.0, 3753121.0, 5643475.0)
Databrick example notebook:

https://docs.cloud.databricks.com/docs/latest/sample_applications/
04%20Apache%20Spark%202.0%20Examples/05%20Approximate%20Quantile.html
Whole-stage code generation
• Collapsing multiple function calls into one
• Minimising overheads (e.g., virtual function calls) and using CPU registers
24
Volcano (Spark 1.X)
Example Source:
Databrick example notebook:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/
6122906529858466/293651311471490/5382278320999420/latest.html
Whole-stage code generation
(Spark 2.0)
Structure Streaming [Alpha]
• SQL on streaming data
25
Source : https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
Databrick example notebook:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/
4012078893478893/295447656425301/5985939988045659/latest.html

Structure Streaming [Alpha]
26
Source: https://databricks.com/blog/2016/07/28/continuous-applications-evolving-
streaming-in-apache-spark-2-0.html
Not ready for production
We are currently recruiting 🙂
combient.com/#careers
Presentation materials are available below.
https://github.com/rerngvit/technical_presentations/tree/master/
2016-10-27_ML_Meetup_Stockholm%40Schibsted

More Related Content

What's hot

Graph Databases
Graph DatabasesGraph Databases
Graph Databases
Sergey Enin
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
Taro L. Saito
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Databricks
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
Kyle Bader
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
GraphAware
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
Lars Albertsson
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
Gregg Barrett
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
Chetan Khatri
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
Itzhak Kameli
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
Databricks
 
Presto summit israel 2019-04
Presto summit   israel 2019-04Presto summit   israel 2019-04
Presto summit israel 2019-04
Ori Reshef
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...nimak
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
Data provenance in Hopsworks
Data provenance in HopsworksData provenance in Hopsworks
Data provenance in Hopsworks
Alexandru Adrian Ormenisan
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
Alasdair Gray
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
Makoto Yui
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use Case
Jason Plurad
 

What's hot (20)

Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
Bridging the Gap Between Data Scientists and Software Engineers – Deploying L...
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Graph-Powered Machine Learning
Graph-Powered Machine Learning Graph-Powered Machine Learning
Graph-Powered Machine Learning
 
Data pipelines from zero to solid
Data pipelines from zero to solidData pipelines from zero to solid
Data pipelines from zero to solid
 
Introduction to Microsoft R Services
Introduction to Microsoft R ServicesIntroduction to Microsoft R Services
Introduction to Microsoft R Services
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Presto summit israel 2019-04
Presto summit   israel 2019-04Presto summit   israel 2019-04
Presto summit israel 2019-04
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
Cross-Tier Application and Data Partitioning of Web Applications for Hybrid C...
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
Data provenance in Hopsworks
Data provenance in HopsworksData provenance in Hopsworks
Data provenance in Hopsworks
 
Dataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLSDataset Descriptions in Open PHACTS and HCLS
Dataset Descriptions in Open PHACTS and HCLS
 
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4jNeo4j-Databridge: Enterprise-scale ETL for Neo4j
Neo4j-Databridge: Enterprise-scale ETL for Neo4j
 
Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0Introduction to Apache Hivemall v0.5.0
Introduction to Apache Hivemall v0.5.0
 
Airline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use CaseAirline Reservations and Routing: A Graph Use Case
Airline Reservations and Routing: A Graph Use Case
 

Similar to What's new in Spark 2.0?

What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
Örjan Lundberg
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
Juantomás García Molina
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
Mark Kerzner
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
Paco Nathan
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_VeriticalsPeyman Mohajerian
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
Spark Summit
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
DataWorks Summit/Hadoop Summit
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Slim Baltagi
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Paco Nathan
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
Jim Haughwout
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
DataWorks Summit
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
Paco Nathan
 

Similar to What's new in Spark 2.0? (20)

What's new in spark 2.0?
What's new in spark 2.0?What's new in spark 2.0?
What's new in spark 2.0?
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
IBM Strategy for Spark
IBM Strategy for SparkIBM Strategy for Spark
IBM Strategy for Spark
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
 
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
 
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Spark Streaming the Industrial IoT
Spark Streaming the Industrial IoTSpark Streaming the Industrial IoT
Spark Streaming the Industrial IoT
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 

What's new in Spark 2.0?

  • 1. What’s new in Spark 2.0? Machine Learning Stockholm Meetup 27 October, 2016 Schibsted Media Group 1 Rerngvit Yanggratoke @ Combient AB Örjan Lundberg @ Combient AB
  • 2. Combient - Who We Are • A joint venture owned by several Swedish global enterprises in the Wallenberg sphere • One of our key missions is the
 Analytics Centre of Excellence (ACE) • Provide data science and data engineer resources to our owners and members • Share best practices, advanced methods, and state-of-the-art technology to generate business value from data
  • 3. Apache Spark Evolution 2016201420122010 2011 2013 2015 Apache Incubator Project Spark 1.0 released Spark 2.0 released First Spark Summit 2017 Spark paper
 OSDI 2012 Top-level Apache Project Spark paper HotCloud 2010 announced strong investment in Spark Project begin in AMPLab – UC Berkeley Spark technical report, UC Berkeley announced efforts to unite Spark with Hadoop * Very rapid takeoffs: from research publication to major adoptions in industry in 4 years integrates Spark 1.5.2 into their distribution added Spark to their 4.0 release
  • 4. Spark 2.0 “Easier, Faster, and Smarter” • Spark Session - a new entry point • Unification of APIs => Dataset & Dataframe • Spark SQL enhancements: subqueries, native Window functions, 
 better error messages • Built-in CSV file type support • Machine learning pipeline persistence across languages • Approximate DataFrame Statistics Functions • Performance enhancement (Whole-stage-code generation) • Structured streaming (Alpha) 4
  • 5. Spark Session 5 Spark 1.6 => SQLContext, HiveContext Spark 2.0 => SparkSession val ss = SparkSession.builder().master(master) .appName(appName).getOrCreate() val conf = new SparkConf().setMaster(master).setAppName(appName) val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  • 6. Unified API: DataFrame & Dataset • DataFrame - introduced in Spark 1.3 (March 2015) • Dataset - experimental in Spark 1.6 (March 2016) 6 • Observation: very rapid changes of APIs
 Java, Scala, Python, R Java, Scala
  • 7. Dataset API example 7 Which countries has a shrinking population from 2014 to 2015? name population_2014 population_2015 Afghanistan XXXX XXXX Albania XXXX XXXX …. …. …. • Key benefit over DataFrame API: static typing! Compile time error detection Table: country_pop
  • 8. Subqueries 8 Spark 1.6 => Subqueries inside FROM clause Spark 2.0 => Subqueries inside FROM and WHERE clause Ex. Which countries have more than 100M people in 2016?
  • 9. Window Functions 9 Spark 1.X => required HiveContext Spark 2.0 => native window function 

  • 10. SQL Error messages 10 Finding typos/errors in larger SQL?
  • 11. SQL Error messages 11 Spark 2.0 => illustrate possible options Spark 1.6 => an exception happen at this location X
  • 12. Built-in CSV file support 12 Spark 1.6 => require package databricks/spark-csv Spark 2.0 => built-in
  • 13. Spark 2.0 Machine learning (ML) APIs Status • spark.mllib (RDD-based APIs) • Entering bug-fixing modes (no new features) • spark.ml (Dataframe-based): models + transformations • Introduced in January 2015 • Become a primary API 13An example Pipeline model
  • 14. ML Pipeline Persistence Across languages 14 Data scientist Data engineer • Prototype and experiments with candidate ML pipelines • Focus on rapid prototype and fast exploration of data ML Pipelines • Implement and deploy pipelines in production environment • Focus on maintainable and reliable systems Propose Implement / deploy Prefer Python / R / .. Why? Source: classroomclipart.com Source: worldartsme.com/ Prefer Java / Scala - Diverge sourcecodes - Synchronisation efforts/delays Problems
  • 15. ML Pipeline Persistence Across languages 15 Data scientist Data engineer • Prototype and experiments with candidate ML pipelines • Focus on rapid prototype and fast exploration of data ML Pipelines • Implement and deploy pipelines in production environment • Focus on maintainable and reliable systems Propose Implement / deploy Prefer Python / R / .. Why? Pipeline persistence allow for import/export pipeline across languages • Introduced only for Scala in Spark 1.6 • Status of Spark 2.0 • Scala/Java (Full), Python (Near-full), R (with hacks) Source: classroomclipart.com Source: worldartsme.com/ Prefer Java / Scala
  • 16. Create ML pipeline
 and persist it in Python 16
  • 17. Loading the pipeline and test it in Scala 17
  • 18. Approximate DataFrame 
 Statistics Functions Approximate Quantile 18 Motivation: rapid explorative data analysis Bloom filter - Approximate set membership data structure CountMinSketch - Approximate frequency estimation data structure Michael Greenwald and Sanjeev Khanna. 2001. Space-efficient online computation of quantile summaries. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data (SIGMOD '01), Timos Sellis and Sharad Mehrotra (Eds.). ACM, New York, NY, USA, 58-66. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (July 1970), 422-426. DOI=http://dx.doi.org/10.1145/362686.362692 Cormode, Graham (2009). "Count-min sketch" (PDF). Encyclopedia of Database Systems. Springer. pp. 511–516.
  • 19. Bloom filter- Approximate set membership data structure 19 Inserting items O(k) Spark 2: example code Membership query => O(k) k hash functions Trade space efficiency with false positives Source : wikipedia
  • 20. Count-Min Sketch 20 Source : http://debasishg.blogspot.se/2014/01/count- min-sketch-data-structure-for.html Spark 2: example code Operations • Add item - O(wd) • Query item frequency - O(wd)
  • 21. Approximate Quantile 21 Spark 2: example code Details are much more complicated than the above two. :)
  • 22. Count-Min Sketch 22 Source : http://debasishg.blogspot.se/2014/01/count- min-sketch-data-structure-for.html Spark 2: example code Operations • Add item - O(wd) • Query item frequency - O(wd)
  • 23. Approximate Quantile 23 Spark 2: example code Details are much more complicated than the above two. :) res8: Array[Double] = Array(167543.0, 1260934.0, 3753121.0, 5643475.0) Databrick example notebook:
 https://docs.cloud.databricks.com/docs/latest/sample_applications/ 04%20Apache%20Spark%202.0%20Examples/05%20Approximate%20Quantile.html
  • 24. Whole-stage code generation • Collapsing multiple function calls into one • Minimising overheads (e.g., virtual function calls) and using CPU registers 24 Volcano (Spark 1.X) Example Source: Databrick example notebook:
 https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/ 6122906529858466/293651311471490/5382278320999420/latest.html Whole-stage code generation (Spark 2.0)
  • 25. Structure Streaming [Alpha] • SQL on streaming data 25 Source : https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html Databrick example notebook: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/ 4012078893478893/295447656425301/5985939988045659/latest.html

  • 26. Structure Streaming [Alpha] 26 Source: https://databricks.com/blog/2016/07/28/continuous-applications-evolving- streaming-in-apache-spark-2-0.html Not ready for production
  • 27. We are currently recruiting 🙂 combient.com/#careers Presentation materials are available below. https://github.com/rerngvit/technical_presentations/tree/master/ 2016-10-27_ML_Meetup_Stockholm%40Schibsted