© 2017 Impetus Technologies
WEBINAR
Anand Venugopal
Product Head & AVP,
StreamAnalytix
The Structured Streaming Upgrade to Apache Spark
and How Enterprises Can Benefit
Amit Assudani
Sr.Technical Architect – Spark,
StreamAnalytix
August 2017
© 2017 Impetus Technologies
Quick Webinar Notes
• Our focus: Enabling real-time enterprise, make Spark easy-to-use
• Sharing our experience and expertise with you
• Level of content
• 20-80 :: New-Experienced (w.r.t. Spark)
• Format: A combination of panel discussion and presentation
• Usage of some artifacts and pictures from Apache Spark website and other public sources
• Q&A and interactions are important and highly valued
• Please send us your comments/ feedback using the Webex console
© 2017 Impetus Technologies
Webinar Outline
• About Impetus and what is StreamAnalytix? – 2 minutes
• Apache Spark – Know the basics and its evolution – 8 minutes
• A deep dive into Structured Streaming – 25 minutes
• What is it?
• How is it different from 1.0?
• Features and technical highlights
• Benefits and limitations
• Upgrades and migrations
• Future roadmap
• Talent vs Tooling – 5 minutes
• Q&A – 5+ minutes
© 2017 Impetus Technologies
Mission critical technology
solutions since 1996
Fortune 500: Big Data
clients
1700 people; US,
India, global reach
Unique mix of
Big Data products
and services
About Impetus
© 2017 Impetus Technologies© 2017 Impetus Technologies
Real-time Stream Processing & Machine Learning Platform
+
Visual Spark Studio
© 2017 Impetus Technologies
• Project in Berkeley AMPLabs – 2009 – Matei Zaharia; open sourced (BSD) in 2010
• Framework on distributed resource management system (Mesos)
• Speed up ML jobs in Apache Hadoop with in-memory approach
• 30x performance increase on Hadoop jobs
Apache Spark – The Beginning
© 2017 Impetus Technologies
• Robust widely used technology
• Survey by Taneja Group in November 2016 highlights:
• 54% of 7000 enterprise participants – said actively using Spark
• 55% of workloads were ETL / data processing / engineering
• Cloud deployments projected well beyond 30%
• Popular new initiatives – Data science exploration, streaming and machine learning
Micro-batch
Hi-speed Batch Sits on Hadoop
and/or CloudInteractive Iterative
Graph Streaming
Apache Spark – Current State
© 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
0.X
Feb
2014
Spark
0.7-0.9
• Becomes a top level Apache project
• RDD concept introduced with Spark
• Scala and Java binding
• Adds a Python API called PySpark
• Introduces Spark Streaming
• Introduces MLlib
• Includes a first version of GraphX
• PySpark makes it possible to use Spark
from Python
• Spark Streaming adds near real-time
processing capability
• Spark Streaming is now out of alpha and
includes significant optimizations and
simplified high availability deployment
© 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0-1.2
May
2014
Spark 1.0 • Adds Spark SQL
• Guarantees stability of its core API
• Full support for running seamlessly in
secured Hadoop clusters
• Spark 1.0 was the first production ready
backward compatible release. Viewed
spark streaming as faster batch
processing rather than streaming
• Became 1st open source Big Data
framework to embrace in-memory
computing
Sep
2014
Spark 1.1 • Migrates all customer workloads from Shark
to Spark SQL
• Expansion of MLlib
• Extends libraries and sources for Spark
streaming
• First minor release in the 1.X series.
Added significant extensions to the newly
added Spark SQL and the Spark MLlib
Dec
2014
Spark 1.2 • A new API for external data sources
• New H/A driver support through a Write
Ahead Log (WAL), removes any single-
point-of-failure from Spark streaming
• A higher-level API for constructing pipelines
in the spark.ml package
• GraphX project provides a stable API
• Recognized the need for structured data
and started to evolve to support it.
Introduced a specialized RDD schema as
a first step.
• However still lacked a direct API to read
structured data from Spark
© 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.3-1.5
Mar
2015
Spark
1.3
• A new DataFrames API
• Provides a rich set of new MLlib algorithms
• Adds APIs to direct Kakfa streaming source
• DataFrames allow Spark to better
understand the structure of data as well as
the computation being performed.
• First unified API to read from structured
and semi-structured sources (both
RDBMS and NoSQL databases)
Jun
2015
Spark
1.4
• Introduces SparkR
• ML pipelines API graduates from alpha with new
transformers and improved Python coverage
• Adds visual debugging and monitoring
utilities to evaluate running of Spark applications
• A REST API for Initial performance improvements
in project Tungsten
• A pluggable interface for write ahead logs
• Targets data scientists with SparkR on
new DataFrame API.
• Ships the initial pieces of Project Tungsten,
becomes first version of custom memory
management
Sep
2015
Spark
1.5
• 1st major pieces of Project Tungsten
• New ML algorithms, extends new R API
• Adds visualization of SQL and DataFrame query
plans in the web UI
• Operational features for the streaming
component, such as backpressure support
• Pushes Project Tungsten
• Focused on increasing Spark’s
performance through several low-level
architectural optimizations
• Another major theme was data science
© 2017 Impetus Technologies
Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0
Jan
2016
Spark 1.6 • Experimental Dataset API
• New data science functionalities; ML
pipeline persistence and new algorithms
• A new and efficient ’mapWithState API’,
replaces updateStateByKey
• Speedup of 10X for streaming state
management
• SQL queries on files
• Datasets, a typed extension of the
DataFrame API allows to work with custom
objects and lambda functions with benefits
of Spark SQL
© 2017 Impetus Technologies
Spark Evolution
Date of
Release
Major
Version
Minor
Version
Feature Remarks
Spark
2.0-2.2
Jul 2016 Spark
2.0
• A new API, Structured Streaming
• Second generation Tungsten engine
• Unified DataFrame and Dataset in Scala/Java
• Substantial (2-10X) performance speedup for
common operators in SQL and DataFrames with
a new technique called whole stage code
generation
• Structured Streaming launched
experimentally Aims to integrate batch and
Stream. Introduces the concept of
continuous applications
Dec 2016 Spark
2.1
• Hardening of Structured Streaming – still
experimental
• Adds a number of SQL functionalities
• Focuses on advanced analytics
• SparkR becomes most comprehensive library
for distributed machine learning on R
Introduced Structured Streaming as a high-
level API for building continuous applications.
Aims to make it easier to build end-to-end
streaming applications. Introduces;
• Event-time watermarks
• Support for all file-based formats and all
file-based features
• Adds native support for Kafka 0.10
Jul 2017 Spark
2.2
• Production ready Structured Streaming
• Focuses on advanced analytics and Python
• Cost-based optimizer
• Limit the max number of records written per file
• Support for parsing multi-line JSON & CSV files
• The Structured Streaming APIs are now
GA and is no longer labeled experimental
• Add various SQL functionalities and
introduces Additional Algorithms in MLlib
and GraphX
© 2017 Impetus Technologies
Poll Question
What is your currently used Spark version?
- 1.6 or prior
- 2.1
- 2.2
- Planning to start soon
- No plans
© 2017 Impetus Technologies
A Deep Dive into Structured Streaming
© 2017 Impetus Technologies
Structured Streaming – What is it?
• Strongly improved framework over Spark Streaming (DStream API) of Spark 1.x
• High level streaming API built on Spark SQL (DataFrame/Dataset API) and Catalyst Optimizer
• Express streaming computations the same way as batch computations
• Repeated query / incremental execution on unbounded table
© 2017 Impetus Technologies
Structured Streaming – What is it?
• “NO REASONING ABOUT STREAMING”
• Simply define a flow:
• source  transformation  sink  mode
& trigger time  checkpoint
• Structured Streaming makes Streaming ETL +
Analytics easier and a natural single flow
• Not restricted to hard batch duration limits (delivers
lower latency)
• Exactly-once guarantee now truly end-end: includes
sink layer
© 2017 Impetus Technologies
Structured Streaming – Code Snippet
(Structured Streaming vs Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
//Batch
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
© 2017 Impetus Technologies
Structured Streaming – Code Snippet
(DStreams vs Batch)
//DStream
val topics = Array("topicA", "topicB")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val stream:DStream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
//NO Kafka Write Support
//Batch
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
© 2017 Impetus Technologies
Streaming Code – Executed on “Trigger”
(One Time Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
.as[(String, String)]
//One Time Trigger
df.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.trigger(Trigger.Once)
.start()
• No worry about figuring out “changed data” and output
consistency
• Much easier stateful processing like deduping
• Unified code: No different code base for Lambda
solutions
• Cost saving by not running the cluster 24/7
© 2017 Impetus Technologies
Poll Results
© 2017 Impetus Technologies
Structured Streaming – Features and Highlights
(Event Time; Window Duration and Triggers)
• Event time orientation
• In combination with “windows” and triggers
• Aggregates maintained by Structured Streaming
• No need to write separate code
• Incremental query and output modes
• append / complete / update
© 2017 Impetus Technologies
Structured Streaming – Features and Highlights
(Late Data Handling)
© 2017 Impetus Technologies
Structured Streaming – Features and Highlights
(Watermarking (“Data too late!”))
© 2017 Impetus Technologies
• New data formats:
• Native - multi-line JSON support
• Native CSV data source
• Stateful processing and time-outs beyond aggregations
• Using mapgroupswithstate and flatmapgroupswithstate
• New built-in ‘rate‘ source for benchmarking and testing for data generation
• x number of events, <xyz> format
• Metrics for Structured Streaming: New metrics sink
• Connect with Graphite
• Streaming listener (for metrics for every batch execution)
• Kafka 010 support; from_json, to_json, explode
Structured Streaming – Features and Highlights
© 2017 Impetus Technologies
• New – Input / output features:
• Kafka stream / batch writer (DStream - didn't have Kafka writer)
• Kafka batch / stream source (Kafka wasn't available as a source for batch earlier)
• Partitioning output data files (Example: Hive data output)
• Deduplication is a built in function
• Example: Major Bank use case
• Without Structured Streaming – manual record and check for hash value in external store
• With Structured Streaming - unbounded table with hash values
Structured Streaming – Features and Highlights
© 2017 Impetus Technologies
• Improvements (not new) :
• Easier stream to batch join
• Recovering failures using checkpoint (this was there in DStream also)
• “Code Productivity” enhanced / continuous SQL over batches and aggregations
(maintained by Structured Streaming)
• Enhanced batch inter-operability
Structured Streaming – Additional Features
© 2017 Impetus Technologies
• Co-existence of 1.6 and 2.x – on the same Hadoop cluster
• Forward compatibility changes
• SparkSession is now the new entry point of Spark
• Replaces the old (1.x) SQLContext and HiveContext
• Dataset API and DataFrame API are unified
• Scala: DataFrame becomes a type alias for Dataset[Row]
• Java API users must replace DataFrame with Dataset<Row>
Spark Version Management Considerations
(Migration, Co-existence)
© 2017 Impetus Technologies
• Machine learning support still weak (coming soon)
• Multiple (chained) aggregations not supported
• Limit, take, collect, show, count, foreach – Don’t work
• Join limitations
• Caching for multiple actions
• Aggregation queries / SQL on single micro batch
• No kinesis support
• Java8 only
Structured Streaming – Limitations
© 2017 Impetus Technologies
• Streaming without micro-batches
• ~1 ms latency – has been promised (and without code changes)
• Berkeley - Drizzle project - potential replacement of Streaming engine
• For users: will not be much different
• No changes in code
Structured Streaming – Future: Mid-Long Term
© 2017 Impetus Technologies
Talent vs. Tooling
© 2017 Impetus Technologies
Shortage of Talent and the Urgent Need For It
• Spark projects are increasing
• Need to get done quickly with budget controls
• The big barrier
• Talent - Deep Spark / Scala skills are hard to find
• Big gap between Spark prototype app vs. production grade scale, stability
• Lot of engineers on other projects need to be made productive quickly
© 2017 Impetus Technologies
The Need for Tooling
• Need very good enterprise grade, UI driven tooling around Spark to make it easy
• Need to cover all bases:
• Development, Debugging, Deployment, DevOps, Monitoring
• Also need to cover the full data processing journey
• Ingest
• Data Quality
• Blending
• Transformation / Enrichment
• Analytics / Machine Learning
• Loading of target databases
• Visualization
© 2017 Impetus Technologies
StreamAnalytix – “Visual Spark” and More…
• StreamAnalytix is one such platform which makes Spark easy
• Drag-and-drop UI to build and deploy Spark apps in minutes
• Real-time and Batch Data360 platform – on Apache Spark 2.1
• Support for Spark 2.2 and Structured Streaming coming in 4Q
© 2017 Impetus Technologies
About StreamAnalytix
Based on Multiple
Open-Source Engines
– Spark, Storm
and Flink (Future)
On Premise and
Cloud Compatible
Enterprise Grade – UI
Driven Streaming, IoT
and Batch Analytics and
Machine Learning
Platform
© 2017 Impetus Technologies© 2017 Impetus Technologies
Please provide your feedback on the webinar and your
interest to attend our upcoming webinars.
Meet us at Booth # 127
Strata Data Conference in New York
September 26-28, 2017
© 2017 Impetus Technologies
Thank you.
Questions?
© 2017 Impetus Technologies –
Confidential

The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar

  • 1.
    © 2017 ImpetusTechnologies WEBINAR Anand Venugopal Product Head & AVP, StreamAnalytix The Structured Streaming Upgrade to Apache Spark and How Enterprises Can Benefit Amit Assudani Sr.Technical Architect – Spark, StreamAnalytix August 2017
  • 2.
    © 2017 ImpetusTechnologies Quick Webinar Notes • Our focus: Enabling real-time enterprise, make Spark easy-to-use • Sharing our experience and expertise with you • Level of content • 20-80 :: New-Experienced (w.r.t. Spark) • Format: A combination of panel discussion and presentation • Usage of some artifacts and pictures from Apache Spark website and other public sources • Q&A and interactions are important and highly valued • Please send us your comments/ feedback using the Webex console
  • 3.
    © 2017 ImpetusTechnologies Webinar Outline • About Impetus and what is StreamAnalytix? – 2 minutes • Apache Spark – Know the basics and its evolution – 8 minutes • A deep dive into Structured Streaming – 25 minutes • What is it? • How is it different from 1.0? • Features and technical highlights • Benefits and limitations • Upgrades and migrations • Future roadmap • Talent vs Tooling – 5 minutes • Q&A – 5+ minutes
  • 4.
    © 2017 ImpetusTechnologies Mission critical technology solutions since 1996 Fortune 500: Big Data clients 1700 people; US, India, global reach Unique mix of Big Data products and services About Impetus
  • 5.
    © 2017 ImpetusTechnologies© 2017 Impetus Technologies Real-time Stream Processing & Machine Learning Platform + Visual Spark Studio
  • 6.
    © 2017 ImpetusTechnologies • Project in Berkeley AMPLabs – 2009 – Matei Zaharia; open sourced (BSD) in 2010 • Framework on distributed resource management system (Mesos) • Speed up ML jobs in Apache Hadoop with in-memory approach • 30x performance increase on Hadoop jobs Apache Spark – The Beginning
  • 7.
    © 2017 ImpetusTechnologies • Robust widely used technology • Survey by Taneja Group in November 2016 highlights: • 54% of 7000 enterprise participants – said actively using Spark • 55% of workloads were ETL / data processing / engineering • Cloud deployments projected well beyond 30% • Popular new initiatives – Data science exploration, streaming and machine learning Micro-batch Hi-speed Batch Sits on Hadoop and/or CloudInteractive Iterative Graph Streaming Apache Spark – Current State
  • 8.
    © 2017 ImpetusTechnologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 0.X Feb 2014 Spark 0.7-0.9 • Becomes a top level Apache project • RDD concept introduced with Spark • Scala and Java binding • Adds a Python API called PySpark • Introduces Spark Streaming • Introduces MLlib • Includes a first version of GraphX • PySpark makes it possible to use Spark from Python • Spark Streaming adds near real-time processing capability • Spark Streaming is now out of alpha and includes significant optimizations and simplified high availability deployment
  • 9.
    © 2017 ImpetusTechnologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.0-1.2 May 2014 Spark 1.0 • Adds Spark SQL • Guarantees stability of its core API • Full support for running seamlessly in secured Hadoop clusters • Spark 1.0 was the first production ready backward compatible release. Viewed spark streaming as faster batch processing rather than streaming • Became 1st open source Big Data framework to embrace in-memory computing Sep 2014 Spark 1.1 • Migrates all customer workloads from Shark to Spark SQL • Expansion of MLlib • Extends libraries and sources for Spark streaming • First minor release in the 1.X series. Added significant extensions to the newly added Spark SQL and the Spark MLlib Dec 2014 Spark 1.2 • A new API for external data sources • New H/A driver support through a Write Ahead Log (WAL), removes any single- point-of-failure from Spark streaming • A higher-level API for constructing pipelines in the spark.ml package • GraphX project provides a stable API • Recognized the need for structured data and started to evolve to support it. Introduced a specialized RDD schema as a first step. • However still lacked a direct API to read structured data from Spark
  • 10.
    © 2017 ImpetusTechnologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.3-1.5 Mar 2015 Spark 1.3 • A new DataFrames API • Provides a rich set of new MLlib algorithms • Adds APIs to direct Kakfa streaming source • DataFrames allow Spark to better understand the structure of data as well as the computation being performed. • First unified API to read from structured and semi-structured sources (both RDBMS and NoSQL databases) Jun 2015 Spark 1.4 • Introduces SparkR • ML pipelines API graduates from alpha with new transformers and improved Python coverage • Adds visual debugging and monitoring utilities to evaluate running of Spark applications • A REST API for Initial performance improvements in project Tungsten • A pluggable interface for write ahead logs • Targets data scientists with SparkR on new DataFrame API. • Ships the initial pieces of Project Tungsten, becomes first version of custom memory management Sep 2015 Spark 1.5 • 1st major pieces of Project Tungsten • New ML algorithms, extends new R API • Adds visualization of SQL and DataFrame query plans in the web UI • Operational features for the streaming component, such as backpressure support • Pushes Project Tungsten • Focused on increasing Spark’s performance through several low-level architectural optimizations • Another major theme was data science
  • 11.
    © 2017 ImpetusTechnologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.0 Jan 2016 Spark 1.6 • Experimental Dataset API • New data science functionalities; ML pipeline persistence and new algorithms • A new and efficient ’mapWithState API’, replaces updateStateByKey • Speedup of 10X for streaming state management • SQL queries on files • Datasets, a typed extension of the DataFrame API allows to work with custom objects and lambda functions with benefits of Spark SQL
  • 12.
    © 2017 ImpetusTechnologies Spark Evolution Date of Release Major Version Minor Version Feature Remarks Spark 2.0-2.2 Jul 2016 Spark 2.0 • A new API, Structured Streaming • Second generation Tungsten engine • Unified DataFrame and Dataset in Scala/Java • Substantial (2-10X) performance speedup for common operators in SQL and DataFrames with a new technique called whole stage code generation • Structured Streaming launched experimentally Aims to integrate batch and Stream. Introduces the concept of continuous applications Dec 2016 Spark 2.1 • Hardening of Structured Streaming – still experimental • Adds a number of SQL functionalities • Focuses on advanced analytics • SparkR becomes most comprehensive library for distributed machine learning on R Introduced Structured Streaming as a high- level API for building continuous applications. Aims to make it easier to build end-to-end streaming applications. Introduces; • Event-time watermarks • Support for all file-based formats and all file-based features • Adds native support for Kafka 0.10 Jul 2017 Spark 2.2 • Production ready Structured Streaming • Focuses on advanced analytics and Python • Cost-based optimizer • Limit the max number of records written per file • Support for parsing multi-line JSON & CSV files • The Structured Streaming APIs are now GA and is no longer labeled experimental • Add various SQL functionalities and introduces Additional Algorithms in MLlib and GraphX
  • 13.
    © 2017 ImpetusTechnologies Poll Question What is your currently used Spark version? - 1.6 or prior - 2.1 - 2.2 - Planning to start soon - No plans
  • 14.
    © 2017 ImpetusTechnologies A Deep Dive into Structured Streaming
  • 15.
    © 2017 ImpetusTechnologies Structured Streaming – What is it? • Strongly improved framework over Spark Streaming (DStream API) of Spark 1.x • High level streaming API built on Spark SQL (DataFrame/Dataset API) and Catalyst Optimizer • Express streaming computations the same way as batch computations • Repeated query / incremental execution on unbounded table
  • 16.
    © 2017 ImpetusTechnologies Structured Streaming – What is it? • “NO REASONING ABOUT STREAMING” • Simply define a flow: • source  transformation  sink  mode & trigger time  checkpoint • Structured Streaming makes Streaming ETL + Analytics easier and a natural single flow • Not restricted to hard batch duration limits (delivers lower latency) • Exactly-once guarantee now truly end-end: includes sink layer
  • 17.
    © 2017 ImpetusTechnologies Structured Streaming – Code Snippet (Structured Streaming vs Batch) // Structured Streaming val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .start() //Batch val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save()
  • 18.
    © 2017 ImpetusTechnologies Structured Streaming – Code Snippet (DStreams vs Batch) //DStream val topics = Array("topicA", "topicB") val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "localhost:9092,anotherhost:9092", "group.id" -> "use_a_separate_group_id_for_each_stream", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val stream:DStream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.map(record => (record.key, record.value)) //NO Kafka Write Support //Batch val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save()
  • 19.
    © 2017 ImpetusTechnologies Streaming Code – Executed on “Trigger” (One Time Batch) // Structured Streaming val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] //One Time Trigger df.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .trigger(Trigger.Once) .start() • No worry about figuring out “changed data” and output consistency • Much easier stateful processing like deduping • Unified code: No different code base for Lambda solutions • Cost saving by not running the cluster 24/7
  • 20.
    © 2017 ImpetusTechnologies Poll Results
  • 21.
    © 2017 ImpetusTechnologies Structured Streaming – Features and Highlights (Event Time; Window Duration and Triggers) • Event time orientation • In combination with “windows” and triggers • Aggregates maintained by Structured Streaming • No need to write separate code • Incremental query and output modes • append / complete / update
  • 22.
    © 2017 ImpetusTechnologies Structured Streaming – Features and Highlights (Late Data Handling)
  • 23.
    © 2017 ImpetusTechnologies Structured Streaming – Features and Highlights (Watermarking (“Data too late!”))
  • 24.
    © 2017 ImpetusTechnologies • New data formats: • Native - multi-line JSON support • Native CSV data source • Stateful processing and time-outs beyond aggregations • Using mapgroupswithstate and flatmapgroupswithstate • New built-in ‘rate‘ source for benchmarking and testing for data generation • x number of events, <xyz> format • Metrics for Structured Streaming: New metrics sink • Connect with Graphite • Streaming listener (for metrics for every batch execution) • Kafka 010 support; from_json, to_json, explode Structured Streaming – Features and Highlights
  • 25.
    © 2017 ImpetusTechnologies • New – Input / output features: • Kafka stream / batch writer (DStream - didn't have Kafka writer) • Kafka batch / stream source (Kafka wasn't available as a source for batch earlier) • Partitioning output data files (Example: Hive data output) • Deduplication is a built in function • Example: Major Bank use case • Without Structured Streaming – manual record and check for hash value in external store • With Structured Streaming - unbounded table with hash values Structured Streaming – Features and Highlights
  • 26.
    © 2017 ImpetusTechnologies • Improvements (not new) : • Easier stream to batch join • Recovering failures using checkpoint (this was there in DStream also) • “Code Productivity” enhanced / continuous SQL over batches and aggregations (maintained by Structured Streaming) • Enhanced batch inter-operability Structured Streaming – Additional Features
  • 27.
    © 2017 ImpetusTechnologies • Co-existence of 1.6 and 2.x – on the same Hadoop cluster • Forward compatibility changes • SparkSession is now the new entry point of Spark • Replaces the old (1.x) SQLContext and HiveContext • Dataset API and DataFrame API are unified • Scala: DataFrame becomes a type alias for Dataset[Row] • Java API users must replace DataFrame with Dataset<Row> Spark Version Management Considerations (Migration, Co-existence)
  • 28.
    © 2017 ImpetusTechnologies • Machine learning support still weak (coming soon) • Multiple (chained) aggregations not supported • Limit, take, collect, show, count, foreach – Don’t work • Join limitations • Caching for multiple actions • Aggregation queries / SQL on single micro batch • No kinesis support • Java8 only Structured Streaming – Limitations
  • 29.
    © 2017 ImpetusTechnologies • Streaming without micro-batches • ~1 ms latency – has been promised (and without code changes) • Berkeley - Drizzle project - potential replacement of Streaming engine • For users: will not be much different • No changes in code Structured Streaming – Future: Mid-Long Term
  • 30.
    © 2017 ImpetusTechnologies Talent vs. Tooling
  • 31.
    © 2017 ImpetusTechnologies Shortage of Talent and the Urgent Need For It • Spark projects are increasing • Need to get done quickly with budget controls • The big barrier • Talent - Deep Spark / Scala skills are hard to find • Big gap between Spark prototype app vs. production grade scale, stability • Lot of engineers on other projects need to be made productive quickly
  • 32.
    © 2017 ImpetusTechnologies The Need for Tooling • Need very good enterprise grade, UI driven tooling around Spark to make it easy • Need to cover all bases: • Development, Debugging, Deployment, DevOps, Monitoring • Also need to cover the full data processing journey • Ingest • Data Quality • Blending • Transformation / Enrichment • Analytics / Machine Learning • Loading of target databases • Visualization
  • 33.
    © 2017 ImpetusTechnologies StreamAnalytix – “Visual Spark” and More… • StreamAnalytix is one such platform which makes Spark easy • Drag-and-drop UI to build and deploy Spark apps in minutes • Real-time and Batch Data360 platform – on Apache Spark 2.1 • Support for Spark 2.2 and Structured Streaming coming in 4Q
  • 34.
    © 2017 ImpetusTechnologies About StreamAnalytix Based on Multiple Open-Source Engines – Spark, Storm and Flink (Future) On Premise and Cloud Compatible Enterprise Grade – UI Driven Streaming, IoT and Batch Analytics and Machine Learning Platform
  • 35.
    © 2017 ImpetusTechnologies© 2017 Impetus Technologies Please provide your feedback on the webinar and your interest to attend our upcoming webinars. Meet us at Booth # 127 Strata Data Conference in New York September 26-28, 2017
  • 36.
    © 2017 ImpetusTechnologies Thank you. Questions? © 2017 Impetus Technologies – Confidential