The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar

© 2017 Impetus Technologies
WEBINAR
Anand Venugopal
Product Head & AVP,
StreamAnalytix
The Structured Streaming Upgrade to Apache Spark
and How Enterprises Can Benefit
Amit Assudani
Sr.Technical Architect – Spark,
StreamAnalytix
August 2017

Quick Webinar Notes
• Our focus: Enabling real-time enterprise, make Spark easy-to-use
• Sharing our experience and expertise with you
• Level of content
• 20-80 :: New-Experienced (w.r.t. Spark)
• Format: A combination of panel discussion and presentation
• Usage of some artifacts and pictures from Apache Spark website and other public sources
• Q&A and interactions are important and highly valued
• Please send us your comments/ feedback using the Webex console

Webinar Outline
• About Impetus and what is StreamAnalytix? – 2 minutes
• Apache Spark – Know the basics and its evolution – 8 minutes
• A deep dive into Structured Streaming – 25 minutes
• What is it?
• How is it different from 1.0?
• Features and technical highlights
• Benefits and limitations
• Upgrades and migrations
• Future roadmap
• Talent vs Tooling – 5 minutes
• Q&A – 5+ minutes

Mission critical technology
solutions since 1996
Fortune 500: Big Data
clients
1700 people; US,
India, global reach
Unique mix of
Big Data products
and services
About Impetus

© 2017 Impetus Technologies© 2017 Impetus Technologies
Real-time Stream Processing & Machine Learning Platform
+
Visual Spark Studio

• Project in Berkeley AMPLabs – 2009 – Matei Zaharia; open sourced (BSD) in 2010
• Framework on distributed resource management system (Mesos)
• Speed up ML jobs in Apache Hadoop with in-memory approach
• 30x performance increase on Hadoop jobs
Apache Spark – The Beginning

• Robust widely used technology
• Survey by Taneja Group in November 2016 highlights:
• 54% of 7000 enterprise participants – said actively using Spark
• 55% of workloads were ETL / data processing / engineering
• Cloud deployments projected well beyond 30%
• Popular new initiatives – Data science exploration, streaming and machine learning
Micro-batch
Hi-speed Batch Sits on Hadoop
and/or CloudInteractive Iterative
Graph Streaming
Apache Spark – Current State

Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
0.X
Feb
2014
Spark
0.7-0.9
• Becomes a top level Apache project
• RDD concept introduced with Spark
• Scala and Java binding
• Adds a Python API called PySpark
• Introduces Spark Streaming
• Introduces MLlib
• Includes a first version of GraphX
• PySpark makes it possible to use Spark
from Python
• Spark Streaming adds near real-time
processing capability
• Spark Streaming is now out of alpha and
includes significant optimizations and
simplified high availability deployment

Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0-1.2
May
2014
Spark 1.0 • Adds Spark SQL
• Guarantees stability of its core API
• Full support for running seamlessly in
secured Hadoop clusters
• Spark 1.0 was the first production ready
backward compatible release. Viewed
spark streaming as faster batch
processing rather than streaming
• Became 1st open source Big Data
framework to embrace in-memory
computing
Sep
2014
Spark 1.1 • Migrates all customer workloads from Shark
to Spark SQL
• Expansion of MLlib
• Extends libraries and sources for Spark
streaming
• First minor release in the 1.X series.
Added significant extensions to the newly
added Spark SQL and the Spark MLlib
Dec
2014
Spark 1.2 • A new API for external data sources
• New H/A driver support through a Write
Ahead Log (WAL), removes any single-
point-of-failure from Spark streaming
• A higher-level API for constructing pipelines
in the spark.ml package
• GraphX project provides a stable API
• Recognized the need for structured data
and started to evolve to support it.
Introduced a specialized RDD schema as
a first step.
• However still lacked a direct API to read
structured data from Spark

Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.3-1.5
Mar
2015
Spark
1.3
• A new DataFrames API
• Provides a rich set of new MLlib algorithms
• Adds APIs to direct Kakfa streaming source
• DataFrames allow Spark to better
understand the structure of data as well as
the computation being performed.
• First unified API to read from structured
and semi-structured sources (both
RDBMS and NoSQL databases)
Jun
2015
Spark
1.4
• Introduces SparkR
• ML pipelines API graduates from alpha with new
transformers and improved Python coverage
• Adds visual debugging and monitoring
utilities to evaluate running of Spark applications
• A REST API for Initial performance improvements
in project Tungsten
• A pluggable interface for write ahead logs
• Targets data scientists with SparkR on
new DataFrame API.
• Ships the initial pieces of Project Tungsten,
becomes first version of custom memory
management
Sep
2015
Spark
1.5
• 1st major pieces of Project Tungsten
• New ML algorithms, extends new R API
• Adds visualization of SQL and DataFrame query
plans in the web UI
• Operational features for the streaming
component, such as backpressure support
• Pushes Project Tungsten
• Focused on increasing Spark’s
performance through several low-level
architectural optimizations
• Another major theme was data science

Spark Evolution
Major
Version
Date of
Release
Minor
Version
Feature Remarks
Spark
1.0
Jan
2016
Spark 1.6 • Experimental Dataset API
• New data science functionalities; ML
pipeline persistence and new algorithms
• A new and efficient ’mapWithState API’,
replaces updateStateByKey
• Speedup of 10X for streaming state
management
• SQL queries on files
• Datasets, a typed extension of the
DataFrame API allows to work with custom
objects and lambda functions with benefits
of Spark SQL

Spark Evolution
Date of
Release
Major
Version
Minor
Version
Feature Remarks
Spark
2.0-2.2
Jul 2016 Spark
2.0
• A new API, Structured Streaming
• Second generation Tungsten engine
• Unified DataFrame and Dataset in Scala/Java
• Substantial (2-10X) performance speedup for
common operators in SQL and DataFrames with
a new technique called whole stage code
generation
• Structured Streaming launched
experimentally Aims to integrate batch and
Stream. Introduces the concept of
continuous applications
Dec 2016 Spark
2.1
• Hardening of Structured Streaming – still
experimental
• Adds a number of SQL functionalities
• Focuses on advanced analytics
• SparkR becomes most comprehensive library
for distributed machine learning on R
Introduced Structured Streaming as a high-
level API for building continuous applications.
Aims to make it easier to build end-to-end
streaming applications. Introduces;
• Event-time watermarks
• Support for all file-based formats and all
file-based features
• Adds native support for Kafka 0.10
Jul 2017 Spark
2.2
• Production ready Structured Streaming
• Focuses on advanced analytics and Python
• Cost-based optimizer
• Limit the max number of records written per file
• Support for parsing multi-line JSON & CSV files
• The Structured Streaming APIs are now
GA and is no longer labeled experimental
• Add various SQL functionalities and
introduces Additional Algorithms in MLlib
and GraphX

Poll Question
What is your currently used Spark version?
- 1.6 or prior
- 2.1
- 2.2
- Planning to start soon
- No plans

A Deep Dive into Structured Streaming

Structured Streaming – What is it?
• Strongly improved framework over Spark Streaming (DStream API) of Spark 1.x
• High level streaming API built on Spark SQL (DataFrame/Dataset API) and Catalyst Optimizer
• Express streaming computations the same way as batch computations
• Repeated query / incremental execution on unbounded table

Structured Streaming – What is it?
• “NO REASONING ABOUT STREAMING”
• Simply define a flow:
• source  transformation  sink  mode
& trigger time  checkpoint
• Structured Streaming makes Streaming ETL +
Analytics easier and a natural single flow
• Not restricted to hard batch duration limits (delivers
lower latency)
• Exactly-once guarantee now truly end-end: includes
sink layer

Structured Streaming – Code Snippet
(Structured Streaming vs Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.writeStream
.format("kafka")
.option("topic", "topic1")
.start()
//Batch
val df = spark
.read
.format("kafka")
.load()
df.write
.format("kafka")
.save()

Structured Streaming – Code Snippet
(DStreams vs Batch)
//DStream
val topics = Array("topicA", "topicB")
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092,anotherhost:9092",
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val stream:DStream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
//NO Kafka Write Support
//Batch
val df = spark
.read
.format("kafka")
.load()
df.write
.format("kafka")
.save()

Streaming Code – Executed on “Trigger”
(One Time Batch)
// Structured Streaming
val df = spark
.readStream
.format("kafka")
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS
STRING)")
//One Time Trigger
df.writeStream
.format("kafka")
.trigger(Trigger.Once)
.start()
• No worry about figuring out “changed data” and output
consistency
• Much easier stateful processing like deduping
• Unified code: No different code base for Lambda
solutions
• Cost saving by not running the cluster 24/7

Poll Results

Structured Streaming – Features and Highlights
(Event Time; Window Duration and Triggers)
• Event time orientation
• In combination with “windows” and triggers
• Aggregates maintained by Structured Streaming
• No need to write separate code
• Incremental query and output modes
• append / complete / update

(Late Data Handling)

(Watermarking (“Data too late!”))

• New data formats:
• Native - multi-line JSON support
• Native CSV data source
• Stateful processing and time-outs beyond aggregations
• Using mapgroupswithstate and flatmapgroupswithstate
• New built-in ‘rate‘ source for benchmarking and testing for data generation
• x number of events, <xyz> format
• Metrics for Structured Streaming: New metrics sink
• Connect with Graphite
• Streaming listener (for metrics for every batch execution)
• Kafka 010 support; from_json, to_json, explode

• New – Input / output features:
• Kafka stream / batch writer (DStream - didn't have Kafka writer)
• Kafka batch / stream source (Kafka wasn't available as a source for batch earlier)
• Partitioning output data files (Example: Hive data output)
• Deduplication is a built in function
• Example: Major Bank use case
• Without Structured Streaming – manual record and check for hash value in external store
• With Structured Streaming - unbounded table with hash values

• Improvements (not new) :
• Easier stream to batch join
• Recovering failures using checkpoint (this was there in DStream also)
• “Code Productivity” enhanced / continuous SQL over batches and aggregations
(maintained by Structured Streaming)
• Enhanced batch inter-operability
Structured Streaming – Additional Features

• Co-existence of 1.6 and 2.x – on the same Hadoop cluster
• Forward compatibility changes
• SparkSession is now the new entry point of Spark
• Replaces the old (1.x) SQLContext and HiveContext
• Dataset API and DataFrame API are unified
• Scala: DataFrame becomes a type alias for Dataset[Row]
• Java API users must replace DataFrame with Dataset<Row>
Spark Version Management Considerations
(Migration, Co-existence)

• Machine learning support still weak (coming soon)
• Multiple (chained) aggregations not supported
• Limit, take, collect, show, count, foreach – Don’t work
• Join limitations
• Caching for multiple actions
• Aggregation queries / SQL on single micro batch
• No kinesis support
• Java8 only
Structured Streaming – Limitations

• Streaming without micro-batches
• ~1 ms latency – has been promised (and without code changes)
• Berkeley - Drizzle project - potential replacement of Streaming engine
• For users: will not be much different
• No changes in code
Structured Streaming – Future: Mid-Long Term

Talent vs. Tooling

Shortage of Talent and the Urgent Need For It
• Spark projects are increasing
• Need to get done quickly with budget controls
• The big barrier
• Talent - Deep Spark / Scala skills are hard to find
• Big gap between Spark prototype app vs. production grade scale, stability
• Lot of engineers on other projects need to be made productive quickly

The Need for Tooling
• Need very good enterprise grade, UI driven tooling around Spark to make it easy
• Need to cover all bases:
• Development, Debugging, Deployment, DevOps, Monitoring
• Also need to cover the full data processing journey
• Ingest
• Data Quality
• Blending
• Transformation / Enrichment
• Analytics / Machine Learning
• Loading of target databases
• Visualization

StreamAnalytix – “Visual Spark” and More…
• StreamAnalytix is one such platform which makes Spark easy
• Drag-and-drop UI to build and deploy Spark apps in minutes
• Real-time and Batch Data360 platform – on Apache Spark 2.1
• Support for Spark 2.2 and Structured Streaming coming in 4Q

About StreamAnalytix
Based on Multiple
Open-Source Engines
– Spark, Storm
and Flink (Future)
On Premise and
Cloud Compatible
Enterprise Grade – UI
Driven Streaming, IoT
and Batch Analytics and
Machine Learning
Platform

© 2017 Impetus Technologies© 2017 Impetus Technologies
Please provide your feedback on the webinar and your
interest to attend our upcoming webinars.
Meet us at Booth # 127
Strata Data Conference in New York
September 26-28, 2017

The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar

More Related Content

What's hot

Similar to The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar

More from Impetus Technologies

Recently uploaded

The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar