Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar


Published on

The adoption of Apache Spark to analyze data in real-time is increasing with its ability to handle sophisticated analytical requirements and a common framework for streaming and batch. However, most organizations are also looking for "true streaming" features like lower latency and the ability to process out-of-order data.

Structured Streaming, a new high-level API, introduced in Apache Spark 2.0 promises these and other enhancements to the Spark approach to streaming data processing.

In this webinar, Anand Venugopal (Product Head) and other technical experts from StreamAnalytix, speak about the promising developments in Apache Spark 2.0 and how organizations can leverage structured streaming to make timely and accurate decisions and stay competitive.

Published in: Data & Analytics
  • Be the first to comment

The structured streaming upgrade to Apache Spark and how enterprises can benefit- StreamAnalytix Webinar

  1. 1. © 2017 Impetus Technologies WEBINAR Anand Venugopal Product Head & AVP, StreamAnalytix The Structured Streaming Upgrade to Apache Spark and How Enterprises Can Benefit Amit Assudani Sr.Technical Architect – Spark, StreamAnalytix August 2017
  2. 2. © 2017 Impetus Technologies Quick Webinar Notes • Our focus: Enabling real-time enterprise, make Spark easy-to-use • Sharing our experience and expertise with you • Level of content • 20-80 :: New-Experienced (w.r.t. Spark) • Format: A combination of panel discussion and presentation • Usage of some artifacts and pictures from Apache Spark website and other public sources • Q&A and interactions are important and highly valued • Please send us your comments/ feedback using the Webex console
  3. 3. © 2017 Impetus Technologies Webinar Outline • About Impetus and what is StreamAnalytix? – 2 minutes • Apache Spark – Know the basics and its evolution – 8 minutes • A deep dive into Structured Streaming – 25 minutes • What is it? • How is it different from 1.0? • Features and technical highlights • Benefits and limitations • Upgrades and migrations • Future roadmap • Talent vs Tooling – 5 minutes • Q&A – 5+ minutes
  4. 4. © 2017 Impetus Technologies Mission critical technology solutions since 1996 Fortune 500: Big Data clients 1700 people; US, India, global reach Unique mix of Big Data products and services About Impetus
  5. 5. © 2017 Impetus Technologies© 2017 Impetus Technologies Real-time Stream Processing & Machine Learning Platform + Visual Spark Studio
  6. 6. © 2017 Impetus Technologies • Project in Berkeley AMPLabs – 2009 – Matei Zaharia; open sourced (BSD) in 2010 • Framework on distributed resource management system (Mesos) • Speed up ML jobs in Apache Hadoop with in-memory approach • 30x performance increase on Hadoop jobs Apache Spark – The Beginning
  7. 7. © 2017 Impetus Technologies • Robust widely used technology • Survey by Taneja Group in November 2016 highlights: • 54% of 7000 enterprise participants – said actively using Spark • 55% of workloads were ETL / data processing / engineering • Cloud deployments projected well beyond 30% • Popular new initiatives – Data science exploration, streaming and machine learning Micro-batch Hi-speed Batch Sits on Hadoop and/or CloudInteractive Iterative Graph Streaming Apache Spark – Current State
  8. 8. © 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 0.X Feb 2014 Spark 0.7-0.9 • Becomes a top level Apache project • RDD concept introduced with Spark • Scala and Java binding • Adds a Python API called PySpark • Introduces Spark Streaming • Introduces MLlib • Includes a first version of GraphX • PySpark makes it possible to use Spark from Python • Spark Streaming adds near real-time processing capability • Spark Streaming is now out of alpha and includes significant optimizations and simplified high availability deployment
  9. 9. © 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.0-1.2 May 2014 Spark 1.0 • Adds Spark SQL • Guarantees stability of its core API • Full support for running seamlessly in secured Hadoop clusters • Spark 1.0 was the first production ready backward compatible release. Viewed spark streaming as faster batch processing rather than streaming • Became 1st open source Big Data framework to embrace in-memory computing Sep 2014 Spark 1.1 • Migrates all customer workloads from Shark to Spark SQL • Expansion of MLlib • Extends libraries and sources for Spark streaming • First minor release in the 1.X series. Added significant extensions to the newly added Spark SQL and the Spark MLlib Dec 2014 Spark 1.2 • A new API for external data sources • New H/A driver support through a Write Ahead Log (WAL), removes any single- point-of-failure from Spark streaming • A higher-level API for constructing pipelines in the package • GraphX project provides a stable API • Recognized the need for structured data and started to evolve to support it. Introduced a specialized RDD schema as a first step. • However still lacked a direct API to read structured data from Spark
  10. 10. © 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.3-1.5 Mar 2015 Spark 1.3 • A new DataFrames API • Provides a rich set of new MLlib algorithms • Adds APIs to direct Kakfa streaming source • DataFrames allow Spark to better understand the structure of data as well as the computation being performed. • First unified API to read from structured and semi-structured sources (both RDBMS and NoSQL databases) Jun 2015 Spark 1.4 • Introduces SparkR • ML pipelines API graduates from alpha with new transformers and improved Python coverage • Adds visual debugging and monitoring utilities to evaluate running of Spark applications • A REST API for Initial performance improvements in project Tungsten • A pluggable interface for write ahead logs • Targets data scientists with SparkR on new DataFrame API. • Ships the initial pieces of Project Tungsten, becomes first version of custom memory management Sep 2015 Spark 1.5 • 1st major pieces of Project Tungsten • New ML algorithms, extends new R API • Adds visualization of SQL and DataFrame query plans in the web UI • Operational features for the streaming component, such as backpressure support • Pushes Project Tungsten • Focused on increasing Spark’s performance through several low-level architectural optimizations • Another major theme was data science
  11. 11. © 2017 Impetus Technologies Spark Evolution Major Version Date of Release Minor Version Feature Remarks Spark 1.0 Jan 2016 Spark 1.6 • Experimental Dataset API • New data science functionalities; ML pipeline persistence and new algorithms • A new and efficient ’mapWithState API’, replaces updateStateByKey • Speedup of 10X for streaming state management • SQL queries on files • Datasets, a typed extension of the DataFrame API allows to work with custom objects and lambda functions with benefits of Spark SQL
  12. 12. © 2017 Impetus Technologies Spark Evolution Date of Release Major Version Minor Version Feature Remarks Spark 2.0-2.2 Jul 2016 Spark 2.0 • A new API, Structured Streaming • Second generation Tungsten engine • Unified DataFrame and Dataset in Scala/Java • Substantial (2-10X) performance speedup for common operators in SQL and DataFrames with a new technique called whole stage code generation • Structured Streaming launched experimentally Aims to integrate batch and Stream. Introduces the concept of continuous applications Dec 2016 Spark 2.1 • Hardening of Structured Streaming – still experimental • Adds a number of SQL functionalities • Focuses on advanced analytics • SparkR becomes most comprehensive library for distributed machine learning on R Introduced Structured Streaming as a high- level API for building continuous applications. Aims to make it easier to build end-to-end streaming applications. Introduces; • Event-time watermarks • Support for all file-based formats and all file-based features • Adds native support for Kafka 0.10 Jul 2017 Spark 2.2 • Production ready Structured Streaming • Focuses on advanced analytics and Python • Cost-based optimizer • Limit the max number of records written per file • Support for parsing multi-line JSON & CSV files • The Structured Streaming APIs are now GA and is no longer labeled experimental • Add various SQL functionalities and introduces Additional Algorithms in MLlib and GraphX
  13. 13. © 2017 Impetus Technologies Poll Question What is your currently used Spark version? - 1.6 or prior - 2.1 - 2.2 - Planning to start soon - No plans
  14. 14. © 2017 Impetus Technologies A Deep Dive into Structured Streaming
  15. 15. © 2017 Impetus Technologies Structured Streaming – What is it? • Strongly improved framework over Spark Streaming (DStream API) of Spark 1.x • High level streaming API built on Spark SQL (DataFrame/Dataset API) and Catalyst Optimizer • Express streaming computations the same way as batch computations • Repeated query / incremental execution on unbounded table
  16. 16. © 2017 Impetus Technologies Structured Streaming – What is it? • “NO REASONING ABOUT STREAMING” • Simply define a flow: • source  transformation  sink  mode & trigger time  checkpoint • Structured Streaming makes Streaming ETL + Analytics easier and a natural single flow • Not restricted to hard batch duration limits (delivers lower latency) • Exactly-once guarantee now truly end-end: includes sink layer
  17. 17. © 2017 Impetus Technologies Structured Streaming – Code Snippet (Structured Streaming vs Batch) // Structured Streaming val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .start() //Batch val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save()
  18. 18. © 2017 Impetus Technologies Structured Streaming – Code Snippet (DStreams vs Batch) //DStream val topics = Array("topicA", "topicB") val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "localhost:9092,anotherhost:9092", "" -> "use_a_separate_group_id_for_each_stream", "auto.offset.reset" -> "latest", "" -> (false: java.lang.Boolean) ) val stream:DStream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) => (record.key, record.value)) //NO Kafka Write Support //Batch val df = spark .read .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] df.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .save()
  19. 19. © 2017 Impetus Technologies Streaming Code – Executed on “Trigger” (One Time Batch) // Structured Streaming val df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1") .load() df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") .as[(String, String)] //One Time Trigger df.writeStream .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "topic1") .trigger(Trigger.Once) .start() • No worry about figuring out “changed data” and output consistency • Much easier stateful processing like deduping • Unified code: No different code base for Lambda solutions • Cost saving by not running the cluster 24/7
  20. 20. © 2017 Impetus Technologies Poll Results
  21. 21. © 2017 Impetus Technologies Structured Streaming – Features and Highlights (Event Time; Window Duration and Triggers) • Event time orientation • In combination with “windows” and triggers • Aggregates maintained by Structured Streaming • No need to write separate code • Incremental query and output modes • append / complete / update
  22. 22. © 2017 Impetus Technologies Structured Streaming – Features and Highlights (Late Data Handling)
  23. 23. © 2017 Impetus Technologies Structured Streaming – Features and Highlights (Watermarking (“Data too late!”))
  24. 24. © 2017 Impetus Technologies • New data formats: • Native - multi-line JSON support • Native CSV data source • Stateful processing and time-outs beyond aggregations • Using mapgroupswithstate and flatmapgroupswithstate • New built-in ‘rate‘ source for benchmarking and testing for data generation • x number of events, <xyz> format • Metrics for Structured Streaming: New metrics sink • Connect with Graphite • Streaming listener (for metrics for every batch execution) • Kafka 010 support; from_json, to_json, explode Structured Streaming – Features and Highlights
  25. 25. © 2017 Impetus Technologies • New – Input / output features: • Kafka stream / batch writer (DStream - didn't have Kafka writer) • Kafka batch / stream source (Kafka wasn't available as a source for batch earlier) • Partitioning output data files (Example: Hive data output) • Deduplication is a built in function • Example: Major Bank use case • Without Structured Streaming – manual record and check for hash value in external store • With Structured Streaming - unbounded table with hash values Structured Streaming – Features and Highlights
  26. 26. © 2017 Impetus Technologies • Improvements (not new) : • Easier stream to batch join • Recovering failures using checkpoint (this was there in DStream also) • “Code Productivity” enhanced / continuous SQL over batches and aggregations (maintained by Structured Streaming) • Enhanced batch inter-operability Structured Streaming – Additional Features
  27. 27. © 2017 Impetus Technologies • Co-existence of 1.6 and 2.x – on the same Hadoop cluster • Forward compatibility changes • SparkSession is now the new entry point of Spark • Replaces the old (1.x) SQLContext and HiveContext • Dataset API and DataFrame API are unified • Scala: DataFrame becomes a type alias for Dataset[Row] • Java API users must replace DataFrame with Dataset<Row> Spark Version Management Considerations (Migration, Co-existence)
  28. 28. © 2017 Impetus Technologies • Machine learning support still weak (coming soon) • Multiple (chained) aggregations not supported • Limit, take, collect, show, count, foreach – Don’t work • Join limitations • Caching for multiple actions • Aggregation queries / SQL on single micro batch • No kinesis support • Java8 only Structured Streaming – Limitations
  29. 29. © 2017 Impetus Technologies • Streaming without micro-batches • ~1 ms latency – has been promised (and without code changes) • Berkeley - Drizzle project - potential replacement of Streaming engine • For users: will not be much different • No changes in code Structured Streaming – Future: Mid-Long Term
  30. 30. © 2017 Impetus Technologies Talent vs. Tooling
  31. 31. © 2017 Impetus Technologies Shortage of Talent and the Urgent Need For It • Spark projects are increasing • Need to get done quickly with budget controls • The big barrier • Talent - Deep Spark / Scala skills are hard to find • Big gap between Spark prototype app vs. production grade scale, stability • Lot of engineers on other projects need to be made productive quickly
  32. 32. © 2017 Impetus Technologies The Need for Tooling • Need very good enterprise grade, UI driven tooling around Spark to make it easy • Need to cover all bases: • Development, Debugging, Deployment, DevOps, Monitoring • Also need to cover the full data processing journey • Ingest • Data Quality • Blending • Transformation / Enrichment • Analytics / Machine Learning • Loading of target databases • Visualization
  33. 33. © 2017 Impetus Technologies StreamAnalytix – “Visual Spark” and More… • StreamAnalytix is one such platform which makes Spark easy • Drag-and-drop UI to build and deploy Spark apps in minutes • Real-time and Batch Data360 platform – on Apache Spark 2.1 • Support for Spark 2.2 and Structured Streaming coming in 4Q
  34. 34. © 2017 Impetus Technologies About StreamAnalytix Based on Multiple Open-Source Engines – Spark, Storm and Flink (Future) On Premise and Cloud Compatible Enterprise Grade – UI Driven Streaming, IoT and Batch Analytics and Machine Learning Platform
  35. 35. © 2017 Impetus Technologies© 2017 Impetus Technologies Please provide your feedback on the webinar and your interest to attend our upcoming webinars. Meet us at Booth # 127 Strata Data Conference in New York September 26-28, 2017
  36. 36. © 2017 Impetus Technologies Thank you. Questions? © 2017 Impetus Technologies – Confidential