Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Thing you didn't know you could do in Spark


Published on

This presentation discusses issues with the modern lambda architecture and how Spark attempts to solve them with structured streaming and interactive querying. It then shows how SnappyData takes these solutions one step further with its Synopsis Data Engine

Published in: Data & Analytics
  • Be the first to comment

Thing you didn't know you could do in Spark

  1. 1. ThingsYou Didn'tKnowYou Could Do in Spark Version 0.1 | © Snappydata Inc 2017 Strategic Advisor @ Snappydata Assistant Prof. @Univ. of Mich Barzan Mozafari CTO | Co-founder @ Snappydata Jags Ramnarayan COO | Co-founder @ Snappydata Sudhir Menon
  2. 2. 2 Current Market Trends Analytic workloads are moving to the cloud Scale-outinfrastructure->IncreasinglyExpensive Excessivedatashuffling -> Longer responsetimes Real-time operational analytics are on the rise MixedWorkloadscombine streams,operationaldataandHistoricaldata Real-timeMatters Concurrency is increasing Sharedinfrustrure(10’sof usersand apps,100’sof queries) Exploratoryanalytics BIandvisualizationtools Concurrentqueries Data volumes are growing rapidly IndustrialIoT(Internetof Things) Datagrowing fasterthanMoore’slaw
  3. 3. 3 Our Observations Acheiving interactive analytics is hard Analystswantresponsetime <2 seconds Hardwith concurrentqueriesin the cluster Hardonananalyst’slaptop Hardonlargedatasets Hardwhenthereare networkshuffles After 99.9%- accurate early results, the final answer is no longer needed 200xqueryspeedup 200xreductionof clusterload, networkshuffles,andI/O waits Lambda architecture is extremely complex (and slow!) Dealingwith multipledataformats, disparateproductAPIs Wastedresources Querying multiple remote data sources is slow Curateddatafrom multiplesourcesisslow
  4. 4. Agenda • Mixed workloads Structured Streaming in Spark 2.0 State Management in Spark 2.0 • Interactive Analytics in Spark 2.0 • The SnappyData Solution • Q&A
  5. 5. Stream Processing Mixed workloads are way too common Transactions (point lookups, small updates) Interactive Analytics Analytics on mutating data Maintaining state or counters while processing streams Correlating and joining streams w/ large histories
  6. 6. Mixed Workload in IoT Analytics - Map sensors to tags - Monitor temperature - Send alerts Correlate current temperature trend with history…. Anomaly detection – score against models Interact using dynamic queries
  7. 7. How Mixed Workloads are Supported Today? Lambda Architecture
  8. 8. Scalable Store HDFS, MPP DBData-in-motion Analytics Application Streams Alerts Enrich Reference DB Lamba Architecure is Complex Enrich – requires reference data join State – manage windows with mutating state Analytics – Joins to history, trend patterns Model maintenance Models Transaction writes Interactive OLAP queries Updates
  9. 9. Scalable Store Updates HDFS, MPP DBData-in-motion Analytics Application Streams Alerts Enrich Reference DB Lamba Architecure is Complex Storm, Spark streaming, Samza… NoSQL – Cassandra, Redis, Hbase …. Models Enterprise DB – Oracle, Postgres.. OLAP SQL DB
  10. 10. Lamba Architecure is Complex • Complexity: learn and master multiple products  data models, disparate APIs, configs • Slower • Wasted resources
  11. 11. Can we simplify & optimize? How about a single clustered DB that can manage stream state, transactional data and run OLAP queries? RDB Tables Txn Framework for stream processing, etc Stream processing HDFS MPP DB Scalable writes, point reads, OLAP queries Apps
  12. 12. One idea: Start from Apache Spark Spark Core Spark Streaming Spark ML Spark SQL GraphX Stream program can Leverage ML, SQL, Graph Why? • Unified programming across streaming, interactive, batch • Multiple data formats, data sources • Rich library for Streaming, SQL, Machine Learning, … But: Spark is a computational engine not a store or a database
  13. 13. Structured Streaming in Spark 2.0
  14. 14. Source: Databricks presentation
  15. 15. Source: Databricks presentation creates a streaming DataFrame, does not start any of the computation write...startStream() defines where & how to output the data and starts the processing
  16. 16. Source: Databricks presentation
  17. 17. New State Management API in Spark 2.0 – KV Store • Preserve state across streaming aggregations across multiple batches • Fetch, store KV pairs • Transactional • Supports versioning (batch) • Built in store that persists to HDFS
  18. 18. State Management in Spark 2.0
  19. 19. Deeper Look – Ad Impression Analytics Ad Network Architecture – Analyze log impressions in real time
  20. 20. Ad Impression Analytics Ref -
  21. 21. Bottlenecks in the write path Stream micro batches in parallel from Kafka to each Spark executor Filter and collect event for 1 minute. Reduce to 1 event per Publisher, Geo every minute Execute GROUP BY … Expensive Spark Shuffle … Shuffle again in DB cluster … data format changes … serialization costs val input :DataFrame= .options(kafkaOptions).format("kafkaSource") .stream() val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie")) val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)
  22. 22. Bottlenecks in the Write Path • Aggregations – GroupBy, MapReduce • Joins with other streams, Reference data Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores
  23. 23. Avoid bottleneck - Localize process with State Spark Executor Spark Executor Kafka queue KV KV Publisher: 1,3,5 Geo: 2,4,6 Avoid shuffle by localizing writes Input queue, Stream, KV Store all share the same partitioning strategy Publisher: 2,4,6 Geo: 1,3,5 Publisher: 1,3,5 Geo: 2,4,6 Publisher: 2,4,6 Geo: 1,3,5 Spark 2.0 StateStore?
  24. 24. Impedance mismatch with KV stores We want to run interactive “scan”-intensive queries: - Find total uniques for Ads grouped on geography - Impression trends for Advertisers(group by query) Two alternatives: row-stores vs. column stores
  25. 25. Goal – Localize processingFast key-based lookup But, too slow to run aggregations, scan based interactive queries Fast scans, aggregations Updates and random writes are very difficult Consume too much memory
  26. 26. Why columnar storage in-memory?
  27. 27. Interactive Analytics in Spark 2.0
  28. 28. But, are in-memory column-stores fast enough? SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
  29. 29. Why distributed, in-memory and columnar formats not enough for interactive speeds? 1. Distributed shuffles can take minutes 2. Software overheads, e.g., (de)serialization 3. Concurrent queries competing for the same resources So, how can we ensure interactive-speed analytics?
  30. 30. • Most apps happy to tradeoff 1% accuracy for 200x speedup! • Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data! • Often can make perfectly accurate decisions without having perfectly accurate answers! • A/B Testing, visualization, ... • The data itself is usually noisy • Processing entire data doesn’t necessarily mean exact answers! • Inference is probabilistic anyway Use statistical techniques to shrink data!
  31. 31. Our Solution: SnappyData
  32. 32. SnappyData: A New Approach A Single Unified Cluster: OLTP + OLAP + Streaming for real-time analytics Batch design, high throughput Real-time design Low latency, HA, concurrency Vision: Drastically reduce the cost and complexity in modern big data Rapidly Maturing Matured over 13 years
  33. 33. SnappyData: A New Approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial product with Approximate Query Processing (AQP) MPP DB Index
  34. 34. SnappyData Data Model Rows Columnar Stream processing ODBC JDBCProbabilistic data Snappy Data Server – Spark Executor + Store Index Spark Program SQL Parser RDD Table Hybrid Storage RDD Cluster Mgr Scheduler High Latency/OLAP Low Latency Add/remove Servers Catalog
  35. 35. Features - Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP - OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc
  36. 36. Time Series Analytics ( Trade and Quote Analytics
  37. 37. Time Series Analytics – - Find last quoted price for Top 100 instruments for given date "select quote.sym, avg(bid) from quote join S on (quote.sym = S.sym) where date='$d' group by quote.sym” // Replaced with AVG as MemSQL doesn’t support LAST - Max price for Top 100 Instruments across Exchanges for given date "select trade.sym, ex, max(price) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, ex” - Find Average size of Trade in each hour for top 100 Instruments for given date "select trade.sym, hour(time), avg(size) from trade join S on (trade.sym = S.sym) where date='$d' group by trade.sym, hour(time)"
  38. 38. Results for Small day (40million), Big day(640 million) 0 500 1000 1500 2000 2500 3000 3500 Q1 Q2 Q3 SnappyData 224 125 124 MemSQL 556 1884 3210 AxisTitle Query Response time in Millis (Smaller is better) 0 10000 20000 30000 40000 50000 60000 SnappyData MemSQL Impala Greenplum Q1 1769 8652 45000 56000 Q2 432 26034 2250 900 Q3 494 49614 1880 1000AxisTitle Query Response time in Millis (Smaller is better)
  39. 39. TPC-H: 10X-20X faster than Spark 2.0
  40. 40. Synopses Data Engine
  41. 41. Use Case Patterns 1. Analytics cache - Caching for Analytics over any “big data” store (esp. MPP) - Federate query between samples and backend’ 2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query 3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP
  42. 42. Interactive-Speed Analytic Queries – Exact or Approximate Select avg(Bid), Advertiser from T1 group by Advertiser Select avg(Bid), Advertiser from T1 group by Advertiser with error 0.1 Speed/Accuracy tradeoffError(%) 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time 42 100 secs 2 secs 1% Error
  43. 43. Approximate Query Processing Features • Support for uniform sampling • Support for stratified sampling - Solutions exist for stored data (BlinkDB) - SnappyData works for infinite streams of data too • Support for exponentially decaying windows over time • Support for synopses - Top-K queries, heavy hitters, outliers, ... • [future] Support for joins • Workload mining (
  44. 44. Uniform (Random) Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/3 5 adv20 NY 0.0001 1/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  45. 45. Uniform (Random) Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 2/3 5 adv20 NY 0.0001 2/3 1 adv10 NY 0.0001 2/3 2 adv10 VT 0.0005 2/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table Larger
  46. 46. Stratified Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Stratified Sample on Geo ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/4 2 adv10 VT 0.0005 1/2 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  47. 47. Sketching techniques ● Sampling not effective for outlier detection ○ MAX/MIN etc ● Other probabilistic structures like CMS, heavy hitters, etc ● SnappyData implements Hokusai ○ Capturing item frequencies in timeseries ● Design permits TopK queries over arbitrary trim intervals (Top100 popular URLs) SELECT pageURL, count(*) frequency FROM Table WHERE …. GROUP BY …. ORDER BY frequency DESC LIMIT 100
  48. 48. Airline Analytics Demo Zeppelin Spark Interpreter (Driver) Zeppelin Server Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM
  49. 49. Free Cloud trial service – Project iSight ● Free AWS/Azure credits for anyone to try out SnappyData ● One click launch of private SnappyData cluster with Zeppelin ● Multiple notebooks with comprehensive description of concepts and value ● Bring your own data sets to try ‘Immediate visualization’ using Synopses data --- URL for AWS Service Send email to to be notified. Release expected next week
  50. 50. How SnappyData Extends Spark
  51. 51. Snappy Spark Cluster Deployment topologies • Snappy store and Spark Executor share the JVM memory • Reference based access – zero copy • SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput Unified Cluster Split Cluster
  52. 52. Simple API – Spark Compatible ● Access Table as DataFrame Catalog is automatically recovered ● Store RDD[T]/DataFrame can be stored in SnappyData tables ● Access from Remote SQL clients ● Addtional API for updates, inserts, deletes //Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(pro ps).saveAsTable(”T1"); val impressionLogs: DataFrame = context.table(colTable) val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable) <… Now use any of DataFrame APIs … >
  53. 53. Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  54. 54. Simple to Ingest Streams using SQL Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Appen d) .saveAsTable("adImpressions")
  55. 55. AdImpression Demo Spark, SQL Code Walkthrough, interactive SQL
  56. 56. AdImpression Ingest Performance Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)
  57. 57. Unified Cluster Architecture
  58. 58. How do we extend Spark for Real Time? • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers run “Managed” with standby secondary • Data HA – Consensus based clustering integrated for eager replication
  59. 59. How do we extend Spark for Real Time? • By pass scheduler for low latency SQL • Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc • Full SQL support – Persistent Catalog, Transaction, DML
  60. 60. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  61. 61. SnappyData is Open Source ● Ad Analytics example/benchmark - ● ● Learn more ● Connect: ○ twitter: ○ facebook: ○ slack: