Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th


Published on

Presented at the BDAM meetup in Palo Alto on Sept 14th. Jags Ramnarayan, CTO, SnappyData, discusses an ad analytics use case running on SnappyData.

Published in: Data & Analytics
  • Be the first to comment

SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th

  1. 1. Speed of thought Analytics for Ad impressions using Spark 2.0 and Snappydata Jags Ramnarayan CTO, Co-founder @ SnappyData
  2. 2. Our Pedigree SnappyDat a SpinOut ● New Spark-based open source project started by Pivotal GemFire founders+engineers ● Decades of in-memory data management experience ● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP databaseFunded by Pivotal, GE, GTD Capital
  3. 3. Mixed workloads are common – Why is Lambda the answer?
  4. 4. Lambda Architecture is Complex • Complexity: learn and master multiple products  data models, disparate APIs, configs • Slower • Wasted resources Source:
  5. 5. Can we simplify & optimize? Perhaps a single clustered DB that can manage stream state, transactional data and run OLAP queries?
  6. 6. Ad Analytics using Lambda
  7. 7. Ad Impression Analytics Ad Network Architecture – Analyze log impressions in real time
  8. 8. Ad Impression Analytics Ref -
  9. 9. Bottlenecks in the write path Stream micro batches in parallel from Kafka to each Spark executor Filter and collect event for 1 minute. Reduce to 1 event per Publisher, Geo every minute Execute GROUP BY … Expensive Spark Shuffle … Shuffle again in DB cluster … data format changes … serialization costs val input :DataFrame= .options(kafkaOptions).format("kafkaSource") .stream() val result :DataFrame = input .where("geo != 'UnknownGeo'") .groupBy( window("event-time", "1min"), Publisher", "geo") .agg(avg("bid"), count("geo"),countDistinct("cookie")) val query = result.write.format("org.apache.spark.sql.cassandra") .outputMode("append") .startStream("dest-path”)
  10. 10. Bottlenecks in the Write Path • Aggregations – GroupBy, MapReduce • Joins with other streams, Reference data Shuffle Costs (Copying, Serialization) Excessive copying in Java based Scale out stores
  11. 11. Impedance mismatch with KV stores We want to run interactive “scan”-intensive queries: - Find total uniques for Ads grouped on geography - Impression trends for Advertisers(group by query) Two alternatives: row-stores vs. column stores
  12. 12. Goal – Localize processingFast key-based lookup But, too slow to run aggregations, scan based interactive queries Fast scans, aggregations Updates and random writes are very difficult Consumes too much memory
  13. 13. Why columnar storage in-memory? Source: MonetDB
  14. 14. How did Spark 2.0 do? - Spark 2.0, MacBook Pro 4 core, 2.8 Ghz Intel i7, enough RAM - Data set : 105 Million records Parquet files in OS Buffer Managed in Spark memory ~ 3 seconds ~ 600 millisecondsselect AVG(bid) from AdImpressions
  15. 15. Spark 2.0 query plan What is different? Scan over 105 million Integers Shuffle results from each partition so we can compute Avg across all partitions - is cheap in this case … only 11 partitions select AVG(ArrDelay) from airline
  16. 16. Whole Stage Code Generation Typical Query engine design - Each Operator implemented using functions - And, functions imply chasing pointers … Expensive Code Generation in Spark -- Remove virtual function calls -- Array, variables instead of objects -- Capitalize on modern CPU cache Aggregate Filter Scan Project How to remove complexity? Add a layer How to improve perf? Remove a layer Filter() { getNextRow { get a row from scan() //child Apply filter condition true: return row } Scan() { getNextRow { get row from fileInputStream }
  17. 17. Good enough? Hitting the CPU Wall? select count(*) , advertiser From history t1, current t2, adimpress t3 Where t1 Join t2 Join t3 group by geo order by count desc limit 8 Distributed Joins can be very expensive 0 20 40 60 80 100 120 140 160 180 200 1 10 ConcurrencyConcurrency ResponseTime in seconds ResponseTime in seconds
  18. 18. - DRAM is still relatively expensive for the deluge of data - Analytics in the cloud requires fluid data movement -- How do you move large volumes to/from clouds? Challenges with In-memory Analytics
  19. 19. • Most apps happy to tradeoff 1% accuracy for 200x speedup! • Can usually get a 99.9% accurate answer by only looking at a tiny fraction of data! • Often can make perfectly accurate decisions without having perfectly accurate answers! • A/B Testing, visualization, ... • The data itself is usually noisy • Processing entire data doesn’t necessarily mean exact answers! • Inference is probabilistic anyway Use statistical techniques to shrink data?
  20. 20. SnappyData A Hybrid Open source system for Transactions, Analytics, Streaming (
  21. 21. SnappyData – In-memory Hybrid DB with Spark A Single Unified Cluster: OLTP + OLAP + Streaming for real-time analytics Batch design, high throughput Real-time design Low latency, HA, concurrency Vision: Drastically reduce the cost and complexity in modern big data Rapidly Maturing Matured over 13 years
  22. 22. SnappyData fuses speed + serving layer Process, store streams Kafka Snappy Data Server – Spark Executor + Store Batch compute Reference data Lazy write, Fetch on demand RDB HDFS In-memory compute, state Current Operational data External data S3, Rdb, XML…Spark API ++ - Java, Scala, Python, R, REST Synopses data Interactive analytic queries History data
  23. 23. Realizing ‘speed-of-thought’ Analytics Rows Columnar Stream processing Kafka Queue (partition) Snappy Data Server – Spark Executor + Store Index Process Spark or SQL Program Batch compute Hybrid Store RDB (Reference data) HDFS MPP DB In-memory compute, state overflow Local persist Spark API ++ - Java, Scala, Python, R, REST Synopse s Interactive analytic queries
  24. 24. • Fast - Stream, ingested data colocated on shared key - Tables colocated on shared key - Far less copying, serialization - Improvements to vectorization (20X faster than spark) • Use less memory, CPU - Maintain only “Hot/active” data in RAM - Summarize all data using Synopses • Flexible - Spark. Enough said. Fast, Fewer resources, Flexible
  25. 25. Features - Deeply integrated database for Spark - 100% compatible with Spark - Extensions for Transactions (updates), SQL stream processing - Extensions for High Availability - Approximate query processing for interactive OLAP - OLTP+OLAP Store - Replicated and partitioned tables - Tables can be Row or Column oriented (in-memory & on-disk) - SQL extensions for compatibility with SQL Standard - create table, view, indexes, constraints, etc
  26. 26. TPC-H: 10X-20X faster than Spark 2.0
  27. 27. Interactive-Speed Analytic Queries – Exact or Approximate Select avg(Bid), Advertiser from T1 group by Advertiser Select avg(Bid), Advertiser from T1 group by Advertiser with error 0.1 Speed/Accuracy tradeoffError(%) 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time 27 100 secs 2 secs 1% Error
  28. 28. Query execution with accuracy guarantee PARSE QUERY Can Query be executed on Cache? - Recent time window - Computable from samples - Within error constraints - Point query on history - Outlier query - Very complex query Parallely Execute on Base table In-memory Execution with Error bar Response Response No Yes
  29. 29. Synopses Data Engine Features • Support for uniform sampling • Support for stratified sampling - Solutions exist for stored data (BlinkDB) - SnappyData works for infinite streams of data too • Support for exponentially decaying windows over time • Support for synopses - Top-K queries, heavy hitters, outliers, ... • [future] Support for joins • Workload mining (
  30. 30. Uniform (Random) Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/3 5 adv20 NY 0.0001 1/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  31. 31. Uniform (Random) Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Uniform Sample ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 2/3 5 adv20 NY 0.0001 2/3 1 adv10 NY 0.0001 2/3 2 adv10 VT 0.0005 2/3 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table Larger
  32. 32. Stratified Sampling ID Advertiser Geo Bid 1 adv10 NY 0.0001 2 adv10 VT 0.0005 3 adv20 NY 0.0002 4 adv10 NY 0.0003 5 adv20 NY 0.0001 6 adv30 VT 0.0001 Stratified Sample on Geo ID Advertiser Geo Bid Sampling Rate 3 adv20 NY 0.0002 1/4 2 adv10 VT 0.0005 1/2 SELECT avg(bid) FROM AdImpresssions WHERE geo = ‘VT’ Original Table
  33. 33. Sketching techniques ● Sampling not effective for outlier detection ○ MAX/MIN etc ● Other probabilistic structures like CMS, heavy hitters, etc ● SnappyData implements Hokusai ○ Capturing item frequencies in timeseries ● Design permits TopK queries over arbitrary time intervals (Top100 popular URLs) SELECT pageURL, count(*) frequency FROM Table WHERE …. GROUP BY …. ORDER BY frequency DESC LIMIT 100
  34. 34. Synopses Data Engine Demo Zeppelin Spark Interpreter (Driver) Zeppelin Server Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM Row cache Columnar compressed Spark Executor JVM
  35. 35. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  36. 36. SnappyData is Open Source ● Ad Analytics example/benchmark - ● ● Learn more ● Connect: ○ twitter: ○ facebook: ○ slack:
  37. 37. EXTRAS
  38. 38. Use Case Patterns 1. Operational Analytics DB - Caching for Analytics over disparate sources - Federate query between samples and backend’ 2. Stream analytics for Spark Process streams, transform, real-time scoring, store, query 3. In-memory transactional store Highly concurrent apps, SQL cache, OLTP + OLAP
  39. 39. How SnappyData Extends Spark
  40. 40. Snappy Spark Cluster Deployment topologies • Snappy store and Spark Executor share the JVM memory • Reference based access – zero copy • SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput Unified Cluster Split Cluster
  41. 41. Simple API – Spark Compatible ● Access Table as DataFrame Catalog is automatically recovered ● Store RDD[T]/DataFrame can be stored in SnappyData tables ● Access from Remote SQL clients ● Addtional API for updates, inserts, deletes //Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(pro ps).saveAsTable(”T1"); val impressionLogs: DataFrame = context.table(colTable) val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable) <… Now use any of DataFrame APIs … >
  42. 42. Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  43. 43. Simple to Ingest Streams using SQL Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Appen d) .saveAsTable("adImpressions")
  44. 44. Unified Cluster Architecture
  45. 45. How do we extend Spark for Real Time? • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers run “Managed” with standby secondary • Data HA – Consensus based clustering integrated for eager replication
  46. 46. How do we extend Spark for Real Time? • By pass scheduler for low latency SQL • Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc • Full SQL support – Persistent Catalog, Transaction, DML
  47. 47. AdImpression Demo Spark, SQL Code Walkthrough, interactive SQL
  48. 48. Concurrent Ingest + Query Performance • AWS 4 c4.2xlarge instances - 8 cores, 15GB mem • Each node parallely ingests stream from Kafka • Parallel batch writes to store (32 partitions) • Only few cores used for Stream writes as most of CPU reserved for OLAP queries 0 100000 200000 300000 400000 500000 600000 700000 Spark- Cassandra Spark- InMemoryDB SnappyData Series1 322000 480000 670000 Persecond Throughput Stream ingestion rate (On 4 nodes with cap on CPU to allow for queries) 2X – 45X faster (vs Cassandra, Memsql)
  49. 49. Concurrent Ingest + Query Performance 0 10000 20000 30000 40000 30M 60M 90M 30M 60M 90M 30M 60M 90M Spark-Cassandra Spark-InMemoryDBl SnappyData 20346 65061 93960 3649 5801 7295 1056 1571 2144 Q1 Sample “scan” oriented OLAP query(Spark SQL) performance executed while ingesting data select count(*) AS adCount, geo from adImpressions group by geo order by adCount desc limit 20; Response Time(millis) 2X – 45X faster