Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to SnappyData Webinar


Published on

The slides for the first ever SnappyData webinar. Covers SnappyData core concepts, programming models, benchmarks and more.

SnappyData is open sourced here:

We also have a deep technical paper here:

We can be easily contacted on Slack, Gitter and more:

Published in: Technology
  • Be the first to comment

Intro to SnappyData Webinar

  1. 1. SnappyData Unified Stream, interactive analytics in a single in-memory cluster with Spark Jags Ramnarayan CTO, Co-founder SnappyData Feb 2016
  2. 2. SnappyData - an EMC/Pivotal spin out ● New Spark-based open source project started by Pivotal GemFire founders+engineers ● Decades of in-memory data management experience ● Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database
  3. 3. Lambda Architecture (LA) for Analytics
  4. 4. Is Lambda Complex ? Impala, HBase Cassandra, Redis, .. Application
  5. 5. Is Lambda Complex ? Impala, HBase Cassandra, Redis, .. Application • Application has to federate queries? • Complex Application – deal with multiple data models? Disparate APIs? • Slow? • Expensive to maintain?
  6. 6. Can we simplify, optimize? Can the batch and speed layer use a single, unified serving DB?
  7. 7. Deeper Look – Ad Impression Analytics
  8. 8. Ad Impression Analytics Ref -
  9. 9. Bottlenecks in the write path Stream micro batches in parallel from Kafka to each Spark executor Emit Key-Value pairs. So, we can group By Publisher, Geo Execute GROUP BY … Expensive Spark Shuffle … Shuffle again in DB cluster … data format changes … serialization costs
  10. 10. Bottlenecks in the Write Path • Aggregations – GroupBy, MapReduce • Joins with other streams, Reference data • Replication (HA) in Fast data store Shuffle Costs • Data models across Spark and FastData store are different • Data flows through too many processes • In-JVM copying Copying, Serialization Excessive copying in Java based Scale out stores
  11. 11. Goal – Localize processing • Can we Localize processing with state, avoid shuffle. • Can Kafka, Spark partitions and the data store share the same partitioning policy? ( partition by Advertiser,Geo ) -- Show cassandra, memSQL ingestion performance (maybe later section?) • Embedded State : Spark’s native MapWithState, Apache Samza, Flink? Good, scalable KV stores but IS THIS ENOUGH?
  12. 12. Impedance mismatch for Interactive Queries On Ingested Ad Impressions we want to run queries like: - Find total uniques for a certain AD grouped on geography, month - Impression trends for advertisers, or, detect outliers in the bidding price Unfortunately, in most KV oriented stores this is very inefficient - Not suited for scans, aggregations, distributed joins on large volumes - Memory utilization in KV store is poor
  13. 13. Why columnar storage?
  14. 14. In-memory Columns but still slow? SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, X) Berkeley AMPLab Big Data Benchmark -- AWS m2.4xlarge ; total of 342 GB
  15. 15. Can we use Statistical methods to shrink data? • It is not always possible to store the data in full Many applications (telecoms, ISPs, search engines) can’t keep everything • It is inconvenient to work with data in full Just because we can, doesn’t mean we should • It is faster to work with a compact summary Better to explore data on a laptop than a cluster Ref: Graham Cormode - Sampling for Big Data Can we use statistical techniques to understand data, synthesize something very small but still answer Analytical queries?
  16. 16. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-time design center - Low latency, HA, concurrent Vision: Drastically reduce the cost and complexity in modern big data Rapidly Maturing Matured over 13 years
  17. 17. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial project on Approximate Query Processing(AQP) MPP DB Index
  18. 18. Tenets, Guiding Principles ● Memory is the new bottleneck for speed ● Memory densities will follow Moore’s law ● 100% Spark compatible - powerful, concise. Simplify runtime ● Aim for Google Search like speed for analytic queries ● Dramatically reduce costs associated with Analytics Distributed systems are expensive – Esp. in production
  19. 19. Use Case Patterns • Stream ingestion database for spark • Process streams, transform, real-time scoring, store, query • In-memory database for apps • Highly concurrent apps, SQL cache, OLTP + OLAP • Analytic caching pattern • Caching for Analytics over any “Big data” store (esp MPP) • Federate query between samples and backend
  20. 20. Why Spark – Confluence of Streaming, interactive, batch Unifies batch, streaming, interactive comp. Easy to build sophisticated applications Support iterative, graph-parallel algorithms Powerful APIs in Scala, Python, Java Spark Spark Streaming SQL BlinkDB GraphX MLlib Streamin g Batch, Interactive Batch, Interactive Interactiv e Data-parallel, Iterative Sophisticated algos. Source: Spark summit presentation from Ion Stoica
  21. 21. Spark Cluster
  22. 22. Snappy Spark Cluster Deployment topologies • Snappy store and Spark Executor share the JVM memory • Reference based access – zero copy • SnappyStore is isolated but use the same COLUMN FORMAT AS SPARK for high throughput Unified Cluster Split Cluster
  23. 23. Simple API – Spark Compatible ● Access Table as DataFrame Catalog is automatically recovered ● Store RDD[T]/DataFrame can be stored in SnappyData tables ● Access from Remote SQL clients ● Addtional API for updates, inserts, deletes //Save a dataFrame using the Snappy or spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(pro ps).saveAsTable(”T1"); val impressionLogs: DataFrame = context.table(colTable) val campaignRef: DataFrame = context.table(rowTable) val parquetData: DataFrame = context.table(parquetTable) <… Now use any of DataFrame APIs … >
  24. 24. Extends Spark CREATE [Temporary] TABLE [IF NOT EXISTS] table_name ( <column definition> ) USING ‘JDBC | ROW | COLUMN ’ OPTIONS ( COLOCATE_WITH 'table_name', // Default none PARTITION_BY 'PRIMARY KEY | column name', // will be a replicated table, by default REDUNDANCY '1' , // Manage HA PERSISTENT "DISKSTORE_NAME ASYNCHRONOUS | SYNCHRONOUS", // Empty string will map to default disk store. OFFHEAP "true | false" EVICTION_BY "MEMSIZE 200 | COUNT 200 | HEAPPERCENT", ….. [AS select_statement];
  25. 25. Simple to Ingest Streams using SQL Consume from stream Transform raw data Continuous Analytics Ingest into in-memory Store Overflow table to HDFS Create stream table AdImpressionLog (<Columns>) using directkafka_stream options ( <socket endpoints> "topics 'adnetwork-topic’ “, "rowConverter ’ AdImpressionLogAvroDecoder’ ) streamingContext.registerCQ( "select publisher, geo, avg(bid) as avg_bid, count(*) imps, count(distinct(cookie)) uniques from AdImpressionLog window (duration '2' seconds, slide '2' seconds) where geo != 'unknown' group by publisher, geo”)// Register CQ .foreachDataFrame(df => { df.write.format("column").mode(SaveMode.Appen d) .saveAsTable("adImpressions")
  26. 26. AdImpression Demo Spark, SQL Code Walkthrough, interactive SQL
  27. 27. AdImpression Demo Performance comparison to Cassandra, MemSQL - TBD
  28. 28. AdImpression Ingest Performance Loading from parquet files using 4 cores Stream ingest (Kafka + Spark streaming to store using 1 core)
  29. 29. Why is ingestion, querying fast? Linearly scale with partition pruning Input queue, Stream, IMDB all share the same partitioning strategy
  30. 30. How does it scale with concurrency? ● Parallel query engine skips Spark SQL scheduling for low latency queries ● Column tables For fast scan/aggregations Also automatically compressed ● Row tables Fast key based or selective queries Can have any number of secondary indices
  31. 31. How does it scale with concurrency? ● Distributed shuffle for joins, ordering .. expensive ● Two techniques to minimize this 1) Colocation of related tables. All related records are collocated. For instance, ‘Users’ table partitioned across nodes and all related ‘Ad impressions” can be collocated on same partition. The parallel query would execute the Join locally on each partition 2) Replication of ‘Dimension’ tables Joins to dimension tables are always localized
  32. 32. Low latency Interactive Analytic Queries – Exact or Approximate Select avg(volume),symbol from T1 where <time range> group by symbol Select avg(volume), symbol from T1 where <time range> group by symbol With error 0.1 confidence 0.8 Speed/Accuracy tradeoff Error 30 mins Time to Execute on Entire Dataset Interactive Queries 2 sec Execution Time (Sample Size) 32 100 secs 2 secs
  33. 33. Key feature: Synopses Data ● Maintain stratified samples ○ Intelligent sampling to keep error bounds low ● Probabilistic data ○ TopK for time series (using time aggregation CMS, item aggregation) ○ Histograms, HyperLogLog, Bloom Filters, Wavelets CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS ( BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS
  34. 34. Unified Cluster Architecture
  35. 35. How do we extend Spark for Real Time? • Spark Executors are long running. Driver failure doesn’t shutdown Executors • Driver HA – Drivers run “Managed” with standby secondary • Data HA – Consensus based clustering integrated for eager replication
  36. 36. How do we extend Spark for Real Time? • By pass scheduler for low latency SQL • Deep integration with Spark Catalyst(SQL) – collocation optimizations, indexing use, etc • Full SQL support – Persistent Catalog, Transaction, DML
  37. 37. Performance – Spark vs Snappy (TPC-H) See ACM Sigmod 2016 paper for details Available on blogs
  38. 38. Performance – Snappy vs in-memoryDB (YCSB)
  39. 39. Unified OLAP/OLTP streaming w/ Spark ● Far fewer resources: TB problem becomes GB. ○ CPU contention drops ● Far less complex ○ single cluster for stream ingestion, continuous queries, interactive queries and machine learning ● Much faster ○ compressed data managed in distributed memory in columnar form reduces volume and is much more responsive
  40. 40. SnappyData is Open Source ● Available for download on Github today ● ● Learn more ● Connect: ○ twitter: ○ facebook: ○ linkedin: ○ slack:
  41. 41. EXTRAS
  42. 42. Colocated row/column Tables in Spark Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing ● Spark Executors are long lived and shared across multiple apps ● Snappy Memory Mgr and Spark Block Mgr integrated
  43. 43. Table can be partitioned or replicated Replicated Table Partitioned Table (Buckets A-H) Replicated Table Partitioned Table (Buckets I-P) consistent replica on each node Partition Replica (Buckets A-H) Replicated Table Partitioned Table (Buckets Q-W)Partition Replica (Buckets I-P) Data partitioned with one or more replicas Use partitioned tables for large fact tables, Replicated for small dimension tables
  44. 44. Spark Core Micro-batch Streaming Spark SQL Catalyst T X N OLAP Job Scheduler OLAP Query OLTP Query AQP P2P Cluster Replication Svc RowColumn Table Index Sample Table Shared nothing logs Spark program JDBC ODBC SnappyData Components