Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SnappyData Overview Slidedeck for Big Data Bellevue


Published on

Slides presented at big data bellevue on Jan 20th 2015

Published in: Software
  • Login to see the comments

SnappyData Overview Slidedeck for Big Data Bellevue

  1. 1. SnappyData Getting Spark ready for real-time, operational analytics Suds Menon Co-Founder SnappyData Jan 2016
  2. 2. Last Week Tonight in Big Data
  3. 3. IoT is what makes the big data challenge very real A 10 Trillion Device World1 1:
  4. 4. Because Insights are like people. Useful for a short period of time The New Arms Race ●  Sift through data to get insights to improve your business ●  What is your time to insights? ●  What is your time to operationalizing insights?
  5. 5. Can we use the past to accurately predict the future? The Holy Grail of Analytics
  6. 6. The faster you go, the bigger your business advantage Speeding Up Insights
  7. 7. Exploding data volumes fuel the search for distributed solutions How We Got Here Teradata Cognos GreenPlum Netezza, ParAccel Hadoop (SQL on Hadoop) Spark (Spark SQL)
  8. 8. Every enterprise today deals with these 4 kinds of data interactions The Four Horsemen Of Data OLTP OLAP Streaming Machine Learning
  9. 9. Who Are We? ●  An EMC-Pivotal spinout focused on real time operational analytics ●  New Spark-based open source project started by Pivotal GemFire founders+engineers ●  Decades of in-memory data management experience ●  Focus on real-time, operational analytics: Spark inside an OLTP+OLAP database
  10. 10. SnappyData At Cruising Altitude Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real time operational Analytics – TBs in memory RDB Rows Txn Columnar API Stream processing ODBC, JDBC, REST Spark - Scala, Java, Python, R HDFS AQP First commercial project on Approximate Query Processing(AQP) MPP DB Index
  11. 11. SnappyData: A new approach Single unified HA cluster: OLTP + OLAP + Stream for real-time analytics Batch design, high throughput Real-­‐time   design  center   -­‐  Low  latency,  HA,   concurrent   Vision: Drastically reduce the cost and complexity in modern big data
  12. 12. Huge community adoption, slip streaming into Hadoop momentum, great data integration platform Why Spark? •  Most events in life can be analyzed as micro batches •  Blends streaming, interactive, and batch analytics •  Appeals to Java, R, Python, Scala programmers •  Rich set of transformations and libraries •  RDD and fault tolerance without replication •  Offers Spark SQL as a key capability
  13. 13. Spark is a compute framework that processes data, not an analytics database Clearing Up Some Spark Myths ●  It is NOT a distributed in-memory database ○  It’s a computational framework with immutable caching ●  It is NOT Highly Available ○  Fault tolerance is not the same as HA ●  NOT well suited for real time, operational environments ○  Does not handle concurrency well ○  Does not share data very well either
  14. 14. SnappyData & Lambda SnappyData Focus
  15. 15. Perspective on Lambda for real time In-Memory DB Interactive queries, updates Deep Scale, High volume MPP DB Transform Data-in-motion Analytics Application Streams Alerts
  17. 17. Market Surveillance FLAG DETECT ANALYZE INGEST Identify patterns based on query results Partitioned, HA stream ingestion Prevent settlement, investigate further SQL queries & Stream Analytics on microbatches
  18. 18. Contextual Marketing RESPOND DECIDE ANALYZE INGEST Pick Ad based on variety of reference data parameters Transactional request for Ad placement Deliver in real time Join with history, join with user profile, join with location
  19. 19. Location Based Telco Services Geo Fencing Mobile Marketing Network Analytics ●  INGEST, CORRELATE, JOIN WITH HISTORICAL DATA, RESPOND
  20. 20. Spark Architecture Driver Cluster Manager (YARN, Mesos, Standalone) Worker Worker Worker Executor
  21. 21. REST API for Job Submission Worker Worker Worker Data Server Execut or Cluster Manager (YARN, Mesos, Standalone) Data Server Execut or Snappy Infused Spark Architecture JDBC Clients ODBC Clients Job ServerLead Node Lead Node
  22. 22. Core Components Of SnappyData
  23. 23. Colocated row/column Tables in Spark Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing Row Table Column Table Spark Executor TASK Spark Block Manager Stream processing ●  Spark Executors are long lived and shared across multiple apps ●  Gem Memory Mgr and Spark Block Mgr integrated
  24. 24. Table can be partitioned or replicated Replicated Table Partitioned Table (Buckets A-H) Replicated Table Partitioned Table (Buckets I-P) consistent replica on each node Partition Replica (Buckets A-H) Replicated Table Partitioned Table (Buckets Q-W)Partition Replica (Buckets I-P) Data partitioned with one or more replicas
  25. 25. Linearly scale with shared partitions Spark Executor Spark Executor Kafka queue Subscriber N-Z Subscriber A-M Subscriber A-M Ref data Linearly scale with partition pruning Input queue, Stream, IMDB, Output queue all share the same partitioning strategy
  26. 26. Point access, updates, fast writes ●  Row tables with PKs are distributed HashMaps ○  with secondary indexes ●  Support for transactional semantics ○  read_committed, repeatable_read ●  Support for scalable high write rates ○  streaming data goes through stages ○  queue streams, intermediate storage (Delta row buffer), immutable compressed columns
  27. 27. Full Spark Compatibility ●  Any table is also visible as a DataFrame ●  Any RDD[T]/DataFrame can be stored in SnappyData tables ●  Tables appear like any JDBC sourced table ○  But, in executor memory by default ●  Addtional API for updates, inserts, deletes //Save a dataFrame using the spark context … context.createExternalTable(”T1", "ROW", myDataFrame.schema, props ); //save using DataFrame API dataDF.write.format("ROW").mode(SaveMode.Append).options(props).saveAsTable(”T1");
  28. 28. Extends Spark CREATE  [Temporary]  TABLE  [IF  NOT  EXISTS]  table_name        (                <column  deIinition>          )    USING  ‘JDBC  |  ROW  |  COLUMN  ’   OPTIONS  (        COLOCATE_WITH  'table_name',        //  Default  none        PARTITION_BY  'PRIMARY  KEY  |  column  name',  //  will  be  a  replicated  table,  by  default        REDUNDANCY                '1'  ,          //  Manage  HA      PERSISTENT      "DISKSTORE_NAME  ASYNCHRONOUS  |    SYNCHRONOUS",          //  Empty  string  will  map  to  default  disk  store.        OFFHEAP  "true  |  false"        EVICTION_BY    "MEMSIZE  200  |  COUNT  200  |  HEAPPERCENT",   …..      [AS  select_statement];  
  29. 29. Key feature: Synopses Data ●  Maintain stratified samples ○  Intelligent sampling to keep error bounds low ●  Probabilistic data ○  TopK for time series (using time aggregation CMS, item aggregation) ○  Histograms, HyperLogLog, Bloom Filters, Wavelets CREATE SAMPLE TABLE sample-table-name USING columnar OPTIONS ( BASETABLE ‘table_name’ // source column table or stream table [ SAMPLINGMETHOD "stratified | uniform" ] STRATA name ( QCS (“comma-separated-column-names”) [ FRACTION “frac” ] ),+ // one or more QCS
  30. 30. AQP Architecture
  31. 31. Spot The Differences
  32. 32. SnappyData is Open Source ●  Beta will be on github in January. We are looking for contributors! ●  Learn more & register for beta: ●  Connect: ○  twitter: ○  facebook: ○  linkedin: ○  slack: ○  IRC: #snappydata
  33. 33. Q&A
  34. 34. THANK YOU