Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SnappyData Toronto Meetup Nov 2017


Published on

Slides from Sudhir Menon's presentation at the Pivotal Toronto Meetup

Published in: Technology
  • Be the first to comment

  • Be the first to like this

SnappyData Toronto Meetup Nov 2017

  1. 1. SnappyData   Building Continuous Applications Driven By Real Time Insights Version  0.1    |  ©  SnappyData  Inc  2017   Sudhir  Menon  ,  Founder,  COO  
  2. 2.       Who Are We? 2   •  New  Spark-­‐based  open  source  project  started  by  Pivotal  GemFire   founders  +  engineers   •  Decades  of  in-­‐memory  data  management  experience   •  Focus  on  real-­‐Ome,  operaOonal  analyOcs:  Spark  based  OLTP+OLAP   database   Spinout   SnappyData   Funded  by   Pivotal,  GE,  GTD   Capital  
  3. 3.   The Big Data Market Is Facing Disruption (Again) •  Higher  Data  Volumes   •  Growth  in  Streaming                Workloads   •  Analy;cs  on  Live  Data   •  Growth  in  unstructured                  data   •  Machine  Learning/  AI  as                first  class  workloads   Need  to  reduce  complexity  and  cloud   costs    
  4. 4. ©  SnappyData  Inc.  2017   4   Spark  –  Key  To  emerging  baAleground  in  AnalyCcs  and  BI   -­‐  Mul;-­‐model  data   -­‐  Integrate  Streaming  data   Phase1   Phase2   Phase3  ?   Source:  Gartner  
  5. 5.       Mixed Workloads Are Everywhere 5   Stream   Processing   TransacCon   (point  lookups,  small   updates)   InteracCve   AnalyCcs   Analytics on mutating data Correlating and joining streams with large histories Maintaining state or counters while ingesting streams
  6. 6.   •  Elapsed time from event occurrence to event analytics matters •  Latency in using information for learning matters •  Concurrency matters •  Recovery time matters •  User-kernel crossings matter In short, liveness of data matters when it comes to making decisions based on current information Time Value of Information – Why does it matter
  7. 7.   •  Applications that are intelligent, proactive, learn from past interactions, and are context aware in their decision making •  Fast and reliable ingestion capabilities •  Support high memory density •  Utilize memory to reduce response time •  Support high concurrency •  Work on live data •  Support data mutability What Is Happening With Applications
  8. 8.   •  Market Surveillance Systems (Trading exchanges, Market makers) •  Real Time Scoring Systems (Product recommendations, real time offers) •  Telco Analytics (Location based services, Predictive analytics) •  Sensor Analytics (Real time alerting for parking management, lighting etc.) •  Ad analytics + Ad placement systems •  Credit Card Fraud •  Detecting and Stopping Malware Lets Discuss Some Use Cases
  9. 9.       Mixed Workloads in Industrial IOT 9   IOT   Devices   Anomaly  detecOon  –   score  against  models   -­‐  Map  sensors  to  tags   -­‐  Monitor  temperature   -­‐  Send  alerts   Correlate  current   temperature  trend  with   history….     Interact  using    dynamic  queries….     Event Stream
  10. 10.   •  Is Spark ready for real time? Enterprise? •  Lacks mutable State management •  Not designed for high concurrency and mixed workloads •  Inadequate SQL support; Limited ODBC/JDBC access •  Fault tolerant not HA •  Near impossible to do Live Analytics on NoSQL stores •  Pattern today – periodic copy of state into some analytic DB/Hadoop •  Stale Insight, not continuous or real time •  Interactive dynamic aggregations not possible •  Data models makes support for BI tools like Tableau difficult •  Most Stream processors not capable of true Analytics •  Deep stream analytics augmenting stream processing Pain  points  we  come  across  today  
  11. 11. 11   PaAern  1:  NaCve  data  store  for  Spark  …  Fast,  Simple   Scalable  NoSQL   Spark  Analy;cs   (compute)       Mul;-­‐model,  distributed  in-­‐memory  data  store  na;vely  designed  for  Spark   Immutable   Cache   NoSQL,   Hadoop   -­‐  20X  faster  than  Spark  for  Analy;c  queries   -­‐  1000X  faster  than  Spark+NoSQL   -­‐  Mutable  DataFrames,  transac;ons,  indexing   -­‐  Rich,  complete  SQL  +  All  Spark  APIs   -­‐  Highly  Available  data,  Spark  Driver   -­‐  Enterprise  grade  security   -­‐  Na;ve  support  for  ML/DL   Too  much  copying  …  too  slow  for  real   ;me,  Interac;ve  analy;cs   Scalable  NoSQL   Spark  Analy;cs   (compute)        SnappyData   Data  sources  
  12. 12. PaAern  2:  InteracCve  analyCcs  on  live  data  in  NoSQL  stores   Problem:  Interac;ve  Analy;cs  at  scale  and  concurrency  for  LIVE  data  sets   e.g.  Sensor  data,  Customer  interac;ons   Scalable   NoSQL   Opera;onal   Live,  Data   NoSQL   Scalable   NoSQL  Hadoop,   MPP  DB           Cubes,  aggrega;ons   -­‐  MongoDB  BI  connector   -­‐  Custom  BI  like  SlamData   Tableau  Extracts   Tableau   MPP  ETL   Expensive,  complex,  batch   Difficult  to  deal  with  semi-­‐structured   Read  only,  stale  Insight   Inflexible,  Slow   Con;nuous   updates          
  13. 13. 13   CDC   Streams   Hadoop   NoSQL   Rich  SPARK  APIs   window   Spark   Transform   (Data  Prep)   -­‐  Live  updates  propagated  to  in-­‐memory  Analy;cs  Cluster  in  SnappyData     Micro   Service  1   Micro   Service  2   Micro   Service  3   In-­‐memory     Row-­‐Column   Tables   Virtual  Tables   NoSQL  Connectors  SQL   Visualize  on  any  tool   Micro   Service  3   Sensor  stream   PaAern  2  :  Live  AnalyCcs  on  Polyglot  NotOnlySQL  stores   SnappyData  
  14. 14. 14   Unbounded     Streams   State  Update   Index   OLAP   Column  table   Hadoop   NoSQL   Stream  App   window   KV  Store   -­‐  KV  stores  offer  lihle  to  no  analy;c  operators   -­‐  Joins,  aggrega;ons  across  mul;ple  DBs  not  possible   -­‐  Too  slow   PaAern  3:  Streaming  AnalyCcs  not  just  simple  processing   Streaming  in  Flink,  DataFlow,  Apex  …  
  15. 15. 15   •  Sensor  streams     •  CDC  streams   •  TransacCon   Streams…   Rich  SPARK  APIs   Stream   Streaming  deeply  integrated  with  Analy:cs  DB   PaAern  3:  Deep  integraCon  of  stream  processing  with  OLAP   In-­‐memory     Row-­‐Column   Tables   NoSQL  Connectors   SQL   Pull  history   on  Demand  Con;nuously   summarize   -­‐  Con;nuous  queries  on  stream  +  history  +  enterprise  data   -­‐  Simple:  Build  stream+analy;cs  apps  using  single  model   -­‐  Much  faster  than  s;tching   Tableau,  Zeppelin  
  16. 16.       How Mixed Workloads Are Supported Today 16   Query   New             Data   Batch  layer   Master   Datasheet   2   Serving  layer   Batch  view   3   Batch  view   Speed  layer   4   Real-­‐Cme  View   Real-­‐Cme  View   1   Query   5  
  17. 17.       Lambda Architecture is Complex 17   KAFKA   STORM   CASSANDRA   ..... SOURCE  APPS   •  Complexity: learn  and  master  mulOple   products,  data  models,  disparate  APIs,   configs   •  Slower •  Wasted resources
  18. 18. 18   Can  We   Simplify  &   Op:mize?  
  19. 19.         19   How about a single clustered DB that can manage stream state, transactional data & run OLAP queries? Stream  processing   Scalable writes, point reads, OLAP queries Apps   Framework  for  Stream  Processing,  etc   RDB   MPP  DB   HDFS   Tables   Txn  
  20. 20. 20  ©  Snappydata  INC  2017     Our   SoluCon   SnappyData   A Single Unified Cluster: OLTP + OLAP + Streaming for real-time analytics
  21. 21.       Our Solution 21   Deep  Scale,   High  Volume   MPP  DB   Real-­‐Cme  design   Low  latency,  HA,     Concurrency,  replicaOion   based  consensus  driven     Batch  design,  high   throughput,  lineage   based  system     Rapidly Maturing Matured over 13 years Single  Unified  HA  Cluster   OLTP + OLAP + Streaming for real-time analytics
  22. 22.       A  Spark  Based  Big  Data  AnalyCcs  Pla_orm   22   Spark  API   (Streaming,  ML,  Graph)   TransacOons,   Indexing   Full  SQL     HA   DataFrame,   RDD,  DataSets   Rows  Columnar   IN-­‐MEMORY   Spark  Cache   Synopses   (Samples)   Unified  Data  Access   (Virtual  Tables)   Unified  Catalog  NaOve  Store   SNAPPYDATA HDFS/ HBASE   S3   JSON,  CSV,   XML   SQL  db   Cassandra   MPP  DB   Stream   sources   Spark  Jobs,  Scala/Java/Python/R  API,  JDBC/ODBC,  Object  API  (RDD,  DataSets)  
  23. 23.       We transform Spark from this… 23   Deep  Scale,   High  Volume   MPP  DB   USER 1 / APP 1 SPARK   MASTER   Spark  ExecuCon  (Worker)   Framework  for   streaming  SQL,   ML…   Immutable   CACHE   USER 2 / APP 2 SPARK   MASTER   Spark  ExecuCon  (Worker)   Framework  for   streaming  SQL,   ML…   Immutable   CACHE   HDFS   SQL   NoSQL     •  Cannot  update   •  Repeated  for  each  User/ APP   Boaleneck  
  24. 24.       … Into “an always-on hybrid database ! 24   Deep  Scale,   High  Volume   MPP  DB   HDFS   SQL   NoSQL     HISTORY   Spark  ExecuCon  (Worker)  JVM - Long running Framework  for   streaming  SQL,   ML…   Spark   Driver   IN-­‐Memory   ROW  +  COLUMN   Start  with   Indexing   Store   -  Mutable, -  TransactionalSPARK   Cluster   JDBC   ODBC   Spark Job Shared  Nothing   Persistence    
  25. 25.       Architecture   25   Cluster  Manager     &  Scheduler   Snappy  Data  Server  (Spark Executor + Store) Parser   OLAP   TXN   Synopsis  Data  Engine   Distributed  Membership     Service   H A Stream  Processing   Data  Frame   RDD   Low   Latency   High   Latency   HYBRID  Store   ProbabilisOc   Rows   Columns   Index   Query   OpOmizer   Add  /  Remove   Server   Tables   ODBC/JDBC  
  26. 26.       Unified API 26   •  ML,  graph,  batch  &  streaming,  SQL  (selects)   Spark’s  DataFrame  API  allows  for:     •  Mutability  semanOcs  (DML  &  transacOons)   •  Indexing     •  SQL-­‐based  streaming   SnappyData  adds  full  SQL  support  and  extends  DataFrame  and  DataSource  APIs  for:  
  27. 27.       Can we use Statistical techniques to shrink data? 27   •  Most  apps  happy  to  tradeoff  1%  accuracy  for   200x  speedup!     •  Can  usually  get  a  99.9%  accurate  answer  by  only   looking  at  a  ;ny  frac;on  of  data!       •  Oqen  can  make  perfectly  accurate  decisions   with  imperfect  answers!     •  A/B  Tes;ng,  visualiza;on,  ...     •  The  data  itself  is  usually  noisy   •  Processing  en;re  data  doesn’t  necessarily  mean  exact   answers!  
  28. 28.    `   Probabilistic Store: Sketches + Uniform & Stratified Samples Higher  resoluOon  for  more  recent   Ome  ranges   1. Streaming CMS (Count-Min-Sketch) [t1,  t2)                          [t2,  t3)                            [t3,  t4)                        [t4,  now)   Time 4T   2T   T   ≤T   ....   Maintain  a  small  sample  at  each  CMS  cell   2. Top-K Queries w/ Arbitrary Filters Tradi2onal  CMS                            CMS+Samples   3. Fully Distributed Stratified Samples Always  include  Omestamp  as  a  straOfied  column   for  streams   Streams   Aging  Row  Store  (In-­‐memory)   Column  Store  (Disk)   timestamp
  29. 29.       High-Level Accuracy Guarantees 29   1 0 1 1 0 0 2 1 2 0 0 1 2 0 0 0 1 1 0 1 0 2 0 2 Quality  cer2fied   Approx  Answers   Query  Engine   HAC   Bias  Es2mate   Variance  Es2mate   STREAMS   Aging   SNAPPY  STORE   Stra2fied  Samples   Stra2fied  Samples   Interac2ve  Query   Con2nuous  Query   Pipelined   bootstrapped   operator   Row  store  Memory   Column  Store  Disk  
  30. 30.       30   Deep  Fusion   w/  Spark  Extreme   Speed   Synopsis   Data   Engine   Deep  Fusion  with  Spark   Elas;c,  highly  available  in-­‐memory  store  for  OLTP  fused  with   Spark’s  memory  manager  and  the  Catalyst/Tungsten  engine.     The  store  itself  is  exposed  as  na;ve  Spark  data  frames. Extreme  Speed  thru  CPU  code  gen,  vectorizaCon   Extend  Spark’s  Tungsten  engine  with  beher  code  genera;on,   coloca;on  schemes,  .. Use  Sta;s;cal  techniques  to  reduce  data  by  100-­‐1000x   Answer  queries  in  frac;on  of  ;me  and  resources   Synopses  Data  Engine   What is unique
  31. 31.       Cloud Ready 31  
  32. 32. Dealing with Credit Card Fraud SnappyData  Cluster   Credit  Card   transacOon   stream   User  History   PredicOon   Model   Streaming  ApplicaOon   ……….   Black     Listed     Cards   Data Lake No;fica;on  to   owner   No;fica;on  to   merchant  
  33. 33. SnappyData  Cluster   Customers   Approaching     Limit   Plan     Info   CDR  Stream   Schedule  callback   through  call  center   Streaming  ApplicaOon  Immediate  SMS     to  customer   Data Lake Preventing Bill Shock, Real Time Upgrades The  system  detects  approaching  usage   limits,  no;fies  users  and  gives  them     a  chance  to  buy  a  one  ;me  upgrade  or   a  new  plan,  increasing  loyalty  &  revenue  
  34. 34.   Stream  IngesOon   Reference   Data   •  Stream  analyOcs   •  Insider  detecOon   •  Apply  Rules   •  Detect  Market   ManipulaOon   Alert  &  NoOfy  Downstream   Systems   Trigger  InvesOgaOons   Spark  Streaming   SQL  Querying   Con;nuous  Queries   Par;;oned  Stream   Inges;on   Summaries  &  Alerts   Messaging   Machine  Learning   Market Surveillance For Market Makers
  35. 35. Connected Car Real Time Data Flow SnappyData  Cluster   Kava     Receiver   Vehicle  Time     Series  Data   Vehicle   History   Driver     History   Streaming  ApplicaOon   HDFS,  HBase   Raw  Data  Store   Custom   Summary   Dashboard   No;fica;on  to   owner   ……….   System     KPIs     Asset     Metadata  
  36. 36. Offline Analysis REAL TIME MATCHING ENGINE MATCHING   ENGINE   Customer   History   NoOficaOon   Sub-­‐system   !   Historical  Customer   Profiles   User  by     Geo  locaOon   PERSONALIZED   CAMPAIGNS  TO   USERS                Ingest  Stream   REAL     TIME   OFFERS       from   Merchants   Real Time Marketing Campaigns A  stream  matching  engine  that  uses  customer   history,  their  current  loca;on  and  relevant  offers  to   Effec;vely  target  users  creates  differen;a;on  &  generates  revenue  
  37. 37.   000’s data points/sec Emergency Shutdown Tuning & Optimization, Monitor & Control Continuous Real-time Analysis Maintenance Billing Sensor Analytics
  38. 38. Message  Bus   Stream  IngesOon   Reference   Data   ETL   •  OLAP  and  Low   Latency   Querying  in  SQL     •  Machine   Learning  in  Spark   RFQs/Trades/Quotes streams Analytic Dashboards SnappyData RFQ Analytics
  39. 39.       Ad Analytics 39   1.5-­‐2x        faster ingestion, faster trx 7-­‐142×    faster analytics (at 300M records)
  40. 40.       Data Synopsis Engine 40  
  41. 41.       TPCH 41   Avg  Latency     SnappyData     MemSQL     Spark   5.7s   100 GB 12.0s   66.9s  
  42. 42. THANK  YOU  !   Try  it  out:  hAp://   Resources:  hAp:// resources