C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

  • 1,540 views
Uploaded on

This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to …

This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to resolve the complexity challenge by implementing a DevOps approach that abstracts all the underlying infrastructure and provides single-click management of all the application tiers and services, on any environment (private/public cloud, bare metal…). And the best news is that all this optimization can be implemented seamlessly, with no code change to your apps.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,540
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
67
Comments
0
Likes
7

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Real  Time  Big  Data  With  Storm,  Cassandra,  and  In-­‐Memory  Compu=ng  DeWayne  Filppi  @dfilppi  
  • 2.  Big  Data  Predic=ons    “Over  the  next  few  years  well  see  the  adop=on  of  scalable  frameworks  and  pla1orms  for  handling  streaming,  or  near  real-­‐=me,  analysis  and  processing.  In  the  same  way  that  Hadoop  has  been  borne  out  of  large-­‐scale  web  applica=ons,  these  plaMorms  will  be  driven  by  the  needs  of  large-­‐scale  loca=on-­‐aware  mobile,  social  and  sensor  use.”  Edd  Dumbill,  O’REILLY  2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 3. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  3  The  Two  Vs  of  Big  Data    Velocity   Volume  
  • 4. We’re  Living  in  a  Real  Time  World…  Homeland SecurityReal Time SearchSocial  eCommerceUser  Tracking  &  Engagement  Financial Services®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  4  
  • 5. The  Flavors  of  Big  Data  Analy=cs    Coun:ng   Correla:ng   Research  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  5  
  • 6. Analy=cs  @  Twi`er  –  Coun=ng    §  How  many  signups,  tweets,  retweets  for  a  topic?  §  What’s  the  average  latency?  §  Demographics  §  Countries  and  ci=es  §  Gender    §  Age  groups    §  Device  types    §  …      ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  6  
  • 7. Analy=cs  @  Twi`er  –  Correla=ng    §  What  devices  fail  at  the  same  =me?  §  What  features  get  user  hooked?  §  What  places  on  the  globe  are  “happening”?  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  7  
  • 8. Analy=cs  @  Twi`er  –  Research    §  Sen=ment  analysis  §  “Obama  is  popular”  §  Trends  §  “People  like  to  tweet  aeer  watching  American  Idol”  §  Spam  pa`erns    §  How  can  you  tell  when  a  user  spams?  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  8  
  • 9. It’s  All  about  Timing    “Real  :me”    (<  few  Seconds)    Reasonably  Quick  (seconds  -­‐  minutes)    Batch    (hours/days)    ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  9  
  • 10. It’s  All  about  Timing    •  Event  driven  /  stream  processing      •  High  resolu=on  –  every  tweet  gets  counted    •  Ad-­‐hoc  querying    •  Medium  resolu=on  (aggrega=ons)    •  Long  running  batch  jobs  (ETL,  map/reduce)    •  Low  resolu=on  (trends  &  pa`erns)    ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  10  This  is  what  we’re  here  to  discuss  J  
  • 11. VELOCITY  +  VAST  VOLUME  =    IN  MEMORY  +  BIG  DATA 11  
  • 12. §  RAM  is  the  new  disk  §  Data  par==oned  across  a  cluster  §  Large  “virtual”  memory  space  §  Transac=onal  §  Highly  available  §  Code  collocated  with  data.        In  Memory  Data  Grid  Review  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  12  
  • 13. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  13  Data  Grid  +  Cassandra:  A  Complete  Solu=on  •  Data  flows  through  the  in-­‐memory  cluster  async  to  Cassandra  •  Side  effects  calculated  •  Filtering  an  op=on  •  Enrichment  an  op=on  •  Results  instantly  available  •  Internal  and  external  event  listeners  no=fied  
  • 14. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  14  Simplified  Event  Flow  
  • 15. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  15  Grid  –  Cassandra  Interface  §  Hector  and  CQL  based  interface  §  In  memory  data  must  be  mapped  to  column  families.  §  Configurable  class  to  column  family  mapping  §  Must  serialize  individual  fields  §  Fixed  fields  can  use  defined  types  §  Variable  fields  (  for  schemaless  in-­‐memory  mode)  need  serializers  §  Object  model  fla`ening  §  By  default,  nested  fields  are  fla`ened.  §  Can  be  overridden  by  custom  serializer.  
  • 16. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  16  Virtues  and  Limita=ons  §  Could  be  faster:    high  availability  has  a  cost  §  Complex  flows  not  easy  to  assemble  or  understand  with  simple  event  handlers  §  Complete  stack,  not  just  two  tools  of  many  §  Fast.  §  Microsecond  latencies  for  in  memory  opera=ons  §  Fast  enough  for  almost  anybody  §  Highly  available/self  healing  §  Elas=c  
  • 17. §  Popular  open  source,  real  =me,  in-­‐memory,  streaming  computa=on  plaMorm.  §  Includes  distributed  run=me  and  intui=ve  API  for  defining  distributed  processing  flows.  §  Scalable  and  fault  tolerant.  §  Developed  at  BackType,              and  open  sourced  by  Twi`er  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  17  Storm  Background  
  • 18. §  Streams  §  Unbounded  sequence  of  tuples  §  Spouts  §  Source  of  streams  (Queues)  §  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  §  Topologies  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  18  Storm  Abstrac=ons  Spout  Bolt  Topologies  
  • 19. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  19  Streaming  word  count  with  Storm  §  Storm  has  a  simple  builder  interface  to  crea=ng  stream  processing  topologies  §  Storm  delegates  persistence  to  external  providers  §  Cassandra,  because  of  its  write  performance,  is  commonly  used  
  • 20. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  20  Storm  :  Op=mis=c  Processing  §  Storm  (quite  ra=onally)  assumes  success  is  normal  §  Storm  uses  batching  and  pipelining  for  performance  §  Therefore  the  spout  must  be  able  to  replay  tuples  on  demand  in  case  of  error.  §  Any  kind  of  quasi-­‐queue  like  data  source  can  be  fashioned  into  a  spout.  §  No  persistence  is  ever  required,  and  speed  a`ained  by  minimizing  network  hops  during  topology  processing.  
  • 21. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  21  Fast.    Want  to  go  faster?  §  Eliminate  non-­‐memory  components  §  Subs=tute  disk  based  queue  for  reliable  in-­‐memory  queue  §  Subs=tute  disk  based  state  persistence  to  in-­‐memory  persistence  §  Asynchronously  update  disk  based  state  (C*)  
  • 22. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  22  Sample  Architecture  
  • 23. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  23  References  §  Try  the  Cloudify  recipe  §  Download  Cloudify  :  h`p://www.cloudifysource.org/  §  Download  the  Recipe  (apps/xapstream,  services/xapstream):  –  h`ps://github.com/CloudifySource/cloudify-­‐recipes  §  XAP  –  Cassandra  Interface  Details;  §  h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  §  Check  out  the  source  for  the  XAP  Spout  and  a  sample  state  implementa=on  backed  by  XAP,  and  a  Storm  friendly  streaming  implemen=on  on  github:  §  h`ps://github.com/Gigaspaces/storm-­‐integra=on  §  For  more  background  on  the  effort,  check  out  my  recent  blog  posts  at  h`p://blog.gigaspaces.com/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐1-­‐storm-­‐clouds/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra=on/  §  Part  3  coming  soon.  
  • 24. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  24  
  • 25. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  25  Twi`er  Storm  With  Cassandra  
  • 26. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  26  Storm  Overview  
  • 27. §  Streams  §  Unbounded  sequence  of  tuples  §  Spouts  §  Source  of  streams  (Queues)  §  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  §  Topologies  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  27  Storm  Concepts  Spouts  Bolt  Topologies  
  • 28. Challenge  –  Word  Count  Word:CountTweets  Count  ?®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  28  • HoWest  topics  • URL  men:ons  • etc.  
  • 29. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  29  Streaming  word  count  with  Storm  
  • 30. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  30  Supercharging  Storm  §  Storm  doesn’t  supply  persistence,  but  provides  for  it  §  Storm  op=mizes  IO  to  slow  persistence  (e.g.  databases)  using  batching.  §  Storm  processes  streams.    The  stream  provider  itself  needs  to  support  persistency,  batching,  and  reliability.  Tweets,  events,whatever….  
  • 31. XAP  Real  Time  Analy=cs  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  31  
  • 32. ®  Copyright  2011  Gigaspaces  Ltd.  All  Rights  Reserved  Two  Layer  Approach  §  Advantage:  Minimal  “impedance  mismatch”  between  layers.  –  Both  NoSQL  cluster  technologies,  with  similar  advantages  §  Grid  layer  serves  as  an  in  memory  cache  for  interac=ve  requests.  §  Grid  layer  serves  as  a  real  =me  computa=on  fabric  for  CEP,  and  limited  (  to  allocated  memory)  real  =me  distributed  query  capability.  In  Memory  Compute  ClusterNoSQL  Cluster...Raw  Event  StreamRaw  Event  StreamRaw  Event  StreamReal  Time  EventsRaw  And  Derived  EventsReal  Time  EventsReporting  EngineSCALESCALE
  • 33. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  33  Simplified  Architecture  
  • 34. §  Flowing  event  streams  through  memory  for  side  effects  §  Event  driven  architecture  execu=ng  in-­‐memory  §  Raw  events  flushed,  aggrega=ons/deriva=ons  retained  §  All  layers  horizontally  scalable  §  All  layers  highly  available  §  Real-­‐=me  analy=cs  &  cached  batch  analy=cs  on  same  scalable  layer  §  Data  grid  provides  a  transac=onal/consistent  façade  on  NoSQL  store  (in  this  case  elimina=ng  SQL  database  en=rely)  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  34  Key  Concepts  
  • 35. Keep  Things  In  Memory  Facebook  keeps  80%  of  its  data  in  Memory    (Stanford  research)  RAM  is  100-­‐1000x  faster  than  Disk  (Random  seek)  •  Disk:  5  -­‐10ms      •  RAM:  ~0.001msec    
  • 36. Take  Aways  §  A  data  grid  can  serve  different  needs  for  big  data  analy=cs:  §  Supercharge  a  dedicated  stream  processing  cluster  like  Storm.  –  Provide  fast,  reliable,  transac=onal  tuple  streams  and  state  §  Provide  a  general  purpose  analy=cs  plaMorm  –  Roll  your  own  §  Simplify  overall  architecture  while  enhancing  scalability  –  Ultra  high  performance/low  latency  –  Dynamically  scalable  processing  and  in-­‐memory  storage  –  Eliminate  messaging  =er  –  Eliminate  or  minimize  need  for  RDBMS  
  • 37. §  Real:me  Analy:cs  with  Storm  and  Hadoop  §  hWp://www.slideshare.net/Hadoop_Summit/real:me-­‐analy:cs-­‐with-­‐storm  §  Learn  and  fork  the  code  on  github:      hWps://github.com/Gigaspaces/storm-­‐integra:on  §  Twi`er  Storm:    hWp://storm-­‐project.net  §  XAP  +  Storm  Detailed  Blog  Post            hWp://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra:on/     ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  37  References    
  • 38. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  38