Your SlideShare is downloading. ×
0
Real	  Time	  Big	  Data	  With	  Storm,	  Cassandra,	  and	  In-­‐Memory	  Compu=ng	  DeWayne	  Filppi	  @dfilppi	  
 Big	  Data	  Predic=ons	  	  “Over	  the	  next	  few	  years	  well	  see	  the	  adop=on	  of	  scalable	  frameworks	 ...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  3	  The	  Two	  Vs	  of	  Big	  Data	  	  Velocity	  ...
We’re	  Living	  in	  a	  Real	  Time	  World…	  Homeland SecurityReal Time SearchSocial	  eCommerceUser	  Tracking	  &	  ...
The	  Flavors	  of	  Big	  Data	  Analy=cs	  	  Coun:ng	   Correla:ng	   Research	  ®	  Copyright	  2013	  Gigaspaces	  Lt...
Analy=cs	  @	  Twi`er	  –	  Coun=ng	  	  §  How	  many	  signups,	  tweets,	  retweets	  for	  a	  topic?	  §  What’s	  ...
Analy=cs	  @	  Twi`er	  –	  Correla=ng	  	  §  What	  devices	  fail	  at	  the	  same	  =me?	  §  What	  features	  get...
Analy=cs	  @	  Twi`er	  –	  Research	  	  §  Sen=ment	  analysis	  §  “Obama	  is	  popular”	  §  Trends	  §  “People	...
It’s	  All	  about	  Timing	  	  “Real	  :me”	  	  (<	  few	  Seconds)	  	  Reasonably	  Quick	  (seconds	  -­‐	  minutes)...
It’s	  All	  about	  Timing	  	  •  Event	  driven	  /	  stream	  processing	  	  	  •  High	  resolu=on	  –	  every	  twe...
VELOCITY	  +	  VAST	  VOLUME	  =	  	  IN	  MEMORY	  +	  BIG	  DATA	11	  
§  RAM	  is	  the	  new	  disk	  §  Data	  par==oned	  across	  a	  cluster	  §  Large	  “virtual”	  memory	  space	  §...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  13	  Data	  Grid	  +	  Cassandra:	  A	  Complete	  So...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  14	  Simplified	  Event	  Flow	  
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  15	  Grid	  –	  Cassandra	  Interface	  §  Hector	  ...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  16	  Virtues	  and	  Limita=ons	  §  Could	  be	  fa...
§  Popular	  open	  source,	  real	  =me,	  in-­‐memory,	  streaming	  computa=on	  plaMorm.	  §  Includes	  distributed...
§  Streams	  §  Unbounded	  sequence	  of	  tuples	  §  Spouts	  §  Source	  of	  streams	  (Queues)	  §  Bolts	  § ...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  19	  Streaming	  word	  count	  with	  Storm	  §  St...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  20	  Storm	  :	  Op=mis=c	  Processing	  §  Storm	  ...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  21	  Fast.	  	  Want	  to	  go	  faster?	  §  Elimin...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  22	  Sample	  Architecture	  
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  23	  References	  §  Try	  the	  Cloudify	  recipe	 ...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  24	  
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  25	  Twi`er	  Storm	  With	  Cassandra	  
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  26	  Storm	  Overview	  
§  Streams	  §  Unbounded	  sequence	  of	  tuples	  §  Spouts	  §  Source	  of	  streams	  (Queues)	  §  Bolts	  § ...
Challenge	  –	  Word	  Count	  Word:CountTweets	  Count	  ?®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reser...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  29	  Streaming	  word	  count	  with	  Storm	  
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  30	  Supercharging	  Storm	  §  Storm	  doesn’t	  su...
XAP	  Real	  Time	  Analy=cs	  ®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  31	  
®	  Copyright	  2011	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  Two	  Layer	  Approach	  §  Advantage:	  Minimal	  “...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  33	  Simplified	  Architecture	  
§  Flowing	  event	  streams	  through	  memory	  for	  side	  effects	  §  Event	  driven	  architecture	  execu=ng	  in...
Keep	  Things	  In	  Memory	  Facebook	  keeps	  80%	  of	  its	  data	  in	  Memory	  	  (Stanford	  research)	  RAM	  is...
Take	  Aways	  §  A	  data	  grid	  can	  serve	  different	  needs	  for	  big	  data	  analy=cs:	  §  Supercharge	  a	 ...
§  Real:me	  Analy:cs	  with	  Storm	  and	  Hadoop	  §  hWp://www.slideshare.net/Hadoop_Summit/real:me-­‐analy:cs-­‐wit...
®	  Copyright	  2013	  Gigaspaces	  Ltd.	  All	  Rights	  Reserved	  38	  
Upcoming SlideShare
Loading in...5
×

C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

1,687

Published on

This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to resolve the complexity challenge by implementing a DevOps approach that abstracts all the underlying infrastructure and provides single-click management of all the application tiers and services, on any environment (private/public cloud, bare metal…). And the best news is that all this optimization can be implemented seamlessly, with no code change to your apps.

Published in: Technology

Transcript of "C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi"

  1. 1. Real  Time  Big  Data  With  Storm,  Cassandra,  and  In-­‐Memory  Compu=ng  DeWayne  Filppi  @dfilppi  
  2. 2.  Big  Data  Predic=ons    “Over  the  next  few  years  well  see  the  adop=on  of  scalable  frameworks  and  pla1orms  for  handling  streaming,  or  near  real-­‐=me,  analysis  and  processing.  In  the  same  way  that  Hadoop  has  been  borne  out  of  large-­‐scale  web  applica=ons,  these  plaMorms  will  be  driven  by  the  needs  of  large-­‐scale  loca=on-­‐aware  mobile,  social  and  sensor  use.”  Edd  Dumbill,  O’REILLY  2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  3. 3. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  3  The  Two  Vs  of  Big  Data    Velocity   Volume  
  4. 4. We’re  Living  in  a  Real  Time  World…  Homeland SecurityReal Time SearchSocial  eCommerceUser  Tracking  &  Engagement  Financial Services®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  4  
  5. 5. The  Flavors  of  Big  Data  Analy=cs    Coun:ng   Correla:ng   Research  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  5  
  6. 6. Analy=cs  @  Twi`er  –  Coun=ng    §  How  many  signups,  tweets,  retweets  for  a  topic?  §  What’s  the  average  latency?  §  Demographics  §  Countries  and  ci=es  §  Gender    §  Age  groups    §  Device  types    §  …      ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  6  
  7. 7. Analy=cs  @  Twi`er  –  Correla=ng    §  What  devices  fail  at  the  same  =me?  §  What  features  get  user  hooked?  §  What  places  on  the  globe  are  “happening”?  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  7  
  8. 8. Analy=cs  @  Twi`er  –  Research    §  Sen=ment  analysis  §  “Obama  is  popular”  §  Trends  §  “People  like  to  tweet  aeer  watching  American  Idol”  §  Spam  pa`erns    §  How  can  you  tell  when  a  user  spams?  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  8  
  9. 9. It’s  All  about  Timing    “Real  :me”    (<  few  Seconds)    Reasonably  Quick  (seconds  -­‐  minutes)    Batch    (hours/days)    ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  9  
  10. 10. It’s  All  about  Timing    •  Event  driven  /  stream  processing      •  High  resolu=on  –  every  tweet  gets  counted    •  Ad-­‐hoc  querying    •  Medium  resolu=on  (aggrega=ons)    •  Long  running  batch  jobs  (ETL,  map/reduce)    •  Low  resolu=on  (trends  &  pa`erns)    ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  10  This  is  what  we’re  here  to  discuss  J  
  11. 11. VELOCITY  +  VAST  VOLUME  =    IN  MEMORY  +  BIG  DATA 11  
  12. 12. §  RAM  is  the  new  disk  §  Data  par==oned  across  a  cluster  §  Large  “virtual”  memory  space  §  Transac=onal  §  Highly  available  §  Code  collocated  with  data.        In  Memory  Data  Grid  Review  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  12  
  13. 13. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  13  Data  Grid  +  Cassandra:  A  Complete  Solu=on  •  Data  flows  through  the  in-­‐memory  cluster  async  to  Cassandra  •  Side  effects  calculated  •  Filtering  an  op=on  •  Enrichment  an  op=on  •  Results  instantly  available  •  Internal  and  external  event  listeners  no=fied  
  14. 14. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  14  Simplified  Event  Flow  
  15. 15. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  15  Grid  –  Cassandra  Interface  §  Hector  and  CQL  based  interface  §  In  memory  data  must  be  mapped  to  column  families.  §  Configurable  class  to  column  family  mapping  §  Must  serialize  individual  fields  §  Fixed  fields  can  use  defined  types  §  Variable  fields  (  for  schemaless  in-­‐memory  mode)  need  serializers  §  Object  model  fla`ening  §  By  default,  nested  fields  are  fla`ened.  §  Can  be  overridden  by  custom  serializer.  
  16. 16. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  16  Virtues  and  Limita=ons  §  Could  be  faster:    high  availability  has  a  cost  §  Complex  flows  not  easy  to  assemble  or  understand  with  simple  event  handlers  §  Complete  stack,  not  just  two  tools  of  many  §  Fast.  §  Microsecond  latencies  for  in  memory  opera=ons  §  Fast  enough  for  almost  anybody  §  Highly  available/self  healing  §  Elas=c  
  17. 17. §  Popular  open  source,  real  =me,  in-­‐memory,  streaming  computa=on  plaMorm.  §  Includes  distributed  run=me  and  intui=ve  API  for  defining  distributed  processing  flows.  §  Scalable  and  fault  tolerant.  §  Developed  at  BackType,              and  open  sourced  by  Twi`er  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  17  Storm  Background  
  18. 18. §  Streams  §  Unbounded  sequence  of  tuples  §  Spouts  §  Source  of  streams  (Queues)  §  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  §  Topologies  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  18  Storm  Abstrac=ons  Spout  Bolt  Topologies  
  19. 19. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  19  Streaming  word  count  with  Storm  §  Storm  has  a  simple  builder  interface  to  crea=ng  stream  processing  topologies  §  Storm  delegates  persistence  to  external  providers  §  Cassandra,  because  of  its  write  performance,  is  commonly  used  
  20. 20. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  20  Storm  :  Op=mis=c  Processing  §  Storm  (quite  ra=onally)  assumes  success  is  normal  §  Storm  uses  batching  and  pipelining  for  performance  §  Therefore  the  spout  must  be  able  to  replay  tuples  on  demand  in  case  of  error.  §  Any  kind  of  quasi-­‐queue  like  data  source  can  be  fashioned  into  a  spout.  §  No  persistence  is  ever  required,  and  speed  a`ained  by  minimizing  network  hops  during  topology  processing.  
  21. 21. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  21  Fast.    Want  to  go  faster?  §  Eliminate  non-­‐memory  components  §  Subs=tute  disk  based  queue  for  reliable  in-­‐memory  queue  §  Subs=tute  disk  based  state  persistence  to  in-­‐memory  persistence  §  Asynchronously  update  disk  based  state  (C*)  
  22. 22. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  22  Sample  Architecture  
  23. 23. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  23  References  §  Try  the  Cloudify  recipe  §  Download  Cloudify  :  h`p://www.cloudifysource.org/  §  Download  the  Recipe  (apps/xapstream,  services/xapstream):  –  h`ps://github.com/CloudifySource/cloudify-­‐recipes  §  XAP  –  Cassandra  Interface  Details;  §  h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  §  Check  out  the  source  for  the  XAP  Spout  and  a  sample  state  implementa=on  backed  by  XAP,  and  a  Storm  friendly  streaming  implemen=on  on  github:  §  h`ps://github.com/Gigaspaces/storm-­‐integra=on  §  For  more  background  on  the  effort,  check  out  my  recent  blog  posts  at  h`p://blog.gigaspaces.com/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐1-­‐storm-­‐clouds/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra=on/  §  Part  3  coming  soon.  
  24. 24. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  24  
  25. 25. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  25  Twi`er  Storm  With  Cassandra  
  26. 26. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  26  Storm  Overview  
  27. 27. §  Streams  §  Unbounded  sequence  of  tuples  §  Spouts  §  Source  of  streams  (Queues)  §  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  §  Topologies  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  27  Storm  Concepts  Spouts  Bolt  Topologies  
  28. 28. Challenge  –  Word  Count  Word:CountTweets  Count  ?®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  28  • HoWest  topics  • URL  men:ons  • etc.  
  29. 29. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  29  Streaming  word  count  with  Storm  
  30. 30. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  30  Supercharging  Storm  §  Storm  doesn’t  supply  persistence,  but  provides  for  it  §  Storm  op=mizes  IO  to  slow  persistence  (e.g.  databases)  using  batching.  §  Storm  processes  streams.    The  stream  provider  itself  needs  to  support  persistency,  batching,  and  reliability.  Tweets,  events,whatever….  
  31. 31. XAP  Real  Time  Analy=cs  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  31  
  32. 32. ®  Copyright  2011  Gigaspaces  Ltd.  All  Rights  Reserved  Two  Layer  Approach  §  Advantage:  Minimal  “impedance  mismatch”  between  layers.  –  Both  NoSQL  cluster  technologies,  with  similar  advantages  §  Grid  layer  serves  as  an  in  memory  cache  for  interac=ve  requests.  §  Grid  layer  serves  as  a  real  =me  computa=on  fabric  for  CEP,  and  limited  (  to  allocated  memory)  real  =me  distributed  query  capability.  In  Memory  Compute  ClusterNoSQL  Cluster...Raw  Event  StreamRaw  Event  StreamRaw  Event  StreamReal  Time  EventsRaw  And  Derived  EventsReal  Time  EventsReporting  EngineSCALESCALE
  33. 33. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  33  Simplified  Architecture  
  34. 34. §  Flowing  event  streams  through  memory  for  side  effects  §  Event  driven  architecture  execu=ng  in-­‐memory  §  Raw  events  flushed,  aggrega=ons/deriva=ons  retained  §  All  layers  horizontally  scalable  §  All  layers  highly  available  §  Real-­‐=me  analy=cs  &  cached  batch  analy=cs  on  same  scalable  layer  §  Data  grid  provides  a  transac=onal/consistent  façade  on  NoSQL  store  (in  this  case  elimina=ng  SQL  database  en=rely)  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  34  Key  Concepts  
  35. 35. Keep  Things  In  Memory  Facebook  keeps  80%  of  its  data  in  Memory    (Stanford  research)  RAM  is  100-­‐1000x  faster  than  Disk  (Random  seek)  •  Disk:  5  -­‐10ms      •  RAM:  ~0.001msec    
  36. 36. Take  Aways  §  A  data  grid  can  serve  different  needs  for  big  data  analy=cs:  §  Supercharge  a  dedicated  stream  processing  cluster  like  Storm.  –  Provide  fast,  reliable,  transac=onal  tuple  streams  and  state  §  Provide  a  general  purpose  analy=cs  plaMorm  –  Roll  your  own  §  Simplify  overall  architecture  while  enhancing  scalability  –  Ultra  high  performance/low  latency  –  Dynamically  scalable  processing  and  in-­‐memory  storage  –  Eliminate  messaging  =er  –  Eliminate  or  minimize  need  for  RDBMS  
  37. 37. §  Real:me  Analy:cs  with  Storm  and  Hadoop  §  hWp://www.slideshare.net/Hadoop_Summit/real:me-­‐analy:cs-­‐with-­‐storm  §  Learn  and  fork  the  code  on  github:      hWps://github.com/Gigaspaces/storm-­‐integra:on  §  Twi`er  Storm:    hWp://storm-­‐project.net  §  XAP  +  Storm  Detailed  Blog  Post            hWp://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra:on/     ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  37  References    
  38. 38. ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  38  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×