C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi
Upcoming SlideShare
Loading in...5
×
 

C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi

on

  • 1,785 views

This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to ...

This session will describe how to resolve the processing limitations by placing the streaming and data store interfaces in-memory as well, through an in-memory computing platform, and also how to resolve the complexity challenge by implementing a DevOps approach that abstracts all the underlying infrastructure and provides single-click management of all the application tiers and services, on any environment (private/public cloud, bare metal…). And the best news is that all this optimization can be implemented seamlessly, with no code change to your apps.

Statistics

Views

Total Views
1,785
Views on SlideShare
1,784
Embed Views
1

Actions

Likes
7
Downloads
64
Comments
0

1 Embed 1

http://localhost 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Computing by Dewayne Filppi Presentation Transcript

    • Real  Time  Big  Data  With  Storm,  Cassandra,  and  In-­‐Memory  Compu=ng  DeWayne  Filppi  @dfilppi  
    •  Big  Data  Predic=ons    “Over  the  next  few  years  well  see  the  adop=on  of  scalable  frameworks  and  pla1orms  for  handling  streaming,  or  near  real-­‐=me,  analysis  and  processing.  In  the  same  way  that  Hadoop  has  been  borne  out  of  large-­‐scale  web  applica=ons,  these  plaMorms  will  be  driven  by  the  needs  of  large-­‐scale  loca=on-­‐aware  mobile,  social  and  sensor  use.”  Edd  Dumbill,  O’REILLY  2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  3  The  Two  Vs  of  Big  Data    Velocity   Volume  
    • We’re  Living  in  a  Real  Time  World…  Homeland SecurityReal Time SearchSocial  eCommerceUser  Tracking  &  Engagement  Financial Services®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  4  
    • The  Flavors  of  Big  Data  Analy=cs    Coun:ng   Correla:ng   Research  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  5  
    • Analy=cs  @  Twi`er  –  Coun=ng    §  How  many  signups,  tweets,  retweets  for  a  topic?  §  What’s  the  average  latency?  §  Demographics  §  Countries  and  ci=es  §  Gender    §  Age  groups    §  Device  types    §  …      ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  6  
    • Analy=cs  @  Twi`er  –  Correla=ng    §  What  devices  fail  at  the  same  =me?  §  What  features  get  user  hooked?  §  What  places  on  the  globe  are  “happening”?  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  7  
    • Analy=cs  @  Twi`er  –  Research    §  Sen=ment  analysis  §  “Obama  is  popular”  §  Trends  §  “People  like  to  tweet  aeer  watching  American  Idol”  §  Spam  pa`erns    §  How  can  you  tell  when  a  user  spams?  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  8  
    • It’s  All  about  Timing    “Real  :me”    (<  few  Seconds)    Reasonably  Quick  (seconds  -­‐  minutes)    Batch    (hours/days)    ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  9  
    • It’s  All  about  Timing    •  Event  driven  /  stream  processing      •  High  resolu=on  –  every  tweet  gets  counted    •  Ad-­‐hoc  querying    •  Medium  resolu=on  (aggrega=ons)    •  Long  running  batch  jobs  (ETL,  map/reduce)    •  Low  resolu=on  (trends  &  pa`erns)    ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  10  This  is  what  we’re  here  to  discuss  J  
    • VELOCITY  +  VAST  VOLUME  =    IN  MEMORY  +  BIG  DATA 11  
    • §  RAM  is  the  new  disk  §  Data  par==oned  across  a  cluster  §  Large  “virtual”  memory  space  §  Transac=onal  §  Highly  available  §  Code  collocated  with  data.        In  Memory  Data  Grid  Review  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  12  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  13  Data  Grid  +  Cassandra:  A  Complete  Solu=on  •  Data  flows  through  the  in-­‐memory  cluster  async  to  Cassandra  •  Side  effects  calculated  •  Filtering  an  op=on  •  Enrichment  an  op=on  •  Results  instantly  available  •  Internal  and  external  event  listeners  no=fied  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  14  Simplified  Event  Flow  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  15  Grid  –  Cassandra  Interface  §  Hector  and  CQL  based  interface  §  In  memory  data  must  be  mapped  to  column  families.  §  Configurable  class  to  column  family  mapping  §  Must  serialize  individual  fields  §  Fixed  fields  can  use  defined  types  §  Variable  fields  (  for  schemaless  in-­‐memory  mode)  need  serializers  §  Object  model  fla`ening  §  By  default,  nested  fields  are  fla`ened.  §  Can  be  overridden  by  custom  serializer.  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  16  Virtues  and  Limita=ons  §  Could  be  faster:    high  availability  has  a  cost  §  Complex  flows  not  easy  to  assemble  or  understand  with  simple  event  handlers  §  Complete  stack,  not  just  two  tools  of  many  §  Fast.  §  Microsecond  latencies  for  in  memory  opera=ons  §  Fast  enough  for  almost  anybody  §  Highly  available/self  healing  §  Elas=c  
    • §  Popular  open  source,  real  =me,  in-­‐memory,  streaming  computa=on  plaMorm.  §  Includes  distributed  run=me  and  intui=ve  API  for  defining  distributed  processing  flows.  §  Scalable  and  fault  tolerant.  §  Developed  at  BackType,              and  open  sourced  by  Twi`er  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  17  Storm  Background  
    • §  Streams  §  Unbounded  sequence  of  tuples  §  Spouts  §  Source  of  streams  (Queues)  §  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  §  Topologies  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  18  Storm  Abstrac=ons  Spout  Bolt  Topologies  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  19  Streaming  word  count  with  Storm  §  Storm  has  a  simple  builder  interface  to  crea=ng  stream  processing  topologies  §  Storm  delegates  persistence  to  external  providers  §  Cassandra,  because  of  its  write  performance,  is  commonly  used  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  20  Storm  :  Op=mis=c  Processing  §  Storm  (quite  ra=onally)  assumes  success  is  normal  §  Storm  uses  batching  and  pipelining  for  performance  §  Therefore  the  spout  must  be  able  to  replay  tuples  on  demand  in  case  of  error.  §  Any  kind  of  quasi-­‐queue  like  data  source  can  be  fashioned  into  a  spout.  §  No  persistence  is  ever  required,  and  speed  a`ained  by  minimizing  network  hops  during  topology  processing.  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  21  Fast.    Want  to  go  faster?  §  Eliminate  non-­‐memory  components  §  Subs=tute  disk  based  queue  for  reliable  in-­‐memory  queue  §  Subs=tute  disk  based  state  persistence  to  in-­‐memory  persistence  §  Asynchronously  update  disk  based  state  (C*)  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  22  Sample  Architecture  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  23  References  §  Try  the  Cloudify  recipe  §  Download  Cloudify  :  h`p://www.cloudifysource.org/  §  Download  the  Recipe  (apps/xapstream,  services/xapstream):  –  h`ps://github.com/CloudifySource/cloudify-­‐recipes  §  XAP  –  Cassandra  Interface  Details;  §  h`p://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  §  Check  out  the  source  for  the  XAP  Spout  and  a  sample  state  implementa=on  backed  by  XAP,  and  a  Storm  friendly  streaming  implemen=on  on  github:  §  h`ps://github.com/Gigaspaces/storm-­‐integra=on  §  For  more  background  on  the  effort,  check  out  my  recent  blog  posts  at  h`p://blog.gigaspaces.com/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐1-­‐storm-­‐clouds/  §  h`p://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra=on/  §  Part  3  coming  soon.  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  24  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  25  Twi`er  Storm  With  Cassandra  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  26  Storm  Overview  
    • §  Streams  §  Unbounded  sequence  of  tuples  §  Spouts  §  Source  of  streams  (Queues)  §  Bolts  §  Func=ons,  Filters,  Joins,  Aggrega=ons  §  Topologies  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  27  Storm  Concepts  Spouts  Bolt  Topologies  
    • Challenge  –  Word  Count  Word:CountTweets  Count  ?®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  28  • HoWest  topics  • URL  men:ons  • etc.  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  29  Streaming  word  count  with  Storm  
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  30  Supercharging  Storm  §  Storm  doesn’t  supply  persistence,  but  provides  for  it  §  Storm  op=mizes  IO  to  slow  persistence  (e.g.  databases)  using  batching.  §  Storm  processes  streams.    The  stream  provider  itself  needs  to  support  persistency,  batching,  and  reliability.  Tweets,  events,whatever….  
    • XAP  Real  Time  Analy=cs  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  31  
    • ®  Copyright  2011  Gigaspaces  Ltd.  All  Rights  Reserved  Two  Layer  Approach  §  Advantage:  Minimal  “impedance  mismatch”  between  layers.  –  Both  NoSQL  cluster  technologies,  with  similar  advantages  §  Grid  layer  serves  as  an  in  memory  cache  for  interac=ve  requests.  §  Grid  layer  serves  as  a  real  =me  computa=on  fabric  for  CEP,  and  limited  (  to  allocated  memory)  real  =me  distributed  query  capability.  In  Memory  Compute  ClusterNoSQL  Cluster...Raw  Event  StreamRaw  Event  StreamRaw  Event  StreamReal  Time  EventsRaw  And  Derived  EventsReal  Time  EventsReporting  EngineSCALESCALE
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  33  Simplified  Architecture  
    • §  Flowing  event  streams  through  memory  for  side  effects  §  Event  driven  architecture  execu=ng  in-­‐memory  §  Raw  events  flushed,  aggrega=ons/deriva=ons  retained  §  All  layers  horizontally  scalable  §  All  layers  highly  available  §  Real-­‐=me  analy=cs  &  cached  batch  analy=cs  on  same  scalable  layer  §  Data  grid  provides  a  transac=onal/consistent  façade  on  NoSQL  store  (in  this  case  elimina=ng  SQL  database  en=rely)  ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  34  Key  Concepts  
    • Keep  Things  In  Memory  Facebook  keeps  80%  of  its  data  in  Memory    (Stanford  research)  RAM  is  100-­‐1000x  faster  than  Disk  (Random  seek)  •  Disk:  5  -­‐10ms      •  RAM:  ~0.001msec    
    • Take  Aways  §  A  data  grid  can  serve  different  needs  for  big  data  analy=cs:  §  Supercharge  a  dedicated  stream  processing  cluster  like  Storm.  –  Provide  fast,  reliable,  transac=onal  tuple  streams  and  state  §  Provide  a  general  purpose  analy=cs  plaMorm  –  Roll  your  own  §  Simplify  overall  architecture  while  enhancing  scalability  –  Ultra  high  performance/low  latency  –  Dynamically  scalable  processing  and  in-­‐memory  storage  –  Eliminate  messaging  =er  –  Eliminate  or  minimize  need  for  RDBMS  
    • §  Real:me  Analy:cs  with  Storm  and  Hadoop  §  hWp://www.slideshare.net/Hadoop_Summit/real:me-­‐analy:cs-­‐with-­‐storm  §  Learn  and  fork  the  code  on  github:      hWps://github.com/Gigaspaces/storm-­‐integra:on  §  Twi`er  Storm:    hWp://storm-­‐project.net  §  XAP  +  Storm  Detailed  Blog  Post            hWp://blog.gigaspaces.com/gigaspaces-­‐and-­‐storm-­‐part-­‐2-­‐xap-­‐integra:on/     ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  37  References    
    • ®  Copyright  2013  Gigaspaces  Ltd.  All  Rights  Reserved  38