bvishnu
Apache	
  Storm
• A	
  Stream	
  Processing	
  framework
Apache	
  Storm
• A	
  Stream	
  Processing	
  framework
• Used	
  to	
  pull	
  data	
  from	
  a	
  stream	
  and	
  
perform	
  real	
  time	
  analytics	
  on	
  the	
  
data
a	
  Stream…
• Can	
  be	
  Apache	
  Kafka	
  ,	
  Amazon	
  
Kinesis.
a	
  Stream…
• Can	
  be	
  Apache	
  Kafka	
  ,	
  Amazon	
  
Kinesis.
• Normally	
  has	
  partitions	
  /	
  shards	
  for	
  
better	
  read	
  &	
  write	
  throughput
Partition	
  Metadata
Partition	
  Metadata
• Storm	
  uses	
  INTEGERS (0,1…)	
  to	
  identify	
  
partitions.
Partition	
  Metadata
• Storm	
  uses	
  INTEGERS (0,1…)	
  to	
  identify	
  
partitions.
• Where	
  as	
  ……
Partition	
  Metadata
• Storm	
  uses	
  INTEGERS (0,1…)	
  to	
  identify	
  
partitions.
• Where	
  as	
  ……
• Amazon	
  Kinesis	
  uses	
  STRINGS to	
  identify	
  
partitions
So	
  how	
  can	
  we	
  process	
  data	
  ?
So	
  how	
  can	
  we	
  process	
  data	
  ?
• User	
  sorts	
  the	
  STRINGS	
  (shard	
  Id’s)
So	
  how	
  can	
  we	
  process	
  data	
  ?
• User	
  sorts	
  the	
  STRINGS	
  (shard	
  Id’s)
• User	
  maps	
  the	
  sorted	
  items	
  id’s	
  from	
  0...N
So	
  how	
  can	
  we	
  process	
  data	
  ?
• User	
  sorts	
  the	
  STRINGS(shard	
  Id’s)
• User	
  maps	
  the	
  sorted	
  items	
  id’s	
  from	
  0...N
Shard-­‐id-­‐0001	
  	
  	
  	
  <-­‐>	
  	
  0
Shard-­‐id-­‐0002	
  	
  	
  	
  <-­‐>	
  	
  1
…..
…..
Storm	
  API
Shard	
  Split	
  in	
  
Amazon	
  Kinesis
Shard	
  Split	
  in	
  
Amazon	
  Kinesis
Shard	
  Split	
  in	
  
Amazon	
  Kinesis
Stream	
  shrinks	
  
(3	
  to	
  2	
  shards)
Disturbance	
  in	
  the	
  Force
• Storm	
  partition	
  metadata	
  NO longer	
  valid	
  as	
  
the	
  shard	
  has	
  been	
  deleted.
Disturbance	
  in	
  the	
  Force
• Storm	
  partition	
  metadata	
  NO longer	
  valid	
  as	
  
the	
  shard	
  has	
  been	
  deleted.
• Storm	
  partition	
  metadata	
  should	
  now	
  be:
shard-­‐2	
  	
  	
  	
  <-­‐>	
  	
  0
shard-­‐3	
  	
  	
  	
  <-­‐>	
  	
  1
a Solution:
a	
  Solution:
• WHITE_LIST	
  of	
  shards	
  for	
  a	
  storm	
  topology.
a	
  Solution:
• WHITE_LIST	
  of	
  shards	
  for	
  a	
  storm	
  topology.
• A	
  storm	
  topology	
  pulls	
  from	
  a	
  specific	
  set	
  of	
  
shards.
a	
  Solution:
• WHITE_LIST	
  of	
  shards	
  for	
  a	
  storm	
  topology.
• A	
  storm	
  topology	
  pulls	
  from	
  a	
  specific	
  set	
  of	
  
shards.
• So	
  in	
  our	
  case:
– start	
  topology-­‐1 with	
  WHITELIST	
  =“shard-­‐1”
a	
  Solution:
• WHITE_LIST	
  of	
  shards	
  for	
  a	
  storm	
  topology.
• A	
  storm	
  topology	
  pulls	
  from	
  a	
  specific	
  set	
  of	
  
shards.
• So	
  in	
  our	
  case:
– start	
  topology-­‐1 with	
  WHITELIST	
  =“shard-­‐1”
– split	
  shard
a	
  Solution:
• WHITE_LIST	
  of	
  shards	
  for	
  a	
  storm	
  topology.
• A	
  storm	
  topology	
  pulls	
  from	
  a	
  specific	
  set	
  of	
  
shards.
• So	
  in	
  our	
  case:
– start	
  topology-­‐1 with	
  WHITELIST	
  =“shard-­‐1”
– split	
  shard
– start	
  topology-­‐2 with	
  WHITELIST=“shard-­‐2	
  &	
  3”
a	
  Solution…
• When	
  shard-­‐1	
  	
  gets	
  deleted	
  ,	
  topology	
  1	
  dies	
  with	
  it.
a	
  Solution…
• When	
  shard-­‐1	
  	
  gets	
  deleted	
  ,	
  topology	
  1	
  dies	
  with	
  it.
• Topology	
  2	
  continues	
  processing	
  data	
  for	
  the	
  new	
  
shards.
a	
  Solution…
So,	
  there	
  is	
  NO	
  metadata	
  conflict	
  ,
as	
  there	
  are	
  2	
  different	
  topologies	
  
pulling	
  data	
  from	
  different	
  sets	
  of	
  shards.
Thank	
  you
&
May	
  the	
  force	
  be	
  with	
  you	
  !
jaihind213@gmail.com
sweetweet213@twitter
mash213.wordpress.com
linkedin.com/in/213vishnu

StormWars - when the data stream shrinks

  • 1.
  • 3.
    Apache  Storm • A  Stream  Processing  framework
  • 4.
    Apache  Storm • A  Stream  Processing  framework • Used  to  pull  data  from  a  stream  and   perform  real  time  analytics  on  the   data
  • 5.
    a  Stream… • Can  be  Apache  Kafka  ,  Amazon   Kinesis.
  • 6.
    a  Stream… • Can  be  Apache  Kafka  ,  Amazon   Kinesis. • Normally  has  partitions  /  shards  for   better  read  &  write  throughput
  • 7.
  • 8.
    Partition  Metadata • Storm  uses  INTEGERS (0,1…)  to  identify   partitions.
  • 9.
    Partition  Metadata • Storm  uses  INTEGERS (0,1…)  to  identify   partitions. • Where  as  ……
  • 10.
    Partition  Metadata • Storm  uses  INTEGERS (0,1…)  to  identify   partitions. • Where  as  …… • Amazon  Kinesis  uses  STRINGS to  identify   partitions
  • 11.
    So  how  can  we  process  data  ?
  • 12.
    So  how  can  we  process  data  ? • User  sorts  the  STRINGS  (shard  Id’s)
  • 13.
    So  how  can  we  process  data  ? • User  sorts  the  STRINGS  (shard  Id’s) • User  maps  the  sorted  items  id’s  from  0...N
  • 14.
    So  how  can  we  process  data  ? • User  sorts  the  STRINGS(shard  Id’s) • User  maps  the  sorted  items  id’s  from  0...N Shard-­‐id-­‐0001        <-­‐>    0 Shard-­‐id-­‐0002        <-­‐>    1 ….. …..
  • 15.
  • 16.
    Shard  Split  in   Amazon  Kinesis
  • 17.
    Shard  Split  in   Amazon  Kinesis
  • 18.
    Shard  Split  in   Amazon  Kinesis
  • 19.
    Stream  shrinks   (3  to  2  shards)
  • 21.
    Disturbance  in  the  Force • Storm  partition  metadata  NO longer  valid  as   the  shard  has  been  deleted.
  • 22.
    Disturbance  in  the  Force • Storm  partition  metadata  NO longer  valid  as   the  shard  has  been  deleted. • Storm  partition  metadata  should  now  be: shard-­‐2        <-­‐>    0 shard-­‐3        <-­‐>    1
  • 23.
  • 24.
    a  Solution: • WHITE_LIST  of  shards  for  a  storm  topology.
  • 25.
    a  Solution: • WHITE_LIST  of  shards  for  a  storm  topology. • A  storm  topology  pulls  from  a  specific  set  of   shards.
  • 26.
    a  Solution: • WHITE_LIST  of  shards  for  a  storm  topology. • A  storm  topology  pulls  from  a  specific  set  of   shards. • So  in  our  case: – start  topology-­‐1 with  WHITELIST  =“shard-­‐1”
  • 27.
    a  Solution: • WHITE_LIST  of  shards  for  a  storm  topology. • A  storm  topology  pulls  from  a  specific  set  of   shards. • So  in  our  case: – start  topology-­‐1 with  WHITELIST  =“shard-­‐1” – split  shard
  • 28.
    a  Solution: • WHITE_LIST  of  shards  for  a  storm  topology. • A  storm  topology  pulls  from  a  specific  set  of   shards. • So  in  our  case: – start  topology-­‐1 with  WHITELIST  =“shard-­‐1” – split  shard – start  topology-­‐2 with  WHITELIST=“shard-­‐2  &  3”
  • 29.
    a  Solution… • When  shard-­‐1    gets  deleted  ,  topology  1  dies  with  it.
  • 30.
    a  Solution… • When  shard-­‐1    gets  deleted  ,  topology  1  dies  with  it. • Topology  2  continues  processing  data  for  the  new   shards.
  • 31.
    a  Solution… So,  there  is  NO  metadata  conflict  , as  there  are  2  different  topologies   pulling  data  from  different  sets  of  shards.
  • 33.
    Thank  you & May  the  force  be  with  you  ! jaihind213@gmail.com sweetweet213@twitter mash213.wordpress.com linkedin.com/in/213vishnu