Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.




Relevance - Deal Personalization and Real
Time Big Data Analytics
Prassnitha	
  Sampath	
  
psampath@groupon.com	
  
About Me
•  Lead	
  Engineer	
  working	
  on	
  Real	
  Time	
  Data	
  
Infrastructure	
  @	
  Groupon	
  
	
  
•  Gradu...
What are Groupon Deals?
Our Relevance Scenario
Users	
  
Scaling: Keeping Up With a Changing Business
2014	
  2011	
   2012	
  
Growing	
  Number	
  of	
  deals	
   Growing	
  Use...
Changing Business: Shift from Email to Mobile
•  Growth	
  in	
  Mobile	
  
Business	
  
•  Reducing	
  dependence	
  on	
...
Deal Personalization Infrastructure Use Cases
Deliver Personalized
Emails
Deliver Personalized
Website & Mobile
Experience...
Deal Personalization Infrastructure Use Cases
Deliver Personalized
Website, Mobile and Email
Experience
Deal	
  Performanc...
Earlier System
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Data	
  Pipeline	
  (User	
  Logs,	
  Email	
  Records,	
  User...
Earlier System
Email	
  
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Data	
  Pipeline	
  
Online	
  Deal	
  
PersonalizaOo...
Email	
  
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Real	
  Time	
  Data	
  
Pipeline	
  
Online	
  Deal	
  
Personaliza...
Email	
  
Offline	
  
PersonalizaOon	
  
Map/Reduce	
  
Web	
  Site	
  	
  
Logs	
  
Online	
  Deal	
  
PersonalizaOon	
  	
...
Two Challenges With HBase
HBase	
  
How	
  to	
  scale	
  
100,000	
  	
  
writes/	
  second?	
  
HBase	
  
•  How	
  to	
...
Final Hbase Design
Real	
  Time	
  
HBase	
  
Batch	
  
HBase	
  
Bulk	
  Load	
  	
  data	
  via	
  
HFiles	
  
ReplicaOo...
Leveraging System for Real Time Analytics
	
  Various	
  requirements	
  from	
  relevance	
  algorithms	
  to	
  pre-­‐
c...
Leveraging System for Real Time Analytics
	
  More	
  Complex	
  Examples	
  
	
  
	
  
Category	
  Level	
  
MulOdimensio...
Leveraging System for Real Time Analytics
Even	
  More	
  Complex	
  Examples	
  
	
  
	
  How	
  do	
  women	
  in	
  Dub...
Power of Simple Counting
Turns	
  out	
  all	
  earlier	
  quesOons	
  can	
  be	
  answered	
  if	
  we	
  could	
  count...
Real Time Analytics Infrastructure
Ka`a	
  Topic	
  –	
  With	
  Real	
  Time	
  User	
  
events	
  
Storm	
  –	
  Running...
Real Time Analytics Infrastructure -
Explained
Ka`a	
  Topic	
  –	
  
With	
  Real	
  
Time	
  User	
  
events	
  
Read	
 ...
Scaling Challenges - Kafka - Storm
	
  
•  Storm	
  was	
  hard	
  to	
  scale.	
  We	
  had	
  to	
  try	
  various	
  nu...
Scaling Challenges - Redis
•  Reduce	
  memory	
  footprint	
  –	
  	
  use	
  hashes.	
  Very	
  memory	
  
efficient	
  co...
When Small is Big – Bloom Filters
•  Since	
  both	
  Ka`a	
  and	
  Storm	
  can	
  send	
  same	
  data	
  twice	
  spec...
Avoiding Errors – Backups/ Recovery
Strategy
For	
  a	
  high	
  volume	
  system,	
  which	
  also	
  drives	
  so	
  muc...
Monitoring
Overall end-to-end monitoring to test the complete flow of data
Ka`a	
  -­‐>	
  Storm	
  -­‐>	
  HBase	
  Pipeli...
psampath@groupon.com
www.groupon.com/techjobs
Ques6ons?	
  
Thank	
  you!	
  
Slides	
  prepared	
  in	
  collabora/on	
  ...
Upcoming SlideShare
Loading in …5
×

Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015

1,483 views

Published on

Relevance and Personalization is crucial to building personalized local commerce experience at Groupon. Talk covers overview of the real time analytics infrastructure that handles over 3 million events/ second and stores and scales to billions of data points. Solution covers how our Kafka -> Storm -> Redis/ HBase pipeline is used to generate real time analytics for hundreds of millions of users of Groupon. Solution includes various architecture design choices and tradeoffs including some interesting algorithmic choices such as Bloom Filters & Hyper Log Log. Attendees can take away learnings from our real-life experience that can help them understand various tuning methods, their tradeoffs and apply them in their solutions

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Prassnitha Sampath - Real Time Big Data Analytics with Kafka, Storm & HBase - NoSQL matters Dublin 2015

  1. 1. 
 
 Relevance - Deal Personalization and Real Time Big Data Analytics Prassnitha  Sampath   psampath@groupon.com  
  2. 2. About Me •  Lead  Engineer  working  on  Real  Time  Data   Infrastructure  @  Groupon     •  Graduate  of  Portland  State  and  Madras   University  
  3. 3. What are Groupon Deals?
  4. 4. Our Relevance Scenario Users  
  5. 5. Scaling: Keeping Up With a Changing Business 2014  2011   2012   Growing  Number  of  deals   Growing  Users   •  100  Million+  subscribers   •  We  need    to  store  data   like,  user  click  history,     email  records,  service   logs  etc.  This  is  billions  of   data  points  and  TB’s  of   data  
  6. 6. Changing Business: Shift from Email to Mobile •  Growth  in  Mobile   Business   •  Reducing  dependence  on   email  markeOng     100  Million+  App  Downloads  
  7. 7. Deal Personalization Infrastructure Use Cases Deliver Personalized Emails Deliver Personalized Website & Mobile Experience Offline  System   Online  System   Email   Personalize  billions  of  emails  for  hundreds   of  millions  of  users   Personalize  one  of  the  most  popular   e-­‐commerce  mobile  &  web  app   for  hundreds  of  millions  of  users  &  page  views  
  8. 8. Deal Personalization Infrastructure Use Cases Deliver Personalized Website, Mobile and Email Experience Deal  Performance   Understand  User  Behavior   Deliver Relevant Experience with High Quality Deals
  9. 9. Earlier System Offline   PersonalizaOon   Map/Reduce   Data  Pipeline  (User  Logs,  Email  Records,  User  History  etc)   Online  Deal   PersonalizaOon     API   MySQL  Store   Email  
  10. 10. Earlier System Email   Offline   PersonalizaOon   Map/Reduce   Data  Pipeline   Online  Deal   PersonalizaOon     API   MySQL  Store   •   Scaling  MySQL  for  data   such  as  user  click  history,   email  records  was   painful  unless  we  shard   data   •  Data  Pipeline  is  not   “Real  Time”  
  11. 11. Email   Offline   PersonalizaOon   Map/Reduce   Real  Time  Data   Pipeline   Online  Deal   PersonalizaOon     API   Ideal  Data  Store   •  Common  data  store  that   serves  data  to  both  online   and  offline  systems   •  Data  store  that  scales  to   hundreds  of  millions  of   records   •  Data  store  that  works  well   with  our  exisOng  Hadoop   based  systems   •  Real  Time  pipeline  that  scales   and  can  process  about   100,000  messages/  second   Ideal System
  12. 12. Email   Offline   PersonalizaOon   Map/Reduce   Web  Site     Logs   Online  Deal   PersonalizaOon     API   HBase   Final Design Mobile     Logs   Ka`a  Message  Broker   Storm  
  13. 13. Two Challenges With HBase HBase   How  to  scale   100,000     writes/  second?   HBase   •  How  to  run  Map  Reduce  Programs   over  HBase  without  affecOng  read   latency?   •  How  to  batch  load  data  in  HBase     without  affecOng  read  latencies?    
  14. 14. Final Hbase Design Real  Time   HBase   Batch   HBase   Bulk  Load    data  via   HFiles   ReplicaOon   Map  Reduce  Over   HBase  
  15. 15. Leveraging System for Real Time Analytics  Various  requirements  from  relevance  algorithms  to  pre-­‐ compute  real  6me  analy6cs  for  be9er  targe6ng       Category  Level   MulOdimensional   Performance   Metrics         Deal  Level   Performance   Metrics    How  do    women  in  Dublin   convert  for  Pizza  deals?     How  do  women  in  Dublin   convert  for  a  parOcular  pizza   deal?    
  16. 16. Leveraging System for Real Time Analytics  More  Complex  Examples       Category  Level   MulOdimensional   Performance  Metrics         Deal  Level   Performance  Metrics    How  do  women  in  Dublin   from  the  Dundrum  area  aged   30-­‐35  convert  for  New  York   Style  Pizza,  when  deal  is   located  within  2  miles,  and   when  deal  is  priced  between   €10-­‐€20?      How  do  women  in  Dublin  from  Dundrum  area   aged  30-­‐35  convert  for  a  parOcular  deal?  
  17. 17. Leveraging System for Real Time Analytics Even  More  Complex  Examples      How  do  women  in  Dublin   from  the  Dundrum  area   aged  30-­‐35  who  also  like   acOviOes  like  Biking  and  are     acOve  customers  on  our   mobile  plahorm  convert   when  deal  is  located  within   2  miles,  and  when  deal  is   priced  between  €10-­‐€20?      How  do  women  in  Dublin  from   the  Dundrum  area  aged  30-­‐35   who  also  like  acOviOes  such  as   biking  and  are  acOve  customers   of  Groupon  deals  on  mobile   plahorm  convert  for  this   parOcular  deal?  
  18. 18. Power of Simple Counting Turns  out  all  earlier  quesOons  can  be  answered  if  we  could  count  appropriate  events  in   appropriate  bucket         No  Deal  Impressions  by  Women  in  Dublin  for  Pizza  Deals       No  of  Purchases  by  Women  in  Dublin  for   Pizza  Deals  Conversion  rate   for  pizza  deals   for  women  in   Dublin   =  
  19. 19. Real Time Analytics Infrastructure Ka`a  Topic  –  With  Real  Time  User   events   Storm  –  Running  AnalyOcs  Topology   Real  Time  infrastructure  processing     100,000  requests/  second   Redis  1   …   Storm  Topology  calculaOng  various   dimensions/  buckets  and  updates   appropriate  Redis  bucket.  Redis  is   sharded  from  client  side   Redis  cluster  handles  over  3  Million   events  per  second.  Stores  over  14   Billion  unique  keys   Redis  2   Redis  N  
  20. 20. Real Time Analytics Infrastructure - Explained Ka`a  Topic  –   With  Real   Time  User   events   Read  user   event  Data   from  Ka`a   Find  out   which  all   buckets  this   event  falls   Increase  event   counter  for   appropriate   bucket  in  Redis   Redis   Shards   Storm  
  21. 21. Scaling Challenges - Kafka - Storm   •  Storm  was  hard  to  scale.  We  had  to  try  various  number  of  combinaOons  to   finalize  how  many  bolts  of  each  type  are  required  for  steady  state   operaOons  and  overall  how  many  workers  are  needed.   •  Use  “topology.max.spout.pending”  senng  in  Storm  topologies.  We  found   it  to  be  very  useful  to  shield  your  topologies  from  sudden  surge  in  traffic.   •  Build  your  enOre  infrastructure  –  where  data  duplicates  are  allowed  
  22. 22. Scaling Challenges - Redis •  Reduce  memory  footprint  –    use  hashes.  Very  memory   efficient  compared  to  normal  Redis  keys     •  In  order  to  support  high  write  operaOons  turned  off  AOF,   turned  on  RDB  backups   Easiest  of  all  other  infrastructure  pieces  –  Ka`a,  Storm,  HBase  
  23. 23. When Small is Big – Bloom Filters •  Since  both  Ka`a  and  Storm  can  send  same  data  twice  specially  at   scale,  it  was  important  to  build  downstream  infrastructure  that  can   handle  duplicate  data.   •  However,  by  very  nature  AnalyOcs  Topology  (CounOng  Topology)   cannot  handle  duplicates   •  Storing  individual  messages  for  billions  of  messages  is  way  too   expensive  and  would  take  lot  more  memory     •  So  we  used  bloom  filters.  At  a  very  small  %  error  rate,  we  could   effecOvely  de-­‐dupe  data  with  a  very  small  memory  footprint.  
  24. 24. Avoiding Errors – Backups/ Recovery Strategy For  a  high  volume  system,  which  also  drives  so  much  revenue  for  the  company  good   backup/recovery  strategy  is  necessary   Redis     RDB  Backups  every   few  hours.  RDB   backups  are  stored   in  HDFS  for  later   use       HBase     HBase  Snapshot   funcOonality  is   used.  Snapshot  are   taken  every  few   hours.     Ka`a/  Storm     All  input  into  Ka`a   topic  is  stored  in   HDFS  for  30  days.   So  any  hour/  day   can  be  replayed   from  HDFS  if   necessary.  
  25. 25. Monitoring Overall end-to-end monitoring to test the complete flow of data Ka`a  -­‐>  Storm  -­‐>  HBase  Pipeline   Crawler  crawls  the  page  and  monitoring  looks  for  corresponding  data  in  HBase  
  26. 26. psampath@groupon.com www.groupon.com/techjobs Ques6ons?   Thank  you!   Slides  prepared  in  collabora/on  with  Ameya  Kanitkar  

×