Published on

Slides for Munich Datageek Meetup

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. User Behaviour Tracking Track - Store - Process ! //Florian Pfeiffer - Head of Data&Infrastructure - !
  2. 2. Vision „Let’s build our own Google Analytics“
  3. 3. Why analytics does sampling we want the (raw) data
  4. 4. Ideas,Thoughts&Goals fast / minimal impact on page loading time high availability track user over multiple platforms storage engine? -> hbase
  5. 5. Infrastructure
  6. 6. Numbers! 10-20ms Response Time per pixel record for now: ~2500 concurrent reqs 1,5 billion entries in Hbase 10 Nodes in Hadoop Cluster
  7. 7. Serving Infrastructure Loadbalancers & RR DNS nginx with empty_gif module (~2ms) data is written to logfile
  8. 8. Storing Infrastructure every nginx node has flume-ng flume ingests logfile AsyncHBaseSink with custom Serializer direct writes to HBase
  9. 9. why flume? we had it already in production ;) Storm might be an interesting alternative
  10. 10. HBase rowkey design
  11. 11. Why? You can scan through all data and use filters for selecting specific data But scanning with start & stop row speeds things up (a lot)
  12. 12. HBase rowkey design Do I need a fast user or a fast timespan lookup? User - clientid,ts<,connectionId> Timespan - ts,clientid<,connectionId>
  13. 13. Inverse Timestamps Data in HBase is stored lexicographicaly sorted Normal TS - scan would yield oldest results first Inverse TS - newer entries come first (and you can cancel the scan if you have enough data)
  14. 14. Cross Domain Tracking (Flash)Cookies Fingerprinting Etag HTML5 Storage
  15. 15. The olden times… or Cookies Easy to drop a 3rd party cookie with userId on different websites Gets more and more blocked (Safari, FF..)
  16. 16. Fingerprinting Yields interesting results on desktop, difficult on e.g. iPhone invisible to user Last resort if everything else fails?
  17. 17. Etag Quite new, based on browser cache sounds interesting
  18. 18. HTML5 Storage Store data in local HTML5 storage Retrieve data with Cross Domain Messaging
  19. 19. Store data e.g. UserId, SessionId, GeoIP, URL, action, data
  20. 20. Batch Processing Calculate how many users are active on platform A and also on B Get Traffic of all Questions belonging to Channel X sorted by Country
  21. 21. Now to something completely different…
  22. 22. demo
  23. 23. Recommendations with Myrrix
  24. 24. Myrrix Evolution: taste -> mahout -> myrrix (-> oryx) Recommender based on ALS
  25. 25. Recommendations @ User emit signals on questions view, like, gives answer, answer is voted best Application sends signals through RabbitMQ to recommendation servers
  26. 26. YEAH but what happens, when a new user signs up?
  27. 27. ?
  28. 28. Fetch data from tracking and feed it into myrrix
  29. 29. Collecting&Storing data works great using & processing is another thing ;)