Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Diving In The Deep End Of The Big Data Pool

855 views

Published on

My Ignite talk at Strata Barcelona 2014

Published in: Software
  • Be the first to comment

  • Be the first to like this

Diving In The Deep End Of The Big Data Pool

  1. 1. François Garillot @huitseeker Diving In The Deep End of the Big Data Pool
  2. 2. 17:45 Thursday Understanding your Unicorns: Data Science Team Building in Action Location: 120-121
  3. 3. 4 analytical PhDs 3 weeks 1 org with data & a QUESTION
  4. 4. Stephen Gadd Marisa Figueiredo Federica Capranico François Garillot (me)
  5. 5. Gamers Food Coffee Shops Entertainment Films Mums Preschool Entertainment Globetrotters Sport Music Festivals Football Fans In Car Market Buyers Pet Owners Technology Drivers Family SMB Gamblers Mums Shoppers University Students Music Zone 1 commuters Zone 1 commuters Infrequent Zone 1 commuters Freq. Zone 1 commuters Resident Zone 1 commuters Regular Autos B2B Business/Finance Careers Education Entertainment Family & Youth Gambling Gaming Government IT Lifestyle News Property Retail Search Social Sport Telco Travel
  6. 6. Gamers Food Coffee Shops Entertainment Films Mums Preschool Entertainment Globetrotters Sport Music Festivals Football Fans In Car Market Buyers Pet Owners Technology Drivers Family SMB Gamblers Mums Shoppers University Students Music Zone 1 commuters Zone 1 commuters Infrequent Zone 1 commuters Freq. Zone 1 commuters Resident Zone 1 commuters Regular Autos B2B Business/Finance Careers Education Entertainment Family & Youth Gambling Gaming Government IT Lifestyle News Property Retail Search Social Sport Telco Travel 5+ millions 50+ K
  7. 7. ... so: Things Not To Mess Up
  8. 8. Nobody ever get those two right
  9. 9. unsupervised clustering find new segments based on web browsing history
  10. 10. relative distances unsupervised clustering based on web browsing history spatial representation have a position for each user no implementation that works at scale! find new segments simrank
  11. 11. Simrank & MDS website website website 22 million nodes website 123 million edges simrank 5+ millions 25+ trillions Clustering
  12. 12. Simrank & MDS MDS: scalable but too complex to do in time website website website ✓Implemented 22 million nodes website 123 million edges simrank 5+ millions MDS Clustering (45, 36) ✖ Fail
  13. 13. Lay the bare stuff down first, THEN refine
  14. 14. Cluster still a huge mess to deploy
  15. 15. Results Locality-Sensitive Hashing Singles Hand-made code ! typical web browsing: pof.com, tagged.com “The year of being single”, Marketing Magazine, 2013 “The rise of the single economy”, The Guardian, 2014
  16. 16. Final results obtained on the last day
  17. 17. Essential : fuel & friends
  18. 18. - power & network fail - Bare pipeline first - Distributed is hard, let's go Think instead ! - Fuel & friends

×