Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Bring survey sampling techniques into big data

62 views

Published on

Techniques initially developed to be used in official statistics applications are now used in machine learning and big data contexts

Published in: Science
  • Be the first to comment

  • Be the first to like this

Bring survey sampling techniques into big data

  1. 1. BRINGING SURVEY SAMPLING TECHNIQUES INTO ‘BIG DATA’ ANTOINE REBECQ UBISOFT MONTRÉAL NOVEMBER 7, 2018 1
  2. 2. About me • Formerly: survey sampling methodologist at INSEE, France • “Type A” data scientist turned “Type B”
  3. 3. Key takeaway The future of ‘big data’ is a statistician
  4. 4. Summary I. What is a data science team? How can a (survey) statistician fit into it? II. Examples of awesome ‘big data’ challenges that could use statisticians
  5. 5. I. Data science and data scientists
  6. 6. I. Data science and data scientists Data scientists = combination of computer science, statistics, applied mathematics and domain expertise Type A data scientist = Focused on analyses, decision science Type B data scientist = Focused on production data application (typically ML, recommendations, etc.)
  7. 7. What does our type B data science team do? Machine Learning in games! Example: Recommendations (from Netflix: Basilico, 2015)
  8. 8. What does our type B data science team do? Send data Send content Compute ML models
  9. 9. What does our type B data science team do? At core: programming team - Production code: - Distributed computation - Optimized algorithms - Code history and reviews Tech stack:
  10. 10. Modern data science teams The (in)famous data science Venn diagram (Conway, 2013)
  11. 11. Modern data science teams Some truths: - Blur the line between all jobs (opportunities, not requirements) - Unicorns are rare but they do exist - Let them have fun! - Pay them accordingly! More generally: Create opportunities for everyone to learn from every domain
  12. 12. Modern data science teams What can statisticians get from CS culture - Quality control for statisticians (hint: it’s the same!): - Distributed computation - Optimized algorithms - Code history and reviews R community has a very positive influence in introducing CS quality processes for statistics and data science (for example see Wickham, 2015 on git).
  13. 13. II. Examples of ‘big data’ challenges that could use statisticians
  14. 14. II. Examples of challenges 1. A/B testing 2. Sampled events (understanding data sources) 3. Improving ML algorithms (quality) 4. Improving ML algorithms (speed) 5. Understanding user feedback
  15. 15. II. Examples of challenges 1. A/B testing A/B testing = ‘big data’ term for Randomized Controlled Trial (RCT) Very useful for: - Product shipping - Business decisions For example Microsoft has a dedicated team doing extensive work on A/B testing (see Deng, 2018).
  16. 16. II. Examples of challenges 1. A/B testing Need for carefully crafted sampling designs (Image from Miller).
  17. 17. II. Examples of challenges 2. Sampled tracking events Event = single information sent to server when something happens Some events are sampled to reduce load (CPU, network, storage)
  18. 18. II. Examples of challenges 2. Sampled tracking events Example: analysis of balancing in a fighting game An event is sent by a sample of players when they use a new weapon. Question: is sword A better than sword B? -> Analysis of matches where these weapons are used …
  19. 19. II. Examples of challenges 2. Sampled tracking events … This is an indirect sampling design (Lavallée, 2009) (Unequal probabilities because of players preferences, game rules, etc.) Our ‘quick-and-dirty’ solution: calibration and R package Icarus (Rebecq, 2016)
  20. 20. II. Examples of challenges 3. Better probabilities for ML algorithms using sampling calibration Using sampling calibration (Deville, 1992) to craft better probabilities from ML algorithms 1. Example with balancing of sample data: http://nc233.com/2018/07/weighting-tricks-for-machine-learning- with-icarus-part-1/
  21. 21. II. Examples of challenges 3. Better probabilities for ML algorithms using sampling calibration
  22. 22. II. Examples of challenges 3. Better probabilities for ML algorithms using sampling calibration 2. Directly calibrate output probabilities (WIP) - Better simulations - Better recommendations
  23. 23. II. Examples of challenges 4. Speed up big data tasks Example: Sampling to speed up network analyses (Leskovec, 2016 and Rebecq, 2017)
  24. 24. II. Examples of challenges 5. Understand user feedback Sentiment analysis (Pang, 2002) Direct feedback from community Vs. Sampling and carefully crafted questionnaire
  25. 25. Conclusion - A lot of interesting topics in survey sampling literature can be super useful for ‘big data’ problems (research and practice) - Hire a statistician for your type A data science team! - Hire a statistician for your type B data science team! - If you’re a statistician, look into ‘big data’ jobs for interesting challenges!
  26. 26. Thanks! Antoine Rebecq . Blog post: nc233.com/symposium2018 LinkedIn
  27. 27. References (1) [Basilico, 2015] BASILICO, Justin. Recommendations for building Machine Learning systems https://www.slideshare.net/SessionsEvents/justin-basilico-research-engineering-manager-at-netflix-at-mlconf- sf-111315 [Conway, 2013] CONWAY, Drew. The data science Venn diagram http://drewconway.com/zia/2013/3/26/the- data-science-venn-diagram [Deville, 1992] DEVILLE, Jean-Claude and SÄRNDAL, Carl-Erik. Calibration estimators in survey sampling. Journal of the American statistical Association, 1992, vol. 87, no 418, p. 376-382. [Deng, 2018] DENG, Alex, KNOBLICH, Ulf, and LU, Jiannan. Applying the Delta method in metric analytics: A practical guide with novel ideas. arXiv preprint arXiv:1803.06336, 2018. [Lavallée, 2009] LAVALLÉE, Pierre. Indirect sampling. Springer Science & Business Media, 2009.
  28. 28. References (2) [Leskovec, 2016] LESKOVEC, Jure and SOSIČ, Rok. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology (TIST), 2016, vol. 8, no 1, p. 1. [Miller] MILLER, Evan. Evan Miller’s sample size calculator https://www.evanmiller.org/ab-testing/sample- size.html [Pang, 2002] PANG, Bo, LEE, Lillian, and VAITHYANATHAN, Shivakumar. Thumbs up?: sentiment classification using machine learning techniques. In : Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10. Association for Computational Linguistics, 2002. p. 79-86. [Rebecq, 2017] REBECQ, Antoine. Sampling graphs https://nc233.com/2017/03/sampling-graphs-mad-stat- seminar-at-toulouse-school-of-economics/
  29. 29. References (3) [Rebecq, 2016] REBECQ, Antoine. Icarus: un package R pour le calage sur marges et ses variantes. In : 9e colloque francophone sur les sondages, Gatineau (Canada). 2016. [Wickham, 2015] WICKHAM, Hadley. R packages: organize, test, document, and share your code. " O'Reilly Media, Inc.", 2015 (page on git available at http://r-pkgs.had.co.nz/git.html)

×