Bring survey sampling techniques into big data

BRINGING SURVEY SAMPLING
TECHNIQUES INTO ‘BIG DATA’
ANTOINE REBECQ
UBISOFT MONTRÉAL
NOVEMBER 7, 2018
1

About me
• Formerly: survey sampling methodologist at INSEE, France
• “Type A” data scientist turned “Type B”

Key takeaway
The future of ‘big data’ is a statistician

Summary
I. What is a data science team? How can a (survey) statistician fit into
it?
II. Examples of awesome ‘big data’ challenges that could use
statisticians

I. Data science and data scientists

I. Data science and data scientists
Data scientists = combination of computer science, statistics, applied
mathematics and domain expertise
Type A data scientist = Focused on analyses, decision science
Type B data scientist = Focused on production data application
(typically ML, recommendations, etc.)

What does our type B data science team do?
Machine Learning in games! Example: Recommendations (from Netflix:
Basilico, 2015)

Send data
Send content
Compute
ML
models

At core: programming team
- Production code:
- Distributed computation
- Optimized algorithms
- Code history and reviews
Tech stack:

Modern data science teams
The (in)famous data science Venn diagram (Conway, 2013)

Some truths:
- Blur the line between all jobs (opportunities, not requirements)
- Unicorns are rare but they do exist
- Let them have fun!
- Pay them accordingly!
More generally: Create opportunities for everyone to learn from every
domain

What can statisticians get from CS culture
- Quality control for statisticians (hint: it’s the same!):
- Distributed computation
- Optimized algorithms
- Code history and reviews
R community has a very positive influence in introducing CS quality
processes for statistics and data science (for example see Wickham,
2015 on git).

II. Examples of ‘big data’ challenges that
could use statisticians

II. Examples of challenges
1. A/B testing
2. Sampled events (understanding data sources)
3. Improving ML algorithms (quality)
4. Improving ML algorithms (speed)
5. Understanding user feedback

1. A/B testing
A/B testing = ‘big data’ term for Randomized Controlled Trial (RCT)
Very useful for:
- Product shipping
- Business decisions
For example Microsoft has a dedicated team doing extensive work on
A/B testing (see Deng, 2018).

1. A/B testing
Need for carefully crafted sampling designs (Image from Miller).

2. Sampled tracking events
Event = single information sent to server when something happens
Some events are sampled to reduce load (CPU, network, storage)

Example: analysis of balancing in a fighting game
An event is sent by a sample of players when they use a new weapon.
Question: is sword A better than sword B?
-> Analysis of matches where these weapons are used
…

… This is an indirect sampling design (Lavallée, 2009)
(Unequal probabilities because of players preferences, game rules, etc.)
Our ‘quick-and-dirty’ solution: calibration and R package Icarus
(Rebecq, 2016)

3. Better probabilities for ML algorithms using sampling calibration
Using sampling calibration (Deville, 1992) to craft better probabilities
from ML algorithms
1. Example with balancing of sample data:
http://nc233.com/2018/07/weighting-tricks-for-machine-learning-
with-icarus-part-1/

2. Directly calibrate output probabilities (WIP)
- Better simulations
- Better recommendations

4. Speed up big data tasks
Example: Sampling to speed up network analyses (Leskovec, 2016 and Rebecq,
2017)

5. Understand user feedback
Sentiment analysis (Pang, 2002)
Direct feedback from community
Vs.
Sampling and carefully crafted questionnaire

Conclusion
- A lot of interesting topics in survey sampling literature can be super
useful for ‘big data’ problems (research and practice)
- Hire a statistician for your type A data science team!
- Hire a statistician for your type B data science team!
- If you’re a statistician, look into ‘big data’ jobs for interesting
challenges!

Thanks!
Antoine Rebecq
.
Blog post: nc233.com/symposium2018
LinkedIn

References (1)
[Basilico, 2015] BASILICO, Justin. Recommendations for building Machine Learning systems
https://www.slideshare.net/SessionsEvents/justin-basilico-research-engineering-manager-at-netflix-at-mlconf-
sf-111315
[Conway, 2013] CONWAY, Drew. The data science Venn diagram http://drewconway.com/zia/2013/3/26/the-
data-science-venn-diagram
[Deville, 1992] DEVILLE, Jean-Claude and SÄRNDAL, Carl-Erik. Calibration estimators in survey sampling. Journal
of the American statistical Association, 1992, vol. 87, no 418, p. 376-382.
[Deng, 2018] DENG, Alex, KNOBLICH, Ulf, and LU, Jiannan. Applying the Delta method in metric analytics: A
practical guide with novel ideas. arXiv preprint arXiv:1803.06336, 2018.
[Lavallée, 2009] LAVALLÉE, Pierre. Indirect sampling. Springer Science & Business Media, 2009.

References (2)
[Leskovec, 2016] LESKOVEC, Jure and SOSIČ, Rok. Snap: A general-purpose network analysis and graph-mining
library. ACM Transactions on Intelligent Systems and Technology (TIST), 2016, vol. 8, no 1, p. 1.
[Miller] MILLER, Evan. Evan Miller’s sample size calculator https://www.evanmiller.org/ab-testing/sample-
size.html
[Pang, 2002] PANG, Bo, LEE, Lillian, and VAITHYANATHAN, Shivakumar. Thumbs up?: sentiment classification
using machine learning techniques. In : Proceedings of the ACL-02 conference on Empirical methods in natural
language processing-Volume 10. Association for Computational Linguistics, 2002. p. 79-86.
[Rebecq, 2017] REBECQ, Antoine. Sampling graphs https://nc233.com/2017/03/sampling-graphs-mad-stat-
seminar-at-toulouse-school-of-economics/

References (3)
[Rebecq, 2016] REBECQ, Antoine. Icarus: un package R pour le calage sur marges et ses variantes. In : 9e
colloque francophone sur les sondages, Gatineau (Canada). 2016.
[Wickham, 2015] WICKHAM, Hadley. R packages: organize, test, document, and share your code. " O'Reilly
Media, Inc.", 2015 (page on git available at http://r-pkgs.had.co.nz/git.html)

Bring survey sampling techniques into big data

Recommended

Recommended

More Related Content

Similar to Bring survey sampling techniques into big data

Similar to Bring survey sampling techniques into big data (20)

Recently uploaded

Recently uploaded (20)

Bring survey sampling techniques into big data