No-Bullshit Data Science

1. No-Bullshit Data Science Szilárd Pafka, PhD Chief Scientist, Epoch Domino Data Science Popup San Francisco, Feb 2017

3. Disclaimer: I am not representing my employer (Epoch) in this talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk

8. Example #1

25. Aggregation 100M rows 1M groups Join 100M rows x 1M rows time [s] time [s]

26. (largest data analyzed)

30. data size [M] training time [s] 10x Gradient Boosting Machines

32. linear tops off (data size) (accuracy)

33. linear tops off more data & better algo (data size) (accuracy)

34. linear tops off more data & better algorandom forest on 1% of data beats linear on all data (data size) (accuracy)

35. linear tops off more data & better algorandom forest on 1% of data beats linear on all data (data size) (accuracy)

41. Example #2

43. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

44. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf

47. - R packages - Python scikit-learn - Vowpal Wabbit - H2O - xgboost - Spark MLlib - a few others

48. - R packages - Python scikit-learn - Vowpal Wabbit - H2O - xgboost - Spark MLlib - a few others

50. EC2

51. n = 10K, 100K, 1M, 10M, 100M Training time RAM usage AUC CPU % by core read data, pre-process, score test data

56. 10x

65. Best linear: 71.1

68. learn_rate = 0.1, max_depth = 6, n_trees = 300learn_rate = 0.01, max_depth = 16, n_trees = 1000

71. ...

78. Summary

No-Bullshit Data Science

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to No-Bullshit Data Science

Similar to No-Bullshit Data Science (20)

More from Domino Data Lab

More from Domino Data Lab (20)

Recently uploaded

Recently uploaded (20)

No-Bullshit Data Science