Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

No-Bullshit Data Science

710 views

Published on

by Szilard Pafka
Chief Scientist at Epoch

Szilard studied Physics in the 90s in Budapest and has obtained a PhD by using statistical methods to analyze the risk of financial portfolios. Next he has worked in finance quantifying and managing market risk. A decade ago he moved to California to become the Chief Scientist of a credit card processing company doing what now is called data science (data munging, analysis, modeling, visualization, machine learning etc). He is the founder/organizer of several data science meetups in Santa Monica, and he is also a visiting professor at CEU in Budapest, where he teaches data science in the Masters in Business Analytics program.

While extracting business value from data has been performed by practitioners for decades, the last several years have seen an unprecedented amount of hype in this field. This hype has created not only unrealistic expectations in results, but also glamour in the usage of the newest tools assumably capable of extraordinary feats. In this talk I will apply the much needed methods of critical thinking and quantitative measurements (that data scientists are supposed to use daily in solving problems for their companies) to assess the capabilities of the most widely used software tools for data science. I will discuss in details two such analyses, one concerning the size of datasets used for analytics and the other one regarding the performance of machine learning software used for supervised learning.

Published in: Technology
  • Be the first to comment

No-Bullshit Data Science

  1. 1. No-Bullshit Data Science Szilárd Pafka, PhD Chief Scientist, Epoch Domino Data Science Popup San Francisco, Feb 2017
  2. 2. Disclaimer: I am not representing my employer (Epoch) in this talk I cannot confirm nor deny if Epoch is using any of the methods, tools, results etc. mentioned in this talk
  3. 3. Example #1
  4. 4. Aggregation 100M rows 1M groups Join 100M rows x 1M rows time [s] time [s]
  5. 5. (largest data analyzed)
  6. 6. (largest data analyzed)
  7. 7. (largest data analyzed)
  8. 8. data size [M] training time [s] 10x Gradient Boosting Machines
  9. 9. linear tops off (data size) (accuracy)
  10. 10. linear tops off more data & better algo (data size) (accuracy)
  11. 11. linear tops off more data & better algorandom forest on 1% of data beats linear on all data (data size) (accuracy)
  12. 12. linear tops off more data & better algorandom forest on 1% of data beats linear on all data (data size) (accuracy)
  13. 13. Example #2
  14. 14. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf
  15. 15. http://www.cs.cornell.edu/~alexn/papers/empirical.icml06.pdf http://lowrank.net/nikos/pubs/empirical.pdf
  16. 16. - R packages - Python scikit-learn - Vowpal Wabbit - H2O - xgboost - Spark MLlib - a few others
  17. 17. - R packages - Python scikit-learn - Vowpal Wabbit - H2O - xgboost - Spark MLlib - a few others
  18. 18. EC2
  19. 19. n = 10K, 100K, 1M, 10M, 100M Training time RAM usage AUC CPU % by core read data, pre-process, score test data
  20. 20. 10x
  21. 21. Best linear: 71.1
  22. 22. learn_rate = 0.1, max_depth = 6, n_trees = 300learn_rate = 0.01, max_depth = 16, n_trees = 1000
  23. 23. ...
  24. 24. Summary

×