Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Modern Data Science

5,363 views

Published on

Presentation on Modern Data Science

Data scientists are in high demand. There is simply not enough talent to fill the jobs. Why? Because the sexiest job of 21th century requires a mixture of broad, multidisciplinary skills ranging from an intersection of mathematics, statistics, computer science, communication and business. Finding a data scientist is hard. Finding people who understand who a data scientist is, is equally hard.

Check the video in spanish here :https://www.youtube.com/watch?v=R3jeBHLLiiM

Published in: Technology
  • Be the first to comment

Modern Data Science

  1. 1. Modern Data Science Alejandro Correa Bahnsen June 2016 @albahnsen 1
  2. 2. Who am I? Data Scientist PhD in Machine Learning Interested in Big Data Engineering Passionate about open-source Scikit-Learn contributor :) Organizer of the Bogota Big Data Science Meetup 2
  3. 3. Who I've worked with 3
  4. 4. Where I work Lead Data Scientist working on applying Machine Learning for Security Informatics 4
  5. 5. Aims of this talk Discuss what a Modern Data Scientist is (And what is not) 5
  6. 6. 6
  7. 7. It's 2016 and there is still no unique definition of Data Science 7
  8. 8. 8
  9. 9. “ A data scientist is a statistician who lives in San Fransisco. “ Data Science is statistics on a Mac. 9
  10. 10. Data Science is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... 10
  11. 11. Even worse, people use several words interchangeable 11
  12. 12. 12
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. Lets focus only on modern data science 16
  17. 17. So what is Data Science? 17
  18. 18. Data Science 18
  19. 19. Data Science is the intersection of Hacking Skills, Math & Statistics Knowledge and Substantive Expertise Those are the pillars of data science: computing, statistics, mathematics and quantitative disciplines combined to analyze data for better decision making 19
  20. 20. Hacking Skills Ability to build things and find clever solutions to problems. Programming/Coding: Python and R (and others) Databases: MySQL, PostgreSQL, Cassandra, MongoDB and CouchDB. Visualization: D3, Tableau, Qlikview and Markdown. Big Data: Hadoop, MapReduce and Spark. 20
  21. 21. Hacking Skills 21
  22. 22. Hacking Skills http://www.kdnuggets.com/2016/06/r-python-top- analytics-data-mining-data-science-software.html 22
  23. 23. Hacking Skills http://www.kdnuggets.com/2016/06/r-python-top- analytics-data-mining-data-science-software.html 23
  24. 24. Math & Statistics Being able understand the right solution to each problem Linear algebra: Matrix manipulation Machine Learning: Random Forests, SVM, Boosting Descriptive statistics: Describe, Cluster Statistical inference: Generate new knowledge . 24
  25. 25. Math & Statistics 25
  26. 26. Substantive Expertise Ability to ask good questions requires domain understanding, that’s why a data scientist can’t create data based solutions without a good industry knowledge Is this A or B or C? (classification) Is this weird? (anomaly detection). How much/how many? (regression). How is it organized? (clustering). What should I do next? (reinforcement learning) 26
  27. 27. How did we get here 27
  28. 28. Data Science Examples 28
  29. 29. Netflix Price 29
  30. 30. Goolge flu trends 30
  31. 31. Creating a rembrandt 31
  32. 32. Obama campaign 32
  33. 33. Moneyball 33
  34. 34. AlphaGo 34
  35. 35. My recent experience 35
  36. 36. Phishing Detection 36
  37. 37. Malware Identification 37
  38. 38. Man-in-the-Browser Attacks 38
  39. 39. Intrusion Detection 39
  40. 40. Fraud Detection 40
  41. 41. Fraud Detection Estimate the probability of a transaction being fraud based on customer patterns and recent fraudulent behavior Issues when constructing a fraud detection system: Class Imbalance Cost-sensitivity Short time response of the system Dimensionality of the search space Feature preprocessing Model selection 41
  42. 42. Fraud Detection 42
  43. 43. Class Imbalance Fraudulent transactions represents between 0.01% to 0.5% of the transactions Create a balanced dataset using: Under sampling Over sampling TomekLinks sampling Condensed Nearest Neighbor NearMiss Synthetic Majority Over Sampling 43
  44. 44. Class Imbalance Synthetic Majority Over Sampling Technique SMOTE 44
  45. 45. Cost-Sensitivity Typical evaluation of a classification model: Actual Fraud Actual Legitimate Predicted Fraud True Positives (TP) False Positives (FP) Predicted Legitimate False Negatives (FN) True Negatives (FN) Accuracy = TP+FP+TN+FN TP+TN F Score =1 TP+FN+FP TP 45
  46. 46. Cost-Sensitivity Assumes the same financial cost of false positives and false negatives! Not the case in fraud detection: False positives: When predicting a transaction as fraudulent, when in fact it is not a fraud, there is an administrative cost False negatives: Failing to detect a fraud, the amount of that transaction is lost. 46
  47. 47. Cost-Sensitivity Cost Matrix Actual Fraud Actual Legitimate Predicted Fraud Predicted Legitimate Cost(f(S)) = y (1 − c )AMT + c C∑i=1 N i i i i a c = CTP a c = CFP a c = AMTFN i c = 0TN 47
  48. 48. Feature Engineering Raw Features 48
  49. 49. Feature Engineering Transaction aggregated features 49
  50. 50. Feature Engineering Periodic Features 50
  51. 51. Feature Engineering Social Networks Analysis 51
  52. 52. Finally - Some Models Data Large European Card Processing company 2012 & 2013 card present transactions 20 Million transactions 40,000 frauds 2 Million Euros in losses in the test set 52
  53. 53. Finally - Some Models Algorithms Fuzzy Rules Neural Networks Naive Bayes Random Forests Random Forests with Cost-Proportonate Sampling Cost-Sensitive Random Patches Decision Trees 53
  54. 54. Finally - Some Models 54
  55. 55. Takeaways 55
  56. 56. How could you learn more? 56
  57. 57. How could you learn more? 57
  58. 58. How could you learn more? 58
  59. 59. Embrace open-source 59
  60. 60. Support open-source 60
  61. 61. Modern Data Scientist The sexiest job of the 21th century 61
  62. 62. Thank You! @albahnsen albahnsen.com 62

×