Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Modern Data Science
Alejandro Correa Bahnsen
June 2016
@albahnsen
1
Who am I?
Data Scientist
PhD in Machine Learning
Interested in Big Data Engineering
Passionate about open-source
Scikit-Le...
Who I've worked with
3
Where I work
Lead Data Scientist working on applying
Machine Learning for Security Informatics
4
Aims of this talk
Discuss what a Modern Data Scientist is
(And what is not)
5
6
It's 2016 and there is still no
unique definition of Data
Science
7
8
“ A data scientist is a statistician
who lives in San Fransisco.
“ Data Science is statistics on a
Mac.
9
Data Science is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else...
Even worse, people use
several words interchangeable
11
12
13
14
15
Lets focus only on modern
data science
16
So what is Data
Science?
17
Data Science
18
Data Science is the intersection of
Hacking Skills, Math & Statistics
Knowledge and Substantive Expertise
Those are the pi...
Hacking Skills
Ability to build things and find clever solutions to
problems.
Programming/Coding: Python and R (and others)...
Hacking Skills
21
Hacking Skills
http://www.kdnuggets.com/2016/06/r-python-top-
analytics-data-mining-data-science-software.html
22
Hacking Skills
http://www.kdnuggets.com/2016/06/r-python-top-
analytics-data-mining-data-science-software.html
23
Math & Statistics
Being able understand the right solution to each
problem
Linear algebra: Matrix manipulation
Machine Lea...
Math & Statistics
25
Substantive Expertise
Ability to ask good questions requires domain
understanding, that’s why a data scientist can’t creat...
How did we get here
27
Data Science
Examples
28
Netflix Price
29
Goolge flu trends
30
Creating a rembrandt
31
Obama campaign
32
Moneyball
33
AlphaGo
34
My recent
experience
35
Phishing Detection
36
Malware Identification
37
Man-in-the-Browser Attacks
38
Intrusion Detection
39
Fraud Detection
40
Fraud Detection
Estimate the probability of a transaction being fraud
based on customer patterns and recent fraudulent
beh...
Fraud Detection
42
Class Imbalance
Fraudulent transactions represents between 0.01% to
0.5% of the transactions
Create a balanced dataset usi...
Class Imbalance
Synthetic Majority Over Sampling Technique
SMOTE
44
Cost-Sensitivity
Typical evaluation of a classification model:
Actual Fraud Actual Legitimate
Predicted Fraud True Positive...
Cost-Sensitivity
Assumes the same financial cost of false positives and
false negatives!
Not the case in fraud detection:
F...
Cost-Sensitivity
Cost Matrix
Actual Fraud Actual Legitimate
Predicted Fraud
Predicted Legitimate
Cost(f(S)) = y (1 − c )AM...
Feature Engineering
Raw Features
48
Feature Engineering
Transaction aggregated features
49
Feature Engineering
Periodic Features
50
Feature Engineering
Social Networks Analysis
51
Finally - Some Models
Data
Large European Card Processing company
2012 & 2013 card present transactions
20 Million transac...
Finally - Some Models
Algorithms
Fuzzy Rules
Neural Networks
Naive Bayes
Random Forests
Random Forests with Cost-Proporton...
Finally - Some Models
54
Takeaways
55
How could you learn more?
56
How could you learn more?
57
How could you learn more?
58
Embrace open-source
59
Support open-source
60
Modern
Data
Scientist
The sexiest job of
the 21th century
61
Thank You!
@albahnsen
albahnsen.com
62
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Maximizing a churn campaign’s profitability with cost sensitive predictive analytics
Next
Upcoming SlideShare
Maximizing a churn campaign’s profitability with cost sensitive predictive analytics
Next
Download to read offline and view in fullscreen.

Share

Modern Data Science

Download to read offline

Presentation on Modern Data Science

Data scientists are in high demand. There is simply not enough talent to fill the jobs. Why? Because the sexiest job of 21th century requires a mixture of broad, multidisciplinary skills ranging from an intersection of mathematics, statistics, computer science, communication and business. Finding a data scientist is hard. Finding people who understand who a data scientist is, is equally hard.

Check the video in spanish here :https://www.youtube.com/watch?v=R3jeBHLLiiM

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Modern Data Science

  1. 1. Modern Data Science Alejandro Correa Bahnsen June 2016 @albahnsen 1
  2. 2. Who am I? Data Scientist PhD in Machine Learning Interested in Big Data Engineering Passionate about open-source Scikit-Learn contributor :) Organizer of the Bogota Big Data Science Meetup 2
  3. 3. Who I've worked with 3
  4. 4. Where I work Lead Data Scientist working on applying Machine Learning for Security Informatics 4
  5. 5. Aims of this talk Discuss what a Modern Data Scientist is (And what is not) 5
  6. 6. 6
  7. 7. It's 2016 and there is still no unique definition of Data Science 7
  8. 8. 8
  9. 9. “ A data scientist is a statistician who lives in San Fransisco. “ Data Science is statistics on a Mac. 9
  10. 10. Data Science is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... 10
  11. 11. Even worse, people use several words interchangeable 11
  12. 12. 12
  13. 13. 13
  14. 14. 14
  15. 15. 15
  16. 16. Lets focus only on modern data science 16
  17. 17. So what is Data Science? 17
  18. 18. Data Science 18
  19. 19. Data Science is the intersection of Hacking Skills, Math & Statistics Knowledge and Substantive Expertise Those are the pillars of data science: computing, statistics, mathematics and quantitative disciplines combined to analyze data for better decision making 19
  20. 20. Hacking Skills Ability to build things and find clever solutions to problems. Programming/Coding: Python and R (and others) Databases: MySQL, PostgreSQL, Cassandra, MongoDB and CouchDB. Visualization: D3, Tableau, Qlikview and Markdown. Big Data: Hadoop, MapReduce and Spark. 20
  21. 21. Hacking Skills 21
  22. 22. Hacking Skills http://www.kdnuggets.com/2016/06/r-python-top- analytics-data-mining-data-science-software.html 22
  23. 23. Hacking Skills http://www.kdnuggets.com/2016/06/r-python-top- analytics-data-mining-data-science-software.html 23
  24. 24. Math & Statistics Being able understand the right solution to each problem Linear algebra: Matrix manipulation Machine Learning: Random Forests, SVM, Boosting Descriptive statistics: Describe, Cluster Statistical inference: Generate new knowledge . 24
  25. 25. Math & Statistics 25
  26. 26. Substantive Expertise Ability to ask good questions requires domain understanding, that’s why a data scientist can’t create data based solutions without a good industry knowledge Is this A or B or C? (classification) Is this weird? (anomaly detection). How much/how many? (regression). How is it organized? (clustering). What should I do next? (reinforcement learning) 26
  27. 27. How did we get here 27
  28. 28. Data Science Examples 28
  29. 29. Netflix Price 29
  30. 30. Goolge flu trends 30
  31. 31. Creating a rembrandt 31
  32. 32. Obama campaign 32
  33. 33. Moneyball 33
  34. 34. AlphaGo 34
  35. 35. My recent experience 35
  36. 36. Phishing Detection 36
  37. 37. Malware Identification 37
  38. 38. Man-in-the-Browser Attacks 38
  39. 39. Intrusion Detection 39
  40. 40. Fraud Detection 40
  41. 41. Fraud Detection Estimate the probability of a transaction being fraud based on customer patterns and recent fraudulent behavior Issues when constructing a fraud detection system: Class Imbalance Cost-sensitivity Short time response of the system Dimensionality of the search space Feature preprocessing Model selection 41
  42. 42. Fraud Detection 42
  43. 43. Class Imbalance Fraudulent transactions represents between 0.01% to 0.5% of the transactions Create a balanced dataset using: Under sampling Over sampling TomekLinks sampling Condensed Nearest Neighbor NearMiss Synthetic Majority Over Sampling 43
  44. 44. Class Imbalance Synthetic Majority Over Sampling Technique SMOTE 44
  45. 45. Cost-Sensitivity Typical evaluation of a classification model: Actual Fraud Actual Legitimate Predicted Fraud True Positives (TP) False Positives (FP) Predicted Legitimate False Negatives (FN) True Negatives (FN) Accuracy = TP+FP+TN+FN TP+TN F Score =1 TP+FN+FP TP 45
  46. 46. Cost-Sensitivity Assumes the same financial cost of false positives and false negatives! Not the case in fraud detection: False positives: When predicting a transaction as fraudulent, when in fact it is not a fraud, there is an administrative cost False negatives: Failing to detect a fraud, the amount of that transaction is lost. 46
  47. 47. Cost-Sensitivity Cost Matrix Actual Fraud Actual Legitimate Predicted Fraud Predicted Legitimate Cost(f(S)) = y (1 − c )AMT + c C∑i=1 N i i i i a c = CTP a c = CFP a c = AMTFN i c = 0TN 47
  48. 48. Feature Engineering Raw Features 48
  49. 49. Feature Engineering Transaction aggregated features 49
  50. 50. Feature Engineering Periodic Features 50
  51. 51. Feature Engineering Social Networks Analysis 51
  52. 52. Finally - Some Models Data Large European Card Processing company 2012 & 2013 card present transactions 20 Million transactions 40,000 frauds 2 Million Euros in losses in the test set 52
  53. 53. Finally - Some Models Algorithms Fuzzy Rules Neural Networks Naive Bayes Random Forests Random Forests with Cost-Proportonate Sampling Cost-Sensitive Random Patches Decision Trees 53
  54. 54. Finally - Some Models 54
  55. 55. Takeaways 55
  56. 56. How could you learn more? 56
  57. 57. How could you learn more? 57
  58. 58. How could you learn more? 58
  59. 59. Embrace open-source 59
  60. 60. Support open-source 60
  61. 61. Modern Data Scientist The sexiest job of the 21th century 61
  62. 62. Thank You! @albahnsen albahnsen.com 62
  • ThanhLoginNguyen

    Sep. 1, 2020
  • chamsedineaidara

    Sep. 1, 2019
  • geekchicme22

    Oct. 5, 2018
  • geekchicme22

    Oct. 5, 2018
  • ManoharPatil35

    Jun. 5, 2018
  • RonShi2

    Nov. 29, 2017
  • namkeelee

    Nov. 18, 2017
  • JulioCaioVaz

    Jun. 6, 2017
  • gomoreno

    Jun. 16, 2016
  • choeungjin

    May. 21, 2016

Presentation on Modern Data Science Data scientists are in high demand. There is simply not enough talent to fill the jobs. Why? Because the sexiest job of 21th century requires a mixture of broad, multidisciplinary skills ranging from an intersection of mathematics, statistics, computer science, communication and business. Finding a data scientist is hard. Finding people who understand who a data scientist is, is equally hard. Check the video in spanish here :https://www.youtube.com/watch?v=R3jeBHLLiiM

Views

Total views

6,883

On Slideshare

0

From embeds

0

Number of embeds

984

Actions

Downloads

302

Shares

0

Comments

0

Likes

10

×