Data Science: De la Matemática a la Práctica

Modern Data Science
Alejandro Correa Bahnsen
August 2016
@albahnsen
1

Who am I?
Data Scientist
PhD in Machine Learning
Interested in Big Data Engineering
Passionate about open-source
Scikit-Learn contributor :)
Organizer of the Bogota Big Data Science Meetup
2

Where I work
Lead Data Scientist working on applying
Machine Learning for Security Informatics
4

Aims of this talk
Discuss what a Modern Data Scientist is
(And what is not)
5

It's 2016 and there is still no
unique deﬁnition of Data
Science
7

“ A data scientist is a statistician
who lives in San Fransisco.
“ Data Science is statistics on a
Mac.
9

Data Science is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
10

Even worse, people use
several words interchangeable
11

Lets focus only on modern
data science
16

Data Science is the intersection of
Hacking Skills, Math & Statistics
Knowledge and Substantive Expertise
Those are the pillars of data science: computing,
statistics, mathematics and quantitative disciplines
combined to analyze data for better decision making
19

Hacking Skills
Ability to build things and ﬁnd clever solutions to
problems.
Programming/Coding: Python and R (and others)
Databases: MySQL, PostgreSQL, Cassandra,
MongoDB and CouchDB.
Visualization: D3, Tableau, Qlikview and Markdown.
Big Data: Hadoop, MapReduce and Spark.
20

Hacking Skills
http://www.kdnuggets.com/2016/06/r-python-top-
analytics-data-mining-data-science-software.html
22

Hacking Skills
http://www.kdnuggets.com/2016/06/r-python-top-
analytics-data-mining-data-science-software.html
23

Math & Statistics
Being able understand the right solution to each
problem
Linear algebra: Matrix manipulation
Machine Learning: Random Forests, SVM, Boosting
Descriptive statistics: Describe, Cluster
Statistical inference: Generate new knowledge .
24

Substantive Expertise
Ability to ask good questions requires domain
understanding, that’s why a data scientist can’t create
data based solutions without a good industry knowledge
Is this A or B or C? (classiﬁcation)
Is this weird? (anomaly detection).
How much/how many? (regression).
How is it organized? (clustering).
What should I do next? (reinforcement learning)
26

Fraud Detection
Estimate the probability of a transaction being fraud
based on customer patterns and recent fraudulent
behavior
Issues when constructing a fraud detection system:
Class Imbalance
Cost-sensitivity
Short time response of the system
Dimensionality of the search space
Feature preprocessing
Model selection
41

Class Imbalance
Fraudulent transactions represents between 0.01% to
0.5% of the transactions
Create a balanced dataset using:
Under sampling
Over sampling
TomekLinks sampling
Condensed Nearest Neighbor
NearMiss
Synthetic Majority Over Sampling
43

Class Imbalance
Synthetic Majority Over Sampling Technique
SMOTE
44

Cost-Sensitivity
Typical evaluation of a classiﬁcation model:
Actual Fraud Actual Legitimate
Predicted Fraud True Positives (TP) False Positives (FP)
Predicted Legitimate False Negatives (FN) True Negatives (FN)
Accuracy = TP+FP+TN+FN
TP+TN
F Score =1 TP+FN+FP
TP
45

Cost-Sensitivity
Assumes the same ﬁnancial cost of false positives and
false negatives!
Not the case in fraud detection:
False positives: When predicting a transaction as
fraudulent, when in fact it is not a fraud, there is an
administrative cost
False negatives: Failing to detect a fraud, the amount
of that transaction is lost.
46

Cost-Sensitivity
Cost Matrix
Actual Fraud Actual Legitimate
Predicted Fraud
Predicted Legitimate
Cost(f(S)) = y (1 − c )AMT + c C∑i=1
N
i i i i a
c = CTP a c = CFP a
c = AMTFN i c = 0TN
47

Feature Engineering
Raw Features
48

Feature Engineering
Transaction aggregated features
49

Feature Engineering
Periodic Features
50

Feature Engineering
Social Networks Analysis
51

Finally - Some Models
Data
Large European Card Processing company
2012 & 2013 card present transactions
20 Million transactions
40,000 frauds
2 Million Euros in losses in the test set
52

Finally - Some Models
Algorithms
Fuzzy Rules
Neural Networks
Naive Bayes
Random Forests
Random Forests with Cost-Proportonate Sampling
Cost-Sensitive Random Patches Decision Trees
53

Modern
Data
Scientist
The sexiest job of
the 21th century
61

Thank You!
@albahnsen
albahnsen.com
62

Data Science: De la Matemática a la Práctica

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (17)

Similar to Data Science: De la Matemática a la Práctica

Similar to Data Science: De la Matemática a la Práctica (20)

More from Big-Data-Summit

More from Big-Data-Summit (20)

Recently uploaded

Recently uploaded (20)

Data Science: De la Matemática a la Práctica