Los “Data Scientists” se catalogan como algunos de los profesionales con mayor demanda en el mundo laboral de la actualidad. Desafortunadamente no existen candidatos suficientemente calificados para satisfacer esta demanda. Esto se debe tal vez a la complejidad de las habilidades requeridas para ejercer la profesión, las cuales incluyen matemática, estadística, computación, y administración. Mediante ejemplos de la vida real, esta conferencia pretende demostrar que completar exitosamente un proyecto de “Data Science” es posible. Este proceso requiere el entendimiento del problema del negocio, la aplicación de modelos matemáticos o estadísticos adecuados, y la implementación correcta de la solución.
2. Who am I?
Data Scientist
PhD in Machine Learning
Interested in Big Data Engineering
Passionate about open-source
Scikit-Learn contributor :)
Organizer of the Bogota Big Data Science Meetup
2
9. “ A data scientist is a statistician
who lives in San Fransisco.
“ Data Science is statistics on a
Mac.
9
10. Data Science is like teenage sex:
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it...
10
19. Data Science is the intersection of
Hacking Skills, Math & Statistics
Knowledge and Substantive Expertise
Those are the pillars of data science: computing,
statistics, mathematics and quantitative disciplines
combined to analyze data for better decision making
19
20. Hacking Skills
Ability to build things and find clever solutions to
problems.
Programming/Coding: Python and R (and others)
Databases: MySQL, PostgreSQL, Cassandra,
MongoDB and CouchDB.
Visualization: D3, Tableau, Qlikview and Markdown.
Big Data: Hadoop, MapReduce and Spark.
20
24. Math & Statistics
Being able understand the right solution to each
problem
Linear algebra: Matrix manipulation
Machine Learning: Random Forests, SVM, Boosting
Descriptive statistics: Describe, Cluster
Statistical inference: Generate new knowledge .
24
26. Substantive Expertise
Ability to ask good questions requires domain
understanding, that’s why a data scientist can’t create
data based solutions without a good industry knowledge
Is this A or B or C? (classification)
Is this weird? (anomaly detection).
How much/how many? (regression).
How is it organized? (clustering).
What should I do next? (reinforcement learning)
26
41. Fraud Detection
Estimate the probability of a transaction being fraud
based on customer patterns and recent fraudulent
behavior
Issues when constructing a fraud detection system:
Class Imbalance
Cost-sensitivity
Short time response of the system
Dimensionality of the search space
Feature preprocessing
Model selection
41
43. Class Imbalance
Fraudulent transactions represents between 0.01% to
0.5% of the transactions
Create a balanced dataset using:
Under sampling
Over sampling
TomekLinks sampling
Condensed Nearest Neighbor
NearMiss
Synthetic Majority Over Sampling
43
45. Cost-Sensitivity
Typical evaluation of a classification model:
Actual Fraud Actual Legitimate
Predicted Fraud True Positives (TP) False Positives (FP)
Predicted Legitimate False Negatives (FN) True Negatives (FN)
Accuracy = TP+FP+TN+FN
TP+TN
F Score =1 TP+FN+FP
TP
45
46. Cost-Sensitivity
Assumes the same financial cost of false positives and
false negatives!
Not the case in fraud detection:
False positives: When predicting a transaction as
fraudulent, when in fact it is not a fraud, there is an
administrative cost
False negatives: Failing to detect a fraud, the amount
of that transaction is lost.
46
47. Cost-Sensitivity
Cost Matrix
Actual Fraud Actual Legitimate
Predicted Fraud
Predicted Legitimate
Cost(f(S)) = y (1 − c )AMT + c C∑i=1
N
i i i i a
c = CTP a c = CFP a
c = AMTFN i c = 0TN
47
52. Finally - Some Models
Data
Large European Card Processing company
2012 & 2013 card present transactions
20 Million transactions
40,000 frauds
2 Million Euros in losses in the test set
52
53. Finally - Some Models
Algorithms
Fuzzy Rules
Neural Networks
Naive Bayes
Random Forests
Random Forests with Cost-Proportonate Sampling
Cost-Sensitive Random Patches Decision Trees
53