LOREM
I P S U M
LEARNING FROM
BIOMETRICS
To prevent #CyberSecurity 🕵 threats
Valerio Maggio
@leriomaggio Data Scientist & Pythonistas @ FBK
vmaggio@fbk.eu
DOLOR
S I T
A M E T
SORRY, WHO?
• Post Doc Researcher
• Background in CS
• Interested in Machine & Deep Learning
• Core in Biomedicine & Environment
here
We’re looking for students for
Internship & (PhD) Thesis
• Applied Machine Learning (a.k.a. Data Science)
https://mpbalab.fbk.eu
DONEC
F I N I B U
S A C
• Geek & Nerd
• Fellow Pythonista since 2006
this is a better me !-)
SORRY, WHO?
100K points if
you get this pun !-)
github.com/leriomaggio
Machine
Learning
B
U
Z
Z
W
O
R
D
S
NULLA
C O N G U
E S A P I E
N
WHAT THE CLOUDS SAY
VITAE
A U G U E
C O N S E
C T E T U R
WHAT THE CLOUDS KEEP
SAYING…
AT
CONVALLIS
M I
A U C T O R .
WHAT THE CLOUDS STILL
SAY…
FUSCE
F E U G I A T
WHAT THE CLOUDS
FINALLY SAY!
Learning from Data
for future predictions
ACHINE
LEARNING
LAVOURS
SED
SUSCIPIT
I N
E L I T
M O L L I S
SUPERVISED SETTING
• Input Data are accompanied with
labels the ML model can learn from
• i.o.w. labels are reference for the
model to estimate the expected
outcomes
DIGITS CLASSIFICATION
Labels are
Categories
HOUSE PRICES ESTIMATION
Labels are
Real numbers
FRINGILLA
M A E C E
N A S
G R A V I D
A S
UNSUPERVISED
SETTING
• No label is provided
• Learning directly from data
• e.g. Clustering
CLUSTERING
FUSCE
F E U G I A T
WHAT THE CLOUDS
FINALLY SAY!
EU TURPIS
V O L U P T A
T
Let’s play with
all of this!
IPSUM
E G E T
A U C T O R APPLIED ML IN 5 STEPS
• Collect the Data
1. Look at the Data & Clean the Data
2. Prepare the data
3. Train your model(s)
4. Predict using your best model using unseen data
(namely: data NOT used in training)
5. Deploy your system in production
TWO COMMON FRAUDS
Account Hijacking
Card Faking
TWO COMMON FRAUDS
Account Hijacking
User Identification
USER IDENTIFICATION
KEYSTROKE DYNAMICS
Keystroke dynamics consists in analysing the way a user types by monitoring
keyboard inputs thousand of times per second, and processing this data through an
algorithm, which then defines a pattern for future comparison
Identifying an individual based on their way of typing on a physical or virtual keyboard
KEYSTROKE DYNAMICS
Time between two key pressures
Time between one pressure and one release
Time between one release and one pressure
Time between two key release
Intuition:
Users have unique ways to
type on keyboards
(i.e. typing patterns)
KEYSTROKE DYNAMIC
Time between two key pressures
Down-Down Time
Time between one pressure and one release-
Dwell Time
Time between one release and one pressure
Flight Time
Time between two key release
Up-Up Time
LOOKING FOR
ANOMALIES
DATA COLLECTION
Time between two key pressures
Down-Down Time
Time between one pressure and one release-
Dwell Time
Time between one release and one pressure
Flight Time
Time between two key release
Up-Up Time
• Dataset Statistics:
• 50 different users
• 450+ patterns each
DONEC
M E N U S
U R N A
STEP 1: LOOK AT
THE DATA AND
CLEAN THEM
UP-UP TIME - USERNAME FIELD - WEB VS APP
UP-UP TIME - PASSWORD FIELD - WEB VS APP
DWELL TIME - USERNAME FIELD - WEB VS APP
DWELL TIME - PASSWORD FIELD - WEB VS APP
DATA
CLEANING
Complexity-Invariant
Distance Measure
FEATURE SCALING (NORMALISATION)
Original
Feature Data
MinMax Scaling
Standard Scaling
PULVINAR
V I T A E
E L I T .
STEP 2:PREPARE
THE DATA
TRAIN-TEST CUT
WHAT
WE
DO
WHAT
WE
REALLY
DO K-Fold Cross Validation
VIVAMUS
F I N I B
U S
R I S U S
STEP 3-4:TRAIN
AND TEST ML
MODEL
Deep AutoEncoder
Encoder Decoder
…
Classification Deep Network
One AutoEncoder + FC Network
Outlier Detector (per user)
DEEPKS
Deep AutoEncoder
Encoder Decoder
DEEPKS
1. AUTOENCODER
Trained on genuine keystroke patterns
Unsupervised Machine (Deep) Learning
Deep AutoEncoder
Encoder Decoder
DEEPKS
2. DISCRIMINATOR
Trained on genuine &
adversarial patterns
EVALUATION METRICS
Confusion Matrix
over ~5200 samples
SAMPLE
SIZE TEST
Q: How many patterns would I
need to be confident about the
accuracy of the model ?
Feature Importance
rf.fit(X,y_DL)
NON
DIAM
B L A N D
I T
F E R M E
N T U M .
STEP 5:DEPLOY
YOUR SOLUTION
Models
Database
Model
Service
Feature
Database
Data
Collector
Feature Detection
Orchestration
Model
Training
Service
Feature
Extraction
Alarms
Dashboard
Models Models
Features
+ Labels
Features
Features
Raw Data
Alarm
Prediction
Request
Labels
1
2
3
9
SOC
Alarms
Database
4
5 6
7
Score
Confirmation/
Rejection
Features
8
10
11
12
API Engine
Feature
extractor
DL Model
{json}
Raw data, features,
predictions
SHAMELESS
PLUG
pydata.it
pycon.it
EUROSCIPY 2018
Fondazione Bruno Kessler | Associazione Python Italia
University of Trento
Northern Italy | Trentino Region Tentative dates:
Aug. 28 - Sept. 01 2018
Be posted on euroscipy.org
trento.python.it
Next Meetup: Feb, 22 2018 - h19:00 ➡ @Clab
SHAMELESS
SELF
PROMOTION
https://github.com/leriomaggio/deep-learning-keras-tensorflow
THANK YOU!
🍻
Now it’s time for Cheers
🥓
@leriomaggio
vmaggio@fbk.eu

Learning from Biometric Fingerprints to prevent Cyber Security Threats