SlideShare a Scribd company logo
1 of 32
Download to read offline
Poorly Supervised Learning
How to engineer labels for machine learning
when you don’t have any.
Chris Anagnostopoulos
Chief Data Scientist, www.ment.at
Hon. Senior Lecturer, Imperial College
Mentat
(X, y)
Mentat
(X, y)
image courtesy of Stanford DAWN
Mentat
(X, y)
Predict the class y of an object, given a description x of that object
˜f = argmin
f2
nX
i=1
L(f(xi), yi)
˜✓ = argmin
✓2⇥
nX
i=1
L(f✓(xi), yi)
Function approximation:
Parametric estimation:
Relies on a classifier (family of functions)
and a labelled dataset.
learning by example
Mentat
Few labels
Multiple noisy labels
Not missing
at random
Unit of analysis /
missing context
Class imbalance
Expert prior rules
vs
(X, y)
Mentat
The X and the y in Cybersecurity
Mentat
The X in Cybersecurity
web proxy logs
windows
authentication logs
process logs
packet capture
email headers
network flow
DNS logs
firewall logs
Unix syslog
A broad array of data formats,
often capturing different
aspects of the same activity.
Mentat
spamfilter
An example: exfiltration
web proxy logs
windows
authentication logs
packet captureemail headers
netflow
firewall logs
Unix syslog
LinkedIn
Email
social engineering
spearphishing
User clicks
malware
download
webproxy
antivirus
privilege
escalation
file scanning
and encryption
login
malicious
infrastructure
firewall
process logs
Mentat
The y in Cybersecurity
actual attacks are
(thankfully) very rare
but plenty of signals worth surfacing:
near-misses, early attack stages,
risky behaviours, “suspect” traffic
Mentat
The y in Cybersecurity
vs
Mountains of “soft” labelling rules
captured as search queries, automated
rules (NIDS) or look-ups (intel)
Very little to no time to
produce manual labels at
any reasonable scale
Mentat
The y in Cybersecurity
Benign vs malware in pcap
Network intrusion detection
systems (NIDS) logs
Threat intel: observables,
indicators of compromise
Threat intel: kill chain
Threat intel: TTPs
even when y is available,
it’s often about a complex
X obtained by data fusion
often a noisy y is available
by checking against a DB
the user is looking for
patterns and lookalikes
no y exists for the vast
majority of traffic
Mentat
Solution: label engineering
Mentat
Standard tools: data fusion and feature engineering
expert-defined features
are usually meant as scores
Mentat
A middle-ground between rules and classifiers
• hard to explicitly describe how cats differ from dogs.
• learning by example avoids that, but needs a gazillion labels
• human learning is a sensible mix of examples and imperfect rules
cat dog
“cats are smaller than dogs”
• the trick is to interpret rules as noisy, partial labels
• a feature engineering step is needed to map rules to raw data
Mentat
A middle-ground between rules and classifiers
Raw Data Features
Expert
Ruleset
Noisy
Labels
01011010110101000
10101101010111010
101101010111101…
{
‘size’: ‘large’,
‘ear shape’: ‘pointy’,
’color:’ [‘white’, ‘orange’],
‘nose shape: …
}
if it is very large,
then it is not a cat,
if its ears are
pointy, then it is
likely a cat
…
Example
ID
Rule 1 Rule 2 Rule 3 Ground
Truth
01.jpeg Cat Dog Cat Cat
02.jpeg Cat - Cat -
03.jpeg - - Dog -
04.jpeg - Cat - Cat
Mentat
A middle-ground between rules and classifiers
Raw Data Features
Expert
Ruleset
Noisy
Labels
01011010110101000
10101101010111010
101101010111101…
{
‘size’: ‘large’,
‘ear shape’: ‘pointy’,
’color:’ [‘white’, ‘orange’],
‘nose shape: …
}
if it is very large,
then it is not a cat,
if its ears are
pointy, then it is
likely a cat
…
Example
ID
Rule 1 Rule 2 Rule 3 Ground
Truth
01.jpeg Cat Dog Cat Cat
02.jpeg Cat - Cat -
03.jpeg - - Dog -
04.jpeg - Cat - Cat
ManualData Programming
Mentat
A middle-ground between rules and classifiers
Raw Data Features
Expert
Ruleset
Noisy
Labels
01011010110101000
10101101010111010
101101010111101…
{
‘size’: ‘large’,
‘ear shape’: ‘pointy’,
’color:’ [‘white’, ‘orange’],
‘nose shape: …
}
if it is very large,
then it is not a cat,
if its ears are
pointy, then it is
likely a cat
…
Example
ID
Rule 1 Rule 2 Rule 3 Ground
Truth
01.jpeg Cat Dog Cat Cat
02.jpeg Cat - Cat -
03.jpeg - - Dog -
04.jpeg - Cat - Cat
ManualData Programming
Semantically Easy
Semantically Tough
Mentat
A middle-ground between rules and classifiers
Raw Data
Features Expert Rules
Noisy Tags
Case Report
Class: Confidential
Jack Black interviewing
suspect John Doe
Date of interview:
11/01/2015
Summary: Regarding
the incident in London
on the 6th of December
2010, the main suspect
was John Doe.
- whichever name appears first in the
document is the name of the suspect.
- if in the first paragraph the pattern “Suspect:
[Name]” appears, then that is the suspect
name. It can also be called “Person of Interest”.
- if the term “incident” or “crime” or
“investigation” and the term “suspect” appears
in the summary, followed by a name, and that
name also appears in the first paragraph
Example
ID
Rule 1 Rule 2 Rule 3 Ground
Truth
01.doc J. Black - J. Doe J. Doe
02.doc … … … …
03.doc … … … …
04.doc … … … …
ManualData Programming
Labelling Functions
Mentat
Weakly supervised learning and data programming
Labelling functions is a middle ground between feature engineering and
labelling. It is a very information-efficient use of analyst time.
Challenge:
This was most striking in the chemical name task, where
complicated surface forms (e.g., “9-acetyl-1,3,7-trimethyl-
pyrim- idinedione") caused tokenization errors which had
significant effects on system recall. This was corrected by
utilizing a domain-specific tokenizer.
domain primitives
Mentat
Weakly supervised learning and data programming
Is all expert labelling heuristic? (as opposed to ground truth labelling)
vision and other types of cognition are often hard to
reduce to a number of heuristics
but sometimes they are heuristics on top of “deep”
domain primitives (e.g., a “circle” and a “line”)
Mentat
Weakly supervised learning and data programming
Is all expert labelling heuristic? (as opposed to ground truth labelling)
in non-sensory domains, we would expect expert
labelling to be much more rules-based.
eliciting the full decision tree voluntarily is
impossible, however; in practice, it seems
experts use a forest methodology, rather than
one relying on a single deep tree
regulation plays a role here. If you are expected
to follow a playbook and document your work,
then a procedural description is natural
Mentat
Weakly supervised learning and data programming
Domain Primitives
sessions, action sequences, statistical profiling
(“unusually high”), data fusion (job roles),
integration with threat intel knowledge DBs,
jargon (“C2-like behaviour”, “domain-flux”)
Log analysis:
Threat intel:
formal representation of TTPs, regex for cyber
entities (hashes, IPs, malware names …),
The whole point is to empower the expert to express their heuristics
programmatically in an easy, accurate and scalable fashion
Mentat
Weakly supervised learning and data programming
“Hey. We’re building a model for user agents.
Does this one look suspicious?” “Yep! That’s a bad one.”
“Great! Out of curiosity, why?”
“The SeznamBot version is wrong”
Where you had one additional label, now you have a labelling rule.
Mentat
Weakly supervised learning and data programming
“Hey. We’re building a model for user agents.
Does this one look suspicious?” “Yep! That’s a bad one.”
“Great! Out of curiosity, why?”
“The IP is blacklisted”
This is a case of distant learning that might or might not induce a
signal on the user agent itself (a small fraction of all attacks will
have anomalous user agents, so label is very noisy)
Mentat
Weakly supervised learning and data programming
“Hey. We’re building a model for user agents.
Does this one look suspicious?” “Yep! That’s a bad one.”
“Great! Out of curiosity, why?”
“The IP is from Russia”
This is case of missing context that could be turned into a
labelling rule if we enrich the data. Again, the label might be
noisy as far as the user agent is concerned - but that’s OK!
Mentat
Breaking free of (X,y): label engineering
semi-supervised
active learning
weakly supervised
multi-view learning
unsupervised learning
data programming
distant learning
importance reweighting
cost-sensitive learning
label-noise robustness
boosting co-training
Mentat
Maths and algos
E(X,y)⇠⇡L(f✓(xi), yi)Same loss as before:
Enter labelling functions:
“an IP that we have
never seen before,
and the TLD is not
a .com, .co.uk, .org”
“an IP that appears
in our blacklist”
“an IP that appears
in our greylist”
i : X ! { 1, 0, 1}
missing value
Generative model: µ↵, (⇤, Y ) =
1
2
mY
i=1
i↵i1[⇤i=Y ] + i(1 ↵i)1[⇤i= Y ] + (1 i)1[⇤i=0]
“each labelling function λ has a probability β of
labelling an object and α of getting the label right”
µ↵, (⇤, Y ) remember: we do not have any Ys
Mentat
Maximum
Likelihood:
Maths and algos
µ↵?, ? (⇤, Y ) where ↵?
, ?
argmax
↵,
X
x2X
log P(⇤,Y )⇠µ↵,
(⇤ = (x))
“what values of α and β seem most
likely given our unlabelled data X?”.
Average over all possible labels.
˜✓ = argmax
✓
X
x2X
E(⇤,Y )⇠µ↵?, ? [L(x, y) | ⇤ = (x)]Noise-aware
empirical loss:
Assumptions:
y ? (f(x) | (x)) “the labelling functions tell you all
you need to know about the label”
i(x) ? j(x) | y “the labelling functions are independent”
So overlaps are very informative!
Mentat
Software
Chatter
(mostly text)
Logs
(device data)
Intel Alerts
Analysis Rulesets
keyword extraction
topic classification
link analysis
alias detection
regular expressions
SQL / SPL queries
NIDS rules
DPI classification
behavioural profiling
infra (IP) reputation
Action
relationship mining
cyber entity extraction
graph reputation scoring
ruleset optimisation
regex parameterisation
data fusion (e.g.,
sevssionisation)
Λ(Χ)
Mentat
Use Cases in Defence and Security
Anomaly Detection
in Cyber Logs
Relation Extraction
from Investigator Reports
Detect New Buildings
in Aerial Photography
< 10 true positives
billions of log lines
~10K case reports
2 annotated ones
annotated data for UK,
0 annotations for
territories of interest
vast amounts of threat
intel and custom
Splunk queries
semi-standardised
language and HR
databases allows for easy
creation of labelling rules
huge rulesets already
exist in legacy software
“Mr. John SMITH”
suspect
Mentat
Conclusion
Experts often give you one label where they could
simply tell you the rule they are using to produce it
Labelling functions are more intuitive than feature
engineering; they can yield vastly more labels than
manual annotation; and they capture know-how.
Forcing the user to deal with the tradeoff between
accuracy and coverage is unnecessary. Boosting has
taught us how to combine weak heuristics.
At the end of all this, you can still use your favourite classifier.
Focus on the real pain point: dearth of labels.
UK Co. 08251230
Level 39
One Canada Place
Canary Wharf, London
United Kingdom
+44 203 6373330 www.ment.at
info@ment.at

More Related Content

Similar to How to Engineer Labels for Machine Learning with Limited Data

Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdfHODIT12
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1MostafaHazemMostafaa
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfhemangppatel
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiersbutest
 
AI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data ScienceAI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data ScienceAbe
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
 
Machine learning
Machine learningMachine learning
Machine learningRohit Kumar
 
Global CCISO Forum 2018 | AI vs Malware 2018
Global CCISO Forum 2018 | AI vs Malware 2018Global CCISO Forum 2018 | AI vs Malware 2018
Global CCISO Forum 2018 | AI vs Malware 2018EC-Council
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into valueNAVER D2
 
06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptxSaharA84
 
Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Jeff Jonas
 
Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Florian Leitner
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Lukas Mandrake
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introductionAnas Jamil
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowOswald Campesato
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Johan Blomme
 
Artificial intelligence Prolog Language
Artificial intelligence Prolog LanguageArtificial intelligence Prolog Language
Artificial intelligence Prolog LanguageREHMAT ULLAH
 

Similar to How to Engineer Labels for Machine Learning with Limited Data (20)

Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
AI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data ScienceAI Is Changing The Way We Look At Data Science
AI Is Changing The Way We Look At Data Science
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Machine learning
Machine learningMachine learning
Machine learning
 
Global CCISO Forum 2018 | AI vs Malware 2018
Global CCISO Forum 2018 | AI vs Malware 2018Global CCISO Forum 2018 | AI vs Malware 2018
Global CCISO Forum 2018 | AI vs Malware 2018
 
[243] turning data into value
[243] turning data into value[243] turning data into value
[243] turning data into value
 
06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx06-01 Machine Learning and Linear Regression.pptx
06-01 Machine Learning and Linear Regression.pptx
 
Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1Mass declassification sept 23 2010v2.1
Mass declassification sept 23 2010v2.1
 
Angular and Deep Learning
Angular and Deep LearningAngular and Deep Learning
Angular and Deep Learning
 
Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)Overview of text mining and NLP (+software)
Overview of text mining and NLP (+software)
 
Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2Machine Learning Summary for Caltech2
Machine Learning Summary for Caltech2
 
Machine learning introduction
Machine learning introductionMachine learning introduction
Machine learning introduction
 
Introduction to Deep Learning and Tensorflow
Introduction to Deep Learning and TensorflowIntroduction to Deep Learning and Tensorflow
Introduction to Deep Learning and Tensorflow
 
Classification
ClassificationClassification
Classification
 
Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1Text mining and social network analysis of twitter data part 1
Text mining and social network analysis of twitter data part 1
 
Artificial intelligence Prolog Language
Artificial intelligence Prolog LanguageArtificial intelligence Prolog Language
Artificial intelligence Prolog Language
 

Recently uploaded

Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptMAESTRELLAMesa2
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 

Recently uploaded (20)

Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
G9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.pptG9 Science Q4- Week 1-2 Projectile Motion.ppt
G9 Science Q4- Week 1-2 Projectile Motion.ppt
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 

How to Engineer Labels for Machine Learning with Limited Data

  • 1. Poorly Supervised Learning How to engineer labels for machine learning when you don’t have any. Chris Anagnostopoulos Chief Data Scientist, www.ment.at Hon. Senior Lecturer, Imperial College
  • 3. Mentat (X, y) image courtesy of Stanford DAWN
  • 4. Mentat (X, y) Predict the class y of an object, given a description x of that object ˜f = argmin f2 nX i=1 L(f(xi), yi) ˜✓ = argmin ✓2⇥ nX i=1 L(f✓(xi), yi) Function approximation: Parametric estimation: Relies on a classifier (family of functions) and a labelled dataset. learning by example
  • 5. Mentat Few labels Multiple noisy labels Not missing at random Unit of analysis / missing context Class imbalance Expert prior rules vs (X, y)
  • 6. Mentat The X and the y in Cybersecurity
  • 7. Mentat The X in Cybersecurity web proxy logs windows authentication logs process logs packet capture email headers network flow DNS logs firewall logs Unix syslog A broad array of data formats, often capturing different aspects of the same activity.
  • 8. Mentat spamfilter An example: exfiltration web proxy logs windows authentication logs packet captureemail headers netflow firewall logs Unix syslog LinkedIn Email social engineering spearphishing User clicks malware download webproxy antivirus privilege escalation file scanning and encryption login malicious infrastructure firewall process logs
  • 9. Mentat The y in Cybersecurity actual attacks are (thankfully) very rare but plenty of signals worth surfacing: near-misses, early attack stages, risky behaviours, “suspect” traffic
  • 10. Mentat The y in Cybersecurity vs Mountains of “soft” labelling rules captured as search queries, automated rules (NIDS) or look-ups (intel) Very little to no time to produce manual labels at any reasonable scale
  • 11. Mentat The y in Cybersecurity Benign vs malware in pcap Network intrusion detection systems (NIDS) logs Threat intel: observables, indicators of compromise Threat intel: kill chain Threat intel: TTPs even when y is available, it’s often about a complex X obtained by data fusion often a noisy y is available by checking against a DB the user is looking for patterns and lookalikes no y exists for the vast majority of traffic
  • 13. Mentat Standard tools: data fusion and feature engineering expert-defined features are usually meant as scores
  • 14. Mentat A middle-ground between rules and classifiers • hard to explicitly describe how cats differ from dogs. • learning by example avoids that, but needs a gazillion labels • human learning is a sensible mix of examples and imperfect rules cat dog “cats are smaller than dogs” • the trick is to interpret rules as noisy, partial labels • a feature engineering step is needed to map rules to raw data
  • 15. Mentat A middle-ground between rules and classifiers Raw Data Features Expert Ruleset Noisy Labels 01011010110101000 10101101010111010 101101010111101… { ‘size’: ‘large’, ‘ear shape’: ‘pointy’, ’color:’ [‘white’, ‘orange’], ‘nose shape: … } if it is very large, then it is not a cat, if its ears are pointy, then it is likely a cat … Example ID Rule 1 Rule 2 Rule 3 Ground Truth 01.jpeg Cat Dog Cat Cat 02.jpeg Cat - Cat - 03.jpeg - - Dog - 04.jpeg - Cat - Cat
  • 16. Mentat A middle-ground between rules and classifiers Raw Data Features Expert Ruleset Noisy Labels 01011010110101000 10101101010111010 101101010111101… { ‘size’: ‘large’, ‘ear shape’: ‘pointy’, ’color:’ [‘white’, ‘orange’], ‘nose shape: … } if it is very large, then it is not a cat, if its ears are pointy, then it is likely a cat … Example ID Rule 1 Rule 2 Rule 3 Ground Truth 01.jpeg Cat Dog Cat Cat 02.jpeg Cat - Cat - 03.jpeg - - Dog - 04.jpeg - Cat - Cat ManualData Programming
  • 17. Mentat A middle-ground between rules and classifiers Raw Data Features Expert Ruleset Noisy Labels 01011010110101000 10101101010111010 101101010111101… { ‘size’: ‘large’, ‘ear shape’: ‘pointy’, ’color:’ [‘white’, ‘orange’], ‘nose shape: … } if it is very large, then it is not a cat, if its ears are pointy, then it is likely a cat … Example ID Rule 1 Rule 2 Rule 3 Ground Truth 01.jpeg Cat Dog Cat Cat 02.jpeg Cat - Cat - 03.jpeg - - Dog - 04.jpeg - Cat - Cat ManualData Programming Semantically Easy Semantically Tough
  • 18. Mentat A middle-ground between rules and classifiers Raw Data Features Expert Rules Noisy Tags Case Report Class: Confidential Jack Black interviewing suspect John Doe Date of interview: 11/01/2015 Summary: Regarding the incident in London on the 6th of December 2010, the main suspect was John Doe. - whichever name appears first in the document is the name of the suspect. - if in the first paragraph the pattern “Suspect: [Name]” appears, then that is the suspect name. It can also be called “Person of Interest”. - if the term “incident” or “crime” or “investigation” and the term “suspect” appears in the summary, followed by a name, and that name also appears in the first paragraph Example ID Rule 1 Rule 2 Rule 3 Ground Truth 01.doc J. Black - J. Doe J. Doe 02.doc … … … … 03.doc … … … … 04.doc … … … … ManualData Programming Labelling Functions
  • 19. Mentat Weakly supervised learning and data programming Labelling functions is a middle ground between feature engineering and labelling. It is a very information-efficient use of analyst time. Challenge: This was most striking in the chemical name task, where complicated surface forms (e.g., “9-acetyl-1,3,7-trimethyl- pyrim- idinedione") caused tokenization errors which had significant effects on system recall. This was corrected by utilizing a domain-specific tokenizer. domain primitives
  • 20. Mentat Weakly supervised learning and data programming Is all expert labelling heuristic? (as opposed to ground truth labelling) vision and other types of cognition are often hard to reduce to a number of heuristics but sometimes they are heuristics on top of “deep” domain primitives (e.g., a “circle” and a “line”)
  • 21. Mentat Weakly supervised learning and data programming Is all expert labelling heuristic? (as opposed to ground truth labelling) in non-sensory domains, we would expect expert labelling to be much more rules-based. eliciting the full decision tree voluntarily is impossible, however; in practice, it seems experts use a forest methodology, rather than one relying on a single deep tree regulation plays a role here. If you are expected to follow a playbook and document your work, then a procedural description is natural
  • 22. Mentat Weakly supervised learning and data programming Domain Primitives sessions, action sequences, statistical profiling (“unusually high”), data fusion (job roles), integration with threat intel knowledge DBs, jargon (“C2-like behaviour”, “domain-flux”) Log analysis: Threat intel: formal representation of TTPs, regex for cyber entities (hashes, IPs, malware names …), The whole point is to empower the expert to express their heuristics programmatically in an easy, accurate and scalable fashion
  • 23. Mentat Weakly supervised learning and data programming “Hey. We’re building a model for user agents. Does this one look suspicious?” “Yep! That’s a bad one.” “Great! Out of curiosity, why?” “The SeznamBot version is wrong” Where you had one additional label, now you have a labelling rule.
  • 24. Mentat Weakly supervised learning and data programming “Hey. We’re building a model for user agents. Does this one look suspicious?” “Yep! That’s a bad one.” “Great! Out of curiosity, why?” “The IP is blacklisted” This is a case of distant learning that might or might not induce a signal on the user agent itself (a small fraction of all attacks will have anomalous user agents, so label is very noisy)
  • 25. Mentat Weakly supervised learning and data programming “Hey. We’re building a model for user agents. Does this one look suspicious?” “Yep! That’s a bad one.” “Great! Out of curiosity, why?” “The IP is from Russia” This is case of missing context that could be turned into a labelling rule if we enrich the data. Again, the label might be noisy as far as the user agent is concerned - but that’s OK!
  • 26. Mentat Breaking free of (X,y): label engineering semi-supervised active learning weakly supervised multi-view learning unsupervised learning data programming distant learning importance reweighting cost-sensitive learning label-noise robustness boosting co-training
  • 27. Mentat Maths and algos E(X,y)⇠⇡L(f✓(xi), yi)Same loss as before: Enter labelling functions: “an IP that we have never seen before, and the TLD is not a .com, .co.uk, .org” “an IP that appears in our blacklist” “an IP that appears in our greylist” i : X ! { 1, 0, 1} missing value Generative model: µ↵, (⇤, Y ) = 1 2 mY i=1 i↵i1[⇤i=Y ] + i(1 ↵i)1[⇤i= Y ] + (1 i)1[⇤i=0] “each labelling function λ has a probability β of labelling an object and α of getting the label right” µ↵, (⇤, Y ) remember: we do not have any Ys
  • 28. Mentat Maximum Likelihood: Maths and algos µ↵?, ? (⇤, Y ) where ↵? , ? argmax ↵, X x2X log P(⇤,Y )⇠µ↵, (⇤ = (x)) “what values of α and β seem most likely given our unlabelled data X?”. Average over all possible labels. ˜✓ = argmax ✓ X x2X E(⇤,Y )⇠µ↵?, ? [L(x, y) | ⇤ = (x)]Noise-aware empirical loss: Assumptions: y ? (f(x) | (x)) “the labelling functions tell you all you need to know about the label” i(x) ? j(x) | y “the labelling functions are independent” So overlaps are very informative!
  • 29. Mentat Software Chatter (mostly text) Logs (device data) Intel Alerts Analysis Rulesets keyword extraction topic classification link analysis alias detection regular expressions SQL / SPL queries NIDS rules DPI classification behavioural profiling infra (IP) reputation Action relationship mining cyber entity extraction graph reputation scoring ruleset optimisation regex parameterisation data fusion (e.g., sevssionisation) Λ(Χ)
  • 30. Mentat Use Cases in Defence and Security Anomaly Detection in Cyber Logs Relation Extraction from Investigator Reports Detect New Buildings in Aerial Photography < 10 true positives billions of log lines ~10K case reports 2 annotated ones annotated data for UK, 0 annotations for territories of interest vast amounts of threat intel and custom Splunk queries semi-standardised language and HR databases allows for easy creation of labelling rules huge rulesets already exist in legacy software “Mr. John SMITH” suspect
  • 31. Mentat Conclusion Experts often give you one label where they could simply tell you the rule they are using to produce it Labelling functions are more intuitive than feature engineering; they can yield vastly more labels than manual annotation; and they capture know-how. Forcing the user to deal with the tradeoff between accuracy and coverage is unnecessary. Boosting has taught us how to combine weak heuristics. At the end of all this, you can still use your favourite classifier. Focus on the real pain point: dearth of labels.
  • 32. UK Co. 08251230 Level 39 One Canada Place Canary Wharf, London United Kingdom +44 203 6373330 www.ment.at info@ment.at