Data Science vs. the Bad Guys: Defending LinkedIn from Fraud and Abuse

David Freeman
David FreemanHead of Anti-Abuse Engineering at LinkedIn
©2013LinkedInCorporation.AllRightsReserved.
1
Data Science vs. The Bad Guys
Using data to defend LinkedIn against fraud and abuse
David Freeman
Head of Security Data Science at LinkedIn



Strata+Hadoop World
San Jose, CA
20 Feb 2015
©2013LinkedInCorporation.AllRightsReserved.
World’s largest professional network
But not everyone
follows the rules!
§
©2013LinkedInCorporation.AllRightsReserved.
Why?
3
©2013LinkedInCorporation.AllRightsReserved.
What do they try to do?
• Spam Messages
• Spam Content
• Fake Companies
• Fraud Ads
• Fake Jobs
• Social Engineering
• Social Action Spam (e.g. likes, follows)
• Payment Fraud
• Malware
• Malicious URLs
• Scraping
©2013LinkedInCorporation.AllRightsReserved.
How do they do it?
5
©2013LinkedInCorporation.AllRightsReserved.
How do we stop them?
6
+
©2013LinkedInCorporation.AllRightsReserved.
How we stop them — process
1. Stop the bleeding!
2. Heuristic rules.

3. Machine learning.
7
Hypothetical Example: lots of fake accounts from one IP address
• Block the IP.
!
• Limit signup rate from any
IP.
!
• Model trained on historical
data, incorporating
– Signups/IP/hour
– Signups/IP/day
– # good accounts on IP
– # bad accounts on IP
– other features
©2013LinkedInCorporation.AllRightsReserved.
How we stop them — Infrastructure
Online
Offline
request
scoring
abuse DB
accept
reject
scheduled scoring jobs
§
©2013LinkedInCorporation.AllRightsReserved.
Case studies:
• Registration
• Fake accounts
• Account takeover
!
If they can’t get in, then they can’t do damage!
9
©2013LinkedInCorporation.AllRightsReserved.
How can we tell if you’re real?
10
©2013LinkedInCorporation.AllRightsReserved.
Answer: Asset Reputation Systems
We have 347 million members’ worth of data on
• Names
• Email addresses
• IP addresses
• ISPs
• Browsers
• etc.
We can assign a reputation score to each asset
based on the level of abuse we’ve seen in the past.
11
©2013LinkedInCorporation.AllRightsReserved.
Reputation Scoring
Instantaneous
• Calculated online
from recent data
• Catches new bad
activity
• Minimal feature set



sample feature: 

rate of signups from IP
in last hour
!
!
Historical
• Calculated offline
from long-term data
• Catches recurring
bad activity
• Extensive feature set



sample feature: 

% of accounts using IP
labeled abusive
12
©2013LinkedInCorporation.AllRightsReserved.
Scoring Registration Attempts
• Machine-learned model combines reputation
features (offline + online) to produce a registration
score.
!
!
!
!
!
!
!
• How do we choose the thresholds?
13
0 10.5
©2013LinkedInCorporation.AllRightsReserved.
Precision/Recall Tradeoffs
• Once system is online, it’s hard to distinguish
false positives from true positives.

• User has no recourse — be conservative! 

• Bad guys who slip through will be caught
sooner or later in other models.
14
©2013LinkedInCorporation.AllRightsReserved.
Fake Accounts Offline
Offline models can use many more features:
• Invitations
• Connection graph
• Profile content
• Messages sent/received
• Pattern of pages viewed
• Reported by other members
• etc.
15
©2013LinkedInCorporation.AllRightsReserved.
Fake Accounts — Online and Offline
16
abuse DB
Fake account models
(Heuristic/ML)
replication
©2013LinkedInCorporation.AllRightsReserved.
Online/Offline Tradeoffs
Online
• Instant action

• Data collected from
many sources
• Computationally
limited
• Slow to build and
iterate

!
Offline
• Action delayed hours
to days
• Data all in one place
(HDFS)
• Lots of computational
resources
• Fast to build and
iterate
17
©2013LinkedInCorporation.AllRightsReserved.
Fake Account Defense in Action
18
Blocked(at(Registra0on(
Fake(Accounts(Caught(
Fakes(Caught(Within(48h(of(Crea0on(
Cumulativenumberofaccounts
Time
©2013LinkedInCorporation.AllRightsReserved.
Precision/Recall again…
Fake account models have to be very precise.
!
!
!
!
!
!
!
How can we stop bad activity without making good
members unhappy?
19
=
©2013LinkedInCorporation.AllRightsReserved.
Member Reputation
Estimate the probability that a given member is real.
!
!
!
!
!
!
!
Stop abuse before it happens!
20
©2013LinkedInCorporation.AllRightsReserved.
Member reputation infrastructure
21
abuse DB
Fake account models
(Heuristic/ML) Member

reputation

model
(ML)
reputation DB
replication
What do you do when your fake accounts get
blocked?
!
Use real accounts instead!
©2013LinkedInCorporation.AllRightsReserved.
Attackers are smart
22
©2013LinkedInCorporation.AllRightsReserved.
Many ways to get into an account
23
©2013LinkedInCorporation.AllRightsReserved.
Weak passwords
24
Attack:
Defense:
Pitfalls:
©2013LinkedInCorporation.AllRightsReserved.
Credential dumps
25
Attack:
Defense:
Pitfalls:
©2013LinkedInCorporation.AllRightsReserved.
Brute force attacks
26
Attack:
Defense:
Pitfalls:
©2013LinkedInCorporation.AllRightsReserved.
Phishing
27
Attack:
Defense:
Pitfalls:
©2013LinkedInCorporation.AllRightsReserved.
Personal Attacks
28
Attack:
Defense:
Pitfalls:
©2013LinkedInCorporation.AllRightsReserved.
Password defense
We must assume the attacker already has
the password!
29
©2013LinkedInCorporation.AllRightsReserved.
Data Science to the Rescue!
!
!
!
!
• Are you in a city we’ve
seen you in before?
• Are you using a
computer we’ve seen
you use before?
• Have we seen abuse
from this IP address?
• etc.

!
!
!
!
• For user u and data X,
estimate







i.e., likelihood that the
person logging in is
actually you.
30
Pr[attack | u, X]
©2013LinkedInCorporation.AllRightsReserved.
Estimating likelihood of attack
31
Heuristic:
BAD
Not so!
bad
©2013LinkedInCorporation.AllRightsReserved.
Estimating likelihood of attack
32
Machine Learning:
Pr[attack|u, X] = Pr[attack|X] ·
Pr[X]
Pr[X|u]
·
Pr[u|attack]
Pr[u]
Asset Reputation Member and 

Site History
Member Reputation
• Use machine-learned model + heuristic rules to
compute a login score.
!
!
!
!
!
!
!
• Thresholds determined by precision/recall tradeoffs

(e.g. aim for x% false positives)
©2013LinkedInCorporation.AllRightsReserved.
Scoring Login Attempts
33
0 10.5
• Stop bad guys at the entry points.
!
• Be careful about bothering good members.
!
• Securing registration is hard — not much data.
!
• Securing login is hard — passwords suck.
!
• Run models offline to catch what you missed online.
©2013LinkedInCorporation.AllRightsReserved.
Take-aways
34
©2013LinkedInCorporation.AllRightsReserved.
§
©2013 LinkedIn Corporation. All Rights Reserved.
35
Questions?
dfreeman@linkedin.com
(p.s. We’re hiring)
1 of 35

More Related Content

Similar to Data Science vs. the Bad Guys: Defending LinkedIn from Fraud and Abuse(20)

Panama Papers Leak and Precautions Law firms should takePanama Papers Leak and Precautions Law firms should take
Panama Papers Leak and Precautions Law firms should take
Adv. Prashant Mali ♛ [Bsc(Phy),MSc(Comp Sci), CCFP,CISSA,LLM]468 views
Panama-Paper-LeakPanama-Paper-Leak
Panama-Paper-Leak
Adv. Prashant Mali ♛ [Bsc(Phy),MSc(Comp Sci), CCFP,CISSA,LLM]230 views
2014 ota databreach32014 ota databreach3
2014 ota databreach3
Meg Weber222 views
Bring Your Own IdentityBring Your Own Identity
Bring Your Own Identity
NetIQ3.1K views
Check Point designing a securityCheck Point designing a security
Check Point designing a security
Group of company MUK5.6K views
nerfslides.pptxnerfslides.pptx
nerfslides.pptx
ssusera5ade56 views
LoginCat - Mini PresentationLoginCat - Mini Presentation
LoginCat - Mini Presentation
Rohit Kapoor128 views
Login cat   tekmonks - v5 (mini)Login cat   tekmonks - v5 (mini)
Login cat tekmonks - v5 (mini)
Rohit Kapoor229 views
When Insiders ATT&CK!When Insiders ATT&CK!
When Insiders ATT&CK!
MITRE ATT&CK838 views
Adobe presentation sydneyAdobe presentation sydney
Adobe presentation sydney
Michael Buckley471 views
The cyber security hype cycle is upon usThe cyber security hype cycle is upon us
The cyber security hype cycle is upon us
Jonathan Sinclair353 views
W verb68W verb68
W verb68
James1280396 views

Recently uploaded(20)

Project Summary M_Covricova.pdfProject Summary M_Covricova.pdf
Project Summary M_Covricova.pdf
MARIACOVRICOVA16 views
ColonyOSColonyOS
ColonyOS
JohanKristiansson69 views
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika15 views
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia12 views
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann40 views
Personal brand explorationPersonal brand exploration
Personal brand exploration
KyleeGarciaDean19 views
How Leaders See Data? (Level 1)How Leaders See Data? (Level 1)
How Leaders See Data? (Level 1)
Narendra Narendra9 views
2022-Scripting_Hacks.pdf2022-Scripting_Hacks.pdf
2022-Scripting_Hacks.pdf
Roland Schock9 views
krishnashamuktikendra.pdfkrishnashamuktikendra.pdf
krishnashamuktikendra.pdf
gagankrish8 views
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
thomasjvarghese4915 views
PTicketInput.pdfPTicketInput.pdf
PTicketInput.pdf
stuartmcphersonflipm284 views
ML in rare diseasesML in rare diseases
ML in rare diseases
anooshaqaisar8 views
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdf
HiNedHaJar6 views

Data Science vs. the Bad Guys: Defending LinkedIn from Fraud and Abuse