Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science vs. the Bad Guys: Defending LinkedIn from Fraud and Abuse

2,249 views

Published on

As the world’s largest professional network, LinkedIn is subject to a barrage of fraudulent and/or abusive activity aimed at its member-facing products. LinkedIn’s Security Data Science team is tasked with detecting bad activity and building proactive solutions to keep it from happening in the first place. In this talk we explore various types of abuse we see at LinkedIn and discuss some of the solutions we’ve built to defend against them. We focus on ways bad actors can enter the site: fake accounts and account takeover. Some common themes include:

- Precision/recall tradeoffs: No model is 100% accurate, so we must always make a call on where to draw the line when flagging accounts or activity as abusive. What’s the cost of labeling a good member as bad vs. labeling a bad member as good?

- Online/offline tradeoffs: Online models can stop fraudulent activity before it has a chance to gain traction; offline models can use more data and cast a wider net, while also requiring less engineering effort to build. For any given abuse pattern, we must consider whether we can detect and stop the activity in real-time and also whether it’s worth the effort to do so.

- Machine learning vs. heuristic rules: Machine-learned models can be very powerful, but they also require sufficient well-labeled training data and are more difficult to maintain. Heuristic (though still data-driven!) rules can often achieve 90% of the goal with 10% of the effort — but how do you tell when this is the case?

Published in: Data & Analytics

Data Science vs. the Bad Guys: Defending LinkedIn from Fraud and Abuse

  1. 1. ©2013LinkedInCorporation.AllRightsReserved. 1 Data Science vs. The Bad Guys Using data to defend LinkedIn against fraud and abuse David Freeman Head of Security Data Science at LinkedIn
 
 Strata+Hadoop World San Jose, CA 20 Feb 2015
  2. 2. ©2013LinkedInCorporation.AllRightsReserved. World’s largest professional network But not everyone follows the rules! §
  3. 3. ©2013LinkedInCorporation.AllRightsReserved. Why? 3
  4. 4. ©2013LinkedInCorporation.AllRightsReserved. What do they try to do? • Spam Messages • Spam Content • Fake Companies • Fraud Ads • Fake Jobs • Social Engineering • Social Action Spam (e.g. likes, follows) • Payment Fraud • Malware • Malicious URLs • Scraping
  5. 5. ©2013LinkedInCorporation.AllRightsReserved. How do they do it? 5
  6. 6. ©2013LinkedInCorporation.AllRightsReserved. How do we stop them? 6 +
  7. 7. ©2013LinkedInCorporation.AllRightsReserved. How we stop them — process 1. Stop the bleeding! 2. Heuristic rules.
 3. Machine learning. 7 Hypothetical Example: lots of fake accounts from one IP address • Block the IP. ! • Limit signup rate from any IP. ! • Model trained on historical data, incorporating – Signups/IP/hour – Signups/IP/day – # good accounts on IP – # bad accounts on IP – other features
  8. 8. ©2013LinkedInCorporation.AllRightsReserved. How we stop them — Infrastructure Online Offline request scoring abuse DB accept reject scheduled scoring jobs §
  9. 9. ©2013LinkedInCorporation.AllRightsReserved. Case studies: • Registration • Fake accounts • Account takeover ! If they can’t get in, then they can’t do damage! 9
  10. 10. ©2013LinkedInCorporation.AllRightsReserved. How can we tell if you’re real? 10
  11. 11. ©2013LinkedInCorporation.AllRightsReserved. Answer: Asset Reputation Systems We have 347 million members’ worth of data on • Names • Email addresses • IP addresses • ISPs • Browsers • etc. We can assign a reputation score to each asset based on the level of abuse we’ve seen in the past. 11
  12. 12. ©2013LinkedInCorporation.AllRightsReserved. Reputation Scoring Instantaneous • Calculated online from recent data • Catches new bad activity • Minimal feature set
 
 sample feature: 
 rate of signups from IP in last hour ! ! Historical • Calculated offline from long-term data • Catches recurring bad activity • Extensive feature set
 
 sample feature: 
 % of accounts using IP labeled abusive 12
  13. 13. ©2013LinkedInCorporation.AllRightsReserved. Scoring Registration Attempts • Machine-learned model combines reputation features (offline + online) to produce a registration score. ! ! ! ! ! ! ! • How do we choose the thresholds? 13 0 10.5
  14. 14. ©2013LinkedInCorporation.AllRightsReserved. Precision/Recall Tradeoffs • Once system is online, it’s hard to distinguish false positives from true positives.
 • User has no recourse — be conservative! 
 • Bad guys who slip through will be caught sooner or later in other models. 14
  15. 15. ©2013LinkedInCorporation.AllRightsReserved. Fake Accounts Offline Offline models can use many more features: • Invitations • Connection graph • Profile content • Messages sent/received • Pattern of pages viewed • Reported by other members • etc. 15
  16. 16. ©2013LinkedInCorporation.AllRightsReserved. Fake Accounts — Online and Offline 16 abuse DB Fake account models (Heuristic/ML) replication
  17. 17. ©2013LinkedInCorporation.AllRightsReserved. Online/Offline Tradeoffs Online • Instant action
 • Data collected from many sources • Computationally limited • Slow to build and iterate
 ! Offline • Action delayed hours to days • Data all in one place (HDFS) • Lots of computational resources • Fast to build and iterate 17
  18. 18. ©2013LinkedInCorporation.AllRightsReserved. Fake Account Defense in Action 18 Blocked(at(Registra0on( Fake(Accounts(Caught( Fakes(Caught(Within(48h(of(Crea0on( Cumulativenumberofaccounts Time
  19. 19. ©2013LinkedInCorporation.AllRightsReserved. Precision/Recall again… Fake account models have to be very precise. ! ! ! ! ! ! ! How can we stop bad activity without making good members unhappy? 19 =
  20. 20. ©2013LinkedInCorporation.AllRightsReserved. Member Reputation Estimate the probability that a given member is real. ! ! ! ! ! ! ! Stop abuse before it happens! 20
  21. 21. ©2013LinkedInCorporation.AllRightsReserved. Member reputation infrastructure 21 abuse DB Fake account models (Heuristic/ML) Member
 reputation
 model (ML) reputation DB replication
  22. 22. What do you do when your fake accounts get blocked? ! Use real accounts instead! ©2013LinkedInCorporation.AllRightsReserved. Attackers are smart 22
  23. 23. ©2013LinkedInCorporation.AllRightsReserved. Many ways to get into an account 23
  24. 24. ©2013LinkedInCorporation.AllRightsReserved. Weak passwords 24 Attack: Defense: Pitfalls:
  25. 25. ©2013LinkedInCorporation.AllRightsReserved. Credential dumps 25 Attack: Defense: Pitfalls:
  26. 26. ©2013LinkedInCorporation.AllRightsReserved. Brute force attacks 26 Attack: Defense: Pitfalls:
  27. 27. ©2013LinkedInCorporation.AllRightsReserved. Phishing 27 Attack: Defense: Pitfalls:
  28. 28. ©2013LinkedInCorporation.AllRightsReserved. Personal Attacks 28 Attack: Defense: Pitfalls:
  29. 29. ©2013LinkedInCorporation.AllRightsReserved. Password defense We must assume the attacker already has the password! 29
  30. 30. ©2013LinkedInCorporation.AllRightsReserved. Data Science to the Rescue! ! ! ! ! • Are you in a city we’ve seen you in before? • Are you using a computer we’ve seen you use before? • Have we seen abuse from this IP address? • etc.
 ! ! ! ! • For user u and data X, estimate
 
 
 
 i.e., likelihood that the person logging in is actually you. 30 Pr[attack | u, X]
  31. 31. ©2013LinkedInCorporation.AllRightsReserved. Estimating likelihood of attack 31 Heuristic: BAD Not so! bad
  32. 32. ©2013LinkedInCorporation.AllRightsReserved. Estimating likelihood of attack 32 Machine Learning: Pr[attack|u, X] = Pr[attack|X] · Pr[X] Pr[X|u] · Pr[u|attack] Pr[u] Asset Reputation Member and 
 Site History Member Reputation
  33. 33. • Use machine-learned model + heuristic rules to compute a login score. ! ! ! ! ! ! ! • Thresholds determined by precision/recall tradeoffs
 (e.g. aim for x% false positives) ©2013LinkedInCorporation.AllRightsReserved. Scoring Login Attempts 33 0 10.5
  34. 34. • Stop bad guys at the entry points. ! • Be careful about bothering good members. ! • Securing registration is hard — not much data. ! • Securing login is hard — passwords suck. ! • Run models offline to catch what you missed online. ©2013LinkedInCorporation.AllRightsReserved. Take-aways 34
  35. 35. ©2013LinkedInCorporation.AllRightsReserved. § ©2013 LinkedIn Corporation. All Rights Reserved. 35 Questions? dfreeman@linkedin.com (p.s. We’re hiring)

×