Smartening the Crowd: Computational Techniques for Improving Human Verification, at SOUPS 2011
Upcoming SlideShare
Loading in...5
×
 

Smartening the Crowd: Computational Techniques for Improving Human Verification, at SOUPS 2011

on

  • 142 views

We looked at how to augment crowdsourcing techniques to improve coverage, accuracy, and timeliness in identifying phishing attacks. We used relatively simple clustering algorithms to group phish ...

We looked at how to augment crowdsourcing techniques to improve coverage, accuracy, and timeliness in identifying phishing attacks. We used relatively simple clustering algorithms to group phish together, as well as weighting votes based on previous correct answers.

Statistics

Views

Total Views
142
Slideshare-icon Views on SlideShare
142
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Average accuracy for each decile of users, sorted by accuracy. For example, the average accuracy of the top 10% of users in both conditions was 100%, whereas the average accuracy of the bottom 10% was under 30% for the Control Condition and under 50% in the Cluster Condition.

Smartening the Crowd: Computational Techniques for Improving Human Verification, at SOUPS 2011 Smartening the Crowd: Computational Techniques for Improving Human Verification, at SOUPS 2011 Presentation Transcript

  • ©2009CarnegieMellonUniversity:1 Smartening the Crowds: Computational Techniques for Improving Human Verification to Fight Phishing Scams Gang Liu Wenyin Liu Department of Computer Science City University of Hong Kong Guang Xiang Bryan A. Pendleton Jason I. Hong Carnegie Mellon University
  • ©2011CarnegieMellonUniversity:2
  • ©2011CarnegieMellonUniversity:3 Detecting Phishing Websites • Method 1: Use heuristics – Unusual patterns in URL, HTML, topology – Approach is favored by researchers – High true positives, some false positives • Method 2: Manually verify – Approach used by industry blacklists today (Microsoft, Google, PhishTank) – Very few false positives, low risk of liability – Slow, easy to overwhelm
  • ©2011CarnegieMellonUniversity:4
  • ©2011CarnegieMellonUniversity:5
  • ©2011CarnegieMellonUniversity:6
  • ©2011CarnegieMellonUniversity:7 Wisdom of Crowds Approach • Mechanics of PhishTank – Submissions require at least 4 votes and 70% agreement – Some votes weighted more • Total stats (Oct2006 – Feb2011) – 1.1M URL submissions from volunteers – 4.3M votes – resulting in about 646k identified phish • Why so many votes for only 646k phish?
  • ©2011CarnegieMellonUniversity:8 PhishTank Statistics Jan 2011 Submissions 16019 Total Votes 69648 Valid Phish 12789 Invalid Phish 549 Median Time 2hrs 23min • 69648 votes → max of 17412 labels – But only 12789 phish and 549 legitimate identified – 2681 URLs not identified at all • Median delay of 2+ hours still has room for improvement
  • ©2011CarnegieMellonUniversity:9 Why Care? • Can improve performance of human-verified blacklists – Dramatically reduce time to blacklist – Improve breadth of coverage – Offer same or better level of accuracy • More broadly, new way of improving performance of crowd for a task
  • ©2011CarnegieMellonUniversity:10 Ways of Smartening the Crowd • Change the order URLs are shown – Ex. most recent vs closest to completion • Change how submissions are shown – Ex. show one at a time or in groups • Adjust threshold for labels – PhishTank is 4 votes and 70% – Ex. vote weights, algorithm also votes • Motivating people / allocating work – Filtering by brand, competitions, teams of voters, leaderboards
  • ©2011CarnegieMellonUniversity:11 Ways of Smartening the Crowd • Change the order URLs are shown – Ex. most recent vs closest to completion • Change how submissions are shown – Ex. show one at a time or in groups • Adjust threshold for labels – PhishTank is 4 votes and 70% – Ex. vote weights, algorithm also votes • Motivating people / allocating work – Filtering by brand, competitions, teams of voters, leaderboards
  • ©2011CarnegieMellonUniversity:12 Ways of Smartening the Crowd • Change the order URLs are shown – Ex. most recent vs closest to completion • Change how submissions are shown – Ex. show one at a time or in groups • Adjust threshold for labels – PhishTank is 4 votes and 70% – Ex. vote weights, algorithm also votes • Motivating people / allocating work – Filtering by brand, competitions, teams of voters, leaderboards
  • ©2011CarnegieMellonUniversity:13 Ways of Smartening the Crowd • Change the order URLs are shown – Ex. most recent vs closest to completion • Change how submissions are shown – Ex. show one at a time or in groups • Adjust threshold for labels – PhishTank is 4 votes and 70% – Ex. vote weights, algorithm also votes • Motivating people / allocating work – Filtering by brand, competitions, teams of voters, leaderboards
  • ©2011CarnegieMellonUniversity:14 Ways of Smartening the Crowd • Change the order URLs are shown – Ex. most recent vs closest to completion • Change how submissions are shown – Ex. show one at a time or in groups • Adjust threshold for labels – PhishTank is 4 votes and 70% – Ex. vote weights, algorithm also votes • Motivating people / allocating work – Filtering by brand, competitions, teams of voters, leaderboards
  • ©2011CarnegieMellonUniversity:15 Overview of Our Work • Crawled unverified submissions from PhishTank over 2 week period • Replayed URLs on MTurk over 2 weeks – Required participants to play 2 rounds of Anti-Phishing Phil – Clustered phish by html similarity – Two cases: phish one at a time, or in a cluster (not strictly separate conditions) – Evaluated effectiveness of vote weight algorithm after the fact
  • ©2011CarnegieMellonUniversity:16 Anti-Phishing Phil • We had MTurkers play two rounds of Phil [Sheng 2007] to qualify (µ = 5.2min) • Goal was to reduce lazy MTurkers and ensure base level of knowledge
  • ©2011CarnegieMellonUniversity:17 Clustering Phish • Observations – Most phish are generated by toolkits and thus are similar in content and appearance – Can potentially reduce labor by labeling suspicious sites in bulk – Labeling single sites as phish can be hard if unfamiliar, easier if multiple examples
  • ©2011CarnegieMellonUniversity:18 Clustering Phish • Motivations – Most phish are generated by toolkits and thus similar – Labeling single sites as phish can be hard, easier if multiple examples – Reduce labor by labeling suspicious sites in bulk
  • ©2011CarnegieMellonUniversity:19 Clustering Phish • Motivations – Most phish are generated by toolkits and thus similar – Labeling single sites as phish can be hard, easier if multiple examples – Reduce labor by labeling suspicious sites in bulk
  • ©2011CarnegieMellonUniversity:20 Most Phish Can be Clustered • With all data over two weeks, 3180 of 3973 web pages can be grouped (80%) – Used shingling and DBSCAN (see paper) – 392 clusters, size from 2 to 153 URLs
  • ©2011CarnegieMellonUniversity:21
  • ©2011CarnegieMellonUniversity:22 MTurk Tasks • Two kinds of tasks, control and cluster – Listed these two as separate HITs – MTurkers paid $0.01 per label – Cannot do between-conditions on MTurk – MTurker saw a given URL at most once • Four votes minimum, 70% threshold – Stopped at 4 votes, cannot dynamically request more votes on MTurk – 153 (3.9%) in control and 127 (3.2%) in cluster not labeled
  • ©2011CarnegieMellonUniversity:23 MTurk Tasks • URLs were replayed in order – Ex. If crawled at 2:51am from PhishTank on day 1, then we would replay at 2:51am on day 1 of experiment – Listed new HITs each day rather than a HIT lasting two weeks (to avoid delays and last minute rush)
  • ©2011CarnegieMellonUniversity:24 Summary of Experiment • 3973 suspicious URLs – Ground truth from Google, MSIE, and PhishTank, checked every 10 min – 3877 were phish, 96 not • 239 MTurkers participated – 174 did HITs for both control and cluster – 26 in Control only, 39 in Cluster only • Total of 33,781 votes placed – 16,308 in control – 11,463 in cluster (17473 equivalent) • Cost (participants + Amazon): $476.67 USD
  • ©2011CarnegieMellonUniversity:25 Results of Aquarium • All votes are the individual votes • Labeled URLs are after aggregation
  • ©2011CarnegieMellonUniversity:26 Comparing Coverage and Time
  • ©2011CarnegieMellonUniversity:27 Voteweight • Use time and accuracy to weight votes – Those who vote early and accurately are weighted more – Older votes discounted – Incorporates a penalty for wrong votes • Done after data was collected – Harder to do in real-time since we don’t know true label until later • See paper for parameter tuning – Of threshold and penalty function
  • ©2011CarnegieMellonUniversity:28 Voteweight Results • Control condition best scenario – Before-after – 94.8% accuracy, avg 11.8 hrs, median 3.8 – 95.6% accuracy, avg 11.0 hrs, median 2.3 • Cluster condition best scenario – Before-after – 95.4% accuracy, avg 1.8 hrs, median 0.7 – 97.2% accuracy, avg 0.8 hrs, median 0.5 • Overall: small gains, potentially more fragile and more complex though
  • ©2011CarnegieMellonUniversity:29 Limitations of Our Study • Two limitations of MTurk – No separation between control and cluster – ~3% tie votes unresolved (more votes) • Possible learning effects? – Hard to tease out with our data – Aquarium doesn’t offer feedback – Everyone played Phil – No condition prioritized over other • Optimistic case, no active subversion
  • ©2011CarnegieMellonUniversity:30 Conclusion • Investigated two techniques for smartening the crowd for anti-phishing – Clustering and voteweight • Clustering offers significant advantages wrt time and coverage • Voteweight offers smaller improvements in effectiveness
  • ©2011CarnegieMellonUniversity:31 Acknowledgments • Research supported by CyLab under ARO DAAD19-02-1-0389 and W911NF-09-1-0273 • Research Grants Council of the Hong Kong Special Administrative Region, China [Project No. CityU 117907]
  • ©2011CarnegieMellonUniversity:32
  • ©2011CarnegieMellonUniversity:33 Individual Accuracy
  • ©2011CarnegieMellonUniversity:34 4. Framework
  • ©2011CarnegieMellonUniversity:35 4.1 whitelists  include 3208 domains  From Google safe browsing (2784)  http://sb.google.com/safebrowsing/update?version =goog-white-domain:1:1  From millersmiles (424)  http://www.millersmiles.co.uk/scams.php  Reduce false positive  Save human effort
  • ©2011CarnegieMellonUniversity:36 4.2 Clustering  Content similarity measurement (Shingling method)  S(q),S(d) denote the set of unique n-grams in page q and d  The threshold is 0.65  The average time cost : 0.063 microseconds (SD=0.05)  calculating similarity of two web pages on a laptop with 2GHz dual core CPU with 1 GB of RAM  DBSCAN  Eps=0.65 and MinPts=2.  The time cost of clustering over all 3973 pages collected was about 1 second. ( ) ( ) ( ) ( ) ( )dSqS dSqS dqr   =,
  • ©2011CarnegieMellonUniversity:37 4.2 Clustering  Incremental update of the data  If there is no similar web page, we create a new cluster for the new submission.  If the similarity is above the given threshold and all similar web pages are in the same cluster, we assign the new submission to this cluster (unless the cluster is at its maximum size).  If there are many similar web papes in different clusters, we choose the largest cluster that is not at its maximum size.  After a new submission is grouped in a cluster  It has zero votes and does not inherit the votes of any other submissions in the same cluster.
  • ©2011CarnegieMellonUniversity:38 4.3 Voteweight The core idea behind voteweight is that participants who are more helpful in terms of time and accuracy are weighted more than other participants.  It measures how powerful of users’ votes impact on the final status of the suspicious URLs.  Its value comes from the accuracy of users’ historical data  a correct vote should be rewarded and a wrong one should be penalized  recent behavior should be weighted more than past behavior
  • ©2011CarnegieMellonUniversity:39 4.3 Voteweight  In our model, we use y {t,+∞} y {-t,-∞}∈ ∪ ∈ to label the status of a URL,  where y is the sum of voteweight of a given URL,  t is the threshold of voteweight,  y≥ t means a URL has been voted as a phishing URL  y≤-t means voted as legitimate.
  • ©2011CarnegieMellonUniversity:40 4.3 Voteweight ∑ = = M k k i i v v v 1 '    ≥ = otherwise RVifRV v ii i 0 0 iii PRRV ⋅−= α ∑= =⋅ − +− = N j LC j i jI TT TT R jij 1 0 0 )( 1 ∑= ≠⋅ − +− = N j LC j i jI TT TT P jij 1 0 0 )( 1 ( )    ∉ ∈ = Axif Axif xIA 0 1 ∑= ⋅= K i itit Cvl 1 '    − = otherwise phishasvotedif Cit 1 1 (5) (2) (3) (4) (1) (6) (7) (8)
  • ©2011CarnegieMellonUniversity:41 7. Investigating Voteweight Tuning Parameters in Control Condition  Voteweight achieves its best accuracy 95.6% and time cost of 11 hours with t=0.08 and α=2.5 in the control condition  Average time cost drops to 11 hours (11.8 hours without voteweight)  Median time cost drops to 2.3 hours (3.8 hours without woteweight)
  • ©2011CarnegieMellonUniversity:42 7. Investigating Voteweight Tuning Parameters in Cluster Condition  Voteweight achieves its best accuracy of 97.2% and time cost of 0.8 hours with t=0.06 and α= 1 in the control condition  Average time cost drops to 0.8 hours (1.8 hours without voteweight)  Median time cost drops to 0.5 hours (0.7 hours without woteweight)