Typo-Squatting: a Nuisance or a Threat to Your Traffic?


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Typo-Squatting: a Nuisance or a Threat to Your Traffic?

  1. 1. Typo-Squatting: a Nuisance or a Threat to Your Traffic? Mishari Almishari
  2. 2. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Future Work </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  3. 3. Introduction - Motivation <ul><li>Traffic is important to web domains! </li></ul><ul><ul><li>no point of launching without incoming traffic </li></ul></ul><ul><ul><li>Loosing/Gaining traffic means loosing/gaining money </li></ul></ul><ul><ul><li>One way to price the ADS is Pay Per Click Model </li></ul></ul><ul><li>Traffic Diversion could be a serious threat to a domain </li></ul>
  4. 4. Introduction - Motivation <ul><li>Typos may attract traffic </li></ul><ul><ul><li>Users vulnerable to making typos </li></ul></ul><ul><ul><li>Users may forget about visiting target domain </li></ul></ul><ul><ul><ul><li>Threat to Target Domain! </li></ul></ul></ul><ul><li>Intentionally registering such typo domains is called Typo-squatting </li></ul>
  5. 5. Introduction - Goal <ul><li>To study how much traffic typo-squatters can get from target domains </li></ul><ul><ul><li>Are those domains attracting much traffic? </li></ul></ul><ul><ul><ul><li>There are many typo-squatting domains registered (Banerjee et al., 08) </li></ul></ul></ul><ul><ul><ul><li>Search engines typo-corrections and browser auto-completions! </li></ul></ul></ul><ul><ul><li>How much traffic target domains are loosing? </li></ul></ul><ul><ul><li>Is it of negligible ratio or a serious threat? </li></ul></ul><ul><ul><li>Do users go back to target domains or get distracted? </li></ul></ul>
  6. 6. Introduction - Challenges <ul><li>How to identify typo-squatting domains? </li></ul><ul><ul><li>Does Typo mean Typo-squatting? </li></ul></ul><ul><ul><ul><li>Short Domains </li></ul></ul></ul><ul><ul><ul><ul><li>www.abc.com and www.abd.com </li></ul></ul></ul></ul><ul><ul><ul><li>Longer Domains </li></ul></ul></ul><ul><ul><ul><ul><li>www.walmart.com and www.walkmart.com </li></ul></ul></ul></ul><ul><ul><li>If not, how can we? </li></ul></ul><ul><ul><ul><li>Hijacking indicator </li></ul></ul></ul>
  7. 7. Introduction - Contribution <ul><li>Automatic and accurate identification of typo-squatting domains (Measurement Methodology) </li></ul><ul><li>Bound on how much traffic target domains are loosing towards typo-squatting domains (Measurement Results) </li></ul>
  8. 8. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Related Work </li></ul><ul><li>Future Work </li></ul><ul><li>Conclusion </li></ul>
  9. 9. Background – Domain Parking <ul><li>Domain Parking is the practice of showing a temporary page for an unused domain before launching it </li></ul>
  10. 10. Background - Domain Parking
  11. 11. Background – Domain Parking
  12. 12. Background – Domain Parking
  13. 13. Background – Domain Parking <ul><li>Domain Parking Service </li></ul><ul><ul><li>Parks and hosts unused domains </li></ul></ul><ul><ul><li>Monetize the traffic by showing ads </li></ul></ul><ul><li>Many Typo-squatting domains are parked domains (Wang et al, 06), ( Keats, 07 ) </li></ul>
  14. 14. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Future Work </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  15. 15. Methodology <ul><li>Data Collection </li></ul><ul><li>Identifying Typo-Squatting Domains </li></ul>
  16. 16. Methodology - Data Collection <ul><li>DNS traces @ UCI Revolvers </li></ul><ul><li>Internal requests to domain names </li></ul><ul><li>DNS query proceeds http request </li></ul><ul><li>Caching limitation </li></ul><ul><li>Our study represents a lower-bound </li></ul>
  17. 17. Methodology - Data Collection UCI NET INTERNET UCI Resolver Our Machine DATE TIME HASHED-IP DOMAIN TYPE CLASS USER QUERY
  18. 18. Methodology – Identify Typo-squatting Domain <ul><li>Identify Similar Domains </li></ul><ul><ul><li>Single Error Typo </li></ul></ul><ul><ul><ul><li>Single error accounts for 90-95% of spelling/typo errors (Pollock et al, 83) </li></ul></ul></ul><ul><ul><ul><li>www.walmart.com and www.wamart.com </li></ul></ul></ul><ul><ul><li>gTLD substitution </li></ul></ul><ul><ul><ul><li>www.amazon.com and www.amazon.org </li></ul></ul></ul>
  19. 19. Methodology – Identify Typo-squatting Domains <ul><li>But Similar domain is not enough! </li></ul><ul><ul><li>www.abc.com and www.abd.com </li></ul></ul><ul><ul><li>www.walmart.com and www.walkmart.com </li></ul></ul><ul><ul><li>www.usps.com and www.usps.org </li></ul></ul><ul><ul><li>Random Sample </li></ul></ul><ul><ul><ul><li>More than 54% are not Typo-squatting </li></ul></ul></ul>Need to Identify Hijacking Intention
  20. 20. Methodology – Identify Typo-squatting Domain <ul><li>Identify Hijacking Indicator </li></ul><ul><ul><li>Parked Domain (Ads – listing) </li></ul></ul><ul><ul><ul><li>~ 88% </li></ul></ul></ul><ul><ul><li>Forwarding to other domains </li></ul></ul><ul><ul><ul><li>~ 8% </li></ul></ul></ul><ul><ul><li>Others: Inappropriate Content, … </li></ul></ul>Parked Domain as the indicator
  21. 21. Methodology – Identify Typo-squatting Domain Similar Domain Parked Domain Typo-Squatting Domain AND
  22. 22. Methodology – Identify Typo-squatting Domain <ul><li>How to identify Parked Domain? </li></ul><ul><ul><li>Parked Domain Classifier </li></ul></ul><ul><ul><ul><li>96% </li></ul></ul></ul><ul><ul><li>Presence of Parking signatures </li></ul></ul><ul><ul><ul><li>Well-known parking signatures (domain names/urls) </li></ul></ul></ul>
  23. 23. Methodology - Summary Identify Similar Domains Identify Parked Domains List of Typo-squatting Domains
  24. 24. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Future Work </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  25. 25. Parked Domain Classifier Build Data Set Extract Core Features Combine Into Classifier
  26. 26. Data Set <ul><li>Data Set consists of 2,800 domains </li></ul><ul><li>700 are parked domain </li></ul><ul><ul><li>Collected from MS Strider Website </li></ul></ul><ul><li>2,100 are non-parked domains </li></ul><ul><ul><li>Collected From the fourteen Yahoo Directory Top Categories </li></ul></ul>
  27. 27. Feature Selection <ul><li>Heuristically, Identify common features in parked domain </li></ul><ul><li>Compute the distribution of those features for verification </li></ul><ul><li>Common Link Ratio Max </li></ul>
  28. 28. Feature Selection
  29. 29. Combining Features Into Classifier <ul><li>Tried Different Classifier Algorithms </li></ul><ul><ul><li>Decision Tree </li></ul></ul><ul><ul><li>SVM </li></ul></ul><ul><ul><li>K-Nearest Neighbor </li></ul></ul><ul><ul><li>Random Forest </li></ul></ul><ul><ul><ul><li>The best performance </li></ul></ul></ul>
  30. 30. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Future Work </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  31. 31. DATA Sets <ul><li>DNS Traces </li></ul><ul><ul><li>Four Months </li></ul></ul><ul><ul><li>~ 30 million domains ( ~ 2 billion hits ) ( ~ 30,000 users ) </li></ul></ul><ul><li>Target Domain Set </li></ul><ul><ul><li>Alexa’s Top 500 popular domains </li></ul></ul><ul><ul><li>~53,000,000 hits </li></ul></ul>
  32. 32. Typo-Squatting Domains & Hits <ul><li>1,332 typo-squatting </li></ul><ul><li>13,431 hits (~ 110 a day) </li></ul><ul><li>Is it Large or Small? </li></ul><ul><ul><li>500 Target Domains </li></ul></ul><ul><ul><li>4 Month Period </li></ul></ul><ul><ul><li>~ 30,000 users </li></ul></ul><ul><ul><li>Given Similar Ratio may translate to non-trivial number </li></ul></ul><ul><ul><ul><li>30,000 => 110 Per Day </li></ul></ul></ul><ul><ul><ul><li>300,000 => 1,100 Per Day </li></ul></ul></ul><ul><ul><ul><li>3000,000 => 11,000 (X 365 = ~ 4,000,000 A YEAR) </li></ul></ul></ul>
  33. 33. Typo-squatting Ratio <ul><li>0.025% of total number of queries </li></ul><ul><li>(89% , ≤ 1%) (70%, ≤ 0.1%) ( 57%, ≤ 0.01%) </li></ul>
  34. 34. User Correction Ratio – Alexa-500 <ul><li>54% of typo-squatting queries are corrected </li></ul><ul><li>~ 51% squatted target domains have most squat hits corrected </li></ul>
  35. 35. Potential Hit Loss <ul><li>Potential Hit Loss Ratio = 0.012% </li></ul><ul><li>(92% , ≤ 1%) (78%, ≤ 0.1%) (64%, ≤ 0.01%) </li></ul>
  36. 36. Potential Money Loss <ul><li>~75% do not point to target domains </li></ul><ul><li>Referring Typo-Sqt Ratio = 0.008% </li></ul><ul><li>(96%, ≤ 1%) (91%, ≤ 0.1%) ( 81%, ≤ 0.01%) </li></ul>
  37. 37. Non-existing Similar Domains <ul><li>8,285 potential hits (~ 500 non-existing typo domain) </li></ul><ul><li>0.015% of total number of queries </li></ul><ul><li>(96%, ≤ 1%) (83%, ≤ 0.1%) (66%, ≤ 0.01%) </li></ul>
  38. 38. Typo-Squatting Distribution <ul><li>19 % of all Typo-squatting hits </li></ul>
  39. 39. Top Ten Typo-squatting Domains <ul><li>19 % of all Typo-squatting hits </li></ul>
  40. 40. Top Ten Target Domains <ul><li>Responsible of 55% to all typo-squatting queries of Alexa-500 </li></ul><ul><li>50 Million hits of “www.facebook.com” </li></ul>
  41. 41. Typo Characterization <ul><li>Most Typos are single errors ( 95% VS 5%) </li></ul><ul><li>Most gTLD sub are “com” to “org” (50%) </li></ul><ul><li>Add – 37 % are of non-adjacent keys </li></ul><ul><li>Sub – 77% are of non-adjacent keys </li></ul><ul><li>Sub – 13% of substitutions are “a” and “o” </li></ul><ul><ul><li>Spelling error </li></ul></ul>
  42. 42. Typo-squatting Domains – TP60 <ul><li>15,499 hits </li></ul><ul><li>0.045% of total number of queries </li></ul><ul><li>(76%, ≤ 1%) (60%, ≤ 0.5%) </li></ul>
  43. 43. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Future Work </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  44. 44. Future Work <ul><li>How much of the ads budget go to squatters? </li></ul><ul><li>Enhance our identification technique </li></ul><ul><li>See, if the results hold at other ISPs </li></ul><ul><li>Typo Modeling for getting traffic back </li></ul>
  45. 45. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Future Work </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  46. 46. Related Work <ul><li>MS Strider Project [Wang et al. Sruti06] </li></ul><ul><li>McAfee Study [ Keats McAfee White Paper 07 ] </li></ul><ul><li>JAAL project [Banerjee et al. Infocom 08] </li></ul>
  47. 47. Outline <ul><li>Introduction </li></ul><ul><li>Background </li></ul><ul><li>Methodology </li></ul><ul><li>Parked Domain Classifier </li></ul><ul><li>Measurements </li></ul><ul><li>Future Work </li></ul><ul><li>Related Work </li></ul><ul><li>Conclusion </li></ul>
  48. 48. Conclusion <ul><li>Accurately and automatically identify typo-squatting domains </li></ul><ul><li>How much traffic go to typo-squatters </li></ul><ul><li>Bound on how much traffic the target domain is loosing towards typo-squatting </li></ul><ul><ul><li>inconsequential </li></ul></ul>