Your SlideShare is downloading. ×
Mapping Domain Names to Categories
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Mapping Domain Names to Categories

2,819
views

Published on

Oversee.net + UCLA IPAM RIPS Summer internship project 2013

Oversee.net + UCLA IPAM RIPS Summer internship project 2013

Published in: Technology

1 Comment
0 Likes
Statistics
Notes
  • Be the first to like this

No Downloads
Views
Total Views
2,819
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
7
Comments
1
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Mapping Domain Names to Categories Maya Rotmensch, Sorcha Gilroy, Corina Gur˘au Academic Mentor: Cristina Garcia-Cardona Industry Sponsor: Oversee.net (Kryztof Urban) Institute of Pure and Applied Mathematics Research in Industrial Projects August 15, 2013 Institute for Pure & Applied Mathematics University of California, Los Angeles (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 1 / 41
  • 2. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 2 / 41
  • 3. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 3 / 41
  • 4. Oversee.net’s Business Model Person Website (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 4 / 41
  • 5. Person looking for games A gaming website (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 5 / 41
  • 6. Oversee.net’s Business Model Person looking for games Domain A gaming website Direct Navigation: when users navigate to a website by using the address bar instead of a search engine. looking for a gaming website → navigates to ’addictinggamas.com’ (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 6 / 41
  • 7. Oversee.net’s Business Model Domain parking + traffic matching −→ Oversee.net Person Domain Category Website Monetized Domain Parking The registration of internet domain names without placing any content on the domain. Owners monetize traffic by displaying links and advertisements (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 7 / 41
  • 8. Oversee.net’s Business Model Advertisers Partners of Oversee.net Choose the types of traffic they want from Oversee.net’s category tree (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 8 / 41
  • 9. Oversee.net’s Business Model Parked domains do not have any content Mapping Domains to Categories is extremely difficult Oversee.net uses Keywords to describe Domains and Categories Domain Keywords Keywords Category Not enough, as we are not guaranteed use of same language! (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 9 / 41
  • 10. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 10 / 41
  • 11. So what’s the big deal? Reasoning about concepts Scarcity of input information Example 1 - Spelling error cheapvacatins.com Example 2 - Ambiguous meaning bigbearhuts.com (animals? huts? it’s supposed to be winter sports) (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 11 / 41
  • 12. Text Categorization Our problem can be thought of as a problem of categorization. We need to assign a domain to one or more classes or categories A natural choice is topic modeling However, unlike most text categorization problems, we don’t actually have documents to classify, as we are dealing with undeveloped domains (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 12 / 41
  • 13. Topic Modeling This method analyzes the relationships between documents in a corpus by isolating a set of topics from the documents For meaningful results, one must work with a set of large texts Our data set consists of keywords, as our domains are undeveloped This method results in organic generation of topics The categories we are attempting to map into are pre-defined (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 13 / 41
  • 14. ESA - Explicit Semantic Analysis Building a Semantic Interpreter Using a Vector Space Model + an exogeneous knowledge base −→ represent the meaning of text 1 # of articles ∼ 3.5 Million # of terms ∼ 45 Million 1 Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis, 2007. Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI) (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 14 / 41
  • 15. ESA - Explicit Semantic Analysis Government Finance Toys Children Bank School . . . Law 0.2 0.3 0.8 0.9 0.2 0.7 . . . Article2 0.8 0.9 0.1 0.3 0.7 0.5 . . . Article3 0.5 0.2 0.3 0.6 0.4 0.8 . . . Article4 0.1 0.2 0.1 0.3 0.4 0.2 . . . ... ... ... ... ... ... ... ... Term frequency inverse document frequency: tfidf (t, d, D) = tf (t, d) × idf (t, D) Logarithmically scaled term frequency: tf (t, d) = log(f (t, d) + 1) Inverse document frequency: idf (t, D) = log |D| |d ∈ D : t ∈ d| (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 15 / 41
  • 16. ESA - Explicit Semantic Analysis Using a Semantic Interpreter Cosine similarity measure similarity = cos(θ) = A · B ||A|| ||B|| (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 16 / 41
  • 17. How Oversee.net Does It Instead of comparing two texts - compare two small sets of words! Use keywords to describe domains and categories Represent these keywords in terms of DBpedia articles A keyword is significantly related to an article if the TF-IDF is above a certain threshold The set of articles associated to a domain/category is the union of the sets of articles associated to its keywords (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 17 / 41
  • 18. How Oversee Does It Compare the two sets of articles (A - domains, B - categories) using the Jaccard Index: J(A, B) = |A ∩ B| |A ∪ B| Categories with highest scores using this index are matched to a domain (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 18 / 41
  • 19. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 19 / 41
  • 20. Our Focus Domain Keywords Keywords Category Critical link: domains to keywords Improve quality of keywords Click Through Rate String Similarity Semantic Analysis Keyword CTR String Similarity Semantic Similarity industrial 20 80 0 industriel 20 89 0 industrie 20 100 0 china manufacturer 20 0 88 industries 20 80 98 industrial companies 20 0 86 (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 20 / 41
  • 21. Domain Keywords Focusing on developing the link between domains and keywords, the two main questions we posed for our research were: Could we use ESA to extend the number of meaningful keywords per domain? Could we use the keywords obtained through Oversee.net inhouse statistics as the basis of the new keywords? (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 21 / 41
  • 22. Methodology Extending the set of keywords: (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 22 / 41
  • 23. Methodology Extending the set of keywords: When generating new keywords: Only take top 3 articles Only take top 2 terms (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 23 / 41
  • 24. Methodology Method 2 for extending the set of keywords: Breaking up and correcting the domain name chaselogon.com haselogon aselogon cha selogon chas elogon chase logon chasel ogon chaselo gon chaselog chaselogo Example: domain = ’chaselogon.com’ If entire string matches a word in reference file then stop If both parts of broken string are exact words then stop If substring is an exact word then correct other part using edit distances Corrections used: deletions, transpositions, replacements, insertions (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 24 / 41
  • 25. Methodology Method 2 for extending the set of keywords: Reference file made up of collections of text, have added more information Company names Popular websites Brand and store names Countries and major cities Initial Keywords Keywords after parsing chameloeon chas chase elson login (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 25 / 41
  • 26. Methodology Generating new keywords and mapping to categories bankfianancial.com ncofinancial ban bank financial financial institutions financial centre lobsters official personal societies chairman . . . Jaccard Index = 0.240492 finance retirement pension debit card tenant credit check ... Jaccard Index = 0.348147 credit cards debit card credit applications rewards program ... Jaccard Index = 0.219457 banking savings banking checks community bank ... (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 26 / 41
  • 27. Results: Comparing Their Keywords to Semantic We were given a sample of 300 domains that had been matched by hand to a total of 500 categories CTR & String Similarity CTR, String Similarity & Semantic Analysis Number of matches 25 309 percentage of match 5% 61.8% (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 27 / 41
  • 28. Results: Generating New Keywords Using Method 1: CTR & String Similarity Method 1 CTR & String Similarity & 7 Random Number of matches 25 21 24 percentage of match 5% 4.2% 4.8% Most of the time, the different methods yielded the same results Cases where the new keywords improved the system: thhetrainline.com Cases where the base case did better: inindustries.com (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 28 / 41
  • 29. Results thhetrainline.com thetrainline Jaccard Index = 0.0001 microcars & city cars Jaccard Index = 0.0002 property management thhetrainline.com thetrainline strafe train moving departing train station telecommunications georgia rain shine . . . Jaccard Index = 0.1348 bus & rail Jaccard Index = 0.2255 libraries & museums (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 29 / 41
  • 30. Results inindustries.com industrial industrias industriel . . . Jaccard Index = 0.0786 manufacturing inindustries.com industrial industrias industriel . . . ministry quarterly garden/outdoor filipino footballer . . . Jaccard Index = 0.099 tourist destinations Jaccard Index = 0.1326 real estate (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 30 / 41
  • 31. Results: Parsing the Domains Using Method 1 & 2: CTR & String Similarity Method 1 & 2 CTR & String Similarity & 15 Random Number of matches 25 93 23 percentage of match 5% 18.6% 4.6% (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 31 / 41
  • 32. Results - Parsing the Domains chaselogon.com chameloeon No category matched addictinggamas.com chameloeon chas chase elson login password journalists cyber logins expensive beatles . . . Jaccard Index =0.4637 credit cards Jaccard Index = 0.4637 banking (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 32 / 41
  • 33. Results: Parsing the Domains Using Method 2: CTR & String Sim. Method 1& 2 Method 2 Number of matches 25 97 77 out of 356 percentage of match 5% 19.4% ∼ 21.6 % Initial results show that overall, just using parsing might be more beneficial → depends on the amount of noise. Example with a lot of noise: mobilestorage.ca Example with minimal noise: addictinggamas.com (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 33 / 41
  • 34. Results - Amplification of noise mobilestorage.ca gfilestorage mobileshop mobile storage age investor vilest . . . Jaccard Index = 0.1011 mobile & wireless Jaccard Index = 0.0959 music & audio mobilestorage.ca gfilestorage mobileshop mobile storage age investor vilest . . . legal age taylor phone companies mobil . . . Jaccard Index =0.0942 music & audio Jaccard Index = 0.0887 education (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 34 / 41
  • 35. Results - Minimal noise addictinggamas.com addictinggams addictivegames adictigegames . . . addict addicting games ingram . . . Jaccard Index = 0.0153 software addictinggamas.com addictinggams addictivegames adictigegames . . . addict addicting games ingram . . . gameplay requires game impulsedriven flash add ons . . . Jaccard Index = 0.2019 computer & video games Jaccard Index = 0.1975 games (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 35 / 41
  • 36. Results: Extended Matches Using Extended Matches: We extended possible matches to parent and root nodes of the category tree. Checked in how many cases did the parent or root node of the categories we got matched the manual matching. CTR & String Sim. Method 1 Method 1& 2 Method 2 Number of matches 25 21 97 77 out of 356 percentage of match 5% 4.2% 19.4% ∼ 21.6 % Number of extended matches 32 29 128 102 out of 356 Percentage of matches 6.4% 5.8% 25.6% ∼ 28.7 % (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 36 / 41
  • 37. Outline 1 Oversee.net 2 Problem Statement Why so complicated? ESA - Explicit Semantic Analysis How Oversee.net Does It 3 Our Project Our Focus Methodology Results 4 Concluding Remarks (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 37 / 41
  • 38. Conclusion Implemented a program to match domains with categories Created an ESA based method to amplify existing keywords Adapted a domain name parsing and spell correcting method Revisiting our research questions: Could we use ESA to extend the number of meaningful keywords per domain? → Yes Could we use the keywords obtained through Oversee.net inhouse statistics as the basis of the new keywords? → No. Or at least further processing must be done. getting better & more keywords → getting a few good keywords (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 38 / 41
  • 39. Future Directions Find out how many good initial keywords are required to use our method successfully Explore a better way of ranking keywords and determine which are the most descriptive ones Click through rate and string similarity comparisons are not sufficiently descriptive, need a better scoring method Have a reference of the most popular websites, so that the domains given could be compared to these Analyze content in websites to amplify domain to category mapping (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 39 / 41
  • 40. Thank you! Academic Mentor: Cristina Garcia-Cardona Industry Sponsor: Kryztof Urban and Oversee.net RIPS Director: Dr. Michael Raugh Director of IPAM: Dr. Russ Caflisch IPAM Staff: Dimi, Stacey, Stacy, Roland, Stephanie, and everyone that made RIPS possible (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 40 / 41
  • 41. Questions? Thank you for listening! (Institute of Pure and Applied Mathematics) Mapping Domain Names to Categories August 15, 2013 41 / 41