Evolving web, evolving search

624 views

Published on

Published in: Education, Technology
  • Be the first to comment

  • Be the first to like this

Evolving web, evolving search

  1. 1. Evolving Web, gEvolving Search Yuan Tian & Tianqi Chen Apex Data & Knowledge Management Lab Shanghai Jiao Tong University
  2. 2. Agenda g Introduction to SJTU Introduction to Apex Lab p Research Demo
  3. 3. Agenda g Introduction to SJTU Introduction to Apex Lab p Research Demo
  4. 4. Shanghai Jiao Tong University g J g y Location History y Student Campus
  5. 5. Agenda g Introduction to SJTU Introduction to Apex Lab p Research Demo
  6. 6. Apex Lab p  Director Professor  Yong Yu  Associate Professor  Guirong X G i Xue
  7. 7. Apex Lab p Research  Web Search  Social Web  Semantic Search  Machine Learning  Image Search
  8. 8. Apex Lab p Project Partners
  9. 9. Apex Lab p Ph.D. Students  Haofen Wang  Jing Lu  Jia Chen  Guangcan Liu  Xian Wu  Yunbo Cao  g Ruihua Song 35 Master Students
  10. 10. Agenda g Introduction to SJTU Introduction to Apex Lab p Research Demo
  11. 11. Research Traditional Web Social Web Semantic Web Machine Learning
  12. 12. Research Traditional Web Social Web Semantic Web Machine Learning
  13. 13. Search on Traditional Web Focus on how to  improve search relevance? rank pages?  integrate mining technologies into search?  search finer grained objects instead of documents? Search Applications pp  General search engine  Vertical search engine g  Meta search engine
  14. 14. Expert Search
  15. 15. Expert Search ( p (introduction) ) Treat web page as bag of words Q Queries are not fully understood y
  16. 16. Expert Search ( p (motivation) )Searching for Experts:• A more and more important information need• PM search for Dev• Patient P ti t search f D t h for Doctor• Student search for Professor• ……• Not only in Enterprise• But also on WWW
  17. 17. Query Ranked List of ExpertsAn Evidence: an expert and a query co-occur in a document undercertain relation constraint
  18. 18. Research Traditional Web Social Web Semantic Web Machine Learning
  19. 19. The Emergence of Web 2.0 g Web gets social Web 1.0 -> Web 2.0 Publishing -> Participation Personal Websites -> Blogging Content Management Systems -> Wikis Britannica Online -> > Wikipedia Directories (taxonomy) -> Tagging ("folksonomy")  Lower the barrier for contribution.  More people are involved. They are less professional.
  20. 20. Search on Web 2.0 Focus on how to  elaborate user involved data?  search on new social media
  21. 21. Deegle g (WWW 2006, WWW 2007, SIGIR 2008)
  22. 22. Related facets Search results Related tags Relatd users
  23. 23. Emotion Analysis on the Blog y g Blog can be the resource of the news, but also be the stage for representing the emotion Enhancing the blog search for different user
  24. 24. Blog Search gI f Informative article ti ti l News that is similar to the news on traditional news websites b it Technical descriptions, e.g. programming techniques. techniques Commonsense knowledge Objective comments on the events in the worldAffective article Diaries about personall affairs b ff Self-feelings or self-emotions descriptions
  25. 25. Two types of blog yp g
  26. 26. Intent-driven blog search (WWW 2007) Informati Snippets ve Sense 1 1.00 1 00 The catalogue of IBM certification: DB2 Database Administrator DB2 Application Developer MQSeries Engineer VisualAge For Java … 2 -0.94 Crazy Me! I have hesitated between Acer and smuggled IBM for one week. I wouldn’t have taken into account the price, quality or service if I had enough money … 3 1.00 Selling IBM laptop, t22p3-900, , dvd S3/, g p p, p , , independent accelerating display card. 3550 YUAN. (Post fee not included) .Please contact 30316255. We guarantee the quality. This product is only sold within Tianjing city ... 4 -0.35 I got a laptop from my friend this week. Although outdated, it is still a classical one in IBM enthusiast’s mind. There are many second hand IBM laptops in the market. Al h k Although I h h have sold many IBM ld laptops … 5 -0.53 Doctor said that I should make more preparations mentally. You have stayed with me for three years, leaving without y g any words. Do you feel fair for me? Do you remember the moments we were together? You are heartless, I hate you! ...
  27. 27. Informative Snippets Sense1 1.00 The catalogue of IBM certification: DB2 Database Administrator DB2 Application Developer MQSeries Engineer VisualAge For Java …2 1.00 Biao Lin is a military talent. Stalin called him “thegifted general . general”. Americans called him “the unbeaten general”. the general . Chiang Kai-shek called him “devil of war”. Biao Lin is a special person in modern history …3 0.99 Microsoft’s hotmail can only be registered with suffix “@hotmail.com” by default. You can register @msn.com by visiting…4 0.95 0 95 Yi Sh Shang i still sending th fil t me. I will practice it l t 1 is till di the file to ill ti later. 1. Start up Instance (db2inst1) db2start; 2. Stop Instance (db2inst1) db2stop …5 0.84 Name: Lei Zhang. Student number: 5030309959. Class number: 007. The analysis and review about the tendency of Jilin Chemical Industry’ stock in 2005. Date, Increasing and Decreasing ranges, Open Price, Close Price, Amount of deals …6 0.01 Recently I like reading the Buddhist Scripture. I can learn philosophies in it. It makes me comfortable. It is from ...7 -0.11 It’s out of my mind when I first saw it. The water seemed to be exuding from the building. There was much water on the floor of education building. Water was all around us, anywhere you can touch had water. …8 -0.51 I read an article about the last emperor Po-yee today. I have watched “The Last Emperor” before, which realistically described his life without losing artistry. His love impressed me. As an emperor, he can’t choose the one he loved …9 -0.53 She is 164 in height with white skin, black hair and long limp leg. I like the girl who has long hair and likes sport and dancing. I like sweet girls. …10 -0.94 I have many things to do at the end of this semester. There are five fi final fi l examinations, i ti Discrete Di t Mathematics, M th ti Communication Theory, Architecture of Computer, Algorithm and Law. I know little about them. OMG! Only four weeks are left. There are also two projects, Compiler and Operation System. Complier can be easily completed but Operation System …
  28. 28. Research Traditional Web Social Web Semantic Web Machine Learning
  29. 29. Our Vision of Semantic Web Search • It covers most of the important topics in SW • A lot of tools are bu in o o oo s e built each layer • 10+ top papers (WWW’09, SIGMOD’09, SIGMOD’08, VLDB’07, ICDE’09, ISWC’07, etc)
  30. 30. Knowledge Engineering Layer g g g y Ontology Engineering  Orient: Integrating Ontology Engineering into Industry Tooling Environment (ISWC 2004) Ontology Learning & Population O t l L i P l ti  EachWiki: Facilitating Semantics Reuse for Wikipedia Authoring (ISWC/ASWC 2007) u o g ( S C/ S C 00 )  PORE: Semi-supervised Positive Only Relation Extraction from Wikipedia (ISWC/ASWC 2007)  HS E l Explorer: Unsupervised Hierarchical Semantics Explorer U i d Hi hi l S ti E l for Social Annotations (ISWC/ASWC 2007)  Catriple: Extracting Triples from Wikipedia Categories p g p p g (ASWC 2008)
  31. 31. Indexing and Search Layer g y Ontology Query Engine based on DBMS  SOR: A Practical System for OWL Ontology Storage, Reasoning and Search (VLDB 2007, SIGMOD 2008) Annotation-based Semantic Search Engine (DB + IR) A t ti b dS ti S hE i  CE2: Towards Large Scale Annotation-based Semantic Sea c (CIKM 2008) Search (C 008) An Extension to IR index for Relational Search  Semplore: An IR Approach to Scalable Hybrid Query of Semantic Web Data (ISWC/ASWC 2007, ASWC 2008, WWW 2009, JWS) Pattern-based Pattern based RDF Store
  32. 32. SOR Semantic Object Repository Based on IBM DB2 Supports T-Box T Box reasoning
  33. 33. Semplore p Extension to traditional IR engine Ranking is considered
  34. 34. CE^2 Search over semantically annotated corpus Combination of DB and IR search engines
  35. 35. Pattern-based RDF store Learning to materialize join results Efficient retrieval of pattern matches p Reasonable extra space -> Significant performance increase (on some dataset)
  36. 36. Query Interface and User InteractionLayer Keyword Interface for Semantic Search  Q2Semantic: Lightweight Ontology based Keyword Interpretation for Semantic Search (ESWC 2008, ICDE 2009) Natural Language Interface for Semantic Search  PANTO: A Portable Natural Language Interface to Ontologies (ESWC 2007) Snippet Generation  Snippet Generation for Semantic Web Search Engines (ASWC 2008) Ontology Presentation  ZoomRDF: Semantic-driven Fisheye Zooming for RDF Data (WWW 2010)
  37. 37. QQ2Semantic Structured queries vs. keyword queries Structural data
  38. 38. RDF Snippet pp Representation of search results How will you know which answers are most relevant?
  39. 39. ZoomRDF
  40. 40. Research Traditional Web Social Web Semantic Web Machine Learning
  41. 41. Agenda g Introduction to SJTU Introduction to Apex Lab p Research Demo
  42. 42. How to make them as a whole? We focused on Semantic Web search  Closed corpus / one single data source involved  Just scale to million triples  Uncertainty is not fully considered or used y y We need Semantic Web search, however  More th 11 million data sources (Web M than illi d t (W b heterogeneity)  More than 2 billion triples (Scalability)  Uncertainty everywhere Thus, we carefully consider the following topics  Pay as you go for semantic data integration  Semantic search engine towards billion triples Missing  User-friendly query Interface for Semantic Web Let’s Let s Forget
  43. 43. Hermes (2nd place Billion Triple Challenge, SIGMOD 2009, JWS) S S1. Integrate and index data sources 2. Understand user’s need 3. Search and refine Input keywords p y Select a query  q y Refine or navigate g 3 “Article 2 Stanford 1 Results Rudi Studer,  Turing  Semantic Web Award” ... Suggestions Affiliations ... Schema‐level Mapping Data‐level Mapping Graph Data Processing Keyword Translation Distributed Query Processing Data Graph Summarization Query Graph  Result Combination  Keyword Mapping Decomposition  & Ranking Element Label Extraction Top‐k Query  Query Planning Query Planning Local Query  Local Query Graph Search G hS h & Optimization Processing Graph Element Scoring Mapping Discovery Mapping Discovery Internal Indices Indexing Keyword  Schema  Structure Mapping  Graph Indices Index Index Index Index
  44. 44. Heterogeneous TransferLearningL i Machine Learning Team g APEXLAB Shanghai Jiao Tong University
  45. 45. Machine Learning Team in APEX g Focus on machine learning and its application in Web mining and IR.  Transfer learning  Advertising Techniques in Web  Short text classification&clustering  Multiligual search result integeration
  46. 46. Outline Introduction to heterogeneous transfer learning Cross media: Text  Imageg  Clustering  Classification Cross language: English  Chinese Application: Visual Contextual Advertising 47
  47. 47. Outline Introduction to heterogeneous transfer learning Cross media: Text  Imageg  Clustering  Classification Cross language: English  Chinese Application: Visual Contextual Advertising 48
  48. 48. Traditional machine learning g training data and test data in a same distribution. Training data: T t d T i i d t newsda Test 49
  49. 49. Transfer learning g Transfer learning: distributions are not identical. Training data:Test data g news 50
  50. 50. Heterogeneous Transfer Learning g g Learning across different feature spaces. A fixed-wing aircraft, typically called an airplane, aeroplane or simply plane, is an aircraft capable of flight using forward motion that generates lift as the wing moves through the air… An automobile, motor car or car is a wheeled motor vehicle used for transporting p passengers, which also carries g , its own engine or motor... Training data: Text Do T i i d TT data Test d D 51
  51. 51. Related Areas of Heterogeneous Learning g g Multiple Domain Data Feature Space among Domains Heterogeneous Homogeneous Instance Data Distribution D t Di t ib ti Correspondences among Domains Each instance in among Domains There are few or one no Different Same domain has its Instance correspondences correspondence In other domains among domains Multi-view Heterogeneous Transfer Learning Traditional Learning Transfer Learning across Different Machine Learning Distributions Apple is a Banana isSource fr-uit that theDomain can be common found … name for…TargetDomain 52
  52. 52. Related Areas of Heterogeneous Learning g g Multiple Domain Data Feature Space among Domains Heterogeneous Homogeneous Instance Data Distribution D t Di t ib ti Correspondences among Domains Each instance in among Domains There are few or one no Different Same domain has its Instance correspondences correspondence In other domains among domains Multi-view Heterogeneous Transfer Learning Traditional Learning Transfer Learning across Different Machine Learning Distributions Apple is a Banana isSource fr-uit that theDomain can be common found … name for…TargetDomain 53
  53. 53. Related Areas of Heterogeneous Learning g g Multiple Domain Data Feature Space among Domains Heterogeneous Homogeneous Instance Data Distribution D t Di t ib ti Correspondences among Domains Each instance in among Domains There are few or one no Different Same domain has its Instance correspondences correspondence In other domains among domains Multi-view Heterogeneous Transfer Learning Traditional Learning Transfer Learning across Different Machine Learning Distributions Apple is a Banana isSource fr-uit that theDomain can be common found … name for…TargetDomain 54
  54. 54. Related Areas of Heterogeneous Learning g g Multiple Domain Data Feature Space among Domains Heterogeneous Homogeneous Instance Data Distribution D t Di t ib ti Correspondences among Domains Each instance in among Domains There are few or one no Different Same domain has its Instance correspondences correspondence In other domains among domains Multi-view Heterogeneous Transfer Learning Traditional Learning Transfer Learning across Different Machine Learning Distributions Apple is a Banana isSource fr-uit that theDomain can be common found … name for…TargetDomain 55
  55. 55. Related Areas of Heterogeneous Learning g g Multiple Domain Data Feature Space among Domains Heterogeneous Homogeneous Instance Data Distribution D t Di t ib ti Correspondences among Domains Each instance in among Domains There are few or one no Different Same domain has its Instance correspondences correspondence In other domains among domains Multi-view Heterogeneous Transfer Learning Traditional Learning Transfer Learning across Different Machine Learning Distributions Apple is a Banana isSource fr-uit that theDomain can be common found … name for…TargetDomain 56
  56. 56. Outline Introduction to heterogeneous transfer learning Cross media: Text  Imageg  Classification  Clusteringg Cross language: English  Chinese Application: Visual Contextual Advertising 57
  57. 57. Text to Images [Dai et al. NIPS 2008] [Lin et al. APWeb 2010] Mining and learning the multimedia data is becoming increasing important Limited b Li i d by scarce labeled image data, can we l b l di d use abundant text data in the Web?  Our answer is YES 58
  58. 58. ObjectiveEle Learningphma In Oan pussi i ut translatits  tt Inve pu ng  O Learningare ho pu t learning  ut 59
  59. 59. Basic IdeasExploiting co occurrence data as a bridge between text and image co-occurrence
  60. 60. Data Sets Documents from ODP Images from Caltech-256 g
  61. 61. Experimental Result p
  62. 62. Approach 2: Naïve Bayes Way pp y y[Lin et al. APWeb 2010] P(v|w)P(w|c) P( | )P( | ) P(w|c) P(v|w)
  63. 63. Text-aided Image Classification g(TAIC) 64
  64. 64. Experiments: TAIC p Data sets: 9 binary classification data sets and 5 are six-class classification data sets  Image data from Caltech-256 and Fifteen scene  Auxiliary text data from Open Directory Project Baseline methods  Base classifiers: Naïve Bayes (NBC) and Support vector machine (SVM) 65
  65. 65. Evaluation 1: Classification Heterogeneous TL  No‐Heterogeneous TL  Average Error Rate Average Error Rate 0.318 0.334 4 8% error reduc 66
  66. 66. Outline Introduction to heterogeneous transfer learning Cross media: Text  Imageg  Classification  Clusteringg Cross language: English  Chinese Application: Visual Contextual Advertising 67
  67. 67. Text-aided Image Cl t rin T t id d Im Clustering [Yang et al. ACL 2009] Image clustering is a effective method for increasing accessibility of image search result Apple = OR do not work But traditional clustering methods well with small amount of data We consider use annotated images in the social d d h l Web to help image clustering 68
  68. 68. Annotated PLSA Model for Clustering Leveraging the auxiliary text data by using the topics Z as From Flickr. Flickr a bridgeWordsW dfrom Topics Image featuresAux Ima IData 69
  69. 69. Making the transfer… g  Log-likelihood objective function  Two parts: i T t image f t features and auxiliary text d ili t t features  Image feature to image instance correlation: A  Word feature to image feature correlation: B trade Nor  L    Aij log P ( f j | vi )  (1   ) B lj  log P ( f j | wl )    Aij B  -off mali off   j i j l j lj paraLik lih d of lih zatio Lik ti Likeliho Likelihood f 70
  70. 70. Experiment Setup p p Data sets:  Generated from Caltech-256 and 15-scene corpora Baseline methods  Baseline clustering methods: KMeans, PLSA and STC  Strategies:  clustering on target image data only  combined: clustering target image data and annotated image data together and evaluate result for target image data 71
  71. 71. Experimental Result p KM_Seperate KM_Combine PLSA_Seperate PLSA_Combine STC aPLSA 2 1.8 1.6 1.4 14 1.2 Entropy 1 0.8 0.6 06 0.4 0.2 0 Heterogeneous TL  No‐Heterogeneous TL  Average Entropy Average Entropy 0.741 0.786 5 7% entroy redu 72
  72. 72. Clustering Results gon Caltech256 [Griffin et al. TR 2007]frogf k kayak j k jesus-christh bear b watch h it t 73
  73. 73. Outline Introduction to heterogeneous transfer learning Cross media: Text  Imageg  Clustering  Classification Cross language: English  Chinese Application: Visual Contextual Advertising 74
  74. 74. Cross-language Classification g g[Ling et al. WWW 2008] Classifier learn classify Labeled  Unlabeled  Chinese Web  Chinese Web Chinese Web Web  pages pages Text Classification 75
  75. 75. Cross-language Classification g g  Much labelled data in English, but few in g , Chinese.Labeled Data English ChineseNews Reuters 21578 Reuters‐21578 ?newsgroups 20 Newsgroups ?Web pages Open Document  Very few ODP  Project data  data (> 1M) (< 20k, ~ 1%) 76
  76. 76. Cross-language Classification g g Classifier learn classify Labeled  Unlabeled  English Web  Chinese Web  pages pages Cross‐language Classification 77
  77. 77. Cross-language Classification g g Information Bottleneck X : signals to be encoded (Web pages) l b d d( b )  X : codewords (class labels) Y : features related to X (terms) X 78
  78. 78. Cross-language Classification g g Optimization minimize Information betw Information betw Minimize  this distance 79
  79. 79. Cross-language Classification g g Performance 80
  80. 80. Outline Introduction to heterogeneous transfer learning Cross media: Text  Imageg  Clustering  Classification Cross language: English  Chinese Application: Visual Contextual Advertising 81
  81. 81. Application: Visual ContextualAdvertising[[Chen et al. AAAI 2010] ] Previous research focused on advertising for text P i hf d d ti i f t t Web pages. With th b the booming of multimedia data, we need i f lti di d t d to recommend advertisement for these data Difficulty: image and the text i different f t Diffi lt i d th t t in diff t feature spaces Use th U the co-occurrence d t t b id these two data to bridge th t feature spaces
  82. 82. Figure illustration of Visual Contextual gAdvertising
  83. 83. Visual Contextual Advertising g (based on the independ isWe assume that there ent assumpti iWhere
  84. 84. Experimental Results p Co-occurrence data from Flickr. Test Image from Flickr and Fifteen scene data g set Advertisement are crawled from MSN search engine with queries chosen from AOL query log.
  85. 85. Experimental Result p
  86. 86. Experimental Result p
  87. 87. Thank you y For more details of APEXLAB  http://apex.sjtu.edu.cn/apex_wiki/FrontPage Our works  http://apex.sjtu.edu.cn/apex_wiki/Papers p // p j / p _ / p

×