Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Need for and fundamentals of an Open Web Index

73 views

Published on

The Need for and fundamentals of an Open Web Index

Published in: Internet
  • Login to see the comments

  • Be the first to like this

The Need for and fundamentals of an Open Web Index

  1. 1. THE NEED FOR AND FUNDAMENTALS OF AN OPEN WEB INDEX Prof. Dr. Dirk Lewandowski Hamburg University of Applied Sciences, Hamburg, Germany dirk.lewandowski@haw-hamburg.de First International Symposium on Open Search Technology Garching, 23 October, 2019
  2. 2. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski ABOUT ME • Professor of Information Research and Information Retrieval at Hamburg University of Applied Sciences • Author of 100+ scholarly articles on search engines • German-language book “Suchmaschinen verstehen” (Springer, 2nd edition, 2018) • Editor, Aslib Journal of Information Management (Emerald Publishing) • Served as expert for the High Court of Justice (UK) and Deutscher Bundestag (German parliament) 1 https://searchstudies.org/dirk
  3. 3. WHY WE NEED AN OPEN WEB INDEX
  4. 4. GOOGLE SERVES MORE THAN 2.000.000.000.000 QUERIES PER YEAR.
  5. 5. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski PROBLEM STATEMENT • As there is no central directory of the Web, private search engine companies have built large indexes of its contents • Companies operating Web-scale indexes do not allow sufficient access to their data to other parties interested • The difficulties in building a Web index lie in technical issues, operating costs, Web size, and freshness • Due to these difficulties, there is no Web index built by a European company (or other entity) 4
  6. 6. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski IDEA 5 VISION To build a public library of the Web TECHNICAL IDEA Separate the index from the services that are built on the index PUBLIC VS. PRIVATE While the index should be public, the services can be proprietary Separate the index from the services that are built on the index TECHNICAL IDEA Separate the index from the services that are built on the index TECHNICAL IDEA Separate the index from the services that are built on the index PUBLIC VS. PRIVATE While the index should be public, the services can be proprietary TECHNICAL IDEA Separate the index from the services that are built on the index
  7. 7. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski STRUCTURE 6 OWI Crawler OWI Basic Indexer OWI Advanced Indexer OWI Web Index OWI Usage Data Index Service 1 Service 2 Service 3 User User User OWI Interface / API User User UserUser User UserUser User User User Service 4
  8. 8. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski POSSIBLE APPLICATIONS N.B.: This list of ideas is far from being complete and only serves illustrative purposes. 7 SEARCH SCIENCE / RESEARCH • Web Search • Vertical Search, e.g.,video or scholarly content • Trend analysis, e.g., political trends • Language use on the Web • Research evaluation, e.g., Altmetrics DATA ANALYSIS • Data aggregation, e.g., company or person dossiers • Opinion mining (“Who says what about whom?”) • Market researc SCIENCE / RESEARCH • Web Search • Vertical Search, e.g.,video or scholarly content • Trend analysis, e.g., political trends • Language use on the Web • Research evaluation, e.g., Altmetrics DATA ANALYSIS • Data aggregation, e.g., company or person • Opinion mining (“Who says what about who • Market researc DATA ANALYSIS • Data aggregation, e.g., company or person dossiers • Opinion mining (“Who says what about whom?”) • Market research ARTIFICAL INTELLIGENCE OWI could build the foundation for large-scale AI applications, e.g., • Machine translation • Question answering DATA ANALYSIS • Data aggregation, e.g., company or person dossiers • Opinion mining (“Who says what about whom?”) • Market research COMBINING OWI DATA WITH PROPRIETARY DATA • Company profiles + OWI data = enriched company dossiers • Product data + OWI data = enriched product descriptions • Geospatial data + OWI data = enriched map applicatio DATA ANALYSIS • Data aggregation, e.g., company or person dossiers • Opinion mining (“Who says what about whom?”) • Market research COMBINING OWI DATA WITH PROPRIETARY DATA • Company profiles + OWI data = enriched company dossiers • Product data + OWI data = enriched product descriptions • Geospatial data + OWI data = enriched map applications
  9. 9. WHY DON’T WE JUST START BUILDING IT?
  10. 10. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski WHAT SIZE SHOULD A WEB INDEX HAVE? • 1.71 billion websites • How many pages/URLs does this mean? à There is no such thing as a complete index. à However, without representing a major part of the Web, an index is more or less useless. 9
  11. 11. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski WHY ARE INITIATIVES LIKE COMMON CRAWL NOT ENOUGH? They are not comprehensive - CommonCrawl: 2.6 billion pages (not websites!) They are static - Crawling once a month is very different from keeping an index current at any time They do not provide search functionality - No (basic) indexing as needed to build applications on top of the index - No SPAM control as needed to build applications - No human raters to control for the quality of the index à The use of initiatives like Common Crawl is more or less restricted to analysing Web content. Due to the sampling procedure applied, it may not even be too useful for that. 10
  12. 12. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski CRAWLING IS NOT THE PROBLEM, ANYWAY Crawling is just the beginning of a long process. Indexing is required for making the index searchable. The real problems are 1) Avoiding SPAM (= excluding it from the index) – SPAM makes up A LOT of the Web’s content 2) Keeping the index fresh 3) Providing indexing (basic and advanced) 4) Making the index searchable 11
  13. 13. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski BIAS ON THE WEB 12Baeza-Yates, R. (2018). Bias on the web. Communications of the ACM, 61(6), 54–61. https://doi.org/10.1145/3209581
  14. 14. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski WHO CONTROLS THE RESULT RANKINGS? 13 Search Engine Providers Search Engine Result Page Content ProvidersUsers Search Engine Optimizers
  15. 15. HOW TO PROCEED
  16. 16. Proposal for an Open Web Index (OWI) Prof. Dr. Dirk Lewandowski HOW TO PROCEED - A comprehensible and fresh Web index is a societal/political project, not a mere technical problem. - Therefore, we need to approach politics. They should decide for building the index (and financing it) - To make the index independent from governments, a European foundation should be built to govern it. - The technical implementation of the Index should lie in the hands of those (companies/institution) best capable of building it. 15
  17. 17. THANK YOU Dirk Lewandowski Hamburg University of Applied Sciences, Hamburg, Germany dirk.lewandowski@haw-hamburg.de www.searchstudies.org/dirk

×