The document provides an overview of challenges in large-scale web search engines. It discusses scalability and efficiency issues including the size and dynamic nature of the web, high user volumes, and large data center costs. The main sections covered include web crawling, indexing, query processing, and caching. Open research problems are also mentioned such as web partitioning, crawler placement, and coupling crawling with distributed search and indexing.
Apache Sling provides RESTful access to content stored in an Apache Jackrabbit content repository. Sling uses a resource-centric approach where everything is represented as a resource that can be rendered in different formats. Resources map directly to nodes in the Jackrabbit content tree. Sling supports scripting languages like JSP and uses selectors, extensions and resource types to determine how each resource should be rendered. It allows composing page resources from multiple child resources.
Introduces "Slug" a web crawler (or "Scutter") designed for harvesting semantic web content. Implemented in Java using the Jena API, Slug provides a configurable, modular framework that allows a great degree of flexibility in configuring the retrieval, processing and storage of harvested content. The framework provides an RDF vocabulary for describing crawler configurations and collects metadata concerning crawling activity. Crawler metadata allows for reporting and analysis of crawling progress, as well as more efficient retrieval through the storage of HTTP caching data.
NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:
NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.
Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.
Apache Drill is an open source SQL query engine for big data that provides highly flexible and high performance querying of data stored in Hadoop and NoSQL systems. It allows for ad-hoc queries on schema-less data without requiring upfront modeling or ETL. Drill uses a distributed, columnar data store and late binding to optimize query execution across systems. The project is actively developed with the goal of releasing version 1.0 in late 2014.
This document provides an overview of existing web archives and their use for research. It discusses several major web archives including the Internet Archive, Common Crawl, Pandora Archive, and national archives. For each, it describes their size and collection strategies, as well as positives and negatives for research use. The talk concludes with examples of how existing web archives in Australia are being used for research.
Jonathan Gray gave an introduction to HBase at the NYC Hadoop Meetup. He began with an overview of HBase and why it was created to handle large datasets beyond what Hadoop could support alone. He then described what HBase is, as a distributed, column-oriented database management system. Gray explained how HBase works with its master and regionserver nodes and how it partitions data across tables and regions. He highlighted some key features of HBase and examples of companies using it in production. Gray concluded with what is planned for the future of HBase and contrasted it with relational database examples.
JAX-RS is a Java API that allows developers to build RESTful web services and clients in Java. It uses annotations to describe resources as POJOs and supports standard HTTP methods to perform CRUD operations. Resources are identified by URIs and can have multiple representations in different formats like XML, JSON. JAX-RS implementations include Jersey which can be used with servlet containers or standalone to easily create RESTful web services and clients.
This document provides an overview of library web mashups and APIs. It defines mashups as web applications that combine data from multiple sources. Some examples of library mashups are presented. The key technologies that power mashups, such as web services, JSON, XML, and scripting languages are described. Several specific library vendor and general web services APIs are also outlined, including the WorldCat and Serial Solutions APIs. Finally, the document discusses creating simple mashups with widgets and Yahoo Pipes and provides code walkthroughs for sample mashups.
Apache Sling provides RESTful access to content stored in an Apache Jackrabbit content repository. Sling uses a resource-centric approach where everything is represented as a resource that can be rendered in different formats. Resources map directly to nodes in the Jackrabbit content tree. Sling supports scripting languages like JSP and uses selectors, extensions and resource types to determine how each resource should be rendered. It allows composing page resources from multiple child resources.
Introduces "Slug" a web crawler (or "Scutter") designed for harvesting semantic web content. Implemented in Java using the Jena API, Slug provides a configurable, modular framework that allows a great degree of flexibility in configuring the retrieval, processing and storage of harvested content. The framework provides an RDF vocabulary for describing crawler configurations and collects metadata concerning crawling activity. Crawler metadata allows for reporting and analysis of crawling progress, as well as more efficient retrieval through the storage of HTTP caching data.
NoSQL (Not Only SQL) is believed to be a superset of, or sometimes an intersecting set with, relational SQL databases. The concept itself is still shaping, but already now we can say for sure: NoSQL addresses the task of storing and retrieving the data of large volumes in the systems with high load. There is another very important angle in perceiving the concept:
NoSQL systems can allow storing and efficient searching of the unstructured or semi-unstructured data, like completely raw or preprocessed documents. Using the example of one world-class document retrieval system Apache SOLR (performant HTTP wrapper around Apache Lucene) as a reference we will check upon its use cases, horizontal and vertical scalability, faceted search, distribution and load balancing, crawling, extendability, linguistic support, integration with relational databases and much more.
Dmitry Kan will shortly touch upon *hot* topic of cloud computing using the famous project Apache Hadoop and will help the audience to see whether SOLR shines through the cloud.
Apache Drill is an open source SQL query engine for big data that provides highly flexible and high performance querying of data stored in Hadoop and NoSQL systems. It allows for ad-hoc queries on schema-less data without requiring upfront modeling or ETL. Drill uses a distributed, columnar data store and late binding to optimize query execution across systems. The project is actively developed with the goal of releasing version 1.0 in late 2014.
This document provides an overview of existing web archives and their use for research. It discusses several major web archives including the Internet Archive, Common Crawl, Pandora Archive, and national archives. For each, it describes their size and collection strategies, as well as positives and negatives for research use. The talk concludes with examples of how existing web archives in Australia are being used for research.
Jonathan Gray gave an introduction to HBase at the NYC Hadoop Meetup. He began with an overview of HBase and why it was created to handle large datasets beyond what Hadoop could support alone. He then described what HBase is, as a distributed, column-oriented database management system. Gray explained how HBase works with its master and regionserver nodes and how it partitions data across tables and regions. He highlighted some key features of HBase and examples of companies using it in production. Gray concluded with what is planned for the future of HBase and contrasted it with relational database examples.
JAX-RS is a Java API that allows developers to build RESTful web services and clients in Java. It uses annotations to describe resources as POJOs and supports standard HTTP methods to perform CRUD operations. Resources are identified by URIs and can have multiple representations in different formats like XML, JSON. JAX-RS implementations include Jersey which can be used with servlet containers or standalone to easily create RESTful web services and clients.
This document provides an overview of library web mashups and APIs. It defines mashups as web applications that combine data from multiple sources. Some examples of library mashups are presented. The key technologies that power mashups, such as web services, JSON, XML, and scripting languages are described. Several specific library vendor and general web services APIs are also outlined, including the WorldCat and Serial Solutions APIs. Finally, the document discusses creating simple mashups with widgets and Yahoo Pipes and provides code walkthroughs for sample mashups.
Web archiving is the process of collecting portions of the World Wide Web for preservation and future access. It involves using web crawlers like Heritrix to automatically collect and archive web pages and sites. Large organizations like the Internet Archive and national libraries participate in web archiving to preserve culturally and historically important web content. However, web archiving faces challenges due to the scale of the web, rapid changes, and intellectual property issues.
1. There are two main types of web collection methods - content driven and event driven. Content driven seeks to archive website content while event driven collects user interactions.
2. Collection can occur from the server side by directly transferring files, archiving databases, or through transactional archiving. Client side collection involves remote harvesting.
3. Tools like SIARD and DeepArc can convert databases into a standardized XML format for archiving, preserving the structure and content of the original database. This generic approach avoids issues with supporting multiple technologies.
The document discusses adapting the open source Nutch search engine to enable full-text search of web archive collections. Key points include:
1. Nutch was selected as the search platform and modified to index content from web archive collections rather than live web crawling.
2. The modified Nutch supports two modes - basic search similar to Google, and a Wayback Machine-like interface to return all versions of a page.
3. Indexing statistics are provided for a small test collection, taking around 40 hours to index 1.07 million documents from 37GB of archive data.
This document discusses customizing the OPAC (Online Public Access Catalog) interface in Koha. It identifies the customizable areas like the library name, favicon, headers, colors, and blocks. It then explains how to customize each area through the Koha administration global system preferences and OPAC profile settings. The goal is to allow customizing the overall look and content of the OPAC interface presented to users.
Facebook uses a combination of PHP, MySQL, and Memcache (LAMP stack) for their web and application tier. They have also developed various services and tools like Thrift, Scribe, and ODS to handle tasks like logging, monitoring, and communication between systems. Their architecture is designed for scale using principles like simplicity, optimizing for performance, and distributing load. Key components include caching data in Memcache, distributing MySQL databases, and developing services in higher performing languages when needed beyond the capabilities of PHP.
This document provides an overview of Drupal, an open-source content management system (CMS) that many libraries use to build and manage their websites. It discusses why libraries choose Drupal, gives examples of libraries using Drupal, outlines Drupal's system requirements and features, and describes the process of installing and configuring Drupal on different server platforms including Windows and Linux.
This document discusses transforming open government data from Romania into linked open data. It begins with background on linked data and open data initiatives. Then it describes efforts to model, transform, link, and publish Romanian open data as linked open data. This includes identifying common vocabularies and properties, creating URIs, linking to external datasets like DBPedia, and publishing the linked data for use in applications via a SPARQL endpoint. Overall the goal is to make this data more accessible and interoperable through semantic web standards.
Facebook uses Apache HBase to store messaging data at a massive scale. Some key points:
- HBase stores over 2PB of Facebook messaging data and handles 75+ billion read/write operations per day.
- Facebook migrated over 1 billion user accounts to HBase in 2011 and has since performed two schema changes while keeping the system online and available.
- Significant work has gone into optimizing performance, reliability, availability, and scalability of HBase at Facebook through improvements to compaction, metrics collection, and operational processes.
- Choosing HBase provided benefits like strong consistency, automatic failover, compression, and integration with HDFS for storage.
Drupal and Apache Stanbol. What if you could reliably do autotagging?Gabriel Dragomir
The document discusses Apache Stanbol, an open source software stack that can extract semantic data from documents. It describes how Stanbol uses semantic analysis and natural language processing to automatically tag and enrich content with entities, citations and related information. The document also explains how Stanbol can be integrated with Drupal to enable features like automatic tagging, dynamic annotation and autocompletion through a REST API that returns JSON-LD.
Neo4j Introduction at Imperial College LondonMichal Bachman
The document provides an introduction to the Neo4j graph database. It discusses trends driving the adoption of NoSQL databases like increasing data size, connectedness of data, semi-structured data, and changing application architectures. It categorizes NoSQL databases and describes key features of key-value stores, column family databases, document databases, and graph databases. The remainder focuses on graph databases and their suitability for interconnected data, introduces Neo4j as an example graph database, and discusses getting started with Neo4j including its core APIs.
Building apps with HBase - Big Data TechCon Bostonamansk
Amandeep Khurana, a solutions architect at Cloudera, gave a presentation titled "Building Applications with HBase" at the Big Data TechCon conference in Boston in April 2013. The presentation introduced HBase, an open-source, distributed, big data store modeled after Google's Bigtable. It covered how to set up HBase locally and interact with it using the shell and Java API, and provided examples of building a Twitter clone application called TwitBase using the HBase data model.
This is part 4 of the ISWC 2009 tutorial on the GoodRelations ontology and RDFa for e-commerce on the Web of Linked Data.
See also
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009
Facebook uses a distributed systems architecture with services like Memcache, Scribe, Thrift, and Hip Hop to handle large data volumes and high concurrency. Key components include the Haystack photo storage system, BigPipe for faster page loading, and a PHP front-end optimized using Hip Hop. Data is partitioned horizontally and services communicate using lightweight protocols like Thrift.
Academic libraries play an important role in supporting research in three key areas:
1. They facilitate literature searches through discovery services, subject databases, and remote access to subscribed resources. They also guide researchers on open access publishing and predatory journals.
2. They provide information literacy instruction and reference services to help researchers effectively find and evaluate information.
3. They assist researchers in managing their research profiles and outputs to increase visibility and track citations. This includes guidance on ORCID, Google Scholar, and Scopus profiles as well as publishing and citation metrics.
The document summarizes the history and evolution of major web search engines. It discusses how early search engines like Archie in 1990 indexed anonymous FTP directories. Google introduced important concepts like link popularity and page rank around 2001. Yahoo used Google's search results until 2004 when it launched its own. MSN Search is owned by Microsoft. The document also defines search engines and different types like crawler-based (Google, Yahoo) which automatically index web pages versus human-powered directories. It provides details on how WebCrawler, an early search engine, worked as both an indexer and for real-time search using a breath-first search algorithm.
Building a High Performance Environment for RDF Publishingdr0i
This document discusses building a high performance environment for publishing RDF data. It describes experiences publishing data through lobid.org, including what lobid.org is, how the data is stored, and how it is accessed. It discusses benefits of publishing RDF through Elasticsearch, including fast search performance, scalability, and ease of achieving high availability. The document also notes Elasticsearch provides benefits like dumps, content negotiation, searchability, near real-time updates, and APIs that return JSON for web developers.
This document provides an overview of the Hadoop framework. It introduces MapReduce and Hadoop, describes the Hadoop application architecture including HDFS, MapReduce programming model, and developing a typical Hadoop application. It discusses setting up a Hadoop environment and provides a sample word count demo to practice Hadoop.
The document discusses the history and impact of the World Wide Web on information retrieval and search engines. It covers:
1) How Berners-Lee invented the World Wide Web in 1990 by creating HTTP, HTML, and the first browser and server. The Web has since grown enormously.
2) How the Web changed search by requiring crawling to collect documents in a central repository before indexing, and by increasing the scale, size, and difficulty of relevance prediction due to the large collection.
3) The basic architecture of centralized crawling and indexing used by most early search engines, and how distributed and cluster-based architectures were developed to handle the Web's massive growth.
This document provides an overview of web mining. It defines web mining as using data mining techniques to automatically discover and extract information from web documents and services. It discusses the differences between web mining and data mining, and covers the main topics in web mining including web graph analysis, structured data extraction, and web advertising. It also describes the different approaches of web content mining, web structure mining, and web usage mining.
Web archiving is the process of collecting portions of the World Wide Web for preservation and future access. It involves using web crawlers like Heritrix to automatically collect and archive web pages and sites. Large organizations like the Internet Archive and national libraries participate in web archiving to preserve culturally and historically important web content. However, web archiving faces challenges due to the scale of the web, rapid changes, and intellectual property issues.
1. There are two main types of web collection methods - content driven and event driven. Content driven seeks to archive website content while event driven collects user interactions.
2. Collection can occur from the server side by directly transferring files, archiving databases, or through transactional archiving. Client side collection involves remote harvesting.
3. Tools like SIARD and DeepArc can convert databases into a standardized XML format for archiving, preserving the structure and content of the original database. This generic approach avoids issues with supporting multiple technologies.
The document discusses adapting the open source Nutch search engine to enable full-text search of web archive collections. Key points include:
1. Nutch was selected as the search platform and modified to index content from web archive collections rather than live web crawling.
2. The modified Nutch supports two modes - basic search similar to Google, and a Wayback Machine-like interface to return all versions of a page.
3. Indexing statistics are provided for a small test collection, taking around 40 hours to index 1.07 million documents from 37GB of archive data.
This document discusses customizing the OPAC (Online Public Access Catalog) interface in Koha. It identifies the customizable areas like the library name, favicon, headers, colors, and blocks. It then explains how to customize each area through the Koha administration global system preferences and OPAC profile settings. The goal is to allow customizing the overall look and content of the OPAC interface presented to users.
Facebook uses a combination of PHP, MySQL, and Memcache (LAMP stack) for their web and application tier. They have also developed various services and tools like Thrift, Scribe, and ODS to handle tasks like logging, monitoring, and communication between systems. Their architecture is designed for scale using principles like simplicity, optimizing for performance, and distributing load. Key components include caching data in Memcache, distributing MySQL databases, and developing services in higher performing languages when needed beyond the capabilities of PHP.
This document provides an overview of Drupal, an open-source content management system (CMS) that many libraries use to build and manage their websites. It discusses why libraries choose Drupal, gives examples of libraries using Drupal, outlines Drupal's system requirements and features, and describes the process of installing and configuring Drupal on different server platforms including Windows and Linux.
This document discusses transforming open government data from Romania into linked open data. It begins with background on linked data and open data initiatives. Then it describes efforts to model, transform, link, and publish Romanian open data as linked open data. This includes identifying common vocabularies and properties, creating URIs, linking to external datasets like DBPedia, and publishing the linked data for use in applications via a SPARQL endpoint. Overall the goal is to make this data more accessible and interoperable through semantic web standards.
Facebook uses Apache HBase to store messaging data at a massive scale. Some key points:
- HBase stores over 2PB of Facebook messaging data and handles 75+ billion read/write operations per day.
- Facebook migrated over 1 billion user accounts to HBase in 2011 and has since performed two schema changes while keeping the system online and available.
- Significant work has gone into optimizing performance, reliability, availability, and scalability of HBase at Facebook through improvements to compaction, metrics collection, and operational processes.
- Choosing HBase provided benefits like strong consistency, automatic failover, compression, and integration with HDFS for storage.
Drupal and Apache Stanbol. What if you could reliably do autotagging?Gabriel Dragomir
The document discusses Apache Stanbol, an open source software stack that can extract semantic data from documents. It describes how Stanbol uses semantic analysis and natural language processing to automatically tag and enrich content with entities, citations and related information. The document also explains how Stanbol can be integrated with Drupal to enable features like automatic tagging, dynamic annotation and autocompletion through a REST API that returns JSON-LD.
Neo4j Introduction at Imperial College LondonMichal Bachman
The document provides an introduction to the Neo4j graph database. It discusses trends driving the adoption of NoSQL databases like increasing data size, connectedness of data, semi-structured data, and changing application architectures. It categorizes NoSQL databases and describes key features of key-value stores, column family databases, document databases, and graph databases. The remainder focuses on graph databases and their suitability for interconnected data, introduces Neo4j as an example graph database, and discusses getting started with Neo4j including its core APIs.
Building apps with HBase - Big Data TechCon Bostonamansk
Amandeep Khurana, a solutions architect at Cloudera, gave a presentation titled "Building Applications with HBase" at the Big Data TechCon conference in Boston in April 2013. The presentation introduced HBase, an open-source, distributed, big data store modeled after Google's Bigtable. It covered how to set up HBase locally and interact with it using the shell and Java API, and provided examples of building a Twitter clone application called TwitBase using the HBase data model.
This is part 4 of the ISWC 2009 tutorial on the GoodRelations ontology and RDFa for e-commerce on the Web of Linked Data.
See also
http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_ISWC2009
Facebook uses a distributed systems architecture with services like Memcache, Scribe, Thrift, and Hip Hop to handle large data volumes and high concurrency. Key components include the Haystack photo storage system, BigPipe for faster page loading, and a PHP front-end optimized using Hip Hop. Data is partitioned horizontally and services communicate using lightweight protocols like Thrift.
Academic libraries play an important role in supporting research in three key areas:
1. They facilitate literature searches through discovery services, subject databases, and remote access to subscribed resources. They also guide researchers on open access publishing and predatory journals.
2. They provide information literacy instruction and reference services to help researchers effectively find and evaluate information.
3. They assist researchers in managing their research profiles and outputs to increase visibility and track citations. This includes guidance on ORCID, Google Scholar, and Scopus profiles as well as publishing and citation metrics.
The document summarizes the history and evolution of major web search engines. It discusses how early search engines like Archie in 1990 indexed anonymous FTP directories. Google introduced important concepts like link popularity and page rank around 2001. Yahoo used Google's search results until 2004 when it launched its own. MSN Search is owned by Microsoft. The document also defines search engines and different types like crawler-based (Google, Yahoo) which automatically index web pages versus human-powered directories. It provides details on how WebCrawler, an early search engine, worked as both an indexer and for real-time search using a breath-first search algorithm.
Building a High Performance Environment for RDF Publishingdr0i
This document discusses building a high performance environment for publishing RDF data. It describes experiences publishing data through lobid.org, including what lobid.org is, how the data is stored, and how it is accessed. It discusses benefits of publishing RDF through Elasticsearch, including fast search performance, scalability, and ease of achieving high availability. The document also notes Elasticsearch provides benefits like dumps, content negotiation, searchability, near real-time updates, and APIs that return JSON for web developers.
This document provides an overview of the Hadoop framework. It introduces MapReduce and Hadoop, describes the Hadoop application architecture including HDFS, MapReduce programming model, and developing a typical Hadoop application. It discusses setting up a Hadoop environment and provides a sample word count demo to practice Hadoop.
The document discusses the history and impact of the World Wide Web on information retrieval and search engines. It covers:
1) How Berners-Lee invented the World Wide Web in 1990 by creating HTTP, HTML, and the first browser and server. The Web has since grown enormously.
2) How the Web changed search by requiring crawling to collect documents in a central repository before indexing, and by increasing the scale, size, and difficulty of relevance prediction due to the large collection.
3) The basic architecture of centralized crawling and indexing used by most early search engines, and how distributed and cluster-based architectures were developed to handle the Web's massive growth.
This document provides an overview of web mining. It defines web mining as using data mining techniques to automatically discover and extract information from web documents and services. It discusses the differences between web mining and data mining, and covers the main topics in web mining including web graph analysis, structured data extraction, and web advertising. It also describes the different approaches of web content mining, web structure mining, and web usage mining.
This document provides an overview of web mining, which uses data mining techniques to automatically discover and extract information from web documents and services. It discusses the differences between web mining and traditional data mining, and covers various topics in web mining including web content mining, web structure mining, and web usage mining. The document also examines issues around the large scale of web data and approaches for analyzing it at scale across distributed systems.
This document discusses various strategies and resources for archiving internet content for research purposes. It describes several existing large-scale web archives like the Internet Archive and Common Crawl, as well as national and institutional archives. It also outlines how researchers can collect targeted web archives using open-source tools or subscription-based services.
Web search engines and other search technologies use crawlers to systematically explore the web by following links. Crawlers download pages and send them to be indexed so that queries can later retrieve relevant pages. Search engines face challenges in completely and efficiently crawling the massive web while being polite to websites and respecting robot exclusion protocols. Advanced techniques include focused crawling, link analysis, and distributed architectures for scaling to billions of pages.
This document summarizes key learnings from a presentation about SharePoint 2013 and Enterprise Search. It discusses how to run a successful search project through planning, development, testing and deployment. It also covers infrastructure needs and capacity testing findings. Additionally, it provides examples of how to customize the user experience through display templates and Front search. Methods for crawling thousands of file shares and enriching indexed content are presented. The document concludes with discussions on relevancy, managing property weighting, changing ranking models, and tuning search results.
The document discusses how to plan an appropriate site topology in SharePoint. It explains that site collections and sub-sites provide containers for content and each have pros and cons. The key is finding a balance between too many deep sub-sites versus too few top-level site collections. When planning a topology, considerations include governance, quotas, content types and amounts, administrators, and importance of information. Examples are provided of both deep and flat topologies as well as when to use site collections versus sub-sites.
The paper presents Google, a large-scale web search engine prototype created by Sergey Brin and Lawrence Page. It addresses problems with existing search engines like returning too many low-quality results. The paper describes Google's use of PageRank, which assigns importance scores to pages based on the page's links, and its index structure. It also discusses Google's architecture including data structures like BigFiles and techniques for crawling, indexing, and searching the web rapidly and at scale. The paper concludes Google met its goals of building a scalable search engine that provides high-quality results.
Web mining is the use of data mining techniques to automatically discover and extract information from web documents and web usage data. There are three types of web mining: web content mining, web structure mining, and web usage mining. Web content mining analyzes the contents of web pages such as text and images. Web structure mining analyzes the hyperlink structure of the web to discover communities and page rankings. Web usage mining analyzes user interactions with websites through web logs to understand user behavior. Popular algorithms for web mining include PageRank for ranking pages and HITS for identifying hubs and authorities on a topic. Web mining has applications in areas like e-commerce, security, and prediction.
The document discusses optimizing Drupal performance by measuring performance metrics, implementing caching techniques and modules, optimizing database and application code, and configuring web and application servers. It provides an overview of Sergata and their focus on innovation and startups, and recommends analyzing performance bottlenecks and leveraging caching, CDNs, and server configuration to improve performance.
On building a search interface discovery systemDenis Shestakov
This document discusses building a search interface discovery system to create a directory of deep web resources. It outlines recognizing search interfaces on web pages, classifying interfaces into subject hierarchies, and the interface crawler architecture. Experiments showed the system could successfully identify search interfaces on real websites and classify them. The system aims to automate discovery of the large number of databases available online to improve access to undiscovered resources.
SharePoint Saturday San Antonio: SharePoint 2010 PerformanceBrian Culver
Is your farm struggling to server your organization? How long is it taking between page requests? Where is your bottleneck in your farm? Is your SQL Server tuned properly? Worried about upgrading due to poor performance? We will look at various tools for analyzing and measuring performance of your farm. We will look at simple SharePoint and IIS configuration options to instantly improve performance. I will discuss advanced approaches for analyzing, measuring and implementing optimizations in your farm.
SharePoint Saturday The Conference 2011 - SP2010 PerformanceBrian Culver
Is your farm struggling to server your organization? How long is it taking between page requests? Where is your bottleneck in your farm? Is your SQL Server tuned properly? Worried about upgrading due to poor performance? We will look at various tools for analyzing and measuring performance of your farm. We will look at simple SharePoint and IIS configuration options to instantly improve performance. I will discuss advanced approaches for analyzing, measuring and implementing optimizations in your farm.
A web crawler is a program that browses the World Wide Web in an automated manner to download pages and identify links within pages to add to a queue for future downloading. It uses strategies like breadth-first or depth-first searching to systematically visit pages and index them to build a database for search engines. Crawling policies determine which pages to download, when to revisit pages, how to avoid overloading websites, and how to coordinate distributed crawlers.
Mark Dehmlow, Head of the Library Web Department at the University of Notre Dame
At the University of Notre Dame, we recently implemented a new website in concert with rolling out a “next generation” OPAC into production for our campus. While much of the pre-launch feedback was positive, once we implemented the new systems, we started receiving a small number of intense criticisms and a small wave of problem reports. This presentation covers how to plan for big technology changes, prepare your organizations, effectively manage the barrage of post implementation technical problems, and mitigate customer concerns and criticisms. Participants are encouraged to bring brief war stories, anecdotes, and suggestions for managing technology implementations.”
The Anatomy of a Large-Scale Hypertextual Web Search EngineMehul Boricha
The document describes the design of Google's first web search engine. It discusses the challenges of building a large-scale search engine that can crawl and index the rapidly growing web efficiently and produce high-quality search results. It outlines Google's goals of improving search quality through tools with high precision and furthering academic research. The major sections describe Google's system features, including PageRank to prioritize results, and the major data structures used, such as repositories to store web pages, indexes to catalog them, and hit lists to track word occurrences.
This document provides information about building a basic web crawler, including:
- Crawlers find and download web pages to build a collection for search engines by starting with seed URLs and following links on downloaded pages.
- Crawlers use techniques like multiple threads, politeness policies, and robots.txt files to efficiently and politely crawl the web at scale.
- Crawlers must also handle challenges like duplicate pages, updating freshness, focused crawling of specific topics, and accessing deep web or private pages.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce
The internet is a vast collection of billions of web pages containing terabytes of information
arranged in thousands of servers using HTML. The size of this collection itself is a formidable obstacle in
retrieving necessary and relevant information. This made search engines an important part of our lives. Search
engines strive to retrieve information as relevant as possible. One of the building blocks of search engines is the
Web Crawler. We tend to propose a two - stage framework, specifically two smart Crawler, for efficient
gathering deep net interfaces. Within the first stage, smart Crawler, performs site-based sorting out centre
pages with the assistance of search engines, avoiding visiting an oversized variety of pages. To realize
additional correct results for a targeted crawl, smart Crawler, ranks websites to order extremely relevant ones
for a given topic. Within the second stage, smart Crawler, achieves quick in – site looking by excavating most
relevant links with associate degree accommodative link -ranking
This document describes a two-stage crawler called Smart Crawler for concept-based semantic search engines. The first stage involves locating relevant websites for a given topic through site collecting, ranking, and classification. The second stage performs in-site crawling to uncover searchable content from the highest ranked sites using reverse searching and incremental site prioritization algorithms. The system aims to maximize retrieval of deep web content across many websites rather than just a few individual sites.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
1. Scalability and Efficiency
Challenges in Large-Scale
Web Search Engines
Ricardo Baeza-Yates
B. Barla Cambazoglu
Yahoo Labs
Barcelona, Spain
Tutorial in SIGIR’13, WWW’14, SIGIR’14, MUMIA’14
(based on the book chapter in The Information Retrieval Series, 2011)
2. Disclaimer
• This talk presents the opinions of the authors. It does not
necessarily reflect the views of Yahoo Inc. or any other
entity.
• Algorithms, techniques, features, etc. mentioned here
might or might not be in use by Yahoo or any other
company.
• Parts of the content (e.g., images, plots) used in this
presentation were taken from public web pages.
3. Yahoo Labs Barcelona
• Groups
– data mining
– semantic search
– social media
– web retrieval
• Web retrieval group
– distributed web retrieval
– scalability and efficiency
– opinion/sentiment analysis
– personalization
4. Outline of the Tutorial
• Background
• Main sections
– web crawling
– indexing
– query processing
– caching
• Concluding remarks
• Open discussion
5. Structure of the Main Sections
• Definitions
• Metrics
• Issues and techniques
– single computer
– cluster of computers
– multiple search sites
• Research problems
• Q&A
7. Brief History of Search Engines
• Past
- Before browsers
- Gopher
- Before the bubble
- Altavista
- Lycos
- Infoseek
- Excite
- HotBot
- After the bubble
- Yahoo
- Google
- Microsoft
• Current
• Global
• Google, Bing
• Regional
• Yandex, Baidu
• Future
• Facebook ?
• …
8. Anatomy of a Search Engine Result Page
• Main web search results
• Direct displays (vertical search results)
– image
– video
– local
– shopping
– related entities
• Query suggestions
• Advertisement
Focus of this tutorial
9. Anatomy of a Search Engine Result Page
Related
entity
suggestions
Movie
direct
display
Video
direct
display
AdsAlgorithmic
search
results
Knowledge
graph
User
query
10. Actors in Web Search
• User’s perspective: accessing information
– relevance
– speed
• Search engine’s perspective: monetization
– attract more users
– increase the ad revenue
– reduce the operational costs
• Advertiser’s perspective: publicity
– attract more users
– pay little
12. • Size
What Makes Web Search Difficult?
• Diversity • Dynamicity
• All of these three features can be observed in
– the Web
– web users
13. What Makes Web Search Difficult?
• The Web
– more than 180 million web servers and 950 million host names
– almost 1 billion computers directly connect to the Internet
– the largest data repository (estimated as 100 billion pages)
– constantly changing
– diverse in terms of content and data formats
• Users
– too many! (over 2.5 billion at the end of 2012)
– diverse in terms of their culture, education, and demographics
– very short queries (hard to understand the intent)
– changing information needs
– little patience (few queries posed & few answers seen)
14. • Crawl and index a large fraction of the Web
• Maintain most recent copies of the content in the Web
• Scale to serve hundreds of millions of queries every day
• Evaluate a typical query under several hundred milliseconds
• Serve most relevant results for a user query
Expectations from a Search Engine
15. Search Data Centers
• Quality and performance requirements imply large amounts
of compute resources, i.e., very large data centers
• High variation in data center sizes
- hundreds of thousands of computers
- a few computers
16. Cost of Data Centers
• Data center facilities are heavy consumers of energy,
accounting for between 1.1% and 1.5% of the world’s total
energy use in 2010.
17. Financial Costs
• Costs
– depreciation: old hardware needs to be replaced
– maintenance: failures need to be handled
– operational: energy spending needs to be reduced
19. Major Components in a Web Search Engine
• Indexing • Query processing• Web crawling
crawler
Web
indexer
query
processor
document
collection index
query
results user
22. Web Crawling
• Web crawling is the process of locating, fetching, and storing
the pages available in the Web
• Computer programs that perform this task are referred to as
- crawlers
- spider
- harvesters
23. Web Graph
• Web crawlers exploit the hyperlink structure of the Web
24. Web Crawling Process
• Initialize a URL download queue with some seed URLs
• Repeat the following steps
– fetch the content of a URL selected from the download queue
– store the fetched content in a repository
– extract the hyperlinks within the fetched content
– add the extracted links into the download queue
26. Web Crawling
• Crawling divides
the Web into
three sets
– downloaded
– discovered
– undiscovered
seed page
discovered
(frontier)
undiscovered
downloaded
27. URL Prioritization
• A state-of-the-art web crawler maintains two separate
queues for prioritizing the download of URLs
– discovery queue
– downloads pages pointed by already discovered links
– tries to increase the coverage of the crawler
– refreshing queue
– re-downloads already downloaded pages
– tries to increase the freshness of the repository
28. URL Prioritization (Discovery)
• Random (A, B, C, D)
• Breadth-first (A)
• In-degree (C)
• PageRank (B)
A
B C D
seed page
discovered
(frontier)
undiscovered
downloaded
(more intense red color indicates higher PageRank)
29. URL Prioritization (Refreshing)
• Random (A, B, C, D)
• PageRank (B)
• Age (C)
• User feedback (D) • Longevity (A)
download time order
(by the crawler)
A BC D
last update time order
(by the webmaster)
A DCB
(more intense red color indicates higher PageRank)
estimated average
update frequency
never every
minute
every
day
every
year
(more intense blue color indicates larger user interest)
30. Metrics
• Quality metrics
– coverage: percentage of the Web discovered or downloaded by
the crawler
– freshness: measure of staleness of the local copy of a page
relative to the page’s original copy on the Web
– page importance: percentage of important (e.g., popular) pages
in the repository
• Performance metrics
– throughput: content download rate in bytes per unit of time
32. Crawling Architectures
• Single computer
– CPU, RAM, and disk becomes a bottleneck
– not scalable
• Parallel
– multiple computers, single data center
– scalable
• Geographically distributed
– multiple computers, multiple data centers
– scalable
– reduces the network latency
34. Issues in Web Crawling
• Dynamics of the Web
– web growth
– content change
• Malicious intent
– hostile sites (e.g., spider traps, infinite domain name generators)
– spam sites (e.g., link farms)
35. Issues in Web Crawling
• URL normalization (a.k.a. canonicalization)
– case-folding
– removing leading “www strings (e.g., www.cnn.com cnn.com)
– adding trailing slashes (e.g., cnn.com/a cnn.com/a/)
– relative paths (e.g., ../index.html)
• Web site properties
– sites with restricted content (e.g., robot exclusion),
– unstable sites (e.g., variable host performance, unreliable
networks)
– politeness requirements
36. DNS Caching
• Before a web page is
crawled, the host name
needs to be resolved to
an IP address
• Since the same host
name appears many
times, DNS entries are
locally cached by the
crawler
Web
server
Crawler
DNS
server
web page
repository
DNS
cache
DNS caching
TCP connection
HTTP connection
37. Multi-threaded Crawling
• Multi-threaded crawling
– crawling is a network-bound task
– crawlers employ multiple threads to crawl different web pages
simultaneously, increasing their throughput significantly
– in practice, a single node can run around up to a hundred
crawling threads
– multi-threading becomes infeasible when the number of
threads is very large due to the overhead of context switching
38. Politeness
• Multi-threading leads to politeness issues
• If not well-coordinated, the crawler may issue too many
download requests at the same time, overloading
– a web server
– an entire sub-network
• A polite crawler
– puts a delay between two consecutive downloads from the
same server (a commonly used delay is 20 seconds)
– closes the established TCP-IP connection after the web page is
downloaded from the server
39. Robot Exclusion Protocol
• A standard from the early
days of the Web
• A file (called robots.txt) in a
web site advising web
crawlers about which parts
of the site are accessible
• Crawlers often cache
robots.txt files for efficiency
purposes
User-agent: googlebot # all services
Disallow: /private/ # disallow this directory
User-agent: googlebot-news # only the news service
Disallow: / # on everything
User-agent: * # all robots
Disallow: /something/ # on this directory
User-agent: * # all robots
Crawl-delay: 10 # wait at least 10 seconds
Disallow: /directory1/ # disallow this directory
Allow: /directory1/myfile.html # allow a subdirectory
Host: www.example.com # use this mirror
40. Mirror Sites
• A mirror site is a replica of an existing site, used to reduce the
network traffic or improve the availability of the original site
• Mirror sites lead to redundant crawling and, in turn, reduced
discovery rate and coverage for the crawler
• Mirror sites can be detected by analyzing
– URL similarity
– link structure
– content similarity
41. Data Structures
• Good implementation of data structures is crucial for the
efficiency of a web crawler
• The most critical data structure is the “seen URL” table
– stores all URLs discovered so far and continuously grows as new
URLs are discovered
– consulted before each URL is added to the discovery queue
– has high space requirements (mostly stored on the disk)
– URLs are stored as MD5 hashes
– frequent/recent URLs are cached in memory
43. Web Partitioning and Fault Tolerance
• Web partitioning
– typically based on the MD5 hashes of URLs or host names
– site-based partitioning is preferable because URL-based
partitioning may lead to politeness issues if the crawling
decisions given by individual nodes are not coordinated
• Fault tolerance
– when a crawling node dies, its URLs are partitioned over the
remaining nodes
44. Parallelization Alternatives
• Firewall mode
– lower coverage
• Crossover mode
– duplicate pages
• Exchange mode
– communication overhead
D
D
X
D
D
D
D
X
Duplicate crawling
in crossover mode
Not discovered in
firewall mode
Link communicated
in exchange mode
46. Geographically Distributed Web Crawling
• Benefits
- higher crawling throughput
- geographical proximity
- lower crawling latency
- improved network politeness
- less overhead on routers because of fewer hops
- resilience to network partitions
- better coverage
- increased availability
- continuity of business
- better coupling with distributed indexing/search
- reduced data migration
47. Geographically Distributed Web Crawling
• Four crawling countries
- US
- Brazil
- Spain
- Turkey
• Eight target countries
- US, Canada
- Brazil, Chile
- Spain, Portugal
- Turkey, Greece
48. Focused Web Crawling
• The goal is to locate and download a large portion of web pages
that match a given target theme as early as possible.
• Example themes
- topic (nuclear energy)
- media type (forums)
- demographics (kids)
• Strategies
- URL patterns
- referring page content
- local graph structure
starting
page
49. Sentiment Focused Web Crawling
• Goal: to locate and download web pages that contain positive
or negative sentiments (opinionated content) as early as
possible
starting
page
50. Hidden Web Crawling
• Hidden Web: web pages that a crawler cannot access by
simply following the link structure
• Examples
- unlinked pages
- private sites
- contextual pages
- scripted content
- dynamic content hidden web
crawler repositoryvocabulary
root page result page
51. Research Problem – Passive Discovery
• URL discovery by external agents
– toolbar logs
– email messages
– tweets
• Benefits
– improved coverage
– early discovery
head
fetcher parser
tail
repository
frontier
crawler
the Web
toolbar data
processor
toolbar log
URL filter
http://...
toolbar
http://...
toolbar
52. Research Problem – URL Scheduling
• Web master
– create
– modify
– delete
• Crawler
– discover
– download
– refresh
• How to optimally allocate available crawling resources
between the tasks of page discovery and refreshing
Inexistant
Modified
Deleted F
Located
Undiscovered Fresh
Deleted R
Undiscovered Web Repository
modification
deletion f
download & purge re-download & purge
deletion m
re-download m
creation
Inexistence
Frontier
deletion l
discovery
deletion u
download
re-download f
passive
transitions
active
transitions
discovery
53. Challenges in Distributed Web CrawlingResearch Problem – Web Partitioning
• Web partitioning/repartitioning: the problem of finding a Web
partition that minimizes the costs in distributed Web crawling
– minimization objectives
– page download times
– link exchange overhead
– repartitioning overhead
– constraints
– coupling with distributed search
– load balancing
54. Challenges in Distributed Web CrawlingResearch Problem – Crawler Placement
• Crawler placement problem: the problem of finding the
optimum geographical placement for a given number of data
centers
– geographical locations are now objectives, not constraints
• Problem variant: assuming some data centers were already
built, find an optimum location to build a new data center for
crawling
55. Challenges in Distributed Web CrawlingResearch Problem – Coupling with Search
• Coupling with geographically distributed indexing/search
– crawled data may be moved to
– a single data center
– replicated on multiple data centers
– partitioned among a number of data centers
– decisions must be given on
– what data to move (e.g., pages or index)
– how to move (i.e., compression)
– how often to move (i.e., synchronization)
56. Research Problem – Green Web Crawling
• Goal: reduce the carbon footprint generated by the web
servers while handling the requests of web crawlers
• Idea
- crawl web sites when they are consuming green energy
(e.g., during the day when solar energy is more available)
- crawl web sites consuming green energy more often as an
incentive to promote the use of green energy
57. Published Web Crawler Architectures
• Bingbot: Microsoft's Bing web crawler
• FAST Crawler: Used by Fast Search & Transfer
• Googlebot: Web crawler of Google
• PolyBot: A distributed web crawler
• RBSE: The first published web crawler
• WebFountain: A distributed web crawler
• WebRACE: A crawling and caching module
• Yahoo Slurp: Web crawler used by Yahoo Search
58. Open Source Web Crawlers
• DataparkSearch: GNU General Public License.
• GRUB: open source distributed crawler of Wikia Search
• Heritrix: Internet Archive's crawler
• ICDL Crawler: cross-platform web crawler
• Norconex HTTP Collector: licensed under GPL
• Nutch: Apache License
• Open Search Server: GPL license
• PHP-Crawler: BSD license
• Scrapy: BSD license
• Seeks: Affero general public license
59. Key Papers
• Cho, Garcia-Molina, and Page, "Efficient crawling through URL ordering",
WWW, 1998.
• Heydon and Najork, "Mercator: a scalable, extensible web crawler", World
Wide Web, 1999.
• Chakrabarti, van den Berg, and Dom, "Focused crawling: a new approach
to topic-specific web resource discovery", Computer Networks, 1999.
• Najork and Wiener, "Breadth-first crawling yields high-quality pages",
WWW, 2001.
• Cho and Garcia-Molina, "Parallel crawlers", WWW, 2002.
• Cho and Garcia-Molina, "Effective page refresh policies for web crawlers”,
ACM Transactions on Database Systems, 2003.
• Lee, Leonard, Wang, and Loguinov, "IRLbot: Scaling to 6 billion pages and
beyond", ACM TWEB, 2009.
62. Indexing
• Indexing is the process of converting crawled web documents
into an efficiently searchable form.
• An index is a representation for the document collection over
which user queries will be evaluated.
63. Indexing
• Abandoned indexing techniques
– suffix arrays
– signature files
• Currently used indexing technique
– inverted index
64. Signature File
• For a given piece of text, a signature is created by encoding
the words in it
• For each word, a bit signature is computed
– contains F bits
– m out of F bits is set to 1 (decided by hash functions)
• In case of long documents
– a signature is created for each logical text block
– block signatures are concatenated to form the signature for the
entire document
65. Signature File
• When searching, the signature for the query keyword is OR’ed
with the document signature
• Example signature with F = 6 and m = 2
document terms: query terms:
apple 10 00 10 apple 10 00 10 (match)
orange 00 01 10 banana 01 00 01 (no match)
peach 10 01 00 (false match)
signature 10 01 10
66. Inverted Index
• An inverted index
has two parts
– inverted lists
– posting entries
– document id
– term score
– a vocabulary index
(dictionary)
67. Sample Document Collection
Doc id Text
1 pease porridge hot
2 pease porridge cold
3 pease porridge in the pot
4 pease porridge hot, pease porridge not cold
5 pease porridge cold, pease porridge not hot
6 pease porridge hot in the pot
69. Inverted Index
• Additional data structures
- position lists: list of all positions a term occurs in a document
- document array: document length, PageRank, spam score, …
• Sections
- title, body, header, anchor text (inbound, outbound links)
(1, 1) (2, 1) (3, 1) (4, 2) (5, 2) (6, 1)
1 1 1 1 4
pease
11 4
7 240 1.7 ...
6
doc id: 4
term
count
length spam
score
doc
freq
70. Metrics
• Quality metrics
- spam rate: fraction of spam pages in the index
- duplicate rate: fraction of exact or near duplicate web pages
present in the index
• Performance metrics
- compactness: size of the index in bytes
- deployment cost: time and effort it takes to create and deploy
a new inverted index from scratch
- update cost: time and space overhead of updating a
document entry in the index
71. Indexing Documents
• Index terms are extracted from documents after some
processing, which may involve
- tokenization
- stopword removal
- case conversion
- stemming
original text: Living in America
applying all: liv america
in practice: living in america
72. Duplicate Detection
• Detecting documents with duplicate content
- exact duplicates (solution: computing/comparing hash values)
- near duplicates (solution: shingles instead of hash values)
73. Features
• Offline computed features
- content: spam score, domain quality score
- web graph: PageRank, HostRank
- usage: click count, CTR, dwell time
• Online computed features
- query-document similarity: tf-idf, BM25
- term proximity features
74. PageRank
• A link analysis algorithm that assigns a weight to each web
page indicating its importance
• Iterative process that converges to a unique solution
• Weight of a page is proportional to
- number of inlinks of the page
- weight of linking pages
• Other algorithms
- HITS
- SimRank
- TrustRank
75. Spam Filtering
• Content spam
- web pages with potentially many popular search terms and
with little or no useful information
- goal is to match many search queries to the page content
and increase the traffic coming from search engines
76. Spam Filtering
• Link spam
- a group of web sites that
all link to every other site
in the group
- goal is to boost the
PageRank score of pages
and make them more
frequently visible in web
search results
77. Spam Filtering
Top 10 PageRank sites
• PageRank is subject to manipulation by link farms
78. Inverted List Compression
• Benefits
- reduced space consumption
- reduced transfer costs
- increased posting list cache hit rate
• Gap encoding
- original: 17 18 28 40 44 47 56 58
- gap encoded: 17 1 10 12 4 3 9 2
gaps
80. Document Identifier Reordering
• Goal: reassign document identifiers so that we obtain many
small d-gaps, facilitating compression
• Example
old lists: L1: 1 3 6 8 9 L2: 2 4 5 6 9 L3: 3 6 7 9
mapping: 11 29 32 47 58 63 75 86 94
new lists: L1: 1 2 3 4 6 L2: 3 4 7 8 9 L3: 2 3 4 5
old d-gaps: 2 3 2 1 2 1 1 3 3 1 2
new d-gaps: 1 1 1 2 1 3 1 1 1 1 1
81. Document Identifier Reordering
• Techniques
- sorting URLs alphabetically and assigning ids in that order
- idea: pages from the same site have high textual overlap
- simple yet effective
- only applicable to web page collections
- clustering similar documents
- assigns nearby ids to documents in the same cluster
- traversal of document similarity graph
- formulated as the traveling salesman problem
82. Index Construction
• Equivalent to computing the transpose of a matrix
• In-memory techniques do not work well with web-scale data
• Techniques
- two-phase
- first phase: read the collection and allocate a skeleton for the index
- second phase: fill the posting lists
- one-phase
- keep reading documents and building an in-memory index
- each time the memory is full, flush the index to the disk
- merge all on-disk indexes into a single index in a final step
83. Index Maintenance
• Grow a new (delta) index in the memory; each time the
memory is full, flush the in-memory index to disk
- no merge
- flushed index is written to disk as a separate index
- increases fragmentation and query processing time
- eventually requires merging all on-disk indexes or rebuilding
- incremental indexing
- each inverted list contains additional empty space at the end
- new documents are appended to the empty space in the list
- merging delta index
- immediate merge: maintains only one copy of the index on disk
- selective merge: maintains multiple generations of the index on disk
84. Inverted Index Partitioning/Replication
• In practice, the inverted index is
– partitioned on thousands of computers in a large search cluster
– reduces query response times
– allows scaling with increasing collection size
– replicated on tens of search clusters
– increases query processing throughput
– allows scaling with increasing query volume
– provides fault tolerance
85. Inverted Index Partitioning
• Two alternatives for partitioning an inverted index
– term-based partitioning
– T inverted lists are distributed across P processors
– each processor is responsible for processing the postings of a
mutually disjoint subset of inverted lists assigned to itself
– single disk access per query term
– document-based partitioning
– N documents are distributed across P processors
– each processor is responsible for processing the postings of a
mutually disjoint subset of documents assigned to itself
– multiple (parallel) disk accesses per query term
88. Comparison of Index Partitioning Approaches
Document-based Term-based
Space consumption Higher Lower
Number of disk accesses Higher Lower
Concurrency Lower Higher
Computational load imbalance Lower Higher
Max. posting list I/O time Lower Higher
Cost of index building Lower Higher
Maintenance cost Lower Higher
89. Inverted Index Partitioning
• In practice, document-based partitioning is used
– simpler to build and update
– low inter-query-processing concurrency, but good load balance
– low throughput, but high response time
– high throughput is achieved by replication
– easier to maintain
– more fault tolerant
• Hybrid techniques are possible (e.g., term partitioning inside
a document sub-collection)
90. Parallel Index Creation
• Possible alternatives for creating an inverted index in parallel
- message passing paradigm
- doc-based: each node builds a local index using its documents
- term-based: posting lists are communicated between the nodes
- MapReduce framework
Mapper
Reducer
D1: A B C
D2: E B D
D3: B C
Mapper
Mapper
Reducer
A: D1
B: D1
C: D1
E: D2
B: D2
D: D2
B: D3
C: D3
A: D1
B: D1 D2 D3
C: D1 D3
D: D2
E: D2
91. Static Index Pruning
• Idea: to create a small
version of the search index
that can accurately answer
most search queries
• Techniques
- term-based pruning
- doc-based pruning
• Result quality
- guaranteed
- not guaranteed
• In practice, caching does
the same better
main
index
pruned
index
1
2
92. Tiered Index
• A sequence of sub-indexes
- former sub-indexes are small
and keep more important
documents
- later sub-indexes are larger and
keep less important documents
- a query is processed selectively
only on the first n tiers
• Two decisions need to be made
- tiering (offline): how to place
documents in different tiers
- fall-through (online): at which
tier to stop processing the query
3rd tier
index
1st tier
index
1
2
2nd tier
index
3
93. 3rd tier
index
1st tier
index
1
2
2nd tier
index
3
Tiered Index
• The tiering strategy is based on
some document importance metric
- PageRank
- click count
- spam score
• Fall-through strategy
- query the next index until there are
enough results
- query the next index until search
result quality is good
- predict the next tier’s result quality
by machine learning
94. cluster 1 cluster 3
1
2
cluster 2
2
federator
Document Clustering
• Documents are clustered
- similarity between documents
- co-click likelihood
• A separate index is built for
each document cluster
95. cluster 1 cluster 3
1
2
cluster 2
2
federator
Document Clustering
• A query is processed on the
indexes associated with the
most similar n clusters
• Reduces the workload
• Suffers from the load
imbalance problem
- query topic distribution may
be skewed
- certain indexes have to be
queried much more often
98. Research Problem – Indexing Architectures
• Replicated • Pruned • Clustered • Tiered
• Missing research
- a thorough performance comparison of these architectures
that includes all possible optimizations
- scalability analysis with web-scale document collections and
large numbers of nodes
99. Research Problem – Off-network Popularity
• Traditional features
- statistical analysis
- link analysis
- proximity
- spam
- clicks
- session analysis
• Off-network popularity
- social signals
- network
ranker
content
popularity
database
search engine
result page
the Web
routers
users
100. Key Papers
• Brin and Page, "The anatomy of a large-scale hypertextual Web search
engine", Computer Networks and ISDN Systems, 1998.
• Zobel, Moffat, and Ramamohanarao, "Inverted files versus signature files
for text indexing". ACM Transactions on Database Systems, 1998.
• Page, Brin, Motwani, and Winograd, "The PageRank citation ranking:
bringing order to the Web", Technical report, 1998.
• Kleinberg, "Authoritative sources in a hyperlinked environment", Journal of
the ACM, 1999.
• Ribeiro-Neto, Moura, Neubert, and Ziviani, "Efficient distributed algorithms
to build inverted files", SIGIR, 1999.
• Carmel, Cohen, Fagin, Farchi, Herscovici, Maarek, and Soffer, "Static index
pruning for information retrieval systems, SIGIR, 2001.
• Scholer, Williams, Yiannis, and Zobel, "Compression of inverted indexes for
fast query evaluation", SIGIR, 2002.
103. Query Processing
• Query processing is the problem of generating the best-
matching answers (typically, top 10 documents) to a given
user query, spending the least amount of time
• Our focus:
creating 10 blue
links as an
answer to a user
query
104. Web Search
• Web search is a sorting problem!
daylight saving
example query
giant squid
barla cambazoglu
test
sun energy miracle
good Indian restaurant
Honda CRX
Yahoo research
my horoscope
SIGIR conference deadline
test drive
download mp3
honey
facebook
user queries the Web
good Indian restaurant
f (good Indian restaurant)
105. Metrics
• Quality metrics
- relevance: the degree to which returned answers meet user’s
information need.
• Performance metrics
- latency: the response time delay experienced by the user
- peak throughput: number of queries that can be processed per
unit of time without any degradation on other metrics
106. 1.
2.
3.
4.
R
Ranking 1
1.
2.
3.
4.
R
Ranking 2 Optimal
1.
2.
3.
4.
R
R
R
R
R
R
Recall:
Precision:
DCG:
NDCG:
1/3 1/3 1
1/4 1/4 3/4
1 0.63 1+0.63+0.5=2.13
1/2.13 0.63/2.13
Measuring Relevance
• It is not always possible to
know the user’s intent and his
information need
• Commonly used relevance
metrics in practice
- recall
- precision
- DCG
- NDCG
107. Estimating Relevance
• How to estimate the relevance between a given document
and a user query?
• Alternative models for score computation
- vector-space model
- statistical models
- language models
• They all pretty much boil down to the same thing
108. Example Scoring Function
• Notation
- : a user query
- : a document in the collection
- : a term in the query
- : number of documents in the collection
- : number of occurrences of the term in the document
- : number of documents containing the term
- : number of unique terms in the document
• Commonly used scoring function: tf-idf
s(q,d) = w(t,d)
tÎq
å ´log
N
df (t)
w(t,d) =
tf (t,d)
| d |
| d |
df (t)
tf (t,d)
N
t
d
q
111. Design Alternatives in Ranking
• Document matching
- conjunctive (AND) mode
- the document must contain all query terms
- higher precision, lower recall
- disjunctive (OR) mode
- document must contain at least one of the query terms
- lower precision, higher recall
112. Design Alternatives in Ranking
• Inverted list organizations
- increasing order of document ids
- decreasing order of weights in postings
- sorted by term frequency
- sorted by score contribution (impact)
- within the same impact block, sorted in increasing order of document ids
2 3 5 7 8 10
doc id ordered
0.2 0.5 0.1 0.4 0.3 0.1
3 7 8 2 5 10
weight ordered
0.5 0.4 0.3 0.2 0.1 0.1
114. Design Alternatives in Ranking
• Weights stored in postings
- term frequency
- suitable for compression
- normalized term frequency
- no need to store the document length array
- not suitable for compression
- precomputed score
- no need to store the idf value in the vocabulary
- no need to store the document length array
- not suitable for compression
115. Design Alternatives in Ranking
• In practice
- AND mode: faster and leads to better results in web search
- doc-id sorted lists: enables compression
- document-at-a-time list traversal: enables better optimizations
- term frequencies: enables compression
116. Scoring Optimizations
• Skipping
- list is split into blocks linked with pointers called skips
- store the maximum document id in each block
- skip a block if it is guaranteed that sought document is not
within the block
- gains in decompression time
- overhead of skip pointers
2 3 5 7 8 10
0.2 0.5 0.1 0.4 0.1 0.15 10 24 ...
117. Scoring Optimizations
• Dynamic index pruning
- store the maximum
possible score
contribution of each list
- compute the maximum
possible score for the
current document
- compare with the lowest
score in the heap
- gains in scoring and
decompression time
2 3 5 7 8 10
0.2 0.5 0.1 0.4 0.1 0.1
3 7 8 11
0.3 0.1 0.2 0.1
L1:
L2:
3 7 9 10
0.1 0.2 0.1 0.2L3:
0.5
0.3
0.2
Current top 2:
0.9 0.7
3 7
118. Scoring Optimizations
• Early termination
- stop scoring documents when it is guaranteed that neither
document can make it into the top k list
- gains in scoring and decompression time
3 7 2 5 8 10
0.5 0.4 0.2 0.1 0.1 0.1
3 8 7 11
0.3 0.2 0.1 0.1
L1:
L2:
3 7 9 10
0.2 0.2 0.1 0.1L3:
Current heap:
1.0 0.6
3 7
0.2
8
119. Snippet Generation
• Search result snippets (a.k.a., summary or abstract)
- important for users to correctly judge the relevance of a web
page to their information need before clicking on its link
• Snippet generation
- a forward index is built providing a mapping between pages
and the terms they contain
- snippets are computed using this forward index and only for
the top 10 result pages
- efficiency of this step is important
- entire page as well as snippets can be cached
120. • Query processing can be parallelized at different granularities
- parallelization within a search node (intra-query parallelism)
- multi-threading within a search node (inter-query parallelism)
- parallelization within a search cluster (intra-query parallelism)
- replication across search clusters (inter-query parallelism)
- distributed across multiple data centers
Parallel Query Processing
121. • Single computer
- not scalable in terms of response time
• Search cluster
- large search clusters (low response time)
- replicas of clusters (high query throughput)
• Multiple search data centers
- reduces user-to-center latencies
Query Processing Architectures
122. Query Processing in a Data Center
Broker
rewriters cache blenderFrontend
Search cluster
Master
Child
Child
Child
Child
Child
user
query results
Search
cluster
Search
cluster
Search
cluster
• Multiple node types
- frontend, broker, master, child
125. Query Processing on a Search Node
• Two-phase ranking
- simple ranking
- linear combination of query-dependent and query-independent
scores potentially with score boosting
- main objective: efficiency
- complex ranking
- machine learned
- main objective: quality
126. Machine Learned Ranking
• Many features
– term statistics (e.g., BM25)
– term proximity
– link analysis (e.g., PageRank)
– spam detection
– click data
– search session analysis
• Popular learners used by commercial search engines
– neural networks
– boosted decision trees
127. Machine Learned Ranking
• Example: gradient boosted decision trees
– chain of weak learners
– each learner contributes a partial score to the final document score
• Assuming
– 1000 trees
– an average tree depth of 10
– 100 documents scored per query
– 1000 search nodes
• Expensive
– 1000*10*100 = 1 million
operations per query and per node
– around 1 billion comparison for the
entire search cluster
128. Machine Learned Ranking
• Document-ordered traversal (DOT)
– scores are computed one document at a time over all scorers
– an iteration of the outer loop produces the complete score information for a
partial set of documents
• Disadvantages
– poor branch prediction because a different scorer is used in each inner loop iteration
– poor cache hit rates in accessing the data about scorers (for the same reason)
129. Machine Learned Ranking
• Scorer-ordered traversal (SOT)
– scores are computed one score at a time over all documents
– an iteration of the outer loop produces the partial score information for the
complete set of documents
• Disadvantages
– memory requirement (feature vectors of all documents need to be kept in memory)
– poor cache hit rates in accessing features as a different document is used in each
inner loop iteration
130. Machine Learned Ranking
• Early termination
– idea: place predictive functions between scorers
– predict during scoring whether a document will enter the final
top k and
– quit scoring, accordingly
131. Multi-site Web Search Architecture
Key points
- multiple, regional data
centers (sites)
- user-to-center
assignment
- local web crawling
- partitioned web index
- partial document
replication
- query processing with
selective forwarding
132. Multi-site Distributed Query Processing
• Local query response time
– 2 user-to-site latency
– local processing time
• Forwarded query response time
– local query response time
– 2 site-to-site latency
– non-local query processing time
133. Query Forwarding
• Problem
– selecting a subset of non-
local sites which the query
will be forwarded to
• Objectives
– reducing the loss in search
quality w.r.t. to evaluation
over the full index
– reducing average query
response times and/or
query processing workload
134. Machine-Learned Query Forwarding
• Two-phase forwarding decision
– pre-retrieval classification
– query features
– user features
– post-retrieval classification
– result features
• The post-retrieval classifier is
consulted only if the confidence
of the pre-retrieval classifier is not
high enough
• This enables overlapping local
query processing with remote
processing of the query
135. • Constituents of response latency
- user-to-center network latency
- search engine latency time
- query features
- caching
- current system workload
- page rendering time in the browser
Web Search Response Latency
User
Search
frontend
Search
backend
tpre tproc
tpost
tfb
tbf
tuf
tfu
trender
136. • Problem: Predict the response time of a query before
processing it on the backend search system
• Useful for making query scheduling decisions
• Solution: Build a machine learning model with many features
- number of query terms
- total number of postings in inverted lists
- average term popularity
- …
Response Latency Prediction
137. • Impact of latency on the user
varies depending on
- context: time, location
- user: demographics
- query: intent
Impact of Response Latency on Users
138. • A/B test by Google and Microsoft (Schurman & Brutlag)
Impact of Response Latency on Users
dailysearchesperuserrelativetocontrol
wk3 wk4 wk5 wk6 wk7 wk8 wk9 wk10 wk11
-1%-0.8%-0.6%-0.4%-0.2%0%0.2%
delay
removed
Persistent Impact of Post-header Delay
200 ms delay
400 ms delay
actual
trend
139. Research Problem – Energy Savings
• Observation: electricity prices vary across data centers and
depending the time of the day
• Idea: forward queries to cheaper search data centers to reduce the
electricity bill under certain constraints
140. Research Problem – Green Search Engines
• Goal: reduce the carbon footprint of the search engine
• Query processing
- shift workload from data centers that consume brown
energy to those green energy
- constraints:
- response latency
- data center capacity
141. Open Source Search Engines
• DataparkSearch: GNU general public license
• Lemur Toolkit & Indri Search Engine: BSD license
• Lucene: Apache software license
• mnoGoSearch: GNU general public license
• Nutch: based on Lucene
• Seeks: Affero general public license
• Sphinx: free software/open source
• Terrier Search Engine: open source
• Zettair: open source
142. Key Papers
• Turtle and Flood, "Query evaluation: strategies and optimizations",
Information Processing and Management, 1995.
• Barroso, Dean, and Holzle, "Web search for a planet: the Google
cluster architecture", IEEE Micro, 2003.
• Broder, Carmel, Herscovici, Soffer, and Zien, "Efficient query
evaluation using a two-level retrieval process", CIKM, 2003.
• Chowdhury and Pass, “Operational requirements for scalable search
systems”, CIKM, 2003.
• Moffat, Webber, Zobel, and Baeza-Yates, "A pipelined architecture
for distributed text query evaluation", Information Retrieval, 2007.
143. Key Papers
• Turpin, Tsegay, Hawking, and Williams, "Fast generation of result
snippets in web search. SIGIR, 2007.
• Baeza-Yates, Gionis, Junqueira, Plachouras, and Telloli, "On the
feasibility of multi-site web search engines", CIKM, 2009.
• Cambazoglu, Zaragoza, Chapelle, Chen, Liao, Zheng, and
Degenhardt, "Early exit optimizations for additive machine learned
ranking systems", WSDM, 2010.
• Wang, Lin, and Metzler, "Learning to efficiently rank", SIGIR, 2010.
• Macdonald, Tonellotto, Ounis, "Learning to predict response times
for online query scheduling", SIGIR, 2012.
146. Caching
• Cache: quick-access storage system
– may store data to eliminate the need to fetch the data from a
slower storage system
– may store precomputed results to eliminate redundant
computation in the backend
Main backend Cache
speed slower faster
workload higher lower
capacity larger smaller
cost cheaper more expensive
freshness more fresh more stale
147. Caching
• Often appears in a hierarchical form
– OS: registers, L1 cache, L2 cache,
memory, disk
– network: browser cache, web proxy,
server cache, backend database
• Benefits
– reduces the workload
– reduces the response time
– reduces the financial cost
1
2
3
(fastest)
(slowest)
148. Metrics
• Quality metrics
- freshness: average staleness of the data served by the cache
• Performance metrics
- hit rate: fraction of total requests that are answered by the cache
- cost: total processing cost incurred to the backend system
150. Query Frequency
• Skewed distribution in
query frequency
– Few queries are issued
many times (head
queries)
– Many queries are issued
rarely (tail queries)
151. Inter-Arrival Time
• Skewed distribution in
query inter-arrival time
– low inter-arrival time is
for many queries
– high inter-arrival time
for few queries
153. Caching Techniques
• Static caching
– built in an offline manner
– prefers items that are accessed often in the past
– periodically re-deployed
• Dynamic caching
– maintained in an online manner
– prefers items that are recently accessed
– requires removing items from the cache (eviction)
• Static/dynamic caching
– shares the cache space between a static and a dynamic cache
154. Static/Dynamic Caching
A C
A D A D C A B C C E A
static cache
was built
now
static cache
(capacity: 2)
D F G
dynamic cache
(capacity: 3)
F A B D A G F
most
frequent
most
recent
A
C
D
B E
F
G
A
D
B
155. Techniques Used in Caching
• Admission: items are prevented from being cached
• Prefetching: items are cached before they are requested
• Eviction: items are removed from the cache when it is full
156. Admission
• Idea: certain items may be prevented from being cached
forever or until confidence is gained about their popularity
• Example admission criteria
– query length
– query frequency
XYZ
Q
ABC
Result cache
Minimum frequency threshold for admission: 2
Maximum query length for admission: 4
Query stream: ABC IJKLMN ABC ABC XYZ Q XYZ
Query cache
157. Prefetching
• Idea: certain items can be cached before they are actually
requested if there is enough confidence that they will be
requested in the near future.
• Example use case: result page prefetching
Reqested: page 1 Prefetch: page 2 and page 3
158. Eviction
• Goal: remove items that are not likely to lead to a cache hit
from the cache to make space for items that are more useful.
• Ideal policy: evict the item that will be requested in the most
distant future.
• Policies: FIFO, LFU, LRU, SLRU
• Most common: LRU
A
D
D
A
A
A
D
C
C
A
D
D
D
C
A
B
B
D
C
E
Aevicted:
A
159. Result Cache Freshness
• In web search engines
– index is continuously
updated or re-built
– result caches are
almost infinite capacity
– staleness problem
161. Time-to-Live (TTL)
• Common solution:
setting a time-to-live
value for each item
• TTL strategies
– fixed: all items are
associated with the
same TTL value
– adaptive: a separate
TTL value is set for
each item 0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
0
10
20
30
40
50
60
Backendquerytrafficrate(query/sec)
TTL = 0 (no cache)
TTL = 1 hour
TTL = infinite
User query traffic
Cache hits
(expired entries)
Cache misses
Cache misses
(compulsary)
hitting the backend
Lower-bound on
the backend traffic
163. Advanced Solutions
• Cache invalidation: the indexing system provides feedback
to the caching system upon an index update so that the
caching system can identify the stale results
– hard to implement
– incurs communication and computation overheads
– highly accurate
164. Advanced Solutions
• Cache refreshing: stale results are predicted and scheduled
for re-computation in idle cycles of the backend search
system
– easy to implement
– little computational overhead
– not very accurate
167. Result Caching on Multi-site Search Engines
• Strategies
- Local: each search site caches results of queries issued to itself
- redundant processing of the same query when issued to different sites
- Global: all sites cache results of all queries
- overhead of transferring cached query results to all remote sites
- Partial: query results are cached on relevant sites only
- trade-off between the local and global approaches
- Pointer: remote sites maintain only pointers to the site that cached
the query results
- reduces the redundant query processing overhead
168. Research Problem - Financial Perspective
• Past work optimizes
- hit rate
- backend workload
• Optimizing financial cost
- result degradation
- staleness
- current query traffic
- peak sustainable traffic
- current electricity price
0 2 4 6 8 10 12 14 16 18 20 22
Hour of the day
0
5
10
15
20
25
30
35
40
Backendquerytraffic(query/sec)
TTL = 1 hour
TTL = 4 hours
TTL = 16 hours
TTL = infinite
Opportunity for
prefetching
Risk of overflow
Potential for
load reduction
Peak query throughput
sustainable by the backend
169. Key Papers
• Markatos, “On caching search engine query results”, Computer
Communications, 2001.
• Long and Suel, “Three-level caching for efficient query processing in
large web search engines”, WWW, 2005.
• Fagni, Perego, Silvestri, and Orlando, “Boosting the performance of
web search engines: caching and prefetching query results by
exploiting historical usage data”, ACM Transactions on Information
Systems, 2006.
• Altingovde, Ozcan, and Ulusoy, “A cost-aware strategy for query
result caching in web search engines”, ECIR, 2009.
• Cambazoglu, Junqueira, Plachouras, Banachowski, Cui, Lim, and
Bridge, “A refreshing perspective of search engine caching”, WWW,
2010.
172. Summary
• Presented a high-level overview of important scalability and
efficiency issues in large-scale web search
• Provided a summary of commonly used metrics
• Discussed a number of potential research problems in the field
• Provided references to available software and the key
research work in literature
173. Observations
• Unlike past research, the current research on scalability is
mainly driven by the needs of commercial search engine
companies
• Scalability of web search engines is likely to be a research
challenge for some more time (at least, in the near future)
• Lack of hardware resources and large datasets render
scalability research quite difficult, especially for researchers
in academia
174. Suggestions to Newcomers
• Follow the trends in the Web, user bases, and hardware
parameters to identify the real performance bottlenecks
• Watch out newly emerging techniques whose primary target is to
improve the search quality and think about their impact on search
performance
• Re-use or adapt existing solutions in more mature research fields,
such as databases, computer networks, and distributed computing
• Know the key people in the field (the community is small) and
follow their work
175. Important Surveys/Books
• Web search engine scalability and efficiency
- B. B. Cambazoglu and Ricardo Baeza-Yates, “Scalability
Challenges in Web Search Engines”, The Information Retrieval
Series, 2011.
• Web crawling
- C. Olston and M. Najork: “Web Crawling”, Foundations and
Trends in Information Retrieval, 2010.
• Indexing
- J. Zobel and A. Moffat, “Inverted files for text search engines”,
ACM Computing Surveys, 2006.
• Query processing
- R. Baeza-Yates and B. Ribeiro-Neto, Modern Information
Retrieval (2nd edition), 2011.