Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center
Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.
EZID: Easy dataset identification & management
Joan Starr, Manager, Strategic and Project Planning and EZID Service Manager, California Digital Library
Data and data curation are assuming a growing role today’s research library. New approaches are needed both to address the resulting challenges and take advantage of the emerging opportunities. Long-term identifiers represent one such tool. In this presentation, Joan Starr will introduce identifiers and an application designed to make them easy to create and manage: EZID. She will provide a closer look at two identifier types: DOIs and ARKs, and discuss what bringing an identifier service to your institution might mean.
Scientific discovery and innovation in an era of data-intensive science
William (Bill) Michener, Professor and Director of e-Science Initiatives for University Libraries, University of New Mexico; DataONE Principal Investigator
The scope and nature of biological, environmental and earth sciences research are evolving rapidly in response to environmental challenges such as global climate change, invasive species and emergent diseases. Scientific studies are increasingly focusing on long-term, broad-scale, and complex questions that require massive amounts of diverse data collected by remote sensing platforms and embedded environmental sensor networks; collaborative, interdisciplinary science teams; and new tools that promote scientific data preservation, discovery, and innovation. This talk describes the challenges facing scientists as they transition into this new era of data intensive science, presents current solutions, and lays out a roadmap to the future where new information technologies significantly increase the pace of scientific discovery and innovation.
Opening Keynote: The Many and the One: BCE themes in 21st century data curation
Allen Renear, Professor and Interim Dean, Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Two scientists can be using "the same data" even though the computer files involved appear to be quite different. This is familiar enough, and for the most part, in small communities with shared practices and familiar datasets, raises few problems. But these informal understandings do not scale to 21st century data curation. To get full value from cyberinfrastructure we must support huge quantities of heterogeneous data developed by diverse communities and used by diverse communities -- often with widely varying methods, tools, and purposes. To accomplish this our informal practices and understandings much be replaced, or at least supplemented, by a shared framework of standard terminology for describing complex cascades of representational levels and relationships. Fundamental problems in data curation -- and in particular problems involving provenance, identifiers, and data citation — cannot be fully resolved without such a framework. Although the deepest problems here have ancient origins, useful practical measures are now within reach. Some recent work toward this end that is being carried out at the Center for Informatics Research in Science and Scholarship (CIRSS) at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign will be described.
Data Equivalence
Mark Parsons, Lead Project Manager, Senior Associate Scientist, National Snow and Ice Data Center
Data citation, especially using persistent identifiers like Digital Object Identifiers (DOIs), is an increasingly accepted scientific practice. Recently, several, respected organizations have developed guidelines for data citation. The different guidelines are largely congruent in that they agree on the basic practice and elements of data citation, especially for relatively static, whole data collections. There is less agreement on the more subtle nuances of data citation that are sometimes necessary to ensure precise reference and scientific reproducibility--the core purpose of data citation. We need to be sure that if you follow a data reference you get to the precise data that were used or at least their scientific equivalent. Identifiers such as DOIs are necessary but not sufficient for the precise, detailed, references necessary. This talk discusses issues around data set versioning, micro-citation, and scientific equivalence. I propose some interim solutions and suggest research strategies for the future.
EZID: Easy dataset identification & management
Joan Starr, Manager, Strategic and Project Planning and EZID Service Manager, California Digital Library
Data and data curation are assuming a growing role today’s research library. New approaches are needed both to address the resulting challenges and take advantage of the emerging opportunities. Long-term identifiers represent one such tool. In this presentation, Joan Starr will introduce identifiers and an application designed to make them easy to create and manage: EZID. She will provide a closer look at two identifier types: DOIs and ARKs, and discuss what bringing an identifier service to your institution might mean.
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
Geographic web search engines allow users to constrain and order search results in an intuitive manner by focusing a query on a particular geographic region. Geographic search technology, also called local search, has recently received significant interest from major search engine companies. Academic research in this area has focused primarily on techniques for extracting geographic knowledge from the web. In this paper, we study the problem of efficient query processing in scalable geographic search engines. Query processing is a major bottleneck in standard web search engines, and the main reason for the thousands of machines used by the major engines. Geographic search engine query processing is different in that it requires a combination of text and spatial data processing techniques. We propose several algorithms for efficient query processing in geographic search engines, integrate them into an existing web search query processor, and evaluate them on large sets of real data and query traces.
Difference Between Crawling, Indexing and Caching Laxman Kotte
Lot of people confusued about these terms Crawling, Indexing and Catching, Read the document you will understand Difference Between Crawling, Indexing and Caching.
Blogging With Word Press -Social Media Bootcampwesleyzhao
Another presentation made to the Portland office of CH2M Hill, introducing employees to the very fundamentals of blogging using the WordPress front end.
Search Engine Optimization - Social Media Bootcampwesleyzhao
An installment of the Social Media Bootcamp (CH2M Hill office in Portland, OR) that deals with Google and Search Engine Optimization. This covers the basics of how a search engine works and how to use that to your advantage to get more search hits. This covers key words, page links, web crawlers, and more.
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...SEO monitor
Long before any top ranking search engines digest your website bei crawling and indexing it. In both stages of evaluation there are capacity limits and they can become bottlenecks. Learn to perfectly understand each judgement stage and which tools for stirring there are as well as how to adjust and rule to get the maximum out of your SEO efforts.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR -See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
Efficient Query Processing in Geographic Web Search EnginesYen-Yu Chen
Geographic web search engines allow users to constrain and order search results in an intuitive manner by focusing a query on a particular geographic region. Geographic search technology, also called local search, has recently received significant interest from major search engine companies. Academic research in this area has focused primarily on techniques for extracting geographic knowledge from the web. In this paper, we study the problem of efficient query processing in scalable geographic search engines. Query processing is a major bottleneck in standard web search engines, and the main reason for the thousands of machines used by the major engines. Geographic search engine query processing is different in that it requires a combination of text and spatial data processing techniques. We propose several algorithms for efficient query processing in geographic search engines, integrate them into an existing web search query processor, and evaluate them on large sets of real data and query traces.
Difference Between Crawling, Indexing and Caching Laxman Kotte
Lot of people confusued about these terms Crawling, Indexing and Catching, Read the document you will understand Difference Between Crawling, Indexing and Caching.
Blogging With Word Press -Social Media Bootcampwesleyzhao
Another presentation made to the Portland office of CH2M Hill, introducing employees to the very fundamentals of blogging using the WordPress front end.
Search Engine Optimization - Social Media Bootcampwesleyzhao
An installment of the Social Media Bootcamp (CH2M Hill office in Portland, OR) that deals with Google and Search Engine Optimization. This covers the basics of how a search engine works and how to use that to your advantage to get more search hits. This covers key words, page links, web crawlers, and more.
Crawling, indexing, ranking: Make the search engine crawlers and algorithms y...SEO monitor
Long before any top ranking search engines digest your website bei crawling and indexing it. In both stages of evaluation there are capacity limits and they can become bottlenecks. Learn to perfectly understand each judgement stage and which tools for stirring there are as well as how to adjust and rule to get the maximum out of your SEO efforts.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
Presented by M.C. Srivas | MapR -See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This session addresses the biggest issue facing Big Data – Search, Discovery and Analytics need to be integrated. While creating and maintaining separate SOLR and Hadoop clusters is time consuming, error prone and difficult to keep in synch, most Hadoop installations do not integrate with SOLR within the same cluster. Find out how to easily integrate these capabilities into a single cluster. The session will also touch on some of the technical aspects of Big Data Search including how to; protect against silent index corruption that permeates large distributed clusters, overcome the shard distribution problem by leveraging Hadoop to ensure accurate distributed search results, and provide real-time indexing for distributed search including support for streaming data capture. Srivas will also share relevant experiences from his days at Google where he ran one of the major search infrastructure teams where GFS, BigTable and MapReduce were used extensively.
Using the LucidWorks REST API to Support User-Configuration Big Data Search E...lucenerevolution
Presented by Mark Davis, CTO Kitenga - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
Kitenga's Analyst system uses the LucidWorks Enterprise REST API in a variety of ways, including for configuring collections and managing Solr schema. As part of the Kitenga platform, the ZettaSearch Designer empowers the end-user to dynamically drag-and-drop search widgets to create a specialized search interface. For a user to effectively design search UIs that meet their needs, they need to be able to understand the available schema fields that populate a given collection. ZettaSearch Designer interrogates the Solr infrastructure using the Lucid REST API to provide an overview of the available metadata. It is then easy for the user to build rich, facetted search experiences around the metadata library indexed into the collection. In this implementation overview, I will describe the design of ZettaSearch Designer, how it interacts with big data technologies like Hadoop as part of the indexing pipeline, and how it uses the LucidWorks API to enable user discovery of the metadata needed to create novel search user interfaces on the fly.
Introducing the Big Data Ecosystem with Caserta Concepts & TalendCaserta
In this one-hour webinar, Caserta Concepts and Talend described an approach to achieve an architectural framework and roadmap to extend a traditional enterprise data warehouse environment, into a Big Data ecosystem.
They illustrated the architectural components involved for collecting, analyzing and delivering Big Data, with a focus on the importance of Hadoop, Data Integration, Machine Learning, NoSQL, Business Intelligence and Analytics.
Attendees learned:
Which Big Data technologies can’t be ignored
Considerations when extending the data ecosystem
What happens to your existing investment
What are the points of integration
Does Big Data = better data?
To find access the recorded webinar or to learn more, visit http://www.casertaconcepts.com/.
Life Science Database Cross Search and MetadataMaori Ito
Life science databases are sometimes difficult to understand due to lack of information. I'd like to add metadata into databases and improve search results.
Kitenga's ZettaVox and ZettaSearch products support SOLR and Lucene ecosystems at both the ingestion point and for the search user. In this talk, I will show how ZettaVox, our professional content mining platform on Hadoop, can be used to index content and rich metadata into a LucidWorks Enterprise installation. Being built on Hadoop, ZettaVox scales up by scaling out. I will then create an end-user search and analytics experience using our ZettaSearch solution that leverages the faceted metadata to enhance information discovery and analysis. All in about 20 minutes.
Keynote at the Dutch-Belgian Information Retrieval Workshop, November 2016, Delft, Netherlands.
Based on KDD 2016 tutorial with Sara Hajian and Francesco Bonchi.
KDD 2016 tutorial on Algorithmic Bias, Parts I and II.
Video:
Part I: https://www.youtube.com/watch?v=mJcWrfoGup8
Part II: https://www.youtube.com/watch?v=nKemhMbaYcU
Part III: https://www.youtube.com/watch?v=ErgHjxJsEKA
By Sara Hajian, Francesco Bonchi, and Carlos Castillo.
http://francescobonchi.com/algorithmic_bias_tutorial.html
KDD 2016 tutorial on Algorithmic Bias, Parts III and IV.
Video: https://www.youtube.com/watch?v=ErgHjxJsEKA
By Sara Hajian, Francesco Bonchi, and Carlos Castillo.
http://francescobonchi.com/algorithmic_bias_tutorial.html
Various examples of observational studies, mostly fo the analysis of social media.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Basic concepts about natural experiments, based mostly on Dunning's book.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Predictions of links in graphs based on content and information propagations.
Lecture for the M. Sc. Data Science, Sapienza University of Rome, Spring 2016.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
1. Challenges in
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Challenges in Distributed
Caching
Information Retrieval
Ricardo Baeza-Yates1,2
Joint work with: C. Castillo1 , F. Junqueira1 ,
V. Plachouras1 and F. Silvestri3
1. Yahoo! Research Barcelona – Catalunya, Spain
2. Yahoo! Research Latin America – Santiago, Chile
3. ISTI-CNR – Pisa, Italy
5. Challenges in
Crawling
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
In theory it is simple: fetch, parse, fetch, parse, . . .
6. Challenges in
Crawling
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
In theory it is simple: fetch, parse, fetch, parse, . . .
In practice it is difficult: implies using other people’s
resources (web servers’ CPU and network)
7. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
8. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
What to do when one agent fails?
9. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
10. Challenges in
Issues
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
How to partition the crawling task?
What to do when one agent fails?
How to communicate among agents?
How to deal with external factors?
11. Challenges in
Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Host-based partitioning exploits locality of links
Processing
Caching
Balance improves if large/small hosts are treated
differently
Performance improves if geographic location is considered
12. Challenges in
Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Host-based partitioning exploits locality of links
Processing
Caching
Balance improves if large/small hosts are treated
differently
Performance improves if geographic location is considered
Consistent hashing
Allows to add and remove agents from the
pool [Boldi et al., 2004]
13. Challenges in
Communication
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Host-based partitioning reduces communication
Highly-linked URLs should be cached
Communication with the server can be improved if server
cooperates
14. Challenges in
External factors
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
DNS can be a bottleneck
Varying quality of implementation of HTTP
Varying quality of HTML coding
Varying quality of service in general
SPAM
16. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
17. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
Inverted Indexes are typically used in IR indexes
18. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in the
collection’s documents
19. Challenges in
What’s Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Indexing in Database and IR is the process of building an
Caching
index over a collection of documents
Inverted Indexes are typically used in IR indexes
Lexicon: contains distinct terms appearing in the
collection’s documents
Posting Lists: contains descriptions of occurrences of
relative terms within the corresponding documents
20. Challenges in
Index and Distributed Indexing
Distributed IR
Ricardo
Baeza-Yates
D
Crawling
T1
Indexing
Query
Term T2
Processing
Partition
D
Caching
Tn
T
T
Document
Partition
D1 D2 Dm
21. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
22. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
23. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
24. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
25. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
26. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
cons:
27. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
cons:
high number of disk operations
28. Challenges in
Document Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split the collection into several sub-collections and index
Query
Processing
each one of them separately (corresponding to vertically
Caching
slicing the T × D matrix)
pros:
higher throughput
new documents are easily added to existing indexes
load balanced
cons:
high number of disk operations
high volume of data read from disk
29. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
30. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
31. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
32. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
33. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
cons:
34. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
cons:
reduced number of disk accesses
35. Challenges in
Term Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
split terms of the lexicon (and the corresponding inverted
Query
Processing
lists) among search systems (corresponding to
Caching
horizontally slicing the T × D matrix)
pros:
require the entire index to be built before slicing it into
partitions
not scalable with large collections
cons:
reduced number of disk accesses
reduced volume of exchanged data
36. Challenges in
Partitioning Goals
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
partitioning is the first design issue to be faced in
distributed indexing
37. Challenges in
Partitioning Goals
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
partitioning is the first design issue to be faced in
distributed indexing
a distributed index should allow for efficient query routing
and resolution
38. Challenges in
Partitioning Goals
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
partitioning is the first design issue to be faced in
distributed indexing
a distributed index should allow for efficient query routing
and resolution
reduction of the number of nodes queried, is desirable too
39. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
Caching
40. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
41. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
42. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition is
composed by one (or more) cluster(s)
43. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition is
composed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector Document
Model [Puppin et al., 2006])
44. Challenges in
Partitioning Techniques
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
random partitioning
Processing
documents are assigned u.a.r. to various partitions
Caching
topical organization using clustering (e.g.
k-means [Larkey et al., 2000, Liu and Croft, 2004])
documents are firstly clustered and then each partition is
composed by one (or more) cluster(s)
usage-induced partitioning (e.g. Query-Vector Document
Model [Puppin et al., 2006])
clustering is induced by the way users interact with the
index
45. Challenges in
Load Balancing Issues
Distributed IR
Ricardo
Baeza-Yates
In document partitioned indexes not adopting collection
selection strategies, load is almost balanced among all
Crawling
Indexing
the query processors
Query
In term partitioned indexes (even the new pipelined
Processing
schema [Webber et al., 2006]) load balancing is an issue
Caching
In federated document partitioned systems where
collection selection is applied, balancing the load is still
an unexplored issue.
100.0 100.0
80.0 80.0
Load percentage
Load percentage
60.0 60.0
40.0 40.0
20.0 20.0
0.0 0.0
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
Document-distributed Pipelined
46. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
47. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
48. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
query arrival time
49. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
query arrival time
clickthrough information
50. Challenges in
Exploiting Usage Information
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
Query logs contain features that are critical for
optimizing efficiency of different parts of search engines
query distribution
query arrival time
clickthrough information
...
51. Challenges in
Usage Information in Term Partitioned Systems
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
frequency of query terms can be exploited to partition a
collection with the aim of balancing the load of query
processors
52. Challenges in
Usage Information in Term Partitioned Systems
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
frequency of query terms can be exploited to partition a
collection with the aim of balancing the load of query
processors
bin-packing approach [Moffat et al., 2006]
53. Challenges in
Usage Information in Term Partitioned Systems
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
frequency of query terms can be exploited to partition a
collection with the aim of balancing the load of query
processors
bin-packing approach [Moffat et al., 2006]
data mining approach [Lucchese et al., 2007]
54. Challenges in
Usage Information in Document Partitioned
Distributed IR
Systems
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
random partitioning does not ensure load
Caching
balancing [Badue et al., 2006]
55. Challenges in
Usage Information in Document Partitioned
Distributed IR
Systems
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
random partitioning does not ensure load
Caching
balancing [Badue et al., 2006]
broadcast-based systems perform unnecessary operations
on sub-collections containing few or no relevant
documents
56. Challenges in
Usage Information in Document Partitioned
Distributed IR
Systems
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
random partitioning does not ensure load
Caching
balancing [Badue et al., 2006]
broadcast-based systems perform unnecessary operations
on sub-collections containing few or no relevant
documents
Usage-based mapping can be adopted to partition
sub-collections that can be effectively discriminated upon
query receipt [Puppin et al., 2006]
57. Challenges in
Challenges in Distributed Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
in document partitioned system it is needed to find
partitioning strategies for enhancing collection selection
performance in terms of effectiveness
58. Challenges in
Challenges in Distributed Indexing
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
in document partitioned system it is needed to find
partitioning strategies for enhancing collection selection
performance in terms of effectiveness
in both systems it is a challenges to find effective load
balancing strategies
59. Challenges in
Query processing
Distributed IR
Ricardo
Baeza-Yates
System components
Crawling
Indexing
Clients submitting queries
Query
Processing
Sites consisting of servers
Caching
Servers are commodity computers
Query processing
System receives a query
Query routing: forwarding query to appropriate sites
Merging results
Challenges
Determine appropriate sites on the fly
WAN communication is costly
60. Challenges in
Challenges in more detail
Distributed IR
Ricardo
Baeza-Yates
Large-scale systems
Crawling
Indexing
Large amount of data
Query
Processing
Large data structures
Caching
Large number of clients and servers
Partitioning of data structures
Necessary due to very large data structures
Parallel processing
e.g. document collection split by topic, language, region
Replication of data structures
For availability, throughput, and response time
Conflict with resource utilization
61. Challenges in
Framework for Distributed Query Processing
Distributed IR
Ricardo
Baeza-Yates
Site B
Region Y
Crawling
Site A
Indexing Region X
Query
Processing
Caching 2
1
Client 3
WAN
Site C
Region Z
Query processor matches documents to the received queries
Coordinator receives queries and routes them to appropriate
sites
Cache stores results from previous queries
62. Challenges in
Currently...
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Multiple sites
Processing
Sites are full replicas of each other
Caching
Simple query routing: Dynamic DNS
According to the previous framework, opportunity to
Use storage resources more efficiently
More sophisticated query routing mechanisms
Effective partition strategies (e.g., language-based strategies)
63. Challenges in
Partitioning
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Goals
Query
Processing
Achieve cost-effective scalability
Caching
Reduce response times
Potential solutions
Partition of large data structures by topic, language, etc.
Effective query routing first to local sites, then to global sites
Incremental presentation of results to alleviate network
latencies
64. Challenges in
Dependability
Distributed IR
Ricardo
Baeza-Yates
Goals
Crawling
Indexing
Availability of query processors
Query
Processing
Consistency of replicated query data (can be weak)
Caching
Consistency of user state: e.g., personalization, user
preferences
Potetial solutions
More network resources: multi-homed sites
Replication: within and across sites
Consistency: techniques for weak consistency (replicas
eventually converge)
Caching: improve availability when query processors are
unavailable
65. Challenges in
Dependability
Distributed IR
Ricardo
Baeza-Yates Achieving availability is not straighforward
Crawling
BIRN system studied by Junqueira and
Marzullo [Junqueira and Marzullo, 2005]
Indexing
Query
Partitions are quite frequent
Processing
Caching
12
10
Average number of sites
8
6
4
2
0
< 100 < 99.8 < 99 < 98 < 97
Monthly availability
66. Challenges in
Communication
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing Message latency
Query
Communication is costly in wide-area networks
Processing
Caching
Latency is not neglible
Reduced capacity of servers as the latency to process a query
increases
Potential solutions
Reduce as much as possible the number of sites contacted to
process a query
Most queries processed by sites that are close according to
network distance
67. Challenges in
Caching query results or
Distributed IR
postings [Baeza-Yates et al., 2007]
Ricardo
Baeza-Yates
Crawling
Caching query answers:
Indexing
Query
44% of queries are singletons (appear only once)
Processing
Caching
88% of the unique queries are singletons
Infinite cache would achieve 56% hit-ratio
Caching postings of terms:
4% of terms are singletons
73% of the unique terms (the vocabulary) are singletons
Infinite cache would achieve 96% hit-ratio
Note: All statistics and graphs on caching refer to a one-year query
log from yahoo.co.uk
68. Challenges in
Static or dynamic caching of postings
Distributed IR
Ricardo
Baeza-Yates
Crawling
Static caching of postings (Qtf)
Indexing
Cache terms with the highest query log frequency fq (t)
Query
Processing
Caching
However, there is a tradeoff between fq (t) and fd (t)
Terms with high query log frequency fq (t) are good for the
cache
Terms with high document frequency fd (t) occupy too much
space
Static caching of postings as a KnapSack problem (QtfDf)
fq (t)
Cache posting lists of terms with the highest ratio fd (t)
69. Challenges in
Static or dynamic caching of postings
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Processing
Caching
70. Challenges in
Analysis of static caching
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Trade-offs between caching postings and answers
Processing
Caching postings results in more hits
Caching
Caching answers is faster
To compare need to consider time/space parameters
Problem: Given a fixed amount of memory and the average
response times for a system, how much to allocate for caching
answers and how much for caching postings?
71. Challenges in
Analysis of static caching
Distributed IR
Ricardo
Baeza-Yates
Crawling Scenario 1: Centralized retrieval system, complete/partial query
evaluation, un/compressed postings
Indexing
Query
Postings cache can answer more queries than answers cache
Processing
Caching
Most available memory for caching postings
Scenario 2: WAN distributed system, complete/partial query
evaluation, un/compressed postings
Network time dominates
Most available memory for caching answers
Query Dynamics
Slowly changing query dynamics makes static caching viable
72. Challenges in
Distributed IR
Ricardo
Badue, C., Baeza-Yates, R., Ribeiro-Neto, B., Ziviani, A., and
Baeza-Yates
Ziviani, N. (2006).
Crawling
Analyzing imbalance among homogeneous index servers in a
Indexing
web search system.
Query
Processing
Information Processing & Management.
Caching
Baeza-Yates, R., Gionis, A., Junqueira, F., Murdock, V.,
Silvestri, F., and Plachouras, V. (2007).
The impact of caching on search engines.
In Proceedings of the Internation ACM SIGIR Conference (to
appear), Amsterdam, Neatherlands.
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. (2004).
Ubicrawler: a scalable fully distributed web crawler.
Software, Practice and Experience, 34(8):711–726.
73. Challenges in
Distributed IR
Junqueira, F. and Marzullo, K. (2005).
Ricardo
Coterie availability in sites.
Baeza-Yates
In Proceedings of the International Conference on Distributed
Crawling
Computing (DISC), number 3724 in LNCS, pages 3–17,
Indexing
Krakow, Poland. Springer Verlag.
Query
Processing
Larkey, L. S., Connell, M. E., and Callan, J. (2000).
Caching
Collection selection and results merging with topically
organized u.s. patents and trec data.
In CIKM ’00: Proceedings of the ninth international conference
on Information and knowledge management, pages 282–289,
New York, NY, USA. ACM Press.
Liu, X. and Croft, W. B. (2004).
Cluster-based retrieval using language models.
In SIGIR ’04: Proceedings of the 27th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 186–193, New York, NY, USA.
ACM Press.
74. Challenges in
Distributed IR
Lucchese, C., Orlando, S., Perego, R., and Silvestri, F. (2007).
Ricardo
Baeza-Yates
Mining query logs to optimize index partitioning in parallel web
search engines.
Crawling
To Appear in Proceedings of The 2nd International Conference
Indexing
on Scalable Information Systems (INFOSCALE 2007).
Query
Processing
Caching
Moffat, A., Webber, W., and Zobel, J. (2006).
Load balancing for term-distributed parallel retrieval.
In SIGIR ’06: Proceedings of the 29th annual international
ACM SIGIR conference on Research and development in
information retrieval, pages 348–355, New York, NY, USA.
ACM Press.
Puppin, D., Silvestri, F., and Laforenza, D. (2006).
Query-driven document partitioning and collection selection.
In InfoScale ’06: Proceedings of the 1st international
conference on Scalable information systems, page 34, New
York, NY, USA. ACM Press.
75. Challenges in
Distributed IR
Ricardo
Baeza-Yates
Crawling
Indexing
Query
Webber, W., Moffat, A., Zobel, J., and Baeza-Yates, R.
Processing
(2006).
Caching
A pipelined architecture for distributed text query evaluation.
Information Retrieval.
published online October 5, 2006.