Introduction to Enterprise Search. A two hour class to introduce Enterprise Search. It covers:
The problems enterprise search can solve
History of (web) search
How we search and find?
Current state of Enterprise Search + stats
Technical concept
Information quality
Feedback cycle
Five dimensions of Findability
4. Agenda
• Problem
• History of (web) search
• How we search and !nd?
• Current state of Enterprise Search + stats
• Technical concept
• Information quality
• Feedback cycle
• Five dimensions of Findability
8. The Problems
• Growing amounts of Information
• Changing patterns of information
consumption
• Information silos
• Web like behaviour > Information !lters
• Internal information use is still in the
Digital Stone Age
9. History of Search
In Academia search is called Information
Retrieval.
It is an old discipline, dating back
thousands of years...
Basic concepts in Information Retrieval:
Recall and Precision, more later...
10. Directories vs. Search Engines
• Directories are manually compiled taxonomies of
websites
• Directories are far more costly and time intensive to
maintain
• Directories lack coverage, although it provides an
important alternative, especially for novice surfers
• Search engines rely mainly on automated search
algorithms
• Search engines rank pages by popularity on the web,
the more referrals (links) the more relevant
11. Early days of Web Search
Yahoo – searchable directory (1994, ~10000 websites)
• Integrates
search
over
its
directory.
Organized
by
subject
ma8ers.
Sites
can
be
suggested,
but
human
editors
control
quality
of
directory
(~100
dedicated
editors)
Ask – natural language search engine (1998)
• used
human
editors
to
match
popular
queries.
Tried
different
algorithms
to
rank
pages
by
popularity
Google – searchable index (1998)
• Developed
Pagerank,
popularity
algorithm
that
hides
bad
content.
Set
standards
(spellchecking,
query
suggesIon,
search
results
page
design)
12. Web Search - evolution
First generation (1995-97) – AltaVista, Excite, WebCrawler
Uses mostly on-page data (text and formatting).
Informational queries.
Second generation (1998-2010) – Google, Yahoo
Use o"-page, web-speci!c data: link analysis, anchor-text, click-
through data. Informational and navigational queries.
Third generation (2010-present) – Google, Wolfram-Alpha,
Bing
Blend data from many sources, tries to answer ‘‘the need
behind the query’’: semantic analysis, context determination,
dynamic database selection etc. Informational, navigational, and
transactional queries.
14. Seeking information modes:
Navigational
Reach a particular site that the user has in
mind, either because they visited it in the
past or because they assume that such a
site exists. Have usually only one "right"
result.
15. Seeking information modes:
Transactional
Reach a site where further interaction will happen. This
interaction constitutes the transaction de!ning these
queries. The main categories for such queries are
shopping, !nding various web-mediated services,
downloading various type of !le (images, songs, etc),
accessing certain data-bases (e.g. Yellow Pages type data),
!nding servers (e.g.for gaming) etc.
16. Four modes of seeking information
Finding something when I
know what I want and have
words to describe it.
17. Four modes of seeking information
Exploring when I only have
some idea of what I want and
may lack the words to
articulate it.
18. Four modes of seeking information
Finding relevant items when I
don’t know what I need.
19. Four modes of seeking information
Finding something I have seen
before, but can’t remember
where.
20. The State of Enterprise Search
• Amount of information is growing
everyday
• What to Search for?
• Where to Search?
• How to Search?
• Search is simple, complex and powerful
• Findability Dimensions
30. WHAT ARE THE OBSTACLES
TO FINDING THE RIGHT
INFORMATION?
31. Globally
63.4% POOR SEARCH FUNCTIONALITY
52.1% DON'T KNOW WHERE TO LOOK
51.4% INCONSISTENCY IN HOW WE TAG
CONTENT
50.0% LACK OF ADEQUATE TAGS
33.1% DON’T KNOW WHAT TO LOOK FOR
32. Wikipedia De!nition
“Enterprise search is the practice of
making content from multiple
enterprise-type sources, such as
databases and intranets, searchable to a
de!ned audience.”
http://en.wikipedia.org/wiki/Enterprise_search
33. The Concept of Enterprise
Search: Precision
In the !eld of information retrieval, precision is the
fraction of retrieved documents that are relevant to the
search.
Precision takes all retrieved documents into account,
but it can also be evaluated at a given cut-o" rank,
considering only the topmost results returned by the
system. This measure is called precision at n or P@n.
Source: Wikipedia
34. The Concept of Enterprise
Search: Recall
Recall in information retrieval is the fraction of the
documents that are relevant to the query that are
successfully retrieved.
For example for text search on a set of documents recall
is the number of correct results divided by the number
of results that should have been returned.
Source: Wikipedia
35. Precision and Recall
R number of
M number of N number of
retrieved documents
relevant documents retrieved documents
that are also relevant
36. Precision and Recall
Recall = R / M =
Number of retrieved documents that are
also relevant / Total number of relevant
documents.
Precision = R / N =
Number of retrieved documents that are
also relevant / Total number of retrieved
documents.
37. Relevance
...enterprises typically have to use other query-
independent factors, such as a document's recency or
popularity, along with query-dependent factors
traditionally associated with information retrieval
algorithms. Also, the rich functionality of enterprise
search UIs, such as clustering and faceting, diminish
reliance on ranking as the means to direct the user's
attention.
Source: Wikipedia
39. Relevance
We do not have PageRank...
...but we have social!
Social Reconnects Enterprise Search
Emails, People Catalogues, Connections,
Tagging, Sharing etc.
41. Search based Solutions
Examples of implementations:
- People Search
- Product Search
- Document Search
- Intranet and Website Search
- E-commerce
- Dashboard / Search as a Service
42. Information / Content
• Good Data/Information hygiene
• Crap in = Crap out
• Metadata is very important!
• Taxonomy and Metadata demysti!ed
• TetraPak example (video)
• SimCorp example
• VGR example (video)
52. User Satisfaction
• Feedback form
• KPI from Search Analytics
• Session time x n:o sessions = Time spent
on search x hourly price = Cost per
“answer”
• Add search re!nements + exit page (=is
the right answer)
53. Findability by Findwise
1. BUSINESS
Build solutions to support your business processes and goals
2. INFORMATION
Prepare information to make it !ndable
3. USERS
Build usable solutions based on user needs
4. ORGANISATION
Govern and improve your solution over time
5. SEARCH TECHNOLOGY
Build solutions based on state-of-the-art search technology
54. Business
• Analyze how your business goals and
strategies can be met by improved
information access
• Set Findability goals. Examples; increase the
revenue on sales, raise productivity, improve
knowledge sharing, better collaboration
• Specify your requirements
• De!ne KPI’s and measure the success of your
investments
55. Information
• Clean up and archive or delete outdated/
unrelevant information
• Ensure good quality of information by
adding structured and suitable metadata
• Create and use information models and
taxonomies
• Tagging?
56. Users
• Get to know your users and their needs
• Make sure your solution is easy to use
• Perform continuous usability evaluations,
like usage tests and expert evaluations
• Make sure users !nd what they are looking
for
• Enable feedback loops for complaints,
feedback and praise
57. Organisation
• Resources!
• De!ne processes, roles and routines to
govern the solution
• Perform Search Analytics
• Create easy to use administration
interfaces
• Perform training, technical and editorial
• Help publishers get started with processes
for better !ndability
58. Search Technology
• Select a suitable search platform or make
the most of your current solution
• Design your architecture with search-as-a-
service in mind
• Utilise the full potential of the selected
technology
59. Kristian Norling
Kristian Norling
LinkedIn
@kristiannorling
@!ndwise
!ndwise.com
Findability Blog
Slideshare
Vimeo
Newsroom
Editor's Notes
\n
\n
\n
What do you want to know?\n
\n
We humans love to collect information, we have a harder time deleting/archiving.\nWhen we start valuing information correctly we can also motivate investments in search and put processes in place to keep information updated AND with high quality. \nInformation hygiene. Structure, metadata.\nInfonomics = information as an asset in the balance sheet. \n
Is this how you feel information is organised and structured in your organsation?\n
Is the information you need stored in a silo somewhere?\n
\n
\n
\n
\n
Yahoo: The directory is organized by subject matter, the top level containing categories such as Arts and Humanities, Business and Economy, Computers and the Internet, Education, Government, Health, News and Media, Recreation and Sports, Science, Society and Culture, and so on.\nThe natural hierarchical structure of the directory allows users easy navigation through and across its categories.\nThe directory is not strictly hierarchical, as it has many cross-references between categories from different parts of the hierarchy. For example, the subcategory Musicals under the Theater category, has a reference to the Movies and Film subcategory, which comes under Entertainment.\nWeb directories provide an important alternative to search engines, especially for novice surfers, as the directory structure makes it is easy to find relevant information provided when an appropriate category for the search query can be found. The fundamental drawback of directories is their lack of coverage.\nKnowing the category of a web page that a user clicked on is very indicative of the user's interests, and may be used to recommend to the user similar pages from the same or a related category. To solve the problem of how to automatically associate a web page with a category we need to make use of machine learning techniques for automatic categorization of web pages\n
Yahoo: The directory is organized by subject matter, the top level containing categories such as Arts and Humanities, Business and Economy, Computers and the Internet, Education, Government, Health, News and Media, Recreation and Sports, Science, Society and Culture, and so on.\nThe natural hierarchical structure of the directory allows users easy navigation through and across its categories.\nThe directory is not strictly hierarchical, as it has many cross-references between categories from different parts of the hierarchy. For example, the subcategory Musicals under the Theater category, has a reference to the Movies and Film subcategory, which comes under Entertainment.\nWeb directories provide an important alternative to search engines, especially for novice surfers, as the directory structure makes it is easy to find relevant information provided when an appropriate category for the search query can be found. The fundamental drawback of directories is their lack of coverage.\nKnowing the category of a web page that a user clicked on is very indicative of the user's interests, and may be used to recommend to the user similar pages from the same or a related category. To solve the problem of how to automatically associate a web page with a category we need to make use of machine learning techniques for automatic categorization of web pages\n
Yahoo: The directory is organized by subject matter, the top level containing categories such as Arts and Humanities, Business and Economy, Computers and the Internet, Education, Government, Health, News and Media, Recreation and Sports, Science, Society and Culture, and so on.\nThe natural hierarchical structure of the directory allows users easy navigation through and across its categories.\nThe directory is not strictly hierarchical, as it has many cross-references between categories from different parts of the hierarchy. For example, the subcategory Musicals under the Theater category, has a reference to the Movies and Film subcategory, which comes under Entertainment.\nWeb directories provide an important alternative to search engines, especially for novice surfers, as the directory structure makes it is easy to find relevant information provided when an appropriate category for the search query can be found. The fundamental drawback of directories is their lack of coverage.\nKnowing the category of a web page that a user clicked on is very indicative of the user's interests, and may be used to recommend to the user similar pages from the same or a related category. To solve the problem of how to automatically associate a web page with a category we need to make use of machine learning techniques for automatic categorization of web pages\n
Yahoo: The directory is organized by subject matter, the top level containing categories such as Arts and Humanities, Business and Economy, Computers and the Internet, Education, Government, Health, News and Media, Recreation and Sports, Science, Society and Culture, and so on.\nThe natural hierarchical structure of the directory allows users easy navigation through and across its categories.\nThe directory is not strictly hierarchical, as it has many cross-references between categories from different parts of the hierarchy. For example, the subcategory Musicals under the Theater category, has a reference to the Movies and Film subcategory, which comes under Entertainment.\nWeb directories provide an important alternative to search engines, especially for novice surfers, as the directory structure makes it is easy to find relevant information provided when an appropriate category for the search query can be found. The fundamental drawback of directories is their lack of coverage.\nKnowing the category of a web page that a user clicked on is very indicative of the user's interests, and may be used to recommend to the user similar pages from the same or a related category. To solve the problem of how to automatically associate a web page with a category we need to make use of machine learning techniques for automatic categorization of web pages\n
Yahoo: The directory is organized by subject matter, the top level containing categories such as Arts and Humanities, Business and Economy, Computers and the Internet, Education, Government, Health, News and Media, Recreation and Sports, Science, Society and Culture, and so on.\nThe natural hierarchical structure of the directory allows users easy navigation through and across its categories.\nThe directory is not strictly hierarchical, as it has many cross-references between categories from different parts of the hierarchy. For example, the subcategory Musicals under the Theater category, has a reference to the Movies and Film subcategory, which comes under Entertainment.\nWeb directories provide an important alternative to search engines, especially for novice surfers, as the directory structure makes it is easy to find relevant information provided when an appropriate category for the search query can be found. The fundamental drawback of directories is their lack of coverage.\nKnowing the category of a web page that a user clicked on is very indicative of the user's interests, and may be used to recommend to the user similar pages from the same or a related category. To solve the problem of how to automatically associate a web page with a category we need to make use of machine learning techniques for automatic categorization of web pages\n
Yahoo: The directory is organized by subject matter, the top level containing categories such as Arts and Humanities, Business and Economy, Computers and the Internet, Education, Government, Health, News and Media, Recreation and Sports, Science, Society and Culture, and so on.\nThe natural hierarchical structure of the directory allows users easy navigation through and across its categories.\nThe directory is not strictly hierarchical, as it has many cross-references between categories from different parts of the hierarchy. For example, the subcategory Musicals under the Theater category, has a reference to the Movies and Film subcategory, which comes under Entertainment.\nWeb directories provide an important alternative to search engines, especially for novice surfers, as the directory structure makes it is easy to find relevant information provided when an appropriate category for the search query can be found. The fundamental drawback of directories is their lack of coverage.\nKnowing the category of a web page that a user clicked on is very indicative of the user's interests, and may be used to recommend to the user similar pages from the same or a related category. To solve the problem of how to automatically associate a web page with a category we need to make use of machine learning techniques for automatic categorization of web pages\n
Yahoo: The directory is organized by subject matter, the top level containing categories such as Arts and Humanities, Business and Economy, Computers and the Internet, Education, Government, Health, News and Media, Recreation and Sports, Science, Society and Culture, and so on.\nThe natural hierarchical structure of the directory allows users easy navigation through and across its categories.\nThe directory is not strictly hierarchical, as it has many cross-references between categories from different parts of the hierarchy. For example, the subcategory Musicals under the Theater category, has a reference to the Movies and Film subcategory, which comes under Entertainment.\nWeb directories provide an important alternative to search engines, especially for novice surfers, as the directory structure makes it is easy to find relevant information provided when an appropriate category for the search query can be found. The fundamental drawback of directories is their lack of coverage.\nKnowing the category of a web page that a user clicked on is very indicative of the user's interests, and may be used to recommend to the user similar pages from the same or a related category. To solve the problem of how to automatically associate a web page with a category we need to make use of machine learning techniques for automatic categorization of web pages\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
Navigational queries. reach a particular site that the user has in mind, either because they visited it in the past or because they assume that such a site exists. have usually only one "right" result.\nInformational queries. find information assumed to be available on the web in a static form. \nTransactional queries. reach a site where further interaction will happen. This interaction constitutes the transaction defining these queries. The main categories for such queries are shopping, finding various web-mediated services, downloading various type of file (images, songs, etc), accessing certain data-bases (e.g. Yellow Pages type data), finding servers (e.g.for gaming) etc.\n \n 2nd gen. - Google, first engine to use link analysis as a primary ranking factor and DirectHit concentrated on click-through data. By now, all major engines use all these types of data. Link analysis and anchortext seems crucial for navigational queries.\n3rd gen. - For instance on a query like San Francisco the engine might present direct links to a hotel reservation page for San Francisco, a map server, a weather server, etc.\nRapidly changing landscape\n
\n
\n
\n
\n
\n
\n
\n
\n
Information silos. They are everywhere. \nEnterprise Search can “integrate” them.\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
On intranets or our web site search we do not have the equivalent of PageRank.\nWe can’t use the amount of inbound link as a factor for relevancy. \nWe have to find other ways...\n