The document provides an overview of search engines and search algorithms. It discusses (1) the key concepts of search including user intent, queries, documents and results; (2) the technical aspects such as indexing, ranking, and learning algorithms; and (3) current and future challenges for search. Learning algorithms covered include pointwise, pairwise, and listwise approaches. The goal of search engines is to accurately match user intent with relevant documents from a large corpus.
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
Tune in for Portent SEO Marianne Sweeny’s January webinar: “How to SEO a Terrific – and Profitable – User Experience.” Learn how search engine algorithms are now incorporating IA, UX and content strategy, as well as methods for directing Google, Bing & Co. to perform better for your users.
Advanced Internet Marketing November 2010kevindean9737
Social media, search engine optimization, pay per click, what should small businesses use to succeed online? Actually these and many other options abound for business success with online marketing.
How to SEO a Terrific - and Profitable - User ExperienceBrightEdge
Tune in for Portent SEO Marianne Sweeny’s January webinar: “How to SEO a Terrific – and Profitable – User Experience.” Learn how search engine algorithms are now incorporating IA, UX and content strategy, as well as methods for directing Google, Bing & Co. to perform better for your users.
Advanced Internet Marketing November 2010kevindean9737
Social media, search engine optimization, pay per click, what should small businesses use to succeed online? Actually these and many other options abound for business success with online marketing.
An introduction to Search Engine Optimization and different techniques applicable. The presentation also goes into the history of web, and how things changed from time to time.
Kerry Dean presented at the January 2015 DFWSEM | Dallas/Fort Worth Search Engine Marketing Association meeting on The State of SEO: 2015 and Beyond!
https://www.dfwsem.org/events/kerry-dean/
SharePoint 2013 Search Topology and OptimizationMike Maadarani
In this presentation, I am explaining the details of all search components, how to properly configure the search topology, and the options to extend the search farm in a hybrid “cloud/on-premises” scenario. This presentation will explain what you need to consider to design your search, in order to handle your organization's needs. We will dive into scripting a high availability search topology, keeping it healthy and manage your day-to-day search operations.
Learn about how to optimize your search for best performance and search relevancy, to support reliable search applications.
SEO - How does it work, Why is it important, and why do we have to do it?Joao da Costa
Main Objectives of the SEO Presentation:
- To understand what Search Engine Optimisation (SEO) is
- To understand how Google works and how it is evolving
- To define search ranking factors
- The importance of understanding search and how it affects today’s business decisions and strategies
3 ½ Simple Ways to Improve SEO - Practical Ways to Rank HigherPardot
Join us in this one-hour webinar as Derek Grant, SVP of Sales at Pardot, tackles the mysteries of SEO and gives us practical, yet effective ways that will help yield SEO success.
This presentation contains information about the different social features in SharePoint 2010. From the value they provide, to how they can be extended from a development perspective.
Search Strategy for Enterprise SharePoint 2013 - Vancouver SharePoint SummitJoel Oleson
The Four Pillars of Search really help you focus your search planning. In this session we dig into the context, content, metadata and UX or user experience that really matter. We also dig into a variety of publicly accessible SharePoint 2013 real world search pages to demonstrate the value.
Content without access is worthless. Searching for company data has mostly been a poor experience. This needs to change…
This slidedeck is from a webinar performed on september 19th 2018, presented by Joel Oleson and Maarten Visser were they discussed the current issues with Enterprise Search and look at what Microsoft is doing in this space. Besides best practices and tips they will also look at the Meetroo Entree product and how it helps organisations to improve the Search Experience.
An introduction to Search Engine Optimization and different techniques applicable. The presentation also goes into the history of web, and how things changed from time to time.
Kerry Dean presented at the January 2015 DFWSEM | Dallas/Fort Worth Search Engine Marketing Association meeting on The State of SEO: 2015 and Beyond!
https://www.dfwsem.org/events/kerry-dean/
SharePoint 2013 Search Topology and OptimizationMike Maadarani
In this presentation, I am explaining the details of all search components, how to properly configure the search topology, and the options to extend the search farm in a hybrid “cloud/on-premises” scenario. This presentation will explain what you need to consider to design your search, in order to handle your organization's needs. We will dive into scripting a high availability search topology, keeping it healthy and manage your day-to-day search operations.
Learn about how to optimize your search for best performance and search relevancy, to support reliable search applications.
SEO - How does it work, Why is it important, and why do we have to do it?Joao da Costa
Main Objectives of the SEO Presentation:
- To understand what Search Engine Optimisation (SEO) is
- To understand how Google works and how it is evolving
- To define search ranking factors
- The importance of understanding search and how it affects today’s business decisions and strategies
3 ½ Simple Ways to Improve SEO - Practical Ways to Rank HigherPardot
Join us in this one-hour webinar as Derek Grant, SVP of Sales at Pardot, tackles the mysteries of SEO and gives us practical, yet effective ways that will help yield SEO success.
This presentation contains information about the different social features in SharePoint 2010. From the value they provide, to how they can be extended from a development perspective.
Search Strategy for Enterprise SharePoint 2013 - Vancouver SharePoint SummitJoel Oleson
The Four Pillars of Search really help you focus your search planning. In this session we dig into the context, content, metadata and UX or user experience that really matter. We also dig into a variety of publicly accessible SharePoint 2013 real world search pages to demonstrate the value.
Content without access is worthless. Searching for company data has mostly been a poor experience. This needs to change…
This slidedeck is from a webinar performed on september 19th 2018, presented by Joel Oleson and Maarten Visser were they discussed the current issues with Enterprise Search and look at what Microsoft is doing in this space. Besides best practices and tips they will also look at the Meetroo Entree product and how it helps organisations to improve the Search Experience.
Introduction to Enterprise Search. A two hour class to introduce Enterprise Search. It covers:
The problems enterprise search can solve
History of (web) search
How we search and find?
Current state of Enterprise Search + stats
Technical concept
Information quality
Feedback cycle
Five dimensions of Findability
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
This presentation is from the inaugural Atlanta Solr Meetup held on 2014/10/21 at Atlanta Tech Village.
Description: CareerBuilder uses Solr to power their recommendation engine, semantic search, and data analytics products. They maintain an infrastructure of hundreds of Solr servers, holding over a billion documents and serving over a million queries an hour across thousands of unique search indexes. Come learn how CareerBuilder has integrated Solr into their technology platform (with assistance from Hadoop, Cassandra, and RabbitMQ) and walk through api and code examples to see how you can use Solr to implement your own real-time recommendation engine, semantic search, and data analytics solutions.
Speaker: Trey Grainger is the Director of Engineering for Search & Analytics at CareerBuilder.com and is the co-author of Solr in Action (2014, Manning Publications), the comprehensive example-driven guide to Apache Solr. His search experience includes handling multi-lingual content across dozens of markets/languages, machine learning, semantic search, big data analytics, customized Lucene/Solr scoring models, data mining and recommendation systems. Trey is also the Founder of Celiaccess.com, a gluten-free search engine, and is a frequent speaker at Lucene and Solr-related conferences.
How To Rank #1 On Google | How To Improve Google Ranking | SEO Tutorial For B...Simplilearn
This presentation on SEO will help you understand the various factors that help you rank #1 on Google which includes keyword research, creating high-quality content, how to optimize the content using on-page elements and website level factors which influence Google ranking and we will also discuss off-site engagement. But, ranking on Google is not easy. There are a lot of factors that influence this. Now, let us get started and understand the major factors influencing Google ranking in the year 2019.
Below topics are explained in this SEO presentation:
1. Keyword research
2. High-quality content
3. Optimize on-page elements and website factors
4. Off-site engagement
Why learn Digital Marketing?
Businesses and recruiters prefer marketing professionals with genuine knowledge, skills, and experience verified by a certification that is accepted across industries. Continuous learning for any working professional is not only important for keeping themselves up to date with the current market trends, but it also helps them expand their array of skill set and become more flexible in the workplace.
What skills will you learn from this Digital Marketing course?
This course will enable you to:
1. Gain an in-depth understanding of the various digital marketing disciplines: search engine optimization (SEO), social media marketing, pay-per-click (PPC), website conversion rate optimization, web analytics, content marketing, mobile marketing, email marketing, programmatic buying, marketing automation and digital marketing strategy
2. Master digital marketing execution tools: Google Analytics, Google Ads, Facebook Marketing, Twitter Advertising, and YouTube Marketing
3. Become a virtual digital marketing manager for an e-commerce company with Mimic Pro simulations included in our course. Practice SEO, SEM, Website Conversion Rate Optimization, email marketing and more.
4. Gain real-life experience by completing projects using Google Analytics, Google Ads, Facebook Marketing, and YouTube Marketing
5 Create the right marketing messages tailored for the right audiences
6. Prepare for top digital marketing certification exams such as OMCA, Google Analytics, Google Ads, Facebook Marketing, and YouTube Marketing certifications
Who should take this Digital Marketing course?
Anyone who is looking to further his or her career in digital marketing should take this course, especially those seeking leadership positions. Any of these roles can benefit from the Digital Marketing Specialist training:
1. Marketing Managers
2. Digital Marketing Specialists
3. Marketing or Sales Professionals
4. Management, Engineering, Business, or Communication Graduates
5. Entrepreneurs or Business Owners
6. Marketing Consultant
Learn more at https://www.simplilearn.com/digital-marketing/digital-marketing-certified-associate-training
WordPress SEO Basics - Melbourne WordPress MeetupChris Burgess
The slide deck from an introduction to WordPress SEO, covering basic search engine optimization, onsite and offsite factors, keyword/topic and content strategy, WordPress SEO by Yoast and a few recommendations to help people learn more about SEO in general.
Pam goodrich and Joe Gelb - A Journey to Intelligent Content DeliveryLavaConConference
Learn how Cherwell Software used a digital experience platform (DXP) to implement a world-class Documentation Portal with minimal staffing and a condensed time frame. Learn how we consolidated documentation for multiple products, versions, and languages into single consolidated platform. We’ll discuss how we sold the project to Cherwell leadership, how we selected our vendor, and the challenges we faced during implementation.
Key Success Factors for Enterprise Content ManagementIntlock Ltd.
SharePoint has always had a big emphasis on Content Management. This focus has been become stronger version by version. We've had more and more options to organize and classify content through sites, lists, libraries and folders, as well as managed metadata and other properties. These tools help build a SharePoint Information Architecture and are the foundation for improving document search within SharePoint. But as that architecture gets more complex, users can get overwhelmed by the amount of content, and can find themselves easily with a tons of siloed content and, at the same time, with lots of content that cannot be found at all. In these scenarios, Search can be a good option to help getting better findability, but sometimes it’s not enough. In this webinar, we’ll discuss some real-world Content Management use cases and demonstrate how content analytics can help to improve in these scenarios.
SEO in the Age of Artificial Intelligence | How AI influences SearchPhilipp Klöckner
SEO hast changes over the past decade. Understand how classical ranking factors become less important, while user experience dominates the top rankings.
As seen live on stage at @ProjectAcom #PakCon2018 in Berlin.
Enterprise Search Strategy 101 at SEF2014 in StockholmJoel Oleson
In this session on getting at your Enterprise Search Strategy we dig into the 4 pillars of Search including Context, Content, Metadata and UX. We use examples from Amazon, Google, Bing and a variety of real world SharePoint 2013 publicly accessible environments optimized for Search.
This presentation is introduction to search world and deep-dive to Azure Search. Building well functioning search for web site is never easy. It needs planning of search indexes, analysis for technical solution and good knowledge about how users interact with search. Azure Search is simple search service to search-enable web sites and other systems. This presentation uses simple example site to make deep-dive to Azure Search and show audience how to build scalable and powerful search solution on it.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
2. About
• Enterprise software company that develops products for software
developers, project managers, and content management
3. • Enterprise software company that develops products for software
developers, project managers, and content management
• Our products:
About
4. About Me
Head of Search & Smarts Engineering at Atlassian
• In charge of all customer-facing ML/AI initiatives, including Search
• Our main initiative is Cross-Product Search in ‘Home’
Before Atlassian:
• Particle Physicist by training
• Initiated Data Science efforts at several companies
• Previously member of the Search team at @WalmartLabs
5. About this Talk
What to expect
• A general introduction to Search
• A overview of both the Engineering and ML aspects of Search
• Insights into the current and future challenges of Search
What not to expect
• An extensive tutorial covering the entire Learning-to-Rank landscape
• To become a Search expert in 40 min
6. Outline
• Part I: The Concepts of Search
• Part II: The Technical Aspects of Search
• Part III: Learning Algorithms
• Part IV: Measuring Search Relevance
• Part V: The Challenges and the Future of Search
8. Altavista
First to allow NL queries
Web Crawler
1st crawler to index entire pages
The (Pre)History of Search
1990
Archie
First search engine: an index of
downloadable directory listings 1991
Veronika, Jughead
Search file names and titles stored
in Gopher index systems 1992
Vlib
Time Berners-Lee set
up a Virtual Library
1993
Excite
WWW Wanderer
Primitive Web Search1994
1995
LookSmart
1996
Inktomi: HotBot
Google
1997
Ask.com
Lycos
Ranked relevance retrieval
Yahoo! Directory
9. The History of Search
1998 MSN
Open Directory Project
1999AllTheWeb
Overture Services
2000
Snap
2003
2004
2001
2002
2005
2006
LiveSearch
2007
2008
2009
Cuil
Bing
Inline search suggestions
2010
10. What is Search?
Convert an intent into an action that helps people
retrieve something, i.e. a piece of content
CONTENT OVERLOAD
Search
11. What is Search?
Convert an intent into an action that helps people
retrieve something, i.e. a piece of content
CONTENT OVERLOAD
Search
• Retrieving, organizing & classifying information
• Includes:
• Web Search
• Faceted Search (e-Commerce)
• Enterprise Search
• But also
• Different types of documents: Image Search, etc.
• In a wider sense of the term:
• Recommendation (Search with no explicit intent from the user)
• Structured Query Language
14. User Intent
What is Search (Really) About?
Users
Content
Request
Search Query
Return
Search Results
Documents
INTERPRETATION
DISPLAY
RETRIEVAL
15. User 1 - Intent
What is Search (Really) About?
Users
Content
Request Search Query
Return Search Results
Documents 1
INTERPRETATION
DISPLAY
RETRIEVAL
• Query space not controlled
• Content dependent on customer
Multi-tenancy Search
User 2 - Intent
User 3 - Intent
Documents 2
Documents 3
Request Search Query
Return Search Results
Request Search Query
Return Search Results
DISPLAY
INTERPRETATION
DISPLAY
INTERPRETATION
16. Query data
• What are you searching for? (query terms)
Content data
• What are the documents about? (topics)
Contextual data
• Who are you? (user data – both static and learned)
• In which circumstances are you searching?
Engagement data
• As a group (what web pages are ‘hot’ these days?)
• As an individual (your personal viewing history)
Data Zoo For Search
17. CRAWLER
strips out the html text content
The Processes of Search
Automated browser
that views your web pages
18. CRAWLER
INDEXER
strips out the html text content
Stores records of all pages viewed by
the spider/crawler
The Processes of Search
Automated browser
that views your web pages
Database being searched
when ‘search’ button is hit
19. CRAWLER
INDEXER
SEARCHER
strips out the html text content
Stores records of all pages viewed by
the spider/crawler
Algorithm used to sort through
the database of pages
The Processes of Search
Automated browser
that views your web pages
Database being searched
when ‘search’ button is hit
finds the most relevant content
22. Indexing
The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query.
Indexing
23. • Without an index, the search engine would scan every document in the corpus
• Benefits: computation and time saving at query time
• 10,000 documents can be queried within milliseconds with an index
• a sequential scan could take hours
• Disadvantages:
• additional computer storage required to store the index
• increase in the time required for an update to take place
• Design factors:
• Storage techniques
• Index size, lookup speed
• Maintenance, fault tolerance
Indexing
The purpose of storing an index is to optimize speed and performance in finding
relevant documents for a search query.
Indexing
24. What Happens at Indexing Time?
Text + Metadata
(Doc type, structure, features)
Text Acquisition
Index
Takes index terms
& creates data structures
(inverted indexes)
to support fast searching
Transforms documents into
index terms or features
Document
data store
E-mail, Web pages, News
articles, Memos, Letters
Identifies and stores
documents for indexing
Indexing Process
Index Creation
Text Transformation
25. 1. Identify What To Search For
Find out what words get searched and interpret the query term
2. Parse The Query Language Itself
Recognizing and interpreting operators (AND, OR, NOT, etc.) and field restrictors
3. Extend Search to Other Query Terms
This includes:
• Fuzzy Matching (spelling mistakes)
• Entity and Thematic Modeling (related words)
4. Relevance Ranking Improvements
… such as:
• boosting documents containing all of the terms close together (proximity weighting)
• boosting documents from trustworthy sources, reducing documents from unreliable sites
Parsing
28. Ranking
Cats with sunglasses
Just hanging out with
my sunglasses on
Am I cool or what?
Me with glasses just
because…
it makes me smart.
What I see right here is Jim
Belushi as a cat.
Along with the Blues Brothers behind.
You will never be as capable
of rocking shades…
quite as well as this feline friend.
29. Ranking
Relevance score ∈ 0,1
0.9
0.7
0.3
0.1
Cats with sunglasses
Just hanging out with
my sunglasses on
Am I cool or what?
Me with glasses just
because…
it makes me smart.
What I see right here is Jim
Belushi as a cat.
Along with the Blues Brothers behind.
You will never be as capable
of rocking shades…
quite as well as this feline friend.
𝑓 𝑞𝑢𝑒𝑟𝑦, 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
30. Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
31. Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
0.9
0.7
0.3
0.1
Original
rank
32. Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
0.9
0.7
0.3
0.1
TopNdocuments
Original
rank
33. Reranking
Allows to run a simple query (A) for matching documents and
re-order the top N documents using the scores from a more complex query (B)
Query Re-Ranking
0.9
0.7
0.3
0.1
TopNdocuments
Original
rank
1.0
0.9
0.5
Re-ranking
34. Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
35. Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
36. Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼
.
+ 𝛼
.
+ 𝛼
.
37. Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼
.
+ 𝛼
.
+ 𝛼
.
Total dwell
time (minutes)
500
400
100
+ 𝛽.
+ 𝛽.
+ 𝛽.
38. Boosting and Personalization
Boosting
Running a simple query (A) and modify the {query, document} relevance scores to
boost some content (for example, based on popularity, engagement, etc.)
0.9
0.7
0.3
Original
relevance
Original
rank
2,000
5,000
6,000
Page
clicks
+ 𝛼
.
+ 𝛼
.
+ 𝛼
.
Total dwell
time (minutes)
500
400
100
+ 𝛽.
+ 𝛽.
+ 𝛽.
New
relevance
= 65.9
= 154.7
= 181.3
𝛼 = 0.03, 𝛽 = 0.01
New
rank
41. Learning-to-Rank (2)
Learning
System
Ranking System
Model h
q
x1
x2
xm
h(x)
…
q
x1
x2
xm
?
…
q1
x1
(1)
x2
(1)
xm(1)
(1)
y
(1)
…
q2
x1
(2)
x2
(2)
xm(2)
(2)
y
(2)
…
qn
x1
(n)
x2
(n)
xm(n)
(n)
y
(n)
…
…
Training Data
Test Data Prediction
42. Pointwise
• Predict relevance on a document-by-document basis
• 3 types of supervised machine learning algorithms can be used:
• Regression-based algorithms
• Classification-based algorithms
• Ordinal regression
Learning-to-Rank Algorithms
43. Pointwise
• Predict relevance on a document-by-document basis
• 3 types of supervised machine learning algorithms can be used:
• Regression-based algorithms
• Classification-based algorithms
• Ordinal regression
Pairwise
• Tell which document is better in a given pair of documents: it is a classification
problem
• The goal is to minimize average number of inversions in ranking
Learning-to-Rank Algorithms
44. Pointwise
• Predict relevance on a document-by-document basis
• 3 types of supervised machine learning algorithms can be used:
• Regression-based algorithms
• Classification-based algorithms
• Ordinal regression
Pairwise
• Tell which document is better in a given pair of documents: it is a classification
problem
• The goal is to minimize average number of inversions in ranking
Listwise
• Directly optimize one of the ranking evaluation measures
Learning-to-Rank Algorithms
45. Pointwise Approach
• Predict the exact relevance degree of each document
• Assumes that each {query, document} pair has a numerical or ordinal score
• Input space contains the feature vector of every single document
• Can be approximated by a regression problem
• Ordinal regression:
• {query, document} relevance score can only take small, finite number of values
46. Pointwise Approach
Regression Classification Ordinal Regression
Input Space Single Documents yj
Output Space Real Values
Non-ordered
Categories
Ordinal Categories
Hypothesis Space Scoring Function f(x)
Loss Function
Regression Loss Classification Loss
Ordinal Regression
Loss
L(f; xj, yj)
• Predict the exact relevance degree of each document
• Assumes that each {query, document} pair has a numerical or ordinal score
• Input space contains the feature vector of every single document
• Can be approximated by a regression problem
• Ordinal regression:
• {query, document} relevance score can only take small, finite number of values
Summary
47. • Focus on relative order between 2 documents instead of predicting relevance
• Learn a binary classifier to tell which document is better in a pair of documents
• Goal: minimize average number of inversions in ranking
• Pairwise preference is used as the ground truth
• Limitations:
• Does not differentiate inversions at top vs. bottom positions
• Examples:
• RankNet
Pairwise Algorithms
48. • Focus on relative order between 2 documents instead of predicting relevance
• Learn a binary classifier to tell which document is better in a pair of documents
• Goal: minimize average number of inversions in ranking
• Pairwise preference is used as the ground truth
• Limitations:
• Does not differentiate inversions at top vs. bottom positions
• Examples:
• RankNet
Pairwise Algorithms
Input Space Document pairs (xu, xv)
Output Space Preference 𝑦5,6 ∈ {+1, −1}
Hypothesis Space Preference function ℎ 𝑥5, 𝑥6 = 2. 𝐼{@ AB C@ AD } − 1
Loss Function Pairwise classification loss 𝐿(ℎ; 𝑥5, 𝑥6, 𝑦5,6)
Summary
49. • Pick an evaluation measure & optimize its value, averaged over all queries
• Challenges:
• Continuous approximations on measures used b/c most are not continuous functions
• 2 Types of approaches:
• Direct Optimization of IR Evaluation Measures
• Minimization of Listwise Ranking Losses
Listwise Algorithms
50. • Pick an evaluation measure & optimize its value, averaged over all queries
• Challenges:
• Continuous approximations on measures used b/c most are not continuous functions
• 2 Types of approaches:
• Direct Optimization of IR Evaluation Measures
• Minimization of Listwise Ranking Losses
Listwise Algorithms
Listwise Loss Minimization
Direct Optimization of IR
Measure
Input Space Document set 𝒙 =
{𝑥J}JKL
M
Output Space Permutation 𝜋O
Ordered Categories
𝒚 =
{𝑦J}JKL
M
Hypothesis Space ℎ 𝑥 = 𝑠𝑜𝑟𝑡 ∘ 𝑓(𝑥) ℎ 𝑥 = 𝑓(𝑥)
Loss Function Listwise Loss 𝐿(ℎ; 𝒙, 𝜋O)
1-surrogate Measure
𝐿(ℎ; 𝒙, 𝒚)
Summary
51. 3 input ligands: C
Summary
B A
DifferentMethods
Pointwise Pairwise Listwise
C Score(C)
B Score(B)
A Score(A)
BA f(A)>f(B)
CB f(B)>f(C)
CA f(A)>f(C)
CBA PA,B,C
CB A PB,A,C
CB A PB,C,A
Output
Ranking = CBA
52. • Link analysis algorithm
Example: the PageRank Algorithm
• Algorithm invented by Larry Page (Google founder)
• score goes from 0 to 10
• Other Alternatives:
• Page Authority
• HostRank
• Voting Algorithms
• …
Graph-Based Algorithms
A
A
C
B
B
B
B
B
C
53. Features
Rank Features Rank Features
1 TF of body … …
2 TF of anchor 51 PageRank
3 TF of title 52 HostRank
4 TF of URL 53 Topical PageRank
5 TF of whole document 54 Topical HITS authority
6 IDF of body 55 Topical HITS hub
7 IDF of anchor 56 Inlink number
8 IDF of title 57 Outlink number
9 IDF of URL 58 Number of slash in URL
10 IDF of whole document 59 Length of URL
IR/NLPfeatures
LinkageEngagement
Example features (TREC)
TF: term frequency
IDF: inverse document frequency
54. Conventional Ranking Models
Query-dependent
• Boolean model, extended Boolean model, etc.
• Vector space model, latent semantic indexing (LSI), etc.
• BM25 model, statistical language model, etc.
Query-independent
• PageRank, TrustRank, BrowseRank, etc.
Problems with Conventional Models
• Manual parameter tuning difficult
• Too many parameters
• Evaluation measures not smooth
• Sometimes leads to overfitting
• Ensemble approach (combining models into a more effective one) not trivial
56. Corpus Size
• Number of pages indexed
Search engine overlap
• Fraction of pages indexed by engine A also indexed by engine B
Freshness
• Age of the pages in the index
Spam resilience
• Fraction of pages in index that are spam
Duplicates
• Number of unique pages in index
Search Engine Evaluation: Index
57. Search Engine Evaluation: Relevance Judgment
Types of judgments classified similarly to Ranking Algorithms
1. Degree of Relevance
• Binary: relevant vs. irrelevant
• Multiple ordered categories:
Perfect > Excellent > Good > Fair > Bad
2. Pairwise Preference
• Document A is more relevant than document B
3. Total Order
• Documents are ranked as {A,B,C,..} according to their relevance
58. Evaluation Measure – MAP & NDCG
Precision at position k for query q :
Average precision for query q :
𝑃@𝑘 =
#
{ 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑑 𝑜𝑐𝑠
𝑖 𝑛
𝑡 𝑜𝑝
𝑘
𝑟 𝑒𝑠𝑢𝑙𝑡𝑠}
𝑘
𝐴𝑃 =
∑ 𝑃@𝑘. 𝑙^^
#
{ 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡
𝑑 𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠}
NDCG at position n for query q :
𝑁𝐷𝐶𝐺@𝑘 =
𝑍^
e 𝐺 𝜋fL
𝑗
𝜂(𝑗)
^
JKL
Normalized Cumulative
(Position)
Discounted
MAP & NDCG: Averaged over all queries
MAP NDCG
Gain
59. Evaluation Measure - Summary
Query-level: every query contributes equally to the measure
• Computed on documents associated with the same query
• Bounded for each query
• Averaged over all test queries
Position-based: rank position is explicitly used (weighting)
• Top-ranked objects more important
• Relative order vs. relevance score of each document
• Rank is a non-continuous, non-differentiable of scores
60. Part V: The Challenges
and the Future of Search
61. • Near duplicates and versioning
• More recently, “quoting” in-between websites
• Metadata and file formats
• Search across multiple sources
• How to merge several indexes?
• Challenges with latency?
• Security, Privacy, Regulations
The Challenges of Enterprise Search
62. • User Logs as Ground Truth
• A gold mine that has not been leveraged so far
• Implicit feedback
• Click-through rates, etc.
• Feature Engineering
• New Directions of Research
• Semi-supervised Ranking
• Transfer Ranking
Future Research
63. • While 20+ years old, Search is still hard
• But there are off-the-shelf solutions…
• A problem where ML can help (learning-to-rank space)
• Most promising algorithms use a listwise approach
• Very dynamic area of research
• But doing Search well requires more than Learning-to-Rank:
• Query Parsing, Topic modeling, etc.
• It is getting harder with ever more types of documents
Conclusions
65. • Learning-to-Rank for Information Retrieval, by Tie-Yan Liu
• Learning-to-Rank Tutorial, by Tie-Yan Liu
• The PageRank Model, by Ian Rogers
• Search is Hard, by Priyendra Deshwal
• Why Is Enterprise Search so Hard?, by Miles Kehoe
References