The document discusses various techniques for summarizing search results and detecting duplicate web pages. It describes static and dynamic summaries, where static summaries are always the same regardless of the query and dynamic summaries are query-dependent. It also covers different methods for generating static and dynamic summaries, as well as challenges in producing good dynamic summaries. The document then discusses various spam techniques used by search engine optimizers and the arms race between SEOs and search engines. It concludes by outlining approaches for detecting near-duplicate and mirrored web pages.
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
As the Internet occupies our daily lives in all aspects, finding jobs/employees online has an important role for job seekers and companies that hire. However, it is difficult for a job applicant to find the best job that matches his/her qualifications and also it is difficult for a company to find the best qualified candidates based on the company’s job advertisement. In this paper, we propose a system that extracts data from free-structured job advertisements in an ontological way in Turkish language. We describe a system that extracts data from resumés and jobs to generate a matching system that provides job applicants with the best jobs to match their qualifications. Moreover, the system also provides companies to find the best fit for their job advertisement.
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
HathiTrust is a shared digital repository containing over 17 million scanned books from over 140 member libraries, totaling around 5 billion pages. It faces challenges in providing large-scale full-text search across this multilingual collection where document quality and structure varies. Initial approaches involved a two-tiered index but relevance must balance weights between full text and shorter metadata fields. Further tuning of algorithms like BM25 is needed to properly rank longer documents in the collection against metadata.
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
This document discusses text mining and lexicon construction. It introduces text mining and describes how lexicons are important for tasks like question answering and information extraction. It then discusses different approaches for constructing lexicons, including iterative expansion of phrase lists, multilevel bootstrapping, and co-training algorithms using internal word features.
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
As the Internet occupies our daily lives in all aspects, finding jobs/employees online has an important role for job seekers and companies that hire. However, it is difficult for a job applicant to find the best job that matches his/her qualifications and also it is difficult for a company to find the best qualified candidates based on the company’s job advertisement. In this paper, we propose a system that extracts data from free-structured job advertisements in an ontological way in Turkish language. We describe a system that extracts data from resumés and jobs to generate a matching system that provides job applicants with the best jobs to match their qualifications. Moreover, the system also provides companies to find the best fit for their job advertisement.
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...OpenSource Connections
HathiTrust is a shared digital repository containing over 17 million scanned books from over 140 member libraries, totaling around 5 billion pages. It faces challenges in providing large-scale full-text search across this multilingual collection where document quality and structure varies. Initial approaches involved a two-tiered index but relevance must balance weights between full text and shorter metadata fields. Further tuning of algorithms like BM25 is needed to properly rank longer documents in the collection against metadata.
Webinar: Simpler Semantic Search with SolrLucidworks
Hear from Lucidworks Senior Solutions Consultant Ted Sullivan about how you can leverage Apache Solr and Lucidworks Fusion to improve semantic awareness of your search applications.
The talk at TYPO3 DevDays 2015 in Nuremberg which explains the deep insights of how search works. TF-IDF algorithm, vector space model and how that is used in Lucene and therefore Solr and Elasticsearch.
This document discusses text mining and lexicon construction. It introduces text mining and describes how lexicons are important for tasks like question answering and information extraction. It then discusses different approaches for constructing lexicons, including iterative expansion of phrase lists, multilevel bootstrapping, and co-training algorithms using internal word features.
The document describes the anatomy and architecture of Google's large-scale search engine. It discusses how Google crawls the web to index pages, calculates page ranks, and uses its index to return relevant search results. Key components include distributed crawlers that gather page content, a URL server that directs crawlers, storage servers that house the repository, an indexer that processes pages into searchable hits, and a searcher that handles user queries using the index and page ranks.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
LLMs in Production: Tooling, Process, and Team StructureAggregage
Join Dr. Greg Loughnane and Chris Alexiuk in this exciting webinar to learn all about the tooling, processes, and team structure you need to build and operate performant, reliable, and scalable production-grade LLM applications!
Introduction into Search Engines and Information RetrievalA. LE
Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm
Web search engines index billions of web pages and handle hundreds of millions of searches per day. They use inverted indexes to quickly search text and return relevant results. Ranking algorithms consider factors like term frequency, popularity, and link analysis using PageRank to determine the most authoritative pages for a given query. Crawling software systematically explores the web by following links to discover and index new pages.
One of the biggest problems of software projects is that, while the practice of software development is commonly thought of as engineering, it is inherently a creative discipline; hence, many things about it are hard to measure. While simple yardsticks like test coverage and cyclomatic complexity are important for code quality, what other metrics can we apply to answer questions about our code? What coding conventions or development practices can we implement to make our code easier to measure? We'll take a tour through some processes and tools you can implement to begin improving code quality in your team or organization, and see what a difference it makes to long-term project maintainability. More importantly, we'll look at how we can move beyond today's tools to answer higher-level questions of code quality. Can 'good code' be quantified?
One of the biggest problems of software projects is that, while the practice of software development is commonly thought of as engineering, it is inherently a creative discipline; hence, many things about it are hard to measure. While simple yardsticks like test coverage and cyclomatic complexity are important for code quality, what other metrics can we apply to answer questions about our code? What coding conventions or development practices can we implement to make our code easier to measure? We'll take a tour through some processes and tools you can implement to begin improving code quality in your team or organization, and see what a difference it makes to long-term project maintainability. More importantly, we'll look at how we can move beyond today's tools to answer higher-level questions of code quality. Can 'good code' be quantified?
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
New Tools for an Old Art: Rhetorical Analysis Through Visualization and PlayShannan Butler
Presented to the 2012 Southern States Communication Association, this Slideshare explores the use of data visualization software in the practice of rhetorical criticism.
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
This document provides an overview of key concepts in information retrieval. It discusses issues with crawlers getting stuck in loops and storage issues. It also covers the basics of indexing documents, including creating an inverted index to speed up retrieval. Common techniques for detecting duplicate and near-duplicate documents are also summarized.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
This workshop will teach you the basics git concepts (such as references, commits, and blobs) and how they can be mapped into a series of relational tables.
Once we understand the basic concepts we will show how language classification and program parsing are available as SQL custom functions, how to use them correctly, and how to obtain aggregate results with `GROUP BY` and friends. We will discuss Universal Abstract Syntax Trees and how some advanced checks can be done on top this language agnostic structure. Running these checks at scale requires some extra knowledge and we’ll discuss the challenges and their possible solutions.
To finish, we will also discuss how the information in git repositories encodes a form of social network which can be used to better understand the engineering processes of a given organization.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
This document summarizes a PhD dissertation defense about developing a semantic document architecture (SDArch) for desktop data integration and management. The key points are:
1. It proposes a semantic document model (SDM) that represents documents as semantically annotated and interlinked data units identified by URIs and composed of ontological concepts.
2. The semantic document architecture (SDArch) integrates desktop data into a unified information space and enables sharing data across social communities through semantic linking and annotations.
3. An evaluation validated that semantic documents improved information retrieval over traditional keyword search and full text indexing by leveraging semantic annotations and links between document units.
How is organized data used by some web players having not the best intentions? How can tools that try to help individual authors be subverted by spammers? Also, how does Zemanta work and why are we interested in this topic.
How is organized data used by some web players having not the best intentions?
How can tools that try to help individual authors be subverted by spammers?
Also, how does Zemanta work and why are we interested in this topic.
Enriching the semantic web tutorial session 1Tobias Wunner
The document discusses challenges and opportunities in natural language processing for the multilingual semantic web. It provides examples of how content on the web and semantic web exhibits linguistic variations within and across languages. It also summarizes several NLP applications like information extraction and natural language generation that utilize ontologies, and notes that these applications require domain and multilingual adaptation of lexicons and extraction rules. The document argues that efficient adaptation and sharing of linguistic resources between ontology-based NLP applications is needed.
An index is a database that stores information collected from documents in a way that allows quick retrieval. It maps words to their locations in documents. Different indexing methods exist, including inverted indexes and latent semantic indexing (LSI). Probabilistic latent semantic indexing (PLSI) is an improvement over LSI as it has a stronger statistical model and can handle issues like synonyms and multiple meanings of words better. Creating accurate indexes is important for search engines to return relevant results but it involves challenges like document formatting, language processing, and updating as new information is added.
The document describes the anatomy and architecture of Google's large-scale search engine. It discusses how Google crawls the web to index pages, calculates page ranks, and uses its index to return relevant search results. Key components include distributed crawlers that gather page content, a URL server that directs crawlers, storage servers that house the repository, an indexer that processes pages into searchable hits, and a searcher that handles user queries using the index and page ranks.
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
Presentation as given to the Haystack Conference, which outlines research and techniques for automatic extraction of keywords, concepts, and vocabularies from text corpora.
LLMs in Production: Tooling, Process, and Team StructureAggregage
Join Dr. Greg Loughnane and Chris Alexiuk in this exciting webinar to learn all about the tooling, processes, and team structure you need to build and operate performant, reliable, and scalable production-grade LLM applications!
Introduction into Search Engines and Information RetrievalA. LE
Gives a brief introduction into search engines and information retrieval. Covers basics about Google and Yahoo, fundamental terms in the area of information retrieval and an introduction into the famous page rank algorithm
Web search engines index billions of web pages and handle hundreds of millions of searches per day. They use inverted indexes to quickly search text and return relevant results. Ranking algorithms consider factors like term frequency, popularity, and link analysis using PageRank to determine the most authoritative pages for a given query. Crawling software systematically explores the web by following links to discover and index new pages.
One of the biggest problems of software projects is that, while the practice of software development is commonly thought of as engineering, it is inherently a creative discipline; hence, many things about it are hard to measure. While simple yardsticks like test coverage and cyclomatic complexity are important for code quality, what other metrics can we apply to answer questions about our code? What coding conventions or development practices can we implement to make our code easier to measure? We'll take a tour through some processes and tools you can implement to begin improving code quality in your team or organization, and see what a difference it makes to long-term project maintainability. More importantly, we'll look at how we can move beyond today's tools to answer higher-level questions of code quality. Can 'good code' be quantified?
One of the biggest problems of software projects is that, while the practice of software development is commonly thought of as engineering, it is inherently a creative discipline; hence, many things about it are hard to measure. While simple yardsticks like test coverage and cyclomatic complexity are important for code quality, what other metrics can we apply to answer questions about our code? What coding conventions or development practices can we implement to make our code easier to measure? We'll take a tour through some processes and tools you can implement to begin improving code quality in your team or organization, and see what a difference it makes to long-term project maintainability. More importantly, we'll look at how we can move beyond today's tools to answer higher-level questions of code quality. Can 'good code' be quantified?
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
Comparing TensorFlow NLP Options: word2Vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank: Through code samples and demos, we’ll compare the architectures and algorithms of the various TensorFlow NLP options. We’ll explore both feed-forward and recurrent neural networks such as word2vec, gloVe, RNN/LSTM, SyntaxNet, and Penn Treebank using the latest TensorFlow libraries.
New Tools for an Old Art: Rhetorical Analysis Through Visualization and PlayShannan Butler
Presented to the 2012 Southern States Communication Association, this Slideshare explores the use of data visualization software in the practice of rhetorical criticism.
Information Retrieval, Encoding, Indexing, Big Table. Lecture 6 - IndexingSean Golliher
This document provides an overview of key concepts in information retrieval. It discusses issues with crawlers getting stuck in loops and storage issues. It also covers the basics of indexing documents, including creating an inverted index to speed up retrieval. Common techniques for detecting duplicate and near-duplicate documents are also summarized.
These slides were presented as part of a W3C tutorial at the CSHALS 2010 conference (http://www.iscb.org/cshals2010). The slides are adapted from a longer introduction to the Semantic Web available at http://www.slideshare.net/LeeFeigenbaum/semantic-web-landscape-2009 .
A PDF version of the slides is available at http://thefigtrees.net/lee/sw/cshals/cshals-w3c-semantic-web-tutorial.pdf .
Code as Data workshop: Using source{d} Engine to extract insights from git re...source{d}
This workshop will teach you the basics git concepts (such as references, commits, and blobs) and how they can be mapped into a series of relational tables.
Once we understand the basic concepts we will show how language classification and program parsing are available as SQL custom functions, how to use them correctly, and how to obtain aggregate results with `GROUP BY` and friends. We will discuss Universal Abstract Syntax Trees and how some advanced checks can be done on top this language agnostic structure. Running these checks at scale requires some extra knowledge and we’ll discuss the challenges and their possible solutions.
To finish, we will also discuss how the information in git repositories encodes a form of social network which can be used to better understand the engineering processes of a given organization.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
This document summarizes a PhD dissertation defense about developing a semantic document architecture (SDArch) for desktop data integration and management. The key points are:
1. It proposes a semantic document model (SDM) that represents documents as semantically annotated and interlinked data units identified by URIs and composed of ontological concepts.
2. The semantic document architecture (SDArch) integrates desktop data into a unified information space and enables sharing data across social communities through semantic linking and annotations.
3. An evaluation validated that semantic documents improved information retrieval over traditional keyword search and full text indexing by leveraging semantic annotations and links between document units.
How is organized data used by some web players having not the best intentions? How can tools that try to help individual authors be subverted by spammers? Also, how does Zemanta work and why are we interested in this topic.
How is organized data used by some web players having not the best intentions?
How can tools that try to help individual authors be subverted by spammers?
Also, how does Zemanta work and why are we interested in this topic.
Enriching the semantic web tutorial session 1Tobias Wunner
The document discusses challenges and opportunities in natural language processing for the multilingual semantic web. It provides examples of how content on the web and semantic web exhibits linguistic variations within and across languages. It also summarizes several NLP applications like information extraction and natural language generation that utilize ontologies, and notes that these applications require domain and multilingual adaptation of lexicons and extraction rules. The document argues that efficient adaptation and sharing of linguistic resources between ontology-based NLP applications is needed.
An index is a database that stores information collected from documents in a way that allows quick retrieval. It maps words to their locations in documents. Different indexing methods exist, including inverted indexes and latent semantic indexing (LSI). Probabilistic latent semantic indexing (PLSI) is an improvement over LSI as it has a stronger statistical model and can handle issues like synonyms and multiple meanings of words better. Creating accurate indexes is important for search engines to return relevant results but it involves challenges like document formatting, language processing, and updating as new information is added.
Strategies for Effective Upskilling is a presentation by Chinwendu Peace in a Your Skill Boost Masterclass organisation by the Excellence Foundation for South Sudan on 08th and 09th June 2024 from 1 PM to 3 PM on each day.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
How to Build a Module in Odoo 17 Using the Scaffold MethodCeline George
Odoo provides an option for creating a module by using a single line command. By using this command the user can make a whole structure of a module. It is very easy for a beginner to make a module. There is no need to make each file manually. This slide will show how to create a module using the scaffold method.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
বাংলাদেশের অর্থনৈতিক সমীক্ষা ২০২৪ [Bangladesh Economic Review 2024 Bangla.pdf] কম্পিউটার , ট্যাব ও স্মার্ট ফোন ভার্সন সহ সম্পূর্ণ বাংলা ই-বুক বা pdf বই " সুচিপত্র ...বুকমার্ক মেনু 🔖 ও হাইপার লিংক মেনু 📝👆 যুক্ত ..
আমাদের সবার জন্য খুব খুব গুরুত্বপূর্ণ একটি বই ..বিসিএস, ব্যাংক, ইউনিভার্সিটি ভর্তি ও যে কোন প্রতিযোগিতা মূলক পরীক্ষার জন্য এর খুব ইম্পরট্যান্ট একটি বিষয় ...তাছাড়া বাংলাদেশের সাম্প্রতিক যে কোন ডাটা বা তথ্য এই বইতে পাবেন ...
তাই একজন নাগরিক হিসাবে এই তথ্য গুলো আপনার জানা প্রয়োজন ...।
বিসিএস ও ব্যাংক এর লিখিত পরীক্ষা ...+এছাড়া মাধ্যমিক ও উচ্চমাধ্যমিকের স্টুডেন্টদের জন্য অনেক কাজে আসবে ...
How to Manage Your Lost Opportunities in Odoo 17 CRMCeline George
Odoo 17 CRM allows us to track why we lose sales opportunities with "Lost Reasons." This helps analyze our sales process and identify areas for improvement. Here's how to configure lost reasons in Odoo 17 CRM
Walmart Business+ and Spark Good for Nonprofits.pdfTechSoup
"Learn about all the ways Walmart supports nonprofit organizations.
You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money.
The webinar may also give some examples on how nonprofits can best leverage Walmart Business+.
The event will cover the following::
Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping.
Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders.
Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates.
Answers about how you can do more with Walmart!"
3. Summaries
Having ranked the documents matching a
query, we wish to present a results list
Most commonly, the document title plus a
short summary
The title is typically automatically extracted
from document metadata
What about the summaries?
4. Summaries
Two basic kinds:
Static
Dynamic
A static summary of a document is always
the same, regardless of the query that hit
the doc
Dynamic summaries are query-dependent
attempt to explain why the document was
retrieved for the query at hand
5. Static summaries
In typical systems, the static summary is a
subset of the document
Simplest heuristic: the first 50 (or so – this
can be varied) words of the document
Summary cached at indexing time
More sophisticated: extract from each
document a set of “key” sentences
Simple NLP heuristics to score each sentence
Summary is made up of top-scoring
sentences.
Most sophisticated: NLP used to synthesize
a summary
Seldom used in IR; cf. text summarization
6. Dynamic summaries
Present one or more “windows” within the
document that contain several of the query
terms
“KWIC” snippets: Keyword in Context
presentation
Generated in conjunction with scoring
If query found as a phrase, the/some
occurrences of the phrase in the doc
If not, windows within the doc that contain
multiple query terms
The summary itself gives the entire content
of the window – all terms, not only the query
7. Generating dynamic summaries
If we have only a positional index, we cannot
(easily) reconstruct context surrounding hits
If we cache the documents at index time, can
run the window through it, cueing to hits
found in the positional index
E.g., positional index says “the query is a
phrase in position 4378” so we go to this
position in the cached document and stream
out the content
Most often, cache a fixed-size prefix of the
doc
Note: Cached copy can be outdated
8. Dynamic summaries
Producing good dynamic summaries is a
tricky optimization problem
The real estate for the summary is normally
small and fixed
Want short item, so show as many KWIC
matches as possible, and perhaps other
things like title
Want snippets to be long enough to be useful
Want linguistically well-formed snippets:
users prefer snippets that contain complete
phrases
Want snippets maximally informative about
doc
But users really like snippets, even if they
complicate IR system design
10. Adversarial IR (Spam)
Motives
Commercial, political, religious, lobbies
Promotion funded by advertising budget
Operators
Contractors (Search Engine Optimizers) for lobbies,
companies
Web masters
Hosting services
Forum
Web master world ( www.webmasterworld.com )
Search engine specific tricks
Discussions about academic papers
11. Search Engine Optimization II
Search Engine Optimization
Adversarial IR
Adversarial IR
(“search engine wars”)
(“search engine wars”)
12. Can you trust words on the page?
auctions.hitsoffice.com/
Pornographic www.ebay.com/
Content
Examples from July 2002
13. Simplest forms
Early engines relied on the density of terms
The top-ranked pages for the query maui
resort were the ones containing the most
maui’s and resort’s
SEOs responded with dense repetitions of
chosen terms
e.g., maui resort maui resort maui resort
Often, the repetitions would be in the same
color as the background of the web page
Repeated terms got indexed by crawlers
But not visible to humans on browsers
Can’t trust the words on a web page, for ranking.
14. A few spam technologies
Cloaking
Serve fake content to search engine robot
DNS cloaking: Switch IP address. Impersonate
Doorway pages
Pages optimized for a single keyword that re-
direct to the real target page
Keyword Spam
Misleading meta-keywords, excessive
repetition of a term, fake “anchor text”
Hidden text with colors, CSS tricks, etc.
Link spamming
Mutual admiration societies, hidden links,
awards
Domain flooding: numerous domains that
point or re-direct to a target page
Robots
Fake click stream
Fake query stream
Millions of submissions via Add-Url
15. More spam techniques
Cloaking
Serve fake content to search engine spider
DNS cloaking: Switch IP address. Impersonate
SPAM
Y
Is this a Search
Engine spider?
N Real
Cloaking Doc
16. Tutorial on
Tutorial on
Cloaking & Stealth
Cloaking & Stealth
Technology
Technology
17. Variants of keyword stuffing
Misleading meta-tags, excessive repetition
Hidden text with colors, style sheet tricks,
etc.
Meta-Tags =
“… London hotels, hotel, holiday inn, hilton, discount, booking, reservation,
sex, mp3, britney spears, viagra, …”
18. More spam techniques
Doorway pages
Pages optimized for a single keyword that re-
direct to the real target page
Link spamming
Mutual admiration societies, hidden links,
awards – more on these later
Domain flooding: numerous domains that
point or re-direct to a target page
Robots
Fake query stream – rank checking programs
“Curve-fit” ranking programs of search engines
Millions of submissions via Add-Url
19. The war against spam
Quality signals - Prefer authoritative
pages based on:
Votes from authors (linkage signals)
Votes from users (usage signals)
Policing of URL submissions
Anti robot test
Limits on meta-keywords
Robust link analysis
Ignore statistically implausible linkage (or text)
Use link analysis to detect spammers (guilt by
association)
20. The war against spam
Spam recognition by machine learning
Training set based on known spam
Family friendly filters
Linguistic analysis, general classification
techniques, etc.
For images: flesh tone detectors, source text
analysis, etc.
Editorial intervention
Blacklists
Top queries audited
Complaints addressed
21. Acid test
Which SEO’s rank highly on the query seo?
Web search engines have policies on SEO
practices they tolerate/block
See pointers in Resources
Adversarial IR: the unending (technical)
battle between SEO’s and web search
engines
See for instance
http://airweb.cse.lehigh.edu/
23. Duplicate/Near-Duplicate Detection
Duplication: Exact match with fingerprints
Near-Duplication: Approximate match
Overview
Compute syntactic similarity with an edit-
distance measure
Use similarity threshold to detect near-
duplicates
E.g., Similarity > 80% => Documents are “near
duplicates”
Not transitive though sometimes used
transitively
24. Computing Similarity
Segments of a document (natural or artificial
breakpoints) [Brin95]
Shingles (Word k-Grams) [Brin95, Brod98]
“a rose is a rose is a rose” =>
a_rose_is_a
rose_is_a_rose
is_a_rose_is
Similarity Measure between two docs (= sets
of shingles)
Set intersection [Brod98]
(Specifically, Size_of_Intersection /
Size_of_Union )
Jaccard measure
25. Shingles + Set Intersection
Computing exact set intersection of shingles
between all pairs of documents is expensive
Approximate using a cleverly chosen subset of
shingles from each (a sketch)
Estimate Jaccard from a short sketch
Create a “sketch vector” (e.g., of size 200) for
each document
Documents which share more than t (say 80%)
corresponding vector elements are similar
For doc d, sketchd[i] is computed as follows:
Let f map all shingles in the universe to 0..2 m
Let πi be a specific random permutation on 0..2 m
Pick MIN πi (f(s)) over all shingles s in d
26. Shingling with sampling
minima
Given two documents A1, A2.
Let S1 and S2 be their shingle sets
Resemblance = |Intersection of S1 and S2| / |
Union of S1 and S2|.
Let Alpha = min ( π (S1))
Let Beta = min (π(S2))
Probability (Alpha = Beta) = Resemblance
27. Computing Sketch[i] for Doc1
Document 1
264 Start with 64 bit shingles
264
Permute on the number line
264 with πi
264 Pick the min value
28. Test if Doc1.Sketch[i] = Doc2.Sketch[i]
Document 1 Document 2
264 264
264 264
264 264
A B
2 64 264
Are these equal?
Test for 200 random permutations: π1, π2,… π200
29. However…
Document 1 Document 2
264 264
264 264
A
264 B 264
264 264
A = B iff the shingle with the MIN value in the union of
Doc1 and Doc2 is common to both (I.e., lies in the
intersection)
This happens with probability:
Size_of_intersection / Size_of_union
Why?
30. Set Similarity
Set Similarity (Jaccard measure)
Ci C j
simJ(Ci , C j ) =
Ci C j
View sets as columns of a matrix; one row for
each element in the universe. aij = 1 indicates
presence of item i in set j
Example C1 C2
0 1
1 0
1 1 simJ(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1
31. Key Observation
For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
Overload notation: A = # of rows of type A
Claim A
simJ(Ci , C j ) =
A+B+C
32. Min Hashing
Randomly permute rows
h(Ci) = index of first row with 1 in column Ci
Surprising Property
Why? P [ h(Ci ) = h(C j ) ] = simJ ( Ci , C j )
Both are A/(A+B+C)
Look down columns Ci, Cj until first non-Type-
D row
h(Ci) = h(Cj) type A row
33. Mirror Detection
Mirroring is systematic replication of web pages
across hosts.
Single largest cause of duplication on the web
Host1/α and Host2/β are mirrors iff
For all (or most) paths p such that when
http://Host1/ α / p exists
http://Host2/ β / p exists as well
with identical (or near identical) content, and
vice versa.
34. Mirror Detection example
http://www.elsevier.com/ and http://www.elsevier.nl/
Structural Classification of Proteins
http://scop.mrc-lmb.cam.ac.uk/scop
http://scop.berkeley.edu/
http://scop.wehi.edu.au/scop
http://pdb.weizmann.ac.il/scop
http://scop.protres.ru/
36. Motivation
Why detect mirrors?
Smart crawling
Fetch from the fastest or freshest server
Avoid duplication
Better connectivity analysis
Combine inlinks
Avoid double counting outlinks
Redundancy in result listings
“If that fails you can try: <mirror>/samepath”
Proxy caching
37. Bottom Up Mirror Detection
[Cho00]
Maintain clusters of subgraphs
Initialize clusters of trivial subgraphs
Group near-duplicate single documents into a cluster
Subsequent passes
Merge clusters of the same cardinality and corresponding linkage
Avoid decreasing cluster cardinality
To detect mirrors we need:
Adequate path overlap
Contents of corresponding pages within a small time range
38. Can we use URLs to find
mirrors?
www.synthesis.org synthesis.stanford.edu
a b a b
d d
c c
www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html synthesis.stanford.edu/Docs/ProjAbs/deliv/high-tech-…
www.synthesis.org/Docs/ProjAbs/synsys/visual-semi-quant.html
synthesis.stanford.edu/Docs/ProjAbs/mech/mech-enhanced…
www.synthesis.org/Docs/annual.report96.final.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-intro-…
www.synthesis.org/Docs/cicee-berlin-paper.html synthesis.stanford.edu/Docs/ProjAbs/mech/mech-mm-case-…
www.synthesis.org/Docs/myr5 synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-…
www.synthesis.org/Docs/myr5/cicee/bridge-gap.html synthesis.stanford.edu/Docs/annual.report96.final.html
www.synthesis.org/Docs/myr5/cs/cs-meta.html synthesis.stanford.edu/Docs/annual.report96.final_fn.html
www.synthesis.org/Docs/myr5/mech/mech-intro-mechatron.html
synthesis.stanford.edu/Docs/myr5/assessment
www.synthesis.org/Docs/myr5/mech/mech-take-home.html synthesis.stanford.edu/Docs/myr5/assessment/assessment-…
www.synthesis.org/Docs/myr5/synsys/experiential-learning.html
synthesis.stanford.edu/Docs/myr5/assessment/mm-forum-kiosk-…
www.synthesis.org/Docs/myr5/synsys/mm-mech-dissec.html synthesis.stanford.edu/Docs/myr5/assessment/neato-ucb.html
www.synthesis.org/Docs/yr5ar synthesis.stanford.edu/Docs/myr5/assessment/not-available.html
www.synthesis.org/Docs/yr5ar/assess synthesis.stanford.edu/Docs/myr5/cicee
www.synthesis.org/Docs/yr5ar/cicee synthesis.stanford.edu/Docs/myr5/cicee/bridge-gap.html
www.synthesis.org/Docs/yr5ar/cicee/bridge-gap.html synthesis.stanford.edu/Docs/myr5/cicee/cicee-main.html
www.synthesis.org/Docs/yr5ar/cicee/comp-integ-analysis.html
synthesis.stanford.edu/Docs/myr5/cicee/comp-integ-analysis.html
39. Top Down Mirror Detection
[Bhar99, Bhar00c]
E.g.,
www.synthesis.org/Docs/ProjAbs/synsys/synalysis.html
synthesis.stanford.edu/Docs/ProjAbs/synsys/quant-dev-new-teach.html
What features could indicate mirroring?
Hostname similarity:
word unigrams and bigrams: { www, www.synthesis, synthesis, …}
Directory similarity:
Positional path bigrams { 0:Docs/ProjAbs, 1:ProjAbs/synsys, … }
IP address similarity:
3 or 4 octet overlap
Many hosts sharing an IP address => virtual hosting by an ISP
Host outlink overlap
Path overlap
Potentially, path + sketch overlap
40. Implementation
Phase I - Candidate Pair Detection
Find features that pairs of hosts have in common
Compute a list of host pairs which might be mirrors
Phase II - Host Pair Validation
Test each host pair and determine extent of mirroring
Check if 20 paths sampled from Host1 have near-
duplicates on Host2 and vice versa
Use transitive inferences:
IF Mirror(A,x) AND Mirror(x,B) THEN Mirror(A,B)
IF Mirror(A,x) AND !Mirror(x,B) THEN !Mirror(A,B)
Evaluation
140 million URLs on 230,000 hosts (1999)
Best approach combined 5 sets of features
Top 100,000 host pairs had precision = 0.57 and recall =
0.86
41. WebIR Infrastructure
Connectivity Server
Fast access to links to support for link
analysis
Term Vector Database
Fast access to document vectors to augment
link analysis
42. Connectivity Server
[CS1: Bhar98b, CS2 & 3: Rand01]
Fast web graph access to support connectivity
analysis
Stores mappings in memory from
URL to outlinks, URL to inlinks
Applications
HITS, Pagerank computations
Crawl simulation
Graph algorithms: web connectivity, diameter etc.
more on this later
Visualizations
43. Usage
Input
Execution Output
Graph
algorithm Graph URLs
+ URLs algorithm IDs +
URLs to runs in to Values
+ FPs memory URLs
Values to
IDs
Translation Tables on Disk
URL text: 9 bytes/URL (compressed from ~80 bytes )
FP(64b) -> ID(32b): 5 bytes
ID(32b) -> FP(64b): 8 bytes
ID(32b) -> URLs: 0.5 bytes
44. E.g., HIGH IDs:
ID assignment Max(indegree , outdegree) > 254
Partition URLs into 3 sets, sorted
ID URL
lexicographically
High: Max degree > 254 …
Medium: 254 > Max degree > 24 9891 www.amazon.com/
Low: remaining (75%) 9912 www.amazon.com/jobs/
…
IDs assigned in sequence (densely)
9821878 www.geocities.com/
…
40930030 www.google.com/
Adjacency lists …
In memory tables for Outlinks,
Inlinks 85903590 www.yahoo.com/
List index maps from a Source
ID to start of adjacency list
45. Adjacency List Compression - I
…
… 98 …
132 … -6
104 153 34
105 98 104 21
106 147 105 -8
153 106 49
… 6
… …
Sequence …
of Delta
List Adjacency Encoded
Index Lists List Adjacency
Index Lists
• Adjacency List:
- Smaller delta values are exponentially more frequent (80% to same host)
- Compress deltas with variable length encoding (e.g., Huffman)
• List Index pointers: 32b for high, Base+16b for med, Base+8b for low
- Avg = 12b per pointer
46. Adjacency List Compression - II
Inter List Compression
Basis: Similar URLs may share links
Close in ID space => adjacency lists may overlap
Approach
Define a representative adjacency list for a block of IDs
Adjacency list of a reference ID
Union of adjacency lists in the block
Represent adjacency list in terms of deletions and additions
when it is cheaper to do so
Measurements
Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM)
Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)
47. Term Vector Database
[Stat00]
Fast access to 50 word term vectors for web pages
Term Selection:
Restricted to middle 1/3rd of lexicon by document frequency
Top 50 words in document by TF.IDF.
Term Weighting:
Deferred till run-time (can be based on term freq, doc freq, doc length)
Applications
Content + Connectivity analysis (e.g., Topic Distillation)
Topic specific crawls
Document classification
Performance
Storage: 33GB for 272M term vectors
Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk
block)
48. Architecture
URLid * 64 /480
offset
URL Info
Base (4 bytes)
LC:TID
Terms
LC:TID
… 128
Bit vector Byte
For LC:TID TV
480 URLids Record
FRQ:RL
FRQ:RL
Freq
…
FRQ:RL
URLid to Term Vector
Lookup
Editor's Notes
Arms race
Small biotech firm ; query example from last time ; infoseek exapmle
Talk about expert witness; george w bush example
More complex problem of finding the “original” site