SlideShare a Scribd company logo
1 of 31
Grouping Search-Engine Returned Citations forPerson-Name Queries Reema Al-Kamha, David W. Embley, WIDM’04 Brigham Young University Advisor: Chia-Hui Chang Student: Kuan-Hua Huo Date: 2010-6-1 1
Outline Instruction Related work A multi-faseted approach Experimental results Concluding remarks 2
Instruction Citations: the returned results that are related to a specific query in a search engine. Citation contains the title of the web page found, text below the title that includes the keywords of the query, and the URL of the web page found.  Using Google the query ”Kelly Flanagan” 3
Instruction   cont. In the output retain the basic search-engine returned citations within each group maintain the search engine ranking order among groups maintain the relative order of citations as originally presented by the search engine 4
Related work A cross-document coreference occurs when the same person, place, event, or concept is discussed in more than one text source.   we would need to find a context, which is not straightforward for the mixture of structured, semistructured, and unstructured documents on the web. But we found that neither produced satisfactory results. 5
Related work   cont. Object identification refers to the task of deciding that two observed objects are in fact one and the same object.  compare an object’s shared attributes in order to identify matching objects  our technique involves links and page similarity in addition to attributes 6
Related work   cont. One technique that that is typically used to  resolve objec identity is probabilistic modeling. Probabilistic modeling  compares objects based on shared attributes uses appearance probability to determine the similarity between objects. 7
A multi-faceted approach Each facet represents a relevant aspect of the problem space about which we can gather evidence that two citations reference the same person or different persons. attributes about a person links within and among sites page similarity as facets 8
A multi-faceted approach   cont. Attributes  Links Page similarity Confidence Matrix Constuction Final Confidence Matrix Grouping Algorithm 9
Attributes Attributes appear in web pages of citations phone number email address State (from a list that contains all state names and their abbreviations) city zip code 10
Attributes   cont. We only extract state, city, and zip code values in an address context. To extract a city we extract all strings that match the regular expressions 	([A-Z]( w)+( )?) { 1,3 }  	and satisfy the context specification consisting of this string followed by an optional comma, white space, and a state name. 11
Links If two URLs share a common host, we can be reasonably confident that they refer to the same person. 12
Links   cont.  It is common to have two different persons that have the same name in two citations that have a popular host like www.yahoo.com. To find the number of pages that point to a host Query link:siteURL in Google shows all pages and gives a count of  the  number  of  pages  that  point  to  that  URL 13
Links   cont.  We determined empirically that a host h is popular for person-name queries if more than 400 pages point to h. If two citations c 1 and c 2 that are results of a person name query share the same non-popular host, or  if the URL of one citation c 1 has the same non-popular host as one of the URLs that belongs to the web page referenced by the other citation c 2,  then we can be confident about grouping c 1 and c 2 together for the same person. 14
Page similarity The similarity between web pages referenced by the two returned citations If two web pages refer to the same person, there are specific words associated with that person. David  Embley,  who  is  a  professor  and  a  co-director  of  the  Data  ExtractionResearch Group  in  the  Computer  Science  Departmentat  Brigham Young University. Sandra Rogers contain Lessons from the Light, a book she wrote. 15
Page similarity   cont. We consider pairs of words  that start with a capital letter and  that are either adjacent or separated by a connector (and, or, but) or by a preposition which may be followed by an article (a, an, the) or by a single capital letter followed by dot. Adjacent cap-word pairs Cap-Word  (Connector | Preposition  (Article)? | (Capital-LetterDot))?Cap-Word 16
Page similarity   cont. We  eliminate  these  pairs  by  constructing  a  stop-word list, which is a list of frequently appearing adjacent cap-word pairs. We collected approximately 10,000 web documents taken at random from the Open Directory Project, DMOZ. We constructed all adjacent cap-word pairs. We considered all pairs with a frequency greater than two to be stop words. 17
Page similarity   cont. the  number  of  adjacent  cap-word  pairs  as an indicator of the similarity between two web pages In particular, we consider whether two web pages share exactly one,  exactly two,  exactly three,  or four or more adjacent cap-word pairs.    The greater the number of adjacent cap-word  pairs,  the  greater  the  similarity  between  the  pages. 18
Confidence Matrix Construction Construct a confidence matrix, one for each facet attributes, links, and page similarity The confidence matrix: upper triangular matrix over all pairs of the n  returned citations C 1 , C 2 , ...  , C n . Element Cij (i < j) in the confidence matrix the confidence Ci and C j refer to the same person 19
Confidence Matrix Construction   cont. In order to compute the conditional probabilities that represent confidence values, we construct a training set. We entered each name as a query for Google, and we collected the first 50 returned citations for each name. 49+48+.....+2+1 = 1,225 comparison pairs Total  number of comparisons 9*1225=11,025 20
Confidence Matrix Construction   cont. Same  Person:  whether  the  names  are for  the  same person Phone:  whether the web pages to which the citations link contain the same phone number Email:  whether the web pages to which the citations link contain the same email address Zip: whether the web pages to which the citations link contain the same address zip code City:  whether the web pages to which the citations link contain the same address city State:  whether the web pages to which the citations link contain the same address state 21 ,[object Object]
Host2 : whether the URL of one citation has the same host as one of the URLs that belongs to the web page of the other citation
Share1 :  whether the web pages referenced by the citations share exactly one adjacent cap-word pair
Share2 :  whether the web pages referenced by the citations share exactly two adjacent cap-word pair
Share3 : whether the web pages referenced by the citations share exactly three adjacent cap-word pair;
Share ≥ 4 :  whether the web pages referenced by the citations share four or more adjacent cap-word pairs,[object Object]
Confidence Matrix Construction   cont. For pairs, triples, quadruples, and quintuples of attributes, we also compute conditional probabilities. For example, we estimate P(Same Person  = “Yes” | City = “Yes” and State = “Yes”) which is the probability that two citations refer to the same person knowing that the web pages referenced by them share the same address city and state, by dividing the number of citation pairs that are related to the same person and have the same address city and state by the number of citation pairs that share same address city and state in the training set. 23
Confidence Matrix Construction   cont. For link  facet the URLs of the citations share the same non-popular host  For example, we estimate P(Same Person = “Yes” | Host1  “Yes” and Host1  is non-popular) by dividing the number of citation pairs that are related to the same person and have the same non-popular host by the number of citation pairs that share a common, non-popular host. 24
Confidence Matrix Construction   cont. For page similarity facet the web pages referenced by them share exactly one, or two, or three, or four or more pairs of two adjacent cap-word pairs.  For example, we estimate P(Same Person = “Yes” | Share2 = “Yes”) by dividing the number of citation pairs that are related to the same person and share two cap-word pairs by the number of citation pairs that share two cap-word pairs. 25
Final Confidence Matrix Generate  the final confidence matrix by combining the confidence matrices using Stanford certainty theory  Stanford certainty theory defines a confidence measure and generates some simple rules for combining independent evidence. 26
Final Confidence Matrix   cont.  CF (E 1 ): the certainty factor associated with evidence E 1 for  some  observation  B CF (E 2 ): the  certainty  factor associated with evidence E 2 for the same observation B new certainty factor CF of B (called the compound certainty  factor  of  B) CF (E 1 )+CF (E 2 )-(CF (E 1 ) ∗ CF (E 2 )). 27

More Related Content

Viewers also liked

Bitácora nº 3
Bitácora nº 3Bitácora nº 3
Bitácora nº 3Sthiven
 
Macpartitionmanager
MacpartitionmanagerMacpartitionmanager
MacpartitionmanagerRanvir Kumar
 
Unit 5 research project scientific structure template
Unit 5 research project scientific structure templateUnit 5 research project scientific structure template
Unit 5 research project scientific structure templatepaulcoxworthing
 
Boas práticas no desenvolvimento de software
Boas práticas no desenvolvimento de softwareBoas práticas no desenvolvimento de software
Boas práticas no desenvolvimento de softwareFelipe
 
Colvin exadata and_oem12c
Colvin exadata and_oem12cColvin exadata and_oem12c
Colvin exadata and_oem12cEnkitec
 
Building a social media presence
Building a social media presenceBuilding a social media presence
Building a social media presenceMugur Frunzetti
 
Apache Projekte als Basis einer Integrationsplattform
Apache Projekte als Basis einer IntegrationsplattformApache Projekte als Basis einer Integrationsplattform
Apache Projekte als Basis einer IntegrationsplattformCofinpro AG
 
Environmental & Geographical Science Honours 2015
Environmental & Geographical Science Honours 2015Environmental & Geographical Science Honours 2015
Environmental & Geographical Science Honours 2015UCT
 
Six Steps to eLearning engagement and boosting the value of your LMS
Six  Steps  to eLearning engagement and boosting the value of your LMSSix  Steps  to eLearning engagement and boosting the value of your LMS
Six Steps to eLearning engagement and boosting the value of your LMSInfor CERTPOINT
 
Nominalia and Simply @ Congreso web
Nominalia and Simply @ Congreso webNominalia and Simply @ Congreso web
Nominalia and Simply @ Congreso webNominalia
 
RefWorks Get References Into Folders
RefWorks Get References Into FoldersRefWorks Get References Into Folders
RefWorks Get References Into FoldersUCT
 

Viewers also liked (20)

Bitácora nº 3
Bitácora nº 3Bitácora nº 3
Bitácora nº 3
 
Macpartitionmanager
MacpartitionmanagerMacpartitionmanager
Macpartitionmanager
 
Unit 5 research project scientific structure template
Unit 5 research project scientific structure templateUnit 5 research project scientific structure template
Unit 5 research project scientific structure template
 
MSP TechDay
MSP TechDayMSP TechDay
MSP TechDay
 
Boas práticas no desenvolvimento de software
Boas práticas no desenvolvimento de softwareBoas práticas no desenvolvimento de software
Boas práticas no desenvolvimento de software
 
Colvin exadata and_oem12c
Colvin exadata and_oem12cColvin exadata and_oem12c
Colvin exadata and_oem12c
 
Building a social media presence
Building a social media presenceBuilding a social media presence
Building a social media presence
 
www.Maxi-stromvergleich.de
www.Maxi-stromvergleich.dewww.Maxi-stromvergleich.de
www.Maxi-stromvergleich.de
 
Trabajo
TrabajoTrabajo
Trabajo
 
Apache Projekte als Basis einer Integrationsplattform
Apache Projekte als Basis einer IntegrationsplattformApache Projekte als Basis einer Integrationsplattform
Apache Projekte als Basis einer Integrationsplattform
 
Aerial Photos
Aerial PhotosAerial Photos
Aerial Photos
 
Globalisation
GlobalisationGlobalisation
Globalisation
 
Case study pack
Case study packCase study pack
Case study pack
 
Let's START !
Let's START !Let's START !
Let's START !
 
Environmental & Geographical Science Honours 2015
Environmental & Geographical Science Honours 2015Environmental & Geographical Science Honours 2015
Environmental & Geographical Science Honours 2015
 
Six Steps to eLearning engagement and boosting the value of your LMS
Six  Steps  to eLearning engagement and boosting the value of your LMSSix  Steps  to eLearning engagement and boosting the value of your LMS
Six Steps to eLearning engagement and boosting the value of your LMS
 
Nominalia and Simply @ Congreso web
Nominalia and Simply @ Congreso webNominalia and Simply @ Congreso web
Nominalia and Simply @ Congreso web
 
Winter
WinterWinter
Winter
 
September 2012 Toronto Real Estate Market Watch
September 2012 Toronto Real Estate Market WatchSeptember 2012 Toronto Real Estate Market Watch
September 2012 Toronto Real Estate Market Watch
 
RefWorks Get References Into Folders
RefWorks Get References Into FoldersRefWorks Get References Into Folders
RefWorks Get References Into Folders
 

Similar to Grouping search engine returned citations for person-name queries

Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDataminingTools Inc
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clusteringguest0edcaf
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And ClusteringDatamining Tools
 
Lec2_Information Integration.ppt
 Lec2_Information Integration.ppt Lec2_Information Integration.ppt
Lec2_Information Integration.pptNaglaaFathy42
 
Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,Nexgen Technology
 
Secured Ontology Mapping
Secured Ontology Mapping Secured Ontology Mapping
Secured Ontology Mapping dannyijwest
 
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...cscpconf
 
Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search enginecsandit
 
Week 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data ModelWeek 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data Modeloudesign
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approachdinesh_joshy
 
Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Mariana Damova, Ph.D
 
Entity Resolution in Large Graphs
Entity Resolution in Large GraphsEntity Resolution in Large Graphs
Entity Resolution in Large GraphsApurva Kumar
 
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...INFOGAIN PUBLICATION
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...Holistic Benchmarking of Big Linked Data
 
Using NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion housesUsing NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion housesSushant Shankar
 
Knowledge Based Trust: Estimating the Trustworthiness of Web Sources
Knowledge Based Trust: Estimating the Trustworthiness of Web SourcesKnowledge Based Trust: Estimating the Trustworthiness of Web Sources
Knowledge Based Trust: Estimating the Trustworthiness of Web SourcesЮниВеб
 

Similar to Grouping search engine returned citations for person-name queries (20)

Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Textmining Retrieval And Clustering
Textmining Retrieval And ClusteringTextmining Retrieval And Clustering
Textmining Retrieval And Clustering
 
Lec2_Information Integration.ppt
 Lec2_Information Integration.ppt Lec2_Information Integration.ppt
Lec2_Information Integration.ppt
 
Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,Entity linking with a knowledge base issues,
Entity linking with a knowledge base issues,
 
Secured Ontology Mapping
Secured Ontology Mapping Secured Ontology Mapping
Secured Ontology Mapping
 
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
Object surface segmentation, Image segmentation, Region growing, X-Y-Z image,...
 
Computing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engineComputing semantic similarity measure between words using web search engine
Computing semantic similarity measure between words using web search engine
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Week 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data ModelWeek 4 The Relational Data Model & The Entity Relationship Data Model
Week 4 The Relational Data Model & The Entity Relationship Data Model
 
Chat bot using text similarity approach
Chat bot using text similarity approachChat bot using text similarity approach
Chat bot using text similarity approach
 
Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011
 
Entity Resolution in Large Graphs
Entity Resolution in Large GraphsEntity Resolution in Large Graphs
Entity Resolution in Large Graphs
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
 
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
An Improved Web Explorer using Explicit Semantic Similarity with ontology and...
 
Measuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and ConceptsMeasuring Similarity Between Contexts and Concepts
Measuring Similarity Between Contexts and Concepts
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Using NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion housesUsing NLP to find contextual relationships between fashion houses
Using NLP to find contextual relationships between fashion houses
 
Knowledge Based Trust: Estimating the Trustworthiness of Web Sources
Knowledge Based Trust: Estimating the Trustworthiness of Web SourcesKnowledge Based Trust: Estimating the Trustworthiness of Web Sources
Knowledge Based Trust: Estimating the Trustworthiness of Web Sources
 

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Recently uploaded (20)

Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Grouping search engine returned citations for person-name queries

  • 1. Grouping Search-Engine Returned Citations forPerson-Name Queries Reema Al-Kamha, David W. Embley, WIDM’04 Brigham Young University Advisor: Chia-Hui Chang Student: Kuan-Hua Huo Date: 2010-6-1 1
  • 2. Outline Instruction Related work A multi-faseted approach Experimental results Concluding remarks 2
  • 3. Instruction Citations: the returned results that are related to a specific query in a search engine. Citation contains the title of the web page found, text below the title that includes the keywords of the query, and the URL of the web page found. Using Google the query ”Kelly Flanagan” 3
  • 4. Instruction cont. In the output retain the basic search-engine returned citations within each group maintain the search engine ranking order among groups maintain the relative order of citations as originally presented by the search engine 4
  • 5. Related work A cross-document coreference occurs when the same person, place, event, or concept is discussed in more than one text source. we would need to find a context, which is not straightforward for the mixture of structured, semistructured, and unstructured documents on the web. But we found that neither produced satisfactory results. 5
  • 6. Related work cont. Object identification refers to the task of deciding that two observed objects are in fact one and the same object. compare an object’s shared attributes in order to identify matching objects our technique involves links and page similarity in addition to attributes 6
  • 7. Related work cont. One technique that that is typically used to resolve objec identity is probabilistic modeling. Probabilistic modeling compares objects based on shared attributes uses appearance probability to determine the similarity between objects. 7
  • 8. A multi-faceted approach Each facet represents a relevant aspect of the problem space about which we can gather evidence that two citations reference the same person or different persons. attributes about a person links within and among sites page similarity as facets 8
  • 9. A multi-faceted approach cont. Attributes Links Page similarity Confidence Matrix Constuction Final Confidence Matrix Grouping Algorithm 9
  • 10. Attributes Attributes appear in web pages of citations phone number email address State (from a list that contains all state names and their abbreviations) city zip code 10
  • 11. Attributes cont. We only extract state, city, and zip code values in an address context. To extract a city we extract all strings that match the regular expressions ([A-Z]( w)+( )?) { 1,3 } and satisfy the context specification consisting of this string followed by an optional comma, white space, and a state name. 11
  • 12. Links If two URLs share a common host, we can be reasonably confident that they refer to the same person. 12
  • 13. Links cont. It is common to have two different persons that have the same name in two citations that have a popular host like www.yahoo.com. To find the number of pages that point to a host Query link:siteURL in Google shows all pages and gives a count of the number of pages that point to that URL 13
  • 14. Links cont. We determined empirically that a host h is popular for person-name queries if more than 400 pages point to h. If two citations c 1 and c 2 that are results of a person name query share the same non-popular host, or if the URL of one citation c 1 has the same non-popular host as one of the URLs that belongs to the web page referenced by the other citation c 2, then we can be confident about grouping c 1 and c 2 together for the same person. 14
  • 15. Page similarity The similarity between web pages referenced by the two returned citations If two web pages refer to the same person, there are specific words associated with that person. David Embley, who is a professor and a co-director of the Data ExtractionResearch Group in the Computer Science Departmentat Brigham Young University. Sandra Rogers contain Lessons from the Light, a book she wrote. 15
  • 16. Page similarity cont. We consider pairs of words that start with a capital letter and that are either adjacent or separated by a connector (and, or, but) or by a preposition which may be followed by an article (a, an, the) or by a single capital letter followed by dot. Adjacent cap-word pairs Cap-Word (Connector | Preposition (Article)? | (Capital-LetterDot))?Cap-Word 16
  • 17. Page similarity cont. We eliminate these pairs by constructing a stop-word list, which is a list of frequently appearing adjacent cap-word pairs. We collected approximately 10,000 web documents taken at random from the Open Directory Project, DMOZ. We constructed all adjacent cap-word pairs. We considered all pairs with a frequency greater than two to be stop words. 17
  • 18. Page similarity cont. the number of adjacent cap-word pairs as an indicator of the similarity between two web pages In particular, we consider whether two web pages share exactly one, exactly two, exactly three, or four or more adjacent cap-word pairs. The greater the number of adjacent cap-word pairs, the greater the similarity between the pages. 18
  • 19. Confidence Matrix Construction Construct a confidence matrix, one for each facet attributes, links, and page similarity The confidence matrix: upper triangular matrix over all pairs of the n returned citations C 1 , C 2 , ... , C n . Element Cij (i < j) in the confidence matrix the confidence Ci and C j refer to the same person 19
  • 20. Confidence Matrix Construction cont. In order to compute the conditional probabilities that represent confidence values, we construct a training set. We entered each name as a query for Google, and we collected the first 50 returned citations for each name. 49+48+.....+2+1 = 1,225 comparison pairs Total number of comparisons 9*1225=11,025 20
  • 21.
  • 22. Host2 : whether the URL of one citation has the same host as one of the URLs that belongs to the web page of the other citation
  • 23. Share1 : whether the web pages referenced by the citations share exactly one adjacent cap-word pair
  • 24. Share2 : whether the web pages referenced by the citations share exactly two adjacent cap-word pair
  • 25. Share3 : whether the web pages referenced by the citations share exactly three adjacent cap-word pair;
  • 26.
  • 27. Confidence Matrix Construction cont. For pairs, triples, quadruples, and quintuples of attributes, we also compute conditional probabilities. For example, we estimate P(Same Person = “Yes” | City = “Yes” and State = “Yes”) which is the probability that two citations refer to the same person knowing that the web pages referenced by them share the same address city and state, by dividing the number of citation pairs that are related to the same person and have the same address city and state by the number of citation pairs that share same address city and state in the training set. 23
  • 28. Confidence Matrix Construction cont. For link facet the URLs of the citations share the same non-popular host For example, we estimate P(Same Person = “Yes” | Host1 “Yes” and Host1 is non-popular) by dividing the number of citation pairs that are related to the same person and have the same non-popular host by the number of citation pairs that share a common, non-popular host. 24
  • 29. Confidence Matrix Construction cont. For page similarity facet the web pages referenced by them share exactly one, or two, or three, or four or more pairs of two adjacent cap-word pairs. For example, we estimate P(Same Person = “Yes” | Share2 = “Yes”) by dividing the number of citation pairs that are related to the same person and share two cap-word pairs by the number of citation pairs that share two cap-word pairs. 25
  • 30. Final Confidence Matrix Generate the final confidence matrix by combining the confidence matrices using Stanford certainty theory Stanford certainty theory defines a confidence measure and generates some simple rules for combining independent evidence. 26
  • 31. Final Confidence Matrix cont. CF (E 1 ): the certainty factor associated with evidence E 1 for some observation B CF (E 2 ): the certainty factor associated with evidence E 2 for the same observation B new certainty factor CF of B (called the compound certainty factor of B) CF (E 1 )+CF (E 2 )-(CF (E 1 ) ∗ CF (E 2 )). 27
  • 32. Gouping algorithm Input: the final confidence matrix Output: groups of the search engine returned citations We are highly confident about grouping two citations Ci and C j together in a set S 1 We are highly confident about grouping two citations C j and C k together in a set S 2 S 1 and S 2 share one or more citations (C j in our example) =>we are confident about grouping S 1 and S 2 together in one group S 3 . The threshold we use for “highly confident” is 0.8, which we determined empirically. 28
  • 33. Expermimental results To evaluate the performance of our system, we used split and merge measures. For example: eight returned citations C 1 , C 2 , C 3 , C 4 , C 5 , C 6 , C 7 , C 8 correct grouping result Group 1: { C 1 , C 2 , C 4 , C 6 , C 7 } , Group 2: { C 3 , C 8 } , Group 3: { C 5 } Our system Group 1: { C 1 , C 2 , C 4 } , Group 2: { C 3 , C 6 , C 7 } , Group 3: { C 5 , C 8 } 29
  • 34. Expermimental results cont. The average normalized score for splits for all facets is 0.004 The average normalized score for merges is 0.014 30
  • 35. Concluding remark We designed and implemented a system that can automatically group the returned citations from a search engine person-name query, such that each group of citations refers to the same person. We gave experimental evidence to show that our approach can be successful. 31