SlideShare a Scribd company logo
1 of 26
Download to read offline
1
1
FIRST LAST
TITLE
Welcome Message
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur elementum posuere pretium. Quisque nibh dolor, dignissim ac dignissim ut,
luctus ac urna. Aliquam aliquet non massa quis tincidunt. Mauris ullamcorper justo tristique dui posuere tincidunt. In nec lacus laoreet orci varius
imperdiet sit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur elementum posuere pretium. Quisque nibh dolor, dignissim ac
dignissim ut, luctus ac urna. Aliquam aliquet non massa quis tincidunt. Mauris ullamcorper justo tristique dui posuere tincidunt. In nec lacus
laoreet orci varius imperdiet sit. In nec lacus laoreet orci varius imperdiet sit. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Welcome message goes here
Named entity recognition and
disambiguation using an iterative
graph processing system
Julien Gonçalves
2
2
Julien Gonçalves
VP research and partner at Reportlinker
Working on semantic technologies since 2004
@rlk_jgo
fr.linkedin.com/pub/julien-gonçalves/2/557/a21
3
3
Who is Reportlinker?
ReportLinker is a technology company focused on providing
actionable information from global market research and data,
to Marketers, Analysts, Researchers and knowledge workers
in the enterprise.
4000 recurring clients around the world (HonneyWell, 3M, …)
A 10M€ Company
40 Employees with 50% engineers and semantic specialists
4
4
What is ReportLinker?
Reportlinker finds, filters and organizes the latest industry data
5
5
Find, Filter and Organize
30 million new documents analyzed / month
Document discovery NLP Search Engine
1 billion url verified / month
3 million documents available as relevant
Each part of the workflow is scalable
6
6
Natural Language Processing
Text preprocessing
Converting
PDF/DOC/PTT format
to Text.
Lexical analysis
Parsing words and
sentences
Morphology analysis
Chapter/Table/Figure
detection using the
morphology of the original
document
Semantic analysis
Using a thesaurus
with 3.5 millions
ontologies
(industry,
geography, topic)
Storage
Structured,
annoted, sliced:
ready to be
searchable
Data relevance
Scoring each type of
data by relevance
(statistics, analysis, …)
7
7
Semantic Analysis
Industry
Geography
Topic
A Data Lexical Platform using a thesaurus with 3 dimensions
350 industries, 3000 sub-industries
world regions, countries, main cities
Ontologies about industrial economics
(production, exportation, …)
8
8
Semantic Analysis
Industry
Controlled vocabulary helps to find the right meaning of a term
Agribusiness
Food
Fruit and Vegetable
Fruit
apple banana
One term can be used for several meanings (ex:
“apple” as fruit or company).
The proximity with other concept in the thesaurus
helps to find the right meaning when found in the
same section of text.
9
Semantic Analysis
How can we find, normalize and classify
the company names mentioned in our
reports ?
10
Semantic Analysis : Company
Very simple approach: Use a database of company names as ontology
FAIL ! This approach did not work at all
We bought and used a database with 2 000 000 company names
Too many company names existing as common name (ex: “Post
Office”, “Table”, …)
To avoid the noise, we need to match more context in order to be
sure of the right meaning of a concept.
11
Semantic Analysis : Company
Millions of companies exist around the world
Company context changes very often (acquisisions, new
activities, ...)
Hundreds of companies can have the same name
To be able to disambiguate, we need additional context
for each company concept.
12
Our approach
STEP 1
Hypotheses
STEP 2
Inferences
STEP 3
Analysis &
Classification
To create our own database of company names with
additional context for disambiguation.
To exploit our content (110 millions documents) to discover
and identify company names, people, products.
To use an inference engine to build a relational graph with
verified concepts and contexts.
To use this new base of verified companies, enriched with
contexts, to find the right companies in our content.
13
Step 1: Hypotheses
For each document analysed, we extract several “hypotheses”
(unverified facts) using text mining rules
Identification of a concept (the probability that its a company,
person, product, …)
We mainly have 3 types of hypotheses:
Relation between 2 concepts (context proximity between 2
concepts in the document)
Proximity between a concept and an industry/country
(context proximity with an other dimension in the document)
14
Step 1: Hypotheses
In march 2010, Toto inc. acquired Thingso corp., the new CEO
Kevin Sherpa wants to be present in China to sell the new Xbrid3.
Example
“Kevin Sherpa” is guessed as a person name (NER rules).
“Toto inc.” and “Thingso corp.” are guessed as company (using
NER rules). More the pattern is “safe”, more the hypothese is
strong.
“China” is a country (Ontology).
“Xbrid3” is an unidentified named entity (NER rules).
15
Step 1: Hypotheses
Toto inc.
Label Context Industry / Geography
Toto inc. Thingso corp. (C) / Kevin Sherpa (P) / Xbrid3 China
Thingso corp. Toto inc. (C) / Kevin Sherpa (P) / Xbrid3 China
Thingso
corp.
Kevin
Sherpa Xbrid3
16
Step 1: Hypotheses
To validate these hypotheses we need to find more facts verifying
the same hypotheses.
Data volume is one of the key elements of this approach
We mine billions of sentences from economic reports and 3
million news update every month.
Each hypothese brings new information and new contexts
around a company concept.
More an hypothese is verified with several sources, more
chance it has to become a verified fact.
17
Step 2: Inferences
An inference engine verifies all the hypotheses around each
concept in order to keep only the verified facts
C1
C2
P
2
P
1
From millions/billions of sub-graphs (each scope of context), we
obtain 1 final consolidated graph composed of only thousands of
sub-graphs.
18
Step 2: Inferences
Apache Giraph is an iterative graph processing system built
for high scalability.
Giraph implements the Pregel model and other features that
makes it easy to use graph computation.
Giraph loads all the graph in-memory, computation is very
quick.
Giraph is highly scalable.
19
Step 2: Inferences
Graph reduction continues until we can’t reduce the graph anymore
Toto Inc. #1
Kevin Sherpa
Thingso corp.
Xbrid3
Toto Inc. #2
Kevin Sherpa
David Rego
Xbrid3
Toto Inc. #3
Thingso corp.
David Rego
Xbrid Project
Neko Ltd.
Toto Inc. #4
Kevin Sherpa
Xbrid Project
Neko Ltd.
China China
US
China
Toto Inc. #1 Kevin Sherpa
David Rego
Thingso corp.
Xbrid3
China
Toto Inc. #2
Xbrid Project
Kevin Sherpa
Xbrid3
China
US
Toto Inc. #1
Xbrid Project
Kevin Sherpa
David Rego
Thingso corp.
Neko Ltd.
Iteration 1
Iteration 2
David Rego
Thingso corp.
Neko Ltd.
China
US
20
Step 2: Inferences
The final graph is filtered to obtain a base of verified companies
Only the best context is kept for each company name
(context frequently related to the company)
Special iterations are processed to normalize company
names having very close names (ex: “Google France” and
“Google Fr”).
21
Step 3: Semantic Analysis
Company
Name
Name to Match / Alias Contexts Industry / Geography
Toto inc. Toto inc.
Toto incorporated
Toto
Xbrid Project
Kevin Sherpa
Xbrid3
David Rego
Thingso corp.
Neko Ltd.
...
China
US
Apple inc. Apple inc.
Apple incorporated
Apple
Tim Cook
iPhone
iPad
Steve Jobs
...
US
World
More a company name is “common”, the more it will need a
better diversity of context to be verified (common noun, several
company with the same names, high frequency in the corpus)
22
Step 3: Semantic Analysis
Kevin Sherpa said “Toto forecasts to double its revenue in China
selling the new Xbrid3.”
1) “Toto” is a possible name to match, normalized as “Toto inc.”
2) “Toto” is found in this text, we load all the contextual terms terlated to this
concept in order to disambiguate and select the right concept.
3) Contextual terms are found, “Toto” is classified as “Toto inc.” in this text.
23
Step 3: Semantic Analysis
Contextual
terms related to
companies
Verified
company names
NLP
Content to analyse
Load in memory
Checking contextual
terms
company found:
disambiguated and
classified
Company names that are eligible are loaded in memory (NLP
process)
Contexts are loaded in memory in a remote cluster (Redis)
24
Beta version: Statistics
400 million hypotheses
2 million documents analysed
graph nodes: 27 million
graph edges: 380 million
> 400 000 companies verified, enriched with contexts
25
Conclusion
Using Big Data analytics we found a very good approach to
discover, disambiguate and normalise company names. This
solution works because we succeed in resolving 3 main issues:
Data volume
Pattern detection to discover hypotheses (NER rules)
Optimized algorithms for the inference engine
26
QUESTIONS ?

More Related Content

Viewers also liked

DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataSebastian Hellmann
 
LDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataLDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine LearningHeiko Paulheim
 
Applying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementApplying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementJindřich Mynarz
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesLuiz Henrique Zambom Santana
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Heiko Paulheim
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudA Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudSyed Muhammad Ali Hasnain
 
On the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking systemOn the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking systemFaegheh Hasibi
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionRakuten Group, Inc.
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionFedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionSyed Muhammad Ali Hasnain
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Olaf Hartig
 
RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031kwangsub kim
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is FederatedRuben Verborgh
 
Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...
Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...
Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...Giannis Tsakonas
 
Querying Linked Data on Android
Querying Linked Data on AndroidQuerying Linked Data on Android
Querying Linked Data on AndroidEUCLID project
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningHeiko Paulheim
 
Exploiting Linked Open Data and Natural Language Processing for Classificati...
Exploiting Linked Open Data  and Natural Language Processing for Classificati...Exploiting Linked Open Data  and Natural Language Processing for Classificati...
Exploiting Linked Open Data and Natural Language Processing for Classificati...giuseppe_futia
 

Viewers also liked (20)

DBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of DataDBpedia: A Public Data Infrastructure for the Web of Data
DBpedia: A Public Data Infrastructure for the Web of Data
 
LDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked DataLDQL: A Query Language for the Web of Linked Data
LDQL: A Query Language for the Web of Linked Data
 
Fast Approximate A-box Consistency Checking using Machine Learning
Fast Approximate  A-box Consistency Checking using Machine LearningFast Approximate  A-box Consistency Checking using Machine Learning
Fast Approximate A-box Consistency Checking using Machine Learning
 
Applying Linked Open Data to Public Procurement
Applying Linked Open Data to Public ProcurementApplying Linked Open Data to Public Procurement
Applying Linked Open Data to Public Procurement
 
Exploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queriesExploiting the query structure for efficient join ordering in SPARQL queries
Exploiting the query structure for efficient join ordering in SPARQL queries
 
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
Data Mining with Background Knowledge from the Web - Introducing the RapidMin...
 
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data CloudA Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
A Provenance assisted Roadmap for Life Sciences Linked Open Data Cloud
 
On the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking systemOn the Reproducibility of the TAGME entity linking system
On the Reproducibility of the TAGME entity linking system
 
Unsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product DescriptionUnsupervised Extraction of Attributes and Their Values from Product Description
Unsupervised Extraction of Attributes and Their Values from Product Description
 
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and ExecutionFedViz: A Visual Interface for SPARQL Queries Formulation and Execution
FedViz: A Visual Interface for SPARQL Queries Formulation and Execution
 
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
Tutorial "An Introduction to SPARQL and Queries over Linked Data" Chapter 3 (...
 
LR Parsing
LR ParsingLR Parsing
LR Parsing
 
RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031RDF Tutorial - SPARQL 20091031
RDF Tutorial - SPARQL 20091031
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
The Future is Federated
The Future is FederatedThe Future is Federated
The Future is Federated
 
Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...
Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...
Query Expansion and Context: Thoughts on Language, Meaning and Knowledge Orga...
 
Querying Linked Data on Android
Querying Linked Data on AndroidQuerying Linked Data on Android
Querying Linked Data on Android
 
Exploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data MiningExploiting Linked Open Data as Background Knowledge in Data Mining
Exploiting Linked Open Data as Background Knowledge in Data Mining
 
NLP & DBpedia
 NLP & DBpedia NLP & DBpedia
NLP & DBpedia
 
Exploiting Linked Open Data and Natural Language Processing for Classificati...
Exploiting Linked Open Data  and Natural Language Processing for Classificati...Exploiting Linked Open Data  and Natural Language Processing for Classificati...
Exploiting Linked Open Data and Natural Language Processing for Classificati...
 

Similar to Julien Gonçalves: Named entity recognition and disambiguation using an iterative graph processing system

Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And FootballAmanda Gray
 
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...Tata Consultancy Services
 
Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Timesmortardata
 
Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...
Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...
Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...Economic Strategy Institute
 
The Value of Signal (and the Cost of Noise): The New Economics of Meaning-Making
The Value of Signal (and the Cost of Noise): The New Economics of Meaning-MakingThe Value of Signal (and the Cost of Noise): The New Economics of Meaning-Making
The Value of Signal (and the Cost of Noise): The New Economics of Meaning-MakingCognizant
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
applications and advantages of python
applications and advantages of pythonapplications and advantages of python
applications and advantages of pythonbhavesh lande
 
FAQ for the Predictive Testing of Opportunities
FAQ for the Predictive Testing of OpportunitiesFAQ for the Predictive Testing of Opportunities
FAQ for the Predictive Testing of OpportunitiesThe Inovo Group
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”voginip
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”VOGIN-academie
 
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGeetha982072
 
How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...Economic Strategy Institute
 
Copy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptxCopy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptxmpower4ru
 
State of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationState of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationssuser2750ef
 
Exploring the 2020 Artificial Intelligence Sector
Exploring the 2020 Artificial Intelligence SectorExploring the 2020 Artificial Intelligence Sector
Exploring the 2020 Artificial Intelligence SectorWhite Star Capital
 
Workshop_Presentation.pptx
Workshop_Presentation.pptxWorkshop_Presentation.pptx
Workshop_Presentation.pptxRUDRAPRASADSABAR
 
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...Dataconomy Media
 

Similar to Julien Gonçalves: Named entity recognition and disambiguation using an iterative graph processing system (20)

Questions On The And Football
Questions On The And FootballQuestions On The And Football
Questions On The And Football
 
Final Project
Final ProjectFinal Project
Final Project
 
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
TCS Point of View Session - Analyze by Dr. Gautam Shroff, VP and Chief Scient...
 
Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Times
 
Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...
Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...
Pervasive, intelligent cloud ecosystems, spectacular firms and frontier firms...
 
The Value of Signal (and the Cost of Noise): The New Economics of Meaning-Making
The Value of Signal (and the Cost of Noise): The New Economics of Meaning-MakingThe Value of Signal (and the Cost of Noise): The New Economics of Meaning-Making
The Value of Signal (and the Cost of Noise): The New Economics of Meaning-Making
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
applications and advantages of python
applications and advantages of pythonapplications and advantages of python
applications and advantages of python
 
Bigdata notes
Bigdata notesBigdata notes
Bigdata notes
 
FAQ for the Predictive Testing of Opportunities
FAQ for the Predictive Testing of OpportunitiesFAQ for the Predictive Testing of Opportunities
FAQ for the Predictive Testing of Opportunities
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”Smartlogic, Semaphore and Semantically Enhanced Search –  For “Discovery”
Smartlogic, Semaphore and Semantically Enhanced Search – For “Discovery”
 
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptxGEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
GEETHAhshansbbsbsbhshnsnsn_INTERNSHIP.pptx
 
How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...How economists should think about the revolutionary changes taking place in h...
How economists should think about the revolutionary changes taking place in h...
 
Copy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptxCopy of State of AI Report 2023 - ONLINE.pptx
Copy of State of AI Report 2023 - ONLINE.pptx
 
State of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentationState of AI Report 2023 - ONLINE presentation
State of AI Report 2023 - ONLINE presentation
 
Parrot case
Parrot caseParrot case
Parrot case
 
Exploring the 2020 Artificial Intelligence Sector
Exploring the 2020 Artificial Intelligence SectorExploring the 2020 Artificial Intelligence Sector
Exploring the 2020 Artificial Intelligence Sector
 
Workshop_Presentation.pptx
Workshop_Presentation.pptxWorkshop_Presentation.pptx
Workshop_Presentation.pptx
 
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...
Ajit Jaokar, Data Science for IoT professor at Oxford University “Enterprise ...
 

More from Semantic Web Company

How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...
How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...
How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...Semantic Web Company
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AISemantic Web Company
 
Deep Text Analytics - How to extract hidden information and aboutness from text
Deep Text Analytics - How to extract hidden information and aboutness from textDeep Text Analytics - How to extract hidden information and aboutness from text
Deep Text Analytics - How to extract hidden information and aboutness from textSemantic Web Company
 
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemLeveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemSemantic Web Company
 
Linking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured DataLinking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured DataSemantic Web Company
 
The Fast Track to Knowledge Engineering
The Fast Track to Knowledge EngineeringThe Fast Track to Knowledge Engineering
The Fast Track to Knowledge EngineeringSemantic Web Company
 
Leveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine LearningLeveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine LearningSemantic Web Company
 
PoolParty GraphSearch - The Fusion of Search, Recommendation and Analytics
PoolParty GraphSearch - The Fusion of Search, Recommendation and AnalyticsPoolParty GraphSearch - The Fusion of Search, Recommendation and Analytics
PoolParty GraphSearch - The Fusion of Search, Recommendation and AnalyticsSemantic Web Company
 
Semantics as the Basis of Advanced Cognitive Computing
Semantics as the Basis of Advanced Cognitive ComputingSemantics as the Basis of Advanced Cognitive Computing
Semantics as the Basis of Advanced Cognitive ComputingSemantic Web Company
 
PoolParty 6.0 - Climbing the Semantic Ladder
PoolParty 6.0 - Climbing the Semantic LadderPoolParty 6.0 - Climbing the Semantic Ladder
PoolParty 6.0 - Climbing the Semantic LadderSemantic Web Company
 
PoolParty Semantic Suite - Release 6.0 (Technical Overview)
PoolParty Semantic Suite - Release 6.0 (Technical Overview)PoolParty Semantic Suite - Release 6.0 (Technical Overview)
PoolParty Semantic Suite - Release 6.0 (Technical Overview)Semantic Web Company
 
Taxonomies and Ontologies – The Yin and Yang of Knowledge Modelling
Taxonomies and Ontologies – The Yin and Yang of Knowledge ModellingTaxonomies and Ontologies – The Yin and Yang of Knowledge Modelling
Taxonomies and Ontologies – The Yin and Yang of Knowledge ModellingSemantic Web Company
 
PROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataPROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataSemantic Web Company
 

More from Semantic Web Company (20)

How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...
How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...
How Enterprise Architecture & Knowledge Graph Technologies Can Scale Business...
 
Introduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AIIntroduction to Knowledge Graphs and Semantic AI
Introduction to Knowledge Graphs and Semantic AI
 
Deep Text Analytics - How to extract hidden information and aboutness from text
Deep Text Analytics - How to extract hidden information and aboutness from textDeep Text Analytics - How to extract hidden information and aboutness from text
Deep Text Analytics - How to extract hidden information and aboutness from text
 
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemLeveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
 
Linking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured DataLinking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured Data
 
The Fast Track to Knowledge Engineering
The Fast Track to Knowledge EngineeringThe Fast Track to Knowledge Engineering
The Fast Track to Knowledge Engineering
 
Semantic AI
Semantic AISemantic AI
Semantic AI
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
PoolParty Semantic Classifier
PoolParty Semantic ClassifierPoolParty Semantic Classifier
PoolParty Semantic Classifier
 
Leveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine LearningLeveraging Taxonomy Management with Machine Learning
Leveraging Taxonomy Management with Machine Learning
 
Taxonomies put in the right place
Taxonomies put in the right placeTaxonomies put in the right place
Taxonomies put in the right place
 
PoolParty GraphSearch - The Fusion of Search, Recommendation and Analytics
PoolParty GraphSearch - The Fusion of Search, Recommendation and AnalyticsPoolParty GraphSearch - The Fusion of Search, Recommendation and Analytics
PoolParty GraphSearch - The Fusion of Search, Recommendation and Analytics
 
Semantics as the Basis of Advanced Cognitive Computing
Semantics as the Basis of Advanced Cognitive ComputingSemantics as the Basis of Advanced Cognitive Computing
Semantics as the Basis of Advanced Cognitive Computing
 
Structured Content Meets Taxonomy
Structured Content Meets TaxonomyStructured Content Meets Taxonomy
Structured Content Meets Taxonomy
 
PoolParty 6.0 - Climbing the Semantic Ladder
PoolParty 6.0 - Climbing the Semantic LadderPoolParty 6.0 - Climbing the Semantic Ladder
PoolParty 6.0 - Climbing the Semantic Ladder
 
PoolParty Semantic Suite - Release 6.0 (Technical Overview)
PoolParty Semantic Suite - Release 6.0 (Technical Overview)PoolParty Semantic Suite - Release 6.0 (Technical Overview)
PoolParty Semantic Suite - Release 6.0 (Technical Overview)
 
Taxonomies and Ontologies – The Yin and Yang of Knowledge Modelling
Taxonomies and Ontologies – The Yin and Yang of Knowledge ModellingTaxonomies and Ontologies – The Yin and Yang of Knowledge Modelling
Taxonomies and Ontologies – The Yin and Yang of Knowledge Modelling
 
PROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataPROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked Data
 
Taxonomy Quality Assessment
Taxonomy Quality AssessmentTaxonomy Quality Assessment
Taxonomy Quality Assessment
 
Taxonomy-Driven UX
Taxonomy-Driven UXTaxonomy-Driven UX
Taxonomy-Driven UX
 

Recently uploaded

100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Julien Gonçalves: Named entity recognition and disambiguation using an iterative graph processing system

  • 1. 1 1 FIRST LAST TITLE Welcome Message Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur elementum posuere pretium. Quisque nibh dolor, dignissim ac dignissim ut, luctus ac urna. Aliquam aliquet non massa quis tincidunt. Mauris ullamcorper justo tristique dui posuere tincidunt. In nec lacus laoreet orci varius imperdiet sit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur elementum posuere pretium. Quisque nibh dolor, dignissim ac dignissim ut, luctus ac urna. Aliquam aliquet non massa quis tincidunt. Mauris ullamcorper justo tristique dui posuere tincidunt. In nec lacus laoreet orci varius imperdiet sit. In nec lacus laoreet orci varius imperdiet sit. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Welcome message goes here Named entity recognition and disambiguation using an iterative graph processing system Julien Gonçalves
  • 2. 2 2 Julien Gonçalves VP research and partner at Reportlinker Working on semantic technologies since 2004 @rlk_jgo fr.linkedin.com/pub/julien-gonçalves/2/557/a21
  • 3. 3 3 Who is Reportlinker? ReportLinker is a technology company focused on providing actionable information from global market research and data, to Marketers, Analysts, Researchers and knowledge workers in the enterprise. 4000 recurring clients around the world (HonneyWell, 3M, …) A 10M€ Company 40 Employees with 50% engineers and semantic specialists
  • 4. 4 4 What is ReportLinker? Reportlinker finds, filters and organizes the latest industry data
  • 5. 5 5 Find, Filter and Organize 30 million new documents analyzed / month Document discovery NLP Search Engine 1 billion url verified / month 3 million documents available as relevant Each part of the workflow is scalable
  • 6. 6 6 Natural Language Processing Text preprocessing Converting PDF/DOC/PTT format to Text. Lexical analysis Parsing words and sentences Morphology analysis Chapter/Table/Figure detection using the morphology of the original document Semantic analysis Using a thesaurus with 3.5 millions ontologies (industry, geography, topic) Storage Structured, annoted, sliced: ready to be searchable Data relevance Scoring each type of data by relevance (statistics, analysis, …)
  • 7. 7 7 Semantic Analysis Industry Geography Topic A Data Lexical Platform using a thesaurus with 3 dimensions 350 industries, 3000 sub-industries world regions, countries, main cities Ontologies about industrial economics (production, exportation, …)
  • 8. 8 8 Semantic Analysis Industry Controlled vocabulary helps to find the right meaning of a term Agribusiness Food Fruit and Vegetable Fruit apple banana One term can be used for several meanings (ex: “apple” as fruit or company). The proximity with other concept in the thesaurus helps to find the right meaning when found in the same section of text.
  • 9. 9 Semantic Analysis How can we find, normalize and classify the company names mentioned in our reports ?
  • 10. 10 Semantic Analysis : Company Very simple approach: Use a database of company names as ontology FAIL ! This approach did not work at all We bought and used a database with 2 000 000 company names Too many company names existing as common name (ex: “Post Office”, “Table”, …) To avoid the noise, we need to match more context in order to be sure of the right meaning of a concept.
  • 11. 11 Semantic Analysis : Company Millions of companies exist around the world Company context changes very often (acquisisions, new activities, ...) Hundreds of companies can have the same name To be able to disambiguate, we need additional context for each company concept.
  • 12. 12 Our approach STEP 1 Hypotheses STEP 2 Inferences STEP 3 Analysis & Classification To create our own database of company names with additional context for disambiguation. To exploit our content (110 millions documents) to discover and identify company names, people, products. To use an inference engine to build a relational graph with verified concepts and contexts. To use this new base of verified companies, enriched with contexts, to find the right companies in our content.
  • 13. 13 Step 1: Hypotheses For each document analysed, we extract several “hypotheses” (unverified facts) using text mining rules Identification of a concept (the probability that its a company, person, product, …) We mainly have 3 types of hypotheses: Relation between 2 concepts (context proximity between 2 concepts in the document) Proximity between a concept and an industry/country (context proximity with an other dimension in the document)
  • 14. 14 Step 1: Hypotheses In march 2010, Toto inc. acquired Thingso corp., the new CEO Kevin Sherpa wants to be present in China to sell the new Xbrid3. Example “Kevin Sherpa” is guessed as a person name (NER rules). “Toto inc.” and “Thingso corp.” are guessed as company (using NER rules). More the pattern is “safe”, more the hypothese is strong. “China” is a country (Ontology). “Xbrid3” is an unidentified named entity (NER rules).
  • 15. 15 Step 1: Hypotheses Toto inc. Label Context Industry / Geography Toto inc. Thingso corp. (C) / Kevin Sherpa (P) / Xbrid3 China Thingso corp. Toto inc. (C) / Kevin Sherpa (P) / Xbrid3 China Thingso corp. Kevin Sherpa Xbrid3
  • 16. 16 Step 1: Hypotheses To validate these hypotheses we need to find more facts verifying the same hypotheses. Data volume is one of the key elements of this approach We mine billions of sentences from economic reports and 3 million news update every month. Each hypothese brings new information and new contexts around a company concept. More an hypothese is verified with several sources, more chance it has to become a verified fact.
  • 17. 17 Step 2: Inferences An inference engine verifies all the hypotheses around each concept in order to keep only the verified facts C1 C2 P 2 P 1 From millions/billions of sub-graphs (each scope of context), we obtain 1 final consolidated graph composed of only thousands of sub-graphs.
  • 18. 18 Step 2: Inferences Apache Giraph is an iterative graph processing system built for high scalability. Giraph implements the Pregel model and other features that makes it easy to use graph computation. Giraph loads all the graph in-memory, computation is very quick. Giraph is highly scalable.
  • 19. 19 Step 2: Inferences Graph reduction continues until we can’t reduce the graph anymore Toto Inc. #1 Kevin Sherpa Thingso corp. Xbrid3 Toto Inc. #2 Kevin Sherpa David Rego Xbrid3 Toto Inc. #3 Thingso corp. David Rego Xbrid Project Neko Ltd. Toto Inc. #4 Kevin Sherpa Xbrid Project Neko Ltd. China China US China Toto Inc. #1 Kevin Sherpa David Rego Thingso corp. Xbrid3 China Toto Inc. #2 Xbrid Project Kevin Sherpa Xbrid3 China US Toto Inc. #1 Xbrid Project Kevin Sherpa David Rego Thingso corp. Neko Ltd. Iteration 1 Iteration 2 David Rego Thingso corp. Neko Ltd. China US
  • 20. 20 Step 2: Inferences The final graph is filtered to obtain a base of verified companies Only the best context is kept for each company name (context frequently related to the company) Special iterations are processed to normalize company names having very close names (ex: “Google France” and “Google Fr”).
  • 21. 21 Step 3: Semantic Analysis Company Name Name to Match / Alias Contexts Industry / Geography Toto inc. Toto inc. Toto incorporated Toto Xbrid Project Kevin Sherpa Xbrid3 David Rego Thingso corp. Neko Ltd. ... China US Apple inc. Apple inc. Apple incorporated Apple Tim Cook iPhone iPad Steve Jobs ... US World More a company name is “common”, the more it will need a better diversity of context to be verified (common noun, several company with the same names, high frequency in the corpus)
  • 22. 22 Step 3: Semantic Analysis Kevin Sherpa said “Toto forecasts to double its revenue in China selling the new Xbrid3.” 1) “Toto” is a possible name to match, normalized as “Toto inc.” 2) “Toto” is found in this text, we load all the contextual terms terlated to this concept in order to disambiguate and select the right concept. 3) Contextual terms are found, “Toto” is classified as “Toto inc.” in this text.
  • 23. 23 Step 3: Semantic Analysis Contextual terms related to companies Verified company names NLP Content to analyse Load in memory Checking contextual terms company found: disambiguated and classified Company names that are eligible are loaded in memory (NLP process) Contexts are loaded in memory in a remote cluster (Redis)
  • 24. 24 Beta version: Statistics 400 million hypotheses 2 million documents analysed graph nodes: 27 million graph edges: 380 million > 400 000 companies verified, enriched with contexts
  • 25. 25 Conclusion Using Big Data analytics we found a very good approach to discover, disambiguate and normalise company names. This solution works because we succeed in resolving 3 main issues: Data volume Pattern detection to discover hypotheses (NER rules) Optimized algorithms for the inference engine