SlideShare a Scribd company logo
Student Team : Liang Shi, Alexander Michels, Himanshu Ahuja
Academic Mentor : Shadi Shahsavari
Industry Mentor : Dr. Stephen DeSalvo, Urjit Patel
Information Extraction and Aggregation
from Unstructured Web Data
for Business Profiling
1. Manual
Search
2. Credible
Database
3. Forward-
looking Models
4. Predict Likely
Losses
Praedicat: An Insurance Tech Company
• Determine litigation risks
• Predict the likely amount of losses
RIPS Team
Automating
1. Manual
Search
2. Credible
Database
3. Forward-
looking Models
4. Predict Likely
Losses
Where do we fit in?
2. Site Search
1. Search Engine
3. Evaluate Contents
Change
Keywords
Manual Search Process
- Government Databases - News
Difficulty of Searching Information
Less Indicative of Litigation Risks More Indicative of Litigation Risks
Structured Web Pages
Facility Report for 3M Facility Report for Samsung
Structured Web Pages
Facility Report for 3M Facility Report for Samsung
Unstructured Web Pages
Unstructured Web Pages
Problem Statement
How to automate information extraction,
classification, and fact-checking for unstructured
data on the Internet
Computational Fact-Checking using KGs
Information Classification& Aggregation
Web Crawling Framework
Solution Overview
Web CrawlingFramework
Query
Formulator
Company
Name
Information Classification& Aggregation
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Raw text
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Zero useful results
PDF result
mentions Rentokil
Initial PLC
involvement in
window cleaning.
Query Formulator: Asking about the right things!
‘Apple Inc.’ returns
the right results.
Query Formulator: How did we ask the right things?
Mentionthe file-type
Name of the company
Making keywords
mandatory
Making some
words optional
Optionalalias
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
Raw text
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
Start End
Web Crawling: What is web Crawling?
Start
End
Web Crawling: Unsupervised machines cannot be trusted
Start with a google search
of the company and its
business activity.
The business activity appears in
the financial report that
specifically appears on search
services provided by the website.
Web Crawling: Where
and how far?
The problem:
We don’t know how far to
dig, and where to dig?
We don’t know the credible
sources and where the
information lies on the
credible sources.
• Interestingly, the structured
data (available on Federal
websites & Wikipedia) is
also credible!
• Designed specific crawlers
to get data from specific
databases.
• Created a baseline data to
support unsupervised web
crawling.
Web Crawling: Credible data to the rescue
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
Raw text
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Parser
Parser:
Getting unstructured data
Use of text abundance to locate
meaningful paragraphs.
Filtering out tags containing social
media redirects.
Removing graphic contents,
advertisements.
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
Raw text
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
Web Resource Manager:
UUID(Universally Unique Identifier)
source/resource_uuid.(pdf/html) docs/resource_uuid.json
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text
Outputs of Site Crawlers
• Financial statementsfor
52,629 companies
• 21,202 Facility Reports
• Product and ingredient
list for 4,535 companies
• Thousands of subsidiary
structures
• Tens of thousands of
Wikipedia pages
Data
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text
Classifier
Self-Supervised Learning
Label
Train
Classifier
Use
Classifier
Labels its own
training examples
using heuristics
Trains a classifier
on the examples
it labeled
Classifies using
the features it
learned from
self-labeled data
Doc2Vec
• Represents semantic meaning of
documentsin a vector space
• You can "tag" documentswith topics.
• We can attempt to cluster or classify
documentsusing tags.Apple iPhone Swift Mac
Classification Results: Web Pages
TF-IDF Produced:
• - riddel j
• 1941
• rhop
• danaida
• - boisduv j
We Produced:
• 2014 Chemr acquired 3D-
Radar as a subsidiary of
Curtiss-Wright Corporation in
May 2014
Classification Results: Financial Statements
TF-IDF Produced:
• item 3
• asu no
• see note 2
• 10
• -11
We Produced:
• these challenges add to the
uncertainties of the
legislative changes enacted
as part of ACA
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text
ClassifierProfile Manager
Relevant Text Documents
CIK→ SEC NAICS→ SEC
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries
Profile Manager
• Aggregates information by company
• Queryable
• Contains utility functions
Central Index Key
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text
Classifier
Relevant Text Documents
Positive
Feedback
Output
Master
Document
The 3M Company,
formerly known as
the Minnesota
Mining and
Manufacturing
Company
Profile Manager
CIK→ SEC NAICS→ SEC
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries
Master
Documents
• Aggregates all the relevant
company info
• Wikipedia
• Subidiaries
• Web Crawler results
• Produced thousands for
Praedicat and our code can
produce as many as needed
https://github.com/himahuja/pcatxcore
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
ComputationalFact-Checking using KGs
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text
Classifier
Relevant Text Documents
Positive
Feedback
Output
Profile Manager
CIK→ SEC NAICS→ SEC
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries
Master
Document
The 3M Company,
formerly known as
the Minnesota
Mining and
Manufacturing
Company
Triple Construction
(Reverb)
Tim Cook is
heading Apple.
(TimCook,
heads,
Apple)
Open Information
Extraction
• Need to convert relevant text to
structured data
• Reverb gives use this capability using
Natural Language Processing
A. Fader, S. Soderland, and O. Etzioni, Identifying relations for open information extraction, in
Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP
’11, Stroudsburg, PA, USA, 2011, Association for Computational Linguistics, pp. 1535–1545.
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
ComputationalFact-Checking using KGs
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text
Classifier
Relevant Text Documents
Positive
Feedback
Output
Profile Manager
CIK→ SEC NAICS→ SEC
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries
Master
Document
The 3M Company,
formerly known as
the Minnesota
Mining and
Manufacturing
Company
Triple Construction
(Reverb)
(S,P,O)Tim Cooks is
heading Apple.
(TimCook,
heads,
Apple)
Computational
Fact-Checking
Discarded facts
(Low Truth Value)
KnowledgeGraphUpdate withhigh
truthvalue facts.
Facts to
be checked
PositiveFeedback
Base Knowledge
Graph
Knowledge Graphs
KnowledgeLinker
• Valid facts should lie along specific paths
G. L. Ciampaglia, P. Shiralkar, L. M. Rocha, J. Bollen, F. Menczer, and A. Flammini,
Computational fact checking from knowledge networks, PLOS ONE, 10 (2015).
Is in
Is in
Westwood, Los
Angeles, California,
US
Institute of
Knowledge Stream
P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in
knowledge graphs to support fact checking, CoRR, abs/1708.07239 (2017).
• A "stream" (set of
paths) provides more
contextthan a single
path
• Relational similarity
improves path
specificity equation in
Knowledge Linker
Math
RIPS
Ph.D.s
Papers
PredPath
B. Shi and T. Weninger, Fact checking in large knowledge graphs - A
discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015).
UCLA Math
has major
College Subject
has major
PredPath
B. Shi and T. Weninger, Fact checking in large knowledge graphs - A
discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015).
CMUC.S.
has major
Ph.D.s
Finger
Painting
Students
has major
has major has major
PredPath
B. Shi and T. Weninger, Fact checking in large knowledge graphs - A
discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015).
UCLAMath
Ph.D.s
Finger
Painting
Students
High Truth Value
Low Truth Value
Towards a New
Computational Fact-Checking
Algorithm
Math
Workshop
UCLA
CMU
Robotics
Program in
Anthropology
Why both negative and positive samples?
Positive Sample
Positive Sample Negative Sample
Positive Sample
StreamMiner, motivated by PredPath*
Built negative and positive feature sets
for training on graphs.
B. Shi and T. Weninger, Fact checking in large knowledge graphs - A
discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015).
Path Specificity
How general the idea of the node is
(how many conceptsare connectedto it)
Very General: University
Very Specific: Conference Room, IPAM, UCLA
How similar two relations are
e.g.: Mentors
Highly Similar: advises, counsels
Less similar: robs, steals
Path Specificity = Node Specificity + Path Similarity
StreamMiner, motivated by KREL-LINKER*
Path Specificity = Node Specificity + Path Similarity
Logarithm of
node in-degree
Relational similarity
𝑤. 𝑟. 𝑡. predicate P as
cosine distance of co-
occurrence
*P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in
knowledge graphs to support fact checking, CoRR, abs/1708.07239 (2017).
Path Specificity is more important than Path length
Place
UCLA
University
Team
UCLA Bruins
is a
University
is a
is a?P =
Predicate in question
𝑢(𝑃, is a) = 1
𝑢(𝑃, is a) = 1
𝑢(𝑃, has a) = 0.6
𝑢(𝑃, is a) = 1
𝑢(𝑃, has athletic
team) = 0.1
StreamMiner, motivated by Knowledge Stream*
* P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in
knowledge graphs to support fact checking, CoRR, abs/1708.07239 (2017).
Use of Transitive Closure on Dijkstra’s Algorithm with
Yen’s K-Shortest paths for mining path specificity
instead of path length.
Stream Miner, Novel Fact Checking Algorithm
Use of both node specificity and path similarity.
Motivated from PredPath
Motivated from
K-REL-LINKER
Motivated from
Knowledge Stream
Use of positive and negative feature sets.
Use of Transitive Closure on Dijkstra’s
Algorithm with Yen’s K-Weighted Shortest
paths for mining path specificity instead of
path length.
Stream Miner: Performance
Stream Miner was able to produce an average score of
86.325 (AUROC, Area under True Positive v/s False
Positive Curve) on a sub-sample database in its first run,
which was at-par with the benchmark and state-of-the-art
model PredPath.
Web CrawlingFramework
Company
Name
Information Classification& Aggregation
ComputationalFact-Checking using KGs
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text
Classifier
Relevant Text Documents
Positive
Feedback
Output
Profile Manager
CIK→ SEC NAICS→ SEC
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries
Master
Document
The 3M Company,
formerly known as
the Minnesota
Mining and
Manufacturing
Company
Triple Construction
(Reverb)
(S,P,O)Tim Cooks is
heading Apple.
(TimCook,
heads,
Apple)
Computational
Fact-Checking
Discarded facts
(Low Truth Value)
KnowledgeGraphUpdate withhigh
truthvalue facts.
Facts to
be checked
PositiveFeedback
Base Knowledge
Graph
Conclusion
Contributions
• A web crawling, classification and fact-checking architecture.
• A classification technique for retrieving relevant information.
• A fact-checking algorithm, StreamMiner, for checking
information credibility.
Contribution: Making Impact
• Scaled up the Analysts' ability to retrieve information
• Data of 52,000+ Companies for decision-making
Acknowledgements
Shadi Shahsavari,
our academic mentor
Dr. Stephen DeSalvo,
Industry Mentor
Melissa Boudrea,
Industry Sponsor
Urjit Patel
Industry Mentor
Susana Serna,
our Program Director
David Medina,
Our ITProfessional
Dimi Mavalski
ProgramCoordinator
Ronald McFarland
ProgramCo-ordinator
Questions?

More Related Content

What's hot

Single View of the Customer
Single View of the Customer Single View of the Customer
Single View of the Customer
MongoDB
 
Responsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillarsResponsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillars
Sofus Macskássy
 
University Single Constituent View Repository ( SCoRe)
University Single Constituent View Repository ( SCoRe)University Single Constituent View Repository ( SCoRe)
University Single Constituent View Repository ( SCoRe)
Hemant Verma
 
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
Eu gdpr technical workflow and productionalization   neccessary w privacy ass...Eu gdpr technical workflow and productionalization   neccessary w privacy ass...
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
Steven Meister
 
10844 5415 The Value Of Corporate Secrets
10844 5415 The Value Of Corporate Secrets10844 5415 The Value Of Corporate Secrets
10844 5415 The Value Of Corporate Secrets
GuardEra Access Solutions, Inc.
 
Structured Content Meets Taxonomy
Structured Content Meets TaxonomyStructured Content Meets Taxonomy
Structured Content Meets Taxonomy
Semantic Web Company
 
ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...
ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...
ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...
Antidot
 
Linking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured DataLinking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured Data
Semantic Web Company
 
PROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataPROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked Data
Semantic Web Company
 
F E A D R M A K M 2005 03 28
F E A  D R M  A K M 2005 03 28F E A  D R M  A K M 2005 03 28
F E A D R M A K M 2005 03 28
Amit Maitra
 

What's hot (10)

Single View of the Customer
Single View of the Customer Single View of the Customer
Single View of the Customer
 
Responsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillarsResponsible Data Use in AI - core tech pillars
Responsible Data Use in AI - core tech pillars
 
University Single Constituent View Repository ( SCoRe)
University Single Constituent View Repository ( SCoRe)University Single Constituent View Repository ( SCoRe)
University Single Constituent View Repository ( SCoRe)
 
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
Eu gdpr technical workflow and productionalization   neccessary w privacy ass...Eu gdpr technical workflow and productionalization   neccessary w privacy ass...
Eu gdpr technical workflow and productionalization neccessary w privacy ass...
 
10844 5415 The Value Of Corporate Secrets
10844 5415 The Value Of Corporate Secrets10844 5415 The Value Of Corporate Secrets
10844 5415 The Value Of Corporate Secrets
 
Structured Content Meets Taxonomy
Structured Content Meets TaxonomyStructured Content Meets Taxonomy
Structured Content Meets Taxonomy
 
ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...
ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...
ISWC 2012 - Industry Track: "Linked Enterprise Data: leveraging the Semantic ...
 
Linking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured DataLinking SharePoint Documents with Structured Data
Linking SharePoint Documents with Structured Data
 
PROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked DataPROPEL . Austrian's Roadmap for Enterprise Linked Data
PROPEL . Austrian's Roadmap for Enterprise Linked Data
 
F E A D R M A K M 2005 03 28
F E A  D R M  A K M 2005 03 28F E A  D R M  A K M 2005 03 28
F E A D R M A K M 2005 03 28
 

Similar to Information Extraction and Aggregation from Unstructured Web Data for Business Profiling

Apps
AppsApps
Web mining and social media mining
Web mining and social media miningWeb mining and social media mining
Web mining and social media mining
Roxana Tadayon
 
Web
WebWeb
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise
deteo
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Cambridge Semantics
 
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdfUnveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
AqsaBatool21
 
A data-centric program
A data-centric program A data-centric program
A data-centric program
at MicroFocus Italy ❖✔
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
Olha Hrytsay
 
Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!
Jeffrey T. Pollock
 
Electronic Data Discovery
Electronic Data DiscoveryElectronic Data Discovery
Electronic Data Discovery
Sigma Infosolutions, LLC
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentation
millerca2
 
Using Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic CommerceUsing Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic Commerce
Ella Mae Ayen
 
Electronic Commerce
Electronic CommerceElectronic Commerce
Electronic Commerce
ellamee27
 
What Is The Best Tool To Scrape LinkedIn Businesses Data.pdf
What Is The Best Tool To Scrape LinkedIn Businesses Data.pdfWhat Is The Best Tool To Scrape LinkedIn Businesses Data.pdf
What Is The Best Tool To Scrape LinkedIn Businesses Data.pdf
AqsaBatool21
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
Harvinder Atwal
 
Office 365 : Data leakage control, privacy, compliance and regulations in the...
Office 365 : Data leakage control, privacy, compliance and regulations in the...Office 365 : Data leakage control, privacy, compliance and regulations in the...
Office 365 : Data leakage control, privacy, compliance and regulations in the...
Edge Pereira
 
UK Cyber Vulnerability Index 2013
UK Cyber Vulnerability Index 2013UK Cyber Vulnerability Index 2013
UK Cyber Vulnerability Index 2013
Martin Jordan
 
Your Secret Weapon to Extract Data from Multiple Websites.pdf
Your Secret Weapon to Extract Data from Multiple Websites.pdfYour Secret Weapon to Extract Data from Multiple Websites.pdf
Your Secret Weapon to Extract Data from Multiple Websites.pdf
AqsaBatool21
 
conceptClassifier For SharePoint Driving Business Value
conceptClassifier For SharePoint Driving Business ValueconceptClassifier For SharePoint Driving Business Value
conceptClassifier For SharePoint Driving Business Value
martingarland
 
Sqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, AnalyzeSqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl
 

Similar to Information Extraction and Aggregation from Unstructured Web Data for Business Profiling (20)

Apps
AppsApps
Apps
 
Web mining and social media mining
Web mining and social media miningWeb mining and social media mining
Web mining and social media mining
 
Web
WebWeb
Web
 
Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise Deteo. Data science, Big Data expertise
Deteo. Data science, Big Data expertise
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
 
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdfUnveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
Unveiling The Powerhouse LinkedIn Data Scraper By AhmadsoftwareCom.pdf
 
A data-centric program
A data-centric program A data-centric program
A data-centric program
 
Building Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data PlatformsBuilding Predictive Analytics on Big Data Platforms
Building Predictive Analytics on Big Data Platforms
 
Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!Klarna Tech Talk - Mind the Data!
Klarna Tech Talk - Mind the Data!
 
Electronic Data Discovery
Electronic Data DiscoveryElectronic Data Discovery
Electronic Data Discovery
 
A Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining PresentationA Practical Approach To Data Mining Presentation
A Practical Approach To Data Mining Presentation
 
Using Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic CommerceUsing Information Technology to Engage in Electronic Commerce
Using Information Technology to Engage in Electronic Commerce
 
Electronic Commerce
Electronic CommerceElectronic Commerce
Electronic Commerce
 
What Is The Best Tool To Scrape LinkedIn Businesses Data.pdf
What Is The Best Tool To Scrape LinkedIn Businesses Data.pdfWhat Is The Best Tool To Scrape LinkedIn Businesses Data.pdf
What Is The Best Tool To Scrape LinkedIn Businesses Data.pdf
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Office 365 : Data leakage control, privacy, compliance and regulations in the...
Office 365 : Data leakage control, privacy, compliance and regulations in the...Office 365 : Data leakage control, privacy, compliance and regulations in the...
Office 365 : Data leakage control, privacy, compliance and regulations in the...
 
UK Cyber Vulnerability Index 2013
UK Cyber Vulnerability Index 2013UK Cyber Vulnerability Index 2013
UK Cyber Vulnerability Index 2013
 
Your Secret Weapon to Extract Data from Multiple Websites.pdf
Your Secret Weapon to Extract Data from Multiple Websites.pdfYour Secret Weapon to Extract Data from Multiple Websites.pdf
Your Secret Weapon to Extract Data from Multiple Websites.pdf
 
conceptClassifier For SharePoint Driving Business Value
conceptClassifier For SharePoint Driving Business ValueconceptClassifier For SharePoint Driving Business Value
conceptClassifier For SharePoint Driving Business Value
 
Sqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, AnalyzeSqrrl Enterprise: Integrate, Explore, Analyze
Sqrrl Enterprise: Integrate, Explore, Analyze
 

Recently uploaded

一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
eoxhsaa
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
ugydym
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
1tyxnjpia
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
hqfek
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 

Recently uploaded (20)

一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
一比一原版多伦多大学毕业证(UofT毕业证书)学历如何办理
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理一比一原版南昆士兰大学毕业证如何办理
一比一原版南昆士兰大学毕业证如何办理
 
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
一比一原版(Sheffield毕业证书)谢菲尔德大学毕业证如何办理
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
一比一原版爱尔兰都柏林大学毕业证(本硕)ucd学位证书如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 

Information Extraction and Aggregation from Unstructured Web Data for Business Profiling

  • 1. Student Team : Liang Shi, Alexander Michels, Himanshu Ahuja Academic Mentor : Shadi Shahsavari Industry Mentor : Dr. Stephen DeSalvo, Urjit Patel Information Extraction and Aggregation from Unstructured Web Data for Business Profiling
  • 2. 1. Manual Search 2. Credible Database 3. Forward- looking Models 4. Predict Likely Losses Praedicat: An Insurance Tech Company • Determine litigation risks • Predict the likely amount of losses
  • 3. RIPS Team Automating 1. Manual Search 2. Credible Database 3. Forward- looking Models 4. Predict Likely Losses Where do we fit in?
  • 4. 2. Site Search 1. Search Engine 3. Evaluate Contents Change Keywords Manual Search Process
  • 5. - Government Databases - News Difficulty of Searching Information Less Indicative of Litigation Risks More Indicative of Litigation Risks
  • 6. Structured Web Pages Facility Report for 3M Facility Report for Samsung
  • 7. Structured Web Pages Facility Report for 3M Facility Report for Samsung
  • 10. Problem Statement How to automate information extraction, classification, and fact-checking for unstructured data on the Internet
  • 11. Computational Fact-Checking using KGs Information Classification& Aggregation Web Crawling Framework Solution Overview
  • 12. Web CrawlingFramework Query Formulator Company Name Information Classification& Aggregation “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Raw text Master Document ComputationalFact-Checking using KGs Base Knowledge Graph
  • 13. Zero useful results PDF result mentions Rentokil Initial PLC involvement in window cleaning. Query Formulator: Asking about the right things! ‘Apple Inc.’ returns the right results.
  • 14. Query Formulator: How did we ask the right things? Mentionthe file-type Name of the company Making keywords mandatory Making some words optional Optionalalias
  • 15. Web CrawlingFramework Company Name Information Classification& Aggregation Raw text Master Document ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler
  • 16. Start End Web Crawling: What is web Crawling?
  • 17. Start End Web Crawling: Unsupervised machines cannot be trusted Start with a google search of the company and its business activity. The business activity appears in the financial report that specifically appears on search services provided by the website.
  • 18. Web Crawling: Where and how far? The problem: We don’t know how far to dig, and where to dig? We don’t know the credible sources and where the information lies on the credible sources.
  • 19. • Interestingly, the structured data (available on Federal websites & Wikipedia) is also credible! • Designed specific crawlers to get data from specific databases. • Created a baseline data to support unsupervised web crawling. Web Crawling: Credible data to the rescue
  • 20. Web CrawlingFramework Company Name Information Classification& Aggregation Raw text Master Document ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Parser
  • 21. Parser: Getting unstructured data Use of text abundance to locate meaningful paragraphs. Filtering out tags containing social media redirects. Removing graphic contents, advertisements.
  • 22. Web CrawlingFramework Company Name Information Classification& Aggregation Raw text Master Document ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs …
  • 23. Web Resource Manager: UUID(Universally Unique Identifier) source/resource_uuid.(pdf/html) docs/resource_uuid.json
  • 24. Web CrawlingFramework Company Name Information Classification& Aggregation Master Document ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs … The 3M Company, formerlyknown as the Minnesota Mining and Manufacturing Company, is an American multinational conglomerate c orporation operating in the fields of industry, health care, and consumer goods.[2] The companyproduces a variety of products, including adhesives, abrasives, laminate s, passive fire protection, personal protective equipment, dental and orthodonticproducts, electronic materials, medical products, car-care products,[3] electroniccircuits, healthcare software and optical films.[4] Output Raw text
  • 25. Outputs of Site Crawlers • Financial statementsfor 52,629 companies • 21,202 Facility Reports • Product and ingredient list for 4,535 companies • Thousands of subsidiary structures • Tens of thousands of Wikipedia pages Data
  • 26. Web CrawlingFramework Company Name Information Classification& Aggregation Master Document ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs … The 3M Company, formerlyknown as the Minnesota Mining and Manufacturing Company, is an American multinational conglomerate c orporation operating in the fields of industry, health care, and consumer goods.[2] The companyproduces a variety of products, including adhesives, abrasives, laminate s, passive fire protection, personal protective equipment, dental and orthodonticproducts, electronic materials, medical products, car-care products,[3] electroniccircuits, healthcare software and optical films.[4] Output Raw text Classifier
  • 27. Self-Supervised Learning Label Train Classifier Use Classifier Labels its own training examples using heuristics Trains a classifier on the examples it labeled Classifies using the features it learned from self-labeled data
  • 28. Doc2Vec • Represents semantic meaning of documentsin a vector space • You can "tag" documentswith topics. • We can attempt to cluster or classify documentsusing tags.Apple iPhone Swift Mac
  • 29. Classification Results: Web Pages TF-IDF Produced: • - riddel j • 1941 • rhop • danaida • - boisduv j We Produced: • 2014 Chemr acquired 3D- Radar as a subsidiary of Curtiss-Wright Corporation in May 2014
  • 30. Classification Results: Financial Statements TF-IDF Produced: • item 3 • asu no • see note 2 • 10 • -11 We Produced: • these challenges add to the uncertainties of the legislative changes enacted as part of ACA
  • 31. Web CrawlingFramework Company Name Information Classification& Aggregation Master Document ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs … The 3M Company, formerlyknown as the Minnesota Mining and Manufacturing Company, is an American multinational conglomerate c orporation operating in the fields of industry, health care, and consumer goods.[2] The companyproduces a variety of products, including adhesives, abrasives, laminate s, passive fire protection, personal protective equipment, dental and orthodonticproducts, electronic materials, medical products, car-care products,[3] electroniccircuits, healthcare software and optical films.[4] Output Raw text ClassifierProfile Manager Relevant Text Documents CIK→ SEC NAICS→ SEC SEC → CIK Pfizer 3M Dole URLs Wiki Subsidiaries
  • 32. Profile Manager • Aggregates information by company • Queryable • Contains utility functions Central Index Key
  • 33. Web CrawlingFramework Company Name Information Classification& Aggregation ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs … The 3M Company, formerlyknown as the Minnesota Mining and Manufacturing Company, is an American multinational conglomerate c orporation operating in the fields of industry, health care, and consumer goods.[2] The companyproduces a variety of products, including adhesives, abrasives, laminate s, passive fire protection, personal protective equipment, dental and orthodonticproducts, electronic materials, medical products, car-care products,[3] electroniccircuits, healthcare software and optical films.[4] Output Raw text Classifier Relevant Text Documents Positive Feedback Output Master Document The 3M Company, formerly known as the Minnesota Mining and Manufacturing Company Profile Manager CIK→ SEC NAICS→ SEC SEC → CIK Pfizer 3M Dole URLs Wiki Subsidiaries
  • 34. Master Documents • Aggregates all the relevant company info • Wikipedia • Subidiaries • Web Crawler results • Produced thousands for Praedicat and our code can produce as many as needed https://github.com/himahuja/pcatxcore
  • 35. Web CrawlingFramework Company Name Information Classification& Aggregation ComputationalFact-Checking using KGs Base Knowledge Graph Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs … The 3M Company, formerlyknown as the Minnesota Mining and Manufacturing Company, is an American multinational conglomerate c orporation operating in the fields of industry, health care, and consumer goods.[2] The companyproduces a variety of products, including adhesives, abrasives, laminate s, passive fire protection, personal protective equipment, dental and orthodonticproducts, electronic materials, medical products, car-care products,[3] electroniccircuits, healthcare software and optical films.[4] Output Raw text Classifier Relevant Text Documents Positive Feedback Output Profile Manager CIK→ SEC NAICS→ SEC SEC → CIK Pfizer 3M Dole URLs Wiki Subsidiaries Master Document The 3M Company, formerly known as the Minnesota Mining and Manufacturing Company Triple Construction (Reverb) Tim Cook is heading Apple. (TimCook, heads, Apple)
  • 36. Open Information Extraction • Need to convert relevant text to structured data • Reverb gives use this capability using Natural Language Processing A. Fader, S. Soderland, and O. Etzioni, Identifying relations for open information extraction, in Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, Stroudsburg, PA, USA, 2011, Association for Computational Linguistics, pp. 1535–1545.
  • 37. Web CrawlingFramework Company Name Information Classification& Aggregation ComputationalFact-Checking using KGs Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs … The 3M Company, formerlyknown as the Minnesota Mining and Manufacturing Company, is an American multinational conglomerate c orporation operating in the fields of industry, health care, and consumer goods.[2] The companyproduces a variety of products, including adhesives, abrasives, laminate s, passive fire protection, personal protective equipment, dental and orthodonticproducts, electronic materials, medical products, car-care products,[3] electroniccircuits, healthcare software and optical films.[4] Output Raw text Classifier Relevant Text Documents Positive Feedback Output Profile Manager CIK→ SEC NAICS→ SEC SEC → CIK Pfizer 3M Dole URLs Wiki Subsidiaries Master Document The 3M Company, formerly known as the Minnesota Mining and Manufacturing Company Triple Construction (Reverb) (S,P,O)Tim Cooks is heading Apple. (TimCook, heads, Apple) Computational Fact-Checking Discarded facts (Low Truth Value) KnowledgeGraphUpdate withhigh truthvalue facts. Facts to be checked PositiveFeedback Base Knowledge Graph
  • 39. KnowledgeLinker • Valid facts should lie along specific paths G. L. Ciampaglia, P. Shiralkar, L. M. Rocha, J. Bollen, F. Menczer, and A. Flammini, Computational fact checking from knowledge networks, PLOS ONE, 10 (2015). Is in Is in Westwood, Los Angeles, California, US
  • 40. Institute of Knowledge Stream P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in knowledge graphs to support fact checking, CoRR, abs/1708.07239 (2017). • A "stream" (set of paths) provides more contextthan a single path • Relational similarity improves path specificity equation in Knowledge Linker Math RIPS Ph.D.s Papers
  • 41. PredPath B. Shi and T. Weninger, Fact checking in large knowledge graphs - A discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015). UCLA Math has major College Subject has major
  • 42. PredPath B. Shi and T. Weninger, Fact checking in large knowledge graphs - A discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015). CMUC.S. has major Ph.D.s Finger Painting Students has major
  • 43. has major has major PredPath B. Shi and T. Weninger, Fact checking in large knowledge graphs - A discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015). UCLAMath Ph.D.s Finger Painting Students High Truth Value Low Truth Value
  • 44. Towards a New Computational Fact-Checking Algorithm
  • 45. Math Workshop UCLA CMU Robotics Program in Anthropology Why both negative and positive samples? Positive Sample Positive Sample Negative Sample Positive Sample
  • 46. StreamMiner, motivated by PredPath* Built negative and positive feature sets for training on graphs. B. Shi and T. Weninger, Fact checking in large knowledge graphs - A discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015).
  • 47. Path Specificity How general the idea of the node is (how many conceptsare connectedto it) Very General: University Very Specific: Conference Room, IPAM, UCLA How similar two relations are e.g.: Mentors Highly Similar: advises, counsels Less similar: robs, steals Path Specificity = Node Specificity + Path Similarity
  • 48. StreamMiner, motivated by KREL-LINKER* Path Specificity = Node Specificity + Path Similarity Logarithm of node in-degree Relational similarity 𝑤. 𝑟. 𝑡. predicate P as cosine distance of co- occurrence *P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in knowledge graphs to support fact checking, CoRR, abs/1708.07239 (2017).
  • 49. Path Specificity is more important than Path length Place UCLA University Team UCLA Bruins is a University is a is a?P = Predicate in question 𝑢(𝑃, is a) = 1 𝑢(𝑃, is a) = 1 𝑢(𝑃, has a) = 0.6 𝑢(𝑃, is a) = 1 𝑢(𝑃, has athletic team) = 0.1
  • 50. StreamMiner, motivated by Knowledge Stream* * P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in knowledge graphs to support fact checking, CoRR, abs/1708.07239 (2017). Use of Transitive Closure on Dijkstra’s Algorithm with Yen’s K-Shortest paths for mining path specificity instead of path length.
  • 51. Stream Miner, Novel Fact Checking Algorithm Use of both node specificity and path similarity. Motivated from PredPath Motivated from K-REL-LINKER Motivated from Knowledge Stream Use of positive and negative feature sets. Use of Transitive Closure on Dijkstra’s Algorithm with Yen’s K-Weighted Shortest paths for mining path specificity instead of path length.
  • 52. Stream Miner: Performance Stream Miner was able to produce an average score of 86.325 (AUROC, Area under True Positive v/s False Positive Curve) on a sub-sample database in its first run, which was at-par with the benchmark and state-of-the-art model PredPath.
  • 53. Web CrawlingFramework Company Name Information Classification& Aggregation ComputationalFact-Checking using KGs Query Formulator Refined Query “3M” ~Minnesota ~Mining ~Manufacturing filetype:PDF Mandatory Optionals Web Crawler List of URLs Raw Texts Parser Web Resource Manager SEC Source URLs … The 3M Company, formerlyknown as the Minnesota Mining and Manufacturing Company, is an American multinational conglomerate c orporation operating in the fields of industry, health care, and consumer goods.[2] The companyproduces a variety of products, including adhesives, abrasives, laminate s, passive fire protection, personal protective equipment, dental and orthodonticproducts, electronic materials, medical products, car-care products,[3] electroniccircuits, healthcare software and optical films.[4] Output Raw text Classifier Relevant Text Documents Positive Feedback Output Profile Manager CIK→ SEC NAICS→ SEC SEC → CIK Pfizer 3M Dole URLs Wiki Subsidiaries Master Document The 3M Company, formerly known as the Minnesota Mining and Manufacturing Company Triple Construction (Reverb) (S,P,O)Tim Cooks is heading Apple. (TimCook, heads, Apple) Computational Fact-Checking Discarded facts (Low Truth Value) KnowledgeGraphUpdate withhigh truthvalue facts. Facts to be checked PositiveFeedback Base Knowledge Graph
  • 55. Contributions • A web crawling, classification and fact-checking architecture. • A classification technique for retrieving relevant information. • A fact-checking algorithm, StreamMiner, for checking information credibility.
  • 56. Contribution: Making Impact • Scaled up the Analysts' ability to retrieve information • Data of 52,000+ Companies for decision-making
  • 57. Acknowledgements Shadi Shahsavari, our academic mentor Dr. Stephen DeSalvo, Industry Mentor Melissa Boudrea, Industry Sponsor Urjit Patel Industry Mentor Susana Serna, our Program Director David Medina, Our ITProfessional Dimi Mavalski ProgramCoordinator Ronald McFarland ProgramCo-ordinator