Information Extraction and Aggregation from Unstructured Web Data for Business Profiling

Student Team : Liang Shi, Alexander Michels, Himanshu Ahuja
Academic Mentor : Shadi Shahsavari
Industry Mentor : Dr. Stephen DeSalvo, Urjit Patel
Information Extraction and Aggregation
from Unstructured Web Data
for Business Profiling

1. Manual
Search
2. Credible
Database
3. Forward-
looking Models
4. Predict Likely
Losses
Praedicat: An Insurance Tech Company
• Determine litigation risks
• Predict the likely amount of losses

RIPS Team
Automating
1. Manual
Search
2. Credible
Database
3. Forward-
looking Models
4. Predict Likely
Losses
Where do we fit in?

2. Site Search
1. Search Engine
3. Evaluate Contents
Change
Keywords
Manual Search Process

- Government Databases - News
Difficulty of Searching Information
Less Indicative of Litigation Risks More Indicative of Litigation Risks

Structured Web Pages
Facility Report for 3M Facility Report for Samsung

Problem Statement
How to automate information extraction,
classification, and fact-checking for unstructured
data on the Internet

Computational Fact-Checking using KGs
Information Classification& Aggregation
Web Crawling Framework
Solution Overview

Web CrawlingFramework
Query
Formulator
Company
Name
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Raw text
Master
Document
ComputationalFact-Checking using KGs
Base Knowledge
Graph

Zero useful results
PDF result
mentions Rentokil
Initial PLC
involvement in
window cleaning.
Query Formulator: Asking about the right things!
‘Apple Inc.’ returns
the right results.

Query Formulator: How did we ask the right things?
Mentionthe file-type
Name of the company
Making keywords
mandatory
Making some
words optional
Optionalalias

Company
Name
Raw text
Master
Document
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler

Start End
Web Crawling: What is web Crawling?

Start
End
Web Crawling: Unsupervised machines cannot be trusted
Start with a google search
of the company and its
business activity.
The business activity appears in
the financial report that
specifically appears on search
services provided by the website.

Web Crawling: Where
and how far?
The problem:
We don’t know how far to
dig, and where to dig?
We don’t know the credible
sources and where the
information lies on the
credible sources.

• Interestingly, the structured
data (available on Federal
websites & Wikipedia) is
also credible!
• Designed specific crawlers
to get data from specific
databases.
• Created a baseline data to
support unsupervised web
crawling.
Web Crawling: Credible data to the rescue

Company
Name
Raw text
Master
Document
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Parser

Parser:
Getting unstructured data
Use of text abundance to locate
meaningful paragraphs.
Filtering out tags containing social
media redirects.
Removing graphic contents,
advertisements.

Company
Name
Raw text
Master
Document
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …

Web Resource Manager:
UUID(Universally Unique Identifier)
source/resource_uuid.(pdf/html) docs/resource_uuid.json

Company
Name
Master
Document
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
The 3M Company, formerlyknown as
the Minnesota Mining and
Manufacturing Company, is an
American multinational conglomerate c
orporation operating in the fields of
industry, health care, and consumer
goods.[2]
The companyproduces a
variety of products,
including adhesives, abrasives, laminate
s, passive fire protection, personal
protective equipment, dental and
orthodonticproducts, electronic
materials, medical products, car-care
products,[3]
electroniccircuits,
healthcare software and optical films.[4]
Output
Raw text

Outputs of Site Crawlers
• Financial statementsfor
52,629 companies
• 21,202 Facility Reports
• Product and ingredient
list for 4,535 companies
• Thousands of subsidiary
structures
• Tens of thousands of
Wikipedia pages
Data

Company
Name
Master
Document
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
goods.[2]
products,[3]
electroniccircuits,
Output
Raw text
Classifier

Self-Supervised Learning
Label
Train
Classifier
Use
Classifier
Labels its own
training examples
using heuristics
Trains a classifier
on the examples
it labeled
Classifies using
the features it
learned from
self-labeled data

Doc2Vec
• Represents semantic meaning of
documentsin a vector space
• You can "tag" documentswith topics.
• We can attempt to cluster or classify
documentsusing tags.Apple iPhone Swift Mac

Classification Results: Web Pages
TF-IDF Produced:
• - riddel j
• 1941
• rhop
• danaida
• - boisduv j
We Produced:
• 2014 Chemr acquired 3D-
Radar as a subsidiary of
Curtiss-Wright Corporation in
May 2014

Classification Results: Financial Statements
TF-IDF Produced:
• item 3
• asu no
• see note 2
• 10
• -11
We Produced:
• these challenges add to the
uncertainties of the
legislative changes enacted
as part of ACA

Company
Name
Master
Document
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
goods.[2]
products,[3]
electroniccircuits,
Output
Raw text
ClassifierProfile Manager
Relevant Text Documents
CIK→ SEC NAICS→ SEC
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries

Profile Manager
• Aggregates information by company
• Queryable
• Contains utility functions
Central Index Key

Company
Name
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
goods.[2]
products,[3]
electroniccircuits,
Output
Raw text
Classifier
Positive
Feedback
Output
Master
Document
The 3M Company,
formerly known as
the Minnesota
Mining and
Manufacturing
Company
Profile Manager
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries

Master
Documents
• Aggregates all the relevant
company info
• Wikipedia
• Subidiaries
• Web Crawler results
• Produced thousands for
Praedicat and our code can
produce as many as needed
https://github.com/himahuja/pcatxcore

Company
Name
Base Knowledge
Graph
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
goods.[2]
products,[3]
electroniccircuits,
Output
Raw text
Classifier
Positive
Feedback
Output
Profile Manager
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries
Master
Document
The 3M Company,
formerly known as
the Minnesota
Mining and
Manufacturing
Company
Triple Construction
(Reverb)
Tim Cook is
heading Apple.
(TimCook,
heads,
Apple)

Open Information
Extraction
• Need to convert relevant text to
structured data
• Reverb gives use this capability using
Natural Language Processing
A. Fader, S. Soderland, and O. Etzioni, Identifying relations for open information extraction, in
Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP
’11, Stroudsburg, PA, USA, 2011, Association for Computational Linguistics, pp. 1535–1545.

Company
Name
Query
Formulator Refined
Query
“3M”
~Minnesota
~Mining
~Manufacturing
filetype:PDF
Mandatory
Optionals
Web Crawler
List of
URLs
Raw
Texts
Parser Web Resource
Manager
SEC Source URLs …
goods.[2]
products,[3]
electroniccircuits,
Output
Raw text
Classifier
Positive
Feedback
Output
Profile Manager
SEC → CIK
Pfizer
3M
Dole
URLs
Wiki
Subsidiaries
Master
Document
The 3M Company,
formerly known as
the Minnesota
Mining and
Manufacturing
Company
Triple Construction
(Reverb)
(S,P,O)Tim Cooks is
heading Apple.
(TimCook,
heads,
Apple)
Computational
Fact-Checking
Discarded facts
(Low Truth Value)
KnowledgeGraphUpdate withhigh
truthvalue facts.
Facts to
be checked
PositiveFeedback
Base Knowledge
Graph

KnowledgeLinker
• Valid facts should lie along specific paths
G. L. Ciampaglia, P. Shiralkar, L. M. Rocha, J. Bollen, F. Menczer, and A. Flammini,
Computational fact checking from knowledge networks, PLOS ONE, 10 (2015).
Is in
Is in
Westwood, Los
Angeles, California,
US

Institute of
Knowledge Stream
P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in
knowledge graphs to support fact checking, CoRR, abs/1708.07239 (2017).
• A "stream" (set of
paths) provides more
contextthan a single
path
• Relational similarity
improves path
specificity equation in
Knowledge Linker
Math
RIPS
Ph.D.s
Papers

PredPath
B. Shi and T. Weninger, Fact checking in large knowledge graphs - A
discriminative predicate path mining approach, CoRR, abs/1510.05911 (2015).
UCLA Math
has major
College Subject
has major

PredPath
CMUC.S.
has major
Ph.D.s
Finger
Painting
Students
has major

has major has major
PredPath
UCLAMath
Ph.D.s
Finger
Painting
Students
High Truth Value
Low Truth Value

Towards a New
Computational Fact-Checking
Algorithm

Math
Workshop
UCLA
CMU
Robotics
Program in
Anthropology
Why both negative and positive samples?
Positive Sample
Positive Sample Negative Sample
Positive Sample

StreamMiner, motivated by PredPath*
Built negative and positive feature sets
for training on graphs.

Path Specificity
How general the idea of the node is
(how many conceptsare connectedto it)
Very General: University
Very Specific: Conference Room, IPAM, UCLA
How similar two relations are
e.g.: Mentors
Highly Similar: advises, counsels
Less similar: robs, steals
Path Specificity = Node Specificity + Path Similarity

StreamMiner, motivated by KREL-LINKER*
Path Specificity = Node Specificity + Path Similarity
Logarithm of
node in-degree
Relational similarity
𝑤. 𝑟. 𝑡. predicate P as
cosine distance of co-
occurrence
*P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in

Path Specificity is more important than Path length
Place
UCLA
University
Team
UCLA Bruins
is a
University
is a
is a?P =
Predicate in question
𝑢(𝑃, is a) = 1
𝑢(𝑃, is a) = 1
𝑢(𝑃, has a) = 0.6
𝑢(𝑃, is a) = 1
𝑢(𝑃, has athletic
team) = 0.1

StreamMiner, motivated by Knowledge Stream*
* P. Shiralkar, A. Flammini, F. Menczer, and G. L. Ciampaglia, Finding streams in
Use of Transitive Closure on Dijkstra’s Algorithm with
Yen’s K-Shortest paths for mining path specificity
instead of path length.

Stream Miner, Novel Fact Checking Algorithm
Use of both node specificity and path similarity.
Motivated from PredPath
Motivated from
K-REL-LINKER
Motivated from
Knowledge Stream
Use of positive and negative feature sets.
Use of Transitive Closure on Dijkstra’s
Algorithm with Yen’s K-Weighted Shortest
paths for mining path specificity instead of
path length.

Stream Miner: Performance
Stream Miner was able to produce an average score of
86.325 (AUROC, Area under True Positive v/s False
Positive Curve) on a sub-sample database in its first run,
which was at-par with the benchmark and state-of-the-art
model PredPath.

Contributions
• A web crawling, classification and fact-checking architecture.
• A classification technique for retrieving relevant information.
• A fact-checking algorithm, StreamMiner, for checking
information credibility.

Contribution: Making Impact
• Scaled up the Analysts' ability to retrieve information
• Data of 52,000+ Companies for decision-making

Acknowledgements
Shadi Shahsavari,
our academic mentor
Dr. Stephen DeSalvo,
Industry Mentor
Melissa Boudrea,
Industry Sponsor
Urjit Patel
Industry Mentor
Susana Serna,
our Program Director
David Medina,
Our ITProfessional
Dimi Mavalski
ProgramCoordinator
Ronald McFarland
ProgramCo-ordinator

Information Extraction and Aggregation from Unstructured Web Data for Business Profiling

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Similar to Information Extraction and Aggregation from Unstructured Web Data for Business Profiling

Similar to Information Extraction and Aggregation from Unstructured Web Data for Business Profiling (20)

Recently uploaded

Recently uploaded (20)

Information Extraction and Aggregation from Unstructured Web Data for Business Profiling