SlideShare a Scribd company logo
1 of 51
AI PLATFORM TO MINE COMPETITIVE INTELLIGENCE
FROM BILLIONS OF UNSTRUCTURED ONLINE SOURCES
My background
2
Meltwater is the global leader in media intelligence
3
1500 employees
worldwide
28,000
corporate clients
50 offices across
6 continents
Bootstrapped
No venture funding
FOUNDED 2001
in Oslo, Norway
HEADQUARTERS
in San Francisco
Big data company: We process 100 million documents and
billions of searches every day
4
data
capture
data
enrichment
proprietary
search
engine
real-time
analytics
5
We track leading performance
real-time
analytics
Client
Satisfactio
ns
Industry
Trends
Competitive
Intelligence
Brand
Perception
Share
of
Voice
28,000 corporate clients
6
Under the hood
7
Ingestion:
• AI crawling for unstructured web
• Programmatic api’s for partnerships
• Over 100M documents everyday
Media Intelligence applications
• 1M complex Boolean queries configured
• Counters, aggregates, drill downs, pivoting, regression
• Vertical Search, news feed, media exposure, alerts based
on trends & anomalies, influencers etc
Data Augmentation (15 languages):
• Text categorization (topic, language)
• Keyphrase extraction, summarization
• Sentiment analysis (entity, aspect level)
• Semantic hashing for near duplicate detection
Knowledge Management
• NER (person, location, organization, ...)
• NED ( https://en.wikipedia.org/wiki/Tim_Cook )
• Relation & event extraction
• Truth finding, link prediction, graph mining
Motivation for the platform
8
Access to Structured Data
• Make sure the data is clean, complete & normalized
• Make it relevant by connecting the dots
• Bring the methods close to the data
Data is deceitful. Need a systematic way to
mine, propose, and explain possible insights
• we need factual knowledge
• combine (machine) learning and reasoning
Source: www.tylervigen.com (Spurious Correlations)
9
Streaming, Search, Analytics, APIs
Building blocks to leverage the platform
Data Enrichment Platform
Enrich, analyze & build insights by interoperating with all major players
Knowledge Graph
Enable cognitive applications on top of our data by connecting the dots
AI-Driven Data Acquisition
Bring high quality outside data to our repository with minimal human effort
Media Intelligence
Apps
New
Apps
Enterprise
Solutions
Custom
solutions
3rd party
Apps
PaaS
Outside
Data
Context
Building
Enrichment
& Analysis
Service
Layer
Global Monitoring
Distribute
Analyze & Report
Influence & Engage
Outside
Insight
AI-Powered
Reporting
Employee
App
Freemium
100M
documents
ingested daily
150 NLP/IR
pipelines
100’s Billions of
Searches
10
A treasure trove of valuable external data
Online News
Share Price Job Postings
Press Releases
Patent Filings
Social Media Financial Filings
Real Estate Rates
App Downloads
Web traffic
UnemploymentOil Prices
Court DocumentsOnline ad Spend
Blogs
Forums
Product Reviews
Interest Rates
Corporate
Websites
Weather Data
11
Sources of Information
Text
(Information Extraction)
DOM
(Web Extraction)
WebTables Annotations
(Site Microdata)
The academic web
12
Typical comments about web data extraction
• Microdata and the semantic web have solved problems
• All the data is in web tables
• API’s provide all the structured data you need
The real web
13
Web data extraction is not a solved problem
• API’s are limited to large websites
• Web tables and microdata are marginal
• The real problem is not one-time extraction, but keeping the data up-to-date over time
AI crawling for Web data extraction
14
Traditional scraping requires a huge human effort:
•Code wrappers for each source, e.g., in Scrapy or MW’s source configurations
•Visually testing and support tool (ala Connotate, Mozenda, …)
•Automatic scraping for small number of fixed data types (ala Diffbot), e.g., Microdata
•Meltwater (old): ~50 “source engineers” maintaining manual wrappers
o sources failing at a rate of 100’s per week, 1-2h to fix each source effectively
15
Web-Scale Wrapper Induction
We need to scale to the web
• minimize supervision per source
But: we can afford prior knowledge
• about entities and attributes
• mostly in form of known knowledge graph for domain knowledge
• expressed as Gazetteers or rules for local, textual information
• higher-level rules or classifiers for complex structures
Web-Scale Wrapper Induction
16
Problem: application of prior
knowledge is costly & noisy
• wrapper induction to
generalise to other pages of
site
• “template” hypothesis
Solution: Generate “wrapper”
program from examples
• then apply to all pages of a
site
• when to apply which
extractor
Full site extraction needs to
also deal with
• Interactivity such as
pagination & form filling
(deep web)
• Detecting complex
structures such as lists,
tables, …
17
Fairhair.AI Crawlers
Exploration
• Focused crawling
• Stop conditions
• Relational transducers
Template Discovery
• Data areas detection
• Record segmentation
• Attribute alignment
Form Understanding
• Labelling
• Classification
• Filling
Domain Modeling
• DOM annotation
(dictionaries, regexes)
• Web phenomenology
(forms, fields, labels,
menus)
• Conceptual models
Framework for rule-based feature engineering supporting quick turn around for domain-specific rich features on
top of a library of 2.5k pre-built features representing structure, visual rendering, and textual content of a
webpage, as well as the link structure and interaction patterns of the entire site.
18
{
"title": "White House vows to fight media 'tooth and nail' over Trump coverage; says it presented 'alternative
facts'",
"authors":
[ {
"name": "Doina Chiacu",
"socialHandlers": {
"linkedin": "https://www.linkedin.com/in/doina-chiacu-2b2a9875",
"twitter": "https://twitter.com/doinachiacu" } },
{
"name": "Jason Lange",
"socialHandlers": {
"twitter": "https://twitter.com/langejason" } }
],
"datePublished": {
"date": "2017-01-22",
"time": "9:36PM"
},
"keywords": "Politics",
"summary": "The White House vowed on Sunday to fight the news media “tooth and nail” over what it sees
as unfair attacks, with a top adviser saying the Trump administration had presented “alternative facts” to
counter low inauguration crowd estimates.",
"siteHandlers": {
"twitter": "@UnionLeader" },
"ingress": "WASHINGTON — The White House vowed on Sunday to fight the news media “tooth and nail”
over what it sees as unfair attacks, with a top adviser saying the Trump administration had presented
“alternative facts” to counter low inauguration crowd estimates.",
"images": [ {
"url": "http://www.unionleader.com/storyimage/UL/20170122/NEWS06/170129767/AR/0/AR-
170129767.jpg",
"type": "primary" }],
"engagements": [ {
"value": "4",
"type": "comments" } ],
"content": "On his first full day as president, Trump said he had a “running war” with the media and accused
journalists of underestimating the number of people who turned out Friday for his swearing-in.nnWhite House
officials made clear no truce was on the horizon on Sunday in television interviews that set a much harsher
tone in the traditionally adversarial relationship between the White House and the press corps.nn“The point is
not the crowd size. The point is the attacks and the attempt to delegitimize this president in one day. And we’re
not going to sit around and take it,” Chief of Staff Reince Priebus said on “Fox News Sunday.”nnThe sparring
with the media has dominated Trump’s first weekend in office, eclipsing debate over policy and Cabinet
19
Effects of AI Crawling
80-90% lower
human effort
10-100x
more sources
without loss in quality
compared with state-of-
the-art
and domains than existing
automated solutions and
affordable supervised one
3-10x more
attributes
e.g., 300k+ news sources, 1M+ of
company websites, Job postings,
Press releases
20
Information extraction
Identify and disambiguate mentions of entities of interest in a document.
ORG ORG DATETIME
NER Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
Tesla Science Center at Wardenclyffe
Tesla (Czechoslovak company)
Tesla, Inc.
SolarCity Corporation
City Solar AG
Black Monday
Monday
NED Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
VBZ VBN DT JJ NN IN NNP WDT VBD IN NNP NNNNP
Tokenizer
+
Splitter
+
PoS
Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
21
Motivation
• Disambiguated entities are necessary to produce new relations for the graph via Relation
Extraction (RE) - no disambiguation, no linking of relations to the nodes in the graph
• Plain keyword search is extremely noisy
Tesla, Inc.
SolarCity
Corporation
Elon Musk
CEO
Lyndon Rive
CEO
Palo Alto, CA
HQ
HQ
San Mateo, CA
ACQUISITION
acquire(Tesla_Inc, SolarCity_Corp)
Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
NER architecture
22
● No feature engineering required
● Current SOTA does use DL
● Transfer learning to new domains
with fewer training instances
● Better Generalization
Character Representation using CNN
23
● Different embedding size --- token , character
● Optimizers : Adam , RMSProp, SGD
● Varying dropout rates
● Changing number of hidden layers and units
● Different regularization techniques
● Averages pooling vs Max pooling for CNN
● Varying filter and window length for CNN
Ref : End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , Xuezhe Ma, Eduard Hovy -
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016)
24
NED Architecture
24
Tesla has announced the full acquisition of
SolarCity which closed on Monday morning.
Plain text NER Annotated spans
entity(0,5,ORG,”Tesla”)
entity(44,53,ORG,”SolarCity”)
entity(70,84,DATETIME,”Monday Morning”)
Leipzig University’s AGDISTIS NED
HITS - Scoring algorithm
Tesla
SolarCity
Monday Morning
Indexed
triple store
Disambiguated entities
Tesla
|-> http://.../mw/Tesla_Inc
SolarCity
|-> http://.../dbp/SolarCity
Tesl
a
Inc.
SolarCity
Corporation
KG
25
HITS Algorithm (PageRank variant)
● It is query dependent, that is, the (Hubs and Authority) scores resulting from the
link analysis are influenced by the search terms;
● As a corollary, it is executed at query time, not at indexing time, with the
associated hit on performance that accompanies query-time processing.
● It computes two scores per document, hub and authority, as opposed to a single
score;
● It is processed on a small subset of ‘relevant’ documents (a 'focused subgraph' or
base set), not all documents as was the case with PageRank.
26
Problems and solutions
Problem: AGDISTIS only receives named entities, thus it’s disambiguation context is limited
Solution: Complement the NER with a domain-specific (non named) entity recognizer and verbs.
• We have a first version available for the business domain
• Bootstrapped from large corpus of docs and a termbank generation algorithm
Tim Cook is the CEO of Apple. I know Bayer is a pharmaceutical company.
PER BUSINESS
TERM
ORG ORG INDUSTRY BUSINESS
TERM
Problem: AGDISTIS is language agnostic but the index is not. We were initially limited to English
Solution: Replicate the same process on other languages by mapping the index fields
• Not all languages have comprehensive DBPedia fields
• The English part (most richest) of the index probably has to be always present
27
Architectural Overview – RE Service
28
Relation extraction using LSTM’s
Google is competing with Microsoft
LSTM LSTM LSTM LSTM LSTM
Softmax
Output
Vectors Vectors Vectors Vectors Vectors
LSTM Layer
Embedding
Layer
Input
Sentence
Dense Layer
Knowledge Graph
29
• Funding Developments
• Leadership Changes
• New Offerings
• Bankruptcy,
• Restructuring, Cost
Cutting
Editorial
Influencer
DB
Job
Postings
Press
Releases Company
Database
SEC
Filings
Patents
Social
Media
Company
Website
FHAI
Knowledge
Graph
3rd Party
providers
• Competitor
• Customer
• Investment
• Lawsuit/Litigation
• Partnership
• Companies
• Brands
• Products
• Key people
• Influencers
• Relate facts
• Data mining
• Cognitive applications (higher-order
reasoning)
• Contextual Features
• Supplier
• Acquisition
• Out/under performance
• Expanding Operations
• Compliance
Entities: Goal:
Infer high-level insights from a set of extracted events/facts.
30
Challenges
Source: Xin Luna Dong (Google) - PVLDB ‘14
Text
(301M)
Document Object
Model
(1,280M)
Tables
(10M)
Annotations
(website metadata)
(145 M)
110M
13K
1.5M.3M
1.1M 1.7M
● Knowledge deduplication /
integration
● Truth Finding (Contradictory facts)
● Confidence values
31
Graph embedding
• Given a graph (g), entities (e) and relations (r), produce a low rank tensor factorization of
the co-occurrence cube of all combinations of (e,r,e)
• Input:
• Graph
• Vector size
• Goal:
• Find vectors for all (e) and (r) that minimizes the scoring function:
• Output:
• Embedding vectors for (e) and (r).
32
Link prediction using embeddings
• Given a pair of entities (e1,e2), give a score on how
probable that they have a relation
• Input:
• Embedding vectors of entities and relation r
• Annotated examples of true and false combinations of
e,r,e
• Goal:
• Find a decision boundary that separates the true and
false
• E.g.: Using a standard classifier (SVM or
RandomForest), use Embeddings as features
• Output:
• A probability score for e1,e2,r
33
Car
Engines
BMW
Mercedes
Volksw
agen
Country: Weak indicator path Industry: Strong indicator path
Competitor?
Volvo
Scania
Sweden
Car
Engines
Ger-
many
BMW
Mercedes
Volvo
Scania
SwedenGer-
many
Path ranking
Volksw
agen
• Combine PRA and Embedding models to produce a superior link
scoring/prediction algorithm.
• Achievement: significant improvement over SOTA during our experiments
Link prediction combination
34
35
Output Organization Person
11,171,077 1,708,796
Relation Instances
Competition 33,327,137
Works At 228,070
Investor (Company) 93,980
Founder 67,515
Board Member 43,525
Acquisition 15,532
Investor (Person) 10,420
Sub-organization 4,214
275M Facts Mined
(Distinct)
36
Data platform
Forums
Online News
Patents
&
Trademarks
Job Postings
Social
Company
Websites
Blogs
2.8k vCPU, 21TB RAM, 630TB SSD
200B Documents
30M Sources
Analytics Layer
Serving Layer
DataSciencePlatform
KnowledgeGraph
37
NLP/IE Pipelines
NED
2
LANGUAGE-COUNTRY
en-us
en-uk
sv-se
fr-fr
fr-ca
...
DL
language
classifier
topic
classifier
NER1
router
NER2
.
.
.
country
classifier
CLASSIFIER TOPICS
Arts & Entertainment
Business
Demographic Groups
Environment & Nature
Events
Government & Politics
Health
Lifestyle
Living Things
Media
Science
Social Affairs
Sports
Technology
{
“topic”: “business”,
“language”: ”en”,
“country”: ”us”,
“section”: “body”
}
NED1
NED
n
.
.
.
RE2
RE1
REn
.
.
.
DP
model
repo
nlp-data
repo
registry
NED index
repo
GS
repo
scoringGS
versioned
SW
NERn
Every time a new component sw-
version is registered a scoring task
against the GS is triggered
A workflow for analyzing datasets
(Batch & Real Time)
The standard workflow
fetching documents from
the data lake
DOCUMENT SECTIONS
title
ingress
highlights
body
captions
quotations
Supports multiway data flows, e.g.,
for ensembles of NERs
IR
38
Human in the loop
● Annotate text, entities, classification and
custom HTML
● Task Assignment, Ranking, Inter Annotator
agreement
● Gold set creation for any structured data like
NED, Knowledge Fusion
Very Time
Consuming
39
Data Programming
Our deep learning approach for extracting entity-relation tuples requires a huge amount of labelled
data, which is an expensive and time-consuming effort.
Facebook is competing with Google
With Snorkel the goal is to write heuristics to programmatically generate training data.
Potential
relation
mentions
f1
fn
Probabilistic
training labels
Heuristic (Labelling) functions
40
Snorkel -Full pipeline
marginal likelihood estimate to learn
the joint distribution of data and
latent labels
41
Sample Insight
Building
42
• Input text is processed as sequence of UTF8
encoded bytes.
• Hidden states of model encodes all information the
model has learned.
• Final cell states are used as feature representation.
• Encoded output values range from -1 to 1.
• The mLSTM response lag is corrected using reverse
correlation method which ensures the responses
align with the corresponding text.
• The underlying lag corrected mLSTM response to
individual keyphrases is averaged to produce
keyphrases with sentiment values.
Multiplicative LSTM for sequence modelling. Krause et al., 2017
Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017
Input text
mLSTM encoder
Aspect extractor
encoder lag correction
Aspect Level
sentiment
System Block Diagram
Aspect level sentiment extraction using character level LSTM
ALS extraction using character level LSTM
43
Multiplicative LSTM for sequence modelling. Krause et al., 2017
Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017
Network design
• Single layer multiplicative LSTM with 4096 units
• Mini-batches of 128 subsequences of length
256
• 4 Pascal Titan X gpus
• Training took approximately one month
• Trained on ~100M online reviews that are
labelled
44
Unsupervised SAE
Large corpus of homogeneous documents (50k ~ 250k)
• same domain (use a classifier), preferably no bundles
Normalisation and tagging
• tokenisation (NUT specific)
• orthography normalisation (most common orthography)
• POS tagging (Hepple’s on TreeBank)
• NP chunking (Ramshaw – Mitchell)
NP Clustering
• head noun lemmatization (approx. last noun in NP)
• frequent head nouns -> aspect terms
Segmentation
• cPMI optimal parsing of an NP -> modifiers / multi-
words
Generalisation and typing
• structured aspect patterns (SAP)
• entity, aspect term, qualifier, quantifier
45
The filled markers indicate shifts in the LSTM response that are used to
extract keyphrases in the text and their corresponding sentiment value
from LSTM response.
Aspect Sentiment
Display negative
Email alert notification negative
Fonts negative
wallpaper negative
Example
Multiplicative LSTM for sequence modelling. Krause et al., 2017
Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017
Aspect sentiment extraction using character level LSTM
46
Example Use Case
47
Connectors to serving systems
Data Ingestion &
Insights Delivery by
setting up simple
schema mappers
48
Involve users, entrepreneurs, and researchers
6 Data Science Hubs (co-working spaces)
✔ Sydney
✔ Berlin
✔ New York
✔ London
✔ San Francisco
✔ Singapore
Meltwater Entrepreneurial School of Technology
• HQ in Accra, Ghana
• Training program for African entrepreneurs
• Incubator (25+ startups)
• Networking hub
University collaborations
49
Streaming, Search, Analytics, APIs
Building blocks to leverage the platform
Data Enrichment Platform
Enrich, analyze & build insights by interoperating with all major players
Knowledge Graph
Enable cognitive applications on top of our data by connecting the dots
AI-Driven Data Acquisition
Bring high quality outside data to our repository with minimal human effort
Media Intelligence
Apps
New
Apps
Enterprise
Solutions
Custom
solutions
3rd party
Apps
PaaS
Outside
Data
Context
Building
Enrichment
& Analysis
Service
Layer
Global Monitoring
Distribute
Analyze & Report
Influence & Engage
Outside
Insight
AI-Powered
Reporting
Employee
App
Freemium
50
More?
Questions?
51
http://bit.ly/fhai_meltwater

More Related Content

What's hot

Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionRonald Ashri
 
From Structured Data to Linked Open Governmental Data
From Structured Data to Linked Open Governmental DataFrom Structured Data to Linked Open Governmental Data
From Structured Data to Linked Open Governmental DataDongpo Deng
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarSpazioDati
 
Using the Semantic Web Stack to Make Big Data Smarter
Using the Semantic Web Stack to Make  Big Data SmarterUsing the Semantic Web Stack to Make  Big Data Smarter
Using the Semantic Web Stack to Make Big Data SmarterMatheus Mota
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageOntotext
 
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...Connected Data World
 
Fraudes Financières: Méthodes de Prévention et Détection
Fraudes Financières: Méthodes de Prévention et DétectionFraudes Financières: Méthodes de Prévention et Détection
Fraudes Financières: Méthodes de Prévention et DétectionLinkurious
 
Text analytics for Google Spreadsheets using Text Mining add-on
Text analytics for Google Spreadsheets using Text Mining add-on Text analytics for Google Spreadsheets using Text Mining add-on
Text analytics for Google Spreadsheets using Text Mining add-on SpazioDati
 
Semantic Security : Authorization on the Web with Ontologies
Semantic Security : Authorization on the Web with OntologiesSemantic Security : Authorization on the Web with Ontologies
Semantic Security : Authorization on the Web with OntologiesAmit Jain
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...BOBCATSSS 2017
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphsSören Auer
 
Thinking Outside the Table
Thinking Outside the TableThinking Outside the Table
Thinking Outside the TableOntotext
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Ontotext
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publicationsmaartenmarx
 
What can linked data do for digital libraries
What can linked data do for digital librariesWhat can linked data do for digital libraries
What can linked data do for digital librariesSören Auer
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data introvafopoulos
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataSören Auer
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked datavafopoulos
 

What's hot (20)

Linking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An IntroductionLinking Open, Big Data Using Semantic Web Technologies - An Introduction
Linking Open, Big Data Using Semantic Web Technologies - An Introduction
 
From Structured Data to Linked Open Governmental Data
From Structured Data to Linked Open Governmental DataFrom Structured Data to Linked Open Governmental Data
From Structured Data to Linked Open Governmental Data
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Using the Semantic Web Stack to Make Big Data Smarter
Using the Semantic Web Stack to Make  Big Data SmarterUsing the Semantic Web Stack to Make  Big Data Smarter
Using the Semantic Web Stack to Make Big Data Smarter
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
Powerful Information Discovery with Big Knowledge Graphs –The Offshore Leaks ...
 
Fraudes Financières: Méthodes de Prévention et Détection
Fraudes Financières: Méthodes de Prévention et DétectionFraudes Financières: Méthodes de Prévention et Détection
Fraudes Financières: Méthodes de Prévention et Détection
 
Text analytics for Google Spreadsheets using Text Mining add-on
Text analytics for Google Spreadsheets using Text Mining add-on Text analytics for Google Spreadsheets using Text Mining add-on
Text analytics for Google Spreadsheets using Text Mining add-on
 
Semantic Security : Authorization on the Web with Ontologies
Semantic Security : Authorization on the Web with OntologiesSemantic Security : Authorization on the Web with Ontologies
Semantic Security : Authorization on the Web with Ontologies
 
Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...Nemeth Marton - Widening the limits of cognitive reception with online digita...
Nemeth Marton - Widening the limits of cognitive reception with online digita...
 
Enterprise knowledge graphs
Enterprise knowledge graphsEnterprise knowledge graphs
Enterprise knowledge graphs
 
Thinking Outside the Table
Thinking Outside the TableThinking Outside the Table
Thinking Outside the Table
 
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
Transforming Your Data with GraphDB: GraphDB Fundamentals, Jan 2018
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 
Cognitive data
Cognitive dataCognitive data
Cognitive data
 
Keynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official PublicationsKeynote Exploring and Exploiting Official Publications
Keynote Exploring and Exploiting Official Publications
 
What can linked data do for digital libraries
What can linked data do for digital librariesWhat can linked data do for digital libraries
What can linked data do for digital libraries
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked DataIntroduction to the Data Web, DBpedia and the Life-cycle of Linked Data
Introduction to the Data Web, DBpedia and the Life-cycle of Linked Data
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 

Similar to AI Platform Mines Competitive Intelligence from Billions of Online Sources

How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB MongoDB
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4jNeo4j
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to GraphsNeo4j
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017SingleStore
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jIvan Zoratti
 
Webinar: How Financial Services Organizations Use MongoDB
Webinar: How Financial Services Organizations Use MongoDBWebinar: How Financial Services Organizations Use MongoDB
Webinar: How Financial Services Organizations Use MongoDBMongoDB
 
Keynote: GraphTour Toronto
Keynote: GraphTour TorontoKeynote: GraphTour Toronto
Keynote: GraphTour TorontoNeo4j
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino Data Lab
 
Data APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementData APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementVictor Olex
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformVMware Tanzu
 
Knowledge Graphs Webinar- 11/7/2017
Knowledge Graphs Webinar- 11/7/2017Knowledge Graphs Webinar- 11/7/2017
Knowledge Graphs Webinar- 11/7/2017Neo4j
 
APIs in Enterprise
APIs in EnterpriseAPIs in Enterprise
APIs in EnterpriseVictor Olex
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBMongoDB
 
10/ EnterpriseDB @ OPEN'16
10/ EnterpriseDB @ OPEN'16 10/ EnterpriseDB @ OPEN'16
10/ EnterpriseDB @ OPEN'16 Kangaroot
 
Ketnote: GraphTour Boston
Ketnote: GraphTour BostonKetnote: GraphTour Boston
Ketnote: GraphTour BostonNeo4j
 
La bi, l'informatique décisionnelle et les graphes
La bi, l'informatique décisionnelle et les graphesLa bi, l'informatique décisionnelle et les graphes
La bi, l'informatique décisionnelle et les graphesCédric Fauvet
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoSpark Summit
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4jNeo4j
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 

Similar to AI Platform Mines Competitive Intelligence from Billions of Online Sources (20)

How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB How Insurance Companies Use MongoDB
How Insurance Companies Use MongoDB
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
 
Semantics and Machine Learning
Semantics and Machine LearningSemantics and Machine Learning
Semantics and Machine Learning
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017In-Memory Computing Webcast. Market Predictions 2017
In-Memory Computing Webcast. Market Predictions 2017
 
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jAI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
 
Webinar: How Financial Services Organizations Use MongoDB
Webinar: How Financial Services Organizations Use MongoDBWebinar: How Financial Services Organizations Use MongoDB
Webinar: How Financial Services Organizations Use MongoDB
 
Keynote: GraphTour Toronto
Keynote: GraphTour TorontoKeynote: GraphTour Toronto
Keynote: GraphTour Toronto
 
Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...Domino and AWS: collaborative analytics and model governance at financial ser...
Domino and AWS: collaborative analytics and model governance at financial ser...
 
Data APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of EngagementData APIs as a Foundation for Systems of Engagement
Data APIs as a Foundation for Systems of Engagement
 
Analytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data PlatformAnalytical Innovation: How to Build the Next Generation Data Platform
Analytical Innovation: How to Build the Next Generation Data Platform
 
Knowledge Graphs Webinar- 11/7/2017
Knowledge Graphs Webinar- 11/7/2017Knowledge Graphs Webinar- 11/7/2017
Knowledge Graphs Webinar- 11/7/2017
 
APIs in Enterprise
APIs in EnterpriseAPIs in Enterprise
APIs in Enterprise
 
How Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDBHow Financial Services Organizations Use MongoDB
How Financial Services Organizations Use MongoDB
 
10/ EnterpriseDB @ OPEN'16
10/ EnterpriseDB @ OPEN'16 10/ EnterpriseDB @ OPEN'16
10/ EnterpriseDB @ OPEN'16
 
Ketnote: GraphTour Boston
Ketnote: GraphTour BostonKetnote: GraphTour Boston
Ketnote: GraphTour Boston
 
La bi, l'informatique décisionnelle et les graphes
La bi, l'informatique décisionnelle et les graphesLa bi, l'informatique décisionnelle et les graphes
La bi, l'informatique décisionnelle et les graphes
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Introduction to Neo4j
Introduction to Neo4jIntroduction to Neo4j
Introduction to Neo4j
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 

Recently uploaded

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

AI Platform Mines Competitive Intelligence from Billions of Online Sources

  • 1. AI PLATFORM TO MINE COMPETITIVE INTELLIGENCE FROM BILLIONS OF UNSTRUCTURED ONLINE SOURCES
  • 3. Meltwater is the global leader in media intelligence 3 1500 employees worldwide 28,000 corporate clients 50 offices across 6 continents Bootstrapped No venture funding FOUNDED 2001 in Oslo, Norway HEADQUARTERS in San Francisco
  • 4. Big data company: We process 100 million documents and billions of searches every day 4 data capture data enrichment proprietary search engine real-time analytics
  • 5. 5 We track leading performance real-time analytics Client Satisfactio ns Industry Trends Competitive Intelligence Brand Perception Share of Voice
  • 7. Under the hood 7 Ingestion: • AI crawling for unstructured web • Programmatic api’s for partnerships • Over 100M documents everyday Media Intelligence applications • 1M complex Boolean queries configured • Counters, aggregates, drill downs, pivoting, regression • Vertical Search, news feed, media exposure, alerts based on trends & anomalies, influencers etc Data Augmentation (15 languages): • Text categorization (topic, language) • Keyphrase extraction, summarization • Sentiment analysis (entity, aspect level) • Semantic hashing for near duplicate detection Knowledge Management • NER (person, location, organization, ...) • NED ( https://en.wikipedia.org/wiki/Tim_Cook ) • Relation & event extraction • Truth finding, link prediction, graph mining
  • 8. Motivation for the platform 8 Access to Structured Data • Make sure the data is clean, complete & normalized • Make it relevant by connecting the dots • Bring the methods close to the data Data is deceitful. Need a systematic way to mine, propose, and explain possible insights • we need factual knowledge • combine (machine) learning and reasoning Source: www.tylervigen.com (Spurious Correlations)
  • 9. 9 Streaming, Search, Analytics, APIs Building blocks to leverage the platform Data Enrichment Platform Enrich, analyze & build insights by interoperating with all major players Knowledge Graph Enable cognitive applications on top of our data by connecting the dots AI-Driven Data Acquisition Bring high quality outside data to our repository with minimal human effort Media Intelligence Apps New Apps Enterprise Solutions Custom solutions 3rd party Apps PaaS Outside Data Context Building Enrichment & Analysis Service Layer Global Monitoring Distribute Analyze & Report Influence & Engage Outside Insight AI-Powered Reporting Employee App Freemium 100M documents ingested daily 150 NLP/IR pipelines 100’s Billions of Searches
  • 10. 10 A treasure trove of valuable external data Online News Share Price Job Postings Press Releases Patent Filings Social Media Financial Filings Real Estate Rates App Downloads Web traffic UnemploymentOil Prices Court DocumentsOnline ad Spend Blogs Forums Product Reviews Interest Rates Corporate Websites Weather Data
  • 11. 11 Sources of Information Text (Information Extraction) DOM (Web Extraction) WebTables Annotations (Site Microdata)
  • 12. The academic web 12 Typical comments about web data extraction • Microdata and the semantic web have solved problems • All the data is in web tables • API’s provide all the structured data you need
  • 13. The real web 13 Web data extraction is not a solved problem • API’s are limited to large websites • Web tables and microdata are marginal • The real problem is not one-time extraction, but keeping the data up-to-date over time
  • 14. AI crawling for Web data extraction 14 Traditional scraping requires a huge human effort: •Code wrappers for each source, e.g., in Scrapy or MW’s source configurations •Visually testing and support tool (ala Connotate, Mozenda, …) •Automatic scraping for small number of fixed data types (ala Diffbot), e.g., Microdata •Meltwater (old): ~50 “source engineers” maintaining manual wrappers o sources failing at a rate of 100’s per week, 1-2h to fix each source effectively
  • 15. 15 Web-Scale Wrapper Induction We need to scale to the web • minimize supervision per source But: we can afford prior knowledge • about entities and attributes • mostly in form of known knowledge graph for domain knowledge • expressed as Gazetteers or rules for local, textual information • higher-level rules or classifiers for complex structures
  • 16. Web-Scale Wrapper Induction 16 Problem: application of prior knowledge is costly & noisy • wrapper induction to generalise to other pages of site • “template” hypothesis Solution: Generate “wrapper” program from examples • then apply to all pages of a site • when to apply which extractor Full site extraction needs to also deal with • Interactivity such as pagination & form filling (deep web) • Detecting complex structures such as lists, tables, …
  • 17. 17 Fairhair.AI Crawlers Exploration • Focused crawling • Stop conditions • Relational transducers Template Discovery • Data areas detection • Record segmentation • Attribute alignment Form Understanding • Labelling • Classification • Filling Domain Modeling • DOM annotation (dictionaries, regexes) • Web phenomenology (forms, fields, labels, menus) • Conceptual models Framework for rule-based feature engineering supporting quick turn around for domain-specific rich features on top of a library of 2.5k pre-built features representing structure, visual rendering, and textual content of a webpage, as well as the link structure and interaction patterns of the entire site.
  • 18. 18 { "title": "White House vows to fight media 'tooth and nail' over Trump coverage; says it presented 'alternative facts'", "authors": [ { "name": "Doina Chiacu", "socialHandlers": { "linkedin": "https://www.linkedin.com/in/doina-chiacu-2b2a9875", "twitter": "https://twitter.com/doinachiacu" } }, { "name": "Jason Lange", "socialHandlers": { "twitter": "https://twitter.com/langejason" } } ], "datePublished": { "date": "2017-01-22", "time": "9:36PM" }, "keywords": "Politics", "summary": "The White House vowed on Sunday to fight the news media “tooth and nail” over what it sees as unfair attacks, with a top adviser saying the Trump administration had presented “alternative facts” to counter low inauguration crowd estimates.", "siteHandlers": { "twitter": "@UnionLeader" }, "ingress": "WASHINGTON — The White House vowed on Sunday to fight the news media “tooth and nail” over what it sees as unfair attacks, with a top adviser saying the Trump administration had presented “alternative facts” to counter low inauguration crowd estimates.", "images": [ { "url": "http://www.unionleader.com/storyimage/UL/20170122/NEWS06/170129767/AR/0/AR- 170129767.jpg", "type": "primary" }], "engagements": [ { "value": "4", "type": "comments" } ], "content": "On his first full day as president, Trump said he had a “running war” with the media and accused journalists of underestimating the number of people who turned out Friday for his swearing-in.nnWhite House officials made clear no truce was on the horizon on Sunday in television interviews that set a much harsher tone in the traditionally adversarial relationship between the White House and the press corps.nn“The point is not the crowd size. The point is the attacks and the attempt to delegitimize this president in one day. And we’re not going to sit around and take it,” Chief of Staff Reince Priebus said on “Fox News Sunday.”nnThe sparring with the media has dominated Trump’s first weekend in office, eclipsing debate over policy and Cabinet
  • 19. 19 Effects of AI Crawling 80-90% lower human effort 10-100x more sources without loss in quality compared with state-of- the-art and domains than existing automated solutions and affordable supervised one 3-10x more attributes e.g., 300k+ news sources, 1M+ of company websites, Job postings, Press releases
  • 20. 20 Information extraction Identify and disambiguate mentions of entities of interest in a document. ORG ORG DATETIME NER Tesla has announced the full acquisition of SolarCity which closed on Monday morning . Tesla Science Center at Wardenclyffe Tesla (Czechoslovak company) Tesla, Inc. SolarCity Corporation City Solar AG Black Monday Monday NED Tesla has announced the full acquisition of SolarCity which closed on Monday morning . VBZ VBN DT JJ NN IN NNP WDT VBD IN NNP NNNNP Tokenizer + Splitter + PoS Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
  • 21. 21 Motivation • Disambiguated entities are necessary to produce new relations for the graph via Relation Extraction (RE) - no disambiguation, no linking of relations to the nodes in the graph • Plain keyword search is extremely noisy Tesla, Inc. SolarCity Corporation Elon Musk CEO Lyndon Rive CEO Palo Alto, CA HQ HQ San Mateo, CA ACQUISITION acquire(Tesla_Inc, SolarCity_Corp) Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
  • 22. NER architecture 22 ● No feature engineering required ● Current SOTA does use DL ● Transfer learning to new domains with fewer training instances ● Better Generalization
  • 23. Character Representation using CNN 23 ● Different embedding size --- token , character ● Optimizers : Adam , RMSProp, SGD ● Varying dropout rates ● Changing number of hidden layers and units ● Different regularization techniques ● Averages pooling vs Max pooling for CNN ● Varying filter and window length for CNN Ref : End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , Xuezhe Ma, Eduard Hovy - Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016)
  • 24. 24 NED Architecture 24 Tesla has announced the full acquisition of SolarCity which closed on Monday morning. Plain text NER Annotated spans entity(0,5,ORG,”Tesla”) entity(44,53,ORG,”SolarCity”) entity(70,84,DATETIME,”Monday Morning”) Leipzig University’s AGDISTIS NED HITS - Scoring algorithm Tesla SolarCity Monday Morning Indexed triple store Disambiguated entities Tesla |-> http://.../mw/Tesla_Inc SolarCity |-> http://.../dbp/SolarCity Tesl a Inc. SolarCity Corporation KG
  • 25. 25 HITS Algorithm (PageRank variant) ● It is query dependent, that is, the (Hubs and Authority) scores resulting from the link analysis are influenced by the search terms; ● As a corollary, it is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. ● It computes two scores per document, hub and authority, as opposed to a single score; ● It is processed on a small subset of ‘relevant’ documents (a 'focused subgraph' or base set), not all documents as was the case with PageRank.
  • 26. 26 Problems and solutions Problem: AGDISTIS only receives named entities, thus it’s disambiguation context is limited Solution: Complement the NER with a domain-specific (non named) entity recognizer and verbs. • We have a first version available for the business domain • Bootstrapped from large corpus of docs and a termbank generation algorithm Tim Cook is the CEO of Apple. I know Bayer is a pharmaceutical company. PER BUSINESS TERM ORG ORG INDUSTRY BUSINESS TERM Problem: AGDISTIS is language agnostic but the index is not. We were initially limited to English Solution: Replicate the same process on other languages by mapping the index fields • Not all languages have comprehensive DBPedia fields • The English part (most richest) of the index probably has to be always present
  • 28. 28 Relation extraction using LSTM’s Google is competing with Microsoft LSTM LSTM LSTM LSTM LSTM Softmax Output Vectors Vectors Vectors Vectors Vectors LSTM Layer Embedding Layer Input Sentence Dense Layer
  • 29. Knowledge Graph 29 • Funding Developments • Leadership Changes • New Offerings • Bankruptcy, • Restructuring, Cost Cutting Editorial Influencer DB Job Postings Press Releases Company Database SEC Filings Patents Social Media Company Website FHAI Knowledge Graph 3rd Party providers • Competitor • Customer • Investment • Lawsuit/Litigation • Partnership • Companies • Brands • Products • Key people • Influencers • Relate facts • Data mining • Cognitive applications (higher-order reasoning) • Contextual Features • Supplier • Acquisition • Out/under performance • Expanding Operations • Compliance Entities: Goal: Infer high-level insights from a set of extracted events/facts.
  • 30. 30 Challenges Source: Xin Luna Dong (Google) - PVLDB ‘14 Text (301M) Document Object Model (1,280M) Tables (10M) Annotations (website metadata) (145 M) 110M 13K 1.5M.3M 1.1M 1.7M ● Knowledge deduplication / integration ● Truth Finding (Contradictory facts) ● Confidence values
  • 31. 31 Graph embedding • Given a graph (g), entities (e) and relations (r), produce a low rank tensor factorization of the co-occurrence cube of all combinations of (e,r,e) • Input: • Graph • Vector size • Goal: • Find vectors for all (e) and (r) that minimizes the scoring function: • Output: • Embedding vectors for (e) and (r).
  • 32. 32 Link prediction using embeddings • Given a pair of entities (e1,e2), give a score on how probable that they have a relation • Input: • Embedding vectors of entities and relation r • Annotated examples of true and false combinations of e,r,e • Goal: • Find a decision boundary that separates the true and false • E.g.: Using a standard classifier (SVM or RandomForest), use Embeddings as features • Output: • A probability score for e1,e2,r
  • 33. 33 Car Engines BMW Mercedes Volksw agen Country: Weak indicator path Industry: Strong indicator path Competitor? Volvo Scania Sweden Car Engines Ger- many BMW Mercedes Volvo Scania SwedenGer- many Path ranking Volksw agen
  • 34. • Combine PRA and Embedding models to produce a superior link scoring/prediction algorithm. • Achievement: significant improvement over SOTA during our experiments Link prediction combination 34
  • 35. 35 Output Organization Person 11,171,077 1,708,796 Relation Instances Competition 33,327,137 Works At 228,070 Investor (Company) 93,980 Founder 67,515 Board Member 43,525 Acquisition 15,532 Investor (Person) 10,420 Sub-organization 4,214 275M Facts Mined (Distinct)
  • 36. 36 Data platform Forums Online News Patents & Trademarks Job Postings Social Company Websites Blogs 2.8k vCPU, 21TB RAM, 630TB SSD 200B Documents 30M Sources Analytics Layer Serving Layer DataSciencePlatform KnowledgeGraph
  • 37. 37 NLP/IE Pipelines NED 2 LANGUAGE-COUNTRY en-us en-uk sv-se fr-fr fr-ca ... DL language classifier topic classifier NER1 router NER2 . . . country classifier CLASSIFIER TOPICS Arts & Entertainment Business Demographic Groups Environment & Nature Events Government & Politics Health Lifestyle Living Things Media Science Social Affairs Sports Technology { “topic”: “business”, “language”: ”en”, “country”: ”us”, “section”: “body” } NED1 NED n . . . RE2 RE1 REn . . . DP model repo nlp-data repo registry NED index repo GS repo scoringGS versioned SW NERn Every time a new component sw- version is registered a scoring task against the GS is triggered A workflow for analyzing datasets (Batch & Real Time) The standard workflow fetching documents from the data lake DOCUMENT SECTIONS title ingress highlights body captions quotations Supports multiway data flows, e.g., for ensembles of NERs IR
  • 38. 38 Human in the loop ● Annotate text, entities, classification and custom HTML ● Task Assignment, Ranking, Inter Annotator agreement ● Gold set creation for any structured data like NED, Knowledge Fusion Very Time Consuming
  • 39. 39 Data Programming Our deep learning approach for extracting entity-relation tuples requires a huge amount of labelled data, which is an expensive and time-consuming effort. Facebook is competing with Google With Snorkel the goal is to write heuristics to programmatically generate training data. Potential relation mentions f1 fn Probabilistic training labels Heuristic (Labelling) functions
  • 40. 40 Snorkel -Full pipeline marginal likelihood estimate to learn the joint distribution of data and latent labels
  • 42. 42 • Input text is processed as sequence of UTF8 encoded bytes. • Hidden states of model encodes all information the model has learned. • Final cell states are used as feature representation. • Encoded output values range from -1 to 1. • The mLSTM response lag is corrected using reverse correlation method which ensures the responses align with the corresponding text. • The underlying lag corrected mLSTM response to individual keyphrases is averaged to produce keyphrases with sentiment values. Multiplicative LSTM for sequence modelling. Krause et al., 2017 Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017 Input text mLSTM encoder Aspect extractor encoder lag correction Aspect Level sentiment System Block Diagram Aspect level sentiment extraction using character level LSTM
  • 43. ALS extraction using character level LSTM 43 Multiplicative LSTM for sequence modelling. Krause et al., 2017 Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017 Network design • Single layer multiplicative LSTM with 4096 units • Mini-batches of 128 subsequences of length 256 • 4 Pascal Titan X gpus • Training took approximately one month • Trained on ~100M online reviews that are labelled
  • 44. 44 Unsupervised SAE Large corpus of homogeneous documents (50k ~ 250k) • same domain (use a classifier), preferably no bundles Normalisation and tagging • tokenisation (NUT specific) • orthography normalisation (most common orthography) • POS tagging (Hepple’s on TreeBank) • NP chunking (Ramshaw – Mitchell) NP Clustering • head noun lemmatization (approx. last noun in NP) • frequent head nouns -> aspect terms Segmentation • cPMI optimal parsing of an NP -> modifiers / multi- words Generalisation and typing • structured aspect patterns (SAP) • entity, aspect term, qualifier, quantifier
  • 45. 45 The filled markers indicate shifts in the LSTM response that are used to extract keyphrases in the text and their corresponding sentiment value from LSTM response. Aspect Sentiment Display negative Email alert notification negative Fonts negative wallpaper negative Example Multiplicative LSTM for sequence modelling. Krause et al., 2017 Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017 Aspect sentiment extraction using character level LSTM
  • 47. 47 Connectors to serving systems Data Ingestion & Insights Delivery by setting up simple schema mappers
  • 48. 48 Involve users, entrepreneurs, and researchers 6 Data Science Hubs (co-working spaces) ✔ Sydney ✔ Berlin ✔ New York ✔ London ✔ San Francisco ✔ Singapore Meltwater Entrepreneurial School of Technology • HQ in Accra, Ghana • Training program for African entrepreneurs • Incubator (25+ startups) • Networking hub University collaborations
  • 49. 49 Streaming, Search, Analytics, APIs Building blocks to leverage the platform Data Enrichment Platform Enrich, analyze & build insights by interoperating with all major players Knowledge Graph Enable cognitive applications on top of our data by connecting the dots AI-Driven Data Acquisition Bring high quality outside data to our repository with minimal human effort Media Intelligence Apps New Apps Enterprise Solutions Custom solutions 3rd party Apps PaaS Outside Data Context Building Enrichment & Analysis Service Layer Global Monitoring Distribute Analyze & Report Influence & Engage Outside Insight AI-Powered Reporting Employee App Freemium