Web Data Extraction: A Crash Course
Giorgio Orsi
(giorgio.orsi@meltwater.com)
WHO AM I
Senior Research Scientist at Meltwater, a global Media Intelligence company
Honorary Researcher at the School of CS at the University of Birmingham
Co-investigator of the EPSRC VADA (Value Added Data) Programme Grant
Co-founder and Head of Data Engineering of Wrapidity a web scraping startup
About me
I like playing with data!
About Meltwater
Meltwater: Media Intelligence
influencers trends
sentiment
analysis
media
exposure
Meltwater: Science and Entrepreneurship
University collaborations
6 Data Science Hubs (co-working spaces)
London
San Francisco
Singapore
Sydney
Berlin
New York
Meltwater Entrepreneurial School of Technology
HQ in Accra, Ghana
Training program for African entrepreneurs
Incubator (25+ startups)
Networking hub
Web Data Extraction
refcode postcode bedrooms bathrooms available price
33453 OX2 6AR 3 2 15/10/2013 £1280 pcm
33433 OX4 7DG 2 1 18/04/2013 £995 pcm
Process or turning semi-structured (templated) web data into structured data
>10000
Web Data Extraction vs Information Extraction
Data is structured according to templates, annotated or styled
Web Data Extraction vs Information Extraction
Data is hidden in plain text (entities, relations, aspects)
not our focus today…
Web Data Extraction: Why
– N I L E S H D A LV I
Yahoo!, then Facebook
“For many kinds of information one has to extract
from thousands of sites in order to build a
comprehensive database”
http://arxiv.org/pdf/1203.6406.pdf
Knowledge base construction (Yago, DBPedia, Wikidata, BabelNet)
Web Data Extraction: Why
Converging trends in data management
outside insight: shift from internal data to
external data (social, knowledge bases,
news, reports, reviews, jobs)
dark data: semi/un structured data
data preparation: preparing and maintain
data for mining and analytics
leading vs lagging performance indicators
Typical comments about web data extraction
Microdata and the semantic web have solved the problem
All the data you need is in web tables
APIs provide all the structured data you need
The Academic Web
Product
provider
Semantic API
(RDF)
Structured API
(XML/JSON)
HTML
interface
1
template
The Real Web
Web data extraction is not (even remotely) a solved problem
APIs limited to large websites (aggregators)
Web tables and microdata are marginal
The real problem is not one-time extraction, but keep doing it over time
Product
provider
Semantic API
(RDF)
Structured API
(XML/JSON)
HTML
interface
1
template
● 1B+ Webpages over the Web
● Contribution is skewed: 1- 50K
As of 11/2013
Source: Xin Luna Dong (Google, now at Amazon) - PVLDB ‘14
110M
0.3M 1.5M
13K
1.1M 1.7M
ANNO
Information
Extraction
Data
Extraction
The Real Web
Web Data Extraction: How
manual / (semi) supervised
accurate
expensive + non-scalable
less accurate
cheaper + scalable
unsupervised
What have we tried so far
Wrapper Induction: similar objects are presented in similar structures
You need training data (web = high sample complexity + many features)
use human to inform system (supervision / crowd)
Fully unsupervised methods fail beyond simple structures and gets tricked by
regular noise
Fact redundancy (e.g., Google Knowledge Vault)
Works well with highly-redundant common-sense facts (London, Barack Obama)
Ephemeral and unfrequent entities get lost or noisy…
Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, Wei Zhang:
From Data Fusion to Knowledge Fusion. PVLDB 7(10): 881-892 (2014)
Valter Crescenzi, Giansalvatore Mecca:
Automatic information extraction from large websites. J. ACM 51(5): 731-779 (2004)
Tim Furche, Jinsong Guo, Sebastian Maneth, Christian Schallhart:
Robust and Noise Resistant Wrapper Induction. SIGMOD Conference 2016: 773-784
It’s all about…
Scale
DIADEM / Wrapidity: Full-site Web Data Extraction
Bringing Web Data Extraction to the real web and at industrial scale
Key insights…
Replace site supervision with domain knowledge
Increase robustness of wrapper generation algorithms (navigation and induction)
Make wrapper generation algorithms knowledge-parametric (both ML and Rules)
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, Cheng Wang:
DIADEM: Thousands of Websites to a Single Database. PVLDB 7(14): 1845-1856 (2014)
DIADEM: Full-site Web Data Extraction
Template discovery
Result pages, detail pages
Navigation
forms, menus, categories,
bread crumbs, pagination,
infinite scroll, detail links
Full-site Web Data Extraction
Wrapper induction
generalisation and weaving
doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
//form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /}
/.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>]
/(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})*
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]
[? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]
[? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]
[? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]
[? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]
[? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]
[? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]
[? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]
[? .//@src:<image=normalize-space(.)> ]
[? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]
[? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]
doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
//form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /}
/.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>]
/(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})*
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]
[? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]
[? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]
[? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]
[? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]
[? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]
[? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]
[? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]
[? .//@src:<image=normalize-space(.)> ]
[? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]
[? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]
Wrapper Execution
Parallel execution
instantiation, splitting,
distribution, monitoring
doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
//form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /}
/.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>]
/(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})*
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]
[? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]
[? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]
[? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]
[? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]
[? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]
[? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]
[? .//@src:<image=normalize-space(.)> ]
[? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]
[? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]
[? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]
[? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]
[? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]
[? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]
[? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]
[? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]
[? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]
[? .//@src:<image=normalize-space(.)> ]
[? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]
[? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]
doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
//form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /}
/.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>]
/(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})*
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]
doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
//form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /}
/.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>]
/(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})*
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]
[? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]
[? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]
[? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]
[? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]
[? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]
[? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]
[? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]
[? .//@src:<image=normalize-space(.)> ]
[? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]
[? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]
EMR on Amazon AWS
Data Cleaning and Wrapper Repair
A B C D E F G
Ava’s
possessions
March 4, 2016 Rated: R Off Hollywood Pictures
Genre(s): Sci-Fi,
Mystery, Thriller,
Horror
89 min 51
Camino March 4, 2016
Rated:
Not
Rated
Bielberg Entertainment
Genre(s): Action,
Adventure, Thriller
103 min tbd
Wrapper-generated instance
vA,() SINK
13
SOURCE vB,(A)
v
C,(A,B)
9 9 9
vD,(A)
4 4
vB,()
vA,(B)
v
D,(B,A)
4
4 4
4
title releaseMonth releaseDay releaseYear rating genres producer runtime overall score
Target signature
Stefano Ortona, Giorgio Orsi, Tim Furche, Marcello Buoncristiano:
Joint repairs for web wrappers. ICDE 2016: 1146-1157
Background Knowledge
Domain knowledge (once per application domain)
Describe target objects via entities, relationships, instances
Provide a way to identify them on web pages via shallow NLP (dictionaries, regexes)
Use it to annotate both the visible and invisible parts of the live DOM
Record
DataArea
Page
Result PageDetail Page
Block
Attribute
Nav Menu
Form
…
RE record
Price
property type
location
beds
…
RE data area search res number
records number
Metamodel
Model
Rules
(cursymb:instance) number:instance[value>=80k && value<=200M]
|
number:instance[value>80k && value<=200M] (cursymb:instance | curname:instance) -> price:instance
cursymb:instance
£ -> { norm = GBP }
$ -> { norm = USD }
GBP -> { norm = GBP }
USD -> { norm = USD }
Dictionaries
pounds -> {norm = GBP}
dollars -> {norm = USD}
curname:instance
price
amount
price:label
Background Knowledge
Labels and instances, visible and invisible (HTML structure, Javascript values)
labels
instances
<div class="icon first”>
<img src=“…/bdes.jpg” alt="Bedrooms" title="Bedrooms">
<br>8
</div>
<div class=“icon">
<img src=“…/bath.jpg” alt="Bathrooms" title="Bathrooms">
<br>4
</div>
labels
Javascript values
DOM Annotation
Combine standalone/online annotators using ML + Ontologies (Argumentation)
When not enough, create gazetteers and Jape rules (Gate Framework)
ROSeAnn – Reconciling Opinions of Semantic Annotators
Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt:
ROSeAnn: Reconciling Opinions of Semantic Annotators. PVLDB 6(12): 1238-1241 (2013)
Forms
Forms are one of the hardest things to deal with in Web Data Extraction
entry point secondary / refinement forms
Form Understanding and querying
Form labelling
Field grouping
form filling / querying
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart:
The ontological key: automatically understanding and integrating forms to access the deep Web.
VLDB J. 22(5): 615-640 (2013)
Michael Benedikt, Balder ten Cate, Efthymia Tsamoura:
Generating Plans from Proofs. ACM Trans. Database Syst. 40(4): 22:1-22:45 (2016)
Exploration strategy
Knowledge-driven focused crawling
Relational Transducers to declaratively represent strategies (data driven)
Everything gets translated into logical facts
Decision: Which action to take?
Stage 5: Finalize
Stage1:InitPage
success
crawler
next
link
filling
back
iFrame
1
2
6
7
Browser
Interaction
failure
5
3
4
Figure 8: DIADEM controller: action generation and execution
(2) If there are multiple execution-ready transducers, control
flow is determined by priorities dynamically computed by the con-
trol transducer. Transducers are executed in order of their priority.
Dependency and guard rules, registered by the individual trans-
Guarded FSTs
Website Exploration
Block / Page classification
ML (SVM and Decision trees)
Features are knowledge-parametric
Template Discovery
Detail page analysis
use result pages to collect corresponding detail pages
collate the detail pages and use result-page analysis (harder… more noise)
use result-detail redundancy to compensate for additional noise
Result page analysis
essentially… it is tree-mining (well known problem)
regular annotations in regular DOM structures
compensates for low precision
microstructures (tables, lists, key-value maps)
compensates for low recall
tance E,
ema S)
m E;
diva a div div aa
data area
p
span
PRICE
b
LOCATION
p
span
PRICE
b
LOCATION
p
span
PRICE
span
em p
strong
PRICE
span
PRICE
b
LOCATION
div
LOCATION
i
BEDS
Figure 5: Attribute alignment
is well-supported and consistent for S, and (3) E0 is maximal in
doc('http://www.trulia.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
//form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /}
/.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>]
/(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})*
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]
[? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]
[? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]
[? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]
[? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]
[? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]
[? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]
[? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]
[? .//@src:<image=normalize-space(.)> ]
[? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]
[? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]
Navigation
Record &
attributes
Extraction Language: OXPath
OXPath = XPath + 4
iteration / visual axis / actions / extraction markers
https://github.com/diadem/OXPath
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers:
OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
VLDB J. 22(1): 47-72 (2013)
Evaluation
Effort per source
no human effort
about 5 mins analysis time
about 150 pages per CPU hour for extraction
about 1 hour per 50k records in cleaning and post-processing
0
5
10
15
20
RE−FULL
time(minutes)
0 10 20 30 40
visitedpages
Evaluation: Templated Websites
160,000
Restaurant chain locations, from over
295 chains including all major chains
85%
Effective wrappers, all
automatically maintained
95%
Precision of extracted
location information
How is it done in Meltwater
Our ingestion fetches about 3.3M documents / day from 190k editorial sources, re-
crawled every 30 minutes.
With the social fire hoses we go up to 30M documents / day.
Since its foundation, Meltwater has indexed almost 200B documents.
How is it done in Meltwater
Asian websites generate as much content as the rest of the world combined.
We sometimes stretch our 2 secs politeness policy a bit.
How is it done in Meltwater
Ingestion:
Social media hoses (partnerships)
Editorial (partnerships + web crawling)
Broadcasts (views on the above)
Storage and search:
Elastic search
Rabbit MQ (distributed queues)
AWS
Enrichments (15 languages):
Text categorization (topic, language)
NERD (person, location, organization, ...)
NED ( https://en.wikipedia.org/wiki/Tim_Cook )
Sentiment Analysis
Media Intelligence applications
Boolean queries (keywords / entities)
Counters
Aggregates
Drill downs / pivoting
AWS cluster: 354 i3.2xlarge, about 2800 vCPU, 21TB RAM, 630TB NVMe disks, 30K shards
JSON-XPath
"articleTpls": [
{
"id": "article_template_0",
“startUrls": [
"http://www-03.ibm.com/press/us/en/pressrelease/33304.wss",
"http://www-03.ibm.com/press/us/en/pressrelease/33420.wss",
"http://www-03.ibm.com/press/us/en/pressrelease/33117.wss",
"http://www-03.ibm.com/press/us/en/pressrelease/33303.wss"
],
"urlPatterns": [
“(?<wordset>([a-zA-Z]{1,}[:]{1,}){1,1})//(?<wordnumberset>([w]{1,}[-.]{1,}){1,3}[
w]{1,})/(?<wordset1>([a-zA-Z]{1,}[/]{1,}){1,3}[a-zA-Z]{1,})/(?<wordnumberset1>([w]
{1,}[.]{1,}){1,1}[w]{1,})"
],
"titleXpath": "wrty:normalize-space(//h1[@class='ibm-small'])",
"bylineXpath": "//div[@class='ibm-two-column']//strong",
"ingressXpath": “wrty:normalize-space(//div[@id='ibm-content-main']/div[@class='ibm-
container'][1]//p[1])",
"contentXpath": {
"includeXpath": "wrty:normalize-space(wrty:string-join(//div[@id='ibm-content-main']//
div[@class='ibm-container-body']/node()[self::p|self::h2[@class='ibm-inner-subhead']],
"n"))"
},
"engagementPatterns": [],
"imagePatterns": [
{
"baseXpath": "//img[@width='500']",
"urlXpath": ".", }
],
"authorPatterns": [
{
"baseXpath": "//div[@class='ibm-two-column']//strong",
"nameXpath": “wrty:normalize-space(.)",
}
]
}
Instead of OXPath a JSON-like wrapper specification is used
Why a JSON-like wrapper? Well, you can query JSON.
What do you do with the data?
Content:
Companies
Brands
Products
Key people
Influencers
Goals:
Relate facts
Data mining
Cognitive applications
Challenges:
Data Cleaning
Data deduplication / integration
Truth Finding
Let’s build a knowledge graph
5M orgs, 10M people, 200M edges
More information
Selected papers: Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian
Schallhart, Cheng Wang: DIADEM: Thousands of Websites to a Single Database.
PVLDB (2014)
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon
Sellers: OXPath: A language for scalable data extraction, automation, and crawling
on the deep web. VLDB J. (2013)
Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart,
Cheng Wang: Little Knowledge Rules the Web: Domain-Centric Result Page
Extraction. RR (2011)
Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt: ROSeAnn:
Reconciling Opinions of Semantic Annotators. PVLDB (2013)
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian
Schallhart: The ontological key: automatically understanding and integrating forms
to access the deep Web. VLDB J. (2013)
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche: Joint Repairs for Web Wrappers.
ICDE (2016)
Tim Furche, Georg Gottlob, L. Libkin, Giorgio Orsi, N. Paton: Data Wrangling for Big
Data Challenges and Opportunities. EDBT: 1845-1856 (2016)
Omer Gunes, Giorgio Orsi, Tim Furche: Structured Aspect Extraction. CoLing
(2016)
Know somebody who is looking for a PhD?
Meltwater is sponsoring a PhD Scholarship in Large-scale Sentiment Analysis
The post is based at the School of
Computer Science in Birmingham
supervised by Dr Mark Lee and myself
Access to Meltwater’s hoodies and goodies
AWS infrastructure
A huge knowledge graph
200B documents (social, editorial, financial statements, job posts) to play with
Questions?
More about me at: http://www.orsigiorgio.net/
More about Meltwater at: http://www.meltwater.com/
More about Wrapidity at: http://www.wrapidity.com/

Web Data Extraction: A Crash Course

  • 1.
    Web Data Extraction:A Crash Course Giorgio Orsi (giorgio.orsi@meltwater.com)
  • 2.
    WHO AM I SeniorResearch Scientist at Meltwater, a global Media Intelligence company Honorary Researcher at the School of CS at the University of Birmingham Co-investigator of the EPSRC VADA (Value Added Data) Programme Grant Co-founder and Head of Data Engineering of Wrapidity a web scraping startup About me I like playing with data!
  • 3.
  • 4.
    Meltwater: Media Intelligence influencerstrends sentiment analysis media exposure
  • 5.
    Meltwater: Science andEntrepreneurship University collaborations 6 Data Science Hubs (co-working spaces) London San Francisco Singapore Sydney Berlin New York Meltwater Entrepreneurial School of Technology HQ in Accra, Ghana Training program for African entrepreneurs Incubator (25+ startups) Networking hub
  • 6.
    Web Data Extraction refcodepostcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm Process or turning semi-structured (templated) web data into structured data >10000
  • 7.
    Web Data Extractionvs Information Extraction Data is structured according to templates, annotated or styled
  • 8.
    Web Data Extractionvs Information Extraction Data is hidden in plain text (entities, relations, aspects) not our focus today…
  • 9.
    Web Data Extraction:Why – N I L E S H D A LV I Yahoo!, then Facebook “For many kinds of information one has to extract from thousands of sites in order to build a comprehensive database” http://arxiv.org/pdf/1203.6406.pdf Knowledge base construction (Yago, DBPedia, Wikidata, BabelNet)
  • 10.
    Web Data Extraction:Why Converging trends in data management outside insight: shift from internal data to external data (social, knowledge bases, news, reports, reviews, jobs) dark data: semi/un structured data data preparation: preparing and maintain data for mining and analytics leading vs lagging performance indicators
  • 11.
    Typical comments aboutweb data extraction Microdata and the semantic web have solved the problem All the data you need is in web tables APIs provide all the structured data you need The Academic Web Product provider Semantic API (RDF) Structured API (XML/JSON) HTML interface 1 template
  • 12.
    The Real Web Webdata extraction is not (even remotely) a solved problem APIs limited to large websites (aggregators) Web tables and microdata are marginal The real problem is not one-time extraction, but keep doing it over time Product provider Semantic API (RDF) Structured API (XML/JSON) HTML interface 1 template
  • 13.
    ● 1B+ Webpagesover the Web ● Contribution is skewed: 1- 50K As of 11/2013 Source: Xin Luna Dong (Google, now at Amazon) - PVLDB ‘14 110M 0.3M 1.5M 13K 1.1M 1.7M ANNO Information Extraction Data Extraction The Real Web
  • 14.
    Web Data Extraction:How manual / (semi) supervised accurate expensive + non-scalable less accurate cheaper + scalable unsupervised
  • 15.
    What have wetried so far Wrapper Induction: similar objects are presented in similar structures You need training data (web = high sample complexity + many features) use human to inform system (supervision / crowd) Fully unsupervised methods fail beyond simple structures and gets tricked by regular noise Fact redundancy (e.g., Google Knowledge Vault) Works well with highly-redundant common-sense facts (London, Barack Obama) Ephemeral and unfrequent entities get lost or noisy… Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, Wei Zhang: From Data Fusion to Knowledge Fusion. PVLDB 7(10): 881-892 (2014) Valter Crescenzi, Giansalvatore Mecca: Automatic information extraction from large websites. J. ACM 51(5): 731-779 (2004) Tim Furche, Jinsong Guo, Sebastian Maneth, Christian Schallhart: Robust and Noise Resistant Wrapper Induction. SIGMOD Conference 2016: 773-784
  • 16.
  • 17.
    DIADEM / Wrapidity:Full-site Web Data Extraction Bringing Web Data Extraction to the real web and at industrial scale Key insights… Replace site supervision with domain knowledge Increase robustness of wrapper generation algorithms (navigation and induction) Make wrapper generation algorithms knowledge-parametric (both ML and Rules) Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, Cheng Wang: DIADEM: Thousands of Websites to a Single Database. PVLDB 7(14): 1845-1856 (2014)
  • 18.
    DIADEM: Full-site WebData Extraction Template discovery Result pages, detail pages Navigation forms, menus, categories, bread crumbs, pagination, infinite scroll, detail links
  • 19.
    Full-site Web DataExtraction Wrapper induction generalisation and weaving doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /} //form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /} /.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>] /(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})* //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>] [? .//span[@class='prop_price']/text():<price=normalize-space(.)> ] [? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ] [? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ] [? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ] [? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ] [? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ] [? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ] [? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ] [? .//@src:<image=normalize-space(.)> ] [? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ] [? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]
  • 20.
    doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /} //form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /} /.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>] /(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick/})* //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>] [? .//span[@class='prop_price']/text():<price=normalize-space(.)> ] [? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ] [? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ] [? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ] [? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ] [? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ] [? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ] [? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ] [? .//@src:<image=normalize-space(.)> ] [? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ] [? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ] Wrapper Execution Parallel execution instantiation, splitting, distribution, monitoring doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /} //form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /} /.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>] /(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})* //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>] [? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ] [? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ] [? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ] [? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ] [? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ] [? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ] [? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ] [? .//@src:<image=normalize-space(.)> ] [? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ] [? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ] //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>] [? .//span[@class='prop_price']/text():<price=normalize-space(.)> ] [? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ] [? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ] [? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ] [? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ] [? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ] [? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ] [? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ] [? .//@src:<image=normalize-space(.)> ] [? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ] [? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ] doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /} //form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /} /.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>] /(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})* //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>] [? .//span[@class='prop_price']/text():<price=normalize-space(.)> ] doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /} //form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /} /.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>] /(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})* //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>] [? .//span[@class='prop_price']/text():<price=normalize-space(.)> ] [? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ] [? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ] [? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ] [? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ] [? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ] [? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ] [? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ] [? .//@src:<image=normalize-space(.)> ] [? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ] [? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ] EMR on Amazon AWS
  • 21.
    Data Cleaning andWrapper Repair A B C D E F G Ava’s possessions March 4, 2016 Rated: R Off Hollywood Pictures Genre(s): Sci-Fi, Mystery, Thriller, Horror 89 min 51 Camino March 4, 2016 Rated: Not Rated Bielberg Entertainment Genre(s): Action, Adventure, Thriller 103 min tbd Wrapper-generated instance vA,() SINK 13 SOURCE vB,(A) v C,(A,B) 9 9 9 vD,(A) 4 4 vB,() vA,(B) v D,(B,A) 4 4 4 4 title releaseMonth releaseDay releaseYear rating genres producer runtime overall score Target signature Stefano Ortona, Giorgio Orsi, Tim Furche, Marcello Buoncristiano: Joint repairs for web wrappers. ICDE 2016: 1146-1157
  • 22.
    Background Knowledge Domain knowledge(once per application domain) Describe target objects via entities, relationships, instances Provide a way to identify them on web pages via shallow NLP (dictionaries, regexes) Use it to annotate both the visible and invisible parts of the live DOM Record DataArea Page Result PageDetail Page Block Attribute Nav Menu Form … RE record Price property type location beds … RE data area search res number records number Metamodel Model Rules (cursymb:instance) number:instance[value>=80k && value<=200M] | number:instance[value>80k && value<=200M] (cursymb:instance | curname:instance) -> price:instance cursymb:instance £ -> { norm = GBP } $ -> { norm = USD } GBP -> { norm = GBP } USD -> { norm = USD } Dictionaries pounds -> {norm = GBP} dollars -> {norm = USD} curname:instance price amount price:label
  • 23.
    Background Knowledge Labels andinstances, visible and invisible (HTML structure, Javascript values) labels instances <div class="icon first”> <img src=“…/bdes.jpg” alt="Bedrooms" title="Bedrooms"> <br>8 </div> <div class=“icon"> <img src=“…/bath.jpg” alt="Bathrooms" title="Bathrooms"> <br>4 </div> labels Javascript values
  • 24.
    DOM Annotation Combine standalone/onlineannotators using ML + Ontologies (Argumentation) When not enough, create gazetteers and Jape rules (Gate Framework) ROSeAnn – Reconciling Opinions of Semantic Annotators Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt: ROSeAnn: Reconciling Opinions of Semantic Annotators. PVLDB 6(12): 1238-1241 (2013)
  • 25.
    Forms Forms are oneof the hardest things to deal with in Web Data Extraction entry point secondary / refinement forms Form Understanding and querying Form labelling Field grouping form filling / querying Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart: The ontological key: automatically understanding and integrating forms to access the deep Web. VLDB J. 22(5): 615-640 (2013) Michael Benedikt, Balder ten Cate, Efthymia Tsamoura: Generating Plans from Proofs. ACM Trans. Database Syst. 40(4): 22:1-22:45 (2016)
  • 26.
    Exploration strategy Knowledge-driven focusedcrawling Relational Transducers to declaratively represent strategies (data driven) Everything gets translated into logical facts Decision: Which action to take? Stage 5: Finalize Stage1:InitPage success crawler next link filling back iFrame 1 2 6 7 Browser Interaction failure 5 3 4 Figure 8: DIADEM controller: action generation and execution (2) If there are multiple execution-ready transducers, control flow is determined by priorities dynamically computed by the con- trol transducer. Transducers are executed in order of their priority. Dependency and guard rules, registered by the individual trans- Guarded FSTs Website Exploration Block / Page classification ML (SVM and Decision trees) Features are knowledge-parametric
  • 27.
    Template Discovery Detail pageanalysis use result pages to collect corresponding detail pages collate the detail pages and use result-page analysis (harder… more noise) use result-detail redundancy to compensate for additional noise Result page analysis essentially… it is tree-mining (well known problem) regular annotations in regular DOM structures compensates for low precision microstructures (tables, lists, key-value maps) compensates for low recall tance E, ema S) m E; diva a div div aa data area p span PRICE b LOCATION p span PRICE b LOCATION p span PRICE span em p strong PRICE span PRICE b LOCATION div LOCATION i BEDS Figure 5: Attribute alignment is well-supported and consistent for S, and (3) E0 is maximal in
  • 28.
    doc('http://www.trulia.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /} //form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /} /.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>] /(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick/})* //div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>] [? .//span[@class='prop_price']/text():<price=normalize-space(.)> ] [? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ] [? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ] [? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ] [? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ] [? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ] [? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ] [? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ] [? .//@src:<image=normalize-space(.)> ] [? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ] [? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ] Navigation Record & attributes Extraction Language: OXPath OXPath = XPath + 4 iteration / visual axis / actions / extraction markers https://github.com/diadem/OXPath Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers: OXPath: A language for scalable data extraction, automation, and crawling on the deep web. VLDB J. 22(1): 47-72 (2013)
  • 29.
    Evaluation Effort per source nohuman effort about 5 mins analysis time about 150 pages per CPU hour for extraction about 1 hour per 50k records in cleaning and post-processing 0 5 10 15 20 RE−FULL time(minutes) 0 10 20 30 40 visitedpages
  • 30.
    Evaluation: Templated Websites 160,000 Restaurantchain locations, from over 295 chains including all major chains 85% Effective wrappers, all automatically maintained 95% Precision of extracted location information
  • 31.
    How is itdone in Meltwater Our ingestion fetches about 3.3M documents / day from 190k editorial sources, re- crawled every 30 minutes. With the social fire hoses we go up to 30M documents / day. Since its foundation, Meltwater has indexed almost 200B documents.
  • 32.
    How is itdone in Meltwater Asian websites generate as much content as the rest of the world combined. We sometimes stretch our 2 secs politeness policy a bit.
  • 33.
    How is itdone in Meltwater Ingestion: Social media hoses (partnerships) Editorial (partnerships + web crawling) Broadcasts (views on the above) Storage and search: Elastic search Rabbit MQ (distributed queues) AWS Enrichments (15 languages): Text categorization (topic, language) NERD (person, location, organization, ...) NED ( https://en.wikipedia.org/wiki/Tim_Cook ) Sentiment Analysis Media Intelligence applications Boolean queries (keywords / entities) Counters Aggregates Drill downs / pivoting AWS cluster: 354 i3.2xlarge, about 2800 vCPU, 21TB RAM, 630TB NVMe disks, 30K shards
  • 34.
    JSON-XPath "articleTpls": [ { "id": "article_template_0", “startUrls":[ "http://www-03.ibm.com/press/us/en/pressrelease/33304.wss", "http://www-03.ibm.com/press/us/en/pressrelease/33420.wss", "http://www-03.ibm.com/press/us/en/pressrelease/33117.wss", "http://www-03.ibm.com/press/us/en/pressrelease/33303.wss" ], "urlPatterns": [ “(?<wordset>([a-zA-Z]{1,}[:]{1,}){1,1})//(?<wordnumberset>([w]{1,}[-.]{1,}){1,3}[ w]{1,})/(?<wordset1>([a-zA-Z]{1,}[/]{1,}){1,3}[a-zA-Z]{1,})/(?<wordnumberset1>([w] {1,}[.]{1,}){1,1}[w]{1,})" ], "titleXpath": "wrty:normalize-space(//h1[@class='ibm-small'])", "bylineXpath": "//div[@class='ibm-two-column']//strong", "ingressXpath": “wrty:normalize-space(//div[@id='ibm-content-main']/div[@class='ibm- container'][1]//p[1])", "contentXpath": { "includeXpath": "wrty:normalize-space(wrty:string-join(//div[@id='ibm-content-main']// div[@class='ibm-container-body']/node()[self::p|self::h2[@class='ibm-inner-subhead']], "n"))" }, "engagementPatterns": [], "imagePatterns": [ { "baseXpath": "//img[@width='500']", "urlXpath": ".", } ], "authorPatterns": [ { "baseXpath": "//div[@class='ibm-two-column']//strong", "nameXpath": “wrty:normalize-space(.)", } ] } Instead of OXPath a JSON-like wrapper specification is used Why a JSON-like wrapper? Well, you can query JSON.
  • 35.
    What do youdo with the data? Content: Companies Brands Products Key people Influencers Goals: Relate facts Data mining Cognitive applications Challenges: Data Cleaning Data deduplication / integration Truth Finding Let’s build a knowledge graph 5M orgs, 10M people, 200M edges
  • 36.
    More information Selected papers:Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, Cheng Wang: DIADEM: Thousands of Websites to a Single Database. PVLDB (2014) Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers: OXPath: A language for scalable data extraction, automation, and crawling on the deep web. VLDB J. (2013) Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, Cheng Wang: Little Knowledge Rules the Web: Domain-Centric Result Page Extraction. RR (2011) Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt: ROSeAnn: Reconciling Opinions of Semantic Annotators. PVLDB (2013) Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart: The ontological key: automatically understanding and integrating forms to access the deep Web. VLDB J. (2013) S. Ortona, G. Orsi, M. Buoncristiano, T. Furche: Joint Repairs for Web Wrappers. ICDE (2016) Tim Furche, Georg Gottlob, L. Libkin, Giorgio Orsi, N. Paton: Data Wrangling for Big Data Challenges and Opportunities. EDBT: 1845-1856 (2016) Omer Gunes, Giorgio Orsi, Tim Furche: Structured Aspect Extraction. CoLing (2016)
  • 37.
    Know somebody whois looking for a PhD? Meltwater is sponsoring a PhD Scholarship in Large-scale Sentiment Analysis The post is based at the School of Computer Science in Birmingham supervised by Dr Mark Lee and myself Access to Meltwater’s hoodies and goodies AWS infrastructure A huge knowledge graph 200B documents (social, editorial, financial statements, job posts) to play with
  • 38.
    Questions? More about meat: http://www.orsigiorgio.net/ More about Meltwater at: http://www.meltwater.com/ More about Wrapidity at: http://www.wrapidity.com/