Web Data Extraction: A Crash Course

Web Data Extraction: A Crash Course
Giorgio Orsi
(giorgio.orsi@meltwater.com)

WHO AM I
Senior Research Scientist at Meltwater, a global Media Intelligence company
Honorary Researcher at the School of CS at the University of Birmingham
Co-investigator of the EPSRC VADA (Value Added Data) Programme Grant
Co-founder and Head of Data Engineering of Wrapidity a web scraping startup
About me
I like playing with data!

Meltwater: Media Intelligence
influencers trends
sentiment
analysis
media
exposure

Meltwater: Science and Entrepreneurship
University collaborations
6 Data Science Hubs (co-working spaces)
London
San Francisco
Singapore
Sydney
Berlin
New York
Meltwater Entrepreneurial School of Technology
HQ in Accra, Ghana
Training program for African entrepreneurs
Incubator (25+ startups)
Networking hub

Web Data Extraction
refcode postcode bedrooms bathrooms available price
33453 OX2 6AR 3 2 15/10/2013 £1280 pcm
33433 OX4 7DG 2 1 18/04/2013 £995 pcm
Process or turning semi-structured (templated) web data into structured data
>10000

Web Data Extraction vs Information Extraction
Data is structured according to templates, annotated or styled

Web Data Extraction vs Information Extraction
Data is hidden in plain text (entities, relations, aspects)
not our focus today…

Web Data Extraction: Why
– N I L E S H D A LV I
Yahoo!, then Facebook
“For many kinds of information one has to extract
from thousands of sites in order to build a
comprehensive database”
http://arxiv.org/pdf/1203.6406.pdf
Knowledge base construction (Yago, DBPedia, Wikidata, BabelNet)

Web Data Extraction: Why
Converging trends in data management
outside insight: shift from internal data to
external data (social, knowledge bases,
news, reports, reviews, jobs)
dark data: semi/un structured data
data preparation: preparing and maintain
data for mining and analytics
leading vs lagging performance indicators

Typical comments about web data extraction
Microdata and the semantic web have solved the problem
All the data you need is in web tables
APIs provide all the structured data you need
The Academic Web
Product
provider
Semantic API
(RDF)
Structured API
(XML/JSON)
HTML
interface
1
template

The Real Web
Web data extraction is not (even remotely) a solved problem
APIs limited to large websites (aggregators)
Web tables and microdata are marginal
The real problem is not one-time extraction, but keep doing it over time
Product
provider
Semantic API
(RDF)
Structured API
(XML/JSON)
HTML
interface
1
template

● 1B+ Webpages over the Web
● Contribution is skewed: 1- 50K
As of 11/2013
Source: Xin Luna Dong (Google, now at Amazon) - PVLDB ‘14
110M
0.3M 1.5M
13K
1.1M 1.7M
ANNO
Information
Extraction
Data
Extraction
The Real Web

Web Data Extraction: How
manual / (semi) supervised
accurate
expensive + non-scalable
less accurate
cheaper + scalable
unsupervised

What have we tried so far
Wrapper Induction: similar objects are presented in similar structures
You need training data (web = high sample complexity + many features)
use human to inform system (supervision / crowd)
Fully unsupervised methods fail beyond simple structures and gets tricked by
regular noise
Fact redundancy (e.g., Google Knowledge Vault)
Works well with highly-redundant common-sense facts (London, Barack Obama)
Ephemeral and unfrequent entities get lost or noisy…
Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, Wei Zhang:
From Data Fusion to Knowledge Fusion. PVLDB 7(10): 881-892 (2014)
Valter Crescenzi, Giansalvatore Mecca:
Automatic information extraction from large websites. J. ACM 51(5): 731-779 (2004)
Tim Furche, Jinsong Guo, Sebastian Maneth, Christian Schallhart:
Robust and Noise Resistant Wrapper Induction. SIGMOD Conference 2016: 773-784

DIADEM / Wrapidity: Full-site Web Data Extraction
Bringing Web Data Extraction to the real web and at industrial scale
Key insights…
Replace site supervision with domain knowledge
Increase robustness of wrapper generation algorithms (navigation and induction)
Make wrapper generation algorithms knowledge-parametric (both ML and Rules)
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, Cheng Wang:
DIADEM: Thousands of Websites to a Single Database. PVLDB 7(14): 1845-1856 (2014)

DIADEM: Full-site Web Data Extraction
Template discovery
Result pages, detail pages
Navigation
forms, menus, categories,
bread crumbs, pagination,
infinite scroll, detail links

Full-site Web Data Extraction
Wrapper induction
generalisation and weaving
doc('http://www.wwagency.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
//form/div[@class='formbtn-ctn'][last()]/button[@class='formbtn']/{click /}
/.:<data_area>[?.//div[@class='pagenumlinks'][1]//span/text():<number_results=.>]
/(//div[contains(@class,'proplist_wrap')]/following-sibling::div//a[@class='pagenum'][last()]/{nextclick /})*
//div[contains(@class,'proplist_wrap')]:<record>[? .:<origin_url=current-url()>]
[? .//span[@class='prop_price']/text():<price=normalize-space(.)> ]
[? .//span[.='Type:']/following-sibling::strong/text():<property_type=normalize-space(.)> ]
[? .//div[@class='prop_statuses']//text():<property_status=normalize-space(.)> ]
[? .//span[.='Bathrooms:']/following-sibling::strong/text():<bathroom_number=normalize-space(.)> ]
[? .//span[.='Bedrooms:']/following-sibling::strong/text():<bedroom_number=normalize-space(.)> ]
[? .//strong[@class='orange']/preceding-sibling::text():<location_raw=string(.)> ]
[? .//strong[@class='orange']/text():<postcode=normalize-space(.)> ]
[? .//strong/preceding-sibling::strong/text():<street_address=normalize-space(.)> ]
[? .//@src:<image=normalize-space(.)> ]
[? .//div[@class='prop_statuses']/following-sibling::a/@href:<url=normalize-space(.)> ]
[? .//div[@class='prop_maininfo']:<description=normalize-space(.)> ]

Wrapper Execution
Parallel execution
instantiation, splitting,
distribution, monitoring
EMR on Amazon AWS

Data Cleaning and Wrapper Repair
A B C D E F G
Ava’s
possessions
March 4, 2016 Rated: R Off Hollywood Pictures
Genre(s): Sci-Fi,
Mystery, Thriller,
Horror
89 min 51
Camino March 4, 2016
Rated:
Not
Rated
Bielberg Entertainment
Genre(s): Action,
Adventure, Thriller
103 min tbd
Wrapper-generated instance
vA,() SINK
13
SOURCE vB,(A)
v
C,(A,B)
9 9 9
vD,(A)
4 4
vB,()
vA,(B)
v
D,(B,A)
4
4 4
4
title releaseMonth releaseDay releaseYear rating genres producer runtime overall score
Target signature
Stefano Ortona, Giorgio Orsi, Tim Furche, Marcello Buoncristiano:
Joint repairs for web wrappers. ICDE 2016: 1146-1157

Background Knowledge
Domain knowledge (once per application domain)
Describe target objects via entities, relationships, instances
Provide a way to identify them on web pages via shallow NLP (dictionaries, regexes)
Use it to annotate both the visible and invisible parts of the live DOM
Record
DataArea
Page
Result PageDetail Page
Block
Attribute
Nav Menu
Form
…
RE record
Price
property type
location
beds
…
RE data area search res number
records number
Metamodel
Model
Rules
(cursymb:instance) number:instance[value>=80k && value<=200M]
|
number:instance[value>80k && value<=200M] (cursymb:instance | curname:instance) -> price:instance
cursymb:instance
£ -> { norm = GBP }
$ -> { norm = USD }
GBP -> { norm = GBP }
USD -> { norm = USD }
Dictionaries
pounds -> {norm = GBP}
dollars -> {norm = USD}
curname:instance
price
amount
price:label

Background Knowledge
Labels and instances, visible and invisible (HTML structure, Javascript values)
labels
instances
<div class="icon first”>
<img src=“…/bdes.jpg” alt="Bedrooms" title="Bedrooms">
<br>8
</div>
<div class=“icon">
<img src=“…/bath.jpg” alt="Bathrooms" title="Bathrooms">
<br>4
</div>
labels
Javascript values

DOM Annotation
Combine standalone/online annotators using ML + Ontologies (Argumentation)
When not enough, create gazetteers and Jape rules (Gate Framework)
ROSeAnn – Reconciling Opinions of Semantic Annotators
Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt:
ROSeAnn: Reconciling Opinions of Semantic Annotators. PVLDB 6(12): 1238-1241 (2013)

Forms
Forms are one of the hardest things to deal with in Web Data Extraction
entry point secondary / refinement forms
Form Understanding and querying
Form labelling
Field grouping
form filling / querying
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart:
The ontological key: automatically understanding and integrating forms to access the deep Web.
VLDB J. 22(5): 615-640 (2013)
Michael Benedikt, Balder ten Cate, Efthymia Tsamoura:
Generating Plans from Proofs. ACM Trans. Database Syst. 40(4): 22:1-22:45 (2016)

Exploration strategy
Knowledge-driven focused crawling
Relational Transducers to declaratively represent strategies (data driven)
Everything gets translated into logical facts
Decision: Which action to take?
Stage 5: Finalize
Stage1:InitPage
success
crawler
next
link
ﬁlling
back
iFrame
1
2
6
7
Browser
Interaction
failure
5
3
4
Figure 8: DIADEM controller: action generation and execution
(2) If there are multiple execution-ready transducers, control
ﬂow is determined by priorities dynamically computed by the con-
trol transducer. Transducers are executed in order of their priority.
Dependency and guard rules, registered by the individual trans-
Guarded FSTs
Website Exploration
Block / Page classification
ML (SVM and Decision trees)
Features are knowledge-parametric

Template Discovery
Detail page analysis
use result pages to collect corresponding detail pages
collate the detail pages and use result-page analysis (harder… more noise)
use result-detail redundancy to compensate for additional noise
Result page analysis
essentially… it is tree-mining (well known problem)
regular annotations in regular DOM structures
compensates for low precision
microstructures (tables, lists, key-value maps)
compensates for low recall
tance E,
ema S)
m E;
diva a div div aa
data area
p
span
PRICE
b
LOCATION
p
span
PRICE
b
LOCATION
p
span
PRICE
span
em p
strong
PRICE
span
PRICE
b
LOCATION
div
LOCATION
i
BEDS
Figure 5: Attribute alignment
is well-supported and consistent for S, and (3) E0 is maximal in

doc('http://www.trulia.com/')//label[@for='sale_type_id']/following-sibling::select/{0 /}
Navigation
Record &
attributes
Extraction Language: OXPath
OXPath = XPath + 4
iteration / visual axis / actions / extraction markers
https://github.com/diadem/OXPath
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers:
OXPath: A language for scalable data extraction, automation, and crawling on the deep web.
VLDB J. 22(1): 47-72 (2013)

Evaluation
Effort per source
no human effort
about 5 mins analysis time
about 150 pages per CPU hour for extraction
about 1 hour per 50k records in cleaning and post-processing
0
5
10
15
20
RE−FULL
time(minutes)
0 10 20 30 40
visitedpages

Evaluation: Templated Websites
160,000
Restaurant chain locations, from over
295 chains including all major chains
85%
Effective wrappers, all
automatically maintained
95%
Precision of extracted
location information

How is it done in Meltwater
Our ingestion fetches about 3.3M documents / day from 190k editorial sources, re-
crawled every 30 minutes.
With the social fire hoses we go up to 30M documents / day.
Since its foundation, Meltwater has indexed almost 200B documents.

Asian websites generate as much content as the rest of the world combined.
We sometimes stretch our 2 secs politeness policy a bit.

Ingestion:
Social media hoses (partnerships)
Editorial (partnerships + web crawling)
Broadcasts (views on the above)
Storage and search:
Elastic search
Rabbit MQ (distributed queues)
AWS
Enrichments (15 languages):
Text categorization (topic, language)
NERD (person, location, organization, ...)
NED ( https://en.wikipedia.org/wiki/Tim_Cook )
Sentiment Analysis
Media Intelligence applications
Boolean queries (keywords / entities)
Counters
Aggregates
Drill downs / pivoting
AWS cluster: 354 i3.2xlarge, about 2800 vCPU, 21TB RAM, 630TB NVMe disks, 30K shards

JSON-XPath
"articleTpls": [
{
"id": "article_template_0",
“startUrls": [
"http://www-03.ibm.com/press/us/en/pressrelease/33304.wss",
"http://www-03.ibm.com/press/us/en/pressrelease/33303.wss"
],
"urlPatterns": [
“(?<wordset>([a-zA-Z]{1,}[:]{1,}){1,1})//(?<wordnumberset>([w]{1,}[-.]{1,}){1,3}[
w]{1,})/(?<wordset1>([a-zA-Z]{1,}[/]{1,}){1,3}[a-zA-Z]{1,})/(?<wordnumberset1>([w]
{1,}[.]{1,}){1,1}[w]{1,})"
],
"titleXpath": "wrty:normalize-space(//h1[@class='ibm-small'])",
"bylineXpath": "//div[@class='ibm-two-column']//strong",
"ingressXpath": “wrty:normalize-space(//div[@id='ibm-content-main']/div[@class='ibm-
container'][1]//p[1])",
"contentXpath": {
"includeXpath": "wrty:normalize-space(wrty:string-join(//div[@id='ibm-content-main']//
div[@class='ibm-container-body']/node()[self::p|self::h2[@class='ibm-inner-subhead']],
"n"))"
},
"engagementPatterns": [],
"imagePatterns": [
{
"baseXpath": "//img[@width='500']",
"urlXpath": ".", }
],
"authorPatterns": [
{
"baseXpath": "//div[@class='ibm-two-column']//strong",
"nameXpath": “wrty:normalize-space(.)",
}
]
}
Instead of OXPath a JSON-like wrapper specification is used
Why a JSON-like wrapper? Well, you can query JSON.

What do you do with the data?
Content:
Companies
Brands
Products
Key people
Influencers
Goals:
Relate facts
Data mining
Cognitive applications
Challenges:
Data Cleaning
Data deduplication / integration
Truth Finding
Let’s build a knowledge graph
5M orgs, 10M people, 200M edges

More information
Selected papers: Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian
Schallhart, Cheng Wang: DIADEM: Thousands of Websites to a Single Database.
PVLDB (2014)
Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon
Sellers: OXPath: A language for scalable data extraction, automation, and crawling
on the deep web. VLDB J. (2013)
Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart,
Cheng Wang: Little Knowledge Rules the Web: Domain-Centric Result Page
Extraction. RR (2011)
Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt: ROSeAnn:
Reconciling Opinions of Semantic Annotators. PVLDB (2013)
Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian
Schallhart: The ontological key: automatically understanding and integrating forms
to access the deep Web. VLDB J. (2013)
S. Ortona, G. Orsi, M. Buoncristiano, T. Furche: Joint Repairs for Web Wrappers.
ICDE (2016)
Tim Furche, Georg Gottlob, L. Libkin, Giorgio Orsi, N. Paton: Data Wrangling for Big
Data Challenges and Opportunities. EDBT: 1845-1856 (2016)
Omer Gunes, Giorgio Orsi, Tim Furche: Structured Aspect Extraction. CoLing
(2016)

Know somebody who is looking for a PhD?
Meltwater is sponsoring a PhD Scholarship in Large-scale Sentiment Analysis
The post is based at the School of
Computer Science in Birmingham
supervised by Dr Mark Lee and myself
Access to Meltwater’s hoodies and goodies
AWS infrastructure
A huge knowledge graph
200B documents (social, editorial, financial statements, job posts) to play with

Questions?
More about me at: http://www.orsigiorgio.net/
More about Meltwater at: http://www.meltwater.com/
More about Wrapidity at: http://www.wrapidity.com/

Web Data Extraction: A Crash Course

More Related Content

What's hot

Similar to Web Data Extraction: A Crash Course

More from Giorgio Orsi

Web Data Extraction: A Crash Course