Master’s Thesis
Sociopath: automatic local events extractor
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Computer Science and Engineering
June, 2017
Supervisor: Ing. Jan Drchal, PhD
Galina Alperovich
Problem: Extract event info automatically for any web page
2
Requirements
- Extract: name, date, location, description of the event
- Automatic extraction regardless of design and web page structure
- High accuracy
3
4
Examples of different design
Motivation
- Information Extraction task in the Web
- Web technologies
- Machine learning: 4 classification tasks (date, name, location, description)
- Popular type of problems in search engines
Interesting and not easy task
5
Classification problem
Web element
id: //a[@id = 'name-id']
tag: <div>
text: “Summer classic concert”
font size: 16px
font weight: 300
block height: 35 px
block width: 79 px
X, Y coords: [155, 230]
# of siblings: 2
….
Class: “Event name”
6
How training data would look like?
7
ID URL class Tag Text Font size Color_1 X ...
id_1 url_1 name div “Summer
festival”
57 240 120 ..
id_2 url_1 location li “Central park” 17 210 130 ..
id_3 url_1 description span “Sumer is a
perfect time..”
36 100 100 ..
id_4 url_2 no_event a “http://...” .. .. .. ..
id_5 url_2 date .. .. .. .. .. ..
Difficulties
- No training data available ⇒ we need to create it
- Specify the list of relevant features
- Web pages are very different and diverse
- Full web page rendering is not fast
- Not much of previous research
8
1. Literature review
2. Training data collecting
3. Data cleaning
4. Exploratory data analysis
5. Modelling and Evaluation
Thesis structure
9Architecture of the application
Implementation of training data collecting
- Schema.org + Microdata semantic HTML markup: Event, Person, Product, Article, etc
- Web Data Commons - huge online archive of the URLs with semantic markup
- MetaCentrum - parallel crawler for the pages to extract features for the Event schema elements
Training dataset where we know exactly where event components are!
10
Data cleaning and feature extraction
Features 300 + 30
Rows 1.6M 170K
11
DOM tree - related Visual
HTML tag
Siblings in a tree
Children in a tree
Depth
Color of the text
Text alignment
Family, size and weight of the font
Padding
Spatial Textual
X and Y coordinates
Visual properties of a block (h, w)
Tf-Idf matrix
Punctuations and Digits
Upper case letters
Length of the text
Some Features
12
Not all features are important
Feature importance
for the ‘name’ (Random Forest)
13
Top-5 for Event name:
1. Font family
2. Tag
3. Block width
4. Font size
5. Number of uppercase
letters
Evaluation
Name Date Location Description
Accuracy 0.86 0.91 0.81 0.87
Precision 0.86 0.90 0.81 0.83
Recall 0.90 0.95 0.91 0.91
F1 - measure 0.86 0.91 0.82 0.86
The highest metrics results for every event component.
Cross-validation with k = 5
Extreme Random Forest in average showed the best result
14
Classification models
Random forest
SVM
Logistic regression
Extreme Random Forest
Tools
Python: sklearn, seaborn
PhantomJS for page rendering
Scrapy, HTML features
MetaCentrum (parallel crawling)
Feature engineering
TF-IDF for words importance
PCA, t-SNE
Feature Importance from
XGboost and Random Forest
15
Conclusion
- Review of modern Web extraction methods
- Parallel automatic collection of the training dataset
- Engineering of DOM-tree, visual, textual and spatial features
- Extensive dataset cleaning
- Insights on dataset
- Several classification models for every event component
- Dataset is now public and all process is published on GitHub
- Proof of concept of automatic training set collection
16
Thank you!
17
Headless PhantomJS is no longer supported, does that affect
possible future work?
PhantomJS is a web testing framework which relies on modern web browsers, so it is
important to have updates in time.
If it is not actively supported, other alternatives would be created for testing (for example
NightMare - another one), because automatic web interface testing is a standard practice
today.
18
Is it possible to render vector format pictures with matplotlib?
Yes :)
19
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
fig.savefig('filename.eps' , format='eps')
Disadvantages of separate classification problems for every event
component?
- I consider every element independently of each other ⇒ loose information
- Mutual positions and other relative feature would probably improve the results
20
Do you plan to further utilize/promote your system?
Probably yes, I want to try to create scalable system for events for different cities. It
would be easy to find them with such framework.
21

Sociopath presentation

  • 1.
    Master’s Thesis Sociopath: automaticlocal events extractor Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering June, 2017 Supervisor: Ing. Jan Drchal, PhD Galina Alperovich
  • 2.
    Problem: Extract eventinfo automatically for any web page 2
  • 3.
    Requirements - Extract: name,date, location, description of the event - Automatic extraction regardless of design and web page structure - High accuracy 3
  • 4.
  • 5.
    Motivation - Information Extractiontask in the Web - Web technologies - Machine learning: 4 classification tasks (date, name, location, description) - Popular type of problems in search engines Interesting and not easy task 5
  • 6.
    Classification problem Web element id://a[@id = 'name-id'] tag: <div> text: “Summer classic concert” font size: 16px font weight: 300 block height: 35 px block width: 79 px X, Y coords: [155, 230] # of siblings: 2 …. Class: “Event name” 6
  • 7.
    How training datawould look like? 7 ID URL class Tag Text Font size Color_1 X ... id_1 url_1 name div “Summer festival” 57 240 120 .. id_2 url_1 location li “Central park” 17 210 130 .. id_3 url_1 description span “Sumer is a perfect time..” 36 100 100 .. id_4 url_2 no_event a “http://...” .. .. .. .. id_5 url_2 date .. .. .. .. .. ..
  • 8.
    Difficulties - No trainingdata available ⇒ we need to create it - Specify the list of relevant features - Web pages are very different and diverse - Full web page rendering is not fast - Not much of previous research 8
  • 9.
    1. Literature review 2.Training data collecting 3. Data cleaning 4. Exploratory data analysis 5. Modelling and Evaluation Thesis structure 9Architecture of the application
  • 10.
    Implementation of trainingdata collecting - Schema.org + Microdata semantic HTML markup: Event, Person, Product, Article, etc - Web Data Commons - huge online archive of the URLs with semantic markup - MetaCentrum - parallel crawler for the pages to extract features for the Event schema elements Training dataset where we know exactly where event components are! 10
  • 11.
    Data cleaning andfeature extraction Features 300 + 30 Rows 1.6M 170K 11
  • 12.
    DOM tree -related Visual HTML tag Siblings in a tree Children in a tree Depth Color of the text Text alignment Family, size and weight of the font Padding Spatial Textual X and Y coordinates Visual properties of a block (h, w) Tf-Idf matrix Punctuations and Digits Upper case letters Length of the text Some Features 12
  • 13.
    Not all featuresare important Feature importance for the ‘name’ (Random Forest) 13 Top-5 for Event name: 1. Font family 2. Tag 3. Block width 4. Font size 5. Number of uppercase letters
  • 14.
    Evaluation Name Date LocationDescription Accuracy 0.86 0.91 0.81 0.87 Precision 0.86 0.90 0.81 0.83 Recall 0.90 0.95 0.91 0.91 F1 - measure 0.86 0.91 0.82 0.86 The highest metrics results for every event component. Cross-validation with k = 5 Extreme Random Forest in average showed the best result 14 Classification models Random forest SVM Logistic regression Extreme Random Forest
  • 15.
    Tools Python: sklearn, seaborn PhantomJSfor page rendering Scrapy, HTML features MetaCentrum (parallel crawling) Feature engineering TF-IDF for words importance PCA, t-SNE Feature Importance from XGboost and Random Forest 15
  • 16.
    Conclusion - Review ofmodern Web extraction methods - Parallel automatic collection of the training dataset - Engineering of DOM-tree, visual, textual and spatial features - Extensive dataset cleaning - Insights on dataset - Several classification models for every event component - Dataset is now public and all process is published on GitHub - Proof of concept of automatic training set collection 16
  • 17.
  • 18.
    Headless PhantomJS isno longer supported, does that affect possible future work? PhantomJS is a web testing framework which relies on modern web browsers, so it is important to have updates in time. If it is not actively supported, other alternatives would be created for testing (for example NightMare - another one), because automatic web interface testing is a standard practice today. 18
  • 19.
    Is it possibleto render vector format pictures with matplotlib? Yes :) 19 from matplotlib import pyplot as plt fig, ax = plt.subplots() fig.savefig('filename.eps' , format='eps')
  • 20.
    Disadvantages of separateclassification problems for every event component? - I consider every element independently of each other ⇒ loose information - Mutual positions and other relative feature would probably improve the results 20
  • 21.
    Do you planto further utilize/promote your system? Probably yes, I want to try to create scalable system for events for different cities. It would be easy to find them with such framework. 21