Sociopath presentation

Master’s Thesis
Sociopath: automatic local events extractor
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Computer Science and Engineering
June, 2017
Supervisor: Ing. Jan Drchal, PhD
Galina Alperovich

Problem: Extract event info automatically for any web page
2

Requirements
- Extract: name, date, location, description of the event
- Automatic extraction regardless of design and web page structure
- High accuracy
3

4
Examples of different design

Motivation
- Information Extraction task in the Web
- Web technologies
- Machine learning: 4 classification tasks (date, name, location, description)
- Popular type of problems in search engines
Interesting and not easy task
5

Classification problem
Web element
id: //a[@id = 'name-id']
tag: <div>
text: “Summer classic concert”
font size: 16px
font weight: 300
block height: 35 px
block width: 79 px
X, Y coords: [155, 230]
# of siblings: 2
….
Class: “Event name”
6

How training data would look like?
7
ID URL class Tag Text Font size Color_1 X ...
id_1 url_1 name div “Summer
festival”
57 240 120 ..
id_2 url_1 location li “Central park” 17 210 130 ..
id_3 url_1 description span “Sumer is a
perfect time..”
36 100 100 ..
id_4 url_2 no_event a “http://...” .. .. .. ..
id_5 url_2 date .. .. .. .. .. ..

Difficulties
- No training data available ⇒ we need to create it
- Specify the list of relevant features
- Web pages are very different and diverse
- Full web page rendering is not fast
- Not much of previous research
8

1. Literature review
2. Training data collecting
3. Data cleaning
4. Exploratory data analysis
5. Modelling and Evaluation
Thesis structure
9Architecture of the application

Implementation of training data collecting
- Schema.org + Microdata semantic HTML markup: Event, Person, Product, Article, etc
- Web Data Commons - huge online archive of the URLs with semantic markup
- MetaCentrum - parallel crawler for the pages to extract features for the Event schema elements
Training dataset where we know exactly where event components are!
10

Data cleaning and feature extraction
Features 300 + 30
Rows 1.6M 170K
11

DOM tree - related Visual
HTML tag
Siblings in a tree
Children in a tree
Depth
Color of the text
Text alignment
Family, size and weight of the font
Padding
Spatial Textual
X and Y coordinates
Visual properties of a block (h, w)
Tf-Idf matrix
Punctuations and Digits
Upper case letters
Length of the text
Some Features
12

Not all features are important
Feature importance
for the ‘name’ (Random Forest)
13
Top-5 for Event name:
1. Font family
2. Tag
3. Block width
4. Font size
5. Number of uppercase
letters

Evaluation
Name Date Location Description
Accuracy 0.86 0.91 0.81 0.87
Precision 0.86 0.90 0.81 0.83
Recall 0.90 0.95 0.91 0.91
F1 - measure 0.86 0.91 0.82 0.86
The highest metrics results for every event component.
Cross-validation with k = 5
Extreme Random Forest in average showed the best result
14
Classification models
Random forest
SVM
Logistic regression
Extreme Random Forest

Tools
Python: sklearn, seaborn
PhantomJS for page rendering
Scrapy, HTML features
MetaCentrum (parallel crawling)
Feature engineering
TF-IDF for words importance
PCA, t-SNE
Feature Importance from
XGboost and Random Forest
15

Conclusion
- Review of modern Web extraction methods
- Parallel automatic collection of the training dataset
- Engineering of DOM-tree, visual, textual and spatial features
- Extensive dataset cleaning
- Insights on dataset
- Several classification models for every event component
- Dataset is now public and all process is published on GitHub
- Proof of concept of automatic training set collection
16

Headless PhantomJS is no longer supported, does that affect
possible future work?
PhantomJS is a web testing framework which relies on modern web browsers, so it is
important to have updates in time.
If it is not actively supported, other alternatives would be created for testing (for example
NightMare - another one), because automatic web interface testing is a standard practice
today.
18

Is it possible to render vector format pictures with matplotlib?
Yes :)
19
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
fig.savefig('filename.eps' , format='eps')

Disadvantages of separate classification problems for every event
component?
- I consider every element independently of each other ⇒ loose information
- Mutual positions and other relative feature would probably improve the results
20

Do you plan to further utilize/promote your system?
Probably yes, I want to try to create scalable system for events for different cities. It
would be easy to find them with such framework.
21

Sociopath presentation

More Related Content

What's hot

Similar to Sociopath presentation

Recently uploaded

Sociopath presentation