SlideShare a Scribd company logo
Master’s Thesis
Sociopath: automatic local events extractor
Czech Technical University in Prague
Faculty of Electrical Engineering
Department of Computer Science and Engineering
June, 2017
Supervisor: Ing. Jan Drchal, PhD
Galina Alperovich
Problem: Extract event info automatically for any web page
2
Requirements
- Extract: name, date, location, description of the event
- Automatic extraction regardless of design and web page structure
- High accuracy
3
4
Examples of different design
Motivation
- Information Extraction task in the Web
- Web technologies
- Machine learning: 4 classification tasks (date, name, location, description)
- Popular type of problems in search engines
Interesting and not easy task
5
Classification problem
Web element
id: //a[@id = 'name-id']
tag: <div>
text: “Summer classic concert”
font size: 16px
font weight: 300
block height: 35 px
block width: 79 px
X, Y coords: [155, 230]
# of siblings: 2
….
Class: “Event name”
6
How training data would look like?
7
ID URL class Tag Text Font size Color_1 X ...
id_1 url_1 name div “Summer
festival”
57 240 120 ..
id_2 url_1 location li “Central park” 17 210 130 ..
id_3 url_1 description span “Sumer is a
perfect time..”
36 100 100 ..
id_4 url_2 no_event a “http://...” .. .. .. ..
id_5 url_2 date .. .. .. .. .. ..
Difficulties
- No training data available ⇒ we need to create it
- Specify the list of relevant features
- Web pages are very different and diverse
- Full web page rendering is not fast
- Not much of previous research
8
1. Literature review
2. Training data collecting
3. Data cleaning
4. Exploratory data analysis
5. Modelling and Evaluation
Thesis structure
9Architecture of the application
Implementation of training data collecting
- Schema.org + Microdata semantic HTML markup: Event, Person, Product, Article, etc
- Web Data Commons - huge online archive of the URLs with semantic markup
- MetaCentrum - parallel crawler for the pages to extract features for the Event schema elements
Training dataset where we know exactly where event components are!
10
Data cleaning and feature extraction
Features 300 + 30
Rows 1.6M 170K
11
DOM tree - related Visual
HTML tag
Siblings in a tree
Children in a tree
Depth
Color of the text
Text alignment
Family, size and weight of the font
Padding
Spatial Textual
X and Y coordinates
Visual properties of a block (h, w)
Tf-Idf matrix
Punctuations and Digits
Upper case letters
Length of the text
Some Features
12
Not all features are important
Feature importance
for the ‘name’ (Random Forest)
13
Top-5 for Event name:
1. Font family
2. Tag
3. Block width
4. Font size
5. Number of uppercase
letters
Evaluation
Name Date Location Description
Accuracy 0.86 0.91 0.81 0.87
Precision 0.86 0.90 0.81 0.83
Recall 0.90 0.95 0.91 0.91
F1 - measure 0.86 0.91 0.82 0.86
The highest metrics results for every event component.
Cross-validation with k = 5
Extreme Random Forest in average showed the best result
14
Classification models
Random forest
SVM
Logistic regression
Extreme Random Forest
Tools
Python: sklearn, seaborn
PhantomJS for page rendering
Scrapy, HTML features
MetaCentrum (parallel crawling)
Feature engineering
TF-IDF for words importance
PCA, t-SNE
Feature Importance from
XGboost and Random Forest
15
Conclusion
- Review of modern Web extraction methods
- Parallel automatic collection of the training dataset
- Engineering of DOM-tree, visual, textual and spatial features
- Extensive dataset cleaning
- Insights on dataset
- Several classification models for every event component
- Dataset is now public and all process is published on GitHub
- Proof of concept of automatic training set collection
16
Thank you!
17
Headless PhantomJS is no longer supported, does that affect
possible future work?
PhantomJS is a web testing framework which relies on modern web browsers, so it is
important to have updates in time.
If it is not actively supported, other alternatives would be created for testing (for example
NightMare - another one), because automatic web interface testing is a standard practice
today.
18
Is it possible to render vector format pictures with matplotlib?
Yes :)
19
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
fig.savefig('filename.eps' , format='eps')
Disadvantages of separate classification problems for every event
component?
- I consider every element independently of each other ⇒ loose information
- Mutual positions and other relative feature would probably improve the results
20
Do you plan to further utilize/promote your system?
Probably yes, I want to try to create scalable system for events for different cities. It
would be easy to find them with such framework.
21

More Related Content

What's hot

Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
Mao Ye
 
Yawen_Yu_resume
Yawen_Yu_resumeYawen_Yu_resume
Yawen_Yu_resume
Yawen Yu
 
Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017
Savill Liu
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4
CLARIAH
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Ralf Stockmann
 
Graph Database
Graph DatabaseGraph Database
Graph Database
Richard Kuo
 
rachelzhang
rachelzhangrachelzhang
rachelzhang
Yunqing Zhang
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Open Knowledge Belgium
 
Graph database
Graph database Graph database
Graph database
Shruti Arya
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
Stefan Schmunk
 
Top 5-nosql
Top 5-nosqlTop 5-nosql
Top 5-nosql
Mehul Jariwala
 
Data-mining the Semantic Web
Data-mining the Semantic WebData-mining the Semantic Web
Data-mining the Semantic Web
Frank Lynam
 
SWUI Position Paper
SWUI Position PaperSWUI Position Paper
SWUI Position Paper
Ian Dickinson
 

What's hot (13)

Maoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_shortMaoye resume 2017_1_v10_short
Maoye resume 2017_1_v10_short
 
Yawen_Yu_resume
Yawen_Yu_resumeYawen_Yu_resume
Yawen_Yu_resume
 
Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017Shiwei Liu-resume - 2017
Shiwei Liu-resume - 2017
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen State a...
 
Graph Database
Graph DatabaseGraph Database
Graph Database
 
rachelzhang
rachelzhangrachelzhang
rachelzhang
 
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLioDo it on your own - From 3 to 5 Star Linked Open Data with RMLio
Do it on your own - From 3 to 5 Star Linked Open Data with RMLio
 
Graph database
Graph database Graph database
Graph database
 
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
New tasks, new roles: Libraries in the tension between Digital Humanities, Re...
 
Top 5-nosql
Top 5-nosqlTop 5-nosql
Top 5-nosql
 
Data-mining the Semantic Web
Data-mining the Semantic WebData-mining the Semantic Web
Data-mining the Semantic Web
 
SWUI Position Paper
SWUI Position PaperSWUI Position Paper
SWUI Position Paper
 

Similar to Sociopath presentation

Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
Jan Wiegelmann
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
Paco Nathan
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Teradata Aster
 
Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013
Arjan
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
mattthemathman
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
 
MyResume_Updated
MyResume_UpdatedMyResume_Updated
MyResume_Updated
Shiva Ram
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbone
zonathen
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
Jan Wiegelmann
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
Paul Lo
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
Stepan Pushkarev
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applications
bpanulla
 
Akshat misra resume
Akshat misra resumeAkshat misra resume
Akshat misra resume
Akshat Misra
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
Renato Jovic
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
Ville Tuulos
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Volha Bryl
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
MongoDB
 

Similar to Sociopath presentation (20)

Deep Learning for Autonomous Driving
Deep Learning for Autonomous DrivingDeep Learning for Autonomous Driving
Deep Learning for Autonomous Driving
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Mastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and AnalysisMastering MapReduce: MapReduce for Big Data Management and Analysis
Mastering MapReduce: MapReduce for Big Data Management and Analysis
 
Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013
 
NLP and the Web
NLP and the WebNLP and the Web
NLP and the Web
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
MyResume_Updated
MyResume_UpdatedMyResume_Updated
MyResume_Updated
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Intro to-html-backbone
Intro to-html-backboneIntro to-html-backbone
Intro to-html-backbone
 
AI meets Big Data
AI meets Big DataAI meets Big Data
AI meets Big Data
 
Big Data Meetup #7
Big Data Meetup #7Big Data Meetup #7
Big Data Meetup #7
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
Architecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web ApplicationsArchitecture Patterns for Semantic Web Applications
Architecture Patterns for Semantic Web Applications
 
Akshat misra resume
Akshat misra resumeAkshat misra resume
Akshat misra resume
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Azure Machine Learning 101
Azure Machine Learning 101Azure Machine Learning 101
Azure Machine Learning 101
 
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...More Data Science with Less Engineering: Machine Learning Infrastructure at N...
More Data Science with Less Engineering: Machine Learning Infrastructure at N...
 
Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...Learning Regular Expressions for the Extraction of Product Attributes from E-...
Learning Regular Expressions for the Extraction of Product Attributes from E-...
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 

Recently uploaded

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 

Recently uploaded (20)

一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 

Sociopath presentation

  • 1. Master’s Thesis Sociopath: automatic local events extractor Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering June, 2017 Supervisor: Ing. Jan Drchal, PhD Galina Alperovich
  • 2. Problem: Extract event info automatically for any web page 2
  • 3. Requirements - Extract: name, date, location, description of the event - Automatic extraction regardless of design and web page structure - High accuracy 3
  • 5. Motivation - Information Extraction task in the Web - Web technologies - Machine learning: 4 classification tasks (date, name, location, description) - Popular type of problems in search engines Interesting and not easy task 5
  • 6. Classification problem Web element id: //a[@id = 'name-id'] tag: <div> text: “Summer classic concert” font size: 16px font weight: 300 block height: 35 px block width: 79 px X, Y coords: [155, 230] # of siblings: 2 …. Class: “Event name” 6
  • 7. How training data would look like? 7 ID URL class Tag Text Font size Color_1 X ... id_1 url_1 name div “Summer festival” 57 240 120 .. id_2 url_1 location li “Central park” 17 210 130 .. id_3 url_1 description span “Sumer is a perfect time..” 36 100 100 .. id_4 url_2 no_event a “http://...” .. .. .. .. id_5 url_2 date .. .. .. .. .. ..
  • 8. Difficulties - No training data available ⇒ we need to create it - Specify the list of relevant features - Web pages are very different and diverse - Full web page rendering is not fast - Not much of previous research 8
  • 9. 1. Literature review 2. Training data collecting 3. Data cleaning 4. Exploratory data analysis 5. Modelling and Evaluation Thesis structure 9Architecture of the application
  • 10. Implementation of training data collecting - Schema.org + Microdata semantic HTML markup: Event, Person, Product, Article, etc - Web Data Commons - huge online archive of the URLs with semantic markup - MetaCentrum - parallel crawler for the pages to extract features for the Event schema elements Training dataset where we know exactly where event components are! 10
  • 11. Data cleaning and feature extraction Features 300 + 30 Rows 1.6M 170K 11
  • 12. DOM tree - related Visual HTML tag Siblings in a tree Children in a tree Depth Color of the text Text alignment Family, size and weight of the font Padding Spatial Textual X and Y coordinates Visual properties of a block (h, w) Tf-Idf matrix Punctuations and Digits Upper case letters Length of the text Some Features 12
  • 13. Not all features are important Feature importance for the ‘name’ (Random Forest) 13 Top-5 for Event name: 1. Font family 2. Tag 3. Block width 4. Font size 5. Number of uppercase letters
  • 14. Evaluation Name Date Location Description Accuracy 0.86 0.91 0.81 0.87 Precision 0.86 0.90 0.81 0.83 Recall 0.90 0.95 0.91 0.91 F1 - measure 0.86 0.91 0.82 0.86 The highest metrics results for every event component. Cross-validation with k = 5 Extreme Random Forest in average showed the best result 14 Classification models Random forest SVM Logistic regression Extreme Random Forest
  • 15. Tools Python: sklearn, seaborn PhantomJS for page rendering Scrapy, HTML features MetaCentrum (parallel crawling) Feature engineering TF-IDF for words importance PCA, t-SNE Feature Importance from XGboost and Random Forest 15
  • 16. Conclusion - Review of modern Web extraction methods - Parallel automatic collection of the training dataset - Engineering of DOM-tree, visual, textual and spatial features - Extensive dataset cleaning - Insights on dataset - Several classification models for every event component - Dataset is now public and all process is published on GitHub - Proof of concept of automatic training set collection 16
  • 18. Headless PhantomJS is no longer supported, does that affect possible future work? PhantomJS is a web testing framework which relies on modern web browsers, so it is important to have updates in time. If it is not actively supported, other alternatives would be created for testing (for example NightMare - another one), because automatic web interface testing is a standard practice today. 18
  • 19. Is it possible to render vector format pictures with matplotlib? Yes :) 19 from matplotlib import pyplot as plt fig, ax = plt.subplots() fig.savefig('filename.eps' , format='eps')
  • 20. Disadvantages of separate classification problems for every event component? - I consider every element independently of each other ⇒ loose information - Mutual positions and other relative feature would probably improve the results 20
  • 21. Do you plan to further utilize/promote your system? Probably yes, I want to try to create scalable system for events for different cities. It would be easy to find them with such framework. 21