SlideShare a Scribd company logo
1 of 33
Download to read offline
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
The Impact of Source Code Normalization
on Main Content Extraction
Hadi Mohammadzadeh,
T. Gottron, F. Schweiggert and G. Nakhaeizadeh
Institute of Applied Information Processing
University of Ulm , Germany
hadi.mohammadzadeh@uni-ulm.de
WEBIST 2012, April 2012, Porto, Portugal
1 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
2 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
3 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
What is the Main Content Extarction (MCE)!!
Content Extraction (CE) is defined as the task of determining
the main textual content of an HTML document
and/or removing the additional items, such as advertisements,
navigation bars.
4 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
An example of web document with selected MC
5 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
What is the Problem
In recent years, many algorithms have been introduced to
provide solutions to this task
But, none of these methods demostrated perfect results on
all types of web documents
A typical flaw is a poor extraction performance on highly
structured web documents containing many hyperlinks,
such as Wikipedia
So, here AdDANAg will be introduced to address the
problem of main content extraction from highly structured
web documents
6 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
7 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the First Phase
In the First phase, DANAg calculates the amount of
content and code characters in each line of an HTML file
and stores these numbers in two one-dimensional arrays T
and S, respectively
8 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
Phase One : calculates the amount of content and code characters in each line of
an HTML file
9 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the Second Phase
In the Second phase, DANAg computes diffi through
following smoothed Formula for each line i of the HTML
file, and keeps these new numbers in a one-dimensional
array D, to produce a smoothed plot which can be seen in
next slide
diffi = (Ti − Si) + (Ti+1 − Si+1) + (Ti-1 − Si-1) (1)
In the next slide we draw a column above the x-axis if
diffi > 0. Otherwise, a line with length |diffi | is depicted
below the x-axis
10 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
Phase Two : compute diffi = (Ti − Si) + (Ti+1 − Si+1) + (Ti-1 − Si-1)
11 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the Third Phase
Finding the boundaries of the MC regions, but how
After recognizing the longest column above the x-axis, the
algorithm moves up and down in the HTML file to find all
paragraph belonging to the MC. But where is the end of
these movements?
The number of lines we need to traverse to find the next
MC paragraph is defined as a parameter P, initialized with
20.
By considering this parameter, we move up or down,
respectively until we can not find a line containing Content
characters.
At this moment, all lines we found make our MC.
12 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the Fourth Phase
Finally, we feed all HTML lines determined in the previous
phase as an input to a parser. The output of parser shows
the main content
13 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
14 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
The Structure of AdDANAg
As we mentioned, AdDANAg addresses the problem of main
content extraction from highly structured web documents
The process behind AdDANAg can be divided into two
sections:
A pre-processing phase
And a core extraction phase which is DANAg
15 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
An Example of Wikipeia Web Page
16 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Comparison between BBC Web Site and Wikipeia
This figure shows some paragraphs of a normal HTML file.
There is no hyperlink in this code, so approaches like DANAg
has no problem in extracting the main content accurately
</div>
<p id="story_continues_2">&quot;Especially the under-fives and the pregnant
women, they&#039;re suffering from malnutrition and communicable disease like the
measles, diarrhoea and pneumonia,&quot; he said.</p>
<p>Earlier this week Mark Bowden, the UN humanitarian affairs co-ordinator
for Somalia, told the BBC that the country was close to famine. </p>
<p>Last week Somalia&#039;s al-Shabab Islamist militia - which has been
fighting the Mogadishu government - said it was lifting its ban on foreign aid
agencies provided they did not show a &quot;hidden agenda&quot;. </p>
<p>Some 3,000 people flee each day for neighbouring countries such as
Ethiopia and Kenya which are struggling to cope.</p>
</div>
17 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Comparison between BBC Web Site and Wikipeia
In comparison wirh the previous figure, this figure represents
some portions of source code with plenty of hyperlinks
18 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
AdDANAg Normalizes all HTML Hyperlinks Using a Substitution rule
The purpose of pre-processing step is to normalise the ratio
of content and code characters representing hyperlinks
Suppose the following hyperlink to be contained in an
HTML file:
<a href=“http://www.BBC.com/»BBC Web Site</a>
First, the length of the anchor text (in this example: BBC
Web Site) is determined and we refer to this value by LT
Then, we substitute the attribute part of the opening tag
with a placeholder text of length LT - 5, where the
subtracted 5 comes from the length of <a></a>
So, using the underscore sign _ as placeholder the new
hyperlink for our example should be as below:
<a _______>BBC Web Site</a>
19 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Original Diagram Produced by DANAg
20 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Original Diagram Produced by AdDANAg
21 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Smoothed Diagram Produced by DANAg
22 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Smoothed Diagram Produced by AdDANAg
23 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
24 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Data Sets
Tabelle: Evaluation corpus of 2166 web pages.
Web site Source Size Lang.
BBC www.bbc.co.uk/persian 598 Farsi
Hamshahri hamshahrionline.ir 375 Farsi
Jame Jam www.jamejamonline.ir 136 Farsi
Ahram www.jamejamonline.ir 188 Arabic
Reuters ara.reuters.com 116 Arabic
Embassy of www.teheran.diplo.de 31 Farsi
Germany Vertretung/teheran/fa
BBC www.bbc.co.uk/urdu 234 Urdu
BBC www.bbc.co.uk/pashto 203 Pashto
BBC www.bbc.co.uk/arabic 252 Arabic
Wiki fa.wikipedia.org 33 Farsi
25 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Data Sets
Tabelle: Evaluation corpus of 9101 web pages.
Web site Source Size Languages
BBC www.bbc.co.uk 1000 English
Economist www.economist.com 250 English
Golem golem.de 1000 German
Heise www.heise.de 1000 German
Manual several 65 German, English
Repubblica www.repubblica.it 1000 Italian
Spiegel www.spiegel.de 1000 German
Telepolis www.telepolis.de 1000 German
Wiki fa.wikipedia.org 1000 English
Yahoo news.yahoo.com 1000 English
Zdf www.heute.de 422 German
26 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Evaluation
r =
length(k)
length(g)
p =
length(k)
length(m)
F1 = 2 ∗
p ∗ r
p + r
27 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
28 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Evaluation results based on F1-measure
AlAhram
BBCArabic
BBCPashto
BBCPersian
BBCUrdu
Embassy
Hamshahri
JameJam
Reuters
Wikipedia
ACCB-40 0.871 0.826 0.859 0.892 0.948 0.784 0.842 0.840 0.900 0.736
BTE 0.853 0.496 0.854 0.589 0.961 0.810 0.480 0.791 0.889 0.817
DSC 0.871 0.885 0.840 0.950 0.896 0.824 0.948 0.914 0.851 0.747
FE 0.809 0.060 0.165 0.063 0.002 0.017 0.225 0.027 0.241 0.225
KFE 0.690 0.717 0.835 0.748 0.750 0.762 0.678 0.783 0.825 0.624
LQF-25 0.788 0.780 0.844 0.841 0.957 0.860 0.765 0.737 0.870 0.773
LQF-50 0.785 0.777 0.837 0.828 0.954 0.856 0.767 0.724 0.870 0.772
LQF-75 0.773 0.773 0.837 0.819 0.954 0.852 0.756 0.724 0.870 0.750
TCCB-18 0.886 0.826 0.912 0.925 0.990 0.887 0.871 0.929 0.959 0.814
TCCB-25 0.874 0.861 0.909 0.927 0.992 0.883 0.888 0.924 0.958 0.814
Density 0.879 0.202 0.908 0.741 0.958 0.882 0.920 0.907 0.934 0.665
DANAg 0.949 0.986 0.944 0.995 0.999 0.917 0.991 0.966 0.945 0.699
AdDANAg 0.949 0.985 0.944 0.996 0.999 0.917 0.991 0.973 0.945 0.852
29 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Evaluation results based on F1-measure
BBC
Economist
Zdf
Golem
Heise
Repubblica
Spiegel
Telepolis
Yahoo
Wikipedia
Manual
Plain 0.595 0.613 0.514 0.502 0.575 0.704 0.549 0.906 0.582 0.823 0.371
LQF 0.826 0.720 0.578 0.806 0.787 0.816 0.775 0.910 0.670 0.752 0.381
Crunch 0.756 0.815 0.772 0.837 0.810 0.887 0.706 0.859 0.738 0.725 0.382
DSC 0.937 0.881 0.847 0.958 0.877 0.925 0.902 0.902 0.780 0.594 0.403
TCCB 0.914 0.903 0.745 0.947 0.821 0.918 0.910 0.913 0.758 0.660 0.404
CCB 0.923 0.914 0.929 0.935 0.841 0.964 0.858 0.908 0.742 0.403 0.420
ACCB 0.924 0.890 0.929 0.959 0.916 0.968 0.861 0.908 0.732 0.682 0.419
Density 0.575 0.874 0.708 0.873 0.906 0.344 0.761 0.804 0.886 0.708 0.354
DANAg 0.924 0.900 0.912 0.979 0.945 0.970 0.949 0.932 0.952 0.646 0.401
AdDANAg 0.922 0.900 0.911 0.994 0.931 0.970 0.951 0.932 0.950 0.840 0.404
30 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
31 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Conclusion and Future Work
Result shows that the AdDANAg produces satisfactory
Main Content, specially on the rich hyperlink documents
Trying to improve AdDANAg to a free-parameter algorithm
32 / 33
Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Thank you for your attention
33 / 33

More Related Content

Similar to Webist2012 presentation

Computer programming all chapters
Computer programming all chaptersComputer programming all chapters
Computer programming all chapters
Ibrahim Elewah
 
comparative study on cross platfom frameworks mobile apps
comparative study on cross platfom frameworks mobile appscomparative study on cross platfom frameworks mobile apps
comparative study on cross platfom frameworks mobile apps
apurva vyas
 
SAP Training Manual Project Systems .pdf
SAP Training Manual Project Systems .pdfSAP Training Manual Project Systems .pdf
SAP Training Manual Project Systems .pdf
DineshChanakya1
 
Python time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsPython time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving cars
Andreas Pawlik
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Revolution Analytics
 

Similar to Webist2012 presentation (20)

abapin21days.pdf
abapin21days.pdfabapin21days.pdf
abapin21days.pdf
 
Computer programming all chapters
Computer programming all chaptersComputer programming all chapters
Computer programming all chapters
 
educational course/tutorialoutlet.com
educational course/tutorialoutlet.comeducational course/tutorialoutlet.com
educational course/tutorialoutlet.com
 
comparative study on cross platfom frameworks mobile apps
comparative study on cross platfom frameworks mobile appscomparative study on cross platfom frameworks mobile apps
comparative study on cross platfom frameworks mobile apps
 
Bdc BATCH DATA COMMUNICATION
Bdc BATCH DATA COMMUNICATIONBdc BATCH DATA COMMUNICATION
Bdc BATCH DATA COMMUNICATION
 
Resources for business professors and students including ERP
Resources for business professors and students including ERPResources for business professors and students including ERP
Resources for business professors and students including ERP
 
SAP Training Manual Project Systems .pdf
SAP Training Manual Project Systems .pdfSAP Training Manual Project Systems .pdf
SAP Training Manual Project Systems .pdf
 
Pagerank
PagerankPagerank
Pagerank
 
MaxTECH Technical Training Presentation from MaximoWorld 2018
MaxTECH Technical Training Presentation from MaximoWorld 2018MaxTECH Technical Training Presentation from MaximoWorld 2018
MaxTECH Technical Training Presentation from MaximoWorld 2018
 
Python time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving carsPython time series analysis and visualization for self-driving cars
Python time series analysis and visualization for self-driving cars
 
Training Storyboard
Training StoryboardTraining Storyboard
Training Storyboard
 
Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?Comparing Distributed Indexing To Mapreduce or Not?
Comparing Distributed Indexing To Mapreduce or Not?
 
R programming for data science
R programming for data scienceR programming for data science
R programming for data science
 
Precomputing recommendations with Apache Beam
Precomputing recommendations with Apache BeamPrecomputing recommendations with Apache Beam
Precomputing recommendations with Apache Beam
 
Web Traffic Time Series Forecasting
Web Traffic  Time Series ForecastingWeb Traffic  Time Series Forecasting
Web Traffic Time Series Forecasting
 
Chrome Developer Tools - Pro Tips & Tricks
Chrome Developer Tools - Pro Tips & TricksChrome Developer Tools - Pro Tips & Tricks
Chrome Developer Tools - Pro Tips & Tricks
 
Web Page Recommendation using Domain Knowledge and Web Usage Knowledge
Web Page Recommendation using Domain Knowledge and Web Usage KnowledgeWeb Page Recommendation using Domain Knowledge and Web Usage Knowledge
Web Page Recommendation using Domain Knowledge and Web Usage Knowledge
 
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
Flink Forward Berlin 2018: Raj Subramani - "A streaming Quantitative Analytic...
 
Data Science - Part II - Working with R & R studio
Data Science - Part II -  Working with R & R studioData Science - Part II -  Working with R & R studio
Data Science - Part II - Working with R & R studio
 
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results RevealedIs Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
Is Revolution R Enterprise Faster than SAS? Benchmarking Results Revealed
 

More from Hadi Mohammadzadeh

Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML Files
Hadi Mohammadzadeh
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML Files
Hadi Mohammadzadeh
 
Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 

More from Hadi Mohammadzadeh (10)

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
 
Accurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML FilesAccurate Main Content Extraction from Persian HTML Files
Accurate Main Content Extraction from Persian HTML Files
 
Main Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML FilesMain Content Extraction from Persian HTML Files
Main Content Extraction from Persian HTML Files
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
 
Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi Mohammadzadeh
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi Mohammadzadeh
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
anilsa9823
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
anilsa9823
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Recently uploaded (20)

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 

Webist2012 presentation

  • 1. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion The Impact of Source Code Normalization on Main Content Extraction Hadi Mohammadzadeh, T. Gottron, F. Schweiggert and G. Nakhaeizadeh Institute of Applied Information Processing University of Ulm , Germany hadi.mohammadzadeh@uni-ulm.de WEBIST 2012, April 2012, Porto, Portugal 1 / 33
  • 2. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Outline 1 Motivation 2 DANAg Algorithm, DANAg 3 AdDANAg AdDANAg, an Extended Version of DANAg 4 Data Sets, Evaluation 5 Results 6 Conclusion 2 / 33
  • 3. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Outline 1 Motivation 2 DANAg Algorithm, DANAg 3 AdDANAg AdDANAg, an Extended Version of DANAg 4 Data Sets, Evaluation 5 Results 6 Conclusion 3 / 33
  • 4. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion What is the Main Content Extarction (MCE)!! Content Extraction (CE) is defined as the task of determining the main textual content of an HTML document and/or removing the additional items, such as advertisements, navigation bars. 4 / 33
  • 5. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion An example of web document with selected MC 5 / 33
  • 6. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion What is the Problem In recent years, many algorithms have been introduced to provide solutions to this task But, none of these methods demostrated perfect results on all types of web documents A typical flaw is a poor extraction performance on highly structured web documents containing many hyperlinks, such as Wikipedia So, here AdDANAg will be introduced to address the problem of main content extraction from highly structured web documents 6 / 33
  • 7. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Outline 1 Motivation 2 DANAg Algorithm, DANAg 3 AdDANAg AdDANAg, an Extended Version of DANAg 4 Data Sets, Evaluation 5 Results 6 Conclusion 7 / 33
  • 8. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Algorithm, DANAg The Phases of DANAg, the First Phase In the First phase, DANAg calculates the amount of content and code characters in each line of an HTML file and stores these numbers in two one-dimensional arrays T and S, respectively 8 / 33
  • 9. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Algorithm, DANAg Phase One : calculates the amount of content and code characters in each line of an HTML file 9 / 33
  • 10. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Algorithm, DANAg The Phases of DANAg, the Second Phase In the Second phase, DANAg computes diffi through following smoothed Formula for each line i of the HTML file, and keeps these new numbers in a one-dimensional array D, to produce a smoothed plot which can be seen in next slide diffi = (Ti − Si) + (Ti+1 − Si+1) + (Ti-1 − Si-1) (1) In the next slide we draw a column above the x-axis if diffi > 0. Otherwise, a line with length |diffi | is depicted below the x-axis 10 / 33
  • 11. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Algorithm, DANAg Phase Two : compute diffi = (Ti − Si) + (Ti+1 − Si+1) + (Ti-1 − Si-1) 11 / 33
  • 12. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Algorithm, DANAg The Phases of DANAg, the Third Phase Finding the boundaries of the MC regions, but how After recognizing the longest column above the x-axis, the algorithm moves up and down in the HTML file to find all paragraph belonging to the MC. But where is the end of these movements? The number of lines we need to traverse to find the next MC paragraph is defined as a parameter P, initialized with 20. By considering this parameter, we move up or down, respectively until we can not find a line containing Content characters. At this moment, all lines we found make our MC. 12 / 33
  • 13. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Algorithm, DANAg The Phases of DANAg, the Fourth Phase Finally, we feed all HTML lines determined in the previous phase as an input to a parser. The output of parser shows the main content 13 / 33
  • 14. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Outline 1 Motivation 2 DANAg Algorithm, DANAg 3 AdDANAg AdDANAg, an Extended Version of DANAg 4 Data Sets, Evaluation 5 Results 6 Conclusion 14 / 33
  • 15. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg The Structure of AdDANAg As we mentioned, AdDANAg addresses the problem of main content extraction from highly structured web documents The process behind AdDANAg can be divided into two sections: A pre-processing phase And a core extraction phase which is DANAg 15 / 33
  • 16. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg An Example of Wikipeia Web Page 16 / 33
  • 17. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg Comparison between BBC Web Site and Wikipeia This figure shows some paragraphs of a normal HTML file. There is no hyperlink in this code, so approaches like DANAg has no problem in extracting the main content accurately </div> <p id="story_continues_2">&quot;Especially the under-fives and the pregnant women, they&#039;re suffering from malnutrition and communicable disease like the measles, diarrhoea and pneumonia,&quot; he said.</p> <p>Earlier this week Mark Bowden, the UN humanitarian affairs co-ordinator for Somalia, told the BBC that the country was close to famine. </p> <p>Last week Somalia&#039;s al-Shabab Islamist militia - which has been fighting the Mogadishu government - said it was lifting its ban on foreign aid agencies provided they did not show a &quot;hidden agenda&quot;. </p> <p>Some 3,000 people flee each day for neighbouring countries such as Ethiopia and Kenya which are struggling to cope.</p> </div> 17 / 33
  • 18. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg Comparison between BBC Web Site and Wikipeia In comparison wirh the previous figure, this figure represents some portions of source code with plenty of hyperlinks 18 / 33
  • 19. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg AdDANAg Normalizes all HTML Hyperlinks Using a Substitution rule The purpose of pre-processing step is to normalise the ratio of content and code characters representing hyperlinks Suppose the following hyperlink to be contained in an HTML file: <a href=“http://www.BBC.com/»BBC Web Site</a> First, the length of the anchor text (in this example: BBC Web Site) is determined and we refer to this value by LT Then, we substitute the attribute part of the opening tag with a placeholder text of length LT - 5, where the subtracted 5 comes from the length of <a></a> So, using the underscore sign _ as placeholder the new hyperlink for our example should be as below: <a _______>BBC Web Site</a> 19 / 33
  • 20. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg Original Diagram Produced by DANAg 20 / 33
  • 21. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg Original Diagram Produced by AdDANAg 21 / 33
  • 22. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg Smoothed Diagram Produced by DANAg 22 / 33
  • 23. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion AdDANAg, an Extended Version of DANAg Smoothed Diagram Produced by AdDANAg 23 / 33
  • 24. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Outline 1 Motivation 2 DANAg Algorithm, DANAg 3 AdDANAg AdDANAg, an Extended Version of DANAg 4 Data Sets, Evaluation 5 Results 6 Conclusion 24 / 33
  • 25. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Data Sets Tabelle: Evaluation corpus of 2166 web pages. Web site Source Size Lang. BBC www.bbc.co.uk/persian 598 Farsi Hamshahri hamshahrionline.ir 375 Farsi Jame Jam www.jamejamonline.ir 136 Farsi Ahram www.jamejamonline.ir 188 Arabic Reuters ara.reuters.com 116 Arabic Embassy of www.teheran.diplo.de 31 Farsi Germany Vertretung/teheran/fa BBC www.bbc.co.uk/urdu 234 Urdu BBC www.bbc.co.uk/pashto 203 Pashto BBC www.bbc.co.uk/arabic 252 Arabic Wiki fa.wikipedia.org 33 Farsi 25 / 33
  • 26. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Data Sets Tabelle: Evaluation corpus of 9101 web pages. Web site Source Size Languages BBC www.bbc.co.uk 1000 English Economist www.economist.com 250 English Golem golem.de 1000 German Heise www.heise.de 1000 German Manual several 65 German, English Repubblica www.repubblica.it 1000 Italian Spiegel www.spiegel.de 1000 German Telepolis www.telepolis.de 1000 German Wiki fa.wikipedia.org 1000 English Yahoo news.yahoo.com 1000 English Zdf www.heute.de 422 German 26 / 33
  • 27. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Evaluation r = length(k) length(g) p = length(k) length(m) F1 = 2 ∗ p ∗ r p + r 27 / 33
  • 28. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Outline 1 Motivation 2 DANAg Algorithm, DANAg 3 AdDANAg AdDANAg, an Extended Version of DANAg 4 Data Sets, Evaluation 5 Results 6 Conclusion 28 / 33
  • 29. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Evaluation results based on F1-measure AlAhram BBCArabic BBCPashto BBCPersian BBCUrdu Embassy Hamshahri JameJam Reuters Wikipedia ACCB-40 0.871 0.826 0.859 0.892 0.948 0.784 0.842 0.840 0.900 0.736 BTE 0.853 0.496 0.854 0.589 0.961 0.810 0.480 0.791 0.889 0.817 DSC 0.871 0.885 0.840 0.950 0.896 0.824 0.948 0.914 0.851 0.747 FE 0.809 0.060 0.165 0.063 0.002 0.017 0.225 0.027 0.241 0.225 KFE 0.690 0.717 0.835 0.748 0.750 0.762 0.678 0.783 0.825 0.624 LQF-25 0.788 0.780 0.844 0.841 0.957 0.860 0.765 0.737 0.870 0.773 LQF-50 0.785 0.777 0.837 0.828 0.954 0.856 0.767 0.724 0.870 0.772 LQF-75 0.773 0.773 0.837 0.819 0.954 0.852 0.756 0.724 0.870 0.750 TCCB-18 0.886 0.826 0.912 0.925 0.990 0.887 0.871 0.929 0.959 0.814 TCCB-25 0.874 0.861 0.909 0.927 0.992 0.883 0.888 0.924 0.958 0.814 Density 0.879 0.202 0.908 0.741 0.958 0.882 0.920 0.907 0.934 0.665 DANAg 0.949 0.986 0.944 0.995 0.999 0.917 0.991 0.966 0.945 0.699 AdDANAg 0.949 0.985 0.944 0.996 0.999 0.917 0.991 0.973 0.945 0.852 29 / 33
  • 30. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Evaluation results based on F1-measure BBC Economist Zdf Golem Heise Repubblica Spiegel Telepolis Yahoo Wikipedia Manual Plain 0.595 0.613 0.514 0.502 0.575 0.704 0.549 0.906 0.582 0.823 0.371 LQF 0.826 0.720 0.578 0.806 0.787 0.816 0.775 0.910 0.670 0.752 0.381 Crunch 0.756 0.815 0.772 0.837 0.810 0.887 0.706 0.859 0.738 0.725 0.382 DSC 0.937 0.881 0.847 0.958 0.877 0.925 0.902 0.902 0.780 0.594 0.403 TCCB 0.914 0.903 0.745 0.947 0.821 0.918 0.910 0.913 0.758 0.660 0.404 CCB 0.923 0.914 0.929 0.935 0.841 0.964 0.858 0.908 0.742 0.403 0.420 ACCB 0.924 0.890 0.929 0.959 0.916 0.968 0.861 0.908 0.732 0.682 0.419 Density 0.575 0.874 0.708 0.873 0.906 0.344 0.761 0.804 0.886 0.708 0.354 DANAg 0.924 0.900 0.912 0.979 0.945 0.970 0.949 0.932 0.952 0.646 0.401 AdDANAg 0.922 0.900 0.911 0.994 0.931 0.970 0.951 0.932 0.950 0.840 0.404 30 / 33
  • 31. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Outline 1 Motivation 2 DANAg Algorithm, DANAg 3 AdDANAg AdDANAg, an Extended Version of DANAg 4 Data Sets, Evaluation 5 Results 6 Conclusion 31 / 33
  • 32. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Conclusion and Future Work Result shows that the AdDANAg produces satisfactory Main Content, specially on the rich hyperlink documents Trying to improve AdDANAg to a free-parameter algorithm 32 / 33
  • 33. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion Thank you for your attention 33 / 33