Diamond Application Development Crafting Solutions with Precision
Webist2012 presentation
1. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
The Impact of Source Code Normalization
on Main Content Extraction
Hadi Mohammadzadeh,
T. Gottron, F. Schweiggert and G. Nakhaeizadeh
Institute of Applied Information Processing
University of Ulm , Germany
hadi.mohammadzadeh@uni-ulm.de
WEBIST 2012, April 2012, Porto, Portugal
1 / 33
2. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
2 / 33
3. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
3 / 33
4. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
What is the Main Content Extarction (MCE)!!
Content Extraction (CE) is defined as the task of determining
the main textual content of an HTML document
and/or removing the additional items, such as advertisements,
navigation bars.
4 / 33
5. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
An example of web document with selected MC
5 / 33
6. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
What is the Problem
In recent years, many algorithms have been introduced to
provide solutions to this task
But, none of these methods demostrated perfect results on
all types of web documents
A typical flaw is a poor extraction performance on highly
structured web documents containing many hyperlinks,
such as Wikipedia
So, here AdDANAg will be introduced to address the
problem of main content extraction from highly structured
web documents
6 / 33
7. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
7 / 33
8. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the First Phase
In the First phase, DANAg calculates the amount of
content and code characters in each line of an HTML file
and stores these numbers in two one-dimensional arrays T
and S, respectively
8 / 33
9. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
Phase One : calculates the amount of content and code characters in each line of
an HTML file
9 / 33
10. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the Second Phase
In the Second phase, DANAg computes diffi through
following smoothed Formula for each line i of the HTML
file, and keeps these new numbers in a one-dimensional
array D, to produce a smoothed plot which can be seen in
next slide
diffi = (Ti − Si) + (Ti+1 − Si+1) + (Ti-1 − Si-1) (1)
In the next slide we draw a column above the x-axis if
diffi > 0. Otherwise, a line with length |diffi | is depicted
below the x-axis
10 / 33
12. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the Third Phase
Finding the boundaries of the MC regions, but how
After recognizing the longest column above the x-axis, the
algorithm moves up and down in the HTML file to find all
paragraph belonging to the MC. But where is the end of
these movements?
The number of lines we need to traverse to find the next
MC paragraph is defined as a parameter P, initialized with
20.
By considering this parameter, we move up or down,
respectively until we can not find a line containing Content
characters.
At this moment, all lines we found make our MC.
12 / 33
13. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Algorithm, DANAg
The Phases of DANAg, the Fourth Phase
Finally, we feed all HTML lines determined in the previous
phase as an input to a parser. The output of parser shows
the main content
13 / 33
14. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
14 / 33
15. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
The Structure of AdDANAg
As we mentioned, AdDANAg addresses the problem of main
content extraction from highly structured web documents
The process behind AdDANAg can be divided into two
sections:
A pre-processing phase
And a core extraction phase which is DANAg
15 / 33
16. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
An Example of Wikipeia Web Page
16 / 33
17. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Comparison between BBC Web Site and Wikipeia
This figure shows some paragraphs of a normal HTML file.
There is no hyperlink in this code, so approaches like DANAg
has no problem in extracting the main content accurately
</div>
<p id="story_continues_2">"Especially the under-fives and the pregnant
women, they're suffering from malnutrition and communicable disease like the
measles, diarrhoea and pneumonia," he said.</p>
<p>Earlier this week Mark Bowden, the UN humanitarian affairs co-ordinator
for Somalia, told the BBC that the country was close to famine. </p>
<p>Last week Somalia's al-Shabab Islamist militia - which has been
fighting the Mogadishu government - said it was lifting its ban on foreign aid
agencies provided they did not show a "hidden agenda". </p>
<p>Some 3,000 people flee each day for neighbouring countries such as
Ethiopia and Kenya which are struggling to cope.</p>
</div>
17 / 33
18. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Comparison between BBC Web Site and Wikipeia
In comparison wirh the previous figure, this figure represents
some portions of source code with plenty of hyperlinks
18 / 33
19. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
AdDANAg Normalizes all HTML Hyperlinks Using a Substitution rule
The purpose of pre-processing step is to normalise the ratio
of content and code characters representing hyperlinks
Suppose the following hyperlink to be contained in an
HTML file:
<a href=“http://www.BBC.com/»BBC Web Site</a>
First, the length of the anchor text (in this example: BBC
Web Site) is determined and we refer to this value by LT
Then, we substitute the attribute part of the opening tag
with a placeholder text of length LT - 5, where the
subtracted 5 comes from the length of <a></a>
So, using the underscore sign _ as placeholder the new
hyperlink for our example should be as below:
<a _______>BBC Web Site</a>
19 / 33
20. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Original Diagram Produced by DANAg
20 / 33
21. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Original Diagram Produced by AdDANAg
21 / 33
22. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Smoothed Diagram Produced by DANAg
22 / 33
23. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
AdDANAg, an Extended Version of DANAg
Smoothed Diagram Produced by AdDANAg
23 / 33
24. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
24 / 33
25. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Data Sets
Tabelle: Evaluation corpus of 2166 web pages.
Web site Source Size Lang.
BBC www.bbc.co.uk/persian 598 Farsi
Hamshahri hamshahrionline.ir 375 Farsi
Jame Jam www.jamejamonline.ir 136 Farsi
Ahram www.jamejamonline.ir 188 Arabic
Reuters ara.reuters.com 116 Arabic
Embassy of www.teheran.diplo.de 31 Farsi
Germany Vertretung/teheran/fa
BBC www.bbc.co.uk/urdu 234 Urdu
BBC www.bbc.co.uk/pashto 203 Pashto
BBC www.bbc.co.uk/arabic 252 Arabic
Wiki fa.wikipedia.org 33 Farsi
25 / 33
26. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Data Sets
Tabelle: Evaluation corpus of 9101 web pages.
Web site Source Size Languages
BBC www.bbc.co.uk 1000 English
Economist www.economist.com 250 English
Golem golem.de 1000 German
Heise www.heise.de 1000 German
Manual several 65 German, English
Repubblica www.repubblica.it 1000 Italian
Spiegel www.spiegel.de 1000 German
Telepolis www.telepolis.de 1000 German
Wiki fa.wikipedia.org 1000 English
Yahoo news.yahoo.com 1000 English
Zdf www.heute.de 422 German
26 / 33
27. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Evaluation
r =
length(k)
length(g)
p =
length(k)
length(m)
F1 = 2 ∗
p ∗ r
p + r
27 / 33
28. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
28 / 33
31. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Outline
1 Motivation
2 DANAg
Algorithm, DANAg
3 AdDANAg
AdDANAg, an Extended Version of DANAg
4 Data Sets, Evaluation
5 Results
6 Conclusion
31 / 33
32. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Conclusion and Future Work
Result shows that the AdDANAg produces satisfactory
Main Content, specially on the rich hyperlink documents
Trying to improve AdDANAg to a free-parameter algorithm
32 / 33
33. Motivation DANAg AdDANAg Data Sets, Evaluation Results Conclusion
Thank you for your attention
33 / 33