SlideShare a Scribd company logo
Motivation       UTF-8 Encoding Form          DANA             Results   Conclusion




                A Fast and Accurate Approach for
                    Main Content Extraction
                  based on Character Encoding

         Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert
                    and Gholamreza Nakhaeizadeh

                                    Ph.D. Candidate
                      Institute of Applied Information Processing
                                    University of Ulm
                                        Germany
                           hadi.mohammadzadeh@uni-ulm.de


                 Tir 2011, Augest 2011, Toulouse, France

                                                                               1 / 25
Motivation        UTF-8 Encoding Form   DANA   Results   Conclusion




Outline


      1 Motivation

      2 UTF-8 Encoding Form

      3 DANA
             Algorithm, named DANA
             Data Sets, Evaluation

      4 Results

      5 Conclusion



                                                               2 / 25
Motivation        UTF-8 Encoding Form   DANA   Results   Conclusion




Outline



      1 Motivation

      2 UTF-8 Encoding Form

      3 DANA
             Algorithm, named DANA
             Data Sets, Evaluation

      4 Results

      5 Conclusion



                                                               3 / 25
Motivation      UTF-8 Encoding Form        DANA   Results   Conclusion




An example of web pages with selected MC




                                                                  4 / 25
Motivation          UTF-8 Encoding Form     DANA           Results        Conclusion




             What are the right to left (R2L) laguages?
                 Laguages which are written from the right side to the left
                 side, Arabic, Persian, Urdu and Pshtoo
             What are our goals?
                 From a technical point of view, most of the main content
                 extraction approaches use HTML tags to separate the main
                 content from the extraneous items.
                 This implies the need to employ a parser for the entire web
                 pages. Consequently, the computation costs of these main
                 content extraction(MCE) approaches are increased.
                 Thus, the main goal of proposed algorithm is to increase
                 both the accuracy and effectiveness of the MCE algorithms
                 dealing with R2L languages



                                                                                5 / 25
Motivation        UTF-8 Encoding Form   DANA   Results   Conclusion




Outline



      1 Motivation

      2 UTF-8 Encoding Form

      3 DANA
             Algorithm, named DANA
             Data Sets, Evaluation

      4 Results

      5 Conclusion



                                                               6 / 25
Motivation          UTF-8 Encoding Form     DANA          Results         Conclusion




             In UTF-8, ASCII characters need only one byte, with a
             value guaranteed to be less than 128
             In UTF-8, all letters of right to left languages (Persian ,
             Arabic, Pashto, and Urdu Languages) take exactly 2 bytes,
             each with value greater than 127
             By using simple condition , we are able to seperate ASCII
             characters from Non-ASCII characters .
                 If the value of one byte is < 128 ⇒ this byte is a member of
                 ASCII characters
                 Otherwise it is a member of Non-ASCII characters




                                                                                7 / 25
Motivation        UTF-8 Encoding Form   DANA   Results   Conclusion




Outline



      1 Motivation

      2 UTF-8 Encoding Form

      3 DANA
             Algorithm, named DANA
             Data Sets, Evaluation

      4 Results

      5 Conclusion



                                                               8 / 25
Motivation              UTF-8 Encoding Form   DANA     Results       Conclusion


Algorithm, named DANA


The Phases of DANA




             In the First phase, we count the number of ASCII and
             Non-ASCII characters of each line of the HTML file, saved
             in two 1D arrays T1 and T2
             In the Second phase, we are looking to find areas comprising
             the MC in an HTML file using the arrays T1 and T2
             Finally, we feed all HTML lines determined in the previous
             phase as an input to a parser. The output of parser shows
             the main content




                                                                           9 / 25
Motivation              UTF-8 Encoding Form   DANA     Results         Conclusion


Algorithm, named DANA


Phase One : Counting ASCII and Non-ASCII characters of each line




      DANA reads an HTML file line by line and counts the number
      of ASCII and Non-ASCII characters of each line
             The number of ASCII characters of each line is saved in
             one-dimentioanl array T1
             The number of Non-ASCII characters of each line is saved
             in one-dimentioanl array T2




                                                                            10 / 25
Motivation              UTF-8 Encoding Form   DANA       Results        Conclusion


Algorithm, named DANA


Phase Two : Finding areas Comprising MC



      Recognizing areas in the HTML file, in which:
             Non-ASCII Characters have high density
             ASCII Characters have low density
      To illustrate our approach, we depict two diagrams. In the first
      diagram, for each line of the HTML file
             For the Non-ASCII characters, a vertical line with identical
             length is drawn upside of the x-axis, as stored in T1
             Similarly, for the ASCII characters a vertical line with equal
             length is drawn downside of the x-axis, as stored in T2



                                                                              11 / 25
Motivation              UTF-8 Encoding Form   DANA   Results   Conclusion


Algorithm, named DANA


Phase Two : Finding areas Comprising MC




                                                                    12 / 25
Motivation              UTF-8 Encoding Form   DANA      Results         Conclusion


Algorithm, named DANA


Phase Two : Finding areas Comprising MC




      Different types of regions:
             Regions with low or near zero density of columns above the
             x-axis and high density of columns below the x-axis, labeled
             A
             Region with high density of columns above the x-axis and a
             low density of columns below the x-axis, labelled B
             Regions with slight difference between the density of the
             columns above and below the x-axis, labelled C




                                                                             13 / 25
Motivation              UTF-8 Encoding Form          DANA           Results   Conclusion


Algorithm, named DANA


Phase Two : Finding areas Comprising MC

      Now the problem of finding MC in the HTML file becomes the
      problem of finding region B. To find Region B we follow 3 steps:
          1) Draw smoothed diagram. For all columns we calculate
          diff i and draw new diagram
                                         diffi   =   T1i − T2i
                                                +   T1i+1 − T2i+1
                                                +   T1i−1 − T2i−1

                   If diff i > 0 then we draw a line with the length of diff i
                   above the x-axix
                   Otherwise , we draw a line with the length of absolute value
                   of diff i below the x-axix
             2) We find a column with the longest length above the
             x-axis
             3) Finding the boundaries of the MC region                            14 / 25
Motivation                           UTF-8 Encoding Form   DANA           Results         Conclusion


Algorithm, named DANA


Phase Two : Finding areas Comprising MC
               dff c luae b fr l ()
                i ac ltd y o mua 1




                                                             manc n e t
                                                               i o tn




                                                                           n mb r f ie
                                                                            u eol s n
                                                                          i t eHTML l
                                                                          n h         e




                                                                                               15 / 25
Motivation              UTF-8 Encoding Form   DANA     Results       Conclusion


Algorithm, named DANA


Phase Two : Finding areas Comprising MC


      Finding the boundaries of the MC regions, but how
             After recognizing the longest column above the x-axis, the
             algorithm moves up and down in the HTML file to find all
             paragraph belonging to the MC. But where is the end of
             these movements?
             The number of lines we need to traverse to find the next
             MC paragraph is defined as a parameter P, initialized with
             20.
             By considering this parameter, we move up or down,
             respectively until we can not find a line containing
             Non-ASCII characters.
             At this moment, all lines we found make our MC.

                                                                          16 / 25
Motivation              UTF-8 Encoding Form   DANA    Results       Conclusion


Algorithm, named DANA


Phase Three : Extracting MC




      In final phase,
      we feed all HTML lines determined in the previous phase as an
      input to a parser.
      Following our hypothesis, the output of the parser is exactly the
      main content.




                                                                          17 / 25
Motivation              UTF-8 Encoding Form          DANA           Results   Conclusion


Data Sets, Evaluation


Data Sets


                             Web site         Num. of Pages   Languages
                               BBC                598           Farsi
                            Hamshahri             375           Farsi
                            Jame Jam              136           Farsi
                              Ahram               188          Arabic
                              Reuters             116          Arabic
                            Embassy of             31           Farsi
                           Germany, Iran
                               BBC                234           Urdu
                               BBC                203          Pashto
                               BBC                252          Arabic
                               Wiki                33           Farsi


              Arabic, Farsi, Pashto, and Urdu
              We have collected 2166 web pages from different web sites.

                                                                                   18 / 25
Motivation              UTF-8 Encoding Form            DANA    Results   Conclusion


Data Sets, Evaluation


Evaluation




                                                   length(k)
                                              r=
                                                   length(g)
                                                   length(k)
                                              p=
                                                  length(m)
                                                        p∗r
                                              F1 = 2 ∗
                                                       p+r




                                                                              19 / 25
Motivation        UTF-8 Encoding Form   DANA   Results   Conclusion




Outline



      1 Motivation

      2 UTF-8 Encoding Form

      3 DANA
             Algorithm, named DANA
             Data Sets, Evaluation

      4 Results

      5 Conclusion



                                                              20 / 25
Motivation                   UTF-8 Encoding Form                                DANA                         Results                   Conclusion




Evaluation results based on F1-measure




                                                             BBC Persian
                                                BBC Pashto
                                   BBC Arabic




                                                                                                 Hamshahri
                                                                           BBC Urdu




                                                                                                                                       Wikipedia
                                                                                                                  Jame Jam
                  Al Ahram




                                                                                       Embassy




                                                                                                                             Reuters
        ACCB-40   0.8714           0.8255       0.8594       0.8925        0.9476      0.7837    0.8420           0.8398     0.8997    0.7364
        BTE       0.8534           0.4957       0.8544       0.5895        0.9606      0.8095    0.4801           0.7906     0.8891    0.8167
        DSC       0.8706           0.8849       0.8398       0.9505        0.8962      0.8238    0.9482           0.9142     0.8510    0.7471
        FE        0.8086           0.0600       0.1652       0.0626        0.0023      0.0173    0.2251           0.0275     0.2408    0.2250
        KFE       0.6905           0.7186       0.8349       0.7480        0.7504      0.7620    0.6777           0.7833     0.8253    0.6244
        LQF-25    0.7877           0.7796       0.8436       0.8410        0.9566      0.8596    0.7650           0.7372     0.8699    0.7735
        LQF-50    0.7855           0.7772       0.8374       0.8279        0.9544      0.8561    0.7673           0.7240     0.8699    0.7719
        LQF-75    0.7733           0.7727       0.8374       0.8190        0.9544      0.8516    0.7560           0.7240     0.8699    0.7497
        TCCB-18   0.8861           0.8265       0.9121       0.9253        0.9898      0.8867    0.8712           0.9292     0.9593    0.8142
        TCCB-25   0.8737           0.8608       0.9091       0.9271        0.9916      0.8832    0.8884           0.9240     0.9583    0.8142
        Density   0.8787           0.2016       0.9081       0.7415        0.9579      0.8818    0.9197           0.9063     0.9336    0.6649
        DANA      0.9845           0.9633       0.9363       0.9944        1.0         0.9350    0.9797           0.9452     0.9670    0.6740




                                                                                                                                                   21 / 25
Motivation      UTF-8 Encoding Form      DANA         Results    Conclusion




Average processing time (MB/s)



                           Time Performance (Megabyte/Second)
              ACCB-40                                     0.40
              BTE                                         0.17
              DSC                                         7.76
              FE                                         14.33
              KFE                                        11.76
              LQF-25                                      1.25
              LQF-50                                      1.25
              LQF-75                                      1.25
              TCCB-18                                    17.09
              TCCB-25                                    15.86
              Density                                     7.62
              DANA                                       19.43




                                                                      22 / 25
Motivation        UTF-8 Encoding Form   DANA   Results   Conclusion




Outline



      1 Motivation

      2 UTF-8 Encoding Form

      3 DANA
             Algorithm, named DANA
             Data Sets, Evaluation

      4 Results

      5 Conclusion



                                                              23 / 25
Motivation          UTF-8 Encoding Form   DANA         Results       Conclusion




Conclusion and Future Work



             Result shows that the DANA produces satisfactory MC
             with F1-measure > 0.935
             We do not need to use parser in the first phase, Using
             parser is time consuming
             Future works
                 Generalizing DANA by differentiating between tags and
                 text/content to propose a new language-independent
                 content extraction
                 Extending DANA for Wikipedia web pages to obtain better
                 results
                 Trying to improve DANA to a free-parameter algorithm



                                                                           24 / 25
Motivation   UTF-8 Encoding Form     DANA           Results   Conclusion




                     Thank you for your attention




                                                                   25 / 25

More Related Content

What's hot

IRJET-RFID Based Book Tracking in Libraries: Using Bicam
IRJET-RFID Based Book Tracking in Libraries: Using BicamIRJET-RFID Based Book Tracking in Libraries: Using Bicam
IRJET-RFID Based Book Tracking in Libraries: Using Bicam
IRJET Journal
 
Multiplux
MultipluxMultiplux
Multiplux
KTHEJESH1
 
Duplicate Code Detection using Control Statements
Duplicate Code Detection using Control StatementsDuplicate Code Detection using Control Statements
Duplicate Code Detection using Control Statements
Editor IJCATR
 
Cs511 data-extraction
Cs511 data-extractionCs511 data-extraction
Cs511 data-extraction
Borseshweta
 
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
semanticsconference
 
Matlab 1 level_1
Matlab 1 level_1Matlab 1 level_1
Matlab 1 level_1
Ahmed Farouk
 
Operators and Expressions in C#
Operators and Expressions in C#Operators and Expressions in C#
Operators and Expressions in C#
Prasanna Kumar SM
 
C Programming: Structure and Union
C Programming: Structure and UnionC Programming: Structure and Union
C Programming: Structure and Union
Selvaraj Seerangan
 
June 05 MS3
June 05 MS3June 05 MS3
June 05 MS3Samimvez
 
15ss
15ss15ss
Syllabus of BCA Second Year JAMMU University
Syllabus of BCA Second Year JAMMU UniversitySyllabus of BCA Second Year JAMMU University
Syllabus of BCA Second Year JAMMU UniversityRavi Shairaywal
 
CIS110 Computer Programming Design Chapter (14)
CIS110 Computer Programming Design Chapter  (14)CIS110 Computer Programming Design Chapter  (14)
CIS110 Computer Programming Design Chapter (14)
Dr. Ahmed Al Zaidy
 
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...
VLSICS Design
 
Structure and Enum in c#
Structure and Enum in c#Structure and Enum in c#
Structure and Enum in c#
Prasanna Kumar SM
 

What's hot (17)

Matlab tutorial
Matlab tutorialMatlab tutorial
Matlab tutorial
 
IRJET-RFID Based Book Tracking in Libraries: Using Bicam
IRJET-RFID Based Book Tracking in Libraries: Using BicamIRJET-RFID Based Book Tracking in Libraries: Using Bicam
IRJET-RFID Based Book Tracking in Libraries: Using Bicam
 
302 sargent word2007-ssp2008
302 sargent word2007-ssp2008302 sargent word2007-ssp2008
302 sargent word2007-ssp2008
 
Multiplux
MultipluxMultiplux
Multiplux
 
Duplicate Code Detection using Control Statements
Duplicate Code Detection using Control StatementsDuplicate Code Detection using Control Statements
Duplicate Code Detection using Control Statements
 
Cs511 data-extraction
Cs511 data-extractionCs511 data-extraction
Cs511 data-extraction
 
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
Philippe Martin and Jérémy Bénard | Importing, Translating and Exporting Know...
 
Matlab 1 level_1
Matlab 1 level_1Matlab 1 level_1
Matlab 1 level_1
 
M.Sc_Syllabus
M.Sc_SyllabusM.Sc_Syllabus
M.Sc_Syllabus
 
Operators and Expressions in C#
Operators and Expressions in C#Operators and Expressions in C#
Operators and Expressions in C#
 
C Programming: Structure and Union
C Programming: Structure and UnionC Programming: Structure and Union
C Programming: Structure and Union
 
June 05 MS3
June 05 MS3June 05 MS3
June 05 MS3
 
15ss
15ss15ss
15ss
 
Syllabus of BCA Second Year JAMMU University
Syllabus of BCA Second Year JAMMU UniversitySyllabus of BCA Second Year JAMMU University
Syllabus of BCA Second Year JAMMU University
 
CIS110 Computer Programming Design Chapter (14)
CIS110 Computer Programming Design Chapter  (14)CIS110 Computer Programming Design Chapter  (14)
CIS110 Computer Programming Design Chapter (14)
 
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...
 
Structure and Enum in c#
Structure and Enum in c#Structure and Enum in c#
Structure and Enum in c#
 

Similar to Accurate Main Content Extraction from Persian HTML Files

A Simulink model for modified fountain codes
A Simulink model for modified fountain codesA Simulink model for modified fountain codes
A Simulink model for modified fountain codes
TELKOMNIKA JOURNAL
 
Error Detection and Correction in SRAM Cell Using Decimal Matrix Code
Error Detection and Correction in SRAM Cell Using Decimal Matrix CodeError Detection and Correction in SRAM Cell Using Decimal Matrix Code
Error Detection and Correction in SRAM Cell Using Decimal Matrix Code
iosrjce
 
D04561722
D04561722D04561722
D04561722
IOSR-JEN
 
Lab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor board
Lab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor boardLab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor board
Lab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor board
Katrina Little
 
Performance evaluation of family of prime sequence codes in an ocdma system
Performance evaluation of family of prime sequence codes in an ocdma systemPerformance evaluation of family of prime sequence codes in an ocdma system
Performance evaluation of family of prime sequence codes in an ocdma system
IAEME Publication
 
1588147798Begining_ABUAD1.pdf
1588147798Begining_ABUAD1.pdf1588147798Begining_ABUAD1.pdf
1588147798Begining_ABUAD1.pdf
SemsemSameer1
 
Performance Enhancement of MIMO-OFDM using Redundant Residue Number System
Performance Enhancement of MIMO-OFDM using Redundant Residue Number System Performance Enhancement of MIMO-OFDM using Redundant Residue Number System
Performance Enhancement of MIMO-OFDM using Redundant Residue Number System
IJECEIAES
 
Edifact
EdifactEdifact
Edifact
Ergoclicks
 
Design and implementation of log domain decoder
Design and implementation of log domain decoder Design and implementation of log domain decoder
Design and implementation of log domain decoder
IJECEIAES
 
Emergency Service Provide by Mobile
Emergency Service Provide by MobileEmergency Service Provide by Mobile
Emergency Service Provide by MobileSamiul Hoque
 
Convolutional Codes.pdf
Convolutional Codes.pdfConvolutional Codes.pdf
Convolutional Codes.pdf
Jimma University
 
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
IRJET Journal
 
Unit 3 - Data Link Layer - Part A
Unit 3 - Data Link Layer - Part AUnit 3 - Data Link Layer - Part A
Unit 3 - Data Link Layer - Part A
Chandan Gupta Bhagat
 
A NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEM
A NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEMA NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEM
A NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEM
VLSICS Design
 
BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...
BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...
BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...
IRJET Journal
 
SDH_Frame_Structure.ppt
SDH_Frame_Structure.pptSDH_Frame_Structure.ppt
SDH_Frame_Structure.ppt
KarthikeyanPandian7
 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2
ozgur_can
 
T04504121126
T04504121126T04504121126
T04504121126
IJERA Editor
 

Similar to Accurate Main Content Extraction from Persian HTML Files (20)

A Simulink model for modified fountain codes
A Simulink model for modified fountain codesA Simulink model for modified fountain codes
A Simulink model for modified fountain codes
 
Error Detection and Correction in SRAM Cell Using Decimal Matrix Code
Error Detection and Correction in SRAM Cell Using Decimal Matrix CodeError Detection and Correction in SRAM Cell Using Decimal Matrix Code
Error Detection and Correction in SRAM Cell Using Decimal Matrix Code
 
D04561722
D04561722D04561722
D04561722
 
Lab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor board
Lab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor boardLab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor board
Lab 7 -RAM and ROM, Xilinx, Digelent BASYS experimentor board
 
Performance evaluation of family of prime sequence codes in an ocdma system
Performance evaluation of family of prime sequence codes in an ocdma systemPerformance evaluation of family of prime sequence codes in an ocdma system
Performance evaluation of family of prime sequence codes in an ocdma system
 
1588147798Begining_ABUAD1.pdf
1588147798Begining_ABUAD1.pdf1588147798Begining_ABUAD1.pdf
1588147798Begining_ABUAD1.pdf
 
Performance Enhancement of MIMO-OFDM using Redundant Residue Number System
Performance Enhancement of MIMO-OFDM using Redundant Residue Number System Performance Enhancement of MIMO-OFDM using Redundant Residue Number System
Performance Enhancement of MIMO-OFDM using Redundant Residue Number System
 
Edifact
EdifactEdifact
Edifact
 
O9edifact
O9edifactO9edifact
O9edifact
 
Design and implementation of log domain decoder
Design and implementation of log domain decoder Design and implementation of log domain decoder
Design and implementation of log domain decoder
 
Emergency Service Provide by Mobile
Emergency Service Provide by MobileEmergency Service Provide by Mobile
Emergency Service Provide by Mobile
 
Convolutional Codes.pdf
Convolutional Codes.pdfConvolutional Codes.pdf
Convolutional Codes.pdf
 
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
IRJET- A Survey on Encode-Compare and Decode-Compare Architecture for Tag Mat...
 
Unit 3 - Data Link Layer - Part A
Unit 3 - Data Link Layer - Part AUnit 3 - Data Link Layer - Part A
Unit 3 - Data Link Layer - Part A
 
A NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEM
A NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEMA NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEM
A NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEM
 
536 541
536 541536 541
536 541
 
BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...
BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...
BER Performance Improvement for 4 X 4 MIMO Single Carrier FDMA System Using M...
 
SDH_Frame_Structure.ppt
SDH_Frame_Structure.pptSDH_Frame_Structure.ppt
SDH_Frame_Structure.ppt
 
Lcdf4 chap 03_p2
Lcdf4 chap 03_p2Lcdf4 chap 03_p2
Lcdf4 chap 03_p2
 
T04504121126
T04504121126T04504121126
T04504121126
 

More from Hadi Mohammadzadeh

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
Hadi Mohammadzadeh
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Hadi Mohammadzadeh
 
Webist2012 presentation
Webist2012 presentationWebist2012 presentation
Webist2012 presentation
Hadi Mohammadzadeh
 
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Hadi Mohammadzadeh
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 
Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehHadi Mohammadzadeh
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehHadi Mohammadzadeh
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehHadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehHadi Mohammadzadeh
 

More from Hadi Mohammadzadeh (9)

TitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web PagesTitleFinder Extracting the Headline of News Web Pages
TitleFinder Extracting the Headline of News Web Pages
 
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...Revealing Trends Based on Defined Queries in Biological Publications Using Co...
Revealing Trends Based on Defined Queries in Biological Publications Using Co...
 
Webist2012 presentation
Webist2012 presentationWebist2012 presentation
Webist2012 presentation
 
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...Improving Retrieval Accuracy  in Main Content Extraction  from  HTML Web Docu...
Improving Retrieval Accuracy in Main Content Extraction from HTML Web Docu...
 
Information filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi MohammadzadehInformation filtering, By Hadi Mohammadzadeh
Information filtering, By Hadi Mohammadzadeh
 
Content extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi MohammadzadehContent extraction: By Hadi Mohammadzadeh
Content extraction: By Hadi Mohammadzadeh
 
Text mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi MohammadzadehText mining by examples, By Hadi Mohammadzadeh
Text mining by examples, By Hadi Mohammadzadeh
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
 
Information retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi MohammadzadehInformation retreival, By Hadi Mohammadzadeh
Information retreival, By Hadi Mohammadzadeh
 

Recently uploaded

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Alex Pruden
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 

Recently uploaded (20)

Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
Assure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyesAssure Contact Center Experiences for Your Customers With ThousandEyes
Assure Contact Center Experiences for Your Customers With ThousandEyes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofszkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 

Accurate Main Content Extraction from Persian HTML Files

  • 1. Motivation UTF-8 Encoding Form DANA Results Conclusion A Fast and Accurate Approach for Main Content Extraction based on Character Encoding Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert and Gholamreza Nakhaeizadeh Ph.D. Candidate Institute of Applied Information Processing University of Ulm Germany hadi.mohammadzadeh@uni-ulm.de Tir 2011, Augest 2011, Toulouse, France 1 / 25
  • 2. Motivation UTF-8 Encoding Form DANA Results Conclusion Outline 1 Motivation 2 UTF-8 Encoding Form 3 DANA Algorithm, named DANA Data Sets, Evaluation 4 Results 5 Conclusion 2 / 25
  • 3. Motivation UTF-8 Encoding Form DANA Results Conclusion Outline 1 Motivation 2 UTF-8 Encoding Form 3 DANA Algorithm, named DANA Data Sets, Evaluation 4 Results 5 Conclusion 3 / 25
  • 4. Motivation UTF-8 Encoding Form DANA Results Conclusion An example of web pages with selected MC 4 / 25
  • 5. Motivation UTF-8 Encoding Form DANA Results Conclusion What are the right to left (R2L) laguages? Laguages which are written from the right side to the left side, Arabic, Persian, Urdu and Pshtoo What are our goals? From a technical point of view, most of the main content extraction approaches use HTML tags to separate the main content from the extraneous items. This implies the need to employ a parser for the entire web pages. Consequently, the computation costs of these main content extraction(MCE) approaches are increased. Thus, the main goal of proposed algorithm is to increase both the accuracy and effectiveness of the MCE algorithms dealing with R2L languages 5 / 25
  • 6. Motivation UTF-8 Encoding Form DANA Results Conclusion Outline 1 Motivation 2 UTF-8 Encoding Form 3 DANA Algorithm, named DANA Data Sets, Evaluation 4 Results 5 Conclusion 6 / 25
  • 7. Motivation UTF-8 Encoding Form DANA Results Conclusion In UTF-8, ASCII characters need only one byte, with a value guaranteed to be less than 128 In UTF-8, all letters of right to left languages (Persian , Arabic, Pashto, and Urdu Languages) take exactly 2 bytes, each with value greater than 127 By using simple condition , we are able to seperate ASCII characters from Non-ASCII characters . If the value of one byte is < 128 ⇒ this byte is a member of ASCII characters Otherwise it is a member of Non-ASCII characters 7 / 25
  • 8. Motivation UTF-8 Encoding Form DANA Results Conclusion Outline 1 Motivation 2 UTF-8 Encoding Form 3 DANA Algorithm, named DANA Data Sets, Evaluation 4 Results 5 Conclusion 8 / 25
  • 9. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA The Phases of DANA In the First phase, we count the number of ASCII and Non-ASCII characters of each line of the HTML file, saved in two 1D arrays T1 and T2 In the Second phase, we are looking to find areas comprising the MC in an HTML file using the arrays T1 and T2 Finally, we feed all HTML lines determined in the previous phase as an input to a parser. The output of parser shows the main content 9 / 25
  • 10. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase One : Counting ASCII and Non-ASCII characters of each line DANA reads an HTML file line by line and counts the number of ASCII and Non-ASCII characters of each line The number of ASCII characters of each line is saved in one-dimentioanl array T1 The number of Non-ASCII characters of each line is saved in one-dimentioanl array T2 10 / 25
  • 11. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase Two : Finding areas Comprising MC Recognizing areas in the HTML file, in which: Non-ASCII Characters have high density ASCII Characters have low density To illustrate our approach, we depict two diagrams. In the first diagram, for each line of the HTML file For the Non-ASCII characters, a vertical line with identical length is drawn upside of the x-axis, as stored in T1 Similarly, for the ASCII characters a vertical line with equal length is drawn downside of the x-axis, as stored in T2 11 / 25
  • 12. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase Two : Finding areas Comprising MC 12 / 25
  • 13. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase Two : Finding areas Comprising MC Different types of regions: Regions with low or near zero density of columns above the x-axis and high density of columns below the x-axis, labeled A Region with high density of columns above the x-axis and a low density of columns below the x-axis, labelled B Regions with slight difference between the density of the columns above and below the x-axis, labelled C 13 / 25
  • 14. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase Two : Finding areas Comprising MC Now the problem of finding MC in the HTML file becomes the problem of finding region B. To find Region B we follow 3 steps: 1) Draw smoothed diagram. For all columns we calculate diff i and draw new diagram diffi = T1i − T2i + T1i+1 − T2i+1 + T1i−1 − T2i−1 If diff i > 0 then we draw a line with the length of diff i above the x-axix Otherwise , we draw a line with the length of absolute value of diff i below the x-axix 2) We find a column with the longest length above the x-axis 3) Finding the boundaries of the MC region 14 / 25
  • 15. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase Two : Finding areas Comprising MC dff c luae b fr l () i ac ltd y o mua 1 manc n e t i o tn n mb r f ie u eol s n i t eHTML l n h e 15 / 25
  • 16. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase Two : Finding areas Comprising MC Finding the boundaries of the MC regions, but how After recognizing the longest column above the x-axis, the algorithm moves up and down in the HTML file to find all paragraph belonging to the MC. But where is the end of these movements? The number of lines we need to traverse to find the next MC paragraph is defined as a parameter P, initialized with 20. By considering this parameter, we move up or down, respectively until we can not find a line containing Non-ASCII characters. At this moment, all lines we found make our MC. 16 / 25
  • 17. Motivation UTF-8 Encoding Form DANA Results Conclusion Algorithm, named DANA Phase Three : Extracting MC In final phase, we feed all HTML lines determined in the previous phase as an input to a parser. Following our hypothesis, the output of the parser is exactly the main content. 17 / 25
  • 18. Motivation UTF-8 Encoding Form DANA Results Conclusion Data Sets, Evaluation Data Sets Web site Num. of Pages Languages BBC 598 Farsi Hamshahri 375 Farsi Jame Jam 136 Farsi Ahram 188 Arabic Reuters 116 Arabic Embassy of 31 Farsi Germany, Iran BBC 234 Urdu BBC 203 Pashto BBC 252 Arabic Wiki 33 Farsi Arabic, Farsi, Pashto, and Urdu We have collected 2166 web pages from different web sites. 18 / 25
  • 19. Motivation UTF-8 Encoding Form DANA Results Conclusion Data Sets, Evaluation Evaluation length(k) r= length(g) length(k) p= length(m) p∗r F1 = 2 ∗ p+r 19 / 25
  • 20. Motivation UTF-8 Encoding Form DANA Results Conclusion Outline 1 Motivation 2 UTF-8 Encoding Form 3 DANA Algorithm, named DANA Data Sets, Evaluation 4 Results 5 Conclusion 20 / 25
  • 21. Motivation UTF-8 Encoding Form DANA Results Conclusion Evaluation results based on F1-measure BBC Persian BBC Pashto BBC Arabic Hamshahri BBC Urdu Wikipedia Jame Jam Al Ahram Embassy Reuters ACCB-40 0.8714 0.8255 0.8594 0.8925 0.9476 0.7837 0.8420 0.8398 0.8997 0.7364 BTE 0.8534 0.4957 0.8544 0.5895 0.9606 0.8095 0.4801 0.7906 0.8891 0.8167 DSC 0.8706 0.8849 0.8398 0.9505 0.8962 0.8238 0.9482 0.9142 0.8510 0.7471 FE 0.8086 0.0600 0.1652 0.0626 0.0023 0.0173 0.2251 0.0275 0.2408 0.2250 KFE 0.6905 0.7186 0.8349 0.7480 0.7504 0.7620 0.6777 0.7833 0.8253 0.6244 LQF-25 0.7877 0.7796 0.8436 0.8410 0.9566 0.8596 0.7650 0.7372 0.8699 0.7735 LQF-50 0.7855 0.7772 0.8374 0.8279 0.9544 0.8561 0.7673 0.7240 0.8699 0.7719 LQF-75 0.7733 0.7727 0.8374 0.8190 0.9544 0.8516 0.7560 0.7240 0.8699 0.7497 TCCB-18 0.8861 0.8265 0.9121 0.9253 0.9898 0.8867 0.8712 0.9292 0.9593 0.8142 TCCB-25 0.8737 0.8608 0.9091 0.9271 0.9916 0.8832 0.8884 0.9240 0.9583 0.8142 Density 0.8787 0.2016 0.9081 0.7415 0.9579 0.8818 0.9197 0.9063 0.9336 0.6649 DANA 0.9845 0.9633 0.9363 0.9944 1.0 0.9350 0.9797 0.9452 0.9670 0.6740 21 / 25
  • 22. Motivation UTF-8 Encoding Form DANA Results Conclusion Average processing time (MB/s) Time Performance (Megabyte/Second) ACCB-40 0.40 BTE 0.17 DSC 7.76 FE 14.33 KFE 11.76 LQF-25 1.25 LQF-50 1.25 LQF-75 1.25 TCCB-18 17.09 TCCB-25 15.86 Density 7.62 DANA 19.43 22 / 25
  • 23. Motivation UTF-8 Encoding Form DANA Results Conclusion Outline 1 Motivation 2 UTF-8 Encoding Form 3 DANA Algorithm, named DANA Data Sets, Evaluation 4 Results 5 Conclusion 23 / 25
  • 24. Motivation UTF-8 Encoding Form DANA Results Conclusion Conclusion and Future Work Result shows that the DANA produces satisfactory MC with F1-measure > 0.935 We do not need to use parser in the first phase, Using parser is time consuming Future works Generalizing DANA by differentiating between tags and text/content to propose a new language-independent content extraction Extending DANA for Wikipedia web pages to obtain better results Trying to improve DANA to a free-parameter algorithm 24 / 25
  • 25. Motivation UTF-8 Encoding Form DANA Results Conclusion Thank you for your attention 25 / 25