The document describes a new algorithm called DANA for extracting the main content from web pages written in right-to-left languages like Arabic and Persian. DANA uses UTF-8 encoding to distinguish ASCII and non-ASCII characters and identify regions with high non-ASCII character density as potential main content. It works in three phases: 1) counting ASCII and non-ASCII characters per line, 2) analyzing the character counts to find regions with main content, and 3) extracting the text from identified regions using an HTML parser. The algorithm aims to accurately extract main content faster than existing approaches by leveraging character encoding rather than HTML tags.
Implementation and Design of High Speed FPGA-based Content Addressable Memoryijsrd.com
CAM stands for content addressable memory. It is a special type of computer memory used in very high speed searching application. A CAM is a memory that implements the high speed lookup-table function in a single clock cycle using dedicated comparison circuitry. It is also known as associative memory or associative array although the last term used for a programming data structure. Unlike standard computer memory (RAM) in which user supplies the memory address and the RAM returns the data word stored in that memory address, CAM is designed in such a way that user supplies data word and CAM searches its entire memory to see if that data word stored anywhere in it. If the data word is found, the CAM returns a list of one or more storage address where the word was found. This design coding, simulation, logic synthesis and implementation will be done using various EDA tools.
Implementation and Design of High Speed FPGA-based Content Addressable Memoryijsrd.com
CAM stands for content addressable memory. It is a special type of computer memory used in very high speed searching application. A CAM is a memory that implements the high speed lookup-table function in a single clock cycle using dedicated comparison circuitry. It is also known as associative memory or associative array although the last term used for a programming data structure. Unlike standard computer memory (RAM) in which user supplies the memory address and the RAM returns the data word stored in that memory address, CAM is designed in such a way that user supplies data word and CAM searches its entire memory to see if that data word stored anywhere in it. If the data word is found, the CAM returns a list of one or more storage address where the word was found. This design coding, simulation, logic synthesis and implementation will be done using various EDA tools.
Duplicate Code Detection using Control StatementsEditor IJCATR
Code clone detection is an important area of research as reusability is a key factor in software evolution. Duplicate code degrades the design and structure of software and software qualities like readability, changeability, maintainability. Code clone increases the maintenance cost as incorrect changes in copied code may lead to more errors. In this paper we address structural code similarity detection and propose new methods to detect structural clones using structure of control statements. By structure we mean order of control statements used in the source code. We have considered two orders of control structures: (i) Sequence of control statements as it appears (ii) Execution flow of control statements.
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...VLSICS Design
In this paper, we propose a new multiplier-and-accumulator (MAC) architecture for low power and high speed arithmetic. High speed and low power MAC units are required for applications of digital signal processing like Fast Fourier Transform, Finite Impulse Response filters, convolution etc. For improving the speed and reducing the dynamic power, there is a need to reduce the glitches (1 to 0 transition) and spikes (0 to 1 transition). Adder designed using spurious power suppression technique (SPST) avoids the unwanted glitches and spikes, thus minimizing the switching power dissipation and hence the dynamic power. Radix -2 modified booth algorithm reduces the number of partial products to half by grouping of bits from the multiplier term, which improves the speed. The proposed radix-2 modified Booth algorithm
MAC with SPST gives a factor of 5 less delay and 7% less power consumption as compared to array MAC
Structure and Enumeration are used as value type in c#.Value type variables store their data on the stack memory.An enumeration is a set of named integer constants. An enumerated type is declared using the enum keyword. C# enumerations are value data type. In other words, enumeration contains its own values and cannot inherit or cannot pass inheritance.
This paper introduces a Simulink model design for a modified fountain code. The code is a new version of the traditional Luby transform (LT) codes.
The design constructs the blocks required for generation of the generator matrix of a limited-degree-hopping-segment Luby transform (LDHS-LT) codes. This code is especially designed for short length data files which have assigned a great interest for wireless sensor networks. It generates the degrees in a predetermined sequence but random generation and partitioned the data file in segments. The data packets selection has been made serialy according to the integer generated from both degree and segment generators. The code is tested using Monte Carlo simulation approach with the conventional code generation using robust soliton distribution (RSD) for degree generation, and the simulation results approve better performance with all testing parameter.
Error Detection and Correction in SRAM Cell Using Decimal Matrix Codeiosrjce
Error Correction Codes (ECCs) are commonly used to protect memories from soft errors. As
technology scales, Multiple Cell Upsets (MCUs) become more common and affect a larger number of cells. To
prevent the occurrence of MCUs several error correction codes (ECCs) are used, but the main problem is that
they require complex encoder and decoder architecture and higher delay overheads. The decimal matrix code
(DMC) minimizes the area and delay overheads compared to the existing codes such as hamming, RS codes and
also improves the memory reliability by enhancing the error correction capability. In this paper, novel
decimal matrix code (DMC) based on divide-symbol is proposed to enhance memory reliability with
lower delay overhead. The proposed DMC utilizes decimal algorithm to obtain the maximum error
detection capability. Moreover, the encoder-reuse technique (ERT) is proposed to minimize the area
overhead of extra circuits without disturbing the whole encoding and decoding processes. ERT uses DMC encoder itself to be part of the decoder.
Duplicate Code Detection using Control StatementsEditor IJCATR
Code clone detection is an important area of research as reusability is a key factor in software evolution. Duplicate code degrades the design and structure of software and software qualities like readability, changeability, maintainability. Code clone increases the maintenance cost as incorrect changes in copied code may lead to more errors. In this paper we address structural code similarity detection and propose new methods to detect structural clones using structure of control statements. By structure we mean order of control statements used in the source code. We have considered two orders of control structures: (i) Sequence of control statements as it appears (ii) Execution flow of control statements.
EFFICIENT IMPLEMENTATION OF 16-BIT MULTIPLIER-ACCUMULATOR USING RADIX-2 MODIF...VLSICS Design
In this paper, we propose a new multiplier-and-accumulator (MAC) architecture for low power and high speed arithmetic. High speed and low power MAC units are required for applications of digital signal processing like Fast Fourier Transform, Finite Impulse Response filters, convolution etc. For improving the speed and reducing the dynamic power, there is a need to reduce the glitches (1 to 0 transition) and spikes (0 to 1 transition). Adder designed using spurious power suppression technique (SPST) avoids the unwanted glitches and spikes, thus minimizing the switching power dissipation and hence the dynamic power. Radix -2 modified booth algorithm reduces the number of partial products to half by grouping of bits from the multiplier term, which improves the speed. The proposed radix-2 modified Booth algorithm
MAC with SPST gives a factor of 5 less delay and 7% less power consumption as compared to array MAC
Structure and Enumeration are used as value type in c#.Value type variables store their data on the stack memory.An enumeration is a set of named integer constants. An enumerated type is declared using the enum keyword. C# enumerations are value data type. In other words, enumeration contains its own values and cannot inherit or cannot pass inheritance.
This paper introduces a Simulink model design for a modified fountain code. The code is a new version of the traditional Luby transform (LT) codes.
The design constructs the blocks required for generation of the generator matrix of a limited-degree-hopping-segment Luby transform (LDHS-LT) codes. This code is especially designed for short length data files which have assigned a great interest for wireless sensor networks. It generates the degrees in a predetermined sequence but random generation and partitioned the data file in segments. The data packets selection has been made serialy according to the integer generated from both degree and segment generators. The code is tested using Monte Carlo simulation approach with the conventional code generation using robust soliton distribution (RSD) for degree generation, and the simulation results approve better performance with all testing parameter.
Error Detection and Correction in SRAM Cell Using Decimal Matrix Codeiosrjce
Error Correction Codes (ECCs) are commonly used to protect memories from soft errors. As
technology scales, Multiple Cell Upsets (MCUs) become more common and affect a larger number of cells. To
prevent the occurrence of MCUs several error correction codes (ECCs) are used, but the main problem is that
they require complex encoder and decoder architecture and higher delay overheads. The decimal matrix code
(DMC) minimizes the area and delay overheads compared to the existing codes such as hamming, RS codes and
also improves the memory reliability by enhancing the error correction capability. In this paper, novel
decimal matrix code (DMC) based on divide-symbol is proposed to enhance memory reliability with
lower delay overhead. The proposed DMC utilizes decimal algorithm to obtain the maximum error
detection capability. Moreover, the encoder-reuse technique (ERT) is proposed to minimize the area
overhead of extra circuits without disturbing the whole encoding and decoding processes. ERT uses DMC encoder itself to be part of the decoder.
Performance Enhancement of MIMO-OFDM using Redundant Residue Number System IJECEIAES
Telecommunication industry requires high capacity networks with high data rates which are achieved through utilization of Multiple-Input-MultipleOutput (MIMO) communication along with Orthogonal Frequency Division Multiplexing (OFDM) system. Still, the communication channel suffers from noise, interference or distortion due to hardware design limitations, and channel environment, and to combat these challenges, and achieve enhanced performance; various error control techniques are implemented to enable the receiver to detect any possible received errors and correct it and thus; for a certain transmitted signal power the system would have lower Bit Error Rate (BER). The provided research focuses on Redundant Residue Number System (RRNS) coding as a Forward Error Correction (FEC) scheme that improves the performance of MIMO-OFDM based wireless communications in comparison with current methods as Low-Density Parity Check (LDPC) coders at the transmitter side or equalizers at receiver side. The Bit Error Rate (BER) performance over the system was measured using MATLAB tool for different simulated channel conditions, including the effect of signal amplitude reduction and multipath delay spreading. Simulation results had shown that RRNS coding scheme provides an enhancement in system performance over conventional error detection and correction coding schemes by utilizing the distinct features of Residue Number System (RNS).
Design and implementation of log domain decoder IJECEIAES
Low-Density-Parity-Check (LDPC) code has become famous in communications systems for error correction, as an advantage of the robust performance in correcting errors and the ability to meet all the requirements of the 5G system. However, the mot challenge faced researchers is the hardware implementation, because of higher complexity and long run-time. In this paper, an efficient and optimum design for log domain decoder has been implemented using Xilinx system generator with FPGA device Kintex7 (XC7K325T-2FFG900C). Results confirm that the proposed decoder gives a Bit Error Rate (BER) very closed to theory calculations which illustrate that this decoder is suitable for next generation demand which needs a high data rate with very low BER.
A NOVEL APPROACH FOR LOWER POWER DESIGN IN TURBO CODING SYSTEMVLSICS Design
Low Power is an extremely important issue for future mobile communication systems; The focus of this paper is to implementat turbo codes for low power solutions. The effect on performance due to variation in parameters like frame length, number of iterations, type of encoding scheme and type of the interleave in the presence of additive white Gaussian noise is studied with the floating point model. In order to obtain the effect of quantization and word length variation, a fixed point model of the application is also developed.. The application performance measure, namely bit-error rate (BER) is used as a design constraint while optimizing for power and area coverage. Low power Optimization is Performed on Implementation levels by the use of Voltage scaling. With those Techniques we can reduced the power 98.5%and Area(LUT) is 57% and speed grade is Increased .This type of Power maneger is proposed and implemented based on the timing details of the turbo decoder in the VHDL model.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Similar to Accurate Main Content Extraction from Persian HTML Files (20)
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
The Metaverse and AI: how can decision-makers harness the Metaverse for their...Jen Stirrup
The Metaverse is popularized in science fiction, and now it is becoming closer to being a part of our daily lives through the use of social media and shopping companies. How can businesses survive in a world where Artificial Intelligence is becoming the present as well as the future of technology, and how does the Metaverse fit into business strategy when futurist ideas are developing into reality at accelerated rates? How do we do this when our data isn't up to scratch? How can we move towards success with our data so we are set up for the Metaverse when it arrives?
How can you help your company evolve, adapt, and succeed using Artificial Intelligence and the Metaverse to stay ahead of the competition? What are the potential issues, complications, and benefits that these technologies could bring to us and our organizations? In this session, Jen Stirrup will explain how to start thinking about these technologies as an organisation.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Pushing the limits of ePRTC: 100ns holdover for 100 days
Accurate Main Content Extraction from Persian HTML Files
1. Motivation UTF-8 Encoding Form DANA Results Conclusion
A Fast and Accurate Approach for
Main Content Extraction
based on Character Encoding
Hadi Mohammadzadeh, Thomas Gottron, Franz Schweiggert
and Gholamreza Nakhaeizadeh
Ph.D. Candidate
Institute of Applied Information Processing
University of Ulm
Germany
hadi.mohammadzadeh@uni-ulm.de
Tir 2011, Augest 2011, Toulouse, France
1 / 25
2. Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANA
Algorithm, named DANA
Data Sets, Evaluation
4 Results
5 Conclusion
2 / 25
3. Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANA
Algorithm, named DANA
Data Sets, Evaluation
4 Results
5 Conclusion
3 / 25
4. Motivation UTF-8 Encoding Form DANA Results Conclusion
An example of web pages with selected MC
4 / 25
5. Motivation UTF-8 Encoding Form DANA Results Conclusion
What are the right to left (R2L) laguages?
Laguages which are written from the right side to the left
side, Arabic, Persian, Urdu and Pshtoo
What are our goals?
From a technical point of view, most of the main content
extraction approaches use HTML tags to separate the main
content from the extraneous items.
This implies the need to employ a parser for the entire web
pages. Consequently, the computation costs of these main
content extraction(MCE) approaches are increased.
Thus, the main goal of proposed algorithm is to increase
both the accuracy and effectiveness of the MCE algorithms
dealing with R2L languages
5 / 25
6. Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANA
Algorithm, named DANA
Data Sets, Evaluation
4 Results
5 Conclusion
6 / 25
7. Motivation UTF-8 Encoding Form DANA Results Conclusion
In UTF-8, ASCII characters need only one byte, with a
value guaranteed to be less than 128
In UTF-8, all letters of right to left languages (Persian ,
Arabic, Pashto, and Urdu Languages) take exactly 2 bytes,
each with value greater than 127
By using simple condition , we are able to seperate ASCII
characters from Non-ASCII characters .
If the value of one byte is < 128 ⇒ this byte is a member of
ASCII characters
Otherwise it is a member of Non-ASCII characters
7 / 25
8. Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANA
Algorithm, named DANA
Data Sets, Evaluation
4 Results
5 Conclusion
8 / 25
9. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
The Phases of DANA
In the First phase, we count the number of ASCII and
Non-ASCII characters of each line of the HTML file, saved
in two 1D arrays T1 and T2
In the Second phase, we are looking to find areas comprising
the MC in an HTML file using the arrays T1 and T2
Finally, we feed all HTML lines determined in the previous
phase as an input to a parser. The output of parser shows
the main content
9 / 25
10. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase One : Counting ASCII and Non-ASCII characters of each line
DANA reads an HTML file line by line and counts the number
of ASCII and Non-ASCII characters of each line
The number of ASCII characters of each line is saved in
one-dimentioanl array T1
The number of Non-ASCII characters of each line is saved
in one-dimentioanl array T2
10 / 25
11. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Recognizing areas in the HTML file, in which:
Non-ASCII Characters have high density
ASCII Characters have low density
To illustrate our approach, we depict two diagrams. In the first
diagram, for each line of the HTML file
For the Non-ASCII characters, a vertical line with identical
length is drawn upside of the x-axis, as stored in T1
Similarly, for the ASCII characters a vertical line with equal
length is drawn downside of the x-axis, as stored in T2
11 / 25
12. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
12 / 25
13. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Different types of regions:
Regions with low or near zero density of columns above the
x-axis and high density of columns below the x-axis, labeled
A
Region with high density of columns above the x-axis and a
low density of columns below the x-axis, labelled B
Regions with slight difference between the density of the
columns above and below the x-axis, labelled C
13 / 25
14. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Now the problem of finding MC in the HTML file becomes the
problem of finding region B. To find Region B we follow 3 steps:
1) Draw smoothed diagram. For all columns we calculate
diff i and draw new diagram
diffi = T1i − T2i
+ T1i+1 − T2i+1
+ T1i−1 − T2i−1
If diff i > 0 then we draw a line with the length of diff i
above the x-axix
Otherwise , we draw a line with the length of absolute value
of diff i below the x-axix
2) We find a column with the longest length above the
x-axis
3) Finding the boundaries of the MC region 14 / 25
15. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
dff c luae b fr l ()
i ac ltd y o mua 1
manc n e t
i o tn
n mb r f ie
u eol s n
i t eHTML l
n h e
15 / 25
16. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Two : Finding areas Comprising MC
Finding the boundaries of the MC regions, but how
After recognizing the longest column above the x-axis, the
algorithm moves up and down in the HTML file to find all
paragraph belonging to the MC. But where is the end of
these movements?
The number of lines we need to traverse to find the next
MC paragraph is defined as a parameter P, initialized with
20.
By considering this parameter, we move up or down,
respectively until we can not find a line containing
Non-ASCII characters.
At this moment, all lines we found make our MC.
16 / 25
17. Motivation UTF-8 Encoding Form DANA Results Conclusion
Algorithm, named DANA
Phase Three : Extracting MC
In final phase,
we feed all HTML lines determined in the previous phase as an
input to a parser.
Following our hypothesis, the output of the parser is exactly the
main content.
17 / 25
18. Motivation UTF-8 Encoding Form DANA Results Conclusion
Data Sets, Evaluation
Data Sets
Web site Num. of Pages Languages
BBC 598 Farsi
Hamshahri 375 Farsi
Jame Jam 136 Farsi
Ahram 188 Arabic
Reuters 116 Arabic
Embassy of 31 Farsi
Germany, Iran
BBC 234 Urdu
BBC 203 Pashto
BBC 252 Arabic
Wiki 33 Farsi
Arabic, Farsi, Pashto, and Urdu
We have collected 2166 web pages from different web sites.
18 / 25
19. Motivation UTF-8 Encoding Form DANA Results Conclusion
Data Sets, Evaluation
Evaluation
length(k)
r=
length(g)
length(k)
p=
length(m)
p∗r
F1 = 2 ∗
p+r
19 / 25
20. Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANA
Algorithm, named DANA
Data Sets, Evaluation
4 Results
5 Conclusion
20 / 25
22. Motivation UTF-8 Encoding Form DANA Results Conclusion
Average processing time (MB/s)
Time Performance (Megabyte/Second)
ACCB-40 0.40
BTE 0.17
DSC 7.76
FE 14.33
KFE 11.76
LQF-25 1.25
LQF-50 1.25
LQF-75 1.25
TCCB-18 17.09
TCCB-25 15.86
Density 7.62
DANA 19.43
22 / 25
23. Motivation UTF-8 Encoding Form DANA Results Conclusion
Outline
1 Motivation
2 UTF-8 Encoding Form
3 DANA
Algorithm, named DANA
Data Sets, Evaluation
4 Results
5 Conclusion
23 / 25
24. Motivation UTF-8 Encoding Form DANA Results Conclusion
Conclusion and Future Work
Result shows that the DANA produces satisfactory MC
with F1-measure > 0.935
We do not need to use parser in the first phase, Using
parser is time consuming
Future works
Generalizing DANA by differentiating between tags and
text/content to propose a new language-independent
content extraction
Extending DANA for Wikipedia web pages to obtain better
results
Trying to improve DANA to a free-parameter algorithm
24 / 25
25. Motivation UTF-8 Encoding Form DANA Results Conclusion
Thank you for your attention
25 / 25