SlideShare a Scribd company logo
1 of 18
Download to read offline
M´thodologie de Recherche
 e

     RSS Join Engine


 Maroun Baydoun inf1312, OGL
  Marwan Azzam inf1311, OGL


    Thursday, 14 May, 2010
Contents

1 Introduction                                                                                                    3


2 Related Work                                                                                                    3

  2.1   Joining RSS Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      3

  2.2   Comparing Text Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .         4

        2.2.1   Document Index Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           4

  2.3   Relating Rss Items (News) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      5

  2.4   Websites and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .      7


3 Hypothesis                                                                                                      8


4 Architecture                                                                                                    8


5 Pseudo-code                                                                                                     9


6 Development                                                                                                     11


7 Implementation                                                                                                  11


8 Simulation                                                                                                      12


9 Consideration                                                                                                   15

  9.1   Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    15

  9.2   Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   15

  9.3   Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     15




                                                         1
9.4   Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   15


10 Conclusion                                                                                                    16




                                                        2
1     Introduction

    RSS (most commonly expanded as ”Really Simple Syndication”) is a family of web feed formats used
to publish frequently updated workssuch as blog entries, news headlines, audio, and videoin a standardized
format.

Internet users often subscribe to many RSS feeds to stay up to date on the latest news. However, this
information is spread out across many sources, which makes it difficult for users to keep track of the most
important headlines. Therefore, it is essential to come up with a tool to bring together news from different
sources and present the user with a single feed containing the intersection of all other feeds.




2     Related Work

2.1     Joining RSS Feeds

    To intersect different feeds, we start by defining semantic relatedness between RSS elements/items in order
to determine semantic relations regarding the meaning of terms instead of just their syntactic properties.

XML documents can be compared following:


    • Their structure. (structure-based similarity)

    • Their content. (content-based similarity)

    • A combination of both. (hybrid similarity)


RSS feeds can be related in several ways:


    • Inclusion: a news item can be completely included in another news item.

    • Intersection: two news items might refer to similar concepts. We say the two news items intersect.

    • Opposition: two news items might refer to the same topic but in opposite ways.




                                                       3
2.2     Comparing Text Documents

2.2.1     Document Index Graph Model


  In order to relate RSS feeds, its helpful to use a document clustering technique based on segregating
documents into groups so that each group represents a topic that is different than those represented by other
groups.

Any clustering technique relies on 4 concepts: a data representation model, a similarity measure, a cluster
model, a clustering algorithm to build the clusters based on the similarity measure and the data model.

Traditional document clustering techniques rely on single term analysis. They use the Vector Space Model.

The Vector Space Model represents a document as a vector of terms. The vector contains the weights of
the terms (the frequency of the terms for example). Similarity between documents can be measured using
similarity measures applied to the vectors (such as cosine). This model uses single-term analysis only (no
word proximity or phrase based analysis).

Even though this approach is widely used, it proves insufficient because it leaves out phrase analysis. Therefore
a better method would be to combine single-term and phrase analysis.

This brings us to the introduction of a new model, the Document Index Graph (DIG).

The DIG is based on the graph theory. It uses graph properties to match any-length sentence from a document
to any number of previously seen documents. The amount of time taken by this process is proportional to
the number of words in the new document.




                                                      4
2.3    Relating Rss Items (News)

  Each node or element of an RSS Tree is a pair having e = η, ζ where e.η is the element name and e.ζ its
content.

The concept of neighborhood

Neighborhood is used for identifying the relationships between text and is consequently used for RSS elements.

Neighborhood can be classified as follows:


   • Semantic Neighborhood : The semantic neighborhood of a concept Ci is defined as the set of concepts
      Cj in a given knowledge base KB, related with Ci via the hyponymy (      ) or meronymy (       ) semantic
      relations, directly or via transitivity.

   • Global Semantic Neighborhood: The global semantic neighborhood of a concept Ci is the union of each
      semantic neighborhood w.r.t. all synonymy ( ≡ ), hyponymy (           ) and meronymy (         ) relations
      altogether.

   • Antonym Neighborhood: The antonym neighborhood of a concept Ci is defined as the set of concepts
      Ci , in a given knowledge base KB, related with Ci via the antonymy relation (ω), directly or transitively


                                                       5
via synonymy ( ≡ ), hyponymy (         ) or hypernym (      ).


Relations and relatedness of RSS elements:

For two simple elements e1, and e2, the Element Relatedness (ER) algorithm returns a pair quantifying the
semantic relatedness SemRel value and Relation based on corresponding TR label and content values.




Relation relies on a rule-based method combining the label and value relationships as follows:


   • Elements e1 and e2 are disjoint if either their labels or values are disjoint.

   • Element e1 includes e2, if e1.η includes e2.η and e1.ζ includes e2.ζ.

   • Two elements e1 and e2 intersect if either their labels or values intersect.

   • Two elements e1 and e2 are equal if both their labels and values are equal.

   • Two elements e1 and e2 are opposite if both their contents are opposite.


For two RSS items I1 and I2, each containing a group of elements, the Item Relatedness (IR) Algorithm
returns a pair containing SemRel and Relation.




By combining relations between sub elements, the relation between two items I1 and I2 is identified using
the following rule-based method:


   • Items I1 and I2 are disjoint if all elements ei and ej are disjoint (elements are disjoint if there is no
     relatedness whatsoever between them, i.e., SemRel(I1, I2) = 0).

   • Item I1 includes I2, if all elements in ei include all those in ej .


                                                        6
• Two items I1 and I2 intersect if at least two of their elements intersect.

   • Two items I1 and I2 are equal if all their elements in ei equal to all those in ej .

   • Two items I1 and I2 are opposite if at least two of their respective elements are opposite.



2.4    Websites and Applications

There are many websites and applications that provide services related to RSS feeds aggregations, but none
of those solutions implements an RSS join engine based on semantics. They simply enable users to merge
many news feeds into a single feed.

These tools are:


   • xFruits (http://www.xfruits.com)

   • Flock (http://flock.sourceforge.net/index.html)

   • RSSOwl (http://www.rssowl.org)

   • BlogBridge (http://www.blogbridge.com)

   • Yahoo Pipes (http://pipes.yahoo.com)

   • Feedzeo (http://feedzeo.sourceforge.net)




                                                       7
3      Hypothesis

    Based on what is presented in the earlier parts, the simplest method was to consider two phrases similar
if they have a predefined number of words in common. However, this method reveals substantial weaknesses
because it neglects the semantics of the phrases. Sentences written differently but conveying the same meaning
will be deemed not similar.

Thus, the proposed solution consists of implementing a phrase-based document similarity algorithm based
on an index graph model to create a RSS join engine. This engine will take five RSS feeds as input, place a
window on each feed, and then run the similarity algorithm in order to intersect the feeds.




4      Architecture




    1. Feeds: The user has the possibility at most any five Rss feeds.

    2. Parser: The application will rely on the ROME Rss feed parser, which accepts as a parameter the URL
       of the feed, and returns the list of its Items.

    3. Windows: On every list of item, we place a window, which contains the five most recent items.



                                                         8
4. Join Engine: The join engine applies the Phrase-based document similarity algorithm on the items
      contained in the windows.




5     Pseudo-code

CREATE GRAPH:

FOR EACH feed
      Read feed
      Parse feed
      Create window
      Sort feed items by publish date
      Include the five most recent items in the window
      CALL build graph
END FOR

BUILD GRAPH:

Create document
Fill document with feed items
FOR each sentence in document
      IF first-word of sentence NOT in graph
         Add first-word of sentence in graph
      END IF
      Create list
      FOR EACH word in sentence
         IF previous-word, word IS edge in graph
            Extend phrase matches in the list for sentences that continue along previous-word, word
            Add the new phrase matches to the list
         ElSE
            Add previous-word, word to graph
            Update sentence path in nodes previous-word and word
         END IF



                                                     9
END FOR
END FOR




            10
6      Development

    It is a java web application developed on NetBeans IDE 6.8, using java version Java EE6 and Glashfish v3
as application server. The JavaServer Faces framework is adopted to simplify development.

The application uses many external open source libraries that are not provided by default in java:


    1. ROME, JDOM: for RSS feed parsing.

    2. OpenNLP: for natural language processing.

    3. JGrapht: for graph manipulation.




7      Implementation

    The main technique used here allows parsing every Rss feed using ROME and JDOM libraries by entering
in input its URL, and getting as output a SyndFeed. The SyndFeed type represents any kind of feed (RSS,
ATOM ). Afterwards, from the returned SyndFeed a list of items is acquired. Those items are of type
SyndEntry. Next, a window is associated with every feed in order to contain the fifteen most recent entries
from the generated list.

At the end of this step, you will have a maximum of five windows, each containing fifteen entries. OpenNLP
is now used to split each entry into sentences saved in an Array of String. Then StringToKenizer Class is
used to divide each sentence into an Array of word.

After that, JGrapht is brought into play in order to create an empty directed graph, which constitutes the
basis of the Graph Index Model. The building process of the graph goes as follows:


    • For every word, we check if it already exists in the graph; if not, we add it.

    • For every two consecutive words, an edge is created in the graph.


Phrase matching and graph building take place simultaneously. Phrase matching occurs over the following
steps:




                                                      11
• If an edge already exists between two consecutive words, the path in the graph extending that node is
       followed until the last existing edge is reached; a matching phrase is detected and added to the list of
       already matched phrases.

    • At the end of processing some of the matching phrases must be eliminated because they dont hold any
       semantic value.


The remaining sentences should be evaluated in order to assess the degree of similarity between RSS entries.
This evaluation concerns:


    • The length of the sentence.

    • The weight of the sentence in the entry.


However, using phrase based matching solely can be deemed insufficient. A better approach is to incorporate
single-term similarity. Once inter-entries similarity measures are established, a new RSS feed is created using
ROME to contain the matched entries. This feed is returned as the join result.




8     Simulation

To test the application, the following was done:


    1. Launch it.

    2. Fill the textfields with RSS feeds URLs.

    3. Inspect the results.


One example is the following:




                                                      12
13
Other tests are done, and the results are illustrated in the table below:




These results point out the following observations:


  1. Given that the tests are carried out online, there is a big probability that the feeds are constantly
     changing. Therefore, each evaluation can take place on different inputs.

  2. The number of matched entries in not directly linked to the number of feeds entered.




                                                      14
9     Consideration

9.1     Advantages

    1. Ease of use: simple user interface.

    2. High performance: low time processing.

    3. Efficiency: minimal use of bandwidth.



9.2     Disadvantages

    1. Limited inputs to five.

    2. Not the optimal technique (though presents less overhead).



9.3     Possible improvements

    1. Expand the maximum inputs number, without sacrificing performance.

    2. Improve this technique to include semantic similarity so the rate of matched entries increases.

    3. Create different versions of the application for different platforms like mobile phones, desktop applica-
       tions



9.4     Other applications

This technique can be also useful in other fields of application. It can be applied to match inputs other than
RSS feeds.

In general it can be used to match any text content such as speeches, researches




                                                      15
10     Conclusion

  In this paper, we described a technique for creating an RSS join engine. We discussed how RSS feeds can
be joined. Afterwards, we examined how text documents can be compared, and focused on the Document
Index Graph Model in the context of a phrase-based document similarity. Then we moved to enumerate how
RSS items can be related, before looking into previously developed websites and applications that attempted
to solve the question of how to join RSS feeds and finding that none of the preexisting solutions is well
adapted to this task.

By suggesting a technique based on the Graph Index Model, we were able to take advantage of the efficiency
and the ease of implementation of that model. Furthermore, this technique can be further improved, and
can even be applied to fields other than RSS feeds.




                                                     16
References

[1] Relating RSS News/Items
   Fekade Getahun, Joe Tekli, Chbeir Richard, Marco Viviani, Kokou Yetongnon
   Laboratoire Electronique, Informatique et Image
   (LE2I) UMR-CNRS Universit de Bourgogne Sciences et Techniques
   http://vision.u-bourgogne.fr/Le2i/user data/publications/2356 Chapter-LNCS-ICWE%20final.pdf

[2] Phrase-based Document Similarity Based on an Index Graph Model
   Khaled M. Hammouda Mohamed S. Kamel
   Department of Systems Design Engineering
   University of Waterloo
   Waterloo, Ontario, Canada N2L 3G1
   E-mail: hammouda,mkamel@pami.uwaterloo.ca
   http://pami.uwaterloo.ca/pub/hammouda/hammouda icdm02.pdf




                                                 17

More Related Content

Similar to Rss Join Engine

Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET Journal
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
Test Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful ToolsTest Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful Toolsmcthedog
 
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...IJwest
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET Journal
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET Journal
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeSumant Tambe
 
Interactive news feed extraction system 2
Interactive news feed extraction system 2Interactive news feed extraction system 2
Interactive news feed extraction system 2IAEME Publication
 
R journal 2011-2
R journal 2011-2R journal 2011-2
R journal 2011-2Ajay Ohri
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Databaseijbuiiir1
 
LoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment servicesLoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment serviceslocloud
 

Similar to Rss Join Engine (20)

Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
IRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text RankIRJET- Automatic Text Summarization using Text Rank
IRJET- Automatic Text Summarization using Text Rank
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
Test Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful ToolsTest Strategy Utilising Mc Useful Tools
Test Strategy Utilising Mc Useful Tools
 
Sda 9
Sda   9Sda   9
Sda 9
 
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...
AUTOMATIC CONVERSION OF RELATIONAL DATABASES INTO ONTOLOGIES: A COMPARATIVE A...
 
IRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software TechnologiesIRJET- A Novel Approch Automatically Categorizing Software Technologies
IRJET- A Novel Approch Automatically Categorizing Software Technologies
 
rscript_paper-1
rscript_paper-1rscript_paper-1
rscript_paper-1
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect Information
 
databases
databasesdatabases
databases
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
 
Degreeproject
DegreeprojectDegreeproject
Degreeproject
 
Interactive news feed extraction system 2
Interactive news feed extraction system 2Interactive news feed extraction system 2
Interactive news feed extraction system 2
 
Focused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSIFocused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSI
 
system sequence diagram
system sequence diagramsystem sequence diagram
system sequence diagram
 
Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
R journal 2011-2
R journal 2011-2R journal 2011-2
R journal 2011-2
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Database
 
Document Summarizer
Document SummarizerDocument Summarizer
Document Summarizer
 
LoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment servicesLoCloud - D3.3: Metadata Enrichment services
LoCloud - D3.3: Metadata Enrichment services
 

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfngoud9212
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Bluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdfBluetooth Controlled Car with Arduino.pdf
Bluetooth Controlled Car with Arduino.pdf
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Rss Join Engine

  • 1. M´thodologie de Recherche e RSS Join Engine Maroun Baydoun inf1312, OGL Marwan Azzam inf1311, OGL Thursday, 14 May, 2010
  • 2. Contents 1 Introduction 3 2 Related Work 3 2.1 Joining RSS Feeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Comparing Text Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Document Index Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 Relating Rss Items (News) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 Websites and Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3 Hypothesis 8 4 Architecture 8 5 Pseudo-code 9 6 Development 11 7 Implementation 11 8 Simulation 12 9 Consideration 15 9.1 Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 9.2 Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 9.3 Possible improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1
  • 3. 9.4 Other applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 10 Conclusion 16 2
  • 4. 1 Introduction RSS (most commonly expanded as ”Really Simple Syndication”) is a family of web feed formats used to publish frequently updated workssuch as blog entries, news headlines, audio, and videoin a standardized format. Internet users often subscribe to many RSS feeds to stay up to date on the latest news. However, this information is spread out across many sources, which makes it difficult for users to keep track of the most important headlines. Therefore, it is essential to come up with a tool to bring together news from different sources and present the user with a single feed containing the intersection of all other feeds. 2 Related Work 2.1 Joining RSS Feeds To intersect different feeds, we start by defining semantic relatedness between RSS elements/items in order to determine semantic relations regarding the meaning of terms instead of just their syntactic properties. XML documents can be compared following: • Their structure. (structure-based similarity) • Their content. (content-based similarity) • A combination of both. (hybrid similarity) RSS feeds can be related in several ways: • Inclusion: a news item can be completely included in another news item. • Intersection: two news items might refer to similar concepts. We say the two news items intersect. • Opposition: two news items might refer to the same topic but in opposite ways. 3
  • 5. 2.2 Comparing Text Documents 2.2.1 Document Index Graph Model In order to relate RSS feeds, its helpful to use a document clustering technique based on segregating documents into groups so that each group represents a topic that is different than those represented by other groups. Any clustering technique relies on 4 concepts: a data representation model, a similarity measure, a cluster model, a clustering algorithm to build the clusters based on the similarity measure and the data model. Traditional document clustering techniques rely on single term analysis. They use the Vector Space Model. The Vector Space Model represents a document as a vector of terms. The vector contains the weights of the terms (the frequency of the terms for example). Similarity between documents can be measured using similarity measures applied to the vectors (such as cosine). This model uses single-term analysis only (no word proximity or phrase based analysis). Even though this approach is widely used, it proves insufficient because it leaves out phrase analysis. Therefore a better method would be to combine single-term and phrase analysis. This brings us to the introduction of a new model, the Document Index Graph (DIG). The DIG is based on the graph theory. It uses graph properties to match any-length sentence from a document to any number of previously seen documents. The amount of time taken by this process is proportional to the number of words in the new document. 4
  • 6. 2.3 Relating Rss Items (News) Each node or element of an RSS Tree is a pair having e = η, ζ where e.η is the element name and e.ζ its content. The concept of neighborhood Neighborhood is used for identifying the relationships between text and is consequently used for RSS elements. Neighborhood can be classified as follows: • Semantic Neighborhood : The semantic neighborhood of a concept Ci is defined as the set of concepts Cj in a given knowledge base KB, related with Ci via the hyponymy ( ) or meronymy ( ) semantic relations, directly or via transitivity. • Global Semantic Neighborhood: The global semantic neighborhood of a concept Ci is the union of each semantic neighborhood w.r.t. all synonymy ( ≡ ), hyponymy ( ) and meronymy ( ) relations altogether. • Antonym Neighborhood: The antonym neighborhood of a concept Ci is defined as the set of concepts Ci , in a given knowledge base KB, related with Ci via the antonymy relation (ω), directly or transitively 5
  • 7. via synonymy ( ≡ ), hyponymy ( ) or hypernym ( ). Relations and relatedness of RSS elements: For two simple elements e1, and e2, the Element Relatedness (ER) algorithm returns a pair quantifying the semantic relatedness SemRel value and Relation based on corresponding TR label and content values. Relation relies on a rule-based method combining the label and value relationships as follows: • Elements e1 and e2 are disjoint if either their labels or values are disjoint. • Element e1 includes e2, if e1.η includes e2.η and e1.ζ includes e2.ζ. • Two elements e1 and e2 intersect if either their labels or values intersect. • Two elements e1 and e2 are equal if both their labels and values are equal. • Two elements e1 and e2 are opposite if both their contents are opposite. For two RSS items I1 and I2, each containing a group of elements, the Item Relatedness (IR) Algorithm returns a pair containing SemRel and Relation. By combining relations between sub elements, the relation between two items I1 and I2 is identified using the following rule-based method: • Items I1 and I2 are disjoint if all elements ei and ej are disjoint (elements are disjoint if there is no relatedness whatsoever between them, i.e., SemRel(I1, I2) = 0). • Item I1 includes I2, if all elements in ei include all those in ej . 6
  • 8. • Two items I1 and I2 intersect if at least two of their elements intersect. • Two items I1 and I2 are equal if all their elements in ei equal to all those in ej . • Two items I1 and I2 are opposite if at least two of their respective elements are opposite. 2.4 Websites and Applications There are many websites and applications that provide services related to RSS feeds aggregations, but none of those solutions implements an RSS join engine based on semantics. They simply enable users to merge many news feeds into a single feed. These tools are: • xFruits (http://www.xfruits.com) • Flock (http://flock.sourceforge.net/index.html) • RSSOwl (http://www.rssowl.org) • BlogBridge (http://www.blogbridge.com) • Yahoo Pipes (http://pipes.yahoo.com) • Feedzeo (http://feedzeo.sourceforge.net) 7
  • 9. 3 Hypothesis Based on what is presented in the earlier parts, the simplest method was to consider two phrases similar if they have a predefined number of words in common. However, this method reveals substantial weaknesses because it neglects the semantics of the phrases. Sentences written differently but conveying the same meaning will be deemed not similar. Thus, the proposed solution consists of implementing a phrase-based document similarity algorithm based on an index graph model to create a RSS join engine. This engine will take five RSS feeds as input, place a window on each feed, and then run the similarity algorithm in order to intersect the feeds. 4 Architecture 1. Feeds: The user has the possibility at most any five Rss feeds. 2. Parser: The application will rely on the ROME Rss feed parser, which accepts as a parameter the URL of the feed, and returns the list of its Items. 3. Windows: On every list of item, we place a window, which contains the five most recent items. 8
  • 10. 4. Join Engine: The join engine applies the Phrase-based document similarity algorithm on the items contained in the windows. 5 Pseudo-code CREATE GRAPH: FOR EACH feed Read feed Parse feed Create window Sort feed items by publish date Include the five most recent items in the window CALL build graph END FOR BUILD GRAPH: Create document Fill document with feed items FOR each sentence in document IF first-word of sentence NOT in graph Add first-word of sentence in graph END IF Create list FOR EACH word in sentence IF previous-word, word IS edge in graph Extend phrase matches in the list for sentences that continue along previous-word, word Add the new phrase matches to the list ElSE Add previous-word, word to graph Update sentence path in nodes previous-word and word END IF 9
  • 12. 6 Development It is a java web application developed on NetBeans IDE 6.8, using java version Java EE6 and Glashfish v3 as application server. The JavaServer Faces framework is adopted to simplify development. The application uses many external open source libraries that are not provided by default in java: 1. ROME, JDOM: for RSS feed parsing. 2. OpenNLP: for natural language processing. 3. JGrapht: for graph manipulation. 7 Implementation The main technique used here allows parsing every Rss feed using ROME and JDOM libraries by entering in input its URL, and getting as output a SyndFeed. The SyndFeed type represents any kind of feed (RSS, ATOM ). Afterwards, from the returned SyndFeed a list of items is acquired. Those items are of type SyndEntry. Next, a window is associated with every feed in order to contain the fifteen most recent entries from the generated list. At the end of this step, you will have a maximum of five windows, each containing fifteen entries. OpenNLP is now used to split each entry into sentences saved in an Array of String. Then StringToKenizer Class is used to divide each sentence into an Array of word. After that, JGrapht is brought into play in order to create an empty directed graph, which constitutes the basis of the Graph Index Model. The building process of the graph goes as follows: • For every word, we check if it already exists in the graph; if not, we add it. • For every two consecutive words, an edge is created in the graph. Phrase matching and graph building take place simultaneously. Phrase matching occurs over the following steps: 11
  • 13. • If an edge already exists between two consecutive words, the path in the graph extending that node is followed until the last existing edge is reached; a matching phrase is detected and added to the list of already matched phrases. • At the end of processing some of the matching phrases must be eliminated because they dont hold any semantic value. The remaining sentences should be evaluated in order to assess the degree of similarity between RSS entries. This evaluation concerns: • The length of the sentence. • The weight of the sentence in the entry. However, using phrase based matching solely can be deemed insufficient. A better approach is to incorporate single-term similarity. Once inter-entries similarity measures are established, a new RSS feed is created using ROME to contain the matched entries. This feed is returned as the join result. 8 Simulation To test the application, the following was done: 1. Launch it. 2. Fill the textfields with RSS feeds URLs. 3. Inspect the results. One example is the following: 12
  • 14. 13
  • 15. Other tests are done, and the results are illustrated in the table below: These results point out the following observations: 1. Given that the tests are carried out online, there is a big probability that the feeds are constantly changing. Therefore, each evaluation can take place on different inputs. 2. The number of matched entries in not directly linked to the number of feeds entered. 14
  • 16. 9 Consideration 9.1 Advantages 1. Ease of use: simple user interface. 2. High performance: low time processing. 3. Efficiency: minimal use of bandwidth. 9.2 Disadvantages 1. Limited inputs to five. 2. Not the optimal technique (though presents less overhead). 9.3 Possible improvements 1. Expand the maximum inputs number, without sacrificing performance. 2. Improve this technique to include semantic similarity so the rate of matched entries increases. 3. Create different versions of the application for different platforms like mobile phones, desktop applica- tions 9.4 Other applications This technique can be also useful in other fields of application. It can be applied to match inputs other than RSS feeds. In general it can be used to match any text content such as speeches, researches 15
  • 17. 10 Conclusion In this paper, we described a technique for creating an RSS join engine. We discussed how RSS feeds can be joined. Afterwards, we examined how text documents can be compared, and focused on the Document Index Graph Model in the context of a phrase-based document similarity. Then we moved to enumerate how RSS items can be related, before looking into previously developed websites and applications that attempted to solve the question of how to join RSS feeds and finding that none of the preexisting solutions is well adapted to this task. By suggesting a technique based on the Graph Index Model, we were able to take advantage of the efficiency and the ease of implementation of that model. Furthermore, this technique can be further improved, and can even be applied to fields other than RSS feeds. 16
  • 18. References [1] Relating RSS News/Items Fekade Getahun, Joe Tekli, Chbeir Richard, Marco Viviani, Kokou Yetongnon Laboratoire Electronique, Informatique et Image (LE2I) UMR-CNRS Universit de Bourgogne Sciences et Techniques http://vision.u-bourgogne.fr/Le2i/user data/publications/2356 Chapter-LNCS-ICWE%20final.pdf [2] Phrase-based Document Similarity Based on an Index Graph Model Khaled M. Hammouda Mohamed S. Kamel Department of Systems Design Engineering University of Waterloo Waterloo, Ontario, Canada N2L 3G1 E-mail: hammouda,mkamel@pami.uwaterloo.ca http://pami.uwaterloo.ca/pub/hammouda/hammouda icdm02.pdf 17