High Level System OverviewCrawler Preprocessor DBProcessor Store in DB
Infrastructure Requirements• Component Independence• Messaging• Scalability• Minimize code written, use as much opensource...
Infrastructure Choices• Jboss/Java vs .NET• Spring Framework vs Plain-old Java• Oracle vs MySql• Hibernate ORM vs Plain-ol...
Logging Structure• Entering and exiting methods (ofreasonable importance)• Catching Java checked exceptions• Uniform struc...
Pseudo-Code for Crawler Manager• Begin infinite loop– For each messageBoard in List• crawlAll– End For loop• End infinite ...
High-Level Crawler Strategy• Failed messages arepersisted• Message markers(right-hand sidelabels) are persisted• Algorithm...
Crawler Strategy Algorithm• Crawl all previous failed messages• Crawl ‘crashed messages’• Crawl new messages• Crawl new fa...
Preprocessor Block DiagramLowercaseHTMLParserCleanupHTMLParserCleanupContractionsDictionarySlangDictionaryPunctuationClean...
Upcoming SlideShare
Loading in...5
×

This is a title

255

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
255
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "This is a title"

  1. 1. High Level System OverviewCrawler Preprocessor DBProcessor Store in DB
  2. 2. Infrastructure Requirements• Component Independence• Messaging• Scalability• Minimize code written, use as much opensource code as possible
  3. 3. Infrastructure Choices• Jboss/Java vs .NET• Spring Framework vs Plain-old Java• Oracle vs MySql• Hibernate ORM vs Plain-old SQL
  4. 4. Logging Structure• Entering and exiting methods (ofreasonable importance)• Catching Java checked exceptions• Uniform structure• org.fydproject.component.mainMethod.subMethod1.subMethod2…• Ex. org.projectnlp.preprocessor.stemmer
  5. 5. Pseudo-Code for Crawler Manager• Begin infinite loop– For each messageBoard in List• crawlAll– End For loop• End infinite loop
  6. 6. High-Level Crawler Strategy• Failed messages arepersisted• Message markers(right-hand sidelabels) are persisted• Algorithm preventscrawling duplicatemessagesOld MessageThresholdOldest MessageCrawledLast SuccessfulcrawlLast Successfulmessage extractedNewest MessageNewly CrawledMessagesOld successfulCrawledMessagesOld MessagesYet to beCrawledMessages fromCrashHighest MessageIdLowest MessageId
  7. 7. Crawler Strategy Algorithm• Crawl all previous failed messages• Crawl ‘crashed messages’• Crawl new messages• Crawl new failed messages• Crawl old messages
  8. 8. Preprocessor Block DiagramLowercaseHTMLParserCleanupHTMLParserCleanupContractionsDictionarySlangDictionaryPunctuationCleanerStop WordsDictionaryNegationEngineStemmerOut to DB ProcessorIn fromCrawler
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×