This is a title

316 views
291 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
316
On SlideShare
0
From Embeds
0
Number of Embeds
179
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

This is a title

  1. 1. High Level System OverviewCrawler Preprocessor DBProcessor Store in DB
  2. 2. Infrastructure Requirements• Component Independence• Messaging• Scalability• Minimize code written, use as much opensource code as possible
  3. 3. Infrastructure Choices• Jboss/Java vs .NET• Spring Framework vs Plain-old Java• Oracle vs MySql• Hibernate ORM vs Plain-old SQL
  4. 4. Logging Structure• Entering and exiting methods (ofreasonable importance)• Catching Java checked exceptions• Uniform structure• org.fydproject.component.mainMethod.subMethod1.subMethod2…• Ex. org.projectnlp.preprocessor.stemmer
  5. 5. Pseudo-Code for Crawler Manager• Begin infinite loop– For each messageBoard in List• crawlAll– End For loop• End infinite loop
  6. 6. High-Level Crawler Strategy• Failed messages arepersisted• Message markers(right-hand sidelabels) are persisted• Algorithm preventscrawling duplicatemessagesOld MessageThresholdOldest MessageCrawledLast SuccessfulcrawlLast Successfulmessage extractedNewest MessageNewly CrawledMessagesOld successfulCrawledMessagesOld MessagesYet to beCrawledMessages fromCrashHighest MessageIdLowest MessageId
  7. 7. Crawler Strategy Algorithm• Crawl all previous failed messages• Crawl ‘crashed messages’• Crawl new messages• Crawl new failed messages• Crawl old messages
  8. 8. Preprocessor Block DiagramLowercaseHTMLParserCleanupHTMLParserCleanupContractionsDictionarySlangDictionaryPunctuationCleanerStop WordsDictionaryNegationEngineStemmerOut to DB ProcessorIn fromCrawler

×