Your SlideShare is downloading. ×
About onlineextrems concept
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

About onlineextrems concept

320

Published on

About Onlineextrems.com

About Onlineextrems.com

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
320
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
1
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. OverviewOnlineextrems.com
  • 2. Platform overview  A single unified platform for all content types (consolidate to reduce development and maintenance costs)  Flexible system which can support any new content type  High automation (cut configuration costs)  Real time coverage or as close as possible for each content type  Improved data quality using validation rules  Was implemented this yearJanuary 1, 2013 Onlineextrems.com
  • 3. Supporting all the content types  Message boards  Blogs and micro blogs (Myspace, Blogger, Live Journal...)  Blog comments  Social networks – Facebook, Linkedin, Xing  Author profiles  Product reviews  Usenet – mailing lists, groups  Traditional media – CNN, ReutersJanuary 1, 2013 Onlineextrems.com
  • 4. Consolidating the content systems Data mining systems  Message boards  Blogs  Social Networking sites  Author profiles system  Usenet + Newsgroups systemJanuary 1, 2013 Onlineextrems.com 4
  • 5. Some of our challenges  Dynamic nature of the web  Supporting many different types of content  Automatically “understanding” millions of sites with different structures Over 8000 message boards   Over 95 million blogs  Supporting data in different languages  Data qualityJanuary 1, 2013 Onlineextrems.com
  • 6. Data mining process What are the important aspects of the data mining? Managing the order in which we crawl pages  Efficiency (e.g. not entering posts where the number of comments hasn’t changed)  Next page (we need to follow it to get more comments) Extracting relevant data out of everything on the page. Separating the data into posts (or comments) Transforming specific data into the desired format  Handling dates in differing formatsJanuary 1, 2013 Onlineextrems.com
  • 7. Data mining technologies  Jelly –Simple XML workflow engine  HttpClient - Fetcher  Rome –Feed parser  Velocity–Output template engine  JMX + JConsole – Managing the systemJanuary 1, 2013 Onlineextrems.com
  • 8. Flows  Built from steps which are the blocks  Allows adding support for new content types without writing code  The implementation is based on Apache Jelly which allows executing XML filesJanuary 1, 2013 Onlineextrems.com
  • 9. XML parser  Parses the data from simple XML files into the common in memory “items” structure  For now only supports elements and not attributes  Used for TwitterJanuary 1, 2013 Onlineextrems.com
  • 10. HTML parser  Applies XSLT transformations to HTML pages  Extracts the data into the common in memory “items” structure  Uses “Tag Soup” library to read HTML as if it were XML  Faster and more robust than the current XML conversion method  Used for Author ProfilesJanuary 1, 2013 Onlineextrems.com
  • 11. XML Output  Output in XML files  Configurable output format using template fileJanuary 1, 2013 Onlineextrems.com
  • 12. Sample WorkJanuary 1, 2013
  • 13. Sample WorkJanuary 1, 2013
  • 14. Thank You Connect and share with us… www.onlineextrems.comJanuary 1, 2013 Onlineextrems.com

×