MongoDB and Web Scrapingwith the Gyes Platform                                  Jesus Diaz                 jesus.diaz@infi...
What is Gyes?● Aggregation Platform for the Web  ○ Finance (Mint.com, Manilla.com)  ○ Travel (kayak.com)  ○ Shopping (next...
What is Gyes? (cont)● Domain-specific Scrapers● Gyes scrapers = JavaScript + jQuery● Full Web context access
Goals● Decouple Data Extraction from Data  Consumption● Provide a Flexible Data Model● Provide a Semi-structured Model to ...
Overall Architecture                                              REST API (latest, run, collect)UI (www.gyeslab.com) ● De...
From Goals to Challenges● Flexible Data Model● Flexible, semi-structured Means to Access  Data
Take 1: Key-Value pairs (Tuple spaces)Crawler returns:                         Data gets stored as:result.source = "Newspa...
Key-Value pairs (cont)Advantages              Disadvantages● Flexible Data Model   ● Cumbersome to                        ...
Take 2: Enter JSON    We are scraping the web, using Javascript + jQuery. Why dont we use JSON? Thank you, captain    obvi...
Enter JSON (cont)Advantages              Disadvantages● Flexible Data Model   ● Plain text  What about that flexible, semi...
MongoDB to the rescue● No tricks, store data as-is● Flexible (structure of scraped data can  change, MongoDB doesnt care)●...
Overall Architecture (2)                                                Clients          Clients         Clients          ...
Whats Next with Gyes and MongoDB● Scale Data Repository + API  ○   Sharding  ○   Get data closer to users● Add support for...
The End          Questions?
Upcoming SlideShare
Loading in …5
×

MongoDB and Web Scrapping with the Gyes Platform

1,225 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,225
On SlideShare
0
From Embeds
0
Number of Embeds
149
Actions
Shares
0
Downloads
15
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

MongoDB and Web Scrapping with the Gyes Platform

  1. 1. MongoDB and Web Scrapingwith the Gyes Platform Jesus Diaz jesus.diaz@infinithread.com
  2. 2. What is Gyes?● Aggregation Platform for the Web ○ Finance (Mint.com, Manilla.com) ○ Travel (kayak.com) ○ Shopping (nextag.com)
  3. 3. What is Gyes? (cont)● Domain-specific Scrapers● Gyes scrapers = JavaScript + jQuery● Full Web context access
  4. 4. Goals● Decouple Data Extraction from Data Consumption● Provide a Flexible Data Model● Provide a Semi-structured Model to Access Scraped Data
  5. 5. Overall Architecture REST API (latest, run, collect)UI (www.gyeslab.com) ● Develop crawlers ● Check scrapped data Schedule bot.open(http://somesite.com, function(status) { ... Data Repository return {a: 3, b: 4}; });
  6. 6. From Goals to Challenges● Flexible Data Model● Flexible, semi-structured Means to Access Data
  7. 7. Take 1: Key-Value pairs (Tuple spaces)Crawler returns: Data gets stored as:result.source = "Newspaper A" key1 key2 key3 ... valueresult.date = "1/20/2013"result.news[0].id = 8 result source Newspaper Aresult.news[0].text = "Headline#1" result date 1/20/2013result.news[1].id = 9 result news[0] id 8result.news[1].text = "Headline#2" result news[0] text Headline #1.... result news[1] id 9 result news[1] text Headline #2
  8. 8. Key-Value pairs (cont)Advantages Disadvantages● Flexible Data Model ● Cumbersome to "rebuild" data ● Hard to handle versioning ● Lack of great commercial implementations (diy?)
  9. 9. Take 2: Enter JSON We are scraping the web, using Javascript + jQuery. Why dont we use JSON? Thank you, captain obvious!Crawler returns: Data gets stored as{ Plain Text. "source": "Newspaper A", "date": "1/20/2013", "news": [ { "id": 8, "text": "Headline#1"}, { "id": 9,"text": "Headline #2"} .... ]}
  10. 10. Enter JSON (cont)Advantages Disadvantages● Flexible Data Model ● Plain text What about that flexible, semi-structured mechanism to access the data we wanted to provide?
  11. 11. MongoDB to the rescue● No tricks, store data as-is● Flexible (structure of scraped data can change, MongoDB doesnt care)● Semi-structural model allow users to convert data to strongly typed objects● Powerful query mechanisms● Scalable (oh yeah)● Again, store data as-is, consume as-is.
  12. 12. Overall Architecture (2) Clients Clients Clients JSON JSON JSON REST API (latest, run, collect) BSON/JSONUI (www.gyeslab.com) ● Develop crawlers ● Check scrapped data Schedulebot.open(http://somesite.com, Data Repository function(status) { (MongoDB) ... return {a: 3, b: 4};});
  13. 13. Whats Next with Gyes and MongoDB● Scale Data Repository + API ○ Sharding ○ Get data closer to users● Add support for querying data by projection ○ Slices of data ○ Arbitrary attribute subset selection.
  14. 14. The End Questions?

×