MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

1,365 views

Published on

Gyes is an aggregation platform for the Web. Gyes allows you to develop, schedule and troubleshoot data extraction programs (crawlers) that translate html content into structured data you can use later on. In selecting the data model for the platform, several challenges arose due to the lack of structure of the scraped data, and the need to provide meaningful and efficient access to it. MongoDB was our third rewrite of the Gyes back-end, and by far has exceeded expectations. In this talk, I would like to discuss some of the challenges we faced, and how MongoDB addressed them. Details about implementation challenges are also shared.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,365
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
7
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Hi, my name is Jesus Diaz, and I'm Software Architect at Infinithread Corporation. We are a software solutions company based off of Fort Lauderdale, FL. We focus mainly on the enterprise, big data, and algorithms. At the same, we develop a few ideas of our own, and Gyes is one of them. Before we start, I would like to thanks the folks from 10gen for giving me this time to showcase Gyes, and our experiences with this fantastic software that is MongoDB.
  • -What is Gyes? Simply put, Gyes is an aggregation platform for the web. -Just to give you some background, If we think about the web as a huge data source, there is a lot of unstructured information that is not easy to consume or query. By unstructured data I mean data that basically sits in a web page as a combination of text and HTML markup, so there is no a web service or API in place to consume it programmatically. -The absence of this service layer to provide a better access to the data are manyfold. If we rule out intellectual property, many time is the case that the people or company that produced the data in the first place don't think there is anything else worth doing with that data, or they don't have the time or money to work on an API service. Or is just mostly text and it doesn't really make sense to create a structure. -In any case, there is a huge value on thematic aggregation, where you fetch and combine data from multiple websites to produce a comprehensive view of a subject, or to compare attributes of the same elements between different vendors. Great examples of this are, for the finance industry: Mint.com and Manilla.com. For the travel industry: kayak.com. Merchandise and general shopping: nextag.com.
  • -Gyes is our idea of what a web scraping platform should be. We see it as a tool to facilitate the scraping portion of systems that use aggregating data as part of their value proposition . In other words, we see people using Gyes as their data backend provider. You have an idea, don't have the time or resources to implement the scraping logic, well, you can use Gyes as a Software as a Service (SaaS). -Gyes allows you to define domain-specific scrapers. This is: scrapers that focus on one or a few pages. -Scrapers are coded in JavaScript, with the added bonus of injecting jQuery on the scraped page, giving you all the niceties of the library. The data extracted is represented as a JavaScript object that is serialized for storage and later consumption as JSON. -Now, something we think sets Gyes apart from other scraping tools and platforms out there, is that, while other approaches focus on the data collection part, as in collection for later use, analysis and mining, Gyes focuses on providing programmatic access to the data. You can do all those analysis and mining later on, of course, but in designing Gyes we paid special attention to facilitate the access to the data from other programs (mobile apps, services, web sites) via a clean and pragmatic API.
  • -Based on our experience developing scrapers, we had a few goals in mind when we set us off to develop Gyes. The first one was to decouple the data extraction from the data consumption. This will buy some time to correct the scraper upon changes on the page structure. -The second goal was to provide a flexible data model to accommodate the scraped data. Since we were developing a generic platform, we didn't want to make any assumption about the nature or organization of the data. -Thirdly, even though we knew the structure of the data was unpredictable, we did want to provide a semi-structured way to access the data. This is, for your model, for your specific realm, you should be able to treat data as somehow structured.
  • The biggest challenge we faced was to find a storage platform that would accommodate our data requirements . We disregarded early in the game relational databases for several reasons: Lack of native support for JSON. Tabular structure wouldn’t provide much flexibility on the data schema. Other alternative we considered was key/value stores. Key/value stores provide a great deal of flexibility . Basically, you associate a data blob to a key In other words, in a key-value store you can only query by the key component of the pairs. This was actually the biggest argument in favor or keep looking for a more robust solution.
  • When we heard about MongoDB, we thought that someone made a database to fit our needs. With MongoDB, there are no tricks, we can store the data as-is, in JSON. No mappings, no translations, no los s of structure or error-prone format conversions. MongoDB provides a great deal of flexibility when it comes to storing data. I mean, you can store any object in a given collection, regardless of their structure. This works great because the very order in a collection defines the result history (versioning), and it doesn't really matter if the scraper start ed pulling more or less data at a given point in time . There is no schemas to change, no backward compatibility changes. It is all up to the user and its application logic. Speaking about being flexible. Now we can offer our clients the power of filtering, sorting and slicing their data the way they want. If they want to only get certain results, or certain components of the result, it is almost straightforward in terms of MongoDB's query by projection. Now, all of a sudden, our system is very easy to scale. Our data repository can grow to Terabytes. MongoDB will handle all that complexity for us. Just to reiterate, store data as-is, serve it ready to consume as-is.
  • Putting everything together: This is the source code of a very simple crawler. As you can see, we are opening a page from wikipedia, and building a JavaScript object with the data we want to collect from the page, in this case, the scientific classification of gannets.
  • This would be original Wikipedia page, highlighed you will see the data that’s being scraped by the crawler.
  • Now, in terms of Gyes, that data is stored as a JSON, in MongoDB. What you see here is the Data tab on the Gyes user interface. There is an entry for each crawler execution. The returned JavaScript object is serialized and stored on the Data property of the JSON entry in MongoDB.
  • So, now, based on what we just saw, how do we use MongoDB. We keep one database per Gyes user. This ensures data segregation and avoids name conflicts when it comes to name your crawlers. There are two collections per crawler. One is to store ‘permanent’ results (results generated by a crawler scheduler or promoted specifically by the user) . These results are automatically available to the API data operations to consume. Another collection store s temporary results (data produced while developing and tuning the scrapers).
  • -As I mentioned before, one of the design goals of Gyes is to ease the consumption of the aggregated data. We don’t want to create just a SaaS scraping platform, but we would like also to become a BaaS (backend as a service) solution for applications (web and mobile) were most if not all the logic is based on data aggregation. With this in mind, we provide an API to open up the platform to programatic access. The data functions of the API rely heavily on MongoDB features. -The API is RESTful, and the general form of a call is the following: --base Url (in this case we are using our public implementation of the platform, Gyeslab). --API version. --API function invoked. --user identifier (user login) --crawler or crawler group. And just to explain this a little bit more: we support the concept of having crawler groups: set of crawlers you want to treat as one in certain ocassions. Say you have 15 crawlers that scrape data from news sites. It makes sense to perform operations over all of them at the same time, such as get the latest data scraped, run them all, etc. --And then apiKey, which is a unique key per user that confirms his identity against the API. -Our Data API functions leverage greatly the underlying query capabilities of MongoDB, and I would like to talk more about this with a more real-life use case: Ubirates.
  • -Ubirates is a financial aggregation website that uses Gyes to scrape data from several Japanese institutions -Currently, it supports data from 10 banks, and counting. -Uses Gyes not only as an aggregator platform, but as a BaaS too . This means that the data is queried every time a user loads the page.
  • - So, this is a screenshot of the Ubirates site. For those of us that don’t know Japanese, it might seem a little cryptic, at first. For me, I guess that after spending so much time with our client developing it , it just happens to make sense. Maybe just like the code on the Matrix. - Basically, you have the banks on the first column, with the logos, and then subcategories of products (savings, certificate of deposits, etc), then you have the actual banking products, with the interest rates, based on the time you are willing to leave your money untouched. Now, all this information is collected by using a single API call from Gyes, which is the find API call.
  • -The find API call queries a crawler or group of crawler’s results. Among the query parameters it accepts, you can specify a number or results to be returned (take). By default, it will return the last 10 results produced by the crawler. Now, the real power of this call is in the parameter passed on its body. As you can see, we are passing a JSON object, which has two component, a q property, and a p property. q stands for query, and p for projection. -So basically, this find call is asking for the most recent result for the crawlers of user ‘ubirates’, that belong to the group ‘all’, such that Status is success. Furthermore, I’m returning a projection of those data that only contain two fields: CrawlerName, and Data. - For those of you already familiar with MongoDB, you can see where this is going.
  • - Now, on the API server, t his is the core of the implementation of the find method. We use the MongoDB .Net driver, so my apologies if some of you are not familiar with C# and lambda expressions. - In this code excerpt, ‘ crawlers ’ is a collection that contains all the crawlers referenced by the API call, and ‘ database ’ is the MongoDB database associated to the user executing the query. - To make it more clear, I’m going to dim the code that does not relate directly with MongoDB. -We use the q parameter to perform a find over each collection included on the query (remember there is one MongoDB collection per crawler, and one MongoDB database per user). When we find all the elements that match the query, we filter the content of those by the projection selected by the user, using the p parameter. After doing some skipping and taking, we are ready to return the data to the user. -As you can see, thanks to the query capabilities of MongoDB, implementing such complex functionality is for us more about translating, validating and passing the heavy lifting to MongoDB.
  • Real quick, kind of wrapping things up. What's next for us with MongoDB. Well, we are working on scaling the data repository along with the API. One of the challenges we have right now is the increasing interest of mobile application developers in Gyes to partially or completely replace their backend, so we need to provide fast access to the data. Another interesting goal is to do all we can to optimize the API queries. Here there are two things we have been thinking. First, give our users the ability of managing the indexes of their data repository. Like on the Ubirates use case, if would be nice to create an index for the ‘Success’ field to speed up the queries, but more than that: this is about allowing the user to optimize the search over attributes that only exist on his data. Second, it makes sense to cache some of this query results. Some data get updated only once a day, so in this case it makes sense to provide a caching mechanism that optimiz e the API performance.
  • This concludes the presentation. Now, I'll open the floor for questions...
  • MongoDB and Web Scraping with the Gyes platform. MongoDB Atlanta 2013

    1. 1. MongoDB and Web Scrapingwith the Gyes PlatformJesus Diazjesus.diaz@infinithread.com@infinithread, @elyisuMongoDB Atlanta 2013
    2. 2. What is Gyes?• Lets think on the web as a huge data sourceof unstructured data• Absence of a web service or API layer toconsume most of the data• Significant value on thematic aggregation• Finance (Mint.com, Manilla.com)• Travel (Kayak.com)• Shopping (Nextag.com)
    3. 3. What is Gyes? (cont)• Aggregation platform for the web• SaaS or hosted• Domain-specific Scrapers• JavaScript + jQuery = JSON• Oriented to provide programmatic access ofthe data
    4. 4. Goals• Decouple Data Extraction from DataConsumption• Provide a Flexible Data Model• Provide a Semi-structured Model to AccessScraped Data
    5. 5. Challenges: Data Storage• Relational Databases?• Lack of support for JSON• Tabular structure vs data schema flexibility• Key/value stores• Very flexible, but• Inability of querying the data in more than onedimension
    6. 6. The MongoDB solution• No tricks, store data as-is• Flexible (structure of scraped data canchange, MongoDB doesnt care)• Powerful query mechanisms• Scalable• Again, store data as-is, consume as-is
    7. 7. Using MongoDB in Gyes
    8. 8. Using MongoDB in Gyes• One database per user• Data segregation• Avoid name conflicts• Two collections per crawler• Permanent results (available to the API)• Temporary results (developing and tuningcrawler)
    9. 9. Gyes API• Ease data consumption programmatically• RESTful•API Data functions leverage MongoDB querycapabilities (latest,find)
    10. 10. Case Study: Ubirates• www.ubirates.com. Financial aggregationwebsite (Japan)• 10 banks (and counting)• Gyes as aggregator platform and BaaS(data served via API upon page load)
    11. 11. Case Study: Ubirates (cont)find API call (POST)URL:http://api.gyeslab.com/v1/find/ubirates/all?apiKey=xxyy&take=1Body:{q: { Status: success },p: { CrawlerName: 1, Data: 1, _id: 0 }}
    12. 12. Use Case: Ubirates (cont)var data = crawlers.Select(crawler =>database.GetCollection(crawler.ToLower())).Select(collection =>collection.Find(q).SetSortOrder(SortBy.Descending("RequestId")).SetFields(p).Skip(skip).Take(take).ToJson(jsonWritterSettings));
    13. 13. Whats Next• Scale Data Repository + API• Sharding• Get data closer to users• Query optimizations• Indexing• Caching
    14. 14. The EndThanks!@infinithreadwww.infinithread.comwww.gyeslab.com

    ×