Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Elastic Web Mining 01 November 2009
  • Elastic Web Mining 01 November 2009 Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
  • Elastic Web Mining 01 November 2009
  • Elastic Web Mining 01 November 2009 Goal is to generate large, high quality web crawl that results in an Amazon public dataset Having a large set of public data can be very useful, and I'll show one example later. For Amazon, it's some honey to lure people into running jobs in EC2 & EMR. For us, it was a reason to really push the boundaries of what Bixo could handle. Bixo is the open source web mining toolkit project we sponsor, and we use it for a lot of things, but crawling 100M+ pages wasn't one of them It also wound up being a good test for incremental releases of Cascading 1.1, which I think is now final? Why has it taken longer? Well, part of it is because any time you have to deal with a large slice of the web, it hurts. As Jimmy Liu at CMU said, the web is an endless series of edge cases. Plus we had some work to do to figure out how to effectively run in EMR
  • Elastic Web Mining 01 November 2009 The three in bold are the ones that I’ll spend a bit more time talking about today
  • Elastic Web Mining 01 November 2009 Surprising, given that it’s based on ICU, and that’s often the gold standard for internationalization. One simple approach to validating quality is to compare with any charset found in HTML meta tags. Which can still be wrong, but is usually correct - and way better than the http response headers. Our input is the Avro files we generated from a sample crawl covering 1.7M top domains, based on US traffic reports From Alexa and Quantcast. BTW, please provide input on the format.Link to blog post about it at end of talk. Like to finalize before we generate a lot more data.
  • Elastic Web Mining 01 November 2009 I’m not expecting you to be able to understand or even read this, but it shows the actual workflow portion of a typical analysis app There are three functions you’re not seeing, that pick the records to analyze, do the analysis, and then generate a report.
  • Elastic Web Mining 01 November 2009 I took a slice of data and calculated the accuracy of Tika, when compared to the meta tag charset. Now the meta tag data can lie, though it’s better than the http response headers, and is usually pretty accurate. The vertical scale is accuracy, from 0 to 100%, and the horizontal scale is the number of pages for each charset (log) Now it’s a log scale so that you don’t wind up with all but a few charsets squished to the left. From this, you can see that what we want is for common charsets to have high accuracy, so points in the upper-right But we don’t get that, unfortunately. And this is with calling us-ascii and iso-8859-1 subsets of UTF-8 For some reason Tika really likes the gb18030 encoding - many UTF-8 pages are classified as this. Clearly we could re-run this with modified versions of Tika, or other open source detectors. In fact it would be great to find an intern to implement the approach that Ted Dunning has recommended, of using Log-liklihood ratios, to see how that compares. Similar issues exist for Tika’s language detection, so it could be a two-fer. The key point is that by having a large enough data set, and an easy way to process it, you have the ability to quickly try new approaches, and feel confident about the end results.
  • Elastic Web Mining 01 November 2009 As Tom White pointed out, any input format needs to be splittable for efficiency. So it’s not just about which compression format to use (the answer to that is usually LZO) It relates to anything beyond simple one-line-per-record text files There was some pain in creating a Cascading scheme that let us use Avro for reading & writing. But true to the Just-In-Time nature of open source projects, Hadoop input/output formats for Avro were recently committed. So we’ve posted an initial version of a Cascading scheme for Avro, which seems to be working well so far.
  • Elastic Web Mining 01 November 2009 I should ask first how many people know about Amazon’s Elastic Mapreduce? Don’t be shy. And how many have actually run jobs using it? Web mining, and crawling like we’re doing, is very bursty. You run a job, then analyze the results, figure out what went wrong, and run again. Even though it costs more per-hour than raw EC2, often it's cheaper- no waking up to find that your 20 server cluster chokes after 15 minutes, but you paid for 10 hours- no coming back from vacation and realizing that you forgot to terminate the cluster. Also, unlike our friends here at Yahoo, I don't have money to build out evena 20 server cluster. And for bursty jobs, most of the time those 20 servers are sitting idle, which increases the effective cost significantly.
  • Elastic Web Mining 01 November 2009 Log files get auto-uploaded to S3, which is niceExcept when they're huge, so you're waiting around for the file, and when it finally shows up it's too big to effectively download & examine.So never run in trace mode EMR uses an older version of Hadoop, 0.18.3. Which means I regularly something working fine in EC2, and then it would fail in EMR due to some version dependency. Oh, and use Bootstrap Actions to tune your cluster. For example, with 50 slaves I was often running out of namenode listener threads.
  • Elastic Web Mining 01 November 2009 Every URL, status, additional bits of info SQL starts having problems when you get past a few million things.And the seed list for the crawl was 1.7M top-level domains .After the first loop, there would be close to 70M URLs. 3 medium instances for three weeks = $250And right now, for example, I'd be stressing about paying for servers I wasn't actually using.Web crawling is very burstyAnd secretly, I just wasn't excited about configuring and maintaining an HBase cluster.For a different type of crawl, HBase would make total sense. It’s not about you, It’s about me. Maybe this has never happened for any of you, but occasionally I look at a piece of code and I say to myself - WTF? who the heck thought this was a good idea?Something like that happened to me, the first time I looked at the Nutch code that handled updating the CrawlDB, where the state of the crawl is kept.I was thinking, I know Doug's a smart guy, why?So then I was busy trying to use SequenceFiles to store the crawl state, and I realized I was recreating exactly the same logic, only not as well.Which was my code telling me to just stop.Plus I found that copying this ever-growing file to and from S3 to be increasingly time consuming.
  • Elastic Web Mining 01 November 2009 Paying for something you’re not using is like having a timeshare that you never visit. Right now I’ve got 75M URLs in SimpleDB, and it’s not costing me anything.
  • Elastic Web Mining 01 November 2009 Applies to interaction with other external, distributed, shared systems. Very similar to issues related to web crawling, in fact. Eg. High latency, low throughput, high error rates compared to disk I/O
  • Elastic Web Mining 01 November 2009 At this point we’re starting to talk about real performance. And also real backup issues. Even with 10 mappers, it’s easy to completely swamp SimpleDB. We were loading about 200K records/minute with 4 mappers, and 100 threads/mapper. For accessing shared resources via HTTP (web pages, SimpleDB), you wind up having to layer a mutithreading adapter on top of Hadoop to get reasonable performance. But since these are shared resources, you have additional constraints in dealing fairness And you have to be able to handle a much higher error rate - it's not like writing to a disk
  • Elastic Web Mining 01 November 2009 Easy to blame lots of things EMR must get the dregs from the EC2 server pool We not getting rack locality for our data The DNS servers at Amazon are being overwhelmed Maybe there's a bad network card” Todd Lipcon - Poor Manís Profiling
  • Elastic Web Mining 01 November 2009 Why this picture? Because many of these mistakes have to do with configuration, not code.
  • Elastic Web Mining 01 November 2009 Lesson learned: early termination of outliers can give huge effective performance boost On by default, and really slow - and I don't need I Lesson learned: you need to worry about the entire software stack, not just the top (your stuff) and the bottom (Hadoop) JVM reuse not enabled Lesson learned: avoid processing in setup (use pre-built bloom filter), or only do it in reducers where you have more control logging level still set to trace. Generating 100GB+ of data E.g. logging each domain (1.7M) in whitelist that was getting put into the set.
  • Elastic Web Mining 01 November 2009
  • Elastic Web Mining 01 November 2009 So that’s it. I think we’ve got time for questions. One more thing - i n 20 minutes I couldn't cover some other aspects - one of which was using Mahout to do ML in Hadoop/EC2. Specifically, to classify links as spammy or not, based on analyzing the fetch process and the fetched content. If there's enough interest, I could finish that presentation and do something online - there’s a link at the bottom for anyone who’s interested.
  • Elastic Web Mining 01 November 2009
  • Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

    1. 1. Public Terabyte Dataset Project Web crawling with Amazon’s EMR Ken Krugler, Bixo Labs, Inc. Hadoop Bay Area Meetup 21 April 2010
    2. 2. About me <ul><li>Background in consulting, search, vertical crawl </li></ul><ul><ul><li>Apple (Mac), I18N, Expert systems, Palm OS </li></ul></ul><ul><ul><li>Krugle search engine for open source code </li></ul></ul><ul><li>Open source projects </li></ul><ul><ul><li>Nutch, Lucene, Solr </li></ul></ul><ul><ul><li>Bixo web mining toolkit </li></ul></ul><ul><ul><li>Tika content extraction </li></ul></ul><ul><li>Founder of Bixo Labs | </li></ul><ul><ul><li>Elastic web mining platform </li></ul></ul><ul><ul><li>Hadoop, Cascading, Bixo in EC2/EMR </li></ul></ul>
    3. 3. In 20 Minutes I’ll Talk About… <ul><li>Public Terabyte Dataset project </li></ul><ul><li>Amazon’s Elastic MapReduce </li></ul><ul><li>Some really embarrassing mistakes </li></ul>
    4. 4. What is the Public Terabyte Dataset? <ul><li>Large scale crawl of top US domains </li></ul><ul><li>Sponsored by Concurrent/Bixo Labs </li></ul><ul><li>Public dataset for use in Amazon’s cloud </li></ul><ul><li>As expected, taking longer than expected </li></ul><ul><ul><li>Pain is weakness leaving the body </li></ul></ul><ul><ul><li>Questions, input? </li></ul></ul>
    5. 5. Using Many Different Technologies <ul><li>AWS Elastic MapReduce - server farm </li></ul><ul><li>Hadoop, Cascading - processing workflow </li></ul><ul><li>Bixo, Tika - web crawling, extracting links </li></ul><ul><li>AWS SimpleDB - maintaining crawl state </li></ul><ul><li>AWS S3 - saving results </li></ul><ul><li>Apache Avro - storing results </li></ul>
    6. 6. One Example of Analyzing Results <ul><li>Tika charset detection is…not great </li></ul><ul><li>Simple code to process Avro files </li></ul><ul><li>Comparing meta tags with derived </li></ul><ul><li>Then make the results sexy in Excel </li></ul>
    7. 7. Cascading Analysis Workflow
    8. 9. Why Use Avro For Resulting Dataset? <ul><li>Originally tried WARC (Web Archive) </li></ul><ul><li>But not really cross-language (Java, C) </li></ul><ul><li>And not easily splittable </li></ul>
    9. 10. Amazon’s Elastic MapReduce <ul><li>Auto-configured Hadoop clusters </li></ul><ul><li>Transient / on-demand </li></ul><ul><li>Good for “bursty” jobs </li></ul><ul><li>Low $$$/Ops requirements </li></ul>
    10. 11. Effectively Using Elastic MapReduce <ul><li>Avoid the 10 Second Failure </li></ul><ul><ul><li>use the --alive option </li></ul></ul><ul><li>Avoid the TB Log File </li></ul><ul><ul><li>never run with trace debugging </li></ul></ul><ul><li>Use new Bootscript Actions </li></ul><ul><ul><li>tune configuration for larger clusters </li></ul></ul>
    11. 12. Why Use SimpleDB? <ul><li>Need to maintain crawl state </li></ul><ul><li>Too big for MySQL </li></ul><ul><li>Too expensive with HBase </li></ul><ul><li>Too painful with SequenceFiles </li></ul><ul><li>SimpleDB to the rescue (sort of) </li></ul>
    12. 13. SimpleDB Fundamentals <ul><li>Distributed key/value store </li></ul><ul><ul><li>Some interesting query/update support </li></ul></ul><ul><li>Pay for usage, not storage </li></ul><ul><li>Uses HTTP for requests </li></ul><ul><ul><li>Latency, throughput issues </li></ul></ul><ul><li>Shared resource </li></ul><ul><ul><li>So there’s the “Back OFF!” issue </li></ul></ul>
    13. 14. SimpleDB Tap - simple
    14. 15. SimpleDB Tap - batch puts
    15. 16. SimpleDB Tap - sharding
    16. 17. SimpleDB Tap - multithreading
    17. 18. SimpleDB Tap - distributed
    18. 19. Why is My Job Running So Sloooow? <ul><li>I blame Amazon </li></ul><ul><ul><li>“ EMR servers must suck” </li></ul></ul><ul><li>I blame Hadoop </li></ul><ul><ul><li>It is an older version </li></ul></ul><ul><li>I blame Cascading </li></ul><ul><ul><li>The workflow planner must have a bug </li></ul></ul><ul><li>Prehistoric Caveman Profiling </li></ul><ul><ul><li>kill -QUIT to the rescue </li></ul></ul>
    19. 20. Configuration Bugs, not Code Bugs
    20. 21. The Real Problems <ul><li>Fetching ALL the pages </li></ul><ul><li>Tika language detection enabled </li></ul><ul><li>(Re)building of distributed data cache </li></ul><ul><li>Generating log files, not results </li></ul>
    21. 22. Summary <ul><li>Public Terabyte Dataset is “getting there” </li></ul><ul><ul><li>Useful for testing analysis code </li></ul></ul><ul><ul><li>Free, easy to use in EC2 </li></ul></ul><ul><li>Elastic MapReduce works well </li></ul><ul><ul><li>For bursty, occasional jobs </li></ul></ul><ul><ul><li>When coupled with other AWS services </li></ul></ul><ul><li>Many “bugs” are configuration problems </li></ul>
    22. 23. Any Questions? <ul><li>My email: </li></ul><ul><li>[email_address] </li></ul><ul><li>Blog post about sample PTD results: </li></ul><ul><li>Input for Machine Learning in EC2 talk: </li></ul>