Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

9,828 views

Published on

Published in: Technology
  • Be the first to comment

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

  1. 1. Public Terabyte Dataset Project Web crawling with Amazon’s EMR Ken Krugler, Bixo Labs, Inc. Hadoop Bay Area Meetup 21 April 2010
  2. 2. About me <ul><li>Background in consulting, search, vertical crawl </li></ul><ul><ul><li>Apple (Mac), I18N, Expert systems, Palm OS </li></ul></ul><ul><ul><li>Krugle search engine for open source code </li></ul></ul><ul><li>Open source projects </li></ul><ul><ul><li>Nutch, Lucene, Solr </li></ul></ul><ul><ul><li>Bixo web mining toolkit </li></ul></ul><ul><ul><li>Tika content extraction </li></ul></ul><ul><li>Founder of Bixo Labs | http://bixolabs.com </li></ul><ul><ul><li>Elastic web mining platform </li></ul></ul><ul><ul><li>Hadoop, Cascading, Bixo in EC2/EMR </li></ul></ul>
  3. 3. In 20 Minutes I’ll Talk About… <ul><li>Public Terabyte Dataset project </li></ul><ul><li>Amazon’s Elastic MapReduce </li></ul><ul><li>Some really embarrassing mistakes </li></ul>
  4. 4. What is the Public Terabyte Dataset? <ul><li>Large scale crawl of top US domains </li></ul><ul><li>Sponsored by Concurrent/Bixo Labs </li></ul><ul><li>Public dataset for use in Amazon’s cloud </li></ul><ul><li>As expected, taking longer than expected </li></ul><ul><ul><li>Pain is weakness leaving the body </li></ul></ul><ul><ul><li>Questions, input? http://bixolabs.com/PTD/ </li></ul></ul>
  5. 5. Using Many Different Technologies <ul><li>AWS Elastic MapReduce - server farm </li></ul><ul><li>Hadoop, Cascading - processing workflow </li></ul><ul><li>Bixo, Tika - web crawling, extracting links </li></ul><ul><li>AWS SimpleDB - maintaining crawl state </li></ul><ul><li>AWS S3 - saving results </li></ul><ul><li>Apache Avro - storing results </li></ul>
  6. 6. One Example of Analyzing Results <ul><li>Tika charset detection is…not great </li></ul><ul><li>Simple code to process Avro files </li></ul><ul><li>Comparing meta tags with derived </li></ul><ul><li>Then make the results sexy in Excel </li></ul>
  7. 7. Cascading Analysis Workflow
  8. 9. Why Use Avro For Resulting Dataset? <ul><li>Originally tried WARC (Web Archive) </li></ul><ul><li>But not really cross-language (Java, C) </li></ul><ul><li>And not easily splittable </li></ul>
  9. 10. Amazon’s Elastic MapReduce <ul><li>Auto-configured Hadoop clusters </li></ul><ul><li>Transient / on-demand </li></ul><ul><li>Good for “bursty” jobs </li></ul><ul><li>Low $$$/Ops requirements </li></ul>
  10. 11. Effectively Using Elastic MapReduce <ul><li>Avoid the 10 Second Failure </li></ul><ul><ul><li>use the --alive option </li></ul></ul><ul><li>Avoid the TB Log File </li></ul><ul><ul><li>never run with trace debugging </li></ul></ul><ul><li>Use new Bootscript Actions </li></ul><ul><ul><li>tune configuration for larger clusters </li></ul></ul>
  11. 12. Why Use SimpleDB? <ul><li>Need to maintain crawl state </li></ul><ul><li>Too big for MySQL </li></ul><ul><li>Too expensive with HBase </li></ul><ul><li>Too painful with SequenceFiles </li></ul><ul><li>SimpleDB to the rescue (sort of) </li></ul>
  12. 13. SimpleDB Fundamentals <ul><li>Distributed key/value store </li></ul><ul><ul><li>Some interesting query/update support </li></ul></ul><ul><li>Pay for usage, not storage </li></ul><ul><li>Uses HTTP for requests </li></ul><ul><ul><li>Latency, throughput issues </li></ul></ul><ul><li>Shared resource </li></ul><ul><ul><li>So there’s the “Back OFF!” issue </li></ul></ul>
  13. 14. SimpleDB Tap - simple
  14. 15. SimpleDB Tap - batch puts
  15. 16. SimpleDB Tap - sharding
  16. 17. SimpleDB Tap - multithreading
  17. 18. SimpleDB Tap - distributed
  18. 19. Why is My Job Running So Sloooow? <ul><li>I blame Amazon </li></ul><ul><ul><li>“ EMR servers must suck” </li></ul></ul><ul><li>I blame Hadoop </li></ul><ul><ul><li>It is an older version </li></ul></ul><ul><li>I blame Cascading </li></ul><ul><ul><li>The workflow planner must have a bug </li></ul></ul><ul><li>Prehistoric Caveman Profiling </li></ul><ul><ul><li>kill -QUIT to the rescue </li></ul></ul>
  19. 20. Configuration Bugs, not Code Bugs
  20. 21. The Real Problems <ul><li>Fetching ALL the pages </li></ul><ul><li>Tika language detection enabled </li></ul><ul><li>(Re)building of distributed data cache </li></ul><ul><li>Generating log files, not results </li></ul>
  21. 22. Summary <ul><li>Public Terabyte Dataset is “getting there” </li></ul><ul><ul><li>Useful for testing analysis code </li></ul></ul><ul><ul><li>Free, easy to use in EC2 </li></ul></ul><ul><li>Elastic MapReduce works well </li></ul><ul><ul><li>For bursty, occasional jobs </li></ul></ul><ul><ul><li>When coupled with other AWS services </li></ul></ul><ul><li>Many “bugs” are configuration problems </li></ul>
  22. 23. Any Questions? <ul><li>My email: </li></ul><ul><li>[email_address] </li></ul><ul><li>Blog post about sample PTD results: http://bixolabs.com/blog/2010/032010/04/21/first-sample-of-public-terabyte-dataset/ </li></ul><ul><li>Input for Machine Learning in EC2 talk: http://bixolabs.com/ml-talk/ </li></ul>

×