Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012


Published on

Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common Crawl), join this session to find out how you can process billions of web pages and trillions of genes to find new insights into society.

  • Be the first to comment

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012

  1. 1. Lucky Oyster dive deep - discover pearls $100 Worth of PricelessLeveraging Common Crawl and Spot Instances to Data Mine The Web Lisa Green, Common Crawl Matthew Berk, Lucky Oyster
  2. 2. Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone on Amazon’s Public Data Sets
  3. 3. What Does $100 Buy You?• 2 nosebleed seats at an NFL game• 1/10 cost of an entry level Dell PowerEdge• 80 minutes of time from a mid level engineer• Omakase for 1 at Shiro’s Sushi in Seattleor…
  4. 4. $100 + 14 hours + 300 lines of Ruby =3.4 billion Web pages processed, data mined, and indexed for search and research. Even a few years ago, this would have been unthinkable.
  5. 5. The Experiment• Process most recent (2012) Web crawl from Common Crawl• Determine extent and nature of hardcoded references to Facebook• Extract structured metadata (Open Graph and• Store, analyze and index entity metadata and link structure
  6. 6. Components• AWS Spot Instances – Peak of ~200 nodes – ~5,000 hours of compute time – Average cost of $0.02 per hour• Custom ruby code for extraction and analysis• Beanstalkd, Apache httpd, Sinatra• Some sysadmin elbow grease
  7. 7. Architecture• Master instance (m2.4xlarge) – Queue for Common Crawl S3 paths – Data collection and node control service – Indexers and Solr instances• Worker nodes (c1.medium) – Spot instances with worker AMI – Consume S3 paths; decompress and stream ARC files – Extract and analyze• Goals were simplicity, interruption tolerance, and high throughput
  8. 8. Findings / Output• Lucky Oyster Study (see appendix or )• Utility computing = major cost savings• Reusable framework for low complexity Web scale crawl processing• Indexes of 400+ million structured entities for R&D
  9. 9. Thank you. Questions?
  10. 10. Appendix
  11. 11. The Lucky Oyster Study• Based on 3.4 billion URLs from Common Crawl• 22% of pages reference Facebook directly• 8% of pages implement Open Graph tags• Top open graph types: hotels, movies, activities, songs, games, books• Study of shift in locus (away from the open Web) and nature (towards entities) of content
  12. 12. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.