• Save
BDT204 Awesome Applications of Open Data - AWS re: Invent 2012
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

BDT204 Awesome Applications of Open Data - AWS re: Invent 2012

  • 1,233 views
Uploaded on

Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common......

Dive into the world of big data as we discuss how open, public datasets can be harnessed using the AWS cloud. With a lot of large data collections (such as the 1000 Genomes Project and the Common Crawl), join this session to find out how you can process billions of web pages and trillions of genes to find new insights into society.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,233
On Slideshare
1,233
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Lucky Oyster dive deep - discover pearls $100 Worth of PricelessLeveraging Common Crawl and Spot Instances to Data Mine The Web Lisa Green, Common Crawl Matthew Berk, Lucky Oyster
  • 2. Common Crawl Data• ~8 Billion web pages• ~120 TB• 2008-2012• ARC files, JSON metadata, text files• Available to anyone on Amazon’s Public Data Sets
  • 3. What Does $100 Buy You?• 2 nosebleed seats at an NFL game• 1/10 cost of an entry level Dell PowerEdge• 80 minutes of time from a mid level engineer• Omakase for 1 at Shiro’s Sushi in Seattleor…
  • 4. $100 + 14 hours + 300 lines of Ruby =3.4 billion Web pages processed, data mined, and indexed for search and research. Even a few years ago, this would have been unthinkable.
  • 5. The Experiment• Process most recent (2012) Web crawl from Common Crawl• Determine extent and nature of hardcoded references to Facebook• Extract structured metadata (Open Graph and Schema.org)• Store, analyze and index entity metadata and link structure
  • 6. Components• AWS Spot Instances – Peak of ~200 nodes – ~5,000 hours of compute time – Average cost of $0.02 per hour• Custom ruby code for extraction and analysis• Beanstalkd, Apache httpd, Sinatra• Some sysadmin elbow grease
  • 7. Architecture• Master instance (m2.4xlarge) – Queue for Common Crawl S3 paths – Data collection and node control service – Indexers and Solr instances• Worker nodes (c1.medium) – Spot instances with worker AMI – Consume S3 paths; decompress and stream ARC files – Extract and analyze• Goals were simplicity, interruption tolerance, and high throughput
  • 8. Findings / Output• Lucky Oyster Study (see appendix or http://blog.luckyoyster.com )• Utility computing = major cost savings• Reusable framework for low complexity Web scale crawl processing• Indexes of 400+ million structured entities for R&D
  • 9. Thank you. Questions? matthew@luckyoyster.com lisa@commoncrawl.org
  • 10. Appendix
  • 11. The Lucky Oyster Study• Based on 3.4 billion URLs from Common Crawl• 22% of pages reference Facebook directly• 8% of pages implement Open Graph tags• Top open graph types: hotels, movies, activities, songs, games, books• Study of shift in locus (away from the open Web) and nature (towards entities) of content
  • 12. We are sincerely eager to hear your feedback on thispresentation and on re:Invent. Please fill out an evaluation form when you have a chance.