SlideShare a Scribd company logo
1 of 24
Public Terabyte Dataset Project Web crawling with Amazon’s EMR Ken Krugler, Bixo Labs, Inc. Hadoop Bay Area Meetup 21 April 2010
About me ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
In 20 Minutes I’ll Talk About… ,[object Object],[object Object],[object Object]
What is the Public Terabyte Dataset? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using Many Different Technologies ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
One Example of Analyzing Results ,[object Object],[object Object],[object Object],[object Object]
Cascading Analysis Workflow
 
Why Use Avro For Resulting Dataset? ,[object Object],[object Object],[object Object]
Amazon’s Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object]
Effectively Using Elastic MapReduce ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why Use SimpleDB? ,[object Object],[object Object],[object Object],[object Object],[object Object]
SimpleDB Fundamentals ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
SimpleDB Tap - simple
SimpleDB Tap - batch puts
SimpleDB Tap - sharding
SimpleDB Tap - multithreading
SimpleDB Tap - distributed
Why is My Job Running So Sloooow? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Configuration Bugs, not Code Bugs
The Real Problems ,[object Object],[object Object],[object Object],[object Object]
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Any Questions? ,[object Object],[object Object],[object Object],[object Object]
 

More Related Content

What's hot

Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonYahoo Developer Network
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practicesHadoop User Group
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Joydeep Sen Sarma
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for HadoopHadoop User Group
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiJoydeep Sen Sarma
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Alluxio, Inc.
 

What's hot (20)

Nov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In PythonNov HUG 2009: Hadoop Record Reader In Python
Nov HUG 2009: Hadoop Record Reader In Python
 
HUG August 2010: Best practices
HUG August 2010: Best practicesHUG August 2010: Best practices
HUG August 2010: Best practices
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012Facebook Retrospective - Big data-world-europe-2012
Facebook Retrospective - Big data-world-europe-2012
 
Nov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.HNov 2010 HUG: Fuzzy Table - B.A.H
Nov 2010 HUG: Fuzzy Table - B.A.H
 
Karmasphere Studio for Hadoop
Karmasphere Studio for HadoopKarmasphere Studio for Hadoop
Karmasphere Studio for Hadoop
 
Hadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-DelhiHadoop Hive Talk At IIT-Delhi
Hadoop Hive Talk At IIT-Delhi
 
January 2011 HUG: Howl Presentation
January 2011 HUG: Howl PresentationJanuary 2011 HUG: Howl Presentation
January 2011 HUG: Howl Presentation
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
Mumak
MumakMumak
Mumak
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop sqoop
Hadoop sqoop Hadoop sqoop
Hadoop sqoop
 
Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015Qubole @ AWS Meetup Bangalore - July 2015
Qubole @ AWS Meetup Bangalore - July 2015
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
 

Viewers also liked

Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Hadoop User Group
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector Yahoo Developer Network
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataYahoo Developer Network
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsHadoop User Group
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Nick Galbreath
 
AWS Customer Presentation - eHarmony
AWS Customer Presentation - eHarmonyAWS Customer Presentation - eHarmony
AWS Customer Presentation - eHarmonyAmazon Web Services
 

Viewers also liked (20)

Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
Yahoo! Hadoop User Group - May Meetup - Extraordinarily rapid and robust data...
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
August 2016 HUG: Open Source Big Data Ingest with StreamSets Data Collector
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
January 2011 HUG: Pig Presentation
January 2011 HUG: Pig PresentationJanuary 2011 HUG: Pig Presentation
January 2011 HUG: Pig Presentation
 
January 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka PresentationJanuary 2011 HUG: Kafka Presentation
January 2011 HUG: Kafka Presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Nov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big DataNov 2010 HUG: Business Intelligence for Big Data
Nov 2010 HUG: Business Intelligence for Big Data
 
HUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - FacebookHUG Nov 2010: HDFS Raid - Facebook
HUG Nov 2010: HDFS Raid - Facebook
 
Cascalog internal dsl_preso
Cascalog internal dsl_presoCascalog internal dsl_preso
Cascalog internal dsl_preso
 
Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
Common crawlpresentation
Common crawlpresentationCommon crawlpresentation
Common crawlpresentation
 
Karmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-toolsKarmasphere hadoop-productivity-tools
Karmasphere hadoop-productivity-tools
 
Pig at Linkedin
Pig at LinkedinPig at Linkedin
Pig at Linkedin
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
Rate Limiting at Scale, from SANS AppSec Las Vegas 2012
 
AWS Customer Presentation - eHarmony
AWS Customer Presentation - eHarmonyAWS Customer Presentation - eHarmony
AWS Customer Presentation - eHarmony
 

Similar to Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

Distributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the cloudsDistributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the cloudsRobert Coup
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCCal Henderson
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archroyans
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archguest18a0f1
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Archmclee
 
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Matthew McCullough
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceAshok Modi
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!Andraz Tori
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookAmr Awadallah
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 

Similar to Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce (20)

Distributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the cloudsDistributed-ness: Distributed computing & the clouds
Distributed-ness: Distributed computing & the clouds
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYCScalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
Scalable Web Architectures: Common Patterns and Approaches - Web 2.0 Expo NYC
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Web20expo Scalable Web Arch
Web20expo Scalable Web ArchWeb20expo Scalable Web Arch
Web20expo Scalable Web Arch
 
Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]Cloud Computing Bootcamp On The Google App Engine [v1.1]
Cloud Computing Bootcamp On The Google App Engine [v1.1]
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
DrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performanceDrupalCampLA 2011: Drupal backend-performance
DrupalCampLA 2011: Drupal backend-performance
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Dean4j@Njug5
Dean4j@Njug5Dean4j@Njug5
Dean4j@Njug5
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
WebWorkersCamp 2010
WebWorkersCamp 2010WebWorkersCamp 2010
WebWorkersCamp 2010
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Uotm workshop
Uotm workshopUotm workshop
Uotm workshop
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 

More from Hadoop User Group

More from Hadoop User Group (16)

Hdfs high availability
Hdfs high availabilityHdfs high availability
Hdfs high availability
 
2 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-212 hadoop@e bay-hug-2010-07-21
2 hadoop@e bay-hug-2010-07-21
 
1 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-211 content optimization-hug-2010-07-21
1 content optimization-hug-2010-07-21
 
3 avro hug-2010-07-21
3 avro hug-2010-07-213 avro hug-2010-07-21
3 avro hug-2010-07-21
 
1 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit20101 hadoop security_in_details_hadoop_summit2010
1 hadoop security_in_details_hadoop_summit2010
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Flightcaster Presentation Hadoop
Flightcaster  Presentation  HadoopFlightcaster  Presentation  Hadoop
Flightcaster Presentation Hadoop
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Security Preview
Hadoop Security PreviewHadoop Security Preview
Hadoop Security Preview
 
Hadoop Release Plan Feb17
Hadoop Release Plan Feb17Hadoop Release Plan Feb17
Hadoop Release Plan Feb17
 
Twitter Protobufs And Hadoop Hug 021709
Twitter Protobufs And Hadoop   Hug 021709Twitter Protobufs And Hadoop   Hug 021709
Twitter Protobufs And Hadoop Hug 021709
 
Ordered Record Collection
Ordered Record CollectionOrdered Record Collection
Ordered Record Collection
 
Searching At Scale
Searching At ScaleSearching At Scale
Searching At Scale
 
Hadoop Record Reader In Python
Hadoop Record Reader In PythonHadoop Record Reader In Python
Hadoop Record Reader In Python
 
File Context
File ContextFile Context
File Context
 

Recently uploaded

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 

Recently uploaded (20)

Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

Editor's Notes

  1. Elastic Web Mining 01 November 2009
  2. Elastic Web Mining 01 November 2009 Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
  3. Elastic Web Mining 01 November 2009
  4. Elastic Web Mining 01 November 2009 Goal is to generate large, high quality web crawl that results in an Amazon public dataset Having a large set of public data can be very useful, and I'll show one example later. For Amazon, it's some honey to lure people into running jobs in EC2 & EMR. For us, it was a reason to really push the boundaries of what Bixo could handle. Bixo is the open source web mining toolkit project we sponsor, and we use it for a lot of things, but crawling 100M+ pages wasn't one of them It also wound up being a good test for incremental releases of Cascading 1.1, which I think is now final? Why has it taken longer? Well, part of it is because any time you have to deal with a large slice of the web, it hurts. As Jimmy Liu at CMU said, the web is an endless series of edge cases. Plus we had some work to do to figure out how to effectively run in EMR
  5. Elastic Web Mining 01 November 2009 The three in bold are the ones that I’ll spend a bit more time talking about today
  6. Elastic Web Mining 01 November 2009 Surprising, given that it’s based on ICU, and that’s often the gold standard for internationalization. One simple approach to validating quality is to compare with any charset found in HTML meta tags. Which can still be wrong, but is usually correct - and way better than the http response headers. Our input is the Avro files we generated from a sample crawl covering 1.7M top domains, based on US traffic reports From Alexa and Quantcast. BTW, please provide input on the format.Link to blog post about it at end of talk. Like to finalize before we generate a lot more data.
  7. Elastic Web Mining 01 November 2009 I’m not expecting you to be able to understand or even read this, but it shows the actual workflow portion of a typical analysis app There are three functions you’re not seeing, that pick the records to analyze, do the analysis, and then generate a report.
  8. Elastic Web Mining 01 November 2009 I took a slice of data and calculated the accuracy of Tika, when compared to the meta tag charset. Now the meta tag data can lie, though it’s better than the http response headers, and is usually pretty accurate. The vertical scale is accuracy, from 0 to 100%, and the horizontal scale is the number of pages for each charset (log) Now it’s a log scale so that you don’t wind up with all but a few charsets squished to the left. From this, you can see that what we want is for common charsets to have high accuracy, so points in the upper-right But we don’t get that, unfortunately. And this is with calling us-ascii and iso-8859-1 subsets of UTF-8 For some reason Tika really likes the gb18030 encoding - many UTF-8 pages are classified as this. Clearly we could re-run this with modified versions of Tika, or other open source detectors. In fact it would be great to find an intern to implement the approach that Ted Dunning has recommended, of using Log-liklihood ratios, to see how that compares. Similar issues exist for Tika’s language detection, so it could be a two-fer. The key point is that by having a large enough data set, and an easy way to process it, you have the ability to quickly try new approaches, and feel confident about the end results.
  9. Elastic Web Mining 01 November 2009 As Tom White pointed out, any input format needs to be splittable for efficiency. So it’s not just about which compression format to use (the answer to that is usually LZO) It relates to anything beyond simple one-line-per-record text files There was some pain in creating a Cascading scheme that let us use Avro for reading & writing. But true to the Just-In-Time nature of open source projects, Hadoop input/output formats for Avro were recently committed. So we’ve posted an initial version of a Cascading scheme for Avro, which seems to be working well so far.
  10. Elastic Web Mining 01 November 2009 I should ask first how many people know about Amazon’s Elastic Mapreduce? Don’t be shy. And how many have actually run jobs using it? Web mining, and crawling like we’re doing, is very bursty. You run a job, then analyze the results, figure out what went wrong, and run again. Even though it costs more per-hour than raw EC2, often it's cheaper- no waking up to find that your 20 server cluster chokes after 15 minutes, but you paid for 10 hours- no coming back from vacation and realizing that you forgot to terminate the cluster. Also, unlike our friends here at Yahoo, I don't have money to build out evena 20 server cluster. And for bursty jobs, most of the time those 20 servers are sitting idle, which increases the effective cost significantly.
  11. Elastic Web Mining 01 November 2009 Log files get auto-uploaded to S3, which is niceExcept when they're huge, so you're waiting around for the file, and when it finally shows up it's too big to effectively download & examine.So never run in trace mode EMR uses an older version of Hadoop, 0.18.3. Which means I regularly something working fine in EC2, and then it would fail in EMR due to some version dependency. Oh, and use Bootstrap Actions to tune your cluster. For example, with 50 slaves I was often running out of namenode listener threads.
  12. Elastic Web Mining 01 November 2009 Every URL, status, additional bits of info SQL starts having problems when you get past a few million things.And the seed list for the crawl was 1.7M top-level domains .After the first loop, there would be close to 70M URLs. 3 medium instances for three weeks = $250And right now, for example, I'd be stressing about paying for servers I wasn't actually using.Web crawling is very burstyAnd secretly, I just wasn't excited about configuring and maintaining an HBase cluster.For a different type of crawl, HBase would make total sense. It’s not about you, It’s about me. Maybe this has never happened for any of you, but occasionally I look at a piece of code and I say to myself - WTF? who the heck thought this was a good idea?Something like that happened to me, the first time I looked at the Nutch code that handled updating the CrawlDB, where the state of the crawl is kept.I was thinking, I know Doug's a smart guy, why?So then I was busy trying to use SequenceFiles to store the crawl state, and I realized I was recreating exactly the same logic, only not as well.Which was my code telling me to just stop.Plus I found that copying this ever-growing file to and from S3 to be increasingly time consuming.
  13. Elastic Web Mining 01 November 2009 Paying for something you’re not using is like having a timeshare that you never visit. Right now I’ve got 75M URLs in SimpleDB, and it’s not costing me anything.
  14. Elastic Web Mining 01 November 2009 Applies to interaction with other external, distributed, shared systems. Very similar to issues related to web crawling, in fact. Eg. High latency, low throughput, high error rates compared to disk I/O
  15. Elastic Web Mining 01 November 2009 At this point we’re starting to talk about real performance. And also real backup issues. Even with 10 mappers, it’s easy to completely swamp SimpleDB. We were loading about 200K records/minute with 4 mappers, and 100 threads/mapper. For accessing shared resources via HTTP (web pages, SimpleDB), you wind up having to layer a mutithreading adapter on top of Hadoop to get reasonable performance. But since these are shared resources, you have additional constraints in dealing fairness And you have to be able to handle a much higher error rate - it's not like writing to a disk
  16. Elastic Web Mining 01 November 2009 Easy to blame lots of things EMR must get the dregs from the EC2 server pool We not getting rack locality for our data The DNS servers at Amazon are being overwhelmed Maybe there's a bad network card” Todd Lipcon - Poor Manís Profiling
  17. Elastic Web Mining 01 November 2009 Why this picture? Because many of these mistakes have to do with configuration, not code.
  18. Elastic Web Mining 01 November 2009 Lesson learned: early termination of outliers can give huge effective performance boost On by default, and really slow - and I don't need I Lesson learned: you need to worry about the entire software stack, not just the top (your stuff) and the bottom (Hadoop) JVM reuse not enabled Lesson learned: avoid processing in setup (use pre-built bloom filter), or only do it in reducers where you have more control logging level still set to trace. Generating 100GB+ of data E.g. logging each domain (1.7M) in whitelist that was getting put into the set.
  19. Elastic Web Mining 01 November 2009
  20. Elastic Web Mining 01 November 2009 So that’s it. I think we’ve got time for questions. One more thing - i n 20 minutes I couldn't cover some other aspects - one of which was using Mahout to do ML in Hadoop/EC2. Specifically, to classify links as spammy or not, based on analyzing the fetch process and the fetched content. If there's enough interest, I could finish that presentation and do something online - there’s a link at the bottom for anyone who’s interested.
  21. Elastic Web Mining 01 November 2009