SlideShare a Scribd company logo
 
Web Mining in the Cloud Ken Krugler, Bixo Labs, Inc. ACM Silicon Valley Data Mining Camp 01 November 2009 Hadoop/Cascading/Bixo in EC2
About me ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Typical Data Mining
Data Mining Victory!
Meanwhile, Over at McAfee…
Web Mining 101 ,[object Object],[object Object],[object Object]
4 Steps in Web Mining ,[object Object],[object Object],[object Object],[object Object]
Web Mining versus Data Mining ,[object Object],[object Object],[object Object],[object Object]
How to Mine Large Scale Web Data? ,[object Object],[object Object],[object Object],[object Object],[object Object]
One Solution - the HECB Stack ,[object Object],[object Object],[object Object],[object Object]
EC2 - Amazon Elastic Compute Cloud ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why Hadoop? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Why Cascading? ,[object Object],[object Object],[object Object],[object Object]
Why Bixo? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
SEO Keyword Data Mining ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Workflow
Custom Code for Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
End Result in Data Mining Tool
What Next? ,[object Object],[object Object],[object Object],[object Object],[object Object]
Another Example - HUGMEE ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Helpful Hadoopers ,[object Object],[object Object],[object Object],[object Object]
Scoring Algorithm ,[object Object],[object Object],[object Object],[object Object]
High Level Steps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
High Level Steps ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Workflow
Building the Flow
mod_mbox Page
Custom Operation
Validate
This Hug’s for Ted!
Produce
Public Terabyte Dataset ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Back
Summary ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Any Questions? ,[object Object],[object Object],[object Object]
 

More Related Content

What's hot

Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architectureDivyangee Jain
 
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
Mikael Svenson
 
Take Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next LevelTake Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next Level
Jeff Fried
 
Building an unstructured data management solution with elastic search and ama...
Building an unstructured data management solution with elastic search and ama...Building an unstructured data management solution with elastic search and ama...
Building an unstructured data management solution with elastic search and ama...
mobiusservices
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Ahmed Helmy
 
Consuming External Content and Enriching Content with Apache Camel
Consuming External Content and Enriching Content with Apache CamelConsuming External Content and Enriching Content with Apache Camel
Consuming External Content and Enriching Content with Apache Camel
therealgaston
 
Understanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchUnderstanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid Search
Jeff Fried
 
Building a spa_in_30min
Building a spa_in_30minBuilding a spa_in_30min
Building a spa_in_30min
Michael Hackstein
 
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks FusionWebinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Lucidworks
 
How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)
ICF CIRCUIT
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et HadoopMongoDB
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
therealgaston
 
Azure datafactory
Azure datafactoryAzure datafactory
Azure datafactory
Dimko Zhluktenko
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
Lucidworks
 
Dspace 7 presentation
Dspace 7 presentationDspace 7 presentation
Dspace 7 presentation
mohamed Elzalabany
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
Treasure Data, Inc.
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
MongoDB
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
Stéphane Fréchette
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
Michelle Minkoff
 

What's hot (20)

Google history nd architecture
Google history nd architectureGoogle history nd architecture
Google history nd architecture
 
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
Search Queries Explained – A Deep Dive into Query Rules, Query Variables and ...
 
Take Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next LevelTake Cloud Hybrid Search to the Next Level
Take Cloud Hybrid Search to the Next Level
 
Building an unstructured data management solution with elastic search and ama...
Building an unstructured data management solution with elastic search and ama...Building an unstructured data management solution with elastic search and ama...
Building an unstructured data management solution with elastic search and ama...
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Consuming External Content and Enriching Content with Apache Camel
Consuming External Content and Enriching Content with Apache CamelConsuming External Content and Enriching Content with Apache Camel
Consuming External Content and Enriching Content with Apache Camel
 
Understanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid SearchUnderstanding and Applying Cloud Hybrid Search
Understanding and Applying Cloud Hybrid Search
 
Building a spa_in_30min
Building a spa_in_30minBuilding a spa_in_30min
Building a spa_in_30min
 
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks FusionWebinar: Event Processing & Data Analytics with Lucidworks Fusion
Webinar: Event Processing & Data Analytics with Lucidworks Fusion
 
How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)How to migrate from any CMS (thru the front-door)
How to migrate from any CMS (thru the front-door)
 
MongoDB et Hadoop
MongoDB et HadoopMongoDB et Hadoop
MongoDB et Hadoop
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?Do you need an external search platform for Adobe Experience Manager?
Do you need an external search platform for Adobe Experience Manager?
 
Azure datafactory
Azure datafactoryAzure datafactory
Azure datafactory
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Dspace 7 presentation
Dspace 7 presentationDspace 7 presentation
Dspace 7 presentation
 
Scaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big DataScaling to Infinity - Open Source meets Big Data
Scaling to Infinity - Open Source meets Big Data
 
MongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business InsightsMongoDB and Hadoop: Driving Business Insights
MongoDB and Hadoop: Driving Business Insights
 
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
On the move with Big Data (Hadoop, Pig, Sqoop, SSIS...)
 
Almost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without ProgrammingAlmost Scraping: Web Scraping without Programming
Almost Scraping: Web Scraping without Programming
 

Viewers also liked

Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
Ken Krugler
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Cloudera, Inc.
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Mohamed Zaki
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Ontotext
 
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
gilforum
 
Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철
HoChul Lee
 
Text mining
Text miningText mining
Text mining
Malik Imran
 
Dm ml study_roadmap
Dm ml study_roadmapDm ml study_roadmap
Dm ml study_roadmap
Kang Pilsung
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
Sung Yub Kim
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
Ontotext
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
Matthew (정재화)
 
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
uEngine Solutions
 
Text data mining1
Text data mining1Text data mining1
Text data mining1KU Leuven
 
집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요
Kwang Woo NAM
 
마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전
Taejoon Yoo
 
Информационный вестник Сентябрь 2013
Информационный вестник Сентябрь 2013 Информационный вестник Сентябрь 2013
Информационный вестник Сентябрь 2013
Ingria. Technopark St. Petersburg
 
Up in the clouds sdd 2012
Up in the clouds sdd 2012Up in the clouds sdd 2012
Up in the clouds sdd 2012
Andrea Ginsky
 
Londons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background PaperLondons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background PaperNetworked Neighbourhoods
 
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara AB
 

Viewers also liked (20)

Elastic Web Mining
Elastic Web MiningElastic Web Mining
Elastic Web Mining
 
Big data concept
Big data conceptBig data concept
Big data concept
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
 
Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철
 
Text mining
Text miningText mining
Text mining
 
Dm ml study_roadmap
Dm ml study_roadmapDm ml study_roadmap
Dm ml study_roadmap
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요집단지성 프로그래밍 01-데이터마이닝 개요
집단지성 프로그래밍 01-데이터마이닝 개요
 
마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전마인즈랩 회사소개서 V2.3_한국어버전
마인즈랩 회사소개서 V2.3_한국어버전
 
Информационный вестник Сентябрь 2013
Информационный вестник Сентябрь 2013 Информационный вестник Сентябрь 2013
Информационный вестник Сентябрь 2013
 
Up in the clouds sdd 2012
Up in the clouds sdd 2012Up in the clouds sdd 2012
Up in the clouds sdd 2012
 
Londons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background PaperLondons Digital Neighbourhoods Workshop - Background Paper
Londons Digital Neighbourhoods Workshop - Background Paper
 
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
Mongara Arbetsrätt och sociala media Svensk Bensinhandel, Mongara Gran Canari...
 

Similar to Elastic Web Mining

Build Your Own Search Engine
Build Your Own Search EngineBuild Your Own Search Engine
Build Your Own Search Engine
goodfriday
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
Amazon Web Services
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloud
kidrane
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
Kellyn Pot'Vin-Gorman
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
Datafiniti
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
Amazon Web Services
 
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUSBest Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
guest7c2e070
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics Pipeline
Amazon Web Services
 
Fundamentals Of Search
Fundamentals Of SearchFundamentals Of Search
Fundamentals Of Search
Search Tools Consulting
 
PoolParty Thesaurus Management Quick Overview
PoolParty Thesaurus Management Quick OverviewPoolParty Thesaurus Management Quick Overview
PoolParty Thesaurus Management Quick Overview
Andreas Blumauer
 
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
ITProceed
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
Antonio Silveira
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
Amazon Web Services
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
javier ramirez
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
Amazon Web Services
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Precisely
 

Similar to Elastic Web Mining (20)

Build Your Own Search Engine
Build Your Own Search EngineBuild Your Own Search Engine
Build Your Own Search Engine
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
Seravia in the Cloud
Seravia in the CloudSeravia in the Cloud
Seravia in the Cloud
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUSBest Practices to SharePoint Architecture Fundamentals NZ & AUS
Best Practices to SharePoint Architecture Fundamentals NZ & AUS
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics Pipeline
 
Fundamentals Of Search
Fundamentals Of SearchFundamentals Of Search
Fundamentals Of Search
 
PoolParty Thesaurus Management Quick Overview
PoolParty Thesaurus Management Quick OverviewPoolParty Thesaurus Management Quick Overview
PoolParty Thesaurus Management Quick Overview
 
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
Office Track: SharePoint Online Migration - Asses, Prepare, Migrate & Support...
 
RavenDB overview
RavenDB overviewRavenDB overview
RavenDB overview
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Big Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWSBig Data, Ingeniería de datos, y Data Lakes en AWS
Big Data, Ingeniería de datos, y Data Lakes en AWS
 
Example.ppt
Example.pptExample.ppt
Example.ppt
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 

More from Ken Krugler

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
Ken Krugler
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
Ken Krugler
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
Ken Krugler
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Ken Krugler
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
Ken Krugler
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
Ken Krugler
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
Ken Krugler
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
Ken Krugler
 

More from Ken Krugler (8)

Faster Workflows, Faster
Faster Workflows, FasterFaster Workflows, Faster
Faster Workflows, Faster
 
Similarity at scale
Similarity at scaleSimilarity at scale
Similarity at scale
 
Suicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and CassandraSuicide Risk Prediction Using Social Media and Cassandra
Suicide Risk Prediction Using Social Media and Cassandra
 
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & SolrFaster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
Faster, Cheaper, Better - Replacing Oracle with Hadoop & Solr
 
Strata web mining tutorial
Strata web mining tutorialStrata web mining tutorial
Strata web mining tutorial
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
A (very) short history of big data
A (very) short history of big dataA (very) short history of big data
A (very) short history of big data
 
Thinking at scale with hadoop
Thinking at scale with hadoopThinking at scale with hadoop
Thinking at scale with hadoop
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 

Elastic Web Mining

Editor's Notes

  1. Over the prior 4 years I had a startup called Krugle, that provided code search for open source projects and inside large companies. We did a large, 100M page crawl of the “programmer’s web” to find out information about open source projects. Based on what I learned from that experience, I started the Bixo open source project. It’s a toolkit for building web mining workflows, and I’ll be talking more about that later. Several companies paid me to integrate Bixo into an existing data processing environment. And that in turn led to Bixo Labs, which is a platform for quickly creating creating web mining apps. Elastic means the size of the system can easily be changed to match the web mining task.
  2. This is the world that many of you live in. Analyzing data to find important patterns. Here’s an example of output from the QlikView business intelligence tool It was used to help analyze the relative prevalence of keywords in two competing web sites. Here you see two word terms that often occur on McAfee’s site, but not on Symantec’s Which is very useful data for anybody who worries about search engine optimization.
  3. You all know about analyzing data to find important patterns that get managers all worked up…
  4. But how do you get to this point? How do you use the web as the source for data that you’re analyzing That’s what I’m going to be talking about here.
  5. Quick intro to web mining, so we’re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing.
  6. It’s common to confuse web crawling with fetching. Crawling is the process of automatically finding new pages by extracting links from fetched pages. But for many web mining applications, you have a “white list” of pre-defined URLs. In either case, though, you need to reliably, efficiently and politely fetch pages. Content comes in a variety of formats - typically HTML, but also PDF, word, zip archives, etc. Need to parse these formats to extract key data - typically text, but could be image data. Often the analyze step will include aspects of machine learning - classification, clustering. “useful data” covers a lot of ground, because there are a lot of ways to use the output of web mining. Generating an index is one of the most common, because people think about search as the goal. But for data mining, the end result at this point is often highly reduced data that is input to traditional data mining tools.
  7. What are the key differences between web mining and traditional data mining I’m saying “traditional” because the face of data mining is clearly changing. But if you look at most vendor tools, the focus is on what I’d call “traditional data mining” Scale - 10M is big for data mining, but not for web mining Access - with DM, once you defeated Mongor, keeper of data base access keys, you were golden Web pages are typically public, but it’s a shared resource so implicit rules apply. Like “don’t bring my web site to its knees”. Data mining breaks traditional implicit contract, so extra cautions apply. Implicit contract is that I let you crawl me, and you drive traffic to me when your search index goes live. But with DM, there often isn’t an index as the end result. With mining DBs, there’s explicit structure, which is mostly lacking from web pages.
  8. If it doesn’t scale, then it won’t handle the quantity of data you’ll ultimately want to process from the web If you can’t create real workflows, it will never be reliable or efficient. If you don’t use specialized web crawling code, you’ll get blacklisted Because you’re trying to distill down large data, there’s often some custom processing. If you don’t run it a cloud environment, you’ll be wasting money - and I’ll explain why in a few slides.
  9. I’m focusing on one particular solution to the challenges of web mining that I just described. It’s the “HECB” stack. I’m going to talk about these from the bottom up, which is EC2 first, then Hadoop…but the acronym didn’t work as well.
  10. At Krugle we ran two clusters, one of 11 servers, and a smaller 4 server cluster In the end, our actual utilization ratio was probably < 20% Even with close to 100% utilization, the break-even point for EC2 vs. colo is somewhere between 50 and 200 servers, depending on who you talk to. If utilization was 20%, then break even would be 250 to 1000 servers. Mining for search doesn’t work so well in this model - cluster should be always crawling (ABC) so not as bursty And transferring raw content, parse, and index will generate lots of transfer charges. But for web mining that’s focused on data mining, data is distilled so this isn’t an issue.
  11. Map-reduce - how do you parallelize the processing of lots of data so that you can Do the work on many servers? The answer is Map-reduce. HDFS - how do you store lots of data in a fault-tolerant, cost-effective manner. How do you make sure the data (the big stuff) moves as little as possible during processing. The answer is the Hadoop distributed file system. It’s open source, so lots of support, consultants, rapid bug fixes, etc. Large companies are using it, especially Yahoo Elastic map reduce is a special service built on top of EC2, where it’s easier to run Hadoop jobs Because you have access to pre-configured Hadoop clusters, special tools, etc.
  12. If you ever had to write a complex workflow using Hadoop, you know the answer. It frees you from the lower-level details of thinking in map-reduce. You can think about the workflow as operations on records with fields. And in data mining, the workflow often winds up being very complex. Because you can build workflows out of a mix of pre-defined & custom pipes, it’s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading’s ability to check your workflow (the DAG it builds) Finds cases where fields aren’t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle
  13. Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. Polite yet efficient - tension between those two goals that’s hard to resolve. If you do a crawl of any reasonable size, you’ll run into lots of errors. Even if a web server says “I swear to you, I’m sending you a 20K HTML file in English” It’s a 50K text file in Russian using the Cyrillic character set. And because it’s open source, you get the benefit of a community of users. They contribute re-usable toolkit components.
  14. Whenever I show a workflow diagram like this, I make a joke about it being intuitively obvious. Which, obviously, it’s not. And in fact the full workflow is a bit bigger, as I left out the second stage that describes more of the keyword analysis. But the key point is that the blue color items are provided by Cascading. And the green color items are provided by Bixo. So what’s left are two yellow items, which represent the two points of customization.
  15. There were two main pieces of custom code that needed to be written. One was some URL filtering to focus on the right content inside the web sites. Avoiding non-English pages by specific URL patterns. Same kind of thing for forums and such, since these pages weren’t part of what could easily be optimized. And if enough people need this type of support, since Bixo is open source it will likely become part of the toolkit
  16. Finally we can actually use a traditional data mining tool to help make sense of the digested data. Many things we could do in addition Clustering of results, to improve keyword analysis Larger sites have “areas of interest” Identifying broken links, typos Identifying personal data - email addresses, phone numbers
  17. I try to limit presentations to 20 slides - so I’ve hit that limit In the spirit of the unconference - let me know what you’d like to do next.
  18. Let’s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award.
  19. How do you figure out the most helpful Hadoopers? As we discussed previously, it’s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)?
  20. Parsing the mod_mbox page is simple with Tika’s HtmlParser Cheated a bit when parsing emails - some users like Owen have many aliases So hand-generated alias resolution table.
  21. Need to ignore “thanks” in “thanks in advance for doing my job for me” signoff. Generate two tuples for each email: one with messageId/name/address One with reply-to messageId/score Group/sum aspect is classic reduce operation.
  22. I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but… Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique email addresses with at least one positive reply.
  23. Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don’t want to be writing tricky code here. Could optimize, but that would be a mistake…most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster.
  24. Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files.
  25. Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID
  26. Curve looks right - exponential decay. 409 unique email addresses that got some love from somebody.
  27. And the winner is…Ted Dunning I know - I should have colored the elephant yellow.
  28. A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used…hmm.