SlideShare a Scribd company logo
Website classification
using Apache Spark
Amith Nambiar
Demo of the WebCat app
Business problem
Automatically classify new websites into one or more
predefined categories.
Why?
Web logs collected from data providers have new websites
popping up everyday. And these need to be categorized
before they are presented to customers in reports - daily.
Website classification using
Apache Spark's MLlib.
Training Data
Starting point was already categorised data in the
form:
URL, category_id
www.linux.com, 10 -> (Computers and Internet)
www.coles.com.au, 20 -> (Shopping and Classifieds)
Training Data
Developed a crawler to crawl each of the categorised websites
2,550 websites picked for initial training and test data.
URL, Category_Id -> URL, Category_Id, Features
www.coles.com.au, 10 ->
www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri first
spend onlin liquorland cole card cole insur apparel cole credit card locat hour
look hervey hervey today normal store hour monday friday 8am special store hour
saturday decemb sunday decemb store store search suburb postcod search
suburb postcod select locat suburb locat found pleas store store state recip
inspir recip tast cole partner tast weekli plan easier visit tast cook month cole
magazin everyday ingredi sensat meal famili friend latest cole cole handi video
recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh
fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit
corpor respons corpor respons supplier commit work …
Crawled, Stemmed and
removed stop words from
the data for the website
Bayes's theorem
Website classification using
Naive Bayes
Naive Bayes Classifier are a family of simple probabilistic
classifiers based on applying Bayes' theorem with strong
(naive) independence assumptions between the features.
tf-idf for weighting
In information retrieval, tf–idf, short for term frequency–inverse document frequency,
is a numerical statistic that is intended to reflect how important a word is to a document
in a collection or corpus
https://en.wikipedia.org/wiki/Tf-idf
tf-idf
The tf-idf value increases proportionally to the number of times a word appears in the
document, but is offset by the frequency of the word in the corpus, which helps to adjust for
the fact that some words appear more frequently in general.
Training Data from Database/HDFS
TermDoc RDD’s
tf-idf’s
Array of LabeledPoint(classId, vector)
Calculate tf-idf’s on the features.
Create a LabelPoint for each of the
training data row
model = NaiveBayes.train(labelPoints)
Train the NaiveBayes Model
model.predict(feature_vector)
Predict class
New Data e.g “Automotive”
Each row of Training data (website) is
turned into this form:
(ClassId, Sparse Vector) in the form:
5.0, [100,(1,44,..),(0.3,0.12,…)]
API first for Data science
http://engineering.pivotal.io/post/api-first-for-data-science/
High Level Architecture of WebCat
High level architecture of WebCat
Webcat
App
Queues/Topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Categorize
www.coles.com.au
Category is
“Shopping and
Classifieds”
Category is
“Shopping and
Classifieds”
Scale the Crawler
service independent
of the rest of the
services
WebCat dashboard on PWS - Pivotal Web Services
Note that the crawler service is scaled up to 6 instances for better performance.
Ideas for improving WebCat?
User feedback loop to update the model on
incorrect predictions
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Categorise
www.bmw.com.
We think it is
“Electronics” - Did we
get it right?
Upload your own data - (website, category) pairs
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
I know kogan.com.au
belongs to category
“Shopping and
Classifieds” - add it to
the training data please.
More data = Better predictions?
User defined categories
e.g realestate.com.au -> “Real Estate”
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
Create New
Category
“Real Estate”
Provide a publicly available API for
categorised websites
Webcat
App
Queue with topics
Link Collector Service Link Crawler Service
Classification Service
Training Data
Database
Apache Spark
GET /websites/{id}/category
GET /websites/{id}/features
…
WebCat on Apache Madlib
http://madlib.incubator.apache.org/

More Related Content

Similar to Slides from Apache spark Meetup in Sydney - November,2016

Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails Slides
Udita Plaha
 
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at ZendeskThe More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
Databricks
 
Oracle bi ee architecture
Oracle bi ee architectureOracle bi ee architecture
Oracle bi ee architecture
OBIEE Training Online
 
Sap bods online training
Sap bods online trainingSap bods online training
Sap bods online training
sapehsit
 
Speeding up your WordPress site - WordCamp Hamilton 2015
Speeding up your WordPress site - WordCamp Hamilton 2015Speeding up your WordPress site - WordCamp Hamilton 2015
Speeding up your WordPress site - WordCamp Hamilton 2015
Alan Lok
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
Evan Mullins
 
Wix Machine Learning - Ran Romano
Wix Machine Learning - Ran RomanoWix Machine Learning - Ran Romano
Wix Machine Learning - Ran Romano
Wix Engineering
 
FLossEd-BK Tequila Framework3.2.1
FLossEd-BK Tequila Framework3.2.1FLossEd-BK Tequila Framework3.2.1
FLossEd-BK Tequila Framework3.2.1
Siwawong Wuttipongprasert
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Lex Avstreikh
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
Databricks
 
Folio3 - An Introduction to PHP Yii
Folio3 - An Introduction to PHP YiiFolio3 - An Introduction to PHP Yii
Folio3 - An Introduction to PHP Yii
Folio3 Software
 
Spca2014 holme outcomes with governance
Spca2014 holme   outcomes with governanceSpca2014 holme   outcomes with governance
Spca2014 holme outcomes with governance
NCCOMMS
 
ASP.NET 8 Developer Roadmap By ScholarHat PDF
ASP.NET 8 Developer Roadmap By ScholarHat PDFASP.NET 8 Developer Roadmap By ScholarHat PDF
ASP.NET 8 Developer Roadmap By ScholarHat PDF
Scholarhat
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User Presentation
CA RMDM Latam
 
khaled_cv
khaled_cvkhaled_cv
khaled_cv
khaled mohammad
 
Wix's ML Platform
Wix's ML PlatformWix's ML Platform
Wix's ML Platform
Ran Romano
 
Introduction to PredictionIO
Introduction to PredictionIOIntroduction to PredictionIO
Introduction to PredictionIO
Muhammet Arslan
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
Ihor Bobak
 
Software as Service
Software as ServiceSoftware as Service
Software as Service
abhigad
 
Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)
Ivo Jansch
 

Similar to Slides from Apache spark Meetup in Sydney - November,2016 (20)

Crash Course HTML/Rails Slides
Crash Course HTML/Rails SlidesCrash Course HTML/Rails Slides
Crash Course HTML/Rails Slides
 
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at ZendeskThe More the Merrier: Scaling Model Building Infrastructure at Zendesk
The More the Merrier: Scaling Model Building Infrastructure at Zendesk
 
Oracle bi ee architecture
Oracle bi ee architectureOracle bi ee architecture
Oracle bi ee architecture
 
Sap bods online training
Sap bods online trainingSap bods online training
Sap bods online training
 
Speeding up your WordPress site - WordCamp Hamilton 2015
Speeding up your WordPress site - WordCamp Hamilton 2015Speeding up your WordPress site - WordCamp Hamilton 2015
Speeding up your WordPress site - WordCamp Hamilton 2015
 
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017 So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
So, You Wanna Dev? Join the Team! - WordCamp Raleigh 2017
 
Wix Machine Learning - Ran Romano
Wix Machine Learning - Ran RomanoWix Machine Learning - Ran Romano
Wix Machine Learning - Ran Romano
 
FLossEd-BK Tequila Framework3.2.1
FLossEd-BK Tequila Framework3.2.1FLossEd-BK Tequila Framework3.2.1
FLossEd-BK Tequila Framework3.2.1
 
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptxDowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
Dowling buso-feature-store-logical-clocks-spark-ai-summit-2020.pptx
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Folio3 - An Introduction to PHP Yii
Folio3 - An Introduction to PHP YiiFolio3 - An Introduction to PHP Yii
Folio3 - An Introduction to PHP Yii
 
Spca2014 holme outcomes with governance
Spca2014 holme   outcomes with governanceSpca2014 holme   outcomes with governance
Spca2014 holme outcomes with governance
 
ASP.NET 8 Developer Roadmap By ScholarHat PDF
ASP.NET 8 Developer Roadmap By ScholarHat PDFASP.NET 8 Developer Roadmap By ScholarHat PDF
ASP.NET 8 Developer Roadmap By ScholarHat PDF
 
CA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User PresentationCA ERwin Data Modeler End User Presentation
CA ERwin Data Modeler End User Presentation
 
khaled_cv
khaled_cvkhaled_cv
khaled_cv
 
Wix's ML Platform
Wix's ML PlatformWix's ML Platform
Wix's ML Platform
 
Introduction to PredictionIO
Introduction to PredictionIOIntroduction to PredictionIO
Introduction to PredictionIO
 
Productionalizing ML : Real Experience
Productionalizing ML : Real ExperienceProductionalizing ML : Real Experience
Productionalizing ML : Real Experience
 
Software as Service
Software as ServiceSoftware as Service
Software as Service
 
Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)Achievo ATK - A Business Framework (DPC 2007)
Achievo ATK - A Business Framework (DPC 2007)
 

Recently uploaded

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
xclpvhuk
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 

Recently uploaded (20)

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
一比一原版(Unimelb毕业证书)墨尔本大学毕业证如何办理
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 

Slides from Apache spark Meetup in Sydney - November,2016

  • 2. Demo of the WebCat app
  • 3. Business problem Automatically classify new websites into one or more predefined categories.
  • 4. Why? Web logs collected from data providers have new websites popping up everyday. And these need to be categorized before they are presented to customers in reports - daily.
  • 6. Training Data Starting point was already categorised data in the form: URL, category_id www.linux.com, 10 -> (Computers and Internet) www.coles.com.au, 20 -> (Shopping and Classifieds)
  • 7. Training Data Developed a crawler to crawl each of the categorised websites 2,550 websites picked for initial training and test data. URL, Category_Id -> URL, Category_Id, Features www.coles.com.au, 10 -> www.coles.com.au, 10, groceri deliv kitchen bench custom receiv deliveri first spend onlin liquorland cole card cole insur apparel cole credit card locat hour look hervey hervey today normal store hour monday friday 8am special store hour saturday decemb sunday decemb store store search suburb postcod search suburb postcod select locat suburb locat found pleas store store state recip inspir recip tast cole partner tast weekli plan easier visit tast cook month cole magazin everyday ingredi sensat meal famili friend latest cole cole handi video recip creativ kitchen visit cole youtub rang rang product product bakeri dairi fresh fruit cole mobil card heston liquor special diet gluten kosher foodtruck term condit corpor respons corpor respons supplier commit work … Crawled, Stemmed and removed stop words from the data for the website
  • 9. Website classification using Naive Bayes Naive Bayes Classifier are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features.
  • 10. tf-idf for weighting In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus https://en.wikipedia.org/wiki/Tf-idf
  • 11. tf-idf The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
  • 12. Training Data from Database/HDFS TermDoc RDD’s tf-idf’s Array of LabeledPoint(classId, vector) Calculate tf-idf’s on the features. Create a LabelPoint for each of the training data row model = NaiveBayes.train(labelPoints) Train the NaiveBayes Model model.predict(feature_vector) Predict class New Data e.g “Automotive” Each row of Training data (website) is turned into this form: (ClassId, Sparse Vector) in the form: 5.0, [100,(1,44,..),(0.3,0.12,…)]
  • 13. API first for Data science http://engineering.pivotal.io/post/api-first-for-data-science/
  • 15. High level architecture of WebCat Webcat App Queues/Topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark Categorize www.coles.com.au Category is “Shopping and Classifieds” Category is “Shopping and Classifieds” Scale the Crawler service independent of the rest of the services
  • 16. WebCat dashboard on PWS - Pivotal Web Services Note that the crawler service is scaled up to 6 instances for better performance.
  • 18. User feedback loop to update the model on incorrect predictions Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark Categorise www.bmw.com. We think it is “Electronics” - Did we get it right?
  • 19. Upload your own data - (website, category) pairs Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark I know kogan.com.au belongs to category “Shopping and Classifieds” - add it to the training data please. More data = Better predictions?
  • 20. User defined categories e.g realestate.com.au -> “Real Estate” Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark Create New Category “Real Estate”
  • 21. Provide a publicly available API for categorised websites Webcat App Queue with topics Link Collector Service Link Crawler Service Classification Service Training Data Database Apache Spark GET /websites/{id}/category GET /websites/{id}/features …
  • 22. WebCat on Apache Madlib http://madlib.incubator.apache.org/