SlideShare a Scribd company logo
1 of 26
Classifying Ephemeral vs. Long
Lasting Content on the Web
Akif Khan Yusufzai 11CSS07
Moonis Javed 11CSS40
Introduction
● Web classification is a very important machine learning problem with
wide applicability in tasks such as news classification, content
prioritization, focused crawling and sentiment analysis of web content.
● On the Web, classification of page content is essential to focused
crawling, to the assisted development of web directories, to topic
specific Web link analysis and to contextual advertising.
● Our GOAL:
Study a specific instance of this broad and vital web classification
problem and developing a successful prediction system.
Goal Of Project:
● Building a crawler to scrape data from different websites across
the internet such as bloomberg.com, Blogs from blogger.com,
Times of India news website.
● Building a classifier to categorize web pages as evergreen or
ephemeral.
● Binary classes of classification:
o Ephemeral (short lived)
o Evergreen (Long lived)
Ephemeral Content
● Short Lived i.e it loses its relevance after
a certain period of time.
● Based on current happenings or
interests.
● Fades out after some time and its
viewership or hit count is negligible.
● Examples
o A news topic
o A viral video
Long Lasting Content
● Content doesn't loses its relevance even
after a very long time.
● Usually based on everlasting topics
● Example
o A cooking recipe
o Information about monument such as
Taj Mahal, or about history.
Technical Details● Dataset from Kaggle’s ‘Stumbleupon Challenge’
● Contains approximately 10,000 HTML documents for
training and testing purpose.
● To improve our model we will scrape recent data from other
websites.
● Fields
o URL
o Boilerplate text : contains title and body of html page.
o Number of characters
o Number of links
o Number of words in url
o User determined label (only in training) , etc.
Preprocessing
● Bag of Words Model:
o Create a dictionary of all the words with their frequency of
occurrence.
o remove the high frequency words (filler words like the, is,
etc.)
and lowest frequency words as their presence doesn’t affect
the prediction model.
o remove the least frequency words which do not occur
enough so as to be helpful for prediction.
Preprocessing
Another approach…
we use the Term frequency- Inverse document frequency (tf-idf) of
each word.
● The TF-IDF is the product of the Term frequency ( indicating the
number of times a word appears in a given document).
● Inverse document frequency: which measures how commonly
the word appears across all documents.
Formula for calculating IDF
D is the set of training examples (documents)
|D| is the number of training examples
|{d ∈ D : t ∈ d}| is the number of documents where the
word t appears [6].
Classifier Models used
1. Naive Bayes Model
2. Logistic regression Analysis
3. Support Vector Machine (SVM)
4. Decision tree model - Random Forest
Naïve Bayes Model
● family of simple probabilistic classifiers
● based on applying Bayes' theorem
● strong (naive) independence assumptions between
the features
● popular method for text categorization
● used with word frequencies as features
Logistic Regression
● used for predicting the outcome of a categorical dependent
variable (i.e., a class label) based on one or more predictor
variables (features)
● makes use of one or more predictor variables that may be
either continuous or categorical data
● Binomial logistic regression used as final predictions are
binary
(0 - ephemeral
1 - long lasting)
Support Vector Machine (SVM)
● builds a model that assigns new examples into one
category or the other, making it a non-probabilistic
binary linear classifier
● used to analyze data and recognize patterns
● A set of features that describes one case (i.e., a row of predictor
values) is called a vector.
● creates an optimal N-dimensional hyperplane which separates the
data into two categories
Decision Tree Model - Random Forest
● ensemble learning method for classification (and regression).
● operate by constructing a multitude of decision trees at training
time.
● outputting the class that is the mode of the classes output by
individual trees.
● applies the general technique of bootstrap aggregating, or
bagging.
Outliers
Results of Different Models
SVM
● Linear SVM on Tf-Idf vectorized body
20 fold CV score : 86.8915%
● Linear SVM on Tf-Idf vectorized body after outlier
removal
20 fold CV score : 87.2765%
Results of Different Models
L
Results of Different Models
Gaussian NB:
● 20 Fold CV Score
69.825%
● 20 Fold CV Score after
Outlier Removal
70.379%
Results of Different Models
Random Forest:
● 20 Fold CV Score
79.95%
● 20 Fold CV Score after
Outlier Removal
80.1174%
Results of Different Models
Word Cloud of highest frequency words
Work To Be Done
● Verify Outliers using clustering methods like
K-means Clustering, DBSCAN
● Build Ensemble Method by combining
multiple models
● Apply AdaBoost to improve Ensemble
Accuracy
Area of Application
● useful for recommenders attempting to classify different
news stories based on type.
● used for archival projects to determine what web content
merits inclusion.
● for content sites interested in capacity planning for
hosting different pages based on expected longevity.
● used for putting up of advertisement, and the companies
can bid more on ads displayed on this page as it will be
visible for a longer time.
Future Scope
● Apply distributed or cloud computing to further improve accuracy
and provide real time classification.
● Calculate lifetime prediction ( time for which it receives a fair
amount of hits )
● Sentiment analysis of long lasting web pages
References
[1] Kaggle, “StumbleUpon Evergreen Classification Challenge”,
http://www.kaggle.com/c/stumbleupon
[2] T. Fawcett, “An Introduction to ROC Analysis”, Pattern Recognition Letters,
Issue 27, 2006, pp 861-874
[3] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”,
ICML, 2003
Thank You
Any Questions ?

More Related Content

Similar to Classification of webpages as Ephemeral or Evergreen

MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
 
Learning Single page Application chapter 1
Learning Single page Application chapter 1Learning Single page Application chapter 1
Learning Single page Application chapter 1Puguh Rismadi
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBLisa Roth, PMP
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB
 
What are the basic key points to focus on while learning Full-stack web devel...
What are the basic key points to focus on while learning Full-stack web devel...What are the basic key points to focus on while learning Full-stack web devel...
What are the basic key points to focus on while learning Full-stack web devel...kzayra69
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents Sharvil Katariya
 
WDS trainer presentation - MLOps.pptx
WDS trainer presentation - MLOps.pptxWDS trainer presentation - MLOps.pptx
WDS trainer presentation - MLOps.pptxArthur240715
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfvitm11
 
Angular js 1.3 presentation for fed nov 2014
Angular js 1.3 presentation for fed   nov 2014Angular js 1.3 presentation for fed   nov 2014
Angular js 1.3 presentation for fed nov 2014Sarah Hudson
 
Modern UI Architecture_ Trends and Technologies in Web Development
Modern UI Architecture_ Trends and Technologies in Web DevelopmentModern UI Architecture_ Trends and Technologies in Web Development
Modern UI Architecture_ Trends and Technologies in Web DevelopmentSuresh Patidar
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering enginesYash Darak
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Pramati Technologies
 
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
Building an ML Tool to predict Article Quality Scores using Delta & MLFlowBuilding an ML Tool to predict Article Quality Scores using Delta & MLFlow
Building an ML Tool to predict Article Quality Scores using Delta & MLFlowDatabricks
 
Architecturing the software stack at a small business
Architecturing the software stack at a small businessArchitecturing the software stack at a small business
Architecturing the software stack at a small businessYangJerng Hwa
 
Foster - Getting started with Angular
Foster - Getting started with AngularFoster - Getting started with Angular
Foster - Getting started with AngularMukundSonaiya1
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Django Patterns - Pycon India 2014
Django Patterns - Pycon India 2014Django Patterns - Pycon India 2014
Django Patterns - Pycon India 2014arunvr
 
Benefits of using software design patterns and when to use design pattern
Benefits of using software design patterns and when to use design patternBenefits of using software design patterns and when to use design pattern
Benefits of using software design patterns and when to use design patternBeroza Paul
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflowsAdam Gibson
 

Similar to Classification of webpages as Ephemeral or Evergreen (20)

MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Learning Single page Application chapter 1
Learning Single page Application chapter 1Learning Single page Application chapter 1
Learning Single page Application chapter 1
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
 
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDBMongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
 
What are the basic key points to focus on while learning Full-stack web devel...
What are the basic key points to focus on while learning Full-stack web devel...What are the basic key points to focus on while learning Full-stack web devel...
What are the basic key points to focus on while learning Full-stack web devel...
 
Intro to ember.js
Intro to ember.jsIntro to ember.js
Intro to ember.js
 
IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents IRE Semantic Annotation of Documents
IRE Semantic Annotation of Documents
 
WDS trainer presentation - MLOps.pptx
WDS trainer presentation - MLOps.pptxWDS trainer presentation - MLOps.pptx
WDS trainer presentation - MLOps.pptx
 
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdfSlides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
Slides-Артем Коваль-Cloud-Native MLOps Framework - DataFest 2021.pdf
 
Angular js 1.3 presentation for fed nov 2014
Angular js 1.3 presentation for fed   nov 2014Angular js 1.3 presentation for fed   nov 2014
Angular js 1.3 presentation for fed nov 2014
 
Modern UI Architecture_ Trends and Technologies in Web Development
Modern UI Architecture_ Trends and Technologies in Web DevelopmentModern UI Architecture_ Trends and Technologies in Web Development
Modern UI Architecture_ Trends and Technologies in Web Development
 
Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
Document Clustering using LDA | Haridas Narayanaswamy [Pramati]
 
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
Building an ML Tool to predict Article Quality Scores using Delta & MLFlowBuilding an ML Tool to predict Article Quality Scores using Delta & MLFlow
Building an ML Tool to predict Article Quality Scores using Delta & MLFlow
 
Architecturing the software stack at a small business
Architecturing the software stack at a small businessArchitecturing the software stack at a small business
Architecturing the software stack at a small business
 
Foster - Getting started with Angular
Foster - Getting started with AngularFoster - Getting started with Angular
Foster - Getting started with Angular
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Django Patterns - Pycon India 2014
Django Patterns - Pycon India 2014Django Patterns - Pycon India 2014
Django Patterns - Pycon India 2014
 
Benefits of using software design patterns and when to use design pattern
Benefits of using software design patterns and when to use design patternBenefits of using software design patterns and when to use design pattern
Benefits of using software design patterns and when to use design pattern
 
End to end MLworkflows
End to end MLworkflowsEnd to end MLworkflows
End to end MLworkflows
 

Recently uploaded

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PPRINCE C P
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 

Recently uploaded (20)

Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Artificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C PArtificial Intelligence In Microbiology by Dr. Prince C P
Artificial Intelligence In Microbiology by Dr. Prince C P
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 

Classification of webpages as Ephemeral or Evergreen

  • 1. Classifying Ephemeral vs. Long Lasting Content on the Web Akif Khan Yusufzai 11CSS07 Moonis Javed 11CSS40
  • 2. Introduction ● Web classification is a very important machine learning problem with wide applicability in tasks such as news classification, content prioritization, focused crawling and sentiment analysis of web content. ● On the Web, classification of page content is essential to focused crawling, to the assisted development of web directories, to topic specific Web link analysis and to contextual advertising. ● Our GOAL: Study a specific instance of this broad and vital web classification problem and developing a successful prediction system.
  • 3. Goal Of Project: ● Building a crawler to scrape data from different websites across the internet such as bloomberg.com, Blogs from blogger.com, Times of India news website. ● Building a classifier to categorize web pages as evergreen or ephemeral. ● Binary classes of classification: o Ephemeral (short lived) o Evergreen (Long lived)
  • 4. Ephemeral Content ● Short Lived i.e it loses its relevance after a certain period of time. ● Based on current happenings or interests. ● Fades out after some time and its viewership or hit count is negligible. ● Examples o A news topic o A viral video
  • 5. Long Lasting Content ● Content doesn't loses its relevance even after a very long time. ● Usually based on everlasting topics ● Example o A cooking recipe o Information about monument such as Taj Mahal, or about history.
  • 6. Technical Details● Dataset from Kaggle’s ‘Stumbleupon Challenge’ ● Contains approximately 10,000 HTML documents for training and testing purpose. ● To improve our model we will scrape recent data from other websites. ● Fields o URL o Boilerplate text : contains title and body of html page. o Number of characters o Number of links o Number of words in url o User determined label (only in training) , etc.
  • 7. Preprocessing ● Bag of Words Model: o Create a dictionary of all the words with their frequency of occurrence. o remove the high frequency words (filler words like the, is, etc.) and lowest frequency words as their presence doesn’t affect the prediction model. o remove the least frequency words which do not occur enough so as to be helpful for prediction.
  • 8. Preprocessing Another approach… we use the Term frequency- Inverse document frequency (tf-idf) of each word. ● The TF-IDF is the product of the Term frequency ( indicating the number of times a word appears in a given document). ● Inverse document frequency: which measures how commonly the word appears across all documents.
  • 9. Formula for calculating IDF D is the set of training examples (documents) |D| is the number of training examples |{d ∈ D : t ∈ d}| is the number of documents where the word t appears [6].
  • 10. Classifier Models used 1. Naive Bayes Model 2. Logistic regression Analysis 3. Support Vector Machine (SVM) 4. Decision tree model - Random Forest
  • 11. Naïve Bayes Model ● family of simple probabilistic classifiers ● based on applying Bayes' theorem ● strong (naive) independence assumptions between the features ● popular method for text categorization ● used with word frequencies as features
  • 12. Logistic Regression ● used for predicting the outcome of a categorical dependent variable (i.e., a class label) based on one or more predictor variables (features) ● makes use of one or more predictor variables that may be either continuous or categorical data ● Binomial logistic regression used as final predictions are binary (0 - ephemeral 1 - long lasting)
  • 13. Support Vector Machine (SVM) ● builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier ● used to analyze data and recognize patterns ● A set of features that describes one case (i.e., a row of predictor values) is called a vector. ● creates an optimal N-dimensional hyperplane which separates the data into two categories
  • 14. Decision Tree Model - Random Forest ● ensemble learning method for classification (and regression). ● operate by constructing a multitude of decision trees at training time. ● outputting the class that is the mode of the classes output by individual trees. ● applies the general technique of bootstrap aggregating, or bagging.
  • 16. Results of Different Models SVM ● Linear SVM on Tf-Idf vectorized body 20 fold CV score : 86.8915% ● Linear SVM on Tf-Idf vectorized body after outlier removal 20 fold CV score : 87.2765%
  • 18. Results of Different Models Gaussian NB: ● 20 Fold CV Score 69.825% ● 20 Fold CV Score after Outlier Removal 70.379%
  • 19. Results of Different Models Random Forest: ● 20 Fold CV Score 79.95% ● 20 Fold CV Score after Outlier Removal 80.1174%
  • 21. Word Cloud of highest frequency words
  • 22. Work To Be Done ● Verify Outliers using clustering methods like K-means Clustering, DBSCAN ● Build Ensemble Method by combining multiple models ● Apply AdaBoost to improve Ensemble Accuracy
  • 23. Area of Application ● useful for recommenders attempting to classify different news stories based on type. ● used for archival projects to determine what web content merits inclusion. ● for content sites interested in capacity planning for hosting different pages based on expected longevity. ● used for putting up of advertisement, and the companies can bid more on ads displayed on this page as it will be visible for a longer time.
  • 24. Future Scope ● Apply distributed or cloud computing to further improve accuracy and provide real time classification. ● Calculate lifetime prediction ( time for which it receives a fair amount of hits ) ● Sentiment analysis of long lasting web pages
  • 25. References [1] Kaggle, “StumbleUpon Evergreen Classification Challenge”, http://www.kaggle.com/c/stumbleupon [2] T. Fawcett, “An Introduction to ROC Analysis”, Pattern Recognition Letters, Issue 27, 2006, pp 861-874 [3] J. Ramos, “Using TF-IDF to Determine Word Relevance in Document Queries”, ICML, 2003