Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- An Enhanced Bayesian Decision Tree ... by Auricle Technolog... 206 views
- Choice of programming language for ... by Mohammad Kamrul H... 265 views
- 06 Machine Learning - Naive Bayes by Andres Mendez-Vaz... 251 views
- A Semi-naive Bayes Classifier with ... by NTNU 295 views
- Thuật toán Brich , Khai phá dữ liệu by Lương Bá Hợp 1383 views
- Wikipedia, Dead Authors, Naive Baye... by Abhaya Agarwal 1781 views

1,575 views

Published on

Modified naive bayes model for improved web page classification

Published in:
Technology

No Downloads

Total views

1,575

On SlideShare

0

From Embeds

0

Number of Embeds

66

Shares

0

Downloads

63

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Modified Naive Bayes Model for Improved Web Page Classification
- 2. Overview • Abstract • Objective • Related Works • Traditional Naive Bayes Model • Proposed Modification • Classification Algorithm • Experimental Setup • Experimental Results • Conclusions • References
- 3. Abstract • World Wide Web is a large repository of information and it keeps on growing exponentially. The fact is that most of the information stored in it is in either unstructured or semi structured way. So it is not that much easy to get desired piece of information out of that large collection of unprocessed data. Several strategies has been employed to mine the data in World Wide Web for finding interesting information and to organize them in a meaningful way. Web data mining is a field of active research interest and computer scientists across the world is looking into development and improvement of web mining strategies. In this paper we will present a modified probabilistic model based on naive bayes theorem, which will help to classify webpages based on their textual content, in a better way than traditional naive bayes model.
- 4. ObjectiveObjective • Explore traditional naive bayes textExplore traditional naive bayes text classification model, and optimize it forclassification model, and optimize it for improved performance in terms of costimproved performance in terms of cost and time.and time. • Use above algorithm for automatedUse above algorithm for automated classification of web pages based on theirclassification of web pages based on their textual content.textual content.
- 5. Previous WorksPrevious Works • Following are some of the methods presented for text classification – Decision tree method[5] – Rule based classification method[6] – SVM(Space Vector Model)[7][8][9] – Neural network classifiers[10][11][12] – Bayesian Classifiers[13][14] – K-nearest neighbour approach[16] • Most of these text classification methods were further extended for web page classification by taking textual content or hierarchy of web page into consideration[17]
- 6. Traditional Naive Bayes ModelTraditional Naive Bayes Model • Bayesian classifiers are statistical classifiers which predict the class membership probabilities of tuples, which means probability of tuples to be in a particular class. •Naive Bayes classifiers are Bayesian Classifiers that assume 'class conditional independence'. •According to class conditional independece, 'Effect of value an attribute in a class is independent of values of other attributes'.
- 7. Classification Algorithm INPUT freq-> word frequency list of test web page. Database-> 2d hash table containing list of word frequencies in each category categories->List of available categories for classification OUTPUT category->category of given web page function probability_model(freq_list,database,categories): v=total_number_words_in_database pc={} // A hash function for storing probability of each category for each category in categories: attributes=database[category] n=total_number_of_words_in_category pc[category]=0 for each word in freq_list: if word not in attributes: pc[category]=pc[category]+(1.0/(n+v)) else: pc[category]=pc[category] + ((1+attributes[word])/(n+v)) Category=category for which pc[category] is maximum return Category
- 8. Experimental Setup • Dataset – Standard 20 newsgroup data set was used for testing purpose. – The 20 newsgroup dataset contains 19997 documents classified into 20 different groups uniformly. Some of the newsgroups are very closely related to each other while others are highly unrelated • Testing Methods – Random Sub Set Sampling – K-Fold cross validation
- 9. Experimental Results Number of training documents Accuracy With Traditional Model Accuracy with proposed Model 100 68.63 68.96 200 74.52 75.04 300 77.78 78.66 400 79.10 80.28 500 80.23 81.46 600 81.61 82.62 700 82.25 83.41 800 83.06 83.93 900 82.91 84.36 Random Subset Sampling : Traditional Naive Bayes Model V/S Modified Model
- 10. Experimental Results(contd.) Random Subset Sampling : Traditional Naive Bayes Model V/S Modified Model
- 11. Experimental Results(contd.) Number of folds Accuracy with traditional Model Accuracy with Proposed Model 2 73.26 74.67 3 76.51 77.75 4 77.29 78.74 5 77.66 79.24 6 78.22 79.66 7 78.70 80.05 8 79.07 80.49 9 79.51 80.93 10 79.70 81.12 K-Fold Cross Validation : Traditional Naive Bayes Model V/S Modified Model
- 12. Experimental Results(contd.) K-Fold Cross Validation : Traditional Naive Bayes Model V/S Modified Model
- 13. Conclusions • proposed modified naive bayes classifier is better than traditional model because of following reasons. – Experimental results shows that it provide more accuracy than traditional model. – Long multiplication operations an be replaced by less expensive addition operations – No need use Laplacian correction, since zero probability of individual terms will not lead to zero probability of whole category – Floating point underflow problem can be avoided which may arise due to continuous multiplication of small numbers.
- 14. References 1. http://www.sciencedaily.com/releases/2013/05/130522085217.htm 2. http://www.worldwidewebsize.com/ 3. http://www.thecultureist.com/2013/05/09/how-many-people-use-the-internet-more-than-2-billion- infographic/ 4. Automated Classification of Web Sites using Naive Bayesian Algorithm. Ajay S. Patil and B.V Pawar IMECS 2012. 5. A decision-tree-based symbolic rule induction system for text categorization. DE. Johnson, FJ Oles, T Zhang, T Goets 2002 6. Automated learning of decision rules for text categorization. Chidanand and Fred Damerau ACM 1994 7. A statistical learning model of Text classification for Support Vector Machines. Thorsten Joachims ACM 2001 8. Text categorization with support vector machines: Learning with many relevant features. Thorsten Joachims 9. Support vector machines for text categorization. A. Basu, C. Watters, and M. Shepherd IEEE 2002 10. Automated text classification using a dynamic artificial neural network model M. Ghiassi,M. Olschimke, B.Moon,P. Arnaudo Elsevier 2012 11. Study a text classification method based on neural network model Jian Chen,Hailan Pan, Qinyum Ao Springer 2012 12. Autmatic text classification using artificial neural network. Springer volume 172 2005 13. Naive Bayes text classificaton - http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text- classification-1.html 14. Naive Bayesian text classification - John Graham-cuming 2005 15. A survey of text classification algorithms- Charu C Aggarwal, ChengXiang Zhai 16. improved K-nearest neighbour algorithm for text categorization. Li Baoli, Yu Shiven, Lu Qin ICCPOL 2003
- 15. References[contd.] 17. Recent research in web page classification-A review . Alamelu Mangai, Santhosh Kumar, Sugumaran IJCET 2010 18. Data Mining: Practical machine learning tools and techniques, Ian H. Witten and Eibe Frank 2nd Edition, Morgan Kaufmann, San Francisco, 2005. 19. Fast categorizations of large document collections Shanks, V. and H. E. Williams. SPIRE 2001 20. Simple and accurate feature selection for hierarchical categorization. Wibowo, W. and H. E.Williams. ACM 2002 21. Computational Approaches to Analyzing Weblogs- Mihalcea, R. and H. Liu. 22. Graph-based text classification: Learn from your neighbors Angelova, R. and G. Weikum SIGIR 2006 23. SM.F. Porter, “An algorithm for suffix stripping”, Program, Vo.14, no. 3, pp. 130-137, Jul. 1980. 24. Python beautiful soup library.http://www.crummy.com/software/BeautifulSoup/ 25.http://tartarus.org/martin/PorterStemmer/index.html Porter Stemming algorithm, with various implementations. 26. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html Stanford NLP , naïve bayes textclassification 27. The BOW or libbow C Library [Online] Available: http://www.cs.cmu.edu/~mccallum/bow/ 28. Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O’Reilly Media Inc. Python NLTK 29. www.matplotlib.org Python based open source mathematical analysis toolkit. 30. Data Mining concepts and techniques Han Kamber Lee Morgan Kaufman publications. 3 rd Edition 2012. 31. DMOZ open directory project. [Online]. Available: http://dmoz.org/ 32. Home Page for 20 newsgroup dataset http://qwone.com/~jason/20Newsgroups/

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment