Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Loading in …5
×

# Scalable sentiment classification for big data analysis using naive bayes classifier

Scalable sentiment classification for big data analysis using naive bayes classifier

• Full Name
Comment goes here.

Are you sure you want to Yes No
Your message goes here
• Login to see the comments

### Scalable sentiment classification for big data analysis using naive bayes classifier

1. 1. 2013 IEEE International Conference on Big Data Scalable Sentiment Classification for Big DataAnalysis Using Naive Bayes Classifier Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen
2. 2. outline ✤ introduction ✤ Naive Bayes Classiﬁcation ✤ implementation of Naive Bayes in hadoop ✤ experimental study
3. 3. introduction A typical method to obtain valuable information is to extract the sentiment or opinion from a message In this paper, it aim to evaluate the scalability of Naive Bayes classiﬁer (NBC) in large datasets
4. 4. introduction NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput the accuracy of NBC is improved and approaches 82%
5. 5. Naive Bayes Classification naive Bayes classiﬁers is simple probabilistic classiﬁers based on applying Bayes' theorem with strong (naive) independence assumptions between the features a popular method for text categorization, ( the problem of judging documents as belonging to one category)
6. 6. Naive Bayes Classification prior probability ： posterior probability： P(A) P(A|B)
7. 7. Naive Bayes Classification P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS) P(excellent,terrible) P(POS|d1) = P(POS) x P(d1|POS) P(d1) Bayes' theorem
8. 8. Naive Bayes Classification P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS) P(excellent,terrible) P(excellent,terrible|POS) P(excellent|POS) x P(terrible|POS) independent P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS) P(excellent,terrible)
9. 9. Naive Bayes Classification classes excellent terrible d1 POS 5 1 d2 NEG 2 6 P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS) P(excellent,terrible) P(POS|excellent,terrible) = P(NEG|excellent,terrible) = d3 : (excellent,8),(terrible,2) 5 6 ( ) 1 6 ( ) 1 2 82 8 ( ) 26 8 ( )x x 1 2 85 6 ( ) 21 6 ( )x x
10. 10. Naive Bayes Classification P(POS|excellent,terrible) = P(NEG|excellent,terrible) = d3 : (excellent,8),(terrible,2) 1 2 85 6 ( ) 21 6 ( )x x 1 2 82 8 ( ) 26 8 ( )x x 0.00323011165 0.00000429153 d3 is POS
11. 11. Naive Bayes Classification 1 2 85 6 ( ) 21 6 ( )x x
12. 12. Naive Bayes Classification N is the total number of documents,Nc is the number of documents in class c Nwi is the frequency of a word wi in class c.
13. 13. implementation of Naive Bayes in hadoop pre-processing raw dataset
14. 14. implementation of Naive Bayes in hadoop 1000 positive and 1000 negative review
15. 15. implementation of Naive Bayes in hadoop (word,posSum,negSum) the words frequency in all positive,negative document (excellent,1000,10)
16. 16. implementation of Naive Bayes in hadoop (excellent,1000,10) (excellent,20,5) (word,posSum,negSum) (word,count,docID) (docID,count,word,posSum,negSum) (5,20,excellent,1000,10)
17. 17. implementation of Naive Bayes in hadoop (5,10,excellent,20,5) (5,2,terrible,5,20) (5,pos,true) (docID,predict,correct) (6,neg,false) (docID,count,word,posSum,negSum) 10xlog(20)+2xlog(5) 10xlog(5)+2xlog(20)
18. 18. experimental study one name node and six data nodes. they allocate each VM two virtual CPU and 4GB of memory 7 nodes a Dell server with 12 Intel Xeon E5-2630 2.3GHz cores and 32G memory use Xen CloudPlatform (XCP) 1.6 as the hypervisor
19. 19. experimental study training data
20. 20. experimental study