SlideShare a Scribd company logo
1 of 29
Download to read offline
Optimal Sentiment Analysis using
Multiple Sequence Alignment in a
High Performance Computing Environment
Vineetha. V and Achuthsankar S. Nair
Dept. of Computational Biology & Bioinformatics
University of Kerala
High-level Summary
2
ICBAI, 2016
Presents a novel way of using Multiple Sequence Alignment as Text Summarization tool in
performing Sentiment Analysis.
Improved algorithm for MSA
Implementation on Hadoop for handling large volume of
Data
Supervised Learning using Naïve Bayes classifier
MSA as a Promising option- Results at par with other data
processing techniques
Sentiment Analysis
3
ICBAI, 2016
• Mining the public opinion and social interactions from the web is termed as Sentiment
Analysis
People express their opinions and suggestions through:
• Social Media (Twitter, Facebook)
• Blogs (Google blogs, personalized sites)
• E-commerce sites (Amazon, ebay, Flipkart)
• Review sites (Mouthshut, CNET)
• Discussion forums
• Sentiment Analysis helps in identifying the orientation of opinion
Relevance of Sentiment Analysis
4
ICBAI, 2016
• Product reviews & Consumer attitudes
• Trends
• Market Intelligence
• Product and service benchmarking
Business &
Organizations
• Purchasing a product
• Using a service
• Other decision making tasks
• Find like-minded individuals or communities
Individuals
• To know views of general public
• Survey on a Topic
• Opinion polls to decide on policy formation/modification
Politics &
Government
SA – Use Cases
5
ICBAI, 2016
• Retail Supply Chain - Identifying customer sentiments for improving product
features and deciding on new products
• Web Optimization – Sentiment analysis and opinion mining to offer best online
experience to users
• Social Networking Sites – For better understanding of social dynamics
• Historical and comparative Linguistics – For the comparative method by which
linguists traditionally reconstruct languages.
• Business and marketing research - In analyzing series of purchases over time
Sentiment Analysis – General Workflow
6
ICBAI, 2016
Text Preprocessing
Parsing the content
Text Refinement
Analysis and Scoring
Sentiment Analysis / Opinion
Mining
Machine
Learning
Technique
Statistical
Analysis
Natural
Language
Processing
What is Multiple Sequence Alignment (MSA)?
7
ICBAI, 2016
• Aligning more than 2 sequences (stream of characters) to identify similarities between
them
A common technique used in bioinformatics to align biological sequence data
• Proven to be a powerful technique in various fields of studies
Has applicability in various Natural Language Processing (NLP) tasks
• Successfully used in NLP tasks such as machine translation and generation and
multi document summarization
MSA - Applications
8
ICBAI, 2016
Phylogenetic
Analysis
Predicting Protein
structure/Function
Machine
Translation
Historical &
Comparative
Linguistics
Multi-Document
Summarization
Common Text Processing Techniques used in SA
9
ICBAI, 2016
• Data Filtering
• Stemming
• Stop word elimination
• Lexicon based approach
• Text Summarization
• Extractive and Abstractive
• Frequency Driven approach
Why not Multiple
Sequence Alignment
for Text Summarization
in SA?
But, MSA has limitations?
10
ICBAI, 2016
• Exact method used for MSA uses Dynamic Programming approach (Needleman-
Wunsch algorithm)
For 2 input sequences:
• Create M*N matrix (M & N - Input sequence size)
• Fill up the matrix based on character similarity
• Trace back to find optimum alignment
Time & Space Complexity
• Pairwise Alignment - M*N
• Multiple Sequence Alignment – MN
• For MSA, complexity increases exponentially!
• Multiple Sequence Alignment, where more than 2 sequences are aligned to identify
similarities between them, is a classic NP Complete problem in computer science
High Performance Computing
11
ICBAI, 2016
• To Solve computational problems that need significant processing power and
resources.
• Reduces execution time and accommodates complicated problems.
• HPC Clusters, distributed computing, Cloud computing
Head Node
Worker 1 …… Worker N
The Big Data realm
12
ICBAI, 2016
13
ICBAI, 2016
Objectives
Presents a novel approach for Sentiment Analysis
With the data volume on the rise, how can the latest
developments in Big Data arena be utilized for
handling the load in SA?
MSA has been proven to be a successful technique
in many NLP tasks. The objective of this work is to
utilize MSA as a Text Summarization technique in
performing Sentiment Analysis.
Proposed Model
• Proposed Model has 4 major components
14
ICBAI, 2016
Data Acquisition
Text Summarization
Feature Extraction
Sentiment Classification
Collection of Input data in to the system
Data is then shortened into a summarized form
Filter out features from summarized data
Identify the polarity of the sentiment based on
extracted features
System Design
15
ICBAI, 2016
Parallel Implementation
on Hadoop cluster
Data AcquisitionSocial
Media
Feature
Extraction
APIs /
Manual Input Text data is stored
in HDFS
HDFS
Text
Summarization
Text
SummarizationText Summarization
Data fetched from
HDFS
Summary text
placed in
message
queue
Feature vector placed
in message queue
ClassifierTraining Set
Training job
Positive Negative
Modules in Detail - Data Acquisition
16
ICBAI, 2016
• The data acquisition module supports two types of inputs.
• APIs - Social media sites like Twitter exposes APIs for collecting the data from their
sites
• File input - data can be collected manually and stored in files for acquisition
module to process.
• This module performs basic text processing like
• Removing the case sensitivity and special characters like “#”.
• This is to ensure maximum optimal alignment of text while performing summarization.
Modules in Detail - HDFS
17
ICBAI, 2016
• Output of the data acquisition module is stored in HDFS
• HDFS is Distributed, Scalable file system for Hadoop framework.
• Capable of holding large amount of data
• Provides easy access.
• Offers faster data retrieval and improved processing.
• Data is broken down in to small blocks and stored across multiple machines in the
cluster.
• Fault tolerant as multiple copies of the same data gets stored in the cluster.
Modules in Detail – Text Summarization
18
ICBAI, 2016
• Core Module of the proposed system
• Sub components:
• HDFS Interface – Picks data available in the HDFS for summarization and then
invokes the MSA Module
• MSA Module – Performs summarization of input data.
The Amazon Kindle, updated for 2016 and now in its eighth iteration, is a brilliant buy for
book worms. The RRP is low at just £59.99 from Amazon, but as part of its Black Friday
deals, we could see it go even lower - right now the Amazon Kindle is going for just £56.99
at Amazon.
the amazon kindle updated for 2016 and now in its eighth iteration is a brilliant buy for
book worms the rrp is low at just 59.99 but as part of its black friday deals we could see it
go even lower
Text Summarization –MSA
19
ICBAI, 2016
• Needleman-Wunsch algorithm
• Parallel implementation using Hadoop Data Clusters
• Algorithm improved to fill only the major 3 diagonals in the matrix.
S1 S2 S3
AS1
AS2
Pairwise Alignment
Pairwise Alignment
MSA Implementation in Hadoop
20
ICBAI, 2016
Complexity Analysis of MSA
21
ICBAI, 2016
Needleman-Wunsch algorithm
Complexity of Pairwise alignment – O(m*n)
MSA (n sequences of length m) – O(mn)
Possible permutations – nPn = n!
Possible pairwise alignments in each set – sum(1to n-1)
Time Complexity – O((n-1)*(m2))
Hadoop Implementation
Complexity of Pairwise alignment – 3m+2
MSA on hadoop – O((n-1)*(3m+2))
Reduced Matrix Calculation
Modules in Detail – Feature Extraction
22
ICBAI, 2016
• Selects features from the summary text
• N – gram model of feature extraction is implemented
• Output is the N-gram feature vector
• Feature vector is pushed to the message queue for classifier module to pick up
Message Queue Feature Extraction Message Queue
Modules in Detail – Classifier
23
ICBAI, 2016
• Picks up the feature vector from message queue
• Classify them as either ‘Positive’ or ‘Negative’.
• Naïve Bayes classifier is used –
• simple implementation of probabilistic classification
• assigns class labels to the feature vectors based on the class labels taken from a
finite set of training data.
• The training data - processed comments which are classified as positive and
negative.
• The system produces a model of all unique words and their frequency in positive
and negative categories.
• performs classification based on the probabilities of each summary text in the 2
categories
Experimental Set Up
24
ICBAI, 2016
Virtual Hadoop cluster was set up on Lenovo server with Virtual Box and Red Hat as the
operating system. We built a cluster with 3 virtual nodes and one manager node to
manage the cluster.
 The MSA map reduce code was written in Java and executed on Hadoop 2.7.3
 Open MQ was used as the JMS.
 Feature Extraction was implemented in python using standard implementation.
 Naïve Bayes classifier was implemented using ‘TensorFlow’ software library.
Test Data
• The sample dataset of 1000 product reviews of specific categories collected from online
product review sites.
• 500 positive and 500 negative reviews were used as the data set
• 80% was used as training data and rest 20% as the test data
Results
25
ICBAI, 2016
0.64
0.64
0.65
0.65
0.66
0.66
0.67
0.67
0.68
Sensitivity Precision Accuracy Specificity
Measures of Sentiment Analysis using MSA Comparative view
0.665
0.67
0.675
0.68
0.685
0.69
0.695
0.7
0.705
0.71
Stopword elimination Stemming
Conclusion
• MSA as a promising technique for Sentiment Analysis
• Improved algorithm for MSA resulted in Reduced Complexity
• Parallel implementation of MSA using Hadoop Data Clusters
• Solution Provides:
• Accuracy at par with other common techniques such as Stemming, stop word
elimination.
• Scalability as it uses the Hadoop map reduce framework
26
ICBAI, 2016
Future Directions
27
ICBAI, 2016
• Only MSA module has been implemented in MapReduce
• Feature extraction and the classifier modules can also be implemented in
MapReduce
• To handle huge volume of input text data.
• To improve performance and scalability.
• Machine learning techniques along with MSA would be a good option for text
summarization.
• Explore other Big data Frameworks and classification techniques.
• Comparison with other data processing techniques.
References
28
ICBAI, 2016
• Osama M. Rababah, Ahmad K. Hwaitat, Dana A. Al Qudah. (2016) Sentiment analysis as a way of web
optimization. Academicjournals.org/SRE (2DFDF3858431), Volume: 11(8), pages: 90-96
• S. B. Needleman and C. D. Wunsch. (1970) A general method applicable to the search for similarities in the amino
acid sequence of two proteins. Journal of Molecular Biology, Volume: 48(3), pages: 443–453
• Sudha Sadasivam G Baktavatchalam G. (2010). A novel approach to multiple sequence alignment using Hadoop
data grids. International Journal of Bioinformatics Research and Applications. Volume: 6(5), pages: 472-83
• V. Finley Lacatusu, Steven J. Maiorano and Sanda M. Harabagiu. (2004) Multi-Document Summarization using
Multiple-Sequence Alignment. LREC (2004)
• Sara A. Shehab, Arabi Keshk, Hany Mahgoub. (2012) Fast Dynamic Algorithm for Sequence Alignment based on
Bioinformatics. International Journal of Computer Applications (0975 – 8887) Volume: 37(7), pages: 54-61
Thank You
ICBAI, 2016
29

More Related Content

Similar to SA_MSA_ICBAI_2016_presentation_v1.0

2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )SBGC
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionJoão Gabriel Lima
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_SparkMat Keep
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.ASHISH JAGTAP
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Sonya Liberman
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016 Mahesh Dananjaya
 
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...IRJET Journal
 
Resume_Arpita_latest
Resume_Arpita_latestResume_Arpita_latest
Resume_Arpita_latestArpita Sheth
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
A Recommendation Engine For Predicting Movie Ratings Using A Big Data Approach
A Recommendation Engine For Predicting Movie Ratings Using A Big Data ApproachA Recommendation Engine For Predicting Movie Ratings Using A Big Data Approach
A Recommendation Engine For Predicting Movie Ratings Using A Big Data ApproachFelicia Clark
 
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxShahbazKhan77289
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSijcsit
 

Similar to SA_MSA_ICBAI_2016_presentation_v1.0 (20)

2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
 
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time ActionApache Spark and MongoDB - Turning Analytics into Real-Time Action
Apache Spark and MongoDB - Turning Analytics into Real-Time Action
 
MongoDB_Spark
MongoDB_SparkMongoDB_Spark
MongoDB_Spark
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
50120140503003 2
50120140503003 250120140503003 2
50120140503003 2
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.IOTA 2016 Social Recomender System Presentation.
IOTA 2016 Social Recomender System Presentation.
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016
 
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
Evaluating and Enhancing Efficiency of Recommendation System using Big Data A...
 
Sub1583
Sub1583Sub1583
Sub1583
 
Resume_Arpita_latest
Resume_Arpita_latestResume_Arpita_latest
Resume_Arpita_latest
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
An Analytics Platform for Connected Vehicles
An Analytics Platform for Connected VehiclesAn Analytics Platform for Connected Vehicles
An Analytics Platform for Connected Vehicles
 
A Recommendation Engine For Predicting Movie Ratings Using A Big Data Approach
A Recommendation Engine For Predicting Movie Ratings Using A Big Data ApproachA Recommendation Engine For Predicting Movie Ratings Using A Big Data Approach
A Recommendation Engine For Predicting Movie Ratings Using A Big Data Approach
 
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.comHABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
HABIB FIGA GUYE {BULE HORA UNIVERSITY}(habibifiga@gmail.com
 
Recommendation engine
Recommendation engineRecommendation engine
Recommendation engine
 
SHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docxSHAHBAZ_TECHNICAL_SEMINAR.docx
SHAHBAZ_TECHNICAL_SEMINAR.docx
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
 
Query Optimization for Big Data Analytics
Query Optimization for Big Data AnalyticsQuery Optimization for Big Data Analytics
Query Optimization for Big Data Analytics
 

SA_MSA_ICBAI_2016_presentation_v1.0

  • 1. Optimal Sentiment Analysis using Multiple Sequence Alignment in a High Performance Computing Environment Vineetha. V and Achuthsankar S. Nair Dept. of Computational Biology & Bioinformatics University of Kerala
  • 2. High-level Summary 2 ICBAI, 2016 Presents a novel way of using Multiple Sequence Alignment as Text Summarization tool in performing Sentiment Analysis. Improved algorithm for MSA Implementation on Hadoop for handling large volume of Data Supervised Learning using Naïve Bayes classifier MSA as a Promising option- Results at par with other data processing techniques
  • 3. Sentiment Analysis 3 ICBAI, 2016 • Mining the public opinion and social interactions from the web is termed as Sentiment Analysis People express their opinions and suggestions through: • Social Media (Twitter, Facebook) • Blogs (Google blogs, personalized sites) • E-commerce sites (Amazon, ebay, Flipkart) • Review sites (Mouthshut, CNET) • Discussion forums • Sentiment Analysis helps in identifying the orientation of opinion
  • 4. Relevance of Sentiment Analysis 4 ICBAI, 2016 • Product reviews & Consumer attitudes • Trends • Market Intelligence • Product and service benchmarking Business & Organizations • Purchasing a product • Using a service • Other decision making tasks • Find like-minded individuals or communities Individuals • To know views of general public • Survey on a Topic • Opinion polls to decide on policy formation/modification Politics & Government
  • 5. SA – Use Cases 5 ICBAI, 2016 • Retail Supply Chain - Identifying customer sentiments for improving product features and deciding on new products • Web Optimization – Sentiment analysis and opinion mining to offer best online experience to users • Social Networking Sites – For better understanding of social dynamics • Historical and comparative Linguistics – For the comparative method by which linguists traditionally reconstruct languages. • Business and marketing research - In analyzing series of purchases over time
  • 6. Sentiment Analysis – General Workflow 6 ICBAI, 2016 Text Preprocessing Parsing the content Text Refinement Analysis and Scoring Sentiment Analysis / Opinion Mining Machine Learning Technique Statistical Analysis Natural Language Processing
  • 7. What is Multiple Sequence Alignment (MSA)? 7 ICBAI, 2016 • Aligning more than 2 sequences (stream of characters) to identify similarities between them A common technique used in bioinformatics to align biological sequence data • Proven to be a powerful technique in various fields of studies Has applicability in various Natural Language Processing (NLP) tasks • Successfully used in NLP tasks such as machine translation and generation and multi document summarization
  • 8. MSA - Applications 8 ICBAI, 2016 Phylogenetic Analysis Predicting Protein structure/Function Machine Translation Historical & Comparative Linguistics Multi-Document Summarization
  • 9. Common Text Processing Techniques used in SA 9 ICBAI, 2016 • Data Filtering • Stemming • Stop word elimination • Lexicon based approach • Text Summarization • Extractive and Abstractive • Frequency Driven approach Why not Multiple Sequence Alignment for Text Summarization in SA?
  • 10. But, MSA has limitations? 10 ICBAI, 2016 • Exact method used for MSA uses Dynamic Programming approach (Needleman- Wunsch algorithm) For 2 input sequences: • Create M*N matrix (M & N - Input sequence size) • Fill up the matrix based on character similarity • Trace back to find optimum alignment Time & Space Complexity • Pairwise Alignment - M*N • Multiple Sequence Alignment – MN • For MSA, complexity increases exponentially! • Multiple Sequence Alignment, where more than 2 sequences are aligned to identify similarities between them, is a classic NP Complete problem in computer science
  • 11. High Performance Computing 11 ICBAI, 2016 • To Solve computational problems that need significant processing power and resources. • Reduces execution time and accommodates complicated problems. • HPC Clusters, distributed computing, Cloud computing Head Node Worker 1 …… Worker N
  • 12. The Big Data realm 12 ICBAI, 2016
  • 13. 13 ICBAI, 2016 Objectives Presents a novel approach for Sentiment Analysis With the data volume on the rise, how can the latest developments in Big Data arena be utilized for handling the load in SA? MSA has been proven to be a successful technique in many NLP tasks. The objective of this work is to utilize MSA as a Text Summarization technique in performing Sentiment Analysis.
  • 14. Proposed Model • Proposed Model has 4 major components 14 ICBAI, 2016 Data Acquisition Text Summarization Feature Extraction Sentiment Classification Collection of Input data in to the system Data is then shortened into a summarized form Filter out features from summarized data Identify the polarity of the sentiment based on extracted features
  • 15. System Design 15 ICBAI, 2016 Parallel Implementation on Hadoop cluster Data AcquisitionSocial Media Feature Extraction APIs / Manual Input Text data is stored in HDFS HDFS Text Summarization Text SummarizationText Summarization Data fetched from HDFS Summary text placed in message queue Feature vector placed in message queue ClassifierTraining Set Training job Positive Negative
  • 16. Modules in Detail - Data Acquisition 16 ICBAI, 2016 • The data acquisition module supports two types of inputs. • APIs - Social media sites like Twitter exposes APIs for collecting the data from their sites • File input - data can be collected manually and stored in files for acquisition module to process. • This module performs basic text processing like • Removing the case sensitivity and special characters like “#”. • This is to ensure maximum optimal alignment of text while performing summarization.
  • 17. Modules in Detail - HDFS 17 ICBAI, 2016 • Output of the data acquisition module is stored in HDFS • HDFS is Distributed, Scalable file system for Hadoop framework. • Capable of holding large amount of data • Provides easy access. • Offers faster data retrieval and improved processing. • Data is broken down in to small blocks and stored across multiple machines in the cluster. • Fault tolerant as multiple copies of the same data gets stored in the cluster.
  • 18. Modules in Detail – Text Summarization 18 ICBAI, 2016 • Core Module of the proposed system • Sub components: • HDFS Interface – Picks data available in the HDFS for summarization and then invokes the MSA Module • MSA Module – Performs summarization of input data. The Amazon Kindle, updated for 2016 and now in its eighth iteration, is a brilliant buy for book worms. The RRP is low at just £59.99 from Amazon, but as part of its Black Friday deals, we could see it go even lower - right now the Amazon Kindle is going for just £56.99 at Amazon. the amazon kindle updated for 2016 and now in its eighth iteration is a brilliant buy for book worms the rrp is low at just 59.99 but as part of its black friday deals we could see it go even lower
  • 19. Text Summarization –MSA 19 ICBAI, 2016 • Needleman-Wunsch algorithm • Parallel implementation using Hadoop Data Clusters • Algorithm improved to fill only the major 3 diagonals in the matrix. S1 S2 S3 AS1 AS2 Pairwise Alignment Pairwise Alignment
  • 20. MSA Implementation in Hadoop 20 ICBAI, 2016
  • 21. Complexity Analysis of MSA 21 ICBAI, 2016 Needleman-Wunsch algorithm Complexity of Pairwise alignment – O(m*n) MSA (n sequences of length m) – O(mn) Possible permutations – nPn = n! Possible pairwise alignments in each set – sum(1to n-1) Time Complexity – O((n-1)*(m2)) Hadoop Implementation Complexity of Pairwise alignment – 3m+2 MSA on hadoop – O((n-1)*(3m+2)) Reduced Matrix Calculation
  • 22. Modules in Detail – Feature Extraction 22 ICBAI, 2016 • Selects features from the summary text • N – gram model of feature extraction is implemented • Output is the N-gram feature vector • Feature vector is pushed to the message queue for classifier module to pick up Message Queue Feature Extraction Message Queue
  • 23. Modules in Detail – Classifier 23 ICBAI, 2016 • Picks up the feature vector from message queue • Classify them as either ‘Positive’ or ‘Negative’. • Naïve Bayes classifier is used – • simple implementation of probabilistic classification • assigns class labels to the feature vectors based on the class labels taken from a finite set of training data. • The training data - processed comments which are classified as positive and negative. • The system produces a model of all unique words and their frequency in positive and negative categories. • performs classification based on the probabilities of each summary text in the 2 categories
  • 24. Experimental Set Up 24 ICBAI, 2016 Virtual Hadoop cluster was set up on Lenovo server with Virtual Box and Red Hat as the operating system. We built a cluster with 3 virtual nodes and one manager node to manage the cluster.  The MSA map reduce code was written in Java and executed on Hadoop 2.7.3  Open MQ was used as the JMS.  Feature Extraction was implemented in python using standard implementation.  Naïve Bayes classifier was implemented using ‘TensorFlow’ software library. Test Data • The sample dataset of 1000 product reviews of specific categories collected from online product review sites. • 500 positive and 500 negative reviews were used as the data set • 80% was used as training data and rest 20% as the test data
  • 25. Results 25 ICBAI, 2016 0.64 0.64 0.65 0.65 0.66 0.66 0.67 0.67 0.68 Sensitivity Precision Accuracy Specificity Measures of Sentiment Analysis using MSA Comparative view 0.665 0.67 0.675 0.68 0.685 0.69 0.695 0.7 0.705 0.71 Stopword elimination Stemming
  • 26. Conclusion • MSA as a promising technique for Sentiment Analysis • Improved algorithm for MSA resulted in Reduced Complexity • Parallel implementation of MSA using Hadoop Data Clusters • Solution Provides: • Accuracy at par with other common techniques such as Stemming, stop word elimination. • Scalability as it uses the Hadoop map reduce framework 26 ICBAI, 2016
  • 27. Future Directions 27 ICBAI, 2016 • Only MSA module has been implemented in MapReduce • Feature extraction and the classifier modules can also be implemented in MapReduce • To handle huge volume of input text data. • To improve performance and scalability. • Machine learning techniques along with MSA would be a good option for text summarization. • Explore other Big data Frameworks and classification techniques. • Comparison with other data processing techniques.
  • 28. References 28 ICBAI, 2016 • Osama M. Rababah, Ahmad K. Hwaitat, Dana A. Al Qudah. (2016) Sentiment analysis as a way of web optimization. Academicjournals.org/SRE (2DFDF3858431), Volume: 11(8), pages: 90-96 • S. B. Needleman and C. D. Wunsch. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, Volume: 48(3), pages: 443–453 • Sudha Sadasivam G Baktavatchalam G. (2010). A novel approach to multiple sequence alignment using Hadoop data grids. International Journal of Bioinformatics Research and Applications. Volume: 6(5), pages: 472-83 • V. Finley Lacatusu, Steven J. Maiorano and Sanda M. Harabagiu. (2004) Multi-Document Summarization using Multiple-Sequence Alignment. LREC (2004) • Sara A. Shehab, Arabi Keshk, Hany Mahgoub. (2012) Fast Dynamic Algorithm for Sequence Alignment based on Bioinformatics. International Journal of Computer Applications (0975 – 8887) Volume: 37(7), pages: 54-61