1. Optimal Sentiment Analysis using
Multiple Sequence Alignment in a
High Performance Computing Environment
Vineetha. V and Achuthsankar S. Nair
Dept. of Computational Biology & Bioinformatics
University of Kerala
2. High-level Summary
2
ICBAI, 2016
Presents a novel way of using Multiple Sequence Alignment as Text Summarization tool in
performing Sentiment Analysis.
Improved algorithm for MSA
Implementation on Hadoop for handling large volume of
Data
Supervised Learning using Naïve Bayes classifier
MSA as a Promising option- Results at par with other data
processing techniques
3. Sentiment Analysis
3
ICBAI, 2016
• Mining the public opinion and social interactions from the web is termed as Sentiment
Analysis
People express their opinions and suggestions through:
• Social Media (Twitter, Facebook)
• Blogs (Google blogs, personalized sites)
• E-commerce sites (Amazon, ebay, Flipkart)
• Review sites (Mouthshut, CNET)
• Discussion forums
• Sentiment Analysis helps in identifying the orientation of opinion
4. Relevance of Sentiment Analysis
4
ICBAI, 2016
• Product reviews & Consumer attitudes
• Trends
• Market Intelligence
• Product and service benchmarking
Business &
Organizations
• Purchasing a product
• Using a service
• Other decision making tasks
• Find like-minded individuals or communities
Individuals
• To know views of general public
• Survey on a Topic
• Opinion polls to decide on policy formation/modification
Politics &
Government
5. SA – Use Cases
5
ICBAI, 2016
• Retail Supply Chain - Identifying customer sentiments for improving product
features and deciding on new products
• Web Optimization – Sentiment analysis and opinion mining to offer best online
experience to users
• Social Networking Sites – For better understanding of social dynamics
• Historical and comparative Linguistics – For the comparative method by which
linguists traditionally reconstruct languages.
• Business and marketing research - In analyzing series of purchases over time
6. Sentiment Analysis – General Workflow
6
ICBAI, 2016
Text Preprocessing
Parsing the content
Text Refinement
Analysis and Scoring
Sentiment Analysis / Opinion
Mining
Machine
Learning
Technique
Statistical
Analysis
Natural
Language
Processing
7. What is Multiple Sequence Alignment (MSA)?
7
ICBAI, 2016
• Aligning more than 2 sequences (stream of characters) to identify similarities between
them
A common technique used in bioinformatics to align biological sequence data
• Proven to be a powerful technique in various fields of studies
Has applicability in various Natural Language Processing (NLP) tasks
• Successfully used in NLP tasks such as machine translation and generation and
multi document summarization
9. Common Text Processing Techniques used in SA
9
ICBAI, 2016
• Data Filtering
• Stemming
• Stop word elimination
• Lexicon based approach
• Text Summarization
• Extractive and Abstractive
• Frequency Driven approach
Why not Multiple
Sequence Alignment
for Text Summarization
in SA?
10. But, MSA has limitations?
10
ICBAI, 2016
• Exact method used for MSA uses Dynamic Programming approach (Needleman-
Wunsch algorithm)
For 2 input sequences:
• Create M*N matrix (M & N - Input sequence size)
• Fill up the matrix based on character similarity
• Trace back to find optimum alignment
Time & Space Complexity
• Pairwise Alignment - M*N
• Multiple Sequence Alignment – MN
• For MSA, complexity increases exponentially!
• Multiple Sequence Alignment, where more than 2 sequences are aligned to identify
similarities between them, is a classic NP Complete problem in computer science
11. High Performance Computing
11
ICBAI, 2016
• To Solve computational problems that need significant processing power and
resources.
• Reduces execution time and accommodates complicated problems.
• HPC Clusters, distributed computing, Cloud computing
Head Node
Worker 1 …… Worker N
13. 13
ICBAI, 2016
Objectives
Presents a novel approach for Sentiment Analysis
With the data volume on the rise, how can the latest
developments in Big Data arena be utilized for
handling the load in SA?
MSA has been proven to be a successful technique
in many NLP tasks. The objective of this work is to
utilize MSA as a Text Summarization technique in
performing Sentiment Analysis.
14. Proposed Model
• Proposed Model has 4 major components
14
ICBAI, 2016
Data Acquisition
Text Summarization
Feature Extraction
Sentiment Classification
Collection of Input data in to the system
Data is then shortened into a summarized form
Filter out features from summarized data
Identify the polarity of the sentiment based on
extracted features
15. System Design
15
ICBAI, 2016
Parallel Implementation
on Hadoop cluster
Data AcquisitionSocial
Media
Feature
Extraction
APIs /
Manual Input Text data is stored
in HDFS
HDFS
Text
Summarization
Text
SummarizationText Summarization
Data fetched from
HDFS
Summary text
placed in
message
queue
Feature vector placed
in message queue
ClassifierTraining Set
Training job
Positive Negative
16. Modules in Detail - Data Acquisition
16
ICBAI, 2016
• The data acquisition module supports two types of inputs.
• APIs - Social media sites like Twitter exposes APIs for collecting the data from their
sites
• File input - data can be collected manually and stored in files for acquisition
module to process.
• This module performs basic text processing like
• Removing the case sensitivity and special characters like “#”.
• This is to ensure maximum optimal alignment of text while performing summarization.
17. Modules in Detail - HDFS
17
ICBAI, 2016
• Output of the data acquisition module is stored in HDFS
• HDFS is Distributed, Scalable file system for Hadoop framework.
• Capable of holding large amount of data
• Provides easy access.
• Offers faster data retrieval and improved processing.
• Data is broken down in to small blocks and stored across multiple machines in the
cluster.
• Fault tolerant as multiple copies of the same data gets stored in the cluster.
18. Modules in Detail – Text Summarization
18
ICBAI, 2016
• Core Module of the proposed system
• Sub components:
• HDFS Interface – Picks data available in the HDFS for summarization and then
invokes the MSA Module
• MSA Module – Performs summarization of input data.
The Amazon Kindle, updated for 2016 and now in its eighth iteration, is a brilliant buy for
book worms. The RRP is low at just £59.99 from Amazon, but as part of its Black Friday
deals, we could see it go even lower - right now the Amazon Kindle is going for just £56.99
at Amazon.
the amazon kindle updated for 2016 and now in its eighth iteration is a brilliant buy for
book worms the rrp is low at just 59.99 but as part of its black friday deals we could see it
go even lower
19. Text Summarization –MSA
19
ICBAI, 2016
• Needleman-Wunsch algorithm
• Parallel implementation using Hadoop Data Clusters
• Algorithm improved to fill only the major 3 diagonals in the matrix.
S1 S2 S3
AS1
AS2
Pairwise Alignment
Pairwise Alignment
21. Complexity Analysis of MSA
21
ICBAI, 2016
Needleman-Wunsch algorithm
Complexity of Pairwise alignment – O(m*n)
MSA (n sequences of length m) – O(mn)
Possible permutations – nPn = n!
Possible pairwise alignments in each set – sum(1to n-1)
Time Complexity – O((n-1)*(m2))
Hadoop Implementation
Complexity of Pairwise alignment – 3m+2
MSA on hadoop – O((n-1)*(3m+2))
Reduced Matrix Calculation
22. Modules in Detail – Feature Extraction
22
ICBAI, 2016
• Selects features from the summary text
• N – gram model of feature extraction is implemented
• Output is the N-gram feature vector
• Feature vector is pushed to the message queue for classifier module to pick up
Message Queue Feature Extraction Message Queue
23. Modules in Detail – Classifier
23
ICBAI, 2016
• Picks up the feature vector from message queue
• Classify them as either ‘Positive’ or ‘Negative’.
• Naïve Bayes classifier is used –
• simple implementation of probabilistic classification
• assigns class labels to the feature vectors based on the class labels taken from a
finite set of training data.
• The training data - processed comments which are classified as positive and
negative.
• The system produces a model of all unique words and their frequency in positive
and negative categories.
• performs classification based on the probabilities of each summary text in the 2
categories
24. Experimental Set Up
24
ICBAI, 2016
Virtual Hadoop cluster was set up on Lenovo server with Virtual Box and Red Hat as the
operating system. We built a cluster with 3 virtual nodes and one manager node to
manage the cluster.
The MSA map reduce code was written in Java and executed on Hadoop 2.7.3
Open MQ was used as the JMS.
Feature Extraction was implemented in python using standard implementation.
Naïve Bayes classifier was implemented using ‘TensorFlow’ software library.
Test Data
• The sample dataset of 1000 product reviews of specific categories collected from online
product review sites.
• 500 positive and 500 negative reviews were used as the data set
• 80% was used as training data and rest 20% as the test data
26. Conclusion
• MSA as a promising technique for Sentiment Analysis
• Improved algorithm for MSA resulted in Reduced Complexity
• Parallel implementation of MSA using Hadoop Data Clusters
• Solution Provides:
• Accuracy at par with other common techniques such as Stemming, stop word
elimination.
• Scalability as it uses the Hadoop map reduce framework
26
ICBAI, 2016
27. Future Directions
27
ICBAI, 2016
• Only MSA module has been implemented in MapReduce
• Feature extraction and the classifier modules can also be implemented in
MapReduce
• To handle huge volume of input text data.
• To improve performance and scalability.
• Machine learning techniques along with MSA would be a good option for text
summarization.
• Explore other Big data Frameworks and classification techniques.
• Comparison with other data processing techniques.
28. References
28
ICBAI, 2016
• Osama M. Rababah, Ahmad K. Hwaitat, Dana A. Al Qudah. (2016) Sentiment analysis as a way of web
optimization. Academicjournals.org/SRE (2DFDF3858431), Volume: 11(8), pages: 90-96
• S. B. Needleman and C. D. Wunsch. (1970) A general method applicable to the search for similarities in the amino
acid sequence of two proteins. Journal of Molecular Biology, Volume: 48(3), pages: 443–453
• Sudha Sadasivam G Baktavatchalam G. (2010). A novel approach to multiple sequence alignment using Hadoop
data grids. International Journal of Bioinformatics Research and Applications. Volume: 6(5), pages: 472-83
• V. Finley Lacatusu, Steven J. Maiorano and Sanda M. Harabagiu. (2004) Multi-Document Summarization using
Multiple-Sequence Alignment. LREC (2004)
• Sara A. Shehab, Arabi Keshk, Hany Mahgoub. (2012) Fast Dynamic Algorithm for Sequence Alignment based on
Bioinformatics. International Journal of Computer Applications (0975 – 8887) Volume: 37(7), pages: 54-61