Customer_Analysis.docx

1 | P a g e
1) Title Page – A template for this is
provided on CANVAS (add link)

2 | P a g e
Abstract
This project entails the development of a sentiment analysis model for Amazon product
reviews using best learning techniques. The primary objective is to classify user reviews
as either positive or negative, providing valuable insights into customer sentiment.
The project begins with data pre-processing, which involves importing necessary libraries,
extracting and splitting the Amazon review dataset into training and testing sets. Text
cleaning is performed to remove non-alphanumeric characters, reduce whitespace, and
convert text to lowercase. The data is then visualized using a word cloud, illustrating the
most common words in the cleaned training text.
Tokenization and sequence padding are crucial steps, as they enable the conversion of
textual data into numeric sequences suitable for deep learning models. This process helps
in preparing the data for the subsequent sentiment analysis.
The heart of the project lies in developing a sentiment analysis model. Multiple machine
learning models are explored and compared for their performance. These models include
decision trees, support vector machines, logistic regression, and more. The project aims
to identify the best-performing model based on evaluation metrics.
Model training is executed for two epochs, and the best model is saved using a checkpoint
mechanism. The trained model is then evaluated on the testing dataset, yielding important
performance metrics such as loss and accuracy.
To assess the model's effectiveness, predictions are made on the test data, and a
confusion matrix is generated, revealing the model's ability to correctly classify positive
and negative sentiments. Additionally, a comprehensive classification report is provided,
offering insights into precision, recall, F1-score, and support for both positive and negative
sentiment classifications, along with overall accuracy.
In summary, this project showcases the development and evaluation of a sentiment
analysis model for Amazon product reviews using various machine learning techniques.
The identification of the best model, along with a comprehensive performance assessment,
makes it a valuable tool for understanding customer sentiment and feedback, aiding
businesses in making data-driven decisions to enhance their products and services.

3 | P a g e
4) Contents
This lists the Chapter and Sections in the report, including headings and page numbers.
This document has an example of an automatically generated Contents List, most
commonly used word processing software systems support this.
You may also require a List of Tables and/or List of Figures. These should be constructed
on separate pages.
Do not be tempted to make your sections to granular, by creating 4 or 5 layers of sub-
headings.

4 | P a g e
5) Glossary
This section contains a alphabetic list of all specialist vocabulary, nomenclature, acronyms
and/or symbols used in the report. Each should have a brief explanation of their meanings.

5 | P a g e
Introduction
In the contemporary world, where commercial activities are predominantly conducted on online
platforms, the way people trade products has undergone a significant transformation. E-
commerce websites have become the epicenters of this digital marketplace, and as a result, the
act of reviewing products before making a purchase has become a commonplace practice. In
today's landscape, consumers heavily rely on these product reviews to make informed
decisions, considering them as a pivotal part of their buying process. Consequently, the
analysis of data derived from these customer reviews has emerged as an essential and
dynamic field.
The growth of e-commerce has created a global marketplace where consumers from around
the world can access an astonishing variety of products and services with just a few clicks. This
convenience, while empowering, also poses new challenges. The online marketplace is
teeming with options, and consumers are often overwhelmed by the sheer volume of choices
available to them. In such a scenario, product reviews have become more than just helpful;
they are integral in guiding consumers through this digital maze.
In this era marked by the proliferation of machine learning-based algorithms, manually
scrutinizing thousands of reviews to understand a product's appeal can be an exceedingly time-
consuming endeavor. The volume of user-generated content is staggering, and traditional
methods for extracting insights from this wealth of data are no longer adequate. In response to
this challenge, the objective of this project is to categorize the vast array of customer feedback
into positive and negative sentiments. By doing so, it seeks to create a supervised learning
model capable of polarizing a substantial volume of reviews, making them more
comprehensible to a global audience.
The importance of understanding customer sentiment extends beyond individual purchase
decisions.When an online item garners a multitude of positive reviews, it serves as a
resounding endorsement of the item's authenticity and quality. Conversely, products, such as
books or any other commodities, that lack reviews often leave potential buyers in a state of
uncertainty. Put simply, a greater number of reviews exude greater credibility. People attach
immense value to the consensus and experiences of others, and reviews serve as a conduit to
understanding the collective sentiment on a product.
Opinions, derived from the experiences of users, have a direct impact on future consumer
purchasing decisions. Similarly, negative reviews can significantly deter potential buyers,
leading to a decline in sales. Therefore, the goal for those involved in this field is to
comprehend and categorize customer feedback on a large scale.
This project addresses the critical need for automating sentiment analysis of customer reviews
in the e-commerce sector, recognizing the immense influence these reviews have on
purchasing decisions and product credibility. It aims to empower businesses with the tools to
harness this wealth of customer feedback to improve their products and services and enhance
the overall shopping experience for consumers.
In the following sections, we will delve into the specific objectives, methodologies, and tools
employed in achieving this goal, emphasizing the profound impact of sentiment analysis in the
modern landscape of e-commerce.

6 | P a g e
Literature Review:
The realm of product reviews, sentiment analysis, and opinion mining has garnered
considerable attention in recent research papers. This review delves into some of the notable
works in this domain, highlighting their methodologies and findings.
In the work by Elli, Maria, and Yi-Fan [3], sentiment extraction from reviews was a central focus.
They analyzed the results to construct a business model, claiming that the tools employed
exhibited robustness and yielded high accuracy. Additionally, their research encompassed
diverse aspects such as emotion detection from reviews, gender prediction based on names,
and the identification of fake reviews. Python are the preferred programming languages, and
Multinomial Naïve Bayesian (MNB) and Support Vector Machine (SVM) emerged as primary
classifiers.
In another paper [4], existing supervised learning algorithms were applied to predict review
ratings on a numerical scale using text data exclusively. The study incorporated hold-out cross-
validation, with 70% of the data allocated for training and 30% for testing. Various classifiers
were employed to ascertain precision and recall values.
Expanding the scope to Amazon review datasets, a study in [5] applied and extended natural
language processing and sentiment analysis. The authors used Naïve Bayesian and decision
list classifiers to tag reviews as positive or negative, with a focus on the books and Kindle
sections of Amazon.
A unique approach was undertaken in [6], where the objective was to visualize review
sentiments through charts. Data was scraped from Amazon URLs, preprocessed, and then
analyzed using NB, SVM, and maximum entropy. This paper emphasized summarizing product
reviews, with results presented in statistical charts.
In yet another research endeavor [7], a model was constructed to predict product ratings based
on rating text, utilizing a bag-of-words approach. Unigrams and bigrams were tested, with
unigrams outperforming bigrams. The study examined Amazon video game user reviews and
observed that time-based models did not perform well due to small variance in average ratings
over time.
Feature extraction and selection techniques for sentiment analysis were explored in [8], wi th a
focus on Amazon datasets. The study included preprocessing steps such as stop word removal
and the removal of special characters. The Naive Bayes classifier was employed, with an
emphasis on phrase-level features, ultimately concluding that Naive Bayes performed better at
the phrase level than with single words or multiword features.
Simpler algorithms took center stage in [9], with an emphasis on comprehensibility. Support
Vector Machine (SVM), logistic regression, and decision tree methods were utilized, yielding
high accuracy for SVM but limitations in handling extensive datasets.
In [10], the addition of TF-IDF as an experiment to predict ratings using the bag-of-words
approach was noteworthy. While the study incorporated root mean square error and a linear
regression model, it used only a limited number of classifiers.
Bing Liu's "Sentiment Analysis and Opinion Mining: Synthesis Lectures on Human Language
Technologies" provides a comprehensive overview of the field of sentiment analysis, laying a
solid foundation for understanding the concepts, techniques, and applications of this crucial
field in natural language processing. The book effectively balances theoretical underpinnings
with practical applications, making it a valuable resource for researchers, practitioners, and
students alike.
Key Contributions
1. Comprehensive Coverage: The book offers a thorough exploration of sentiment analysis,
covering various aspects such as sentiment classification, opinion summarization, opinion
target extraction, and sentiment lexicons.
2. Theoretical Foundations: Liu delves into the theoretical underpinnings of sentiment analysis,

7 | P a g e
discussing the underlying principles and methodologies that form the basis of sentiment
analysis techniques.
3. Practical Applications: The book highlights the practical applications of sentiment analysis in
various domains, including market research, product reviews, social media analysis, and
customer relationship management.
4. Machine Learning and NLP Techniques: Liu extensively discusses the machine learning and
natural language processing techniques employed in sentiment analysis, providing insights into
the algorithms and tools used for sentiment extraction and classification.
Impact and Significance
Liu's book has been widely recognized as a seminal work in the field of sentiment analysis,
significantly contributing to the advancement of this area. The book has been cited extensively
in research papers and has influenced the development of numerous sentiment analysis tools
and applications.
Bo Pang and Lillian Lee's "Opinion Mining and Sentiment Analysis: Foundations and Trends in
Information Retrieval" delves into the intricate world of opinion mining and sentiment analysis,
providing a comprehensive overview of this rapidly evolving field. The authors meticulously
trace the historical development of sentiment analysis, highlighting its emergence from the
realm of information retrieval and its transformation into a multifaceted discipline.
Key Contributions
Historical Perspective: Pang and Lee offer a detailed historical account of sentiment analysis,
tracing its origins to text classification and information retrieval techniques. They shed light on
the evolution of sentiment analysis methodologies and the factors that have driven its growth.
Task Classification: The authors provide a clear classification of sentiment analysis tasks,
distinguishing between sentiment classification, opinion summarization, and opinion target
extraction. This categorization helps readers understand the different objectives and
approaches within sentiment analysis.
Technique Evaluation: Pang and Lee extensively evaluate the various techniques employed for
sentiment analysis, including supervised learning, unsupervised learning, and lexicon-based
methods. They provide a balanced assessment of the strengths and limitations of each
approach.
Broader Implications: The authors extend their discussion beyond technical aspects and
explore the broader implications of sentiment analysis, addressing issues of privac y,
manipulation, and economic impact. This holistic perspective adds depth and relevance to the
discussion.
Pang and Lee's work has been widely recognized as a foundational contribution to the field of
sentiment analysis. Their comprehensive review has served as a valuable resource for
researchers, practitioners, and students, providing a roadmap for understanding and advancing
the field.
Peter Turney's paper, "Thumbs Up or Thumbs Down? Semantic Orientation Applied to
Unsupervised Classification of Reviews," marks a significant milestone in the field of sentiment
analysis. Turney introduces the concept of semantic orientation, a measure of the positivity or
negativity of a text, and proposes a simple unsupervised method for classifying reviews as
positive or negative based on this orientation.
Key Contributions
Semantic Orientation: Turney's introduction of semantic orientation provides a valuable tool for
understanding sentiment polarity in textual data. This concept goes beyond simple word counts
and delves into the inherent meaning of words and phrases to assess their sentiment.
Unsupervised Classification: The proposed unsupervised method for classifying reviews offers
a practical solution for sentiment analysis tasks where labeled data is limited or unavailable.
This method is based on the premise that words with positive associations tend to co-occur with
other positive words, and vice versa.

8 | P a g e
Empirical Evaluation: Turney provides empirical evidence to support the effectiveness of his
method, demonstrating its ability to accurately classify reviews as positive or negative. This
validation adds credibility to the proposed approach.
Foundation for Future Work: Turney's work has laid the foundation for further research in
sentiment analysis, inspiring the development of more sophisticated methods that build upon
the concept of semantic orientation.
Turney's paper has had a profound impact on the field of sentiment analysis, influencing the
development of numerous sentiment analysis techniques and applications. The concept of
semantic orientation has become a cornerstone of sentiment analysis, and Turney's
unsupervised method has served as a benchmark for evaluating other approaches.
Minqing Hu and Bing Liu's paper, "Mining and Summarizing Customer Reviews," addresses the
challenge of extracting key opinions and generating informative summaries from vast amounts
of customer reviews. Their proposed method integrates sentiment analysis and text
summarization techniques to effectively mine and summarize customer reviews, providing
valuable insights for businesses and consumers alike.
Key Contributions
Integrated Approach: Hu and Liu's method combines the strengths of sentiment analysis and
text summarization to provide a comprehensive approach for review mining and summarization.
This integration allows for the identification of both sentiment and key topics, leading to more
meaningful summaries.
Opinion Extraction: The method effectively extracts opinions expressed by customers, focusing
on identifying product features and the corresponding sentiment expressed towards them. This
granular approach provides a detailed understanding of customer perceptions.
Summary Generation: The method generates summaries that capture the essence of customer
reviews, highlighting key opinions and providing a concise overview of the overall sentiment.
These summaries serve as valuable resources for businesses to gauge customer sentiment
and improve their products or services.
Empirical Validation: Hu and Liu validate their method through experiments on real-world
customer reviews, demonstrating its effectiveness in identifying sentiment, extracting opinions,
and generating informative summaries.
Hu and Liu's work has significantly impacted the fields of sentiment analysis and text
summarization, inspiring the development of numerous methods and applications for review
mining and summarization. Their approach has been adopted by businesses to gain insights
from customer feedback and make informed decisions.
Mike Thelwall's paper, "Heart and Soul: Sentiment Strength Detection in the Social Web with
Recursive Neural Networks," pioneers the application of recursive neural networks (RNNs) to
sentiment analysis, specifically focusing on detecting the strength of sentiment in social media
content. This novel approach introduces a deeper level of understanding to sentiment analysis,
moving beyond simple binary classification of sentiment polarity.
Key Contributions
RNNs for Sentiment Strength Detection: Thelwall's work is the first to utilize RNNs for sentiment
strength detection. RNNs, with their ability to capture long-range dependencies in text, prove
to be well-suited for this task, effectively capturing the nuances of sentiment strength.
Social Media Context: Thelwall recognizes the unique characteristics of social media content,
such as informality, slang, and emoticons. The proposed method is tailored to handle these
features, leading to more accurate sentiment strength detection in social media posts.
Empirical Evaluation: Thelwall conducts rigorous empirical evaluations on a large dataset of
social media posts, demonstrating the effectiveness of the RNN-based approach in
distinguishing between different levels of sentiment strength.

9 | P a g e
Interpretability: Thelwall addresses the interpretability of RNN models, which is often a concern
in sentiment analysis. By analyzing the learned weights of the RNN, he provides insights into
the model's decision-making process, enhancing its interpretability.
Thelwall's work has significantly impacted the field of sentiment analysis, introducing a powerful
and effective method for sentiment strength detection. The application of RNNs has opened up
new avenues for research in sentiment analysis, and the proposed method has been widely
adopted for analyzing social media content.
Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin's paper, "Learning
Sentiment-Specific Word Embedding for Sentiment Analysis," introduces a novel approach to
learning sentiment-specific word embeddings. Traditional word embeddings, such as Word2Vec
and GloVe, capture the semantic and syntactic properties of words, but they do not explicitly
consider the sentiment associated with individual words. Tang et al.'s method addresses this
limitation by learning separate word embeddings for positive and negative sentiment, leading
to a more nuanced representation of words in sentiment analysis tasks.
Key Contributions
Sentiment-Specific Word Embeddings: Tang et al.'s method is the first to introduce sentiment-
specific word embeddings, capturing the sentiment conveyed by individual words. This
approach goes beyond traditional word embeddings by considering the polarity of words,
providing a more refined representation for sentiment analysis.
Leveraging Sentiment Lexicons and Corpora: The method utilizes sentiment lexicons and
sentiment-annotated corpora to train the sentiment-specific word embeddings. Sentiment
lexicons provide a starting point for identifying sentiment-associated words, while sentiment-
annotated corpora provide context for determining the sentiment polarity of words.
Empirical Evaluation: Tang et al. conduct extensive empirical evaluations on various sentiment
analysis tasks, demonstrating the effectiveness of sentiment-specific word embeddings in
improving performance compared to traditional word embeddings.
Interpretability: The authors provide insights into the learned sentiment-specific word
embeddings, analyzing how they capture sentiment polarity and how they influence the
performance of sentiment analysis models.
Tang et al.'s work has had a significant impact on the field of sentiment analysis, inspiring the
development of numerous methods that utilize sentiment-specific word embeddings. This
approach has become a standard technique in sentiment analysis, leading to improvements in
the accuracy of sentiment classification, opinion summarization, and other related tasks.
Shi Wang, Rui Xia, and Haiqin Zeng's paper, "Active Learning for Sentiment Analysis with
Expected Model Change," introduces an active learning approach specifically designed for
sentiment analysis. Active learning is a semi-supervised learning technique that aims to
improve model performance by strategically selecting the most informative data points for
labeling. Wang et al.'s method focuses on identifying data points that are expected to cause the
greatest change in the sentiment analysis model, leading to more efficient and effective model
development.
Key Contributions
Active Learning for Sentiment Analysis: Wang et al.'s method is the first to apply active learning
principles to sentiment analysis, demonstrating the potential of this approach for improving
sentiment analysis model performance.
Expected Model Change: The method utilizes the concept of expected model change to identify
the most informative data points for labeling. This approach focuses on selecting data points
that are likely to cause significant changes in the model's predictions, leading to more efficient
labeling efforts.
Empirical Evaluation: Wang et al. conduct extensive empirical evaluations on various sentiment

10 | P a g e
analysis tasks, demonstrating that their active learning method consistently outperforms
traditional random sampling and uncertainty-based sampling methods.
Computational Efficiency: The method is designed to be computationally efficient, making it
suitable for large-scale sentiment analysis tasks.
Wang et al.'s work has had a significant impact on the field of sentiment analysis, inspiring the
development of numerous active learning methods tailored for sentiment analysis tasks. Active
learning has become an increasingly important technique in sentiment analysis, as it can
significantly reduce the amount of labeled data required to achieve high model performance.
Dong et al.'s paper, titled "Adaptive Convolutional Neural Networks for Target-Dependent
Sentiment Classification," addresses the challenging task of target-dependent sentiment
analysis, a specific form of sentiment classification where the sentiment expressed in a
sentence is influenced by a particular target entity. This paper presents a novel convolutional
neural network (CNN) model designed to improve the accuracy and granularity of sentiment
classification by considering the target entity. The authors aim to advance the understanding
and capability of sentiment analysis in target-specific contexts.
Key Contributions:
Target-Dependent Sentiment Analysis: The paper focuses on a specific and vital subtask of
sentiment analysis – target-dependent sentiment classification. In this task, the sentiment
expressed towards a particular target entity within a sentence is analyzed. The authors
recognize the importance of this task in real-world applications, such as product reviews where
opinions may vary based on the product's features.
Unique CNN Architecture: Dong et al. propose an adaptive CNN model that takes into account
the context and relationships between words and target entities in a sentence. This architecture
allows the model to adapt to different target entities, making it suitable for a wide range of
applications.
Multi-Perspective Analysis: The authors introduce the idea of multi-perspective analysis. In this
approach, the model considers different perspectives when analyzing a sentence, depending
on the target entity. This enables the model to capture the nuances of sentiment expression
associated with specific targets.
Attention Mechanism: The paper employs an attention mechanism that assigns different
weights to words in a sentence, highlighting the words most relevant to the target entity. This
mechanism enhances the model's ability to recognize the context of sentiment expressions.
Evaluation and Results: The authors evaluate their model on benchmark datasets for target-
dependent sentiment classification, demonstrating its effectiveness in accurately classifying
sentiment towards specific targets. The results show improved performance compared to
traditional sentiment analysis models that do not consider target-specific sentiment.
The proposed CNN model in this paper represents a significant advancement in the field of
sentiment analysis, particularly concerning target-dependent sentiment classification. By
considering the relationships between words and target entities and employing multi-
perspective analysis, the model provides a more fine-grained understanding of sentiment
expression, which is crucial for various applications, including aspect-based sentiment analysis
in product reviews, brand monitoring, and more.
Wu et al.'s paper, "A Hierarchical Attention Network for Aspect-Level Sentiment Analysis,"
addresses the challenging task of aspect-level sentiment analysis, which involves fine-grained
sentiment classification with respect to specific aspects or entities within a given text. This
paper introduces a novel model called the Hierarchical Attention Network (HAN) that leverages
hierarchical attention mechanisms to better understand and model the relationships between
words, sentiments, and aspects, thus enhancing the performance of aspect-level sentiment
analysis.

11 | P a g e
Key Contributions:
Aspect-Level Sentiment Analysis: The paper focuses on aspect-level sentiment analysis, a
crucial and challenging subtask of sentiment analysis. In this task, the goal is to determine the
sentiment expressed towards specific aspects or entities mentioned in a text, allowing for fine -
grained analysis of opinions in various domains such as product reviews and social media
posts.
Hierarchical Attention Mechanism: The HAN model incorporates a hierarchical attention
mechanism that operates at different levels. It initially pays attention to words within a sentence,
identifying the most informative words regarding both aspects and sentiments. Subsequently,
it aggregates these sentence-level representations to capture aspect-level sentiment by
focusing on relevant sentences.
Fine-Grained Sentiment Analysis: The proposed model facilitates fine-grained sentiment
analysis by considering the nuanced relationships between aspects, sentiments, and words. It
enables the model to capture the specific sentiments associated with individual aspects and
their expressions within the text.
Evaluation and Results: Wu et al. evaluate the HAN model on benchmark datasets for aspect-
level sentiment analysis. The results demonstrate that the HAN model outperforms existing
methods, showcasing its capability to effectively extract and classify aspect-level sentiments.
Impact and Significance:
The introduction of the HAN model represents a significant advancement in the field of aspect-
level sentiment analysis. Its hierarchical attention mechanism enables the model to effectively
capture the nuances of sentiment expressed towards specific aspects or entities. This is
particularly valuable in real-world applications, such as e-commerce and product reviews,
where users seek detailed information about different product features.
Collectively, these literature reviews and papers contribute significantly to the evolution of
sentiment analysis, offering insights into various techniques, models, and applications in the
field. Researchers and practitioners can draw upon these sources to better understand and
improve sentiment analysis methodologies.
Building upon the insights from these related works, our system integrates extensive datasets,
enabling more efficient results and informed decision-making. Active learning was employed for
dataset labeling, accelerating machine learning tasks. The system also encompasses various
feature extraction methods. To our knowledge, our proposed approach achieved higher
accuracy than previous research endeavors, as we assimilated the best ideas from existing
works, creating a more efficient and effective system for sentiment analysis.
Methodology
Data processing is a pivotal stage in the preparation of data for analysis, playing an
indispensable role in the success of data-driven projects. In the context of the provided code
snippet, data processing is executed on a dataset comprising Amazon product reviews. This
intricate process encompasses several crucial steps, which are meticulously detailed below.
1. Importing Libraries:
The initiation of data processing is marked by the importation of essential Python libraries.
These libraries equip the project with a versatile array of tools and functions for data
manipulation, text preprocessing, and the development of machine learning models. The
following libraries are harnessed within the code:
a. bz2:
Purpose: This library is harnessed to efficiently handle and manage compressed files.

12 | P a g e
b. tqdm:
Purpose: This library furnishes a dynamic progress bar, facilitating the real-time tracking of data
reading and processing, thus enhancing the user experience.
c. re (Regular Expressions):
Purpose: The regular expressions library is indispensable for performing advanced text cleaning
and manipulation. It allows for the precise extraction and transformation of textual data.
d. pandas:
Purpose: Utilized for data manipulation and analysis, pandas provides a robust framework for
the organization and manipulation of data in a tabular form. It plays a central role in tasks such
as data cleaning, aggregation, and transformation.
e. numpy:
Purpose: numpy is a powerful library that supports a wide range of numerical operations. It is
especially useful for working with arrays and matrices, and it forms the backbone of many data
processing and machine learning tasks.
f. matplotlib and seaborn:
Purpose: These libraries are harnessed for data visualization, enabling the creation of visually
appealing and informative charts and graphs. They are pivotal in providing insights into the
dataset and its characteristics.
g. sklearn (Scikit-learn):
Purpose: Scikit-learn is a comprehensive machine learning library that is employed for various
tasks, including the calculation of metrics such as the confusion matrix and classification
reports. It offers a rich set of machine learning algorithms and tools for model evaluation.
h. wordcloud:
Purpose: The wordcloud library is a valuable asset for generating word clouds, a popular
method for visualizing word frequency in textual data. Word clouds offer a visually intuitive way
to explore and represent textual information.
i. nltk.sentiment.vader (Part of the Natural Language Toolkit - NLTK):
Purpose: NLTK's VADER sentiment analysis tool is an integral component for assessing
sentiment in the text. It leverages a pre-trained sentiment analysis model to assign sentiment
scores to individual words and phrases.
j. tensorflow.keras:
Purpose: TensorFlow, in tandem with Keras, is enlisted for deep learning tasks. Keras, a high-
level neural networks API, simplifies the process of building and training deep learning models
for various natural language processing (NLP) tasks.
k. keras.callbacks:
Purpose: This library supports the integration of callbacks during the training of machine
learning and deep learning models. Callbacks can be used to monitor and customize the training
process, enabling actions such as model checkpointing and early stopping.
l. pickle:
Purpose: The pickle library is an indispensable tool for saving and loading Python objects. In
the context of this project, it is used to serialize and deserialize objects such as the text
tokenizer, which is essential for text preprocessing.
These imported libraries collectively provide the foundation for the comprehensive data
processing pipeline, enabling efficient and effective manipulation of the Amazon product
reviews dataset for subsequent analysis and modeling.
2. Reading Data:
The second critical phase of this project involves the precise extraction of data from the source
files. This phase is vital as it ensures the availability of data for subsequent analysis and

13 | P a g e
processing. In this section, we provide an in-depth exploration of the procedures employed in
reading and extracting data from Amazon product reviews.
a. Data Paths Specification:
To begin, the code specifies the paths for both the input data files and the corresponding output
files. These paths are pivotal in guiding the code to the location of the source data and directing
the extracted data to the desired storage location. Proper management of file paths is essential
for data handling and maintenance throughout the project.
b. Data Reading:
The code is adept at reading Amazon product reviews data, which is typically stored in a
compressed file format. This compressed format is space-efficient and is a common choice for
handling large datasets. The code's proficiency in reading this compressed data is attributed to
its utilization of the bz2 library, a specialized tool for dealing with compressed files.
c. Data Extraction and Conversion:
Within the code, data extraction is meticulously executed. Using the Python 'with' statement,
the code opens the compressed data file and effectively extracts its contents. Subsequently,
this content is diligently written to an output file, which is generated to contain the extracted
data in a more human-readable and manipulable format. This process is vital for ensuring that
the data is accessible and ready for further analysis.
d. Training and Test Data:
The data reading and extraction process is carried out meticulously for both training and test
datasets. This bifurcation is crucial as it adheres to standard machine learning practices, which
involve partitioning the data into subsets for model training and evaluation. Each subset
undergoes the same data reading and extraction steps to ensure uniformity and consistency in
the data preparation process.
e. Code Specifics:
The utilization of the bz2 library in this code is a noteworthy technical detail. This library
streamlines the process of handling compressed files, making it highly efficient for reading and
extracting data from these files. The code showcases its versatility in working with various data
file formats, promoting adaptability in dealing with diverse datasets.
The incorporation of the 'with' statement exemplifies best practices in file handling, ensuring
that resources are efficiently managed and automatically released after their use. This not only
enhances the code's performance but also contributes to its robustness and stability.
3. Data Extraction:
The data extraction phase is a critical component in the data processing pipeline of this project.
It encompasses the systematic retrieval of textual data and their associated labels from the
previously extracted plain text files. In Amazon product review data, the structure is such that
each line corresponds to a review and comprises a sentiment label, indicating the sentiment of
the review (e.g., positive or negative), followed by the text of the review itself. This section
delves into the intricacies of the data extraction process, elucidating the procedures employed
to transform raw textual data into organized and structured datasets.
a. Extraction of Text Data and Labels:
The code demonstrates remarkable proficiency in distinguishing between the sentiment labels
and the review text within each line of the data. By parsing each line, it segregates these two
essential components, effectively disentangling the sentiment labels from the accompanying
text. The sentiment labels, which serve as crucial indicators of the review's sentiment, are
carefully identified and isolated. Simultaneously, the associated text, which encapsulates the
user-generated review content, is extracted with precision. This separation is pivotal as it
facilitates the subsequent processing and analysis of the data based on these distinct
components.
b. Storage in Lists:
Once the sentiment labels and review text are successfully separated, the code systematically

14 | P a g e
stores these extracted elements in separate lists. This organization is meticulous, enabling the
creation of structured datasets where each item in the list pairs a sentiment label with its
corresponding review text. These lists, effectively acting as arrays, serve as the foundation for
constructing the training and test datasets for subsequent tasks.
c. Consistency in Processing:
It is noteworthy that both the training and test datasets undergo an analogous extraction
process. This consistent treatment ensures uniformity and comparability between these
datasets, a fundamental requirement for building and evaluating machine learning models.
Regardless of whether the data originates from the training or test set, the code's extraction
procedures remain consistent, guaranteeing that the structure and content of the datasets are
homogenous.
4. Text Cleaning:
Text cleaning, a fundamental preprocessing step in Natural Language Processing (NLP), is a
pivotal facet of this project's data preparation. It is a meticulous process aimed at enhancing
the quality of textual data by removing noise, inconsistencies, and irrelevant characters. In the
context of this project, a custom function named clean_text is defined to systematically clean
the text data. This section provides a comprehensive overview of the intricacies involved in text
cleaning, outlining the steps carried out by the clean_text function:
a. Removing Non-Alphanumeric Characters:
The first step in text cleaning involves the removal of characters that are not classified as
alphanumeric, meaning they are not letters (A-Z or a-z) or spaces. This process is executed
with precision, eliminating any special characters, symbols, or punctuation that may be present
in the text. The removal of these non-alphanumeric characters is crucial as it eradicates
potential noise and irrelevant information that might interfere with subsequent NLP tasks. It
streamlines the text, leaving only the essential content for analysis.
b. Converting Multiple Whitespace Characters to a Single Space:
Textual data often contains inconsistencies in the spacing between words. In this step, the code
diligently addresses this issue by standardizing the spacing. It replaces multiple whitespace
characters (including tabs and line breaks) with a single space. This standardization ensures
uniformity in the text, making it more amenable to subsequent text analysis tasks. By reducing
multiple spaces to a single space, the text becomes easier to tokenize, process, and analyze.
c. Converting Text to Lowercase:
Another crucial component of text cleaning is the conversion of all text to lowercase. This
transformation is performed to ensure that words are treated uniformly, regardless of their
original capitalization. By converting all characters to lowercase, the code minimizes the
potential discrepancies that could arise from variations in letter case. This standardization is
vital for text classification and sentiment analysis, where capitalization should not affect the
interpretation of the text.
The overarching purpose of text cleaning is to bring a sense of uniformity and consistency to
the text data. By eliminating non-alphanumeric characters, standardizing spacing, and
converting text to lowercase, the code is able to produce clean, structured, and reliable textual
data that is well-suited for analysis. This step plays a pivotal role in enhancing the quality and
accuracy of subsequent NLP tasks, such as sentiment analysis and machine learning model
training. It ensures that the text data is devoid of irrelevant noise and is ready for effective
processing, making it a foundational step in the successful execution of the project.
5. Exploratory Data Analysis (EDA):
Exploratory Data Analysis (EDA) represents a crucial preliminary phase in this project, serving
as the foundation for a comprehensive understanding of the dataset. EDA empowers the project
team with valuable insights and aids in data validation, ensuring that the data extraction process
has been executed accurately and that the labels are correctly associated with the text data.
This section provides an in-depth exploration of the EDA conducted in the project, shedding
light on the various procedures and visualizations employed.

15 | P a g e
a. Dataset Lengths Analysis:
One of the initial steps in EDA involves assessing the lengths of both the training and test
datasets. This analysis serves multiple purposes. First and foremost, it verifies the integrit y of
the data extraction process, confirming that all samples have been successfully captured. By
comparing the lengths of the datasets with the expected number of samples, any potential
issues or data loss can be promptly identified and addressed. Additionally, this step confirms
that the labels align appropriately with the corresponding text data, ensuring the integrity of the
dataset.
b. Visualization of Target Labels Distribution:
A key component of EDA is the visualization of the distribution of target labels, which in this
context represent sentiments (e.g., positive and negative). This visualization is paramount for
several reasons. It aids in understanding the balance of the dataset, shedding light on whether
the dataset is evenly distributed among different sentiment classes or if there is an imbalance.
Additionally, it provides a clear overview of the distribution of positive and negative sentiments,
which are encoded as '1' for positive and '0' for negative in this analysis.
c. Utilization of Count Plots:
To effectively convey the distribution of sentiment classes, the project code employs count
plots, a valuable visualization tool. Count plots display the number of samples within each
sentiment class, allowing for a quick assessment of the balance or skew in the dataset. By
visualizing the number of positive and negative sentiments, the project team gains insights into
the relative prevalence of each sentiment class. This information is essential for designing and
implementing subsequent NLP tasks and machine learning models, particularly in scenarios
where imbalanced datasets can impact model performance.
6. Word Cloud:
Word clouds represent a visually engaging and effective means of gaining insights into the
textual content of a dataset. In this project, the generation of word clouds for the training data
holds a crucial role in unveiling the most frequently occurring words within a subset of the
dataset, specifically focusing on the first 20,000 samples. This section provides an in-depth
exploration of the purpose, significance, and methodology employed in creating these word
clouds.
a. Purpose of Word Clouds:
Word clouds serve a dual purpose in the project. Firstly, they offer an intuitive and visual
representation of the most prevalent terms or words within the dataset. By emphasizing the
words that appear most frequently, word clouds provide a quick overview of the dataset's key
themes, sentiments, and frequently used expressions. Secondly, word clouds facilitate the
identification of significant keywords and phrases, which can be instrumental in understanding
the dataset and potentially guiding subsequent analysis and modeling tasks. The visualization
nature of word clouds makes them an accessible tool for both technical and non-technical
stakeholders in the project.
b. Focusing on the Training Data:
The word cloud generation in this project narrows its focus to the training data, which is typically
the portion of the dataset used to train machine learning models. Analyzing the training data is
especially valuable as it provides insights into the language, sentiment, and common
expressions encountered in the reviews. This information is fundamental for training models
that accurately capture and predict sentiment in Amazon product reviews.
c. Subset Selection:
The code selects the first 20,000 samples from the training data for word cloud generation. This
subsetting is often performed for practical reasons, as visualizing the entire dataset may be
overwhelming due to the sheer volume of words. The subset provides a representative sample
for analysis and visualization, offering insights into the dataset's characteristics without
overloading the visualization.

16 | P a g e
d. Methodology and Visualization:
Word clouds are generated using a word cloud library, which processes the text data and
identifies the most frequent words. The size of each word in the cloud is proportional to its
frequency in the text. Frequently occurring words are displayed prominently, while less common
words are presented in a smaller font. This visual hierarchy enables quick identification of the
most prevalent terms.
7. Tokenization & Padding:
Tokenization and padding represent pivotal stages in the data preparation pipeline of this
project, instrumental in converting raw text data into a numerical format that is amenable for
machine learning and deep learning models. This section delves into the detailed process of
tokenization and padding, outlining the essential steps carried out by the project's code.
a. Tokenization Overview:
Tokenization is the process of breaking down text data into smaller units, typically words or
subwords, known as tokens. It is a foundational step in NLP, allowing text data to be
represented as numerical sequences. In this project, tokenization is performed using the Keras
Tokenizer class, a powerful tool for text preprocessing.
b. Defining Maximum Vocabulary Size (voc_size):
A critical decision in tokenization is the definition of the maximum vocabulary size, denoted as
'voc_size.' This parameter determines the maximum number of unique words to be considered
when tokenizing the text. Words exceeding this limit are disregarded. This step is important for
managing computational resources and ensuring that only the most relevant words are included
in the model's vocabulary.
c. Defining Maximum Sequence Length (max_length):
The 'max_length' parameter plays a central role in setting the maximum length of sequences.
Sequences that are shorter than this length are padded with zeros, while sequences longer
than this length are truncated. This step is crucial for ensuring uniform sequence lengths, which
is a requirement for feeding data into deep learning models.
d. Fitting the Tokenizer on Training Data:
To perform tokenization, the Tokenizer is fitted to the training data. This fitting process involves
building a word index and constructing a vocabulary based on the unique words encountered
in the training dataset. The fitted Tokenizer forms the basis for converting text into numerical
sequences in both the training and test datasets.
e. Saving the Tokenizer:
The fitted Tokenizer is not only a vital component for tokenization but also a resource that needs
to be preserved for consistency and reusability. It is saved as a pickle file, which can be loaded
in subsequent project phases or even in different projects for consistent text processing.
f. Tokenizing the Training and Test Data:
Once the Tokenizer is fitted, it is employed to tokenize the text data in both the training and test
datasets. This process replaces words with their corresponding numerical indices, generating
sequences of integers that represent the text.
g. Applying Padding:
Padding is the final step in the tokenization process and is of paramount importance. It ensures
that all sequences have the same length (as defined by 'max_length'). Sequences that are
shorter than 'max_length' are padded with zeros at the beginning, and sequences longer than
'max_length' are truncated. This step is critical for maintaining consistent input dimensions for
deep learning models.
8. Label Encoding:
Label encoding is a crucial step in sentiment analysis, particularly when transforming the task
into a binary classification problem. In this project, the code undertakes label encoding to

17 | P a g e
represent sentiment labels in a simplified binary format, facilitating the sentiment analysis task.
This section elaborates on the label encoding process, the specific mapping employed, and
how the encoded labels are stored.
a. Purpose of Label Encoding:
Label encoding in sentiment analysis is pivotal for transforming the problem into a binary
classification task, where the goal is to discern between positive and negative sentiments. This
binary representation simplifies the sentiment analysis process, making it more manageable
and interpretable for machine learning models. The sentiment labels, originally in textual form
(e.g., 'positive' and 'negative'), are transformed into numerical values to align with the
requirements of classification algorithms.
b. Mapping of Sentiment Labels:
In the provided code, the mapping of sentiment labels is as follows:
'2' is mapped to '1', representing positive sentiment.
'1' is mapped to '0', representing negative sentiment.
This binary mapping converts the sentiment labels into numerical values, where '1' corresponds
to positive sentiment and '0' corresponds to negative sentiment. This binary representation
simplifies the classification task by reducing the number of target classes from multiple
sentiments to just two.
c. Storage of Encoded Labels:
The code ensures the proper management and storage of the encoded sentiment labels.
Specifically, the training labels, after undergoing the label encoding process, are stored in the
variable 'train_lab,' while the test labels are stored in the variable 'test_lab.' This organization
allows for easy access to the encoded labels when training and evaluating machine learning
models.
In conclusion, data processing is a critical step in any machine learning or natural language
processing project. It involves importing libraries, reading and extracting data, cleaning text,
conducting exploratory data analysis, creating visualizations, tokenizing and padding text data,
and encoding labels for sentiment analysis. The preprocessed data can now be used to train
machine learning models, including deep learning models, to perform sentiment analysis on
Amazon product reviews. Data processing ensures that the data is in a suitable format for model
training and evaluation.
Website:
The provided Python script, `app.py`, is designed to create a web application using Streamlit
for performing customer sentiment analysis on textual reviews. The app offers a straightforward
and user-friendly interface for analyzing the sentiment of reviews, as well as includes additional
features for enhanced interactivity. The script begins by importing essential libraries, such as
Streamlit for building the web app, TensorFlow for loading a sentiment analysis model, regular
expressions for text cleaning, and other utilities for processing and visualizing data.
The web application's user interface is configured using Streamlit. It sets the page title, favicon,
and introduces the app with a title and description. Users are encouraged to input review text
for sentiment analysis, and a button is provided to initiate the analysis.
The sentiment analysis model is loaded from a previously saved file, enabling real-time
sentiment predictions. The Tokenizer, which is essential for converting text into sequences, is
also loaded from a saved file.
To ensure that the text data is clean and consistent, a `clean_text` function is defined. This
function utilizes regular expressions to remove non-alphanumeric characters, normalize
whitespace, and convert text to lowercase.
The sentiment analysis process is facilitated by a dedicated function, `analyze_sentiment`,
which cleans the input text, tokenizes it, and pads it to a fixed length before feeding it to the

18 | P a g e
loaded model. The result is a sentiment prediction (either "Positive" or "Negative") based on
the model's output.
The app incorporates a sidebar that provides additional information about the app and its
purpose, enhancing user engagement. Users can enter review text in the provided text area,
and upon clicking the "Analyze Sentiment" button, the app processes and analyzes the
sentiment of the input text. Users are alerted with a success message displaying the predicted
sentiment or a warning if they forget to input text.
To enhance the aesthetics and user experience, the script also includes style modifications for
the button's appearance and changes the text area's background color.
Moreover, the script suggests the possibility of integrating a word cloud visualization component
to present the most common words in the reviews. This section is left for future development or
customization according to specific needs.
The "Additional Features" section provides examples of interactive widgets and features that
can be added to the app. It includes data analysis in the form of a bar chart, the creation of
charts using libraries like Matplotlib and Seaborn, and the option for users to upload a file.
These features can be expanded upon to meet specific project requirements.
In summary, the `app.py` script creates a user-friendly sentiment analysis web application with
Streamlit. It leverages machine learning to predict sentiments, offers opportunities for
enhancing user experience and interaction, and provides flexibility for extending its
functionalities, making it a versatile tool for customer sentiment analysis.
Feature Engineering
Feature engineering in a sentiment analysis project involves creating relevant features from text
data to enhance the performance of machine learning models. Here, we'll detail the feature
engineering steps for the given code:
Word Embeddings:
While the code does not explicitly use pre-trained word embeddings, it's a common feature
engineering technique. Word embeddings can capture semantic relationships between words, which
is beneficial for sentiment analysis. You can use models like Word2Vec, GloVe, or fastText to obtain
word embeddings.
Text Cleaning:
The clean_text function removes non-alphanumeric characters, extra whitespace, and converts the
text to lowercase. This is important for consistent and standardized text data.
Tokenization & Padding:
The code uses the Tokenizer and pad_sequences functions from Keras to tokenize the text data and
pad sequences to a fixed length. This ensures that all input sequences have the same length, which is
necessary for training deep learning models.
Feature Scaling:
The code doesn't explicitly perform feature scaling in this section, but it's a crucial step when dealing
with features like word embeddings or other numeric features. Feature scaling ensures that all
features have similar scales.

19 | P a g e
Sentiment Scores:
The code imports the VADER sentiment analyzer from NLTK. While VADER isn't explicitly used for
feature engineering in this section, it could be used to obtain sentiment scores for the text data.
These scores can be considered as features.
WordCloud:
The code generates a word cloud from the text data. While word clouds are typically used for data
visualization, you can extract features from them. For example, you can count the frequency of the
most prominent words in the word cloud and use these counts as features.
Target Label Encoding:
The target labels ("1" and "2") are encoded as binary labels (0 and 1) using NumPy arrays. While not
a direct feature engineering step, this preprocessing is essential for building classification models.
Feature Extraction from Text Data:
The main feature extraction in this code is done by converting text data into sequences of integers
using tokenization and padding. These sequences are used as input features for deep learning
models.
o Project Work and Methodology
This can be detailed in a number of chapters, The organisation of these chapters can
be temporal (i.e., following a cyclical evolution of the project), task-based (i.e., where
each chapter describes a succinct body of work or task set within the project plan),
process-based (i.e., where the elements of design, build and test are discussed
separately) or a combination of these.
The complete set of chapters must describe what has been attempted throughout the
period of project work, successes and failures, alternative approaches and the
methods applied. Besides covering the work that has been attempted, these chapters
should also discuss other options that were considered but not necessarily
addressed as part of the project work, along with reasons for the choices and
decisions made supported by research and experimental results.

Customer_Analysis.docx

More Related Content

Similar to Customer_Analysis.docx

Recently uploaded

Customer_Analysis.docx